Skip to content

add_bos_token is not correctly set for Gemma3 #3430

@IKACE

Description

@IKACE

Hi,

We found that add_bos_token is not correctly set for Gemma3. This causes some users in vLLM to suspect accuracy issues for the model (link).

After trying to init the tokenizer after the add_bos_token flag is determined, the issue is mitigated.

/home/ubuntu/lm-evaluation-harness/lm_eval/models/vllm_causallms.py:208
        self.add_bos_token = add_bos_token
        if "gemma" in pretrained.lower():
            self.add_bos_token = True
            eval_logger.info(
                "Found 'gemma' in model name, a BOS token will be used as Gemma series models underperform without it."
            )
            
        self.tokenizer = get_tokenizer(
            tokenizer if tokenizer else pretrained,
            tokenizer_mode=tokenizer_mode,
            trust_remote_code=trust_remote_code,
            revision=tokenizer_revision,
            add_bos_token=self.add_bos_token
        )
        self.tokenizer = configure_pad_token(self.tokenizer, model_config=self._config)

We did a unit test on the effect of the fix as well as a test on gsm8k comparing vLLM and HF backend.

Unit Test

Below shows the results of tokenizing a string "This is a test.". After the fix the BOS token is correctly added to the beginning of the sentence.

Before the fix:

2025-11-26:03:11:43 INFO     [models.vllm_causallms:357] string: This is a test.
2025-11-26:03:11:43 INFO     [models.vllm_causallms:358] encoding: [2094, 563, 496, 1594, 236761]

After the fix:

2025-11-26:03:23:20 INFO     [models.vllm_causallms:357] string: This is a test.
2025-11-26:03:23:20 INFO     [models.vllm_causallms:358] encoding: [2, 2094, 563, 496, 1594, 236761]

GSM8k

We compare the performance on gsm8k using HF and vLLM backends.
We can see that the accuracy regression mentioned in the vLLM issue is mitigated.

HF

hf (pretrained=google/gemma-3-27b-it,max_length=8192,parallelize=True,dtype=bfloat16,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 64

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.9242 ± 0.0073
strict-match 5 exact_match 0.9219 ± 0.0074

vLLM before the fix

vllm (pretrained=google/gemma-3-27b-it,tensor_parallel_size=4,gpu_memory_utilization=0.9,enforce_eager=True,max_num_seqs=64,max_model_len=8192,dtype=bfloat16,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 64

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.8628 ± 0.0095
strict-match 5 exact_match 0.8552 ± 0.0097

vLLM after the fix

vllm (pretrained=google/gemma-3-27b-it,tensor_parallel_size=4,gpu_memory_utilization=0.9,enforce_eager=True,max_num_seqs=64,max_model_len=8192,dtype=bfloat16,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 64

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.9265 ± 0.0072
strict-match 5 exact_match 0.9234 ± 0.0073

lm-eval version

v0.4.9.1

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions