-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Description
Hi,
We found that add_bos_token is not correctly set for Gemma3. This causes some users in vLLM to suspect accuracy issues for the model (link).
After trying to init the tokenizer after the add_bos_token flag is determined, the issue is mitigated.
/home/ubuntu/lm-evaluation-harness/lm_eval/models/vllm_causallms.py:208
self.add_bos_token = add_bos_token
if "gemma" in pretrained.lower():
self.add_bos_token = True
eval_logger.info(
"Found 'gemma' in model name, a BOS token will be used as Gemma series models underperform without it."
)
self.tokenizer = get_tokenizer(
tokenizer if tokenizer else pretrained,
tokenizer_mode=tokenizer_mode,
trust_remote_code=trust_remote_code,
revision=tokenizer_revision,
add_bos_token=self.add_bos_token
)
self.tokenizer = configure_pad_token(self.tokenizer, model_config=self._config)
We did a unit test on the effect of the fix as well as a test on gsm8k comparing vLLM and HF backend.
Unit Test
Below shows the results of tokenizing a string "This is a test.". After the fix the BOS token is correctly added to the beginning of the sentence.
Before the fix:
2025-11-26:03:11:43 INFO [models.vllm_causallms:357] string: This is a test.
2025-11-26:03:11:43 INFO [models.vllm_causallms:358] encoding: [2094, 563, 496, 1594, 236761]
After the fix:
2025-11-26:03:23:20 INFO [models.vllm_causallms:357] string: This is a test.
2025-11-26:03:23:20 INFO [models.vllm_causallms:358] encoding: [2, 2094, 563, 496, 1594, 236761]
GSM8k
We compare the performance on gsm8k using HF and vLLM backends.
We can see that the accuracy regression mentioned in the vLLM issue is mitigated.
HF
hf (pretrained=google/gemma-3-27b-it,max_length=8192,parallelize=True,dtype=bfloat16,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 64
| Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
|---|---|---|---|---|---|---|---|---|
| gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.9242 | ± | 0.0073 |
| strict-match | 5 | exact_match | ↑ | 0.9219 | ± | 0.0074 |
vLLM before the fix
vllm (pretrained=google/gemma-3-27b-it,tensor_parallel_size=4,gpu_memory_utilization=0.9,enforce_eager=True,max_num_seqs=64,max_model_len=8192,dtype=bfloat16,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 64
| Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
|---|---|---|---|---|---|---|---|---|
| gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.8628 | ± | 0.0095 |
| strict-match | 5 | exact_match | ↑ | 0.8552 | ± | 0.0097 |
vLLM after the fix
vllm (pretrained=google/gemma-3-27b-it,tensor_parallel_size=4,gpu_memory_utilization=0.9,enforce_eager=True,max_num_seqs=64,max_model_len=8192,dtype=bfloat16,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 64
| Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
|---|---|---|---|---|---|---|---|---|
| gsm8k | 3 | flexible-extract | 5 | exact_match | ↑ | 0.9265 | ± | 0.0072 |
| strict-match | 5 | exact_match | ↑ | 0.9234 | ± | 0.0073 |
lm-eval version
v0.4.9.1