junk outputs
If I eval this model on GCP Vertex AI model garden its great and no junk outputs
If i use vllm myself I see a huge number of junk outputs and my eval metrics decline.
e.g
"*Ayano's eyes light up, her eyes expressionlijkly shifting"
""he doesn't even torightly look at the screen"
"He stays exactly where you're lean against him"
I have tried a lot of different settings including copying the vllm settings which get used by Vertex AI AND using the same docker container that vertex AI uses too but I still get issues.
It might be the slightly different weights that are used by vertex AI
My vllm arguments look like:
"engine_args": {
'gpu_memory_utilization': 0.92,
'language_model_only': True,
'max_model_len': 10240,
'max_num_batched_tokens': 10240,
'max_num_seqs': 64,
'tensor_parallel_size': 1,
'trust_remote_code': True,
'tool-call-parser': 'gemma4',
'reasoning-parser': 'gemma4'
},
im using the default settings in the config (temp 1.0, top_k 64, top_p 0.95 etc. --> they get set automatically by vllm)
Hi @rirv938 ,
Thanks for addressing the issue.
To help us investigate why you are seeing these junk outputs, could you please provide more details about your specific environment? We would specifically like to see the evaluation script you are using to understand how the model's output is being handled.
Additionally, could you please provide the exact steps to reproduce this behavior?
Gemma 4's instruction-tuned format terminates turns with (106), not (1). The model's generation_config.json lists
multiple stop tokens, but transformers overrides that list with the tokenizer's scalar eos_token_id=1 on load.
After that override, vLLM only sees 1 as a stop, so when the model emits 106 to end its turn, generation keeps going and decodes garbage
from the post-turn distribution. That matches the symptoms you're seeing ("expressionlijkly", "torightly" β off-manifold tokens past
the natural stop).
Fix β pass the full stop set to vLLM explicitly:
from vllm import SamplingParams
sampling_params = SamplingParams(
temperature=1.0, top_k=64, top_p=0.95,
stop_token_ids=[1, 106], # ,
max_tokens=...,
)
Or on the server: --override-generation-config '{"stop_token_ids":[1,106]}'.
Vertex AI's serving stack likely honors the model's generation_config.json directly and doesn't hit the transformers override path,
which is why it works there.