junk outputs

#41

by rirv938 - opened 27 days ago

If I eval this model on GCP Vertex AI model garden its great and no junk outputs

If i use vllm myself I see a huge number of junk outputs and my eval metrics decline.

e.g
"*Ayano's eyes light up, her eyes expressionlijkly shifting"
""he doesn't even torightly look at the screen"
"He stays exactly where you're lean against him"

I have tried a lot of different settings including copying the vllm settings which get used by Vertex AI AND using the same docker container that vertex AI uses too but I still get issues.

It might be the slightly different weights that are used by vertex AI

My vllm arguments look like:
"engine_args": {
'gpu_memory_utilization': 0.92,
'language_model_only': True,
'max_model_len': 10240,
'max_num_batched_tokens': 10240,
'max_num_seqs': 64,
'tensor_parallel_size': 1,
'trust_remote_code': True,
'tool-call-parser': 'gemma4',
'reasoning-parser': 'gemma4'
},

rirv938

27 days ago

im using the default settings in the config (temp 1.0, top_k 64, top_p 0.95 etc. --> they get set automatically by vllm)

thnamratha

Google org 14 days ago

Hi @rirv938 ,

Thanks for addressing the issue.
To help us investigate why you are seeing these junk outputs, could you please provide more details about your specific environment? We would specifically like to see the evaluation script you are using to understand how the model's output is being handled.
Additionally, could you please provide the exact steps to reproduce this behavior?

mizadri

13 days ago

Gemma 4's instruction-tuned format terminates turns with (106), not (1). The model's generation_config.json lists
multiple stop tokens, but transformers overrides that list with the tokenizer's scalar eos_token_id=1 on load.

After that override, vLLM only sees 1 as a stop, so when the model emits 106 to end its turn, generation keeps going and decodes garbage
from the post-turn distribution. That matches the symptoms you're seeing ("expressionlijkly", "torightly" — off-manifold tokens past
the natural stop).

Fix — pass the full stop set to vLLM explicitly:

from vllm import SamplingParams

sampling_params = SamplingParams(
temperature=1.0, top_k=64, top_p=0.95,
stop_token_ids=[1, 106], # ,
max_tokens=...,
)

Or on the server: --override-generation-config '{"stop_token_ids":[1,106]}'.

Vertex AI's serving stack likely honors the model's generation_config.json directly and doesn't hit the transformers override path,
which is why it works there.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment