Answer Repetition with FP8 Models

by doruksonmez - opened Dec 4, 2025

Dec 4, 2025

Hi. I'm deploying the Qwen3-VL-2B-Instruct-FP8 and Qwen/Qwen3-VL-4B-Instruct-FP8 models based on the provided examples for vLLM, but it always ends up with the following output:

INFO 12-04 21:40:00 [llm.py:346] Supported tasks: ['generate']

========================================
Inputs[0]: input_['prompt']='<|im_start|>user\n<|vision_start|><|video_pad|><|vision_end|>Describe the video with timestamps.<|im_end|>\n<|im_start|>assistant\n'

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Adding requests: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:09<00:00,  9.32s/it]
Processed prompts: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:12<00:00, 12.90s/it, est. speed input: 812.58 toks/s, output: 79.36 toks/s]

========================================
Generated text: '!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!'
[rank0]:[W1204 21:40:22.278306947 ProcessGroupNCCL.cpp:1553] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
ERROR 12-04 21:40:22 [core_client.py:598] Engine core proc EngineCore_DP0 died unexpectedly, shutting down client.

This doesn't happen with the FP16 (vanilla) versions. What is the issue with FP8 models and are there any workarounds to this? My system is NVIDIA Blackwell based.

LAMediaNation

Dec 5, 2025

Hello Doruksonmez,

You’re running into a known stability gap with the current FP8 builds of Qwen3-VL-2B-Instruct-FP8 and Qwen3-VL-4B-Instruct-FP8 when deployed through vLLM—especially on Blackwell hardware.

The behavior you’re seeing (degenerate ! output, NCCL teardown warnings, and eventual engine crash) is consistent with:

FP8 kernel instability on Blackwell for vision–language models

Blackwell FP8 support is still early, and the required fused FP8 kernels are not consistently implemented or optimized across all operators vLLM calls when loading the Qwen VL graph. When an unsupported FP8 op silently falls back or misfires, you get output drift → token explosion → core failure.

Missing or partial support for FP8 weight-only quantization in vLLM

vLLM’s FP8 pathway is not feature-parity with FP16 yet. Several model families (especially V+L models) hit ops that are still CPU-fallback or unimplemented, causing instability under load.

Qwen VL FP8 checkpoints currently require runtime flags that vLLM doesn’t fully honor

Especially around:
• KV-cache precision
• FP8 scaling metadata
• Mixed-precision vision encoder routing

This is why the same checkpoints load cleanly in PyTorch or HuggingFace Transformers but fail inside vLLM.

✔️ Workarounds that actually work right now

Force the model to run with FP16 KV cache

Prevents the corrupted attention outputs that cause prompt bl

LAMediaNation

Dec 5, 2025

Try loading 😀

vllm serve
--model Qwen/Qwen3-VL-4B-Instruct-FP8
--dtype fp16
--kv-cache-dtype fp16
--quantization fp8
--max-model-len 8192

This prevents the FP8 kernel path from triggering while still letting you load FP8 weights.

There is nothing wrong with your set up , try also moving Move the deployment to TensorRT-LLM if you need true FP8 acceleration.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment