Answer Repetition with FP8 Models
Hi. I'm deploying the Qwen3-VL-2B-Instruct-FP8 and Qwen/Qwen3-VL-4B-Instruct-FP8 models based on the provided examples for vLLM, but it always ends up with the following output:
INFO 12-04 21:40:00 [llm.py:346] Supported tasks: ['generate']
========================================
Inputs[0]: input_['prompt']='<|im_start|>user\n<|vision_start|><|video_pad|><|vision_end|>Describe the video with timestamps.<|im_end|>\n<|im_start|>assistant\n'
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Adding requests: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:09<00:00, 9.32s/it]
Processed prompts: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:12<00:00, 12.90s/it, est. speed input: 812.58 toks/s, output: 79.36 toks/s]
========================================
Generated text: '!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!'
[rank0]:[W1204 21:40:22.278306947 ProcessGroupNCCL.cpp:1553] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
ERROR 12-04 21:40:22 [core_client.py:598] Engine core proc EngineCore_DP0 died unexpectedly, shutting down client.
This doesn't happen with the FP16 (vanilla) versions. What is the issue with FP8 models and are there any workarounds to this? My system is NVIDIA Blackwell based.
Hello Doruksonmez,
Youβre running into a known stability gap with the current FP8 builds of Qwen3-VL-2B-Instruct-FP8 and Qwen3-VL-4B-Instruct-FP8 when deployed through vLLMβespecially on Blackwell hardware.
The behavior youβre seeing (degenerate ! output, NCCL teardown warnings, and eventual engine crash) is consistent with:
- FP8 kernel instability on Blackwell for visionβlanguage models
Blackwell FP8 support is still early, and the required fused FP8 kernels are not consistently implemented or optimized across all operators vLLM calls when loading the Qwen VL graph. When an unsupported FP8 op silently falls back or misfires, you get output drift β token explosion β core failure.
- Missing or partial support for FP8 weight-only quantization in vLLM
vLLMβs FP8 pathway is not feature-parity with FP16 yet. Several model families (especially V+L models) hit ops that are still CPU-fallback or unimplemented, causing instability under load.
- Qwen VL FP8 checkpoints currently require runtime flags that vLLM doesnβt fully honor
Especially around:
β’ KV-cache precision
β’ FP8 scaling metadata
β’ Mixed-precision vision encoder routing
This is why the same checkpoints load cleanly in PyTorch or HuggingFace Transformers but fail inside vLLM.
βοΈ Workarounds that actually work right now
- Force the model to run with FP16 KV cache
Prevents the corrupted attention outputs that cause prompt bl
Try loading π
vllm serve
--model Qwen/Qwen3-VL-4B-Instruct-FP8
--dtype fp16
--kv-cache-dtype fp16
--quantization fp8
--max-model-len 8192
This prevents the FP8 kernel path from triggering while still letting you load FP8 weights.
There is nothing wrong with your set up , try also moving Move the deployment to TensorRT-LLM if you need true FP8 acceleration.