RuntimeError: The size of tensor a (3072) must match the size of tensor b (6144) at non-singleton dimension 1
You must run with --disable-shared-experts-fusion in sglang, otherwise it will incorrectly attempt to fuse the BF16 shared expert.
I am running it using vllm like this
vllm serve /vllm-workspace/models/GLM-5.1-NVFP4
--tensor-parallel-size 4
--speculative-config.method mtp
--speculative-config.num_speculative_tokens 3
--tool-call-parser glm47
--reasoning-parser glm45
--enable-auto-tool-choice
--chat-template-content-format=string
--served-model-name glm-5.1
getting the same error message
how did you solve it @lianyouzao ? any recommendation @lukealonso
Using vllm 0.19.1
Same error for 8xH200
vllm serve /vllm-workspace/models/GLM-5.1-NVFP4/ --tensor-parallel-size 8 --speculative-config.method mtp --speculative-config.num_speculative_tokens 3 --tool-call-parser glm47 --reasoning-parser glm45 --enable-auto-tool-choice --chat-template-content-format=string --served-model-name glm-5.1