Fix for vLLM Metal Compatibility for Qwen3-4b
#3
by
AnandSingh - opened
Fix for vLLM Metal Compatibility
I successfully debugged and fixed multiple compatibility issues between vllm-metal and the core vllm library to enable inference on Apple Silicon (M4).
Issues and Solutions
- Engine Initialization Failure (SIGSEGV)
Issue: RuntimeError: Engine core initialization failed with exit code -11. This was caused by Python's fork multiprocessing method conflicting with Metal/MLX GPU contexts.
Fix: Set VLLM_WORKER_MULTIPROC_METHOD=spawn in
test.py
to use the spawn start method, which is safer for GPU contexts. - Device Capability Error (AttributeError)
Issue: AttributeError: 'tuple' object has no attribute 'to_int'. The vllm quantization logic expected a device capability object with a .to_int() method, but vllm-metal returned a plain tuple
(8, 0)
.
Fix: Modified
vllm_metal/platform.py
to return a custom
MetalCapability
class (inheriting from tuple) that implements .to_int(). - Sampler Assertion Error (AssertionError)
Issue: AssertionError: prompt_token_ids is not None in vllm/v1/sample/sampler.py. The vllm-metal model runner was explicitly passing None for prompt_token_ids in the sampling metadata.
Fix: Updated
vllm_metal/v1/model_runner.py
to:
Accept prompt_token_ids in
_make_sampling_metadata
.
Pass this argument to the SamplingMetadata constructor.
Update call sites (
_prefill_single
,
_batched_decode
) to pass the token history. - Tensor Device Mismatch (RuntimeError)
Issue: RuntimeError: Attempted to set the storage of a tensor on device "cpu" to a storage on different device "mps:0". This occurred because vllm's make_tensor_with_pad tries to "pin" memory for CPU tensors using pin_memory=True. On macOS with PyTorch/MPS, this triggered an invalid attempt to use MPS-pinned memory for these tensors.
Fix: Monkey-patched vllm.utils.platform_utils.is_pin_memory_available in
vllm_metal/
init
.py
to always return False. This disables the problematic memory pinning behavior for the Metal backend. - Invalid Tensor Format (AttributeError)
Issue: AttributeError: 'list' object has no attribute 'device'. I initially passed prompt_token_ids as a list of lists, but vllm's apply_all_penalties expects a torch.Tensor.
Fix: Modified
vllm_metal/v1/model_runner.py
to convert prompt_token_ids into a padded torch.Tensor using vllm.utils.torch_utils.make_tensor_with_pad before creating the SamplingMetadata object.
Verification
The final test script (
test.py
) running Qwen/Qwen3-4B-MLX-4bit executed successfully:
Model Loading: Successful.
Inference: Generated a coherent technical explanation about MLX vs PyTorch MPS.
Performance: Input speed ~2.77 toks/s, Output speed ~30.81 toks/s.
Result: The process exited with code 0.
Conclusion
The vllm-metal plugin is now functional for inference with vLLM v1. The fixes required addressed both API mismatches between the plugin and the core library (checking for None, expected types) and platform-specific behavior on macOS (memory pinning).