Tool use crash the model
I'm using a DGX Spark, Docker and the latest official docker vLLM build from NVIDIA. The model performs fine as a regular chat model, but when using it from Zed, which makes tool calls, vLLM crashes after ~30s after inference starts. Seems to be CUDA-related? Dumping the relevant parts of the log here, sorry for the wall of text, I could not find a spoiler tag to hide it. I use the the 1M context window config.
EDIT: It crashes when used for normal chat as well, it just takes longer.
vllm-1 | (APIServer pid=135) INFO: Started server process [135]
vllm-1 | (APIServer pid=135) INFO: Waiting for application startup.
vllm-1 | (APIServer pid=135) INFO: Application startup complete.
vllm-1 | (APIServer pid=135) INFO: 172.20.0.1:60856 - "GET /v1/models HTTP/1.1" 200 OK
vllm-1 | (APIServer pid=135) INFO: 172.20.0.1:52212 - "GET /v1/models HTTP/1.1" 200 OK
vllm-1 | (APIServer pid=135) INFO 01-29 08:03:37 [chat_utils.py:574] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
vllm-1 | (APIServer pid=135) INFO: 172.20.0.1:52218 - "POST /v1/chat/completions HTTP/1.1" 200 OK
vllm-1 | (APIServer pid=135) INFO 01-29 08:04:07 [loggers.py:236] Engine 000: Avg prompt throughput: 3.8 tokens/s, Avg generation throughput: 29.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
vllm-1 | (APIServer pid=135) INFO 01-29 08:04:17 [loggers.py:236] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 54.8 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
vllm-1 | (APIServer pid=135) INFO 01-29 08:04:27 [loggers.py:236] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 54.7 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
vllm-1 | (APIServer pid=135) INFO: 172.20.0.1:50888 - "GET /v1/models HTTP/1.1" 200 OK
vllm-1 | (APIServer pid=135) INFO: 172.20.0.1:50896 - "POST /v1/chat/completions HTTP/1.1" 200 OK
vllm-1 | (APIServer pid=135) INFO: 172.20.0.1:59948 - "GET /v1/models HTTP/1.1" 200 OK
vllm-1 | (APIServer pid=135) INFO 01-29 08:04:37 [loggers.py:236] Engine 000: Avg prompt throughput: 313.3 tokens/s, Avg generation throughput: 35.7 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
vllm-1 | (APIServer pid=135) INFO: 172.20.0.1:59950 - "POST /v1/chat/completions HTTP/1.1" 200 OK
vllm-1 | (APIServer pid=135) INFO 01-29 08:04:47 [loggers.py:236] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 50.8 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
vllm-1 | (APIServer pid=135) INFO 01-29 08:04:57 [loggers.py:236] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
vllm-1 | (APIServer pid=135) INFO: 192.168.40.63:35670 - "POST /v1/chat/completions HTTP/1.1" 200 OK
vllm-1 | (APIServer pid=135) INFO 01-29 08:10:17 [loggers.py:236] Engine 000: Avg prompt throughput: 248.0 tokens/s, Avg generation throughput: 44.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
vllm-1 | (APIServer pid=135) INFO 01-29 08:10:23 [qwen3coder_tool_parser.py:83] vLLM Successfully import tool parser Qwen3CoderToolParser !
vllm-1 | (APIServer pid=135) INFO: 192.168.40.63:40820 - "POST /v1/chat/completions HTTP/1.1" 200 OK
vllm-1 | (APIServer pid=135) INFO 01-29 08:10:24 [qwen3coder_tool_parser.py:83] vLLM Successfully import tool parser Qwen3CoderToolParser !
vllm-1 | (APIServer pid=135) INFO 01-29 08:10:27 [loggers.py:236] Engine 000: Avg prompt throughput: 1059.5 tokens/s, Avg generation throughput: 42.8 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 0.0%
vllm-1 | (EngineCore_DP0 pid=179) ERROR 01-29 08:10:33 [dump_input.py:72] Dumping input data for V1 LLM engine (v0.12.0+35a9f223.nv25.12.post1) with config: model='nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4', speculative_config=None, tokenizer='nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=1048576, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=modelopt_fp4, enforce_eager=False, kv_cache_dtype=fp8, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='nano_v3', reasoning_parser_plugin='nano_v3_reasoning_parser.py', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01), seed=0, served_model_name=Nemotron3Nano, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '/root/.cache/vllm/torch_compile_cache/f57cbad072', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer'], 'compile_mm_encoder': False, 'compile_sizes': [], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': True, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 16, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>}, 'local_cache_dir': '/root/.cache/vllm/torch_compile_cache/f57cbad072/rank_0_0/backbone'},
vllm-1 | (EngineCore_DP0 pid=179) ERROR 01-29 08:10:33 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[], scheduled_cached_reqs=CachedRequestData(req_ids=['chatcmpl-bf9c84d14895a318'], resumed_req_ids=[], new_token_ids=[], all_token_ids={}, new_block_ids=[null], num_computed_tokens=[10910], num_output_tokens=[311]), num_scheduled_tokens={chatcmpl-bf9c84d14895a318: 1}, total_num_scheduled_tokens=1, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[0, 0, 0, 0, 0], finished_req_ids=[], free_encoder_mm_hashes=[], preempted_req_ids=[], pending_structured_output_tokens=false, kv_connector_metadata=null, ec_connector_metadata=null)
vllm-1 | (EngineCore_DP0 pid=179) ERROR 01-29 08:10:33 [dump_input.py:81] Dumping scheduler stats: SchedulerStats(num_running_reqs=1, num_waiting_reqs=0, step_counter=0, current_wave=0, kv_cache_usage=0.0006991051454138253, prefix_cache_stats=PrefixCacheStats(reset=False, requests=0, queries=0, hits=0, preempted_requests=0, preempted_queries=0, preempted_hits=0), connector_prefix_cache_stats=None, kv_cache_eviction_events=[], spec_decoding_stats=None, kv_connector_stats=None, waiting_lora_adapters={}, running_lora_adapters={})
vllm-1 | (EngineCore_DP0 pid=179) ERROR 01-29 08:10:33 [core.py:845] EngineCore encountered a fatal error.
vllm-1 | (EngineCore_DP0 pid=179) ERROR 01-29 08:10:33 [core.py:845] Traceback (most recent call last):
vllm-1 | (EngineCore_DP0 pid=179) ERROR 01-29 08:10:33 [core.py:845] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 836, in run_engine_core
vllm-1 | (EngineCore_DP0 pid=179) ERROR 01-29 08:10:33 [core.py:845] engine_core.run_busy_loop()
vllm-1 | (EngineCore_DP0 pid=179) ERROR 01-29 08:10:33 [core.py:845] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 863, in run_busy_loop
vllm-1 | (EngineCore_DP0 pid=179) ERROR 01-29 08:10:33 [core.py:845] self._process_engine_step()
vllm-1 | (EngineCore_DP0 pid=179) ERROR 01-29 08:10:33 [core.py:845] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 892, in _process_engine_step
vllm-1 | (EngineCore_DP0 pid=179) ERROR 01-29 08:10:33 [core.py:845] outputs, model_executed = self.step_fn()
vllm-1 | (EngineCore_DP0 pid=179) ERROR 01-29 08:10:33 [core.py:845] ^^^^^^^^^^^^^^
vllm-1 | (EngineCore_DP0 pid=179) ERROR 01-29 08:10:33 [core.py:845] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 441, in step_with_batch_queue
vllm-1 | (EngineCore_DP0 pid=179) ERROR 01-29 08:10:33 [core.py:845] model_output = future.result()
vllm-1 | (EngineCore_DP0 pid=179) ERROR 01-29 08:10:33 [core.py:845] ^^^^^^^^^^^^^^^
vllm-1 | (EngineCore_DP0 pid=179) ERROR 01-29 08:10:33 [core.py:845] File "/usr/lib/python3.12/concurrent/futures/_base.py", line 456, in result
vllm-1 | (EngineCore_DP0 pid=179) ERROR 01-29 08:10:33 [core.py:845] return self.__get_result()
vllm-1 | (EngineCore_DP0 pid=179) ERROR 01-29 08:10:33 [core.py:845] ^^^^^^^^^^^^^^^^^^^
vllm-1 | (EngineCore_DP0 pid=179) ERROR 01-29 08:10:33 [core.py:845] File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
vllm-1 | (EngineCore_DP0 pid=179) ERROR 01-29 08:10:33 [core.py:845] raise self._exception
vllm-1 | (EngineCore_DP0 pid=179) ERROR 01-29 08:10:33 [core.py:845] File "/usr/lib/python3.12/concurrent/futures/thread.py", line 58, in run
vllm-1 | (EngineCore_DP0 pid=179) ERROR 01-29 08:10:33 [core.py:845] result = self.fn(*self.args, **self.kwargs)
vllm-1 | (EngineCore_DP0 pid=179) ERROR 01-29 08:10:33 [core.py:845] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm-1 | (EngineCore_DP0 pid=179) ERROR 01-29 08:10:33 [core.py:845] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 223, in get_output
vllm-1 | (EngineCore_DP0 pid=179) ERROR 01-29 08:10:33 [core.py:845] self.async_copy_ready_event.synchronize()
vllm-1 | (EngineCore_DP0 pid=179) ERROR 01-29 08:10:33 [core.py:845] torch.AcceleratorError: CUDA error: an illegal instruction was encountered
vllm-1 | (EngineCore_DP0 pid=179) ERROR 01-29 08:10:33 [core.py:845] Search for `cudaErrorIllegalInstruction' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
vllm-1 | (EngineCore_DP0 pid=179) ERROR 01-29 08:10:33 [core.py:845] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
vllm-1 | (EngineCore_DP0 pid=179) ERROR 01-29 08:10:33 [core.py:845] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
vllm-1 | (EngineCore_DP0 pid=179) ERROR 01-29 08:10:33 [core.py:845] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
vllm-1 | (EngineCore_DP0 pid=179) ERROR 01-29 08:10:33 [core.py:845]
vllm-1 | (EngineCore_DP0 pid=179) Process EngineCore_DP0:
vllm-1 | (EngineCore_DP0 pid=179) Traceback (most recent call last):
vllm-1 | (EngineCore_DP0 pid=179) File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
vllm-1 | (EngineCore_DP0 pid=179) self.run()
vllm-1 | (EngineCore_DP0 pid=179) File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
vllm-1 | (EngineCore_DP0 pid=179) self._target(*self._args, **self._kwargs)
vllm-1 | (EngineCore_DP0 pid=179) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 847, in run_engine_core
vllm-1 | (EngineCore_DP0 pid=179) raise e
vllm-1 | (EngineCore_DP0 pid=179) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 836, in run_engine_core
vllm-1 | (EngineCore_DP0 pid=179) engine_core.run_busy_loop()
vllm-1 | (EngineCore_DP0 pid=179) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 863, in run_busy_loop
vllm-1 | (EngineCore_DP0 pid=179) self._process_engine_step()
vllm-1 | (EngineCore_DP0 pid=179) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 892, in _process_engine_step
vllm-1 | (EngineCore_DP0 pid=179) outputs, model_executed = self.step_fn()
vllm-1 | (EngineCore_DP0 pid=179) ^^^^^^^^^^^^^^
vllm-1 | (EngineCore_DP0 pid=179) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 441, in step_with_batch_queue
vllm-1 | (EngineCore_DP0 pid=179) model_output = future.result()
vllm-1 | (EngineCore_DP0 pid=179) ^^^^^^^^^^^^^^^
vllm-1 | (EngineCore_DP0 pid=179) File "/usr/lib/python3.12/concurrent/futures/_base.py", line 456, in result
vllm-1 | (EngineCore_DP0 pid=179) return self.__get_result()
vllm-1 | (EngineCore_DP0 pid=179) ^^^^^^^^^^^^^^^^^^^
vllm-1 | (EngineCore_DP0 pid=179) File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
vllm-1 | (EngineCore_DP0 pid=179) raise self._exception
vllm-1 | (EngineCore_DP0 pid=179) File "/usr/lib/python3.12/concurrent/futures/thread.py", line 58, in run
vllm-1 | (EngineCore_DP0 pid=179) result = self.fn(*self.args, **self.kwargs)
vllm-1 | (EngineCore_DP0 pid=179) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm-1 | (EngineCore_DP0 pid=179) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 223, in get_output
vllm-1 | (EngineCore_DP0 pid=179) self.async_copy_ready_event.synchronize()
vllm-1 | (EngineCore_DP0 pid=179) torch.AcceleratorError: CUDA error: an illegal instruction was encountered
vllm-1 | (EngineCore_DP0 pid=179) Search for `cudaErrorIllegalInstruction' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
vllm-1 | (EngineCore_DP0 pid=179) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
vllm-1 | (EngineCore_DP0 pid=179) For debugging consider passing CUDA_LAUNCH_BLOCKING=1
vllm-1 | (EngineCore_DP0 pid=179) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
vllm-1 | (EngineCore_DP0 pid=179)
vllm-1 | (APIServer pid=135) ERROR 01-29 08:10:33 [async_llm.py:546] AsyncLLM output_handler failed.
vllm-1 | (APIServer pid=135) ERROR 01-29 08:10:33 [async_llm.py:546] Traceback (most recent call last):
vllm-1 | (APIServer pid=135) ERROR 01-29 08:10:33 [async_llm.py:546] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 498, in output_handler
vllm-1 | (APIServer pid=135) ERROR 01-29 08:10:33 [async_llm.py:546] outputs = await engine_core.get_output_async()
vllm-1 | (APIServer pid=135) ERROR 01-29 08:10:33 [async_llm.py:546] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm-1 | (APIServer pid=135) ERROR 01-29 08:10:33 [async_llm.py:546] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 885, in get_output_async
vllm-1 | (APIServer pid=135) ERROR 01-29 08:10:33 [async_llm.py:546] raise self._format_exception(outputs) from None
vllm-1 | (APIServer pid=135) ERROR 01-29 08:10:33 [async_llm.py:546] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
vllm-1 | (APIServer pid=135) ERROR 01-29 08:10:33 [serving_chat.py:1287] Error in chat completion stream generator.
vllm-1 | (APIServer pid=135) ERROR 01-29 08:10:33 [serving_chat.py:1287] Traceback (most recent call last):
vllm-1 | (APIServer pid=135) ERROR 01-29 08:10:33 [serving_chat.py:1287] File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 619, in chat_completion_stream_generator
vllm-1 | (APIServer pid=135) ERROR 01-29 08:10:33 [serving_chat.py:1287] async for res in result_generator:
vllm-1 | (APIServer pid=135) ERROR 01-29 08:10:33 [serving_chat.py:1287] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 444, in generate
vllm-1 | (APIServer pid=135) ERROR 01-29 08:10:33 [serving_chat.py:1287] out = q.get_nowait() or await q.get()
vllm-1 | (APIServer pid=135) ERROR 01-29 08:10:33 [serving_chat.py:1287] ^^^^^^^^^^^^^
vllm-1 | (APIServer pid=135) ERROR 01-29 08:10:33 [serving_chat.py:1287] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/output_processor.py", line 70, in get
vllm-1 | (APIServer pid=135) ERROR 01-29 08:10:33 [serving_chat.py:1287] raise output
vllm-1 | (APIServer pid=135) ERROR 01-29 08:10:33 [serving_chat.py:1287] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 498, in output_handler
vllm-1 | (APIServer pid=135) ERROR 01-29 08:10:33 [serving_chat.py:1287] outputs = await engine_core.get_output_async()
vllm-1 | (APIServer pid=135) ERROR 01-29 08:10:33 [serving_chat.py:1287] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm-1 | (APIServer pid=135) ERROR 01-29 08:10:33 [serving_chat.py:1287] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 885, in get_output_async
vllm-1 | (APIServer pid=135) ERROR 01-29 08:10:33 [serving_chat.py:1287] raise self._format_exception(outputs) from None
vllm-1 | (APIServer pid=135) ERROR 01-29 08:10:33 [serving_chat.py:1287] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
vllm-1 | (APIServer pid=135) INFO: 192.168.40.63:35670 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
vllm-1 | [rank0]:[W129 08:10:33.254966974 CUDAGuardImpl.h:122] Warning: CUDA warning: an illegal instruction was encountered (function destroyEvent)
vllm-1 | terminate called after throwing an instance of 'c10::AcceleratorError'
vllm-1 | what(): CUDA error: an illegal instruction was encountered
vllm-1 | Search for `cudaErrorIllegalInstruction' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
vllm-1 | CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
vllm-1 | For debugging consider passing CUDA_LAUNCH_BLOCKING=1
vllm-1 | Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
vllm-1 |
vllm-1 | Exception raised from currentStreamCaptureStatusMayInitCtx at /opt/pytorch/pytorch/c10/cuda/CUDAGraphsC10Utils.h:71 (most recent call first):
vllm-1 | frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xd4 (0xfa16368412b4 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
vllm-1 | frame #1: <unknown function> + 0x43e478 (0xfa163693e478 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_cuda.so)
vllm-1 | frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, unsigned int, bool) + 0x1bc (0xfa163693e6ec in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_cuda.so)
vllm-1 | frame #3: <unknown function> + 0x106ac68 (0xfa163760ac68 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
vllm-1 | frame #4: <unknown function> + 0x5cda64 (0xfa165f30da64 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)
vllm-1 | frame #5: c10::TensorImpl::~TensorImpl() + 0x14 (0xfa16367d3464 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
vllm-1 | frame #6: <unknown function> + 0xb86d00 (0xfa165f8c6d00 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)
vllm-1 | frame #7: <unknown function> + 0xb8700c (0xfa165f8c700c in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)
vllm-1 | frame #8: _PyObject_ClearManagedDict + 0x10c (0x4fbef0 in VLLM::EngineCore)
vllm-1 | frame #9: VLLM::EngineCore() [0x526e2c]
vllm-1 | frame #10: VLLM::EngineCore() [0x5b408c]
vllm-1 | frame #11: VLLM::EngineCore() [0x5b33cc]
vllm-1 | frame #12: VLLM::EngineCore() [0x58d334]
vllm-1 | frame #13: _PyEval_EvalFrameDefault + 0x8ee0 (0x56d514 in VLLM::EngineCore)
vllm-1 | frame #14: VLLM::EngineCore() [0x4c7024]
vllm-1 | frame #15: _PyEval_EvalFrameDefault + 0x43b8 (0x5689ec in VLLM::EngineCore)
vllm-1 | frame #16: VLLM::EngineCore() [0x4c7024]
vllm-1 | frame #17: _PyEval_EvalFrameDefault + 0x43b8 (0x5689ec in VLLM::EngineCore)
vllm-1 | frame #18: VLLM::EngineCore() [0x4c3cb4]
vllm-1 | frame #19: PyObject_CallMethodObjArgs + 0xa8 (0x4c5898 in VLLM::EngineCore)
vllm-1 | frame #20: PyImport_ImportModuleLevelObject + 0x378 (0x58ee2c in VLLM::EngineCore)
vllm-1 | frame #21: _PyEval_EvalFrameDefault + 0x4d10 (0x569344 in VLLM::EngineCore)
vllm-1 | frame #22: PyEval_EvalCode + 0x130 (0x563224 in VLLM::EngineCore)
vllm-1 | frame #23: VLLM::EngineCore() [0x5600e8]
vllm-1 | frame #24: VLLM::EngineCore() [0x50367c]
vllm-1 | frame #25: _PyEval_EvalFrameDefault + 0x3cf8 (0x56832c in VLLM::EngineCore)
vllm-1 | frame #26: VLLM::EngineCore() [0x4c3cb4]
vllm-1 | frame #27: PyObject_CallMethodObjArgs + 0xa8 (0x4c5898 in VLLM::EngineCore)
vllm-1 | frame #28: PyImport_ImportModuleLevelObject + 0x378 (0x58ee2c in VLLM::EngineCore)
vllm-1 | frame #29: _PyEval_EvalFrameDefault + 0x4d10 (0x569344 in VLLM::EngineCore)
vllm-1 | frame #30: PyEval_EvalCode + 0x130 (0x563224 in VLLM::EngineCore)
vllm-1 | frame #31: VLLM::EngineCore() [0x5600e8]
vllm-1 | frame #32: VLLM::EngineCore() [0x50367c]
vllm-1 | frame #33: _PyEval_EvalFrameDefault + 0x3cf8 (0x56832c in VLLM::EngineCore)
vllm-1 | frame #34: VLLM::EngineCore() [0x4c3cb4]
vllm-1 | frame #35: PyObject_CallMethodObjArgs + 0xa8 (0x4c5898 in VLLM::EngineCore)
vllm-1 | frame #36: PyImport_ImportModuleLevelObject + 0x378 (0x58ee2c in VLLM::EngineCore)
vllm-1 | frame #37: _PyEval_EvalFrameDefault + 0x4d10 (0x569344 in VLLM::EngineCore)
vllm-1 | frame #38: PyEval_EvalCode + 0x130 (0x563224 in VLLM::EngineCore)
vllm-1 | frame #39: VLLM::EngineCore() [0x5600e8]
vllm-1 | frame #40: VLLM::EngineCore() [0x50367c]
vllm-1 | frame #41: _PyEval_EvalFrameDefault + 0x3cf8 (0x56832c in VLLM::EngineCore)
vllm-1 | frame #42: VLLM::EngineCore() [0x4c3cb4]
vllm-1 | frame #43: PyObject_CallMethodObjArgs + 0xa8 (0x4c5898 in VLLM::EngineCore)
vllm-1 | frame #44: PyImport_ImportModuleLevelObject + 0x378 (0x58ee2c in VLLM::EngineCore)
vllm-1 | frame #45: _PyEval_EvalFrameDefault + 0x4d10 (0x569344 in VLLM::EngineCore)
vllm-1 | frame #46: PyEval_EvalCode + 0x130 (0x563224 in VLLM::EngineCore)
vllm-1 | frame #47: VLLM::EngineCore() [0x5600e8]
vllm-1 | frame #48: VLLM::EngineCore() [0x50367c]
vllm-1 | frame #49: _PyEval_EvalFrameDefault + 0x3cf8 (0x56832c in VLLM::EngineCore)
vllm-1 | frame #50: VLLM::EngineCore() [0x4c3cb4]
vllm-1 | frame #51: PyObject_CallMethodObjArgs + 0xa8 (0x4c5898 in VLLM::EngineCore)
vllm-1 | frame #52: PyImport_ImportModuleLevelObject + 0x378 (0x58ee2c in VLLM::EngineCore)
vllm-1 | frame #53: VLLM::EngineCore() [0x5605d8]
vllm-1 | frame #54: VLLM::EngineCore() [0x50367c]
vllm-1 | frame #55: _PyEval_EvalFrameDefault + 0x3cf8 (0x56832c in VLLM::EngineCore)
vllm-1 | frame #56: VLLM::EngineCore() [0x4c3cb4]
vllm-1 | frame #57: PyObject_CallMethodObjArgs + 0xa8 (0x4c5898 in VLLM::EngineCore)
vllm-1 | frame #58: PyImport_ImportModuleLevelObject + 0x378 (0x58ee2c in VLLM::EngineCore)
vllm-1 | frame #59: _PyEval_EvalFrameDefault + 0x4d10 (0x569344 in VLLM::EngineCore)
vllm-1 | frame #60: VLLM::EngineCore() [0x6c1cf8]
vllm-1 | frame #61: Py_FinalizeEx + 0x58 (0x67b278 in VLLM::EngineCore)
vllm-1 | frame #62: Py_Exit + 0x18 (0x67c708 in VLLM::EngineCore)
vllm-1 |
vllm-1 | (APIServer pid=135) INFO 01-29 08:10:37 [loggers.py:236] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 36.6 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 0.0%
vllm-1 | (APIServer pid=135) INFO: Shutting down
vllm-1 | (APIServer pid=135) INFO: Waiting for application shutdown.
vllm-1 | (APIServer pid=135) INFO: Application shutdown complete.
vllm-1 | (APIServer pid=135) INFO: Finished server process [135]
@ruben-bibsyst Note that the official vLLM and the Nvidia vLLM do not currently support the SM12.1 compute kernels required to use the blackwell NVFP4 compute units. Some community members have made custom builds of the vLLM project that do support this, but official support is lacking. I expect this will be addressed in the coming months as Nvidia likely aim this model at the DGX Spark device.
AWQ quants by QuantTrio/ work fine for the qwen3 model. Or if you must use This Nemotron model then use the INT8 in the meantime
@ruben-bibsyst Note that the official vLLM and the Nvidia vLLM do not currently support the SM12.1 compute kernels required to use the blackwell NVFP4 compute units.
Thanks for the info! I assumed this model was compatible with the current vLLM version provided by Nvidia for the DGX Spark, since it was mentioned in the README: "If you are on Jetson Thor or DGX Spark, please use this vllm container."
I'll keep an eye out for improvements in the coming months!
@ruben-bibsyst Note that the official vLLM and the Nvidia vLLM do not currently support the SM12.1 compute kernels required to use the blackwell NVFP4 compute units.
Thanks for the info! I assumed this model was compatible with the current vLLM version provided by Nvidia for the DGX Spark, since it was mentioned in the README: "If you are on Jetson Thor or DGX Spark, please use this vllm container."
I'll keep an eye out for improvements in the coming months!
Updates have been patchy tbf. But as far as I can tell the 12.1 isn't yet supported.
Though vLLM works on spark...just NVFP4. Happy to be corrected if Nvidia can provide more details.
I'm also seeing this somewhat randomly. I'm using nvcr.io/nvidia/vllm:26.01-py3 since it seems to be newer than the 25.12 one they link to.
Up, also using nvcr.io/nvidia/vllm:26.01-py3
I also see the issue in: vllm/vllm-openai:v0.17.1-aarch64-cu130