Error while loading model
#ollama
ollama run hf.co/vito95311/Qwen3-Omni-30B-A3B-Thinking-GGUF-INT8FP16:latest
Error: unable to load model: /usr/share/ollama/.ollama/models/blobs/sha256-19a8630baaadacc810f55153f5c6a38b491c53a3cf8df170a27e23c6cbe47324
#llama-cpp-python
Python 3.10.15 (main, Oct 3 2024, 07:27:34) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
from llama_cpp import Llama
llm = Llama(
... model_path="/root/qwen3_omni_quantized.gguf",
... n_gpu_layers=35, # GPU加速層數
... n_ctx=4096, # 上下文長度
... verbose=False
... )
Traceback (most recent call last):
File "", line 1, in
File "/root/miniconda3/envs/internutopia/lib/python3.10/site-packages/llama_cpp/llama.py", line 374, in init
internals.LlamaModel(
File "/root/miniconda3/envs/internutopia/lib/python3.10/site-packages/llama_cpp/_internals.py", line 58, in init
raise ValueError(f"Failed to load model from file: {path_model}")
ValueError: Failed to load model from file: /root/qwen3_omni_quantized.ggu
same problem with latest llama.cpp:
qwen3omni | ggml_cuda_init: found 1 CUDA devices:
qwen3omni | Device 0: NVIDIA GeForce RTX 4090 D, compute capability 8.9, VMM: yes
qwen3omni | load_backend: loaded CUDA backend from /app/libggml-cuda.so
qwen3omni | load_backend: loaded CPU backend from /app/libggml-cpu-sapphirerapids.so
qwen3omni | build: 6588 (a86a580a) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
qwen3omni | system info: n_threads = 12, n_threads_batch = 12, total_threads = 12
qwen3omni |
qwen3omni | system_info: n_threads = 12 (n_threads_batch = 12) / 12 | CUDA : ARCHS = 500,610,700,750,800,860,890 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | AMX_INT8 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
qwen3omni |
qwen3omni | main: binding port with default address family
qwen3omni | main: HTTP server is listening, hostname: 0.0.0.0, port: 36000, http threads: 11
qwen3omni | main: loading model
qwen3omni | srv load_model: loading model '/root/.cache/llama.cpp/qwen3_omni_quantized.gguf'
qwen3omni | llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4090 D) (0000:00:10.0) - 48150 MiB free
qwen3omni | llama_model_loader: loaded meta data with 17 key-value pairs and 56465 tensors from /root/.cache/llama.cpp/qwen3_omni_quantized.gguf (version GGUF V3 (latest))
qwen3omni | llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
qwen3omni | llama_model_loader: - kv 0: general.architecture str = qwen3-omni
qwen3omni | llama_model_loader: - kv 1: general.name str = Qwen3-Omni
qwen3omni | llama_model_loader: - kv 2: qwen3-omni.vocab_size u32 = 152064
qwen3omni | llama_model_loader: - kv 3: qwen3-omni.context_length u32 = 65536
qwen3omni | llama_model_loader: - kv 4: qwen3-omni.embedding_length u32 = 2048
qwen3omni | llama_model_loader: - kv 5: qwen3-omni.attention.head_count u32 = 32
qwen3omni | llama_model_loader: - kv 6: qwen3-omni.attention.head_count_kv u32 = 4
qwen3omni | llama_model_loader: - kv 7: qwen3-omni.block_count u32 = 48
qwen3omni | llama_model_loader: - kv 8: qwen3-omni.feed_forward_length u32 = 768
qwen3omni | llama_model_loader: - kv 9: qwen3-omni.expert_count u32 = 128
qwen3omni | llama_model_loader: - kv 10: qwen3-omni.expert_used_count u32 = 8
qwen3omni | llama_model_loader: - kv 11: qwen3-omni.rope.freq_base f32 = 1000000.000000
qwen3omni | llama_model_loader: - kv 12: qwen3-omni.attention.layer_norm_rms_epsilon f32 = 0.000001
qwen3omni | llama_model_loader: - kv 13: general.file_type u32 = 1
qwen3omni | llama_model_loader: - kv 14: tokenizer.ggml.model str = gpt2
qwen3omni | llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,151643] = ["!", "\"", "#", "$", "%", "&", "'", ...
qwen3omni | llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,151643] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
qwen3omni | llama_model_loader: - type f16: 56465 tensors
qwen3omni | print_info: file format = GGUF V3 (latest)
qwen3omni | print_info: file type = F16
qwen3omni | print_info: file size = 30.46 GiB (16.00 BPW)
qwen3omni | llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'qwen3-omni'
Yes, it is not supported yet. I am still waiting for the adaptation. You can use another quantitative version first, here->https://huggingface.co/vito95311/Qwen3-Omni-30B-A3B-Thinking-INT8FP16