Error while loading model

#3
by LimingShen - opened

#ollama
ollama run hf.co/vito95311/Qwen3-Omni-30B-A3B-Thinking-GGUF-INT8FP16:latest
Error: unable to load model: /usr/share/ollama/.ollama/models/blobs/sha256-19a8630baaadacc810f55153f5c6a38b491c53a3cf8df170a27e23c6cbe47324

#llama-cpp-python
Python 3.10.15 (main, Oct 3 2024, 07:27:34) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.

from llama_cpp import Llama
llm = Llama(
... model_path="/root/qwen3_omni_quantized.gguf",
... n_gpu_layers=35, # GPU加速層數
... n_ctx=4096, # 上下文長度
... verbose=False
... )

Traceback (most recent call last):
File "", line 1, in
File "/root/miniconda3/envs/internutopia/lib/python3.10/site-packages/llama_cpp/llama.py", line 374, in init
internals.LlamaModel(
File "/root/miniconda3/envs/internutopia/lib/python3.10/site-packages/llama_cpp/_internals.py", line 58, in init
raise ValueError(f"Failed to load model from file: {path_model}")
ValueError: Failed to load model from file: /root/qwen3_omni_quantized.ggu

same problem with latest llama.cpp:

qwen3omni  | ggml_cuda_init: found 1 CUDA devices:
qwen3omni  |   Device 0: NVIDIA GeForce RTX 4090 D, compute capability 8.9, VMM: yes
qwen3omni  | load_backend: loaded CUDA backend from /app/libggml-cuda.so
qwen3omni  | load_backend: loaded CPU backend from /app/libggml-cpu-sapphirerapids.so
qwen3omni  | build: 6588 (a86a580a) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
qwen3omni  | system info: n_threads = 12, n_threads_batch = 12, total_threads = 12
qwen3omni  | 
qwen3omni  | system_info: n_threads = 12 (n_threads_batch = 12) / 12 | CUDA : ARCHS = 500,610,700,750,800,860,890 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | AMX_INT8 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 
qwen3omni  | 
qwen3omni  | main: binding port with default address family
qwen3omni  | main: HTTP server is listening, hostname: 0.0.0.0, port: 36000, http threads: 11
qwen3omni  | main: loading model
qwen3omni  | srv    load_model: loading model '/root/.cache/llama.cpp/qwen3_omni_quantized.gguf'
qwen3omni  | llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4090 D) (0000:00:10.0) - 48150 MiB free
qwen3omni  | llama_model_loader: loaded meta data with 17 key-value pairs and 56465 tensors from /root/.cache/llama.cpp/qwen3_omni_quantized.gguf (version GGUF V3 (latest))
qwen3omni  | llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
qwen3omni  | llama_model_loader: - kv   0:                       general.architecture str              = qwen3-omni
qwen3omni  | llama_model_loader: - kv   1:                               general.name str              = Qwen3-Omni
qwen3omni  | llama_model_loader: - kv   2:                      qwen3-omni.vocab_size u32              = 152064
qwen3omni  | llama_model_loader: - kv   3:                  qwen3-omni.context_length u32              = 65536
qwen3omni  | llama_model_loader: - kv   4:                qwen3-omni.embedding_length u32              = 2048
qwen3omni  | llama_model_loader: - kv   5:            qwen3-omni.attention.head_count u32              = 32
qwen3omni  | llama_model_loader: - kv   6:         qwen3-omni.attention.head_count_kv u32              = 4
qwen3omni  | llama_model_loader: - kv   7:                     qwen3-omni.block_count u32              = 48
qwen3omni  | llama_model_loader: - kv   8:             qwen3-omni.feed_forward_length u32              = 768
qwen3omni  | llama_model_loader: - kv   9:                    qwen3-omni.expert_count u32              = 128
qwen3omni  | llama_model_loader: - kv  10:               qwen3-omni.expert_used_count u32              = 8
qwen3omni  | llama_model_loader: - kv  11:                  qwen3-omni.rope.freq_base f32              = 1000000.000000
qwen3omni  | llama_model_loader: - kv  12: qwen3-omni.attention.layer_norm_rms_epsilon f32              = 0.000001
qwen3omni  | llama_model_loader: - kv  13:                          general.file_type u32              = 1
qwen3omni  | llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = gpt2
qwen3omni  | llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,151643]  = ["!", "\"", "#", "$", "%", "&", "'", ...
qwen3omni  | llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,151643]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
qwen3omni  | llama_model_loader: - type  f16: 56465 tensors
qwen3omni  | print_info: file format = GGUF V3 (latest)
qwen3omni  | print_info: file type   = F16
qwen3omni  | print_info: file size   = 30.46 GiB (16.00 BPW) 
qwen3omni  | llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'qwen3-omni'

Yes, it is not supported yet. I am still waiting for the adaptation. You can use another quantitative version first, here->https://huggingface.co/vito95311/Qwen3-Omni-30B-A3B-Thinking-INT8FP16

Sign up or log in to comment