Qwen3.5 PRISM Dynamic Quantization (GGUF)

PRISM Dynamic Quantization (PRISM-DQ) applies per-tensor-class bit allocation based on structural weight analysis — no calibration data or importance matrices required. Each tensor class (attention keys, FFN gates, SSM components, etc.) receives a quantization type proportional to its measured sensitivity, while staying within a target bits-per-weight budget.

This repo contains PRISM-DQ quantized GGUFs for the full Qwen3.5 vision-language model family (0.8B, 2B, 4B, 9B), plus multimodal projection weights (mmproj) for vision capabilities.

Benchmark Results

Pareto Frontier Analysis

Perplexity Comparison (UltraChat, 5 chunks, 512 ctx)

Model Method BPW PPL Size
Qwen3.5-0.8B Q3_K_M 4.96 12.14 470 MB
PRISM-DQ 4.94 11.42 468 MB
Q3_K_M (imatrix) 4.96 11.31 470 MB
UD-Q3_K_XL 5.19 10.94 492 MB
IQ4_XS (imatrix) 5.20 10.35 493 MB
UD-Q4_K_XL 5.89 10.07 559 MB
Qwen3.5-2B Q3_K_M 4.69 9.35 1107 MB
PRISM-DQ 4.68 9.26 1104 MB
Q3_K_M (imatrix) 4.69 8.40 1107 MB
UD-Q3_K_XL 4.91 8.27 1159 MB
IQ4_XS (imatrix) 4.97 8.12 1173 MB
UD-Q4_K_XL 5.68 8.07 1340 MB
Qwen3.5-4B Q3_K_M 4.36 6.88 2293 MB
PRISM-DQ 4.31 6.82 2271 MB
Q3_K_M (imatrix) 4.36 6.62 2293 MB
UD-Q3_K_XL 4.63 6.66 2436 MB
IQ4_XS (imatrix) 4.70 6.51 2477 MB
UD-Q4_K_XL 5.53 6.56 2912 MB
Qwen3.5-9B Q3_K_M 4.17 6.25 4674 MB
PRISM-DQ 4.15 6.18 4652 MB
Q3_K_M (imatrix) 4.17 5.96 4674 MB
UD-Q3_K_XL 4.51 6.01 5054 MB
IQ4_XS (imatrix) 4.61 6.03 5169 MB
UD-Q4_K_XL 5.33 5.86 5966 MB

Key Findings

  • PRISM-DQ beats uniform Q3_K_M on all 4 models (1-6% PPL improvement) at same or lower BPW
  • Smallest file size at competitive perplexity across the Qwen3.5 family
  • No calibration data needed — allocation decisions are purely weight-analysis-based
  • When combined with importance matrices, PRISM-DQ+imatrix achieves Pareto-optimal results on 4B and 9B

Model Files

Each subfolder contains the quantized model GGUF plus multimodal projection weights:

Qwen3.5-0.8B/
  Qwen3.5-0.8B-PRISM-DQ.gguf    (446 MB)
  mmproj-BF16.gguf
  mmproj-F16.gguf
  mmproj-F32.gguf
  chat_template.jinja

Qwen3.5-2B/
  Qwen3.5-2B-PRISM-DQ.gguf      (1.0 GB)
  mmproj-BF16.gguf
  mmproj-F16.gguf
  mmproj-F32.gguf
  chat_template.jinja

Qwen3.5-4B/
  Qwen3.5-4B-PRISM-DQ.gguf      (2.1 GB)
  mmproj-BF16.gguf
  mmproj-F16.gguf
  mmproj-F32.gguf
  chat_template.jinja

Qwen3.5-9B/
  Qwen3.5-9B-PRISM-DQ.gguf      (4.3 GB)
  mmproj-BF16.gguf
  mmproj-F16.gguf
  mmproj-F32.gguf
  chat_template.jinja

Usage

Text-only (llama.cpp)

# Download a model
huggingface-cli download Ex0bit/Qwen3.5-PRISM-Dynamic-Quant-GGUF \
  Qwen3.5-9B/Qwen3.5-9B-PRISM-DQ.gguf --local-dir .

# Run with llama-cli
llama-cli -m Qwen3.5-9B/Qwen3.5-9B-PRISM-DQ.gguf \
  -p "You are a helpful assistant." \
  --chat-template-file Qwen3.5-9B/chat_template.jinja \
  -cnv

Vision (multimodal)

# Download model + mmproj
huggingface-cli download Ex0bit/Qwen3.5-PRISM-Dynamic-Quant-GGUF \
  Qwen3.5-9B/Qwen3.5-9B-PRISM-DQ.gguf \
  Qwen3.5-9B/mmproj-BF16.gguf --local-dir .

# Run with llama-mtmd-cli
llama-mtmd-cli -m Qwen3.5-9B/Qwen3.5-9B-PRISM-DQ.gguf \
  --mmproj Qwen3.5-9B/mmproj-BF16.gguf \
  --chat-template-file Qwen3.5-9B/chat_template.jinja \
  -cnv

LM Studio / Ollama

These GGUFs work with any llama.cpp-compatible runtime. Simply point your application at the .gguf file.

PRISM-DQ Quantization Recipes

Qwen3.5-0.8B (target 3.5 BPW)
llama-quantize \
  --tensor-type "attn_gate=Q3_K" \
  --tensor-type "attn_k=Q3_K" \
  --tensor-type "attn_output=IQ4_XS" \
  --tensor-type "attn_q=Q3_K" \
  --tensor-type "attn_qkv=Q3_K" \
  --tensor-type "attn_v=Q4_K" \
  --tensor-type "ffn_down=Q3_K" \
  --tensor-type "ffn_gate=Q3_K" \
  --tensor-type "ffn_up=Q3_K" \
  --tensor-type "ssm_alpha=Q3_K" \
  --tensor-type "ssm_beta=IQ4_XS" \
  --tensor-type "ssm_out=IQ4_XS" \
  --tensor-type "token_embd=Q3_K" \
  --tensor-type "blk\.(4)\.ssm_beta=Q4_K" \
  --tensor-type "blk\.(18)\.ssm_out=Q4_K" \
  input.gguf output.gguf Q3_K
Qwen3.5-2B (target 3.5 BPW)
llama-quantize \
  --tensor-type "attn_gate=Q3_K" \
  --tensor-type "attn_k=Q4_K" \
  --tensor-type "attn_output=Q4_K" \
  --tensor-type "attn_q=Q4_K" \
  --tensor-type "attn_qkv=Q3_K" \
  --tensor-type "attn_v=Q4_K" \
  --tensor-type "ffn_down=Q3_K" \
  --tensor-type "ffn_gate=Q3_K" \
  --tensor-type "ffn_up=Q3_K" \
  --tensor-type "ssm_alpha=Q4_K" \
  --tensor-type "ssm_beta=Q4_K" \
  --tensor-type "ssm_out=Q3_K" \
  --tensor-type "token_embd=Q3_K" \
  input.gguf output.gguf Q3_K
Qwen3.5-4B (target 3.5 BPW)
llama-quantize \
  --tensor-type "attn_gate=Q3_K" \
  --tensor-type "attn_k=Q4_K" \
  --tensor-type "attn_output=Q5_K" \
  --tensor-type "attn_q=Q3_K" \
  --tensor-type "attn_qkv=Q3_K" \
  --tensor-type "attn_v=Q4_K" \
  --tensor-type "ffn_down=Q3_K" \
  --tensor-type "ffn_gate=Q3_K" \
  --tensor-type "ffn_up=Q3_K" \
  --tensor-type "ssm_alpha=Q4_K" \
  --tensor-type "ssm_beta=Q4_K" \
  --tensor-type "ssm_out=Q3_K" \
  --tensor-type "token_embd=Q3_K" \
  input.gguf output.gguf Q3_K
Qwen3.5-9B (target 3.5 BPW)
llama-quantize \
  --tensor-type "attn_gate=Q3_K" \
  --tensor-type "attn_k=Q4_K" \
  --tensor-type "attn_output=IQ4_XS" \
  --tensor-type "attn_q=Q4_K" \
  --tensor-type "attn_qkv=Q3_K" \
  --tensor-type "attn_v=Q4_K" \
  --tensor-type "ffn_down=Q3_K" \
  --tensor-type "ffn_gate=Q3_K" \
  --tensor-type "ffn_up=Q3_K" \
  --tensor-type "output=Q3_K" \
  --tensor-type "ssm_alpha=Q4_K" \
  --tensor-type "ssm_beta=Q4_K" \
  --tensor-type "ssm_out=Q3_K" \
  --tensor-type "token_embd=Q3_K" \
  input.gguf output.gguf Q3_K

How PRISM-DQ Works

PRISM Dynamic Quantization analyzes each weight tensor using 7 structural metrics:

  1. PL-Alpha-Hill — spectral heavy-tail index via eigenvalue analysis
  2. Spectral Dominance — top singular value ratio (rank-1 approximation quality)
  3. OSQE — optimal scale quantization error at multiple bit levels (2, 3, 4, 6 bit)
  4. Matrix Imbalance — max of row/column coefficient of variation
  5. Fragility — log-ratio of 2-bit vs 4-bit quantization error
  6. Boundary Density — fraction of values near quantization bin boundaries
  7. Spectral Position Prior — bidirectional spectral norm product encoding layer position

These metrics are combined into a composite sensitivity score per tensor class. A Lagrangian allocator then distributes bits across classes to minimize total quantization distortion subject to the BPW budget, with per-block refinement for individual tensor overrides.

License

This model is released under the Apache 2.0 license, consistent with the base Qwen3.5 models.

Acknowledgments

  • Qwen Team for the Qwen3.5 model family
  • llama.cpp for the quantization infrastructure
  • Multimodal projection weights sourced from unsloth GGUF conversions
Downloads last month
2,292
GGUF
Model size
0.8B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Ex0bit/Qwen3.5-PRISM-Dynamic-Quant-GGUF

Finetuned
Qwen/Qwen3.5-0.8B
Quantized
(45)
this model

Collection including Ex0bit/Qwen3.5-PRISM-Dynamic-Quant-GGUF