Qwen3.5 PRISM Dynamic Quantization (GGUF)

PRISM Dynamic Quantization (PRISM-DQ) applies per-tensor-class bit allocation based on structural weight analysis — no calibration data or importance matrices required. Each tensor class (attention keys, FFN gates, SSM components, etc.) receives a quantization type proportional to its measured sensitivity, while staying within a target bits-per-weight budget.

This repo contains PRISM-DQ quantized GGUFs for the full Qwen3.5 vision-language model family (0.8B, 2B, 4B, 9B), plus multimodal projection weights (mmproj) for vision capabilities.

Benchmark Results

Perplexity Comparison (UltraChat, 5 chunks, 512 ctx)

Model	Method	BPW	PPL	Size
Qwen3.5-0.8B	Q3_K_M	4.96	12.14	470 MB
	PRISM-DQ	4.94	11.42	468 MB
	Q3_K_M (imatrix)	4.96	11.31	470 MB
	UD-Q3_K_XL	5.19	10.94	492 MB
	IQ4_XS (imatrix)	5.20	10.35	493 MB
	UD-Q4_K_XL	5.89	10.07	559 MB
Qwen3.5-2B	Q3_K_M	4.69	9.35	1107 MB
	PRISM-DQ	4.68	9.26	1104 MB
	Q3_K_M (imatrix)	4.69	8.40	1107 MB
	UD-Q3_K_XL	4.91	8.27	1159 MB
	IQ4_XS (imatrix)	4.97	8.12	1173 MB
	UD-Q4_K_XL	5.68	8.07	1340 MB
Qwen3.5-4B	Q3_K_M	4.36	6.88	2293 MB
	PRISM-DQ	4.31	6.82	2271 MB
	Q3_K_M (imatrix)	4.36	6.62	2293 MB
	UD-Q3_K_XL	4.63	6.66	2436 MB
	IQ4_XS (imatrix)	4.70	6.51	2477 MB
	UD-Q4_K_XL	5.53	6.56	2912 MB
Qwen3.5-9B	Q3_K_M	4.17	6.25	4674 MB
	PRISM-DQ	4.15	6.18	4652 MB
	Q3_K_M (imatrix)	4.17	5.96	4674 MB
	UD-Q3_K_XL	4.51	6.01	5054 MB
	IQ4_XS (imatrix)	4.61	6.03	5169 MB
	UD-Q4_K_XL	5.33	5.86	5966 MB

Key Findings

PRISM-DQ beats uniform Q3_K_M on all 4 models (1-6% PPL improvement) at same or lower BPW
Smallest file size at competitive perplexity across the Qwen3.5 family
No calibration data needed — allocation decisions are purely weight-analysis-based
When combined with importance matrices, PRISM-DQ+imatrix achieves Pareto-optimal results on 4B and 9B

Model Files

Each subfolder contains the quantized model GGUF plus multimodal projection weights:

Qwen3.5-0.8B/
  Qwen3.5-0.8B-PRISM-DQ.gguf    (446 MB)
  mmproj-BF16.gguf
  mmproj-F16.gguf
  mmproj-F32.gguf
  chat_template.jinja

Qwen3.5-2B/
  Qwen3.5-2B-PRISM-DQ.gguf      (1.0 GB)
  mmproj-BF16.gguf
  mmproj-F16.gguf
  mmproj-F32.gguf
  chat_template.jinja

Qwen3.5-4B/
  Qwen3.5-4B-PRISM-DQ.gguf      (2.1 GB)
  mmproj-BF16.gguf
  mmproj-F16.gguf
  mmproj-F32.gguf
  chat_template.jinja

Qwen3.5-9B/
  Qwen3.5-9B-PRISM-DQ.gguf      (4.3 GB)
  mmproj-BF16.gguf
  mmproj-F16.gguf
  mmproj-F32.gguf
  chat_template.jinja

Usage

Text-only (llama.cpp)

# Download a model
huggingface-cli download Ex0bit/Qwen3.5-PRISM-Dynamic-Quant-GGUF \
  Qwen3.5-9B/Qwen3.5-9B-PRISM-DQ.gguf --local-dir .

# Run with llama-cli
llama-cli -m Qwen3.5-9B/Qwen3.5-9B-PRISM-DQ.gguf \
  -p "You are a helpful assistant." \
  --chat-template-file Qwen3.5-9B/chat_template.jinja \
  -cnv

Vision (multimodal)

# Download model + mmproj
huggingface-cli download Ex0bit/Qwen3.5-PRISM-Dynamic-Quant-GGUF \
  Qwen3.5-9B/Qwen3.5-9B-PRISM-DQ.gguf \
  Qwen3.5-9B/mmproj-BF16.gguf --local-dir .

# Run with llama-mtmd-cli
llama-mtmd-cli -m Qwen3.5-9B/Qwen3.5-9B-PRISM-DQ.gguf \
  --mmproj Qwen3.5-9B/mmproj-BF16.gguf \
  --chat-template-file Qwen3.5-9B/chat_template.jinja \
  -cnv

LM Studio / Ollama

These GGUFs work with any llama.cpp-compatible runtime. Simply point your application at the .gguf file.

PRISM-DQ Quantization Recipes

Qwen3.5-0.8B (target 3.5 BPW)

llama-quantize \
  --tensor-type "attn_gate=Q3_K" \
  --tensor-type "attn_k=Q3_K" \
  --tensor-type "attn_output=IQ4_XS" \
  --tensor-type "attn_q=Q3_K" \
  --tensor-type "attn_qkv=Q3_K" \
  --tensor-type "attn_v=Q4_K" \
  --tensor-type "ffn_down=Q3_K" \
  --tensor-type "ffn_gate=Q3_K" \
  --tensor-type "ffn_up=Q3_K" \
  --tensor-type "ssm_alpha=Q3_K" \
  --tensor-type "ssm_beta=IQ4_XS" \
  --tensor-type "ssm_out=IQ4_XS" \
  --tensor-type "token_embd=Q3_K" \
  --tensor-type "blk\.(4)\.ssm_beta=Q4_K" \
  --tensor-type "blk\.(18)\.ssm_out=Q4_K" \
  input.gguf output.gguf Q3_K

Qwen3.5-2B (target 3.5 BPW)

llama-quantize \
  --tensor-type "attn_gate=Q3_K" \
  --tensor-type "attn_k=Q4_K" \
  --tensor-type "attn_output=Q4_K" \
  --tensor-type "attn_q=Q4_K" \
  --tensor-type "attn_qkv=Q3_K" \
  --tensor-type "attn_v=Q4_K" \
  --tensor-type "ffn_down=Q3_K" \
  --tensor-type "ffn_gate=Q3_K" \
  --tensor-type "ffn_up=Q3_K" \
  --tensor-type "ssm_alpha=Q4_K" \
  --tensor-type "ssm_beta=Q4_K" \
  --tensor-type "ssm_out=Q3_K" \
  --tensor-type "token_embd=Q3_K" \
  input.gguf output.gguf Q3_K

Qwen3.5-4B (target 3.5 BPW)

llama-quantize \
  --tensor-type "attn_gate=Q3_K" \
  --tensor-type "attn_k=Q4_K" \
  --tensor-type "attn_output=Q5_K" \
  --tensor-type "attn_q=Q3_K" \
  --tensor-type "attn_qkv=Q3_K" \
  --tensor-type "attn_v=Q4_K" \
  --tensor-type "ffn_down=Q3_K" \
  --tensor-type "ffn_gate=Q3_K" \
  --tensor-type "ffn_up=Q3_K" \
  --tensor-type "ssm_alpha=Q4_K" \
  --tensor-type "ssm_beta=Q4_K" \
  --tensor-type "ssm_out=Q3_K" \
  --tensor-type "token_embd=Q3_K" \
  input.gguf output.gguf Q3_K

Qwen3.5-9B (target 3.5 BPW)

llama-quantize \
  --tensor-type "attn_gate=Q3_K" \
  --tensor-type "attn_k=Q4_K" \
  --tensor-type "attn_output=IQ4_XS" \
  --tensor-type "attn_q=Q4_K" \
  --tensor-type "attn_qkv=Q3_K" \
  --tensor-type "attn_v=Q4_K" \
  --tensor-type "ffn_down=Q3_K" \
  --tensor-type "ffn_gate=Q3_K" \
  --tensor-type "ffn_up=Q3_K" \
  --tensor-type "output=Q3_K" \
  --tensor-type "ssm_alpha=Q4_K" \
  --tensor-type "ssm_beta=Q4_K" \
  --tensor-type "ssm_out=Q3_K" \
  --tensor-type "token_embd=Q3_K" \
  input.gguf output.gguf Q3_K

How PRISM-DQ Works

PRISM Dynamic Quantization analyzes each weight tensor using 7 structural metrics:

PL-Alpha-Hill — spectral heavy-tail index via eigenvalue analysis
Spectral Dominance — top singular value ratio (rank-1 approximation quality)
OSQE — optimal scale quantization error at multiple bit levels (2, 3, 4, 6 bit)
Matrix Imbalance — max of row/column coefficient of variation
Fragility — log-ratio of 2-bit vs 4-bit quantization error
Boundary Density — fraction of values near quantization bin boundaries
Spectral Position Prior — bidirectional spectral norm product encoding layer position

These metrics are combined into a composite sensitivity score per tensor class. A Lagrangian allocator then distributes bits across classes to minimize total quantization distortion subject to the BPW budget, with per-block refinement for individual tensor overrides.