Carnice-V2-27b-NVFP4-TEXT-MTP

Text-only NVFP4-quantized variant of kai-os/Carnice-V2-27b — a Hermes-agent SFT of Qwen/Qwen3.6-27B — with a working bf16 MTP head grafted from the base model and the vision tower stripped.

Built for NVIDIA RTX PRO 6000 Blackwell, DGX Spark (GB10), and other 24–32 GB Blackwell cards that want a fast Hermes-style agent with <think> reasoning + tool calls at ~20 GB on disk.

Highlights

  • MTP speculative decoding works out of the box — 93 % per-position acceptance verified on this build
  • 🛠 Hermes / OpenAI-XML tool calling — parallel calls, multi-turn round-trips, <think> reasoning all preserved
  • 🎯 ~19.6 GB on disk — fits 24 GB Blackwell with full KV-cache + cuda-graph headroom
  • 📈 100+ tok/s single-request decode on a single RTX PRO 6000 Blackwell with num_speculative_tokens=3 (134 short / 102 medium / 103 long-form). Drops to ~93 tok/s at num_speculative_tokens=1.
  • 🟢 vLLM-ready with --quantization modelopt + qwen3_5_mtp spec config

⚠️ Important: built on a prefix-fixed copy of the base

The shipped kai-os/Carnice-V2-27b BF16 safetensors carry a triple language_model. prefix on every key, which makes HF transformers load the model with all-random weights and produce gibberish (the GGUF variant is unaffected because llama.cpp normalizes prefixes during conversion).

This NVFP4 variant was quantized from a prefix-corrected copy of the base. Detail and reproducible fix script:

kai-os/Carnice-V2-27b · Discussion #1

If you load the original BF16 directly, you will see the same gibberish problem. This repo is unaffected — the export here uses the corrected weights.

What's in here

This repo (NVFP4-TEXT-MTP) Parent (kai-os/Carnice-V2-27b)
Format NVFP4 (modelopt) + bf16 MTP head bf16 safetensors
File size ~19.6 GB ~54.7 GB
Vision tower ❌ stripped (333 tensors / ~0.92 GB removed) ✅ present (but Hermes use is text-only anyway)
MTP head ✅ bf16, working (grafted from Qwen/Qwen3.6-27B) ❌ dropped during merge
Architecture Qwen3_5ForConditionalGeneration (text-only mode) Qwen3_5ForConditionalGeneration
Agent format Hermes-style XML tool calls + <think> reasoning same (this repo preserves it)

Quantization details

  • Base: kai-os/Carnice-V2-27b (bf16, 27 B params, hybrid linear-attn + full-attn, 64 layers) — prefix-corrected before quantization
  • Quantizer: nvidia-modelopt 0.43.0 with NVFP4_DEFAULT_CFG
  • Calibration: 20 samples from neuralmagic/calibration (LLM split), max_seq_len 8192
  • MTP graft source: Qwen/Qwen3.6-27B (15 mtp.* tensors, ~850 MB bf16, original Carnice merge dropped them)
  • Ignored from quantization (kept in bf16):
    • lm_head
    • All *linear_attn.conv1d* (Mamba-style SSM convolutions, 48 of 64 layers)
    • All mtp.* modules (15 tensors)
    • Other NVFP4_DEFAULT_CFG defaults (router, mlp.gate, output_layer …)
  • Vision-related ignore entries are removed since the vision tensors no longer exist.

Why MTP from the base works on a SFT'd body

Because Carnice's SFT used assistant-token-only loss + LoRA then merged to bf16, the underlying hidden-state distribution is close to the Qwen3.6 base. The Qwen3.6 MTP head transfers cleanly, giving a per-position acceptance rate of 0.93 during smoke testing — comparable to a model that shipped with its own MTP head.

Mean acceptance length 1.93 / 2.0 at num_speculative_tokens=1 and approximately 3.0 / 4.0 at num_speculative_tokens=3 (positions 1/2/3 acceptance ~87 / 72 / 61 % per Pulsate1680 on RTX PRO 4500 Blackwell with the parent recipe; matches our long-form decode profile here).

Usage with vLLM (Blackwell, SM120)

This model requires --language-model-only (it's a slim of the VLM architecture, so vLLM otherwise tries to instantiate the missing vision tower) and the qwen3_xml tool-call parser (Carnice emits OpenAI-style function XML, not canonical Hermes JSON-in-tags).

Recommended production launch — 256K context · KV FP8 · MTP n=3 · max-num-seqs 2

vllm serve sakamakismile/Carnice-V2-27b-NVFP4-TEXT-MTP \
    --trust-remote-code \
    --quantization modelopt \
    --language-model-only \
    --enable-auto-tool-choice --tool-call-parser qwen3_xml \
    --reasoning-parser qwen3 \
    --max-model-len 262144 \
    --max-num-seqs 2 \
    --kv-cache-dtype fp8 \
    --gpu-memory-utilization 0.9 \
    --speculative-config '{"method":"qwen3_5_mtp","num_speculative_tokens":3}'

This is the production launch on a single RTX PRO 6000 Blackwell — full 256K context, two concurrent agent slots, KV FP8 for concurrency headroom. The four flags that are easy to skip but matter:

  • --max-model-len 262144 — full 256K context (Qwen3.6 trained max).
  • --kv-cache-dtype fp8 — halves KV memory; lifts max concurrency at 256K from ~4× (BF16, won't fit) to 7.0× with the same VRAM. ~5–10 % per-token decode overhead, more than paid back by capacity.
  • --max-num-seqs 2 — load-bearing. --max-num-seqs 4 plus --kv-cache-dtype fp8 plus --speculative-config n=3 plus --max-model-len 262144 will silently OOM during cuda-graph capture on this build of vLLM (0.19.1rc1).
  • num_speculative_tokens: 3 — vLLM applies the single MTP layer (mtp_num_hidden_layers=1) recursively three times per draft pass. Per-position acceptance ~87 / 72 / 61 %, mean accepted-length ≈ 3.0 / 4.0. The qwen3_5_mtp handler is internally normalized to mtp (deprecated-name warning is harmless).

Smaller-context launch (16K, no fp8) — fastest single-request decode

vllm serve sakamakismile/Carnice-V2-27b-NVFP4-TEXT-MTP \
    --trust-remote-code \
    --quantization modelopt \
    --language-model-only \
    --enable-auto-tool-choice --tool-call-parser qwen3_xml \
    --reasoning-parser qwen3 \
    --speculative-config '{"method":"qwen3_5_mtp","num_speculative_tokens":3}' \
    --gpu-memory-utilization 0.85 \
    --max-model-len 16384

Send a tool-using chat request

curl http://localhost:8000/v1/chat/completions -H 'Content-Type: application/json' -d '{
  "model": "sakamakismile/Carnice-V2-27b-NVFP4-TEXT-MTP",
  "messages": [
    {"role": "system", "content": "You are a helpful agent that uses tools when needed."},
    {"role": "user", "content": "Get the weather in Tokyo, Paris, and New York simultaneously."}
  ],
  "tools": [
    {"type":"function","function":{"name":"get_weather","description":"Get weather","parameters":{"type":"object","properties":{"city":{"type":"string"}},"required":["city"]}}}
  ],
  "tool_choice": "auto",
  "parallel_tool_calls": true,
  "max_tokens": 400
}'

Hermes Agent (NousResearch)

Drop-in works as a custom / vllm provider for NousResearch/hermes-agent:

# ~/.config/hermes/cli-config.yaml
model:
  default: "sakamakismile/Carnice-V2-27b-NVFP4-TEXT-MTP"
  provider: "vllm"
  base_url: "http://localhost:8000/v1"

Verified locally (RTX PRO 6000 Blackwell, vLLM 0.19.1rc1)

Five-task agent capability test against the local OpenAI-compatible endpoint:

Test Result
1. Plain chat with reasoning <think> chain → final answer split correctly
2. Single tool call (Tokyo weather) ✅ valid JSON args, finish_reason=tool_calls
3. Multi-turn continuation (Tokyo result → Paris call) ✅ autonomous next-step
4. Final synthesis after both results ✅ markdown table, finish_reason=stop
5. Parallel tool calls (3 cities) ✅ 3 simultaneous calls in one turn

Production-config bench vs the family baseline (256K + KV FP8 + max-num-seqs 2)

Same launch flags, T = 0, 1 × RTX PRO 6000 Blackwell, vLLM 0.19.1rc1:

Repo Format MTP Single (S/M/L) 2-parallel agg (M/L) vs baseline
Qwen3.6-27B-NVFP4 (the family baseline) compressed-tensors 56 / 59 / 59 119 / 119 1.0×
Qwen3.6-27B-Text-NVFP4-MTP modelopt ✅ n=3 104 / 98 / 100 189 / 207 1.67× / 1.74×
Carnice-V2-27b-NVFP4-TEXT-MTP (this repo) modelopt ✅ n=3 107 / 98 / 102 193 / 194 1.68× / 1.63×
Huihui-Qwen3.6-27B-abliterated-NVFP4-TEXT-MTP modelopt ✅ n=3 117 / 96 / 101 203 / 183 1.65× / 1.54×
Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP (VLM) modelopt ✅ n=3 118 / 97 / 100 183 / 198 1.66× / 1.66×

(S = 50-token, M = 350-token, L = 700-token decodes.)

~1.7× the baseline's single-request decode and 1.6× its 2-parallel aggregate on the same hardware and same launch flags. The Hermes-SFT body lands within noise of the other modelopt + MTP siblings on raw decode speed, so the choice between this repo and the others is about behaviour (Hermes-style XML tool calls + <think> reasoning) rather than throughput.

KV cache size at 256K + fp8: 491,200 tokens → max concurrency 6.98× per request at full 256K. The base-model MTP head grafted from Qwen/Qwen3.6-27B transfers cleanly onto the Hermes-SFT body — at n=1 we see 0.934 per-position acceptance / 1.93 mean acceptance length, comparable to a model that shipped with its own MTP head.

Smaller-context single-request bench (16K, no fp8) — fastest interactive use

Prompt Tokens n=1 tok/s n=3 tok/s
Short (50 tok) 50 94.0 133.9
Medium (350 tok) 350 92.3 101.7
Long-form (700 tok) 700 93.5 103.4

GPU memory at load: ~20 GB.

Inherited SFT performance (from the base)

These are the upstream SFT numbers from kai-os/Carnice-V2-27b. Quantization quality should track these — held-out validation under NVFP4 was sane on the smoke tests but a full re-run is left to the user.

Metric Qwen3.6-27B base Carnice SFT (parent)
IFEval prompt strict (limit 20) 85.0 % 90.0 %
IFEval prompt loose (limit 20) 85.0 % 90.0 %
IFEval instruction strict (limit 20) 90.0 % 93.3 %
IFEval instruction loose (limit 20) 90.0 % 93.3 %
Held-out assistant eval loss 0.607 0.414
Held-out perplexity 1.835 1.513

Hardware target

Built and tested on NVIDIA RTX PRO 6000 Blackwell (SM120). Should also work on DGX Spark (GB10), RTX 5090, and other Blackwell consumer/workstation cards with sufficient VRAM (~15 GB NVFP4 weights + ~4 GB bf16 MTP/SSM/lm_head ≈ 19.6 GB on disk).

Acknowledgements

Support the upstream authors

If you find this useful, please go support the people whose work this rests on:

License

Apache 2.0, inherited from the parent.

— Tonoken3 / Lna-Lab

Downloads last month
9,485
Safetensors
Model size
17B params
Tensor type
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sakamakismile/Carnice-V2-27b-NVFP4-TEXT-MTP

Base model

Qwen/Qwen3.6-27B
Quantized
(10)
this model