Instructions to use sakamakismile/Carnice-V2-27b-NVFP4-TEXT-MTP with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use sakamakismile/Carnice-V2-27b-NVFP4-TEXT-MTP with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="sakamakismile/Carnice-V2-27b-NVFP4-TEXT-MTP") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("sakamakismile/Carnice-V2-27b-NVFP4-TEXT-MTP") model = AutoModelForImageTextToText.from_pretrained("sakamakismile/Carnice-V2-27b-NVFP4-TEXT-MTP") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use sakamakismile/Carnice-V2-27b-NVFP4-TEXT-MTP with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "sakamakismile/Carnice-V2-27b-NVFP4-TEXT-MTP" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "sakamakismile/Carnice-V2-27b-NVFP4-TEXT-MTP", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/sakamakismile/Carnice-V2-27b-NVFP4-TEXT-MTP
- SGLang
How to use sakamakismile/Carnice-V2-27b-NVFP4-TEXT-MTP with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "sakamakismile/Carnice-V2-27b-NVFP4-TEXT-MTP" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "sakamakismile/Carnice-V2-27b-NVFP4-TEXT-MTP", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "sakamakismile/Carnice-V2-27b-NVFP4-TEXT-MTP" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "sakamakismile/Carnice-V2-27b-NVFP4-TEXT-MTP", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use sakamakismile/Carnice-V2-27b-NVFP4-TEXT-MTP with Docker Model Runner:
docker model run hf.co/sakamakismile/Carnice-V2-27b-NVFP4-TEXT-MTP
- Carnice-V2-27b-NVFP4-TEXT-MTP
- Highlights
- ⚠️ Important: built on a prefix-fixed copy of the base
- What's in here
- Quantization details
- Why MTP from the base works on a SFT'd body
- Usage with vLLM (Blackwell, SM120)
- Verified locally (RTX PRO 6000 Blackwell, vLLM 0.19.1rc1)
- Inherited SFT performance (from the base)
- Hardware target
- Acknowledgements
- Support the upstream authors
- License
- Highlights
Carnice-V2-27b-NVFP4-TEXT-MTP
Text-only NVFP4-quantized variant of kai-os/Carnice-V2-27b — a Hermes-agent SFT of Qwen/Qwen3.6-27B — with a working bf16 MTP head grafted from the base model and the vision tower stripped.
Built for NVIDIA RTX PRO 6000 Blackwell, DGX Spark (GB10), and other 24–32 GB Blackwell cards that want a fast Hermes-style agent with <think> reasoning + tool calls at ~20 GB on disk.
Highlights
- ⚡ MTP speculative decoding works out of the box — 93 % per-position acceptance verified on this build
- 🛠 Hermes / OpenAI-XML tool calling — parallel calls, multi-turn round-trips,
<think>reasoning all preserved - 🎯 ~19.6 GB on disk — fits 24 GB Blackwell with full KV-cache + cuda-graph headroom
- 📈 100+ tok/s single-request decode on a single RTX PRO 6000 Blackwell with
num_speculative_tokens=3(134 short / 102 medium / 103 long-form). Drops to ~93 tok/s atnum_speculative_tokens=1. - 🟢 vLLM-ready with
--quantization modelopt+qwen3_5_mtpspec config
⚠️ Important: built on a prefix-fixed copy of the base
The shipped kai-os/Carnice-V2-27b BF16 safetensors carry a triple language_model. prefix on every key, which makes HF transformers load the model with all-random weights and produce gibberish (the GGUF variant is unaffected because llama.cpp normalizes prefixes during conversion).
This NVFP4 variant was quantized from a prefix-corrected copy of the base. Detail and reproducible fix script:
→ kai-os/Carnice-V2-27b · Discussion #1
If you load the original BF16 directly, you will see the same gibberish problem. This repo is unaffected — the export here uses the corrected weights.
What's in here
| This repo (NVFP4-TEXT-MTP) | Parent (kai-os/Carnice-V2-27b) |
|
|---|---|---|
| Format | NVFP4 (modelopt) + bf16 MTP head |
bf16 safetensors |
| File size | ~19.6 GB | ~54.7 GB |
| Vision tower | ❌ stripped (333 tensors / ~0.92 GB removed) | ✅ present (but Hermes use is text-only anyway) |
| MTP head | ✅ bf16, working (grafted from Qwen/Qwen3.6-27B) |
❌ dropped during merge |
| Architecture | Qwen3_5ForConditionalGeneration (text-only mode) |
Qwen3_5ForConditionalGeneration |
| Agent format | Hermes-style XML tool calls + <think> reasoning |
same (this repo preserves it) |
Quantization details
- Base:
kai-os/Carnice-V2-27b(bf16, 27 B params, hybrid linear-attn + full-attn, 64 layers) — prefix-corrected before quantization - Quantizer:
nvidia-modelopt0.43.0 withNVFP4_DEFAULT_CFG - Calibration: 20 samples from
neuralmagic/calibration(LLM split), max_seq_len 8192 - MTP graft source:
Qwen/Qwen3.6-27B(15mtp.*tensors, ~850 MB bf16, original Carnice merge dropped them) - Ignored from quantization (kept in bf16):
lm_head- All
*linear_attn.conv1d*(Mamba-style SSM convolutions, 48 of 64 layers) - All
mtp.*modules (15 tensors) - Other
NVFP4_DEFAULT_CFGdefaults (router, mlp.gate, output_layer …)
- Vision-related ignore entries are removed since the vision tensors no longer exist.
Why MTP from the base works on a SFT'd body
Because Carnice's SFT used assistant-token-only loss + LoRA then merged to bf16, the underlying hidden-state distribution is close to the Qwen3.6 base. The Qwen3.6 MTP head transfers cleanly, giving a per-position acceptance rate of 0.93 during smoke testing — comparable to a model that shipped with its own MTP head.
Mean acceptance length 1.93 / 2.0 at num_speculative_tokens=1 and approximately 3.0 / 4.0 at num_speculative_tokens=3 (positions 1/2/3 acceptance ~87 / 72 / 61 % per Pulsate1680 on RTX PRO 4500 Blackwell with the parent recipe; matches our long-form decode profile here).
Usage with vLLM (Blackwell, SM120)
This model requires --language-model-only (it's a slim of the VLM architecture, so vLLM otherwise tries to instantiate the missing vision tower) and the qwen3_xml tool-call parser (Carnice emits OpenAI-style function XML, not canonical Hermes JSON-in-tags).
Recommended production launch — 256K context · KV FP8 · MTP n=3 · max-num-seqs 2
vllm serve sakamakismile/Carnice-V2-27b-NVFP4-TEXT-MTP \
--trust-remote-code \
--quantization modelopt \
--language-model-only \
--enable-auto-tool-choice --tool-call-parser qwen3_xml \
--reasoning-parser qwen3 \
--max-model-len 262144 \
--max-num-seqs 2 \
--kv-cache-dtype fp8 \
--gpu-memory-utilization 0.9 \
--speculative-config '{"method":"qwen3_5_mtp","num_speculative_tokens":3}'
This is the production launch on a single RTX PRO 6000 Blackwell — full 256K context, two concurrent agent slots, KV FP8 for concurrency headroom. The four flags that are easy to skip but matter:
--max-model-len 262144— full 256K context (Qwen3.6 trained max).--kv-cache-dtype fp8— halves KV memory; lifts max concurrency at 256K from ~4× (BF16, won't fit) to 7.0× with the same VRAM. ~5–10 % per-token decode overhead, more than paid back by capacity.--max-num-seqs 2— load-bearing.--max-num-seqs 4plus--kv-cache-dtype fp8plus--speculative-config n=3plus--max-model-len 262144will silently OOM during cuda-graph capture on this build of vLLM (0.19.1rc1).num_speculative_tokens: 3— vLLM applies the single MTP layer (mtp_num_hidden_layers=1) recursively three times per draft pass. Per-position acceptance ~87 / 72 / 61 %, mean accepted-length ≈ 3.0 / 4.0. Theqwen3_5_mtphandler is internally normalized tomtp(deprecated-name warning is harmless).
Smaller-context launch (16K, no fp8) — fastest single-request decode
vllm serve sakamakismile/Carnice-V2-27b-NVFP4-TEXT-MTP \
--trust-remote-code \
--quantization modelopt \
--language-model-only \
--enable-auto-tool-choice --tool-call-parser qwen3_xml \
--reasoning-parser qwen3 \
--speculative-config '{"method":"qwen3_5_mtp","num_speculative_tokens":3}' \
--gpu-memory-utilization 0.85 \
--max-model-len 16384
Send a tool-using chat request
curl http://localhost:8000/v1/chat/completions -H 'Content-Type: application/json' -d '{
"model": "sakamakismile/Carnice-V2-27b-NVFP4-TEXT-MTP",
"messages": [
{"role": "system", "content": "You are a helpful agent that uses tools when needed."},
{"role": "user", "content": "Get the weather in Tokyo, Paris, and New York simultaneously."}
],
"tools": [
{"type":"function","function":{"name":"get_weather","description":"Get weather","parameters":{"type":"object","properties":{"city":{"type":"string"}},"required":["city"]}}}
],
"tool_choice": "auto",
"parallel_tool_calls": true,
"max_tokens": 400
}'
Hermes Agent (NousResearch)
Drop-in works as a custom / vllm provider for NousResearch/hermes-agent:
# ~/.config/hermes/cli-config.yaml
model:
default: "sakamakismile/Carnice-V2-27b-NVFP4-TEXT-MTP"
provider: "vllm"
base_url: "http://localhost:8000/v1"
Verified locally (RTX PRO 6000 Blackwell, vLLM 0.19.1rc1)
Five-task agent capability test against the local OpenAI-compatible endpoint:
| Test | Result |
|---|---|
| 1. Plain chat with reasoning | ✅ <think> chain → final answer split correctly |
| 2. Single tool call (Tokyo weather) | ✅ valid JSON args, finish_reason=tool_calls |
| 3. Multi-turn continuation (Tokyo result → Paris call) | ✅ autonomous next-step |
| 4. Final synthesis after both results | ✅ markdown table, finish_reason=stop |
| 5. Parallel tool calls (3 cities) | ✅ 3 simultaneous calls in one turn |
Production-config bench vs the family baseline (256K + KV FP8 + max-num-seqs 2)
Same launch flags, T = 0, 1 × RTX PRO 6000 Blackwell, vLLM 0.19.1rc1:
| Repo | Format | MTP | Single (S/M/L) | 2-parallel agg (M/L) | vs baseline |
|---|---|---|---|---|---|
Qwen3.6-27B-NVFP4 (the family baseline) |
compressed-tensors |
❌ | 56 / 59 / 59 | 119 / 119 | 1.0× |
Qwen3.6-27B-Text-NVFP4-MTP |
modelopt |
✅ n=3 | 104 / 98 / 100 | 189 / 207 | 1.67× / 1.74× |
Carnice-V2-27b-NVFP4-TEXT-MTP (this repo) |
modelopt |
✅ n=3 | 107 / 98 / 102 | 193 / 194 | 1.68× / 1.63× |
Huihui-Qwen3.6-27B-abliterated-NVFP4-TEXT-MTP |
modelopt |
✅ n=3 | 117 / 96 / 101 | 203 / 183 | 1.65× / 1.54× |
Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP (VLM) |
modelopt |
✅ n=3 | 118 / 97 / 100 | 183 / 198 | 1.66× / 1.66× |
(S = 50-token, M = 350-token, L = 700-token decodes.)
→ ~1.7× the baseline's single-request decode and 1.6× its 2-parallel aggregate on the same hardware and same launch flags. The Hermes-SFT body lands within noise of the other modelopt + MTP siblings on raw decode speed, so the choice between this repo and the others is about behaviour (Hermes-style XML tool calls + <think> reasoning) rather than throughput.
KV cache size at 256K + fp8: 491,200 tokens → max concurrency 6.98× per request at full 256K. The base-model MTP head grafted from Qwen/Qwen3.6-27B transfers cleanly onto the Hermes-SFT body — at n=1 we see 0.934 per-position acceptance / 1.93 mean acceptance length, comparable to a model that shipped with its own MTP head.
Smaller-context single-request bench (16K, no fp8) — fastest interactive use
| Prompt | Tokens | n=1 tok/s | n=3 tok/s |
|---|---|---|---|
| Short (50 tok) | 50 | 94.0 | 133.9 |
| Medium (350 tok) | 350 | 92.3 | 101.7 |
| Long-form (700 tok) | 700 | 93.5 | 103.4 |
GPU memory at load: ~20 GB.
Inherited SFT performance (from the base)
These are the upstream SFT numbers from kai-os/Carnice-V2-27b. Quantization quality should track these — held-out validation under NVFP4 was sane on the smoke tests but a full re-run is left to the user.
| Metric | Qwen3.6-27B base | Carnice SFT (parent) |
|---|---|---|
| IFEval prompt strict (limit 20) | 85.0 % | 90.0 % |
| IFEval prompt loose (limit 20) | 85.0 % | 90.0 % |
| IFEval instruction strict (limit 20) | 90.0 % | 93.3 % |
| IFEval instruction loose (limit 20) | 90.0 % | 93.3 % |
| Held-out assistant eval loss | 0.607 | 0.414 |
| Held-out perplexity | 1.835 | 1.513 |
Hardware target
Built and tested on NVIDIA RTX PRO 6000 Blackwell (SM120). Should also work on DGX Spark (GB10), RTX 5090, and other Blackwell consumer/workstation cards with sufficient VRAM (~15 GB NVFP4 weights + ~4 GB bf16 MTP/SSM/lm_head ≈ 19.6 GB on disk).
Acknowledgements
kai-os— for the Carnice-V2-27b Hermes-agent SFT (assistant-token-only loss, GLM-5.1 trace blend in the data mix)Qwen Team— for the original Qwen3.6-27B base (and the MTP head we grafted)NousResearch— for the Hermes Agent framework that motivated this variantosoleve— for the original MTP-restoration recipe on Qwen3.5nvidia-modeloptteam
Support the upstream authors
If you find this useful, please go support the people whose work this rests on:
- kai-os (Hermes SFT): https://huggingface.co/kai-os/Carnice-V2-27b
- Qwen Team (base model): https://huggingface.co/Qwen/Qwen3.6-27B
License
Apache 2.0, inherited from the parent.
— Tonoken3 / Lna-Lab
- Downloads last month
- 9,485