Instructions to use sakamakismile/Carnice-V2-27b-NVFP4-TEXT-MTP with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use sakamakismile/Carnice-V2-27b-NVFP4-TEXT-MTP with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="sakamakismile/Carnice-V2-27b-NVFP4-TEXT-MTP")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("sakamakismile/Carnice-V2-27b-NVFP4-TEXT-MTP")
model = AutoModelForImageTextToText.from_pretrained("sakamakismile/Carnice-V2-27b-NVFP4-TEXT-MTP")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use sakamakismile/Carnice-V2-27b-NVFP4-TEXT-MTP with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "sakamakismile/Carnice-V2-27b-NVFP4-TEXT-MTP"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "sakamakismile/Carnice-V2-27b-NVFP4-TEXT-MTP",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/sakamakismile/Carnice-V2-27b-NVFP4-TEXT-MTP

SGLang

How to use sakamakismile/Carnice-V2-27b-NVFP4-TEXT-MTP with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "sakamakismile/Carnice-V2-27b-NVFP4-TEXT-MTP" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "sakamakismile/Carnice-V2-27b-NVFP4-TEXT-MTP",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "sakamakismile/Carnice-V2-27b-NVFP4-TEXT-MTP" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "sakamakismile/Carnice-V2-27b-NVFP4-TEXT-MTP",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use sakamakismile/Carnice-V2-27b-NVFP4-TEXT-MTP with Docker Model Runner:
```
docker model run hf.co/sakamakismile/Carnice-V2-27b-NVFP4-TEXT-MTP
```

Carnice-V2-27b-NVFP4-TEXT-MTP

Text-only NVFP4-quantized variant of kai-os/Carnice-V2-27b — a Hermes-agent SFT of Qwen/Qwen3.6-27B — with a working bf16 MTP head grafted from the base model and the vision tower stripped.

Built for NVIDIA RTX PRO 6000 Blackwell, DGX Spark (GB10), and other 24–32 GB Blackwell cards that want a fast Hermes-style agent with <think> reasoning + tool calls at ~20 GB on disk.

Highlights

⚡ MTP speculative decoding works out of the box — 93 % per-position acceptance verified on this build
🛠 Hermes / OpenAI-XML tool calling — parallel calls, multi-turn round-trips, <think> reasoning all preserved
🎯 ~19.6 GB on disk — fits 24 GB Blackwell with full KV-cache + cuda-graph headroom
📈 100+ tok/s single-request decode on a single RTX PRO 6000 Blackwell with num_speculative_tokens=3 (134 short / 102 medium / 103 long-form). Drops to ~93 tok/s at num_speculative_tokens=1.
🟢 vLLM-ready with --quantization modelopt + qwen3_5_mtp spec config

⚠️ Important: built on a prefix-fixed copy of the base

The shipped kai-os/Carnice-V2-27b BF16 safetensors carry a triple language_model. prefix on every key, which makes HF transformers load the model with all-random weights and produce gibberish (the GGUF variant is unaffected because llama.cpp normalizes prefixes during conversion).

This NVFP4 variant was quantized from a prefix-corrected copy of the base. Detail and reproducible fix script:

→ kai-os/Carnice-V2-27b · Discussion #1

If you load the original BF16 directly, you will see the same gibberish problem. This repo is unaffected — the export here uses the corrected weights.

What's in here

	This repo (NVFP4-TEXT-MTP)	Parent (`kai-os/Carnice-V2-27b`)
Format	NVFP4 (`modelopt`) + bf16 MTP head	bf16 safetensors
File size	~19.6 GB	~54.7 GB
Vision tower	❌ stripped (333 tensors / ~0.92 GB removed)	✅ present (but Hermes use is text-only anyway)
MTP head	✅ bf16, working (grafted from `Qwen/Qwen3.6-27B`)	❌ dropped during merge
Architecture	`Qwen3_5ForConditionalGeneration` (text-only mode)	`Qwen3_5ForConditionalGeneration`
Agent format	Hermes-style XML tool calls + `<think>` reasoning	same (this repo preserves it)

Quantization details

Base: kai-os/Carnice-V2-27b (bf16, 27 B params, hybrid linear-attn + full-attn, 64 layers) — prefix-corrected before quantization
Quantizer: nvidia-modelopt 0.43.0 with NVFP4_DEFAULT_CFG
Calibration: 20 samples from neuralmagic/calibration (LLM split), max_seq_len 8192
MTP graft source: Qwen/Qwen3.6-27B (15 mtp.* tensors, ~850 MB bf16, original Carnice merge dropped them)
Ignored from quantization (kept in bf16):
- lm_head
- All *linear_attn.conv1d* (Mamba-style SSM convolutions, 48 of 64 layers)
- All mtp.* modules (15 tensors)
- Other NVFP4_DEFAULT_CFG defaults (router, mlp.gate, output_layer …)
Vision-related ignore entries are removed since the vision tensors no longer exist.

Why MTP from the base works on a SFT'd body

Because Carnice's SFT used assistant-token-only loss + LoRA then merged to bf16, the underlying hidden-state distribution is close to the Qwen3.6 base. The Qwen3.6 MTP head transfers cleanly, giving a per-position acceptance rate of 0.93 during smoke testing — comparable to a model that shipped with its own MTP head.

Mean acceptance length 1.93 / 2.0 at num_speculative_tokens=1 and approximately 3.0 / 4.0 at num_speculative_tokens=3 (positions 1/2/3 acceptance ~87 / 72 / 61 % per Pulsate1680 on RTX PRO 4500 Blackwell with the parent recipe; matches our long-form decode profile here).

Usage with vLLM (Blackwell, SM120)

This model requires --language-model-only (it's a slim of the VLM architecture, so vLLM otherwise tries to instantiate the missing vision tower) and the qwen3_xml tool-call parser (Carnice emits OpenAI-style function XML, not canonical Hermes JSON-in-tags).

Recommended production launch — 256K context · KV FP8 · MTP n=3 · max-num-seqs 2

vllm serve sakamakismile/Carnice-V2-27b-NVFP4-TEXT-MTP \
    --trust-remote-code \
    --quantization modelopt \
    --language-model-only \
    --enable-auto-tool-choice --tool-call-parser qwen3_xml \
    --reasoning-parser qwen3 \
    --max-model-len 262144 \
    --max-num-seqs 2 \
    --kv-cache-dtype fp8 \
    --gpu-memory-utilization 0.9 \
    --speculative-config '{"method":"qwen3_5_mtp","num_speculative_tokens":3}'

This is the production launch on a single RTX PRO 6000 Blackwell — full 256K context, two concurrent agent slots, KV FP8 for concurrency headroom. The four flags that are easy to skip but matter:

--max-model-len 262144 — full 256K context (Qwen3.6 trained max).
--kv-cache-dtype fp8 — halves KV memory; lifts max concurrency at 256K from ~4× (BF16, won't fit) to 7.0× with the same VRAM. ~5–10 % per-token decode overhead, more than paid back by capacity.
--max-num-seqs 2 — load-bearing. --max-num-seqs 4 plus --kv-cache-dtype fp8 plus --speculative-config n=3 plus --max-model-len 262144 will silently OOM during cuda-graph capture on this build of vLLM (0.19.1rc1).
num_speculative_tokens: 3 — vLLM applies the single MTP layer (mtp_num_hidden_layers=1) recursively three times per draft pass. Per-position acceptance ~87 / 72 / 61 %, mean accepted-length ≈ 3.0 / 4.0. The qwen3_5_mtp handler is internally normalized to mtp (deprecated-name warning is harmless).

Smaller-context launch (16K, no fp8) — fastest single-request decode

vllm serve sakamakismile/Carnice-V2-27b-NVFP4-TEXT-MTP \
    --trust-remote-code \
    --quantization modelopt \
    --language-model-only \
    --enable-auto-tool-choice --tool-call-parser qwen3_xml \
    --reasoning-parser qwen3 \
    --speculative-config '{"method":"qwen3_5_mtp","num_speculative_tokens":3}' \
    --gpu-memory-utilization 0.85 \
    --max-model-len 16384

Send a tool-using chat request

curl http://localhost:8000/v1/chat/completions -H 'Content-Type: application/json' -d '{
  "model": "sakamakismile/Carnice-V2-27b-NVFP4-TEXT-MTP",
  "messages": [
    {"role": "system", "content": "You are a helpful agent that uses tools when needed."},
    {"role": "user", "content": "Get the weather in Tokyo, Paris, and New York simultaneously."}
  ],
  "tools": [
    {"type":"function","function":{"name":"get_weather","description":"Get weather","parameters":{"type":"object","properties":{"city":{"type":"string"}},"required":["city"]}}}
  ],
  "tool_choice": "auto",
  "parallel_tool_calls": true,
  "max_tokens": 400
}'

Hermes Agent (NousResearch)

Drop-in works as a custom / vllm provider for NousResearch/hermes-agent:

# ~/.config/hermes/cli-config.yaml
model:
  default: "sakamakismile/Carnice-V2-27b-NVFP4-TEXT-MTP"
  provider: "vllm"
  base_url: "http://localhost:8000/v1"

Verified locally (RTX PRO 6000 Blackwell, vLLM 0.19.1rc1)

Five-task agent capability test against the local OpenAI-compatible endpoint:

Test	Result
1. Plain chat with reasoning	✅ `<think>` chain → final answer split correctly
2. Single tool call (Tokyo weather)	✅ valid JSON args, `finish_reason=tool_calls`
3. Multi-turn continuation (Tokyo result → Paris call)	✅ autonomous next-step
4. Final synthesis after both results	✅ markdown table, `finish_reason=stop`
5. Parallel tool calls (3 cities)	✅ 3 simultaneous calls in one turn

Production-config bench vs the family baseline (256K + KV FP8 + max-num-seqs 2)

Same launch flags, T = 0, 1 × RTX PRO 6000 Blackwell, vLLM 0.19.1rc1:

Repo	Format	MTP	Single (S/M/L)	2-parallel agg (M/L)	vs baseline
`Qwen3.6-27B-NVFP4` (the family baseline)	`compressed-tensors`	❌	56 / 59 / 59	119 / 119	1.0×
`Qwen3.6-27B-Text-NVFP4-MTP`	`modelopt`	✅ n=3	104 / 98 / 100	189 / 207	1.67× / 1.74×
`Carnice-V2-27b-NVFP4-TEXT-MTP` (this repo)	`modelopt`	✅ n=3	107 / 98 / 102	193 / 194	1.68× / 1.63×
`Huihui-Qwen3.6-27B-abliterated-NVFP4-TEXT-MTP`	`modelopt`	✅ n=3	117 / 96 / 101	203 / 183	1.65× / 1.54×
`Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP` (VLM)	`modelopt`	✅ n=3	118 / 97 / 100	183 / 198	1.66× / 1.66×

(S = 50-token, M = 350-token, L = 700-token decodes.)

→ ~1.7× the baseline's single-request decode and 1.6× its 2-parallel aggregate on the same hardware and same launch flags. The Hermes-SFT body lands within noise of the other modelopt + MTP siblings on raw decode speed, so the choice between this repo and the others is about behaviour (Hermes-style XML tool calls + <think> reasoning) rather than throughput.

KV cache size at 256K + fp8: 491,200 tokens → max concurrency 6.98× per request at full 256K. The base-model MTP head grafted from Qwen/Qwen3.6-27B transfers cleanly onto the Hermes-SFT body — at n=1 we see 0.934 per-position acceptance / 1.93 mean acceptance length, comparable to a model that shipped with its own MTP head.

Smaller-context single-request bench (16K, no fp8) — fastest interactive use

Prompt	Tokens	n=1 tok/s	n=3 tok/s
Short (50 tok)	50	94.0	133.9
Medium (350 tok)	350	92.3	101.7
Long-form (700 tok)	700	93.5	103.4

GPU memory at load: ~20 GB.

Inherited SFT performance (from the base)

These are the upstream SFT numbers from kai-os/Carnice-V2-27b. Quantization quality should track these — held-out validation under NVFP4 was sane on the smoke tests but a full re-run is left to the user.

Metric	Qwen3.6-27B base	Carnice SFT (parent)
IFEval prompt strict (limit 20)	85.0 %	90.0 %
IFEval prompt loose (limit 20)	85.0 %	90.0 %
IFEval instruction strict (limit 20)	90.0 %	93.3 %
IFEval instruction loose (limit 20)	90.0 %	93.3 %
Held-out assistant eval loss	0.607	0.414
Held-out perplexity	1.835	1.513

Hardware target

Built and tested on NVIDIA RTX PRO 6000 Blackwell (SM120). Should also work on DGX Spark (GB10), RTX 5090, and other Blackwell consumer/workstation cards with sufficient VRAM (~15 GB NVFP4 weights + ~4 GB bf16 MTP/SSM/lm_head ≈ 19.6 GB on disk).

Acknowledgements

kai-os — for the Carnice-V2-27b Hermes-agent SFT (assistant-token-only loss, GLM-5.1 trace blend in the data mix)
Qwen Team — for the original Qwen3.6-27B base (and the MTP head we grafted)
NousResearch — for the Hermes Agent framework that motivated this variant
osoleve — for the original MTP-restoration recipe on Qwen3.5
nvidia-modelopt team

Support the upstream authors

If you find this useful, please go support the people whose work this rests on:

kai-os (Hermes SFT): https://huggingface.co/kai-os/Carnice-V2-27b
Qwen Team (base model): https://huggingface.co/Qwen/Qwen3.6-27B

License

Apache 2.0, inherited from the parent.

— Tonoken3 / Lna-Lab

Downloads last month: 9,485

Safetensors

Model size

17B params

Tensor type

BF16

F8_E4M3

Model tree for sakamakismile/Carnice-V2-27b-NVFP4-TEXT-MTP

Base model

Qwen/Qwen3.6-27B

Finetuned

kai-os/Carnice-V2-27b

Quantized

(10)

this model