Qwen3.6-27B
MLX 8-bit · Text + Vision + Thinking + Tool Calling
Apple Silicon native

What's this?

Qwen3.6-27B is a 27B-parameter dense model from Alibaba. It uses a hybrid linear/full attention architecture (3:1 ratio across 64 layers) that combines efficient DeltaNet-style linear attention with full softmax attention at regular intervals. It supports 262K context, vision, video, and multi-token prediction.

This is an MLX 8-bit conversion of the official Qwen3.6-27B weights, ready to run on Apple Silicon with full text, image, and video support.

Architecture details

Spec	Value
Total params	27.8B (dense, all active)
Layers	64 (3x linear attention + 1x full attention, 16 repetitions)
Attention	24 Q heads, 4 KV heads (GQA), head_dim 256
Linear attention	16 QK heads, 48 V heads, head_dim 128
FFN	intermediate_size 17408
Context	262K native, 1M+ with YaRN
RoPE	theta 10M, partial_rotary_factor 0.25, mrope_interleaved
Vocab	248K tokens
Multimodal	Text, image, video
Multi-token prediction	Supported (1 draft layer)
model_type	`qwen3_5`

This conversion

Source: Official Qwen3.6-27B safetensors (BF16, 15 shards)
Quantization: 8-bit (8.6 bits/weight, 28 GB across 6 shards)
Vision: Full support via mlx-vlm. Text, image, and video inputs work out of the box
Thinking: Toggleable via <|think_on|> / <|think_off|> tags (see below)
Tool calling: Works via the included fixed Jinja chat template
Requirements: mlx-lm >= 0.31.2, mlx-vlm >= 0.4.4

Quick start

Text

from mlx_lm import load, generate

model, tokenizer = load("froggeric/Qwen3.6-27B-MLX-8bit")
response = generate(model, tokenizer, prompt="Hello", max_tokens=256, temp=0.7)
print(response)

Vision

from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template

model, processor = load("froggeric/Qwen3.6-27B-MLX-8bit")
image = ["path/to/image.jpg"]
prompt = "Describe this image."
formatted = apply_chat_template(processor, model.config, prompt, num_images=len(image))
result = generate(model, processor, formatted, image, max_tokens=256, temp=0.7)
print(result.text)

CLI

# Text
mlx_lm.generate --model froggeric/Qwen3.6-27B-MLX-8bit --prompt "Hello"

# Vision
mlx_vlm.generate --model froggeric/Qwen3.6-27B-MLX-8bit --image image.jpg --prompt "Describe this image"

System prompt

The first line of your system prompt must be:

You are Qwen, created by Alibaba Cloud. You are a helpful assistant.

The model underperforms without it. You can append anything after that line.

Thinking toggle

This model ships with a fixed Jinja chat template that lets you toggle thinking on the fly. Drop <|think_on|> or <|think_off|> anywhere in your system or user prompt. The template intercepts the tag, strips it from context so the model never sees it, and flips the thinking mode.

System: You are a coding assistant. <|think_off|>
User: What's 2+2?

Fast answer, no internal reasoning.

System: You are a coding assistant. <|think_on|>
User: Implement a red-black tree in Rust.

The model thinks step by step, then answers.

Chat template

The bundled Jinja template fixes several issues in the official Qwen 3.6 template:

Tool calls crash on C++ engines. The official template uses Python's |items filter and |safe, which do not exist in C++ Jinja runtimes (LM Studio, MLX). This template uses direct dictionary key lookups instead.
The developer role crashes. Modern APIs send message.role == "developer". The official template throws an exception. This template maps it to system.
Empty preserve_thinking spam. The official template wraps every past turn in empty <think/> blocks, wasting context tokens. This template only emits thinking blocks when they contain actual reasoning content.
</thinking> hallucination handling. The model sometimes generates </thinking> instead of the expected closing tag. This template handles both gracefully.
Thinking toggle. <|think_on|> / <|think_off|> from any message role.

See chat_template.README.md for the full breakdown.

Sampling

From the official Qwen authors. Reserve 128K+ context for thinking mode.

Mode	temp	top_p	top_k	repeat_penalty	presence_penalty
Thinking (coding) (default)	0.6	0.95	20	1.0	off
Thinking (general)	1.0	0.95	20	1.0	1.5
Non-thinking (general)	0.7	0.8	20	1.0	1.5

GGUF runtimes use presence_penalty (0 = off). MLX / LM Studio use repeat_penalty (1.0 = off).

Authorship

Role	Author
Original model	Alibaba Cloud (Qwen team)
MLX 8-bit conversion (text + vision)	froggeric

License

Apache-2.0, inherited from Qwen3.6.

Downloads last month: 1,718

Safetensors

Model size

8B params

Tensor type

BF16

U32

MLX

Hardware compatibility

8-bit

Model tree for froggeric/Qwen3.6-27B-MLX-8bit

Base model

Qwen/Qwen3.6-27B

Quantized

(324)

this model

froggeric
/

Qwen3.6-27B-MLX-8bit