Qwen3.6-27B
MLX 8-bit · Text + Vision + Thinking + Tool Calling
Apple Silicon native


What's this?

Qwen3.6-27B is a 27B-parameter dense model from Alibaba. It uses a hybrid linear/full attention architecture (3:1 ratio across 64 layers) that combines efficient DeltaNet-style linear attention with full softmax attention at regular intervals. It supports 262K context, vision, video, and multi-token prediction.

This is an MLX 8-bit conversion of the official Qwen3.6-27B weights, ready to run on Apple Silicon with full text, image, and video support.

Architecture details
Spec Value
Total params 27.8B (dense, all active)
Layers 64 (3x linear attention + 1x full attention, 16 repetitions)
Attention 24 Q heads, 4 KV heads (GQA), head_dim 256
Linear attention 16 QK heads, 48 V heads, head_dim 128
FFN intermediate_size 17408
Context 262K native, 1M+ with YaRN
RoPE theta 10M, partial_rotary_factor 0.25, mrope_interleaved
Vocab 248K tokens
Multimodal Text, image, video
Multi-token prediction Supported (1 draft layer)
model_type qwen3_5

This conversion

  • Source: Official Qwen3.6-27B safetensors (BF16, 15 shards)
  • Quantization: 8-bit (8.6 bits/weight, 28 GB across 6 shards)
  • Vision: Full support via mlx-vlm. Text, image, and video inputs work out of the box
  • Thinking: Toggleable via <|think_on|> / <|think_off|> tags (see below)
  • Tool calling: Works via the included fixed Jinja chat template
  • Requirements: mlx-lm >= 0.31.2, mlx-vlm >= 0.4.4

Quick start

Text

from mlx_lm import load, generate

model, tokenizer = load("froggeric/Qwen3.6-27B-MLX-8bit")
response = generate(model, tokenizer, prompt="Hello", max_tokens=256, temp=0.7)
print(response)

Vision

from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template

model, processor = load("froggeric/Qwen3.6-27B-MLX-8bit")
image = ["path/to/image.jpg"]
prompt = "Describe this image."
formatted = apply_chat_template(processor, model.config, prompt, num_images=len(image))
result = generate(model, processor, formatted, image, max_tokens=256, temp=0.7)
print(result.text)

CLI

# Text
mlx_lm.generate --model froggeric/Qwen3.6-27B-MLX-8bit --prompt "Hello"

# Vision
mlx_vlm.generate --model froggeric/Qwen3.6-27B-MLX-8bit --image image.jpg --prompt "Describe this image"

System prompt

The first line of your system prompt must be:

You are Qwen, created by Alibaba Cloud. You are a helpful assistant.

The model underperforms without it. You can append anything after that line.


Thinking toggle

This model ships with a fixed Jinja chat template that lets you toggle thinking on the fly. Drop <|think_on|> or <|think_off|> anywhere in your system or user prompt. The template intercepts the tag, strips it from context so the model never sees it, and flips the thinking mode.

System: You are a coding assistant. <|think_off|>
User: What's 2+2?

Fast answer, no internal reasoning.

System: You are a coding assistant. <|think_on|>
User: Implement a red-black tree in Rust.

The model thinks step by step, then answers.

Chat template

The bundled Jinja template fixes several issues in the official Qwen 3.6 template:

  • Tool calls crash on C++ engines. The official template uses Python's |items filter and |safe, which do not exist in C++ Jinja runtimes (LM Studio, MLX). This template uses direct dictionary key lookups instead.
  • The developer role crashes. Modern APIs send message.role == "developer". The official template throws an exception. This template maps it to system.
  • Empty preserve_thinking spam. The official template wraps every past turn in empty <think/> blocks, wasting context tokens. This template only emits thinking blocks when they contain actual reasoning content.
  • </thinking> hallucination handling. The model sometimes generates </thinking> instead of the expected closing tag. This template handles both gracefully.
  • Thinking toggle. <|think_on|> / <|think_off|> from any message role.

See chat_template.README.md for the full breakdown.


Sampling

From the official Qwen authors. Reserve 128K+ context for thinking mode.

Mode temp top_p top_k min_p repeat_penalty presence_penalty
Thinking (coding) (default) 0.6 0.95 20 0 1.0 off
Thinking (general) 1.0 0.95 20 0 1.0 1.5
Non-thinking (general) 0.7 0.8 20 0 1.0 1.5

GGUF runtimes use presence_penalty (0 = off). MLX / LM Studio use repeat_penalty (1.0 = off).


Links


Authorship

Role Author
Original model Alibaba Cloud (Qwen team)
MLX 8-bit conversion (text + vision) froggeric

License

Apache-2.0, inherited from Qwen3.6.

Downloads last month
1,718
Safetensors
Model size
8B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for froggeric/Qwen3.6-27B-MLX-8bit

Base model

Qwen/Qwen3.6-27B
Quantized
(324)
this model