Instructions to use froggeric/Qwen3.6-27B-MLX-8bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use froggeric/Qwen3.6-27B-MLX-8bit with MLX:
# Make sure mlx-vlm is installed # pip install --upgrade mlx-vlm from mlx_vlm import load, generate from mlx_vlm.prompt_utils import apply_chat_template from mlx_vlm.utils import load_config # Load the model model, processor = load("froggeric/Qwen3.6-27B-MLX-8bit") config = load_config("froggeric/Qwen3.6-27B-MLX-8bit") # Prepare input image = ["http://images.cocodataset.org/val2017/000000039769.jpg"] prompt = "Describe this image." # Apply chat template formatted_prompt = apply_chat_template( processor, config, prompt, num_images=1 ) # Generate output output = generate(model, processor, formatted_prompt, image) print(output) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
- Pi new
How to use froggeric/Qwen3.6-27B-MLX-8bit with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "froggeric/Qwen3.6-27B-MLX-8bit"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "froggeric/Qwen3.6-27B-MLX-8bit" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use froggeric/Qwen3.6-27B-MLX-8bit with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "froggeric/Qwen3.6-27B-MLX-8bit"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default froggeric/Qwen3.6-27B-MLX-8bit
Run Hermes
hermes
Qwen3.6-27B
MLX 8-bit · Text + Vision + Thinking + Tool Calling
Apple Silicon native
What's this?
Qwen3.6-27B is a 27B-parameter dense model from Alibaba. It uses a hybrid linear/full attention architecture (3:1 ratio across 64 layers) that combines efficient DeltaNet-style linear attention with full softmax attention at regular intervals. It supports 262K context, vision, video, and multi-token prediction.
This is an MLX 8-bit conversion of the official Qwen3.6-27B weights, ready to run on Apple Silicon with full text, image, and video support.
Architecture details
| Spec | Value |
|---|---|
| Total params | 27.8B (dense, all active) |
| Layers | 64 (3x linear attention + 1x full attention, 16 repetitions) |
| Attention | 24 Q heads, 4 KV heads (GQA), head_dim 256 |
| Linear attention | 16 QK heads, 48 V heads, head_dim 128 |
| FFN | intermediate_size 17408 |
| Context | 262K native, 1M+ with YaRN |
| RoPE | theta 10M, partial_rotary_factor 0.25, mrope_interleaved |
| Vocab | 248K tokens |
| Multimodal | Text, image, video |
| Multi-token prediction | Supported (1 draft layer) |
| model_type | qwen3_5 |
This conversion
- Source: Official Qwen3.6-27B safetensors (BF16, 15 shards)
- Quantization: 8-bit (8.6 bits/weight, 28 GB across 6 shards)
- Vision: Full support via
mlx-vlm. Text, image, and video inputs work out of the box - Thinking: Toggleable via
<|think_on|>/<|think_off|>tags (see below) - Tool calling: Works via the included fixed Jinja chat template
- Requirements:
mlx-lm >= 0.31.2,mlx-vlm >= 0.4.4
Quick start
Text
from mlx_lm import load, generate
model, tokenizer = load("froggeric/Qwen3.6-27B-MLX-8bit")
response = generate(model, tokenizer, prompt="Hello", max_tokens=256, temp=0.7)
print(response)
Vision
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
model, processor = load("froggeric/Qwen3.6-27B-MLX-8bit")
image = ["path/to/image.jpg"]
prompt = "Describe this image."
formatted = apply_chat_template(processor, model.config, prompt, num_images=len(image))
result = generate(model, processor, formatted, image, max_tokens=256, temp=0.7)
print(result.text)
CLI
# Text
mlx_lm.generate --model froggeric/Qwen3.6-27B-MLX-8bit --prompt "Hello"
# Vision
mlx_vlm.generate --model froggeric/Qwen3.6-27B-MLX-8bit --image image.jpg --prompt "Describe this image"
System prompt
The first line of your system prompt must be:
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.
The model underperforms without it. You can append anything after that line.
Thinking toggle
This model ships with a fixed Jinja chat template that lets you toggle thinking on the fly. Drop <|think_on|> or <|think_off|> anywhere in your system or user prompt. The template intercepts the tag, strips it from context so the model never sees it, and flips the thinking mode.
System: You are a coding assistant. <|think_off|>
User: What's 2+2?
Fast answer, no internal reasoning.
System: You are a coding assistant. <|think_on|>
User: Implement a red-black tree in Rust.
The model thinks step by step, then answers.
Chat template
The bundled Jinja template fixes several issues in the official Qwen 3.6 template:
- Tool calls crash on C++ engines. The official template uses Python's
|itemsfilter and|safe, which do not exist in C++ Jinja runtimes (LM Studio, MLX). This template uses direct dictionary key lookups instead. - The
developerrole crashes. Modern APIs sendmessage.role == "developer". The official template throws an exception. This template maps it tosystem. - Empty
preserve_thinkingspam. The official template wraps every past turn in empty<think/>blocks, wasting context tokens. This template only emits thinking blocks when they contain actual reasoning content. </thinking>hallucination handling. The model sometimes generates</thinking>instead of the expected closing tag. This template handles both gracefully.- Thinking toggle.
<|think_on|>/<|think_off|>from any message role.
See chat_template.README.md for the full breakdown.
Sampling
From the official Qwen authors. Reserve 128K+ context for thinking mode.
| Mode | temp | top_p | top_k | min_p | repeat_penalty | presence_penalty |
|---|---|---|---|---|---|---|
| Thinking (coding) (default) | 0.6 | 0.95 | 20 | 0 | 1.0 | off |
| Thinking (general) | 1.0 | 0.95 | 20 | 0 | 1.0 | 1.5 |
| Non-thinking (general) | 0.7 | 0.8 | 20 | 0 | 1.0 | 1.5 |
GGUF runtimes use presence_penalty (0 = off). MLX / LM Studio use repeat_penalty (1.0 = off).
Links
Authorship
| Role | Author |
|---|---|
| Original model | Alibaba Cloud (Qwen team) |
| MLX 8-bit conversion (text + vision) | froggeric |
License
Apache-2.0, inherited from Qwen3.6.
- Downloads last month
- 1,718
8-bit
Model tree for froggeric/Qwen3.6-27B-MLX-8bit
Base model
Qwen/Qwen3.6-27B