Qwen3-24B-A3B-Instruct-2507-REAP
β¨ Highlights
Introducing Qwen3-24B-A3B-Instruct-2507-REAP, a memory-efficient compressed variant of Qwen3-30B-A3B-Instruct-2507 that maintains near-identical performance while being 25% lighter.
This model was created using REAP (Router-weighted Expert Activation Pruning), a novel expert pruning method that selectively removes redundant experts while preserving the router's independent control over remaining experts. Key features include:
- Near-Lossless Performance: Maintains strong performance on instruction following, logical reasoning, text comprehension, mathematics, science, coding and tool usage compared to the full 30B model
- 20% Memory Reduction: Compressed from 30.5B to ~24B parameters, significantly lowering deployment costs and memory requirements
- Preserved Capabilities: Retains all core functionalities including 256K long-context understanding, tool calling, and multilingual capabilities
- Drop-in Compatibility: Works with vanilla vLLM and SGLang - no source modifications or custom patches required
- Optimized for Real-World Use: Particularly effective for resource-constrained environments, local deployments, and academic research
Roughly 100,000 kv cache works on a single 32GB gpu if using fp8 kv quantization. REAPed on GPU using gfx 1201 (R9700) with patched version of the REAP package. Maintains strong coding ability and agentic behavior, tool calling seems completely unimpacted vs the 30B FP8 this was based upon.
π Model Overview
Qwen3-24B-A3B-Instruct-2507-REAP has the following specifications:
- Base Model: Qwen3-30B-A3B-Instruct-2507
- Compression Method: REAP (Router-weighted Expert Activation Pruning)
- Compression Ratio: ~20% expert pruning
- Type: Sparse Mixture-of-Experts (SMoE) Causal Language Model
- Training Stage: Pretraining & Post-training
- Number of Parameters: ~24B total, 3.3B activated per token
- Number of Parameters (Non-Embedding): ~23.5B
- Number of Layers: 48
- Number of Attention Heads (GQA): 32 for Q and 4 for KV
- Number of Experts: 96 (pruned from 128)
- Number of Activated Experts: 8 per token
- Context Length: 262,144 tokens natively
- License: Apache 2.0
NOTE: This model supports only non-thinking mode and does not generate <think></think> blocks in its output (but can be led to do so with a system prompt).
π Deployment
You can deploy the model directly using vLLM >= 0.8.5 or SGLang >= 0.4.6.post1.
vLLM Deployment
vllm serve tcclaviger/Qwen3-24B-A3B-Instruct-2507-REAP \
--max-model-len 262144 \
--tensor-parallel-size 2
SGLang Deployment
python -m sglang.launch_server \
--model-path tcclaviger/Qwen3-24B-A3B-Instruct-2507-REAP \
--context-length 262144 \
--tp 2
Note: If you encounter out-of-memory (OOM) issues, consider reducing the context length to a shorter value, such as 32768, or adjusting --max-num-seqs to a lower value.
π» Quickstart
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "tcclaviger/Qwen3-24B-A3B-Instruct-2507-REAP"
# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
# Prepare the model input
prompt = "Give me a short introduction to large language model."
messages = [
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
# Conduct text completion
generated_ids = model.generate(
**model_inputs,
max_new_tokens=16384
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
content = tokenizer.decode(output_ids, skip_special_tokens=True)
print("content:", content)
π§© Model Creation
This checkpoint was created by applying the REAP (Router-weighted Expert Activation Pruning) method across all Mixture-of-Experts (MoE) blocks of Qwen3-30B-A3B-Instruct-2507, with a ~25% pruning rate.
How REAP Works
REAP selects experts to prune based on a novel saliency criterion that considers both:
- Router gate values: How frequently and strongly the router activates each expert
- Expert activation norms: The magnitude of each expert's output contributions
This dual consideration ensures that experts contributing minimally to the layer's output are pruned, while preserving those that play critical roles in the model's computations.
Key Advantages
- One-Shot Compression: No fine-tuning required after pruning - the model is immediately ready for deployment
- Preserved Router Control: Unlike expert merging methods, REAP maintains the router's independent, input-dependent control over remaining experts, avoiding "functional subspace collapse"
- Generative Task Superiority: REAP significantly outperforms expert merging approaches on generative benchmarks while maintaining competitive performance on discriminative tasks
π οΈ Best Practices
To achieve optimal performance, we recommend the following settings:
Sampling Parameters:
- Temperature=0.7, TopP=0.8, TopK=20, and MinP=0
- For supported frameworks, adjust
presence_penaltybetween 0 and 2 to reduce repetitions
Adequate Output Length: Use an output length of 16,384 tokens for most queries
Standardize Output Format:
- Math Problems: Include "Please reason step by step, and put your final answer within \boxed{}."
- Multiple-Choice Questions: Add "Please show your choice in the
answerfield with only the choice letter, e.g.,\"answer\": \"C\"."
π Resources
For more details on REAP methodology, refer to:
- π§Ύ REAP arXiv Preprint: arXiv:2510.13999
- π» REAP Codebase: GitHub
For Qwen3 model details:
- π Qwen3 Blog: https://qwenlm.github.io/blog/qwen3/
- π» Qwen3 GitHub: https://github.com/QwenLM/Qwen3
- π Qwen Documentation: https://qwen.readthedocs.io/en/latest/
βοΈ License
This model is derived from Qwen/Qwen3-30B-A3B-Instruct-2507 and distributed under the Apache 2.0 license.
π Credits
- Original Model: Qwen Team (Qwen3-30B-A3B-Instruct-2507)
- Compression Method: Cerebras Research (REAP)
- REAP Implementation: Mike Lasby, Ivan Lazarevich, Nish Sinnadurai, Sean Lie, Yani Ioannou, Vithursan Thangarasa
π Citation
If you use this checkpoint, please cite both the REAP paper and the Qwen3 technical report:
@article{lasby-reap,
title={REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
author={Lasby, Mike and Lazarevich, Ivan and Sinnadurai, Nish and Lie, Sean and Ioannou, Yani and Thangarasa, Vithursan},
journal={arXiv preprint arXiv:2510.13999},
year={2025}
}
@misc{qwen3technicalreport,
title={Qwen3 Technical Report},
author={Qwen Team},
year={2025},
eprint={2505.09388},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2505.09388},
}
- Downloads last month
- 79