Qwen3-24B-A3B-Instruct-2507-REAP

✨ Highlights

Introducing Qwen3-24B-A3B-Instruct-2507-REAP, a memory-efficient compressed variant of Qwen3-30B-A3B-Instruct-2507 that maintains near-identical performance while being 25% lighter.

This model was created using REAP (Router-weighted Expert Activation Pruning), a novel expert pruning method that selectively removes redundant experts while preserving the router's independent control over remaining experts. Key features include:

Near-Lossless Performance: Maintains strong performance on instruction following, logical reasoning, text comprehension, mathematics, science, coding and tool usage compared to the full 30B model
20% Memory Reduction: Compressed from 30.5B to ~24B parameters, significantly lowering deployment costs and memory requirements
Preserved Capabilities: Retains all core functionalities including 256K long-context understanding, tool calling, and multilingual capabilities
Drop-in Compatibility: Works with vanilla vLLM and SGLang - no source modifications or custom patches required
Optimized for Real-World Use: Particularly effective for resource-constrained environments, local deployments, and academic research

Roughly 100,000 kv cache works on a single 32GB gpu if using fp8 kv quantization. REAPed on GPU using gfx 1201 (R9700) with patched version of the REAP package. Maintains strong coding ability and agentic behavior, tool calling seems completely unimpacted vs the 30B FP8 this was based upon.

📋 Model Overview

Qwen3-24B-A3B-Instruct-2507-REAP has the following specifications:

Base Model: Qwen3-30B-A3B-Instruct-2507
Compression Method: REAP (Router-weighted Expert Activation Pruning)
Compression Ratio: ~20% expert pruning
Type: Sparse Mixture-of-Experts (SMoE) Causal Language Model
Training Stage: Pretraining & Post-training
Number of Parameters: ~24B total, 3.3B activated per token
Number of Parameters (Non-Embedding): ~23.5B
Number of Layers: 48
Number of Attention Heads (GQA): 32 for Q and 4 for KV
Number of Experts: 96 (pruned from 128)
Number of Activated Experts: 8 per token
Context Length: 262,144 tokens natively
License: Apache 2.0

NOTE: This model supports only non-thinking mode and does not generate <think></think> blocks in its output (but can be led to do so with a system prompt).

🚀 Deployment

You can deploy the model directly using vLLM >= 0.8.5 or SGLang >= 0.4.6.post1.

vLLM Deployment

vllm serve tcclaviger/Qwen3-24B-A3B-Instruct-2507-REAP \
    --max-model-len 262144 \
    --tensor-parallel-size 2

SGLang Deployment

python -m sglang.launch_server \
    --model-path tcclaviger/Qwen3-24B-A3B-Instruct-2507-REAP \
    --context-length 262144 \
    --tp 2

Note: If you encounter out-of-memory (OOM) issues, consider reducing the context length to a shorter value, such as 32768, or adjusting --max-num-seqs to a lower value.

💻 Quickstart

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "tcclaviger/Qwen3-24B-A3B-Instruct-2507-REAP"

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

# Prepare the model input
prompt = "Give me a short introduction to large language model."
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# Conduct text completion
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=16384
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()

content = tokenizer.decode(output_ids, skip_special_tokens=True)
print("content:", content)

🧩 Model Creation

This checkpoint was created by applying the REAP (Router-weighted Expert Activation Pruning) method across all Mixture-of-Experts (MoE) blocks of Qwen3-30B-A3B-Instruct-2507, with a ~25% pruning rate.

How REAP Works

REAP selects experts to prune based on a novel saliency criterion that considers both:

Router gate values: How frequently and strongly the router activates each expert
Expert activation norms: The magnitude of each expert's output contributions

This dual consideration ensures that experts contributing minimally to the layer's output are pruned, while preserving those that play critical roles in the model's computations.

Key Advantages

One-Shot Compression: No fine-tuning required after pruning - the model is immediately ready for deployment
Preserved Router Control: Unlike expert merging methods, REAP maintains the router's independent, input-dependent control over remaining experts, avoiding "functional subspace collapse"
Generative Task Superiority: REAP significantly outperforms expert merging approaches on generative benchmarks while maintaining competitive performance on discriminative tasks

🛠️ Best Practices

To achieve optimal performance, we recommend the following settings:

Sampling Parameters:
- Temperature=0.7, TopP=0.8, TopK=20, and MinP=0
- For supported frameworks, adjust presence_penalty between 0 and 2 to reduce repetitions
Adequate Output Length: Use an output length of 16,384 tokens for most queries
Standardize Output Format:
- Math Problems: Include "Please reason step by step, and put your final answer within \boxed{}."
- Multiple-Choice Questions: Add "Please show your choice in the answer field with only the choice letter, e.g., \"answer\": \"C\"."

📚 Resources

For more details on REAP methodology, refer to:

🧾 REAP arXiv Preprint: arXiv:2510.13999
💻 REAP Codebase: GitHub

For Qwen3 model details:

📖 Qwen3 Blog: https://qwenlm.github.io/blog/qwen3/
💻 Qwen3 GitHub: https://github.com/QwenLM/Qwen3
📚 Qwen Documentation: https://qwen.readthedocs.io/en/latest/

⚖️ License

This model is derived from Qwen/Qwen3-30B-A3B-Instruct-2507 and distributed under the Apache 2.0 license.

🙏 Credits

Original Model: Qwen Team (Qwen3-30B-A3B-Instruct-2507)
Compression Method: Cerebras Research (REAP)
REAP Implementation: Mike Lasby, Ivan Lazarevich, Nish Sinnadurai, Sean Lie, Yani Ioannou, Vithursan Thangarasa

📝 Citation

If you use this checkpoint, please cite both the REAP paper and the Qwen3 technical report:

@article{lasby-reap,
  title={REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
  author={Lasby, Mike and Lazarevich, Ivan and Sinnadurai, Nish and Lie, Sean and Ioannou, Yani and Thangarasa, Vithursan},
  journal={arXiv preprint arXiv:2510.13999},
  year={2025}
}

@misc{qwen3technicalreport,
  title={Qwen3 Technical Report},
  author={Qwen Team},
  year={2025},
  eprint={2505.09388},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2505.09388},
}

Downloads last month: 79

Safetensors

Model size

23B params

Tensor type

BF16

F8_E4M3

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for tcclaviger/Qwen3-24B-A3B-2507-Instruct-REAP-FP8

Base model

Qwen/Qwen3-30B-A3B-Instruct-2507

Finetuned

(23)

this model

Quantizations

1 model