Instructions to use AngelSlim/HY-1.8B-2Bit-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use AngelSlim/HY-1.8B-2Bit-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="AngelSlim/HY-1.8B-2Bit-GGUF", filename="hunyuan-fp16-qdq.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use AngelSlim/HY-1.8B-2Bit-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf AngelSlim/HY-1.8B-2Bit-GGUF:Q4_0 # Run inference directly in the terminal: llama-cli -hf AngelSlim/HY-1.8B-2Bit-GGUF:Q4_0
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf AngelSlim/HY-1.8B-2Bit-GGUF:Q4_0 # Run inference directly in the terminal: llama-cli -hf AngelSlim/HY-1.8B-2Bit-GGUF:Q4_0
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf AngelSlim/HY-1.8B-2Bit-GGUF:Q4_0 # Run inference directly in the terminal: ./llama-cli -hf AngelSlim/HY-1.8B-2Bit-GGUF:Q4_0
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf AngelSlim/HY-1.8B-2Bit-GGUF:Q4_0 # Run inference directly in the terminal: ./build/bin/llama-cli -hf AngelSlim/HY-1.8B-2Bit-GGUF:Q4_0
Use Docker
docker model run hf.co/AngelSlim/HY-1.8B-2Bit-GGUF:Q4_0
- LM Studio
- Jan
- Ollama
How to use AngelSlim/HY-1.8B-2Bit-GGUF with Ollama:
ollama run hf.co/AngelSlim/HY-1.8B-2Bit-GGUF:Q4_0
- Unsloth Studio new
How to use AngelSlim/HY-1.8B-2Bit-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for AngelSlim/HY-1.8B-2Bit-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for AngelSlim/HY-1.8B-2Bit-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for AngelSlim/HY-1.8B-2Bit-GGUF to start chatting
- Pi new
How to use AngelSlim/HY-1.8B-2Bit-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf AngelSlim/HY-1.8B-2Bit-GGUF:Q4_0
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "AngelSlim/HY-1.8B-2Bit-GGUF:Q4_0" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use AngelSlim/HY-1.8B-2Bit-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf AngelSlim/HY-1.8B-2Bit-GGUF:Q4_0
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default AngelSlim/HY-1.8B-2Bit-GGUF:Q4_0
Run Hermes
hermes
- Docker Model Runner
How to use AngelSlim/HY-1.8B-2Bit-GGUF with Docker Model Runner:
docker model run hf.co/AngelSlim/HY-1.8B-2Bit-GGUF:Q4_0
- Lemonade
How to use AngelSlim/HY-1.8B-2Bit-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull AngelSlim/HY-1.8B-2Bit-GGUF:Q4_0
Run and chat with the model
lemonade run user.HY-1.8B-2Bit-GGUF-Q4_0
List all available models
lemonade list
llm.create_chat_completion(
messages = "No input example has been defined for this model task."
)
Dedicated to building a more intuitive, comprehensive, and efficient LLMs compression toolkit.
📣 GGUF | ✒️ TechnicalReport | 📖 Documentation | 🤗 Hugging Face | 🤖 ModelScope | 💬 WeChat
📣Latest News
- [26/02/09] We have released HY-1.8B-2Bit, 2bit on-device large language model.
- [26/01/13] We have released v0.3. We support the training and deployment of Eagle3 for all-scale LLMs/VLMs/Audio models, as detailed in the guidance documentation. And We released Sherry, the hardware-efficient 1.25 bit quantization algorithm [Paper Comming soon] | [Code]🔥🔥🔥
For more detailed information, please refer to[AngelSlim]
🌟HY-1.8B-2Bit Key Features
Superior Model Capability HY-1.8B-2Bit is developed via Quantization-Aware Training (QAT) based on the Hunyuan-1.8B-Instruct backbone. By aggressively compressing the model to a 2-bit weight precision, we achieve a performance profile that remains highly competitive with PTQ-INT4 benchmarks. Across a multi-dimensional evaluation suite—encompassing mathematics, humanities, and programming—HY-1.8B-2Bit exhibits a marginal performance degradation of only 4% compared to its full-precision counterpart, demonstrating exceptional information retention despite the radical reduction in bit-width.
Unmatched Scale-to-Performance Efficiency When compared to dense models of equivalent size (e.g., 0.5B parameters), HY-1.8B-2Bit demonstrates a substantial competitive advantage, outperforming benchmarks by an average of 16% across core competencies. As a state-of-the-art (SOTA) solution for its parameter class, HY-1.8B-2Bit provides an extensible and highly efficient alternative for edge computing, delivering high-tier reasoning capabilities within a compact footprint.
Comprehensive Reasoning Proficiency HY-1.8B-2Bit inherits the complete "full-thinking" capabilities of the Hunyuan-1.8B-Instruct model, marking it as the industry's most compact model to support sophisticated reasoning pathways. By integrating a Dual Chain-of-Thought (Dual-CoT) strategy, the model empowers users to navigate the trade-off between latency and depth: utilizing concise short-CoT for intuitive queries and detailed long-CoT for computationally intensive tasks. This flexibility ensures that HY-1.8B-2Bit can be seamlessly deployed in real-time, resource-constrained environments that demand both rapid response and high-fidelity logical synthesis.
📈 Benchmark
Benchmark results for HY-1.8B-2Bit equivalent weights on vLLM across cmmlu,ceval,arc,bbh,gsm8k,humaneval,livecodebench and gpqa_diamond.
The empirical results reveal that HY-1.8B-2Bit maintains high-tier performance despite the extreme reduction in bit-width, incurring a marginal average degradation of only 3.97% compared to its full-precision 1.8B teacher. Remarkably, HY-1.8B-2Bit performs nearly on par with the INT4 variant,with a negligible accuracy gap of only 0.13%, while utilizing only half the weight precision. When compared to the dense HY-0.5B model, which occupies a comparable model size, the superiority of the 2-bit QAT approach becomes evident. While the 0.5B dense model suffers a catastrophic 21.87% drop in average accuracy, HY-1.8B-2Bit remains robust, outperforming the smaller dense counterpart by 22.29% in GSM8K and 20.62% in LiveCodeBench.
| Model | cmmlu | ceval | arc | bbh | gsm8k | humaneval (pass@3) |
livecodebench | gpqa_diamond (pass@3) |
|---|---|---|---|---|---|---|---|---|
| HY-1.8B | 55.07% | 54.27% | 70.50% | 79.08% | 84.08% | 94.51% | 31.50% | 68.18% |
| HY-0.5B | 37.08% | 35.98% | 49.89% | 58.10% | 55.04% | 67.07% | 12.11% | 46.97% |
| HY-1.8B-int4gptq | 50.80% | 48.67% | 68.83% | 74.80% | 78.70% | 89.02% | 30.08% | 65.56% |
| HY-1.8B-2Bit | 49.32% | 47.60% | 64.45% | 75.54% | 77.33% | 93.29% | 32.73% | 65.15% |
💻Deployment
This setup ONLY works on SME2-capable devices (for example, Apple M4, vivo x300 and Arm CPUs with SME2 support). Neon kernel will follow up.
Running Hunyuan model on MacBook M4
We have provided the converted GGUF file, [LINK]
Clone llama.cpp
git clone https://github.com/ggml-org/llama.cpp.git
Enter the llama.cpp folder
cd llama.cpp
Fetch and check out the PR branch
git fetch origin pull/19357/head:pr-19357-sme2-int2
git checkout pr-19357-sme2-int2
Build llama.cpp with KleidiAI enabled
mkdir build && cd build
cmake -DGGML_CPU_KLEIDIAI=ON -DGGML_METAL=OFF -DGGML_BLAS=OFF ..
make -j8
Quantize the Hunyuan fp16 model to int2 per-channel (q2_0c)
./bin/llama-quantize hunyuan-fp16-qdq.gguf hunyuan-q2_0.gguf q2_0c
Run the CLI llama.cpp example
export GGML_KLEIDIAI_SME=1
# thinking
./bin/llama-cli -m hunyuan-q2_0.gguf -p "写一副春联" -t 1 --seed 4568 -n 32
# no thinking
./bin/llama-cli -m hunyuan-q2_0.gguf -p "/no_think写一副春联" -t 1 --seed 4568 -n 32
Run the llama.cpp benchmark
The general command is:
./bin/llama-bench -m hunyuan-q2_0.gguf -p <prompt-length> -t <number-of-threads> -n <gen-length>
📝 License
The code for this project is open-sourced under the License for AngelSlim.
🔗 Citation
@article{angelslim2026,
title={AngelSlim: A more accessible, comprehensive, and efficient toolkit for large model compression},
author={Hunyuan AI Infra Team},
journal={arXiv preprint arXiv:2602.21233},
year={2026}
}
💬 Technical Discussion
- AngelSlim is continuously iterating and new features will be released soon. If you have any questions or suggestions, please open an issue on GitHub Issues or join our WeChat discussion group.
- Downloads last month
- 1,385
4-bit
Model tree for AngelSlim/HY-1.8B-2Bit-GGUF
Base model
AngelSlim/HY-1.8B-2Bit

# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="AngelSlim/HY-1.8B-2Bit-GGUF", filename="", )