Instructions to use AngelSlim/HY-1.8B-2Bit-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use AngelSlim/HY-1.8B-2Bit-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="AngelSlim/HY-1.8B-2Bit-GGUF",
	filename="hunyuan-fp16-qdq.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use AngelSlim/HY-1.8B-2Bit-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf AngelSlim/HY-1.8B-2Bit-GGUF:Q4_0
# Run inference directly in the terminal:
llama-cli -hf AngelSlim/HY-1.8B-2Bit-GGUF:Q4_0

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf AngelSlim/HY-1.8B-2Bit-GGUF:Q4_0
# Run inference directly in the terminal:
llama-cli -hf AngelSlim/HY-1.8B-2Bit-GGUF:Q4_0

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf AngelSlim/HY-1.8B-2Bit-GGUF:Q4_0
# Run inference directly in the terminal:
./llama-cli -hf AngelSlim/HY-1.8B-2Bit-GGUF:Q4_0

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf AngelSlim/HY-1.8B-2Bit-GGUF:Q4_0
# Run inference directly in the terminal:
./build/bin/llama-cli -hf AngelSlim/HY-1.8B-2Bit-GGUF:Q4_0

Use Docker

docker model run hf.co/AngelSlim/HY-1.8B-2Bit-GGUF:Q4_0

LM Studio
Jan
Ollama
How to use AngelSlim/HY-1.8B-2Bit-GGUF with Ollama:
```
ollama run hf.co/AngelSlim/HY-1.8B-2Bit-GGUF:Q4_0
```

Unsloth Studio new

How to use AngelSlim/HY-1.8B-2Bit-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for AngelSlim/HY-1.8B-2Bit-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for AngelSlim/HY-1.8B-2Bit-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for AngelSlim/HY-1.8B-2Bit-GGUF to start chatting

Pi new

How to use AngelSlim/HY-1.8B-2Bit-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf AngelSlim/HY-1.8B-2Bit-GGUF:Q4_0

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "AngelSlim/HY-1.8B-2Bit-GGUF:Q4_0"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use AngelSlim/HY-1.8B-2Bit-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf AngelSlim/HY-1.8B-2Bit-GGUF:Q4_0

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default AngelSlim/HY-1.8B-2Bit-GGUF:Q4_0

Run Hermes

hermes

Docker Model Runner
How to use AngelSlim/HY-1.8B-2Bit-GGUF with Docker Model Runner:
```
docker model run hf.co/AngelSlim/HY-1.8B-2Bit-GGUF:Q4_0
```

Lemonade

How to use AngelSlim/HY-1.8B-2Bit-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull AngelSlim/HY-1.8B-2Bit-GGUF:Q4_0

Run and chat with the model

lemonade run user.HY-1.8B-2Bit-GGUF-Q4_0

List all available models

lemonade list

AngelSlim

Dedicated to building a more intuitive, comprehensive, and efficient LLMs compression toolkit.

📣Latest News

[26/02/09] We have released HY-1.8B-2Bit, 2bit on-device large language model.
[26/01/13] We have released v0.3. We support the training and deployment of Eagle3 for all-scale LLMs/VLMs/Audio models, as detailed in the guidance documentation. And We released Sherry, the hardware-efficient 1.25 bit quantization algorithm [Paper Comming soon] | [Code]🔥🔥🔥

For more detailed information, please refer to[AngelSlim]

🌟HY-1.8B-2Bit Key Features

Superior Model Capability HY-1.8B-2Bit is developed via Quantization-Aware Training (QAT) based on the Hunyuan-1.8B-Instruct backbone. By aggressively compressing the model to a 2-bit weight precision, we achieve a performance profile that remains highly competitive with PTQ-INT4 benchmarks. Across a multi-dimensional evaluation suite—encompassing mathematics, humanities, and programming—HY-1.8B-2Bit exhibits a marginal performance degradation of only 4% compared to its full-precision counterpart, demonstrating exceptional information retention despite the radical reduction in bit-width.
Unmatched Scale-to-Performance Efficiency When compared to dense models of equivalent size (e.g., 0.5B parameters), HY-1.8B-2Bit demonstrates a substantial competitive advantage, outperforming benchmarks by an average of 16% across core competencies. As a state-of-the-art (SOTA) solution for its parameter class, HY-1.8B-2Bit provides an extensible and highly efficient alternative for edge computing, delivering high-tier reasoning capabilities within a compact footprint.
Comprehensive Reasoning Proficiency HY-1.8B-2Bit inherits the complete "full-thinking" capabilities of the Hunyuan-1.8B-Instruct model, marking it as the industry's most compact model to support sophisticated reasoning pathways. By integrating a Dual Chain-of-Thought (Dual-CoT) strategy, the model empowers users to navigate the trade-off between latency and depth: utilizing concise short-CoT for intuitive queries and detailed long-CoT for computationally intensive tasks. This flexibility ensures that HY-1.8B-2Bit can be seamlessly deployed in real-time, resource-constrained environments that demand both rapid response and high-fidelity logical synthesis.

📈 Benchmark

Benchmark results for HY-1.8B-2Bit equivalent weights on vLLM across cmmlu,ceval,arc,bbh,gsm8k,humaneval,livecodebench and gpqa_diamond.

The empirical results reveal that HY-1.8B-2Bit maintains high-tier performance despite the extreme reduction in bit-width, incurring a marginal average degradation of only 3.97% compared to its full-precision 1.8B teacher. Remarkably, HY-1.8B-2Bit performs nearly on par with the INT4 variant,with a negligible accuracy gap of only 0.13%, while utilizing only half the weight precision. When compared to the dense HY-0.5B model, which occupies a comparable model size, the superiority of the 2-bit QAT approach becomes evident. While the 0.5B dense model suffers a catastrophic 21.87% drop in average accuracy, HY-1.8B-2Bit remains robust, outperforming the smaller dense counterpart by 22.29% in GSM8K and 20.62% in LiveCodeBench.

Model	cmmlu	ceval	arc	bbh	gsm8k	humaneval (pass@3)	livecodebench	gpqa_diamond (pass@3)
HY-1.8B	55.07%	54.27%	70.50%	79.08%	84.08%	94.51%	31.50%	68.18%
HY-0.5B	37.08%	35.98%	49.89%	58.10%	55.04%	67.07%	12.11%	46.97%
HY-1.8B-int4gptq	50.80%	48.67%	68.83%	74.80%	78.70%	89.02%	30.08%	65.56%
HY-1.8B-2Bit	49.32%	47.60%	64.45%	75.54%	77.33%	93.29%	32.73%	65.15%

💻Deployment

This setup ONLY works on SME2-capable devices (for example, Apple M4, vivo x300 and Arm CPUs with SME2 support). Neon kernel will follow up.

Running Hunyuan model on MacBook M4

We have provided the converted GGUF file, [LINK]

Clone llama.cpp

git clone https://github.com/ggml-org/llama.cpp.git

Enter the llama.cpp folder

cd llama.cpp

Fetch and check out the PR branch

git fetch origin pull/19357/head:pr-19357-sme2-int2
git checkout pr-19357-sme2-int2

Build llama.cpp with KleidiAI enabled

mkdir build && cd build

cmake -DGGML_CPU_KLEIDIAI=ON -DGGML_METAL=OFF -DGGML_BLAS=OFF ..

make -j8

Quantize the Hunyuan fp16 model to int2 per-channel (q2_0c)

./bin/llama-quantize hunyuan-fp16-qdq.gguf hunyuan-q2_0.gguf q2_0c

Run the CLI llama.cpp example

export GGML_KLEIDIAI_SME=1

# thinking 
./bin/llama-cli -m hunyuan-q2_0.gguf -p "写一副春联" -t 1 --seed 4568 -n 32
# no thinking 
./bin/llama-cli -m hunyuan-q2_0.gguf -p "/no_think写一副春联" -t 1 --seed 4568 -n 32

Run the llama.cpp benchmark

The general command is:

./bin/llama-bench -m hunyuan-q2_0.gguf -p <prompt-length> -t <number-of-threads> -n <gen-length>

📝 License

The code for this project is open-sourced under the License for AngelSlim.

🔗 Citation

@article{angelslim2026,
  title={AngelSlim: A more accessible, comprehensive, and efficient toolkit for large model compression},
  author={Hunyuan AI Infra Team},
  journal={arXiv preprint arXiv:2602.21233},
  year={2026}
}

💬 Technical Discussion

AngelSlim is continuously iterating and new features will be released soon. If you have any questions or suggestions, please open an issue on GitHub Issues or join our WeChat discussion group.