Instructions to use deepseek-ai/DeepSeek-R1-Distill-Qwen-32B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use deepseek-ai/DeepSeek-R1-Distill-Qwen-32B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="deepseek-ai/DeepSeek-R1-Distill-Qwen-32B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-R1-Distill-Qwen-32B")
model = AutoModelForCausalLM.from_pretrained("deepseek-ai/DeepSeek-R1-Distill-Qwen-32B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
HuggingChat
Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use deepseek-ai/DeepSeek-R1-Distill-Qwen-32B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "deepseek-ai/DeepSeek-R1-Distill-Qwen-32B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-32B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B

SGLang

How to use deepseek-ai/DeepSeek-R1-Distill-Qwen-32B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "deepseek-ai/DeepSeek-R1-Distill-Qwen-32B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-32B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "deepseek-ai/DeepSeek-R1-Distill-Qwen-32B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-32B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use deepseek-ai/DeepSeek-R1-Distill-Qwen-32B with Docker Model Runner:
```
docker model run hf.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B
```

Please convert these models to GGUF format...

#12

by Moodym - opened Jan 22, 2025

Discussion

Moodym

Jan 22, 2025

Please convert to GGUF format...

ZiggyS

Jan 25, 2025

Others already have. Just do a search.

fblgit

Jan 28, 2025

@bartowski has already quantz for this.
it may be helpful for the model card to include those refs.

ghostplant

Jan 30, 2025

Is quantz version still that smart against o1-mini?

fblgit

Jan 30, 2025

there is degradation obviously by the change in precision. But there are no performance metrics to showcase exactly how much is lost on each quant..
IMHO a non-quant 32B can perform as good as a quant R1 full.. in any case the model in 32B is extraordinary good.

ZiggyS

Jan 30, 2025

there is degradation obviously by the change in precision. But there are no performance metrics to showcase exactly how much is lost on each quant..
IMHO a non-quant 32B can perform as good as a quant R1 full.. in any case the model in 32B is extraordinary good.

To me its a lot like a wav->mp3 decision when you dont have a lot of storage on your music device. Sure, you will lose some quality but its smaller. So, its a trade off of necessity, and just have to decide the bit-rate you are willing to go down to.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment