Instructions to use microsoft/Phi-3-mini-4k-instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use microsoft/Phi-3-mini-4k-instruct with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="microsoft/Phi-3-mini-4k-instruct", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForMultimodalLM

tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct", trust_remote_code=True)
model = AutoModelForMultimodalLM.from_pretrained("microsoft/Phi-3-mini-4k-instruct", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use microsoft/Phi-3-mini-4k-instruct with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "microsoft/Phi-3-mini-4k-instruct"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "microsoft/Phi-3-mini-4k-instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/microsoft/Phi-3-mini-4k-instruct

SGLang

How to use microsoft/Phi-3-mini-4k-instruct with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "microsoft/Phi-3-mini-4k-instruct" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "microsoft/Phi-3-mini-4k-instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "microsoft/Phi-3-mini-4k-instruct" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "microsoft/Phi-3-mini-4k-instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use microsoft/Phi-3-mini-4k-instruct with Docker Model Runner:
```
docker model run hf.co/microsoft/Phi-3-mini-4k-instruct
```

Help with merging LoRA layers back onto Phi3

#55

by SHIMURA0321 - opened May 9, 2024

Discussion

SHIMURA0321

May 9, 2024

I have used q-LoRA for fine tuning Phi3 on some domain specific knowledge, and I am wonder how to merge the LoRA layers back onto Phi3-4k-instruct. I have tried the following ways:

I want to run inference on CPU on macbook, so I use llama.cpp to transform the LoRA to GGML file so that I can merge the LoRA with Phi3 using Ollama, but I have met the following ERROR:
INFO:lora-to-gguf:model.layers.0.mlp.down_proj => blk.0.ffn_down.weight.loraA (8192, 32) float32 1.00MB
INFO:lora-to-gguf:model.layers.0.mlp.down_proj => blk.0.ffn_down.weight.loraB (3072, 32) float32 0.38MB
INFO:lora-to-gguf:model.layers.0.mlp.gate_up_proj => blk.0.ffn_up.weight.loraA (3072, 32) float32 0.38MB
INFO:lora-to-gguf:model.layers.0.mlp.gate_up_proj => blk.0.ffn_up.weight.loraB (16384, 32) float32 2.00MB
ERROR:lora-to-gguf:Error: could not map tensor name base_model.model.model.layers.0.self_attn.qkv_proj.lora_A.weight
ERROR:lora-to-gguf: Note: the arch parameter must be specified if the model is not llama

(By the way, I have applied the LoRA to the layers: qkv_proj", "gate_up_proj", "down_proj" of Phi3 model)
I will be grateful if someone can help me on this issue!
2. I use the method merge_and_unload() together with the method save_pretrained() from HuggingFace, but I get back a .safetensors file and a .json file, but I do not know how to use this "new" fine-tuned model on CPU.

Thanks in advance!

nguyenbh

Microsoft org Jul 17, 2024

@SHIMURA0321 Thank you for your interest in the Phi-3 models.
We have examples on using LoRA and QLoRA for finetuning in the Phi-3 Cookbook. Maybe these examples are something you can take a look.

eshanc

Aug 17, 2025

@SHIMURA0321 you can also use Impulse AI (https://www.impulselabs.ai/) to fine-tune if you don't want to run it locally

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment