Update README.md
Browse files
README.md
CHANGED
|
@@ -26,7 +26,7 @@ pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
|
|
| 26 |
```
|
| 27 |
|
| 28 |
## Code Example
|
| 29 |
-
```
|
| 30 |
from vllm import LLM, SamplingParams
|
| 31 |
|
| 32 |
llm = LLM(model="pytorch/Phi-4-mini-instruct-int4wo-hqq", trust_remote_code=True)
|
|
@@ -49,7 +49,7 @@ print(output[0].outputs[0].text)
|
|
| 49 |
|
| 50 |
## Serving
|
| 51 |
Then we can serve with the following command:
|
| 52 |
-
```
|
| 53 |
vllm serve pytorch/Phi-4-mini-instruct-int4wo-hqq --tokenizer microsoft/Phi-4-mini-instruct -O3
|
| 54 |
```
|
| 55 |
|
|
@@ -57,7 +57,7 @@ vllm serve pytorch/Phi-4-mini-instruct-int4wo-hqq --tokenizer microsoft/Phi-4-mi
|
|
| 57 |
# Inference with Transformers
|
| 58 |
|
| 59 |
Install the required packages:
|
| 60 |
-
```
|
| 61 |
pip install git+https://github.com/huggingface/transformers@main
|
| 62 |
pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu126
|
| 63 |
pip install torch
|
|
@@ -65,7 +65,7 @@ pip install accelerate
|
|
| 65 |
```
|
| 66 |
|
| 67 |
Example:
|
| 68 |
-
```
|
| 69 |
import torch
|
| 70 |
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
|
| 71 |
|
|
@@ -108,7 +108,7 @@ print(output[0]['generated_text'])
|
|
| 108 |
# Quantization Recipe
|
| 109 |
|
| 110 |
Install the required packages:
|
| 111 |
-
```
|
| 112 |
pip install git+https://github.com/huggingface/transformers@main
|
| 113 |
pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu126
|
| 114 |
pip install torch
|
|
@@ -190,12 +190,12 @@ Need to install lm-eval from source:
|
|
| 190 |
https://github.com/EleutherAI/lm-evaluation-harness#install
|
| 191 |
|
| 192 |
## baseline
|
| 193 |
-
```
|
| 194 |
lm_eval --model hf --model_args pretrained=microsoft/Phi-4-mini-instruct --tasks hellaswag --device cuda:0 --batch_size 8
|
| 195 |
```
|
| 196 |
|
| 197 |
## int4 weight only quantization with hqq (int4wo-hqq)
|
| 198 |
-
```
|
| 199 |
lm_eval --model hf --model_args pretrained=pytorch/Phi-4-mini-instruct-int4wo-hqq --tasks hellaswag --device cuda:0 --batch_size 8
|
| 200 |
```
|
| 201 |
|
|
@@ -232,11 +232,11 @@ lm_eval --model hf --model_args pretrained=pytorch/Phi-4-mini-instruct-int4wo-hq
|
|
| 232 |
| Peak Memory (GB) | 8.91 | 2.98 (67% reduction) |
|
| 233 |
|
| 234 |
|
| 235 |
-
##
|
| 236 |
|
| 237 |
We can use the following code to get a sense of peak memory usage during inference:
|
| 238 |
|
| 239 |
-
```
|
| 240 |
import torch
|
| 241 |
from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig
|
| 242 |
|
|
@@ -290,65 +290,59 @@ Our int4wo is only optimized for batch size 1, so expect some slowdown with larg
|
|
| 290 |
Note the result of latency (benchmark_latency) is in seconds, and serving (benchmark_serving) is in number of requests per second.
|
| 291 |
Int4 weight only is optimized for batch size 1 and short input and output token length, please stay tuned for models optimized for larger batch sizes or longer token length.
|
| 292 |
|
| 293 |
-
##
|
| 294 |
-
|
| 295 |
Need to install vllm nightly to get some recent changes
|
| 296 |
-
```
|
| 297 |
pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
|
| 298 |
```
|
| 299 |
|
| 300 |
Get vllm source code:
|
| 301 |
-
```
|
| 302 |
git clone [email protected]:vllm-project/vllm.git
|
| 303 |
```
|
| 304 |
|
| 305 |
-
Run the
|
|
|
|
|
|
|
| 306 |
|
| 307 |
### baseline
|
| 308 |
-
```
|
| 309 |
python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model microsoft/Phi-4-mini-instruct --batch-size 1
|
| 310 |
```
|
| 311 |
|
| 312 |
### int4wo-hqq
|
| 313 |
-
```
|
| 314 |
python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model pytorch/Phi-4-mini-instruct-int4wo-hqq --batch-size 1
|
| 315 |
```
|
| 316 |
|
| 317 |
## benchmark_serving
|
| 318 |
|
| 319 |
-
We
|
| 320 |
|
| 321 |
Download sharegpt dataset: `wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json`
|
| 322 |
|
| 323 |
|
| 324 |
Other datasets can be found in: https://github.com/vllm-project/vllm/tree/main/benchmarks
|
| 325 |
|
| 326 |
-
Get vllm source code:
|
| 327 |
-
```
|
| 328 |
-
git clone [email protected]:vllm-project/vllm.git
|
| 329 |
-
```
|
| 330 |
-
|
| 331 |
-
Run the following under `vllm` root folder:
|
| 332 |
-
|
| 333 |
### baseline
|
| 334 |
Server:
|
| 335 |
-
```
|
| 336 |
vllm serve microsoft/Phi-4-mini-instruct --tokenizer microsoft/Phi-4-mini-instruct -O3
|
| 337 |
```
|
| 338 |
|
| 339 |
Client:
|
| 340 |
-
```
|
| 341 |
python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model microsoft/Phi-4-mini-instruct --num-prompts 1
|
| 342 |
```
|
| 343 |
|
| 344 |
### int4wo-hqq
|
| 345 |
Server:
|
| 346 |
-
```
|
| 347 |
vllm serve pytorch/Phi-4-mini-instruct-int4wo-hqq --tokenizer microsoft/Phi-4-mini-instruct -O3 --pt-load-map-location cuda:0
|
| 348 |
```
|
| 349 |
|
| 350 |
Client:
|
| 351 |
-
```
|
| 352 |
python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model pytorch/Phi-4-mini-instruct-int4wo-hqq --num-prompts 1
|
| 353 |
```
|
| 354 |
|
|
|
|
| 26 |
```
|
| 27 |
|
| 28 |
## Code Example
|
| 29 |
+
```Py
|
| 30 |
from vllm import LLM, SamplingParams
|
| 31 |
|
| 32 |
llm = LLM(model="pytorch/Phi-4-mini-instruct-int4wo-hqq", trust_remote_code=True)
|
|
|
|
| 49 |
|
| 50 |
## Serving
|
| 51 |
Then we can serve with the following command:
|
| 52 |
+
```Shell
|
| 53 |
vllm serve pytorch/Phi-4-mini-instruct-int4wo-hqq --tokenizer microsoft/Phi-4-mini-instruct -O3
|
| 54 |
```
|
| 55 |
|
|
|
|
| 57 |
# Inference with Transformers
|
| 58 |
|
| 59 |
Install the required packages:
|
| 60 |
+
```Shell
|
| 61 |
pip install git+https://github.com/huggingface/transformers@main
|
| 62 |
pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu126
|
| 63 |
pip install torch
|
|
|
|
| 65 |
```
|
| 66 |
|
| 67 |
Example:
|
| 68 |
+
```Py
|
| 69 |
import torch
|
| 70 |
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
|
| 71 |
|
|
|
|
| 108 |
# Quantization Recipe
|
| 109 |
|
| 110 |
Install the required packages:
|
| 111 |
+
```Shell
|
| 112 |
pip install git+https://github.com/huggingface/transformers@main
|
| 113 |
pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu126
|
| 114 |
pip install torch
|
|
|
|
| 190 |
https://github.com/EleutherAI/lm-evaluation-harness#install
|
| 191 |
|
| 192 |
## baseline
|
| 193 |
+
```Shell
|
| 194 |
lm_eval --model hf --model_args pretrained=microsoft/Phi-4-mini-instruct --tasks hellaswag --device cuda:0 --batch_size 8
|
| 195 |
```
|
| 196 |
|
| 197 |
## int4 weight only quantization with hqq (int4wo-hqq)
|
| 198 |
+
```Shell
|
| 199 |
lm_eval --model hf --model_args pretrained=pytorch/Phi-4-mini-instruct-int4wo-hqq --tasks hellaswag --device cuda:0 --batch_size 8
|
| 200 |
```
|
| 201 |
|
|
|
|
| 232 |
| Peak Memory (GB) | 8.91 | 2.98 (67% reduction) |
|
| 233 |
|
| 234 |
|
| 235 |
+
## Code Example
|
| 236 |
|
| 237 |
We can use the following code to get a sense of peak memory usage during inference:
|
| 238 |
|
| 239 |
+
```Py
|
| 240 |
import torch
|
| 241 |
from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig
|
| 242 |
|
|
|
|
| 290 |
Note the result of latency (benchmark_latency) is in seconds, and serving (benchmark_serving) is in number of requests per second.
|
| 291 |
Int4 weight only is optimized for batch size 1 and short input and output token length, please stay tuned for models optimized for larger batch sizes or longer token length.
|
| 292 |
|
| 293 |
+
## Setup
|
|
|
|
| 294 |
Need to install vllm nightly to get some recent changes
|
| 295 |
+
```Shell
|
| 296 |
pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
|
| 297 |
```
|
| 298 |
|
| 299 |
Get vllm source code:
|
| 300 |
+
```Shell
|
| 301 |
git clone [email protected]:vllm-project/vllm.git
|
| 302 |
```
|
| 303 |
|
| 304 |
+
Run the benchmarks under `vllm` root folder:
|
| 305 |
+
|
| 306 |
+
## benchmark_latency
|
| 307 |
|
| 308 |
### baseline
|
| 309 |
+
```Shell
|
| 310 |
python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model microsoft/Phi-4-mini-instruct --batch-size 1
|
| 311 |
```
|
| 312 |
|
| 313 |
### int4wo-hqq
|
| 314 |
+
```Shell
|
| 315 |
python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model pytorch/Phi-4-mini-instruct-int4wo-hqq --batch-size 1
|
| 316 |
```
|
| 317 |
|
| 318 |
## benchmark_serving
|
| 319 |
|
| 320 |
+
We benchmarked the throughput in a serving environment.
|
| 321 |
|
| 322 |
Download sharegpt dataset: `wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json`
|
| 323 |
|
| 324 |
|
| 325 |
Other datasets can be found in: https://github.com/vllm-project/vllm/tree/main/benchmarks
|
| 326 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 327 |
### baseline
|
| 328 |
Server:
|
| 329 |
+
```Shell
|
| 330 |
vllm serve microsoft/Phi-4-mini-instruct --tokenizer microsoft/Phi-4-mini-instruct -O3
|
| 331 |
```
|
| 332 |
|
| 333 |
Client:
|
| 334 |
+
```Shell
|
| 335 |
python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model microsoft/Phi-4-mini-instruct --num-prompts 1
|
| 336 |
```
|
| 337 |
|
| 338 |
### int4wo-hqq
|
| 339 |
Server:
|
| 340 |
+
```Shell
|
| 341 |
vllm serve pytorch/Phi-4-mini-instruct-int4wo-hqq --tokenizer microsoft/Phi-4-mini-instruct -O3 --pt-load-map-location cuda:0
|
| 342 |
```
|
| 343 |
|
| 344 |
Client:
|
| 345 |
+
```Shell
|
| 346 |
python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model pytorch/Phi-4-mini-instruct-int4wo-hqq --num-prompts 1
|
| 347 |
```
|
| 348 |
|