Update README.md

Browse files

Files changed (1) hide show

README.md +22 -28

README.md CHANGED Viewed

@@ -26,7 +26,7 @@ pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
 ```
 ## Code Example
-```
 from vllm import LLM, SamplingParams
 llm = LLM(model="pytorch/Phi-4-mini-instruct-int4wo-hqq", trust_remote_code=True)
@@ -49,7 +49,7 @@ print(output[0].outputs[0].text)
 ## Serving
 Then we can serve with the following command:
-```
 vllm serve pytorch/Phi-4-mini-instruct-int4wo-hqq --tokenizer microsoft/Phi-4-mini-instruct -O3
 ```
@@ -57,7 +57,7 @@ vllm serve pytorch/Phi-4-mini-instruct-int4wo-hqq --tokenizer microsoft/Phi-4-mi
 # Inference with Transformers
 Install the required packages:
-```
 pip install git+https://github.com/huggingface/transformers@main
 pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu126
 pip install torch
@@ -65,7 +65,7 @@ pip install accelerate
 ```
 Example:
-```
 import torch
 from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
@@ -108,7 +108,7 @@ print(output[0]['generated_text'])
 # Quantization Recipe
 Install the required packages:
-```
 pip install git+https://github.com/huggingface/transformers@main
 pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu126
 pip install torch
@@ -190,12 +190,12 @@ Need to install lm-eval from source:
 https://github.com/EleutherAI/lm-evaluation-harness#install
 ## baseline
-```
 lm_eval --model hf --model_args pretrained=microsoft/Phi-4-mini-instruct --tasks hellaswag --device cuda:0 --batch_size 8
 ```
 ## int4 weight only quantization with hqq (int4wo-hqq)
-```
 lm_eval --model hf --model_args pretrained=pytorch/Phi-4-mini-instruct-int4wo-hqq --tasks hellaswag --device cuda:0 --batch_size 8
 ```
@@ -232,11 +232,11 @@ lm_eval --model hf --model_args pretrained=pytorch/Phi-4-mini-instruct-int4wo-hq
 | Peak Memory (GB) | 8.91           | 2.98 (67% reduction)           |
-## Peak Memory
 We can use the following code to get a sense of peak memory usage during inference:
-```
 import torch
 from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig
@@ -290,65 +290,59 @@ Our int4wo is only optimized for batch size 1, so expect some slowdown with larg
 Note the result of latency (benchmark_latency) is in seconds, and serving (benchmark_serving) is in number of requests per second.
 Int4 weight only is optimized for batch size 1 and short input and output token length, please stay tuned for models optimized for larger batch sizes or longer token length.
-## benchmark_latency
 Need to install vllm nightly to get some recent changes
-```
 pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
 ```
 Get vllm source code:
-```
 git clone [email protected]:vllm-project/vllm.git
 ```
-Run the following under `vllm` root folder:
 ### baseline
-```
 python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model microsoft/Phi-4-mini-instruct --batch-size 1
 ```
 ### int4wo-hqq
-```
 python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model pytorch/Phi-4-mini-instruct-int4wo-hqq --batch-size 1
 ```
 ## benchmark_serving
-We also benchmarked the throughput in a serving environment.
 Download sharegpt dataset: `wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json`
 Other datasets can be found in: https://github.com/vllm-project/vllm/tree/main/benchmarks
-Get vllm source code:
-```
-git clone [email protected]:vllm-project/vllm.git
-```
-Run the following under `vllm` root folder:
 ### baseline
 Server:
-```
 vllm serve microsoft/Phi-4-mini-instruct --tokenizer microsoft/Phi-4-mini-instruct -O3
 ```
 Client:
-```
 python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model microsoft/Phi-4-mini-instruct --num-prompts 1
 ```
 ### int4wo-hqq
 Server:
-```
 vllm serve pytorch/Phi-4-mini-instruct-int4wo-hqq --tokenizer microsoft/Phi-4-mini-instruct -O3 --pt-load-map-location cuda:0
 ```
 Client:
-```
 python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model pytorch/Phi-4-mini-instruct-int4wo-hqq --num-prompts 1
 ```

 ```
 ## Code Example
+```Py
 from vllm import LLM, SamplingParams
 llm = LLM(model="pytorch/Phi-4-mini-instruct-int4wo-hqq", trust_remote_code=True)
 ## Serving
 Then we can serve with the following command:
+```Shell
 vllm serve pytorch/Phi-4-mini-instruct-int4wo-hqq --tokenizer microsoft/Phi-4-mini-instruct -O3
 ```
 # Inference with Transformers
 Install the required packages:
+```Shell
 pip install git+https://github.com/huggingface/transformers@main
 pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu126
 pip install torch
 ```
 Example:
+```Py
 import torch
 from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
 # Quantization Recipe
 Install the required packages:
+```Shell
 pip install git+https://github.com/huggingface/transformers@main
 pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu126
 pip install torch
 https://github.com/EleutherAI/lm-evaluation-harness#install
 ## baseline
+```Shell
 lm_eval --model hf --model_args pretrained=microsoft/Phi-4-mini-instruct --tasks hellaswag --device cuda:0 --batch_size 8
 ```
 ## int4 weight only quantization with hqq (int4wo-hqq)
+```Shell
 lm_eval --model hf --model_args pretrained=pytorch/Phi-4-mini-instruct-int4wo-hqq --tasks hellaswag --device cuda:0 --batch_size 8
 ```
 | Peak Memory (GB) | 8.91           | 2.98 (67% reduction)           |
+## Code Example
 We can use the following code to get a sense of peak memory usage during inference:
+```Py
 import torch
 from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig
 Note the result of latency (benchmark_latency) is in seconds, and serving (benchmark_serving) is in number of requests per second.
 Int4 weight only is optimized for batch size 1 and short input and output token length, please stay tuned for models optimized for larger batch sizes or longer token length.
+## Setup
 Need to install vllm nightly to get some recent changes
+```Shell
 pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
 ```
 Get vllm source code:
+```Shell
 git clone [email protected]:vllm-project/vllm.git
 ```
+Run the benchmarks under `vllm` root folder:
+## benchmark_latency
 ### baseline
+```Shell
 python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model microsoft/Phi-4-mini-instruct --batch-size 1
 ```
 ### int4wo-hqq
+```Shell
 python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model pytorch/Phi-4-mini-instruct-int4wo-hqq --batch-size 1
 ```
 ## benchmark_serving
+We benchmarked the throughput in a serving environment.
 Download sharegpt dataset: `wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json`
 Other datasets can be found in: https://github.com/vllm-project/vllm/tree/main/benchmarks
 ### baseline
 Server:
+```Shell
 vllm serve microsoft/Phi-4-mini-instruct --tokenizer microsoft/Phi-4-mini-instruct -O3
 ```
 Client:
+```Shell
 python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model microsoft/Phi-4-mini-instruct --num-prompts 1
 ```
 ### int4wo-hqq
 Server:
+```Shell
 vllm serve pytorch/Phi-4-mini-instruct-int4wo-hqq --tokenizer microsoft/Phi-4-mini-instruct -O3 --pt-load-map-location cuda:0
 ```
 Client:
+```Shell
 python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model pytorch/Phi-4-mini-instruct-int4wo-hqq --num-prompts 1
 ```