Qwen3-1.7B-Int8

This version of Qwen3-1.7B-Int8 has been converted to run on the Axera NPU using w8a16 quantization.

This model has been optimized with the following LoRA:

Compatible with Pulsar2 version: 4.2(Not released yet)

Convert tools links:

For those who are interested in model conversion, you can try to export axmodel through the original repo : https://huggingface.co/Qwen/Qwen3-1.7B

Pulsar2 Link, How to Convert LLM from Huggingface to axmodel

AXera NPU LLM Runtime

Support Platform

Chips w8a16 w4a16
AX650 9.5 tokens/sec TBD

How to use

安装 axllm

方式一:克隆仓库后执行安装脚本:

git clone -b axllm https://github.com/AXERA-TECH/ax-llm.git
cd ax-llm
./install.sh

方式二:一行命令安装(默认分支 axllm):

curl -fsSL https://raw.githubusercontent.com/AXERA-TECH/ax-llm/axllm/install.sh | bash

方式三:下载Github Actions CI 导出的可执行程序(适合没有编译环境的用户):

如果没有编译环境,请到: https://github.com/AXERA-TECH/ax-llm/actions?query=branch%3Aaxllm 下载 最新 CI 导出的可执行程序axllm),然后:

chmod +x axllm
sudo mv axllm /usr/bin/axllm

模型下载(Hugging Face)

先创建模型目录并进入,然后下载到该目录:

mkdir -p AXERA-TECH/Qwen3-1.7B
cd AXERA-TECH/Qwen3-1.7B
hf download AXERA-TECH/Qwen3-1.7B --local-dir .

# # structure of the downloaded files
tree -L 3
.
└── AXERA-TECH
    └── Qwen3-1.7B
        ├── README.md
        ├── config.json
        ├── model.embed_tokens.weight.bfloat16.bin
        ├── post_config.json
        ├── qwen3_p128_l0_together.axmodel
...
        ├── qwen3_p128_l9_together.axmodel
        ├── qwen3_post.axmodel
        └── qwen3_tokenizer.txt

2 directories, 34 files

Inference with AX650 Host, such as M4N-Dock(爱芯派Pro) or AX650N DEMO Board

运行(CLI)

(base) root@ax650:~# axllm run AXERA-TECH/Qwen3-1.7B/
[I][                            Init][ 127]: LLM init start
tokenizer_type = 1
 96% | ███████████████████████████████   |  30 /  31 [4.99s<5.16s, 6.01 count/s] init post axmodel ok,remain_cmm(7437 MB)
[I][                            Init][ 188]: max_token_len : 2559
[I][                            Init][ 191]: kv_cache_size : 1024, kv_cache_num: 2559
[I][                            Init][ 194]: prefill_token_num : 128
[I][                            Init][ 198]: grp: 1, prefill_max_kv_cache_num : 1
[I][                            Init][ 198]: grp: 2, prefill_max_kv_cache_num : 512
[I][                            Init][ 198]: grp: 3, prefill_max_kv_cache_num : 1024
[I][                            Init][ 198]: grp: 4, prefill_max_kv_cache_num : 1536
[I][                            Init][ 198]: grp: 5, prefill_max_kv_cache_num : 2048
[I][                            Init][ 203]: prefill_max_token_num : 2048
[I][                            Init][  27]: LLaMaEmbedSelector use mmap
100% | ████████████████████████████████ |  31 /  31 [4.99s<4.99s, 6.21 count/s] embed_selector init ok
[I][                     load_config][ 282]: load config:
{
    "enable_repetition_penalty": false,
    "enable_temperature": false,
    "enable_top_k_sampling": false,
    "enable_top_p_sampling": false,
    "penalty_window": 20,
    "repetition_penalty": 1.2,
    "temperature": 0.9,
    "top_k": 10,
    "top_p": 0.8
}

[I][                            Init][ 224]: LLM init ok
Type "q" to exit
Ctrl+c to stop current running
"reset" to reset kvcache
"dd" to remove last conversation.
"pp" to print history.
----------------------------------------
prompt >> who are you
[I][                      SetKVCache][ 357]: prefill_grpid:2 kv_cache_num:512 precompute_len:0 input_num_token:22
[I][                      SetKVCache][ 359]: current prefill_max_token_num:2048
[I][                      SetKVCache][ 360]: first run
[I][                             Run][ 412]: input token num : 22, prefill_split_num : 1
[I][                             Run][ 474]: ttft: 676.74 ms
<think>
Okay, the user asked, "Who are you?" I need to respond appropriately. Let me think.

First, I should introduce myself clearly. My name is Assistant, but I should mention that I'm an AI developed by Alibaba Group. That's important to establish my identity.

Next, I should explain my purpose. I'm here to help with questions, provide information, and assist with tasks. It's good to mention that I can help with various topics, like answering questions, writing, coding, or even creative tasks.

I should also mention that I'm designed to be helpful and friendly. Maybe add something about being available 24/7. But I need to keep it concise.

Wait, the user might be testing if I know how to respond. I should make sure my response is clear and not too technical. Avoid jargon. Also, maybe offer to help with specific tasks if they need it.

Let me check if there's anything else. I should avoid any markdown and keep the response natural. Alright, putting it all together now.
</think>

I'm an AI assistant developed by Alibaba Group. My name is Assistant, and I'm here to help with questions, provide information, and assist with various tasks. I'm designed to be helpful, friendly, and available 24/7. Let me know how I can assist you! 😊

[N][                             Run][ 554]: hit eos,avg 8.17 token/s

[I][                      GetKVCache][ 331]: precompute_len:300, remaining:1748
prompt >> q

启动服务(OpenAI 兼容)

(base) root@ax650:~# axllm serve AXERA-TECH/Qwen3-1.7B/
[I][                            Init][ 127]: LLM init start
tokenizer_type = 1
 96% | ███████████████████████████████   |  30 /  31 [3.32s<3.43s, 9.04 count/s] init post axmodel ok,remain_cmm(7437 MB)
[I][                            Init][ 188]: max_token_len : 2559
[I][                            Init][ 191]: kv_cache_size : 1024, kv_cache_num: 2559
[I][                            Init][ 194]: prefill_token_num : 128
[I][                            Init][ 198]: grp: 1, prefill_max_kv_cache_num : 1
[I][                            Init][ 198]: grp: 2, prefill_max_kv_cache_num : 512
[I][                            Init][ 198]: grp: 3, prefill_max_kv_cache_num : 1024
[I][                            Init][ 198]: grp: 4, prefill_max_kv_cache_num : 1536
[I][                            Init][ 198]: grp: 5, prefill_max_kv_cache_num : 2048
[I][                            Init][ 203]: prefill_max_token_num : 2048
[I][                            Init][  27]: LLaMaEmbedSelector use mmap
100% | ████████████████████████████████ |  31 /  31 [3.32s<3.32s, 9.33 count/s] embed_selector init ok
[I][                     load_config][ 282]: load config:
{
    "enable_repetition_penalty": false,
    "enable_temperature": false,
    "enable_top_k_sampling": false,
    "enable_top_p_sampling": false,
    "penalty_window": 20,
    "repetition_penalty": 1.2,
    "temperature": 0.9,
    "top_k": 10,
    "top_p": 0.8
}

[I][                            Init][ 224]: LLM init ok
Starting server on port 8000 with model 'AXERA-TECH/Qwen3-1.7B'...
OpenAI API Server starting on http://0.0.0.0:8000
Max concurrency: 1
Models: AXERA-TECH/Qwen3-1.7B

OpenAI 调用示例

from openai import OpenAI

API_URL = "http://127.0.0.1:8000/v1"
MODEL = "AXERA-TECH/Qwen3-1.7B"

messages = [
    {"role": "system", "content": [{"type": "text", "text": "you are a helpful assistant."}]},
    {"role": "user", "content": "hello"},
]

client = OpenAI(api_key="not-needed", base_url=API_URL)
completion = client.chat.completions.create(
    model=MODEL,
    messages=messages,
)

print(completion.choices[0].message.content)

OpenAI 流式调用示例

from openai import OpenAI

API_URL = "http://127.0.0.1:8000/v1"
MODEL = "AXERA-TECH/Qwen3-1.7B"

messages = [
    {"role": "system", "content": [{"type": "text", "text": "you are a helpful assistant."}]},
    {"role": "user", "content": "hello"},
]

client = OpenAI(api_key="not-needed", base_url=API_URL)
stream = client.chat.completions.create(
    model=MODEL,
    messages=messages,
    stream=True,
)

print("assistant:")
for ev in stream:
    delta = getattr(ev.choices[0], "delta", None)
    if delta and getattr(delta, "content", None):
        print(delta.content, end="", flush=True)
print("
")
Downloads last month
78
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 1 Ask for provider support

Model tree for AXERA-TECH/Qwen3-1.7B

Finetuned
Qwen/Qwen3-1.7B
Finetuned
(480)
this model

Collection including AXERA-TECH/Qwen3-1.7B