Qwen3-4B-Instruct-2507-GPTQ-Int8
This version of Qwen3-4B-Instruct-2507 has been converted to Qwen3-4B-Instruct-2507-GPTQ-Int8 by GPTQModel and run on the Axera NPU using w8a16 quantization.
This model has been optimized with the following LoRA:
Compatible with Pulsar2 version: 5.1
Convert tools links:
For those who are interested in model conversion, you can try to export axmodel through the original repo : https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507
Pulsar2 Link, How to Convert LLM from Huggingface to axmodel
Support Platform
| Chips | w8a16 | w4a16 |
|---|---|---|
| AX650 | 4.5 tokens/sec | TBD |
How to use
Download all files from this repository to the device
root@ax650:/mnt/qtang/llm-test/Qwen3-4B-Instruct-2507-GPTQ-Int8# tree -L 1
.
βββ config.json
βββ main_api_ax650
βββ main_api_axcl_aarch64
βββ main_api_axcl_x86
βββ main_ax650
βββ main_axcl_aarch64
βββ main_axcl_x86
βββ post_config.json
βββ qwen2.5_tokenizer
βββ Qwen3-4B-Instruct-2507-GPTQ-Int8-context-4k-prefill-3584
βββ qwen3_tokenizer
βββ qwen3_tokenizer_uid.py
βββ README.md
βββ run_qwen3_4b_int8_gptq_ax650_api.sh
βββ run_qwen3_4b_int8_gptq_ax650.sh
βββ run_qwen3_4b_int8_gptq_axcl_aarch64.sh
βββ run_qwen3_4b_int8_gptq_axcl_x86_api.sh
βββ run_qwen3_4b_int8_gptq_axcl_x86.sh
4 directories, 15 files
Start the Tokenizer service
Install requirement
pip install transformers jinja2
root@ax650:/mnt/qtang/llm-test/Qwen3-4B-Instruct-2507-GPTQ-Int8# python3 qwen3_tokenizer_uid.py
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
Server running at http://0.0.0.0:12345
Inference with AX650 Host, such as M4N-Dock(η±θ―ζ΄ΎPro) or AX650N DEMO Board
Open another terminal and run run_qwen3_4b_int8_gptq_ax650.sh
root@ax650:/mnt/qtang/llm-test/Qwen3-4B-Instruct-2507-GPTQ-Int8# ./run_qwen3_4b_int8_gptq_ax650.sh
[I][ Init][ 110]: LLM init start
[I][ Init][ 34]: connect http://127.0.0.1:12345 ok
[I][ Init][ 57]: uid: 4edbab3b-3892-479c-ba4e-7ac7f27780a8
bos_id: -1, eos_id: 151645
2% | β | 1 / 39 [3.64s<142.08s, 0.27 count/s] tokenizer init ok[I][ Init][ 26]: LLaMaEmbedSelector use mmap
100% | ββββββββββββββββββββββββββββββββ | 39 / 39 [25.93s<25.93s, 1.50 count/s] init post axmodel ok,remain_cmm(3466 MB)[I][ Init][ 188]: max_token_len : 4095
[I][ Init][ 193]: kv_cache_size : 1024, kv_cache_num: 4095
[I][ Init][ 201]: prefill_token_num : 256
[I][ Init][ 205]: grp: 1, prefill_max_token_num : 1
[I][ Init][ 205]: grp: 2, prefill_max_token_num : 256
[I][ Init][ 205]: grp: 3, prefill_max_token_num : 512
[I][ Init][ 205]: grp: 4, prefill_max_token_num : 768
[I][ Init][ 205]: grp: 5, prefill_max_token_num : 1024
[I][ Init][ 205]: grp: 6, prefill_max_token_num : 1280
[I][ Init][ 205]: grp: 7, prefill_max_token_num : 1536
[I][ Init][ 205]: grp: 8, prefill_max_token_num : 1792
[I][ Init][ 205]: grp: 9, prefill_max_token_num : 2048
[I][ Init][ 205]: grp: 10, prefill_max_token_num : 2304
[I][ Init][ 205]: grp: 11, prefill_max_token_num : 2560
[I][ Init][ 205]: grp: 12, prefill_max_token_num : 2816
[I][ Init][ 205]: grp: 13, prefill_max_token_num : 3072
[I][ Init][ 205]: grp: 14, prefill_max_token_num : 3328
[I][ Init][ 205]: grp: 15, prefill_max_token_num : 3584
[I][ Init][ 209]: prefill_max_token_num : 3584
[I][ load_config][ 282]: load config:
{
"enable_repetition_penalty": false,
"enable_temperature": false,
"enable_top_k_sampling": true,
"enable_top_p_sampling": false,
"penalty_window": 20,
"repetition_penalty": 1.2,
"temperature": 0.9,
"top_k": 1,
"top_p": 0.8
}
[I][ Init][ 218]: LLM init ok
Type "q" to exit, Ctrl+c to stop current running
[I][ GenerateKVCachePrefill][ 280]: input token num : 21, prefill_split_num : 1 prefill_grpid : 2
[I][ GenerateKVCachePrefill][ 320]: input_num_token:21
[I][ main][ 228]: precompute_len: 21
[I][ main][ 229]: system_prompt: You are Qwen, created by Alibaba Cloud. You are a helpful assistant.
prompt >> who are you
[I][ SetKVCache][ 467]: prefill_grpid:2 kv_cache_num:256 precompute_len:21 input_num_token:14
[I][ SetKVCache][ 470]: current prefill_max_token_num:3328
[I][ Run][ 596]: input token num : 14, prefill_split_num : 1
[I][ Run][ 622]: input_num_token:14
[I][ Run][ 745]: ttft: 2950.42 ms
Hello, I am Qwen, a large-scale language model independently developed by the Tongyi Lab under Alibaba Group. I can answer questions, create text such as stories, official documents, emails, scripts, perform logical reasoning, coding, and more. I can also express opinions and play games. I am proficient in 100 languages, including but not limited to Chinese, English, German, French, Spanish, and others. If you have any questions or need assistance, feel free to ask me anytime! π
[N][ Run][ 864]: hit eos,avg 4.11 token/s
[I][ GetKVCache][ 436]: precompute_len:140, remaining:3444
prompt >> q
root@ax650:/mnt/qtang/llm-test/Qwen3-4B-Instruct-2507-GPTQ-Int8#
- Downloads last month
- 36
Model tree for AXERA-TECH/Qwen3-4B-Instruct-2507-GPTQ-Int8
Base model
Qwen/Qwen3-4B-Instruct-2507