Qwen3.5-2B

This version of Qwen3.5-2B has been converted to run on the Axera NPU using w4a16 quantization.

Compatible with Pulsar2 version: 5.0

Convert tools links:

For those who are interested in model conversion, you can try to export axmodel through the original repo :

Pulsar2 Link, How to Convert LLM from Huggingface to axmodel

AXera NPU HOST LLM Runtime

Support Platform

Image Process

Chips input size image num ttft(168 tokens) w4a16 CMM Flash
AX650 384*384 1 368 ms 13.4 tokens/sec 1.72GiB 2.8GiB

Video Process

Chips input size image num ttft(600 tokens) w4a16 CMM Flash
AX650 384*384 8 892 ms 13.4 tokens/sec 1.72GiB 2.8GiB

The DDR capacity refers to the CMM memory that needs to be consumed. Ensure that the CMM memory allocation on the development board is greater than this value.

How to use

安装 axllm

方式一:克隆仓库后执行安装脚本:

git clone -b axllm https://github.com/AXERA-TECH/ax-llm.git
cd ax-llm
./install.sh

方式二:一行命令安装(默认分支 axllm):

curl -fsSL https://raw.githubusercontent.com/AXERA-TECH/ax-llm/axllm/install.sh | bash

方式三:下载Github Actions CI 导出的可执行程序(适合没有编译环境的用户):

如果没有编译环境,请到: https://github.com/AXERA-TECH/ax-llm/actions?query=branch%3Aaxllm 下载 最新 CI 导出的可执行程序axllm),然后:

chmod +x axllm
sudo mv axllm /usr/bin/axllm

模型下载(Hugging Face)

先创建模型目录并进入,然后下载到该目录:

mkdir -p AXERA-TECH/Qwen3.5-2B-AX650-GPTQ-Int4-C128-P1152-CTX2047
cd AXERA-TECH/Qwen3.5-2B-AX650-GPTQ-Int4-C128-P1152-CTX2047
hf download AXERA-TECH/Qwen3.5-2B-AX650-GPTQ-Int4-C128-P1152-CTX2047 --local-dir .

# structure of the downloaded files
tree -L 3
`-- AXERA-TECH
    `-- Qwen3.5-2B-AX650-GPTQ-Int4-C128-P1152-CTX2047
        |-- qwen3_5_vision.axmodel
        |-- README.md
        |-- config.json
        |-- image.png
        |-- model.embed_tokens.weight.bfloat16.bin
        |-- post_config.json
        |-- qwen3_5_tokenizer.txt
        |-- qwen3_5_text_p128_l0_together.axmodel
        ...
        |-- qwen3_5_text_p128_l23_together.axmodel
        |-- qwen3_5_text_post.axmodel
        `-- vision_cache

3 directories, 39 files

Inference with AX650 Host, such as M4N-Dock(爱芯派Pro) or AX650N DEMO Board

运行(CLI)

图片理解

root@ax650 ~/yongqiang/lhj/Qwen3_5.AXERA/ax-llm # axllm run Qwen3.5-2B-AX650-GPTQ-Int4-C128-P1152-CTX2047/
11:40:21.412 INF Init:218 | LLM init start
11:40:21.412 INF Init:226 | mixed attention enabled: full_attention_interval=4 ref_full_layer_idx=3
tokenizer_type = 3
 96% | ##############################   |  26 /  27 [3.78s<3.93s, 6.87 count/s] init post axmodel ok,remain_cmm(5589 MB)
11:40:25.195 INF Init:368 | max_token_len : 2047
11:40:25.195 INF Init:371 | kv_cache_size : 512, kv_cache_num: 2047
11:40:25.195 INF Init:374 | prefill_token_num : 128
11:40:25.195 INF Init:379 | grp: 1, prefill_max_kv_cache_num : 1
11:40:25.195 INF Init:379 | grp: 2, prefill_max_kv_cache_num : 128
11:40:25.195 INF Init:379 | grp: 3, prefill_max_kv_cache_num : 256
11:40:25.195 INF Init:379 | grp: 4, prefill_max_kv_cache_num : 384
11:40:25.195 INF Init:379 | grp: 5, prefill_max_kv_cache_num : 512
11:40:25.195 INF Init:379 | grp: 6, prefill_max_kv_cache_num : 768
11:40:25.195 INF Init:379 | grp: 7, prefill_max_kv_cache_num : 1024
11:40:25.195 INF Init:379 | grp: 8, prefill_max_kv_cache_num : 1152
11:40:25.195 INF Init:384 | prefill_max_token_num : 1152
11:40:25.195 INF Init:27 | LLaMaEmbedSelector use mmap
100% | ################################ |  27 /  27 [3.79s<3.79s, 7.13 count/s] embed_selector init ok
11:40:25.478 INF Init:643 | Qwen-VL token ids: vision_start=248053 image_pad=248056 video_pad=248057
11:40:25.478 INF Init:668 | VisionModule init ok: type=Qwen3VL, tokens_per_block=144, embed_size=2048, out_dtype=fp32
11:40:25.478 WRN Init:677 | Vision preprocess backend: SimpleCV (OpenCV not found at build time; minor differences vs OpenCV are possible)
11:40:25.480 INF load_config:282 | load config: 
11:40:25.480 INF load_config:282 | {
11:40:25.480 INF load_config:282 |     "enable_repetition_penalty": false,
11:40:25.480 INF load_config:282 |     "enable_temperature": false,
11:40:25.480 INF load_config:282 |     "enable_top_k_sampling": true,
11:40:25.480 INF load_config:282 |     "enable_top_p_sampling": false,
11:40:25.480 INF load_config:282 |     "penalty_window": 20,
11:40:25.480 INF load_config:282 |     "repetition_penalty": 1.2,
11:40:25.480 INF load_config:282 |     "temperature": 0.9,
11:40:25.480 INF load_config:282 |     "top_k": 10,
11:40:25.480 INF load_config:282 |     "top_p": 0.8
11:40:25.480 INF load_config:282 | }
11:40:25.480 INF Init:448 | LLM init ok
Commands:
  /q, /exit  退出
  /reset     重置 kvcache
  /dd        删除一轮对话
  /pp        打印历史对话
Ctrl+C: 停止当前生成
VLM enabled: after each prompt, input image path (empty = text-only). Use "video:<frames_dir>" for video.
----------------------------------------
prompt >> describe the image
image >> image.png
19:25:48.491 INF EncodeForContent:973 | Qwen-VL pixel_values[0] bytes=884736 min=0 max=238 (w=384 h=384 tp=2 ps=16 sm=2)
19:25:48.641 INF EncodeForContent:996 | vision cache store: image.png
19:25:48.665 INF SetKVCache:747 | prefill_grpid:3 kv_cache_num:256 precompute_len:0 input_num_token:168
19:25:48.665 INF SetKVCache:749 | current prefill_max_token_num:1152
19:25:48.665 INF SetKVCache:750 | first run
19:25:48.709 INF Run:805 | input token num : 168, prefill_split_num : 2
19:25:48.709 INF Run:845 | prefill chunk p=0 history_len=0 grpid=1 kv_cache_num=0 input_tokens=128
19:25:48.710 INF Run:868 | prefill indices shape: p=0 idx_elems=128 idx_rows=1 pos_rows=3
19:25:48.881 INF Run:845 | prefill chunk p=1 history_len=128 grpid=2 kv_cache_num=128 input_tokens=40
19:25:48.881 INF Run:868 | prefill indices shape: p=1 idx_elems=128 idx_rows=1 pos_rows=3
19:25:49.077 INF Run:1010 | ttft: 368.31 ms
<think>
The user wants a description of the provided image.

1.  **Identify the main subjects:** There are three individuals wearing white astronaut-style spacesuits or heavy, space-inspired uniforms.
2.  **Describe their appearance:**
    *   They are dressed in bulky, white, bulky jackets with hoods.
    *   They are wearing helmets with metallic visors (the reflective visors are slightly reflective in the image).
    *   They appear to be standing amidst tall, leaf-like vegetation.
    *   The setting looks outdoors, possibly a forest or a jungle-like area, given the tall, grass-like plants.
3.  **Describe the atmosphere/mood**:
    *   It looks surreal or dreamlike.
    *   It doesn't clearly look like a real-world space station.
    *   The lighting is somewhat soft, and the colors are somewhat muted and desaturated, suggesting an artistic filter or a digital manipulation effect rather than a natural scene.
4.  **Specific details:
    - The vegetation:
    - It looks like a dense, grass-like forest.
</think>

This image features three figures wearing bulky, white, space-inspired suits and helmets with reflective visors. They stand among an abundance of tall, leaf-like vegetation, resembling a dense, grass-like forest. The atmosphere is surreal and dreamlike, the lighting and color scheme suggest a surreal or digitally manipulated style rather a realistic depiction of space stations.

19:26:11.722 NTC Run:1132 | hit eos,avg 13.42 token/s
19:26:11.722 INF GetKVCache:721 | precompute_len:340, remaining:812

视频理解

root@ax650 ~/Qwen3.5-2B-AX650-GPTQ-Int4-C128-P1152-CTX2047 # ./axllm run ./
19:29:18.588 INF Init:218 | LLM init start
19:29:18.588 INF Init:226 | mixed attention enabled: full_attention_interval=4 ref_full_layer_idx=3
tokenizer_type = 3
 96% | ##############################   |  26 /  27 [17.06s<17.72s, 1.52 count/s] init post axmodel ok,remain_cmm(6157 MB)
19:29:35.651 INF Init:368 | max_token_len : 2047
19:29:35.651 INF Init:371 | kv_cache_size : 512, kv_cache_num: 2047
19:29:35.651 INF Init:374 | prefill_token_num : 128
19:29:35.651 INF Init:379 | grp: 1, prefill_max_kv_cache_num : 1
19:29:35.651 INF Init:379 | grp: 2, prefill_max_kv_cache_num : 128
19:29:35.651 INF Init:379 | grp: 3, prefill_max_kv_cache_num : 256
19:29:35.651 INF Init:379 | grp: 4, prefill_max_kv_cache_num : 384
19:29:35.651 INF Init:379 | grp: 5, prefill_max_kv_cache_num : 512
19:29:35.651 INF Init:379 | grp: 6, prefill_max_kv_cache_num : 640
19:29:35.651 INF Init:379 | grp: 7, prefill_max_kv_cache_num : 768
19:29:35.651 INF Init:379 | grp: 8, prefill_max_kv_cache_num : 896
19:29:35.651 INF Init:379 | grp: 9, prefill_max_kv_cache_num : 1024
19:29:35.651 INF Init:379 | grp: 10, prefill_max_kv_cache_num : 1152
19:29:35.651 INF Init:384 | prefill_max_token_num : 1152
19:29:35.651 INF Init:27 | LLaMaEmbedSelector use mmap
100% | ################################ |  27 /  27 [17.07s<17.07s, 1.58 count/s] embed_selector init ok
19:29:35.998 INF Init:643 | Qwen-VL token ids: vision_start=248053 image_pad=248056 video_pad=248057
19:29:35.998 INF Init:668 | VisionModule init ok: type=Qwen3VL, tokens_per_block=144, embed_size=2048, out_dtype=fp32
19:29:35.998 WRN Init:677 | Vision preprocess backend: SimpleCV (OpenCV not found at build time; minor differences vs OpenCV are possible)
19:29:35.999 INF load_config:282 | load config: 
19:29:35.999 INF load_config:282 | {
19:29:35.999 INF load_config:282 |     "enable_repetition_penalty": false,
19:29:35.999 INF load_config:282 |     "enable_temperature": false,
19:29:35.999 INF load_config:282 |     "enable_top_k_sampling": true,
19:29:35.999 INF load_config:282 |     "enable_top_p_sampling": false,
19:29:35.999 INF load_config:282 |     "penalty_window": 20,
19:29:35.999 INF load_config:282 |     "repetition_penalty": 1.2,
19:29:35.999 INF load_config:282 |     "temperature": 0.9,
19:29:35.999 INF load_config:282 |     "top_k": 10,
19:29:35.999 INF load_config:282 |     "top_p": 0.8
19:29:35.999 INF load_config:282 | }
19:29:36.000 INF Init:448 | LLM init ok
Commands:
  /q, /exit  退出
  /reset     重置 kvcache
  /dd        删除一轮对话
  /pp        打印历史对话
Ctrl+C: 停止当前生成
VLM enabled: after each prompt, input image path (empty = text-only). Use "video:<frames_dir>" for video.
----------------------------------------
prompt >> describe the video
image >> video:video-test-03
19:30:06.823 INF SetKVCache:747 | prefill_grpid:6 kv_cache_num:640 precompute_len:0 input_num_token:600
19:30:06.823 INF SetKVCache:749 | current prefill_max_token_num:1152
19:30:06.823 INF SetKVCache:750 | first run
19:30:06.844 INF Run:805 | input token num : 600, prefill_split_num : 5
19:30:06.844 INF Run:845 | prefill chunk p=0 history_len=0 grpid=1 kv_cache_num=0 input_tokens=128
19:30:06.845 INF Run:868 | prefill indices shape: p=0 idx_elems=128 idx_rows=1 pos_rows=3
19:30:07.022 INF Run:845 | prefill chunk p=1 history_len=128 grpid=2 kv_cache_num=128 input_tokens=128
19:30:07.022 INF Run:868 | prefill indices shape: p=1 idx_elems=128 idx_rows=1 pos_rows=3
19:30:07.194 INF Run:845 | prefill chunk p=2 history_len=256 grpid=3 kv_cache_num=256 input_tokens=128
19:30:07.194 INF Run:868 | prefill indices shape: p=2 idx_elems=128 idx_rows=1 pos_rows=3
19:30:07.365 INF Run:845 | prefill chunk p=3 history_len=384 grpid=4 kv_cache_num=384 input_tokens=128
19:30:07.366 INF Run:868 | prefill indices shape: p=3 idx_elems=128 idx_rows=1 pos_rows=3
19:30:07.537 INF Run:845 | prefill chunk p=4 history_len=512 grpid=5 kv_cache_num=512 input_tokens=88
19:30:07.537 INF Run:868 | prefill indices shape: p=4 idx_elems=128 idx_rows=1 pos_rows=3
19:30:07.736 INF Run:1010 | ttft: 892.14 ms
<think>

</think>

This video appears to be a short clip, likely from a social media account (indicated by the Instagram handle "@kjanecron" and "INSIDE EDITION" watermark), featuring a woman in a kitchen setting.

**Scene 1: The Fire Incident (12 frames)**
- **Setting:** A kitchen with white cabinets, a silver refrigerator adorned with various magnets and photos, and a stainless-steel stove.
- **Action & Atmosphere:** A woman with long hair, wearing a white and pink patterned halter top, stands before the stove. She seems to be in the middle of a cooking process. Suddenly, a large cloud of smoke and fire erupt from the stove's right burner. Smoke billows into the room, and flames can be seen in the background, suggesting a fire started accidentally or intentionally during her cooking attempt. She looks startled but continues to stand near the fire.

The video then cuts to a different shot, the woman now standing in front of the now-smothered microwave. The fire and smoke are gone, but the overall scene remains the same.

In the next shot, the woman is now wearing a different apron (white with colorful birds), standing near the stove. Steam and smoke are again visible, indicating smoke may have reappeared.

Finally, the last shot shows the woman standing before what appears to be a refrigerator with magnets and photos, holding a plate of food with a fork, indicating she might be serving the food from the oven.

19:30:30.874 NTC Run:1132 | hit eos,avg 13.44 token/s
19:30:30.874 INF GetKVCache:721 | precompute_len:779, remaining:373

启动服务(OpenAI 兼容)

root@ax650:~# axllm serve AXERA-TECH/Qwen3.5-2B-AX650-GPTQ-Int4-C128-P1152-CTX2047
[I][                            Init][ 138]: LLM init start
tokenizer_type = 1
 96% | ███████████████████████████████   |  30 /  31 [4.63s<4.79s, 6.47 count/s] init post axmodel ok,remain_cmm(9563 MB)
[I][                            Init][ 199]: max_token_len : 2047
[I][                            Init][ 202]: kv_cache_size : 1024, kv_cache_num: 2047
[I][                            Init][ 205]: prefill_token_num : 128
[I][                            Init][ 209]: grp: 1, prefill_max_kv_cache_num : 1
[I][                            Init][ 209]: grp: 2, prefill_max_kv_cache_num : 128
[I][                            Init][ 209]: grp: 3, prefill_max_kv_cache_num : 256
[I][                            Init][ 209]: grp: 4, prefill_max_kv_cache_num : 384
[I][                            Init][ 209]: grp: 5, prefill_max_kv_cache_num : 512
[I][                            Init][ 209]: grp: 6, prefill_max_kv_cache_num : 640
[I][                            Init][ 209]: grp: 7, prefill_max_kv_cache_num : 768
[I][                            Init][ 209]: grp: 8, prefill_max_kv_cache_num : 896
[I][                            Init][ 209]: grp: 9, prefill_max_kv_cache_num : 1024
[I][                            Init][ 209]: grp: 10, prefill_max_kv_cache_num : 1152
[I][                            Init][ 214]: prefill_max_token_num : 1152
[I][                            Init][  27]: LLaMaEmbedSelector use mmap
100% | ████████████████████████████████ |  31 /  31 [4.64s<4.64s, 6.69 count/s] embed_selector init ok
[W][                            Init][ 457]: Qwen-VL vision size override: cfg=448x448 bytes=1204224, model_input_bytes=884736 -> 384x384 (square).
[I][                            Init][ 641]: Qwen-VL token ids: vision_start=151652 image_pad=151655 video_pad=151656
[I][                            Init][ 666]: VisionModule init ok: type=Qwen3VL, tokens_per_block=144, embed_size=2048, out_dtype=fp32
[I][                            Init][ 672]: VisionModule deepstack enabled: layers=3
[I][                     load_config][ 282]: load config:
{
    "enable_repetition_penalty": false,
    "enable_temperature": false,
    "enable_top_k_sampling": false,
    "enable_top_p_sampling": false,
    "penalty_window": 20,
    "repetition_penalty": 1.2,
    "temperature": 0.9,
    "top_k": 10,
    "top_p": 0.8
}

[I][                            Init][ 272]: LLM init ok
Starting server on port 8000 with model 'AXERA-TECH/Qwen3.5-2B-AX650-GPTQ-Int4-C128-P1152-CTX2047'...
OpenAI API Server starting on http://0.0.0.0:8000
Max concurrency: 1
Models: AXERA-TECH/Qwen3.5-2B-AX650-GPTQ-Int4-C128-P1152-CTX2047

OpenAI 调用示例

from openai import OpenAI

API_URL = "http://127.0.0.1:8000/v1"
MODEL = "AXERA-TECH/Qwen3.5-2B-AX650-GPTQ-Int4-C128-P1152-CTX2047"

messages = [
    {"role": "system", "content": [{"type": "text", "text": "you are a helpful assistant."}]},
    {"role": "user", "content": "hello"},
]

client = OpenAI(api_key="not-needed", base_url=API_URL)
completion = client.chat.completions.create(
    model=MODEL,
    messages=messages,
)

print(completion.choices[0].message.content)

OpenAI 流式调用示例

from openai import OpenAI

API_URL = "http://127.0.0.1:8000/v1"
MODEL = "AXERA-TECH/Qwen3.5-2B-AX650-GPTQ-Int4-C128-P1152-CTX2047"

messages = [
    {"role": "system", "content": [{"type": "text", "text": "you are a helpful assistant."}]},
    {"role": "user", "content": "hello"},
]

client = OpenAI(api_key="not-needed", base_url=API_URL)
stream = client.chat.completions.create(
    model=MODEL,
    messages=messages,
    stream=True,
)

print("assistant:")
for ev in stream:
    delta = getattr(ev.choices[0], "delta", None)
    if delta and getattr(delta, "content", None):
        print(delta.content, end="", flush=True)
print("
")
Downloads last month
41
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AXERA-TECH/Qwen3.5-2B-AX650-GPTQ-Int4-C128-P1152-CTX2047

Finetuned
Qwen/Qwen3.5-2B
Finetuned
(139)
this model

Collection including AXERA-TECH/Qwen3.5-2B-AX650-GPTQ-Int4-C128-P1152-CTX2047