xinhe commited on
Commit
0e586cf
·
verified ·
1 Parent(s): cbbf606

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +9 -18
README.md CHANGED
@@ -6,7 +6,7 @@ tags:
6
  ---
7
  ## Model Details
8
 
9
- This model is a mixed int4 model with group_size 128 and asymmetric quantization of [zai-org/GLM-5](https://huggingface.co/zai-org/GLM-5/) generated by [intel/auto-round](https://github.com/intel/auto-round). Please follow the license of the original model.
10
 
11
  **The model is quantized with pure RTN mode**.
12
 
@@ -17,49 +17,40 @@ This model is a mixed int4 model with group_size 128 and asymmetric quantization
17
  **Setup**
18
  ~~~bash
19
  pip install git+https://github.com/vllm-project/vllm.git@main
20
-
21
-
22
  pip install git+https://github.com/huggingface/transformers.git
23
  ~~~
24
 
25
 
26
-
27
- **MTP has not been supported, we will try to fix it later.**
28
 
29
  ~~~bash
30
- CUDA_VISIBLE_DEVICES=1,2,3,4 vllm serve Intel/GLM-5-int4-mixed-AutoRound \
31
  --tensor-parallel-size 4 \
32
  --gpu-memory-utilization 0.85 \
33
  --tool-call-parser glm47 \
34
  --reasoning-parser glm45 \
35
  --enable-auto-tool-choice \
 
 
36
  --served-model-name glm-5
37
  ~~~
38
 
39
-
40
-
41
-
42
-
43
  ~~~bash
44
- curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d ' {
45
  "model": "glm-5",
46
  "messages": [
47
  {"role": "system", "content": "You are a helpful assistant."},
48
- {"role": "user", "content": "Summarize GLM-5 in one sentence."}
49
  ],
50
- "temperature": 1,
51
- "max_tokens": 4096
52
  } '
53
  ~~~
54
 
55
 
56
-
57
  ## Generate the Model
58
 
59
- this pr is required https://github.com/intel/auto-round/pull/1466
60
-
61
  ~~~bash
62
- auto-round zai-org/GLM-5 --ignore_layers layers.0,layers.1,layers.2,self_attn,shared_experts --iters 0 --disable_opt_rtn --output_dir ./glm5_int4 --format auto_round:auto_awq --asym
63
  ~~~
64
 
65
 
 
6
  ---
7
  ## Model Details
8
 
9
+ This model is a mixed int4 model with group_size 64 and symmetric quantization of [zai-org/GLM-5](https://huggingface.co/zai-org/GLM-5/) generated by [intel/auto-round](https://github.com/intel/auto-round). Please follow the license of the original model.
10
 
11
  **The model is quantized with pure RTN mode**.
12
 
 
17
  **Setup**
18
  ~~~bash
19
  pip install git+https://github.com/vllm-project/vllm.git@main
 
 
20
  pip install git+https://github.com/huggingface/transformers.git
21
  ~~~
22
 
23
 
24
+ **MTP is supported.**
 
25
 
26
  ~~~bash
27
+ CUDA_VISIBLE_DEVICES=0,1,2,3 vllm serve Intel/GLM-5-int4-mixed-AutoRound \
28
  --tensor-parallel-size 4 \
29
  --gpu-memory-utilization 0.85 \
30
  --tool-call-parser glm47 \
31
  --reasoning-parser glm45 \
32
  --enable-auto-tool-choice \
33
+ --speculative-config.method mtp \
34
+ --speculative-config.num_speculative_tokens 1 \
35
  --served-model-name glm-5
36
  ~~~
37
 
 
 
 
 
38
  ~~~bash
39
+ curl http://localhost:8009/v1/chat/completions -H "Content-Type: application/json" -d ' {
40
  "model": "glm-5",
41
  "messages": [
42
  {"role": "system", "content": "You are a helpful assistant."},
43
+ {"role": "user", "content": "Summarize AutoRound in one sentence."}
44
  ],
45
+ "max_tokens": 512
 
46
  } '
47
  ~~~
48
 
49
 
 
50
  ## Generate the Model
51
 
 
 
52
  ~~~bash
53
+ auto_round /storage/wenhuach/GLM-5/ --iters 0 --disable_opt_rtn --ignore_layers layers.0,layers.1,layers.2,self_attn,shared_experts,eh_proj --output_dir /GLM-5-int4 --group_size 64
54
  ~~~
55
 
56