Intel
/

GLM-5-int4-mixed-AutoRound

text-generation-inference

4-bit precision

Model card Files Files and versions

xinhe commited on Mar 17

Commit

0e586cf

·

verified ·

1 Parent(s): cbbf606

Update README.md

Files changed (1) hide show

README.md +9 -18

README.md CHANGED Viewed

@@ -6,7 +6,7 @@ tags:
 ---
 ## Model Details
-This model is a mixed int4 model with group_size 128 and asymmetric quantization of [zai-org/GLM-5](https://huggingface.co/zai-org/GLM-5/) generated by [intel/auto-round](https://github.com/intel/auto-round). Please follow the license of the original model.
 **The model is quantized with pure RTN mode**.
@@ -17,49 +17,40 @@ This model is a mixed int4 model with group_size 128 and asymmetric quantization
 **Setup**
 ~~~bash
 pip install git+https://github.com/vllm-project/vllm.git@main
 pip install git+https://github.com/huggingface/transformers.git
 ~~~
-**MTP has not been supported, we will try to fix it later.**
 ~~~bash
-CUDA_VISIBLE_DEVICES=1,2,3,4 vllm serve  Intel/GLM-5-int4-mixed-AutoRound  \
      --tensor-parallel-size 4 \
      --gpu-memory-utilization 0.85 \
      --tool-call-parser glm47 \
      --reasoning-parser glm45 \
      --enable-auto-tool-choice \
      --served-model-name glm-5
 ~~~
 ~~~bash
-curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d ' {
     "model": "glm-5",
     "messages": [
       {"role": "system", "content": "You are a helpful assistant."},
-      {"role": "user", "content": "Summarize GLM-5 in one sentence."}
     ],
-    "temperature": 1,
-    "max_tokens": 4096
   } '
 ~~~
 ## Generate the Model
-this pr is required https://github.com/intel/auto-round/pull/1466
 ~~~bash
-auto-round zai-org/GLM-5 --ignore_layers layers.0,layers.1,layers.2,self_attn,shared_experts --iters 0 --disable_opt_rtn --output_dir ./glm5_int4 --format auto_round:auto_awq --asym
 ~~~

 ---
 ## Model Details
+This model is a mixed int4 model with group_size 64 and symmetric quantization of [zai-org/GLM-5](https://huggingface.co/zai-org/GLM-5/) generated by [intel/auto-round](https://github.com/intel/auto-round). Please follow the license of the original model.
 **The model is quantized with pure RTN mode**.
 **Setup**
 ~~~bash
 pip install git+https://github.com/vllm-project/vllm.git@main
 pip install git+https://github.com/huggingface/transformers.git
 ~~~
+**MTP is supported.**
 ~~~bash
+CUDA_VISIBLE_DEVICES=0,1,2,3 vllm serve  Intel/GLM-5-int4-mixed-AutoRound  \
      --tensor-parallel-size 4 \
      --gpu-memory-utilization 0.85 \
      --tool-call-parser glm47 \
      --reasoning-parser glm45 \
      --enable-auto-tool-choice \
+     --speculative-config.method mtp \
+     --speculative-config.num_speculative_tokens 1 \
      --served-model-name glm-5
 ~~~
 ~~~bash
+curl http://localhost:8009/v1/chat/completions -H "Content-Type: application/json" -d ' {
     "model": "glm-5",
     "messages": [
       {"role": "system", "content": "You are a helpful assistant."},
+      {"role": "user", "content": "Summarize AutoRound in one sentence."}
     ],
+    "max_tokens": 512
   } '
 ~~~
 ## Generate the Model
 ~~~bash
+auto_round /storage/wenhuach/GLM-5/ --iters 0 --disable_opt_rtn --ignore_layers layers.0,layers.1,layers.2,self_attn,shared_experts,eh_proj  --output_dir /GLM-5-int4  --group_size 64
 ~~~