Update README.md
Browse files
README.md
CHANGED
|
@@ -6,7 +6,7 @@ tags:
|
|
| 6 |
---
|
| 7 |
## Model Details
|
| 8 |
|
| 9 |
-
This model is a mixed int4 model with group_size
|
| 10 |
|
| 11 |
**The model is quantized with pure RTN mode**.
|
| 12 |
|
|
@@ -17,49 +17,40 @@ This model is a mixed int4 model with group_size 128 and asymmetric quantization
|
|
| 17 |
**Setup**
|
| 18 |
~~~bash
|
| 19 |
pip install git+https://github.com/vllm-project/vllm.git@main
|
| 20 |
-
|
| 21 |
-
|
| 22 |
pip install git+https://github.com/huggingface/transformers.git
|
| 23 |
~~~
|
| 24 |
|
| 25 |
|
| 26 |
-
|
| 27 |
-
**MTP has not been supported, we will try to fix it later.**
|
| 28 |
|
| 29 |
~~~bash
|
| 30 |
-
CUDA_VISIBLE_DEVICES=1,2,3
|
| 31 |
--tensor-parallel-size 4 \
|
| 32 |
--gpu-memory-utilization 0.85 \
|
| 33 |
--tool-call-parser glm47 \
|
| 34 |
--reasoning-parser glm45 \
|
| 35 |
--enable-auto-tool-choice \
|
|
|
|
|
|
|
| 36 |
--served-model-name glm-5
|
| 37 |
~~~
|
| 38 |
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
|
| 43 |
~~~bash
|
| 44 |
-
curl http://localhost:
|
| 45 |
"model": "glm-5",
|
| 46 |
"messages": [
|
| 47 |
{"role": "system", "content": "You are a helpful assistant."},
|
| 48 |
-
{"role": "user", "content": "Summarize
|
| 49 |
],
|
| 50 |
-
"
|
| 51 |
-
"max_tokens": 4096
|
| 52 |
} '
|
| 53 |
~~~
|
| 54 |
|
| 55 |
|
| 56 |
-
|
| 57 |
## Generate the Model
|
| 58 |
|
| 59 |
-
this pr is required https://github.com/intel/auto-round/pull/1466
|
| 60 |
-
|
| 61 |
~~~bash
|
| 62 |
-
|
| 63 |
~~~
|
| 64 |
|
| 65 |
|
|
|
|
| 6 |
---
|
| 7 |
## Model Details
|
| 8 |
|
| 9 |
+
This model is a mixed int4 model with group_size 64 and symmetric quantization of [zai-org/GLM-5](https://huggingface.co/zai-org/GLM-5/) generated by [intel/auto-round](https://github.com/intel/auto-round). Please follow the license of the original model.
|
| 10 |
|
| 11 |
**The model is quantized with pure RTN mode**.
|
| 12 |
|
|
|
|
| 17 |
**Setup**
|
| 18 |
~~~bash
|
| 19 |
pip install git+https://github.com/vllm-project/vllm.git@main
|
|
|
|
|
|
|
| 20 |
pip install git+https://github.com/huggingface/transformers.git
|
| 21 |
~~~
|
| 22 |
|
| 23 |
|
| 24 |
+
**MTP is supported.**
|
|
|
|
| 25 |
|
| 26 |
~~~bash
|
| 27 |
+
CUDA_VISIBLE_DEVICES=0,1,2,3 vllm serve Intel/GLM-5-int4-mixed-AutoRound \
|
| 28 |
--tensor-parallel-size 4 \
|
| 29 |
--gpu-memory-utilization 0.85 \
|
| 30 |
--tool-call-parser glm47 \
|
| 31 |
--reasoning-parser glm45 \
|
| 32 |
--enable-auto-tool-choice \
|
| 33 |
+
--speculative-config.method mtp \
|
| 34 |
+
--speculative-config.num_speculative_tokens 1 \
|
| 35 |
--served-model-name glm-5
|
| 36 |
~~~
|
| 37 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 38 |
~~~bash
|
| 39 |
+
curl http://localhost:8009/v1/chat/completions -H "Content-Type: application/json" -d ' {
|
| 40 |
"model": "glm-5",
|
| 41 |
"messages": [
|
| 42 |
{"role": "system", "content": "You are a helpful assistant."},
|
| 43 |
+
{"role": "user", "content": "Summarize AutoRound in one sentence."}
|
| 44 |
],
|
| 45 |
+
"max_tokens": 512
|
|
|
|
| 46 |
} '
|
| 47 |
~~~
|
| 48 |
|
| 49 |
|
|
|
|
| 50 |
## Generate the Model
|
| 51 |
|
|
|
|
|
|
|
| 52 |
~~~bash
|
| 53 |
+
auto_round /storage/wenhuach/GLM-5/ --iters 0 --disable_opt_rtn --ignore_layers layers.0,layers.1,layers.2,self_attn,shared_experts,eh_proj --output_dir /GLM-5-int4 --group_size 64
|
| 54 |
~~~
|
| 55 |
|
| 56 |
|