Model usage discussion.
This could make 3x 3090 interesting, perhaps!
Anyone using this on 3x24GB?
This looks like fascinating work thank you very much.
This could make 3x 3090 interesting, perhaps!
Anyone using this on 3x24GB?
This looks like fascinating work thank you very much.
It would be very fast on 3x24G GPU. However it is also very usable with even 1x 12G GPU for context memory with experts offloaded to CPU as explained in the usage. It will do around 7t/s running experts on CPU even with an old DDR4 9900k PC which I find quite usable. x.ai hit it with this model, best Ive seen over last couple years not by a small amount, so I put quite some time into making a full set of strong quants for it. Hope it works well for you!
Thank you! I will explore how to load the mostest to my 2x 3090 and if i get success I will report.
5000 downloads and nobody comments on usage, kinda sad eh. People...
5000 downloads and nobody comments on usage, kinda sad eh. People...
Air main page got 400k downloads last month with a grand total of 13 comments for probably the best open weight model runnable on consumer grade hardware dropped yet. Its just statistically rare for anyone to comment on this platform for whatever reason.
I'm commenting!
Well TBH I have a question. You mentioned:
The last layer 46 in this quant was explicitly set to a Q2_K_S since the layer is for multi token prediction and is not currently used. This does not impact performance but will save a small amount of memory.
Is this true for full GLM 4 as well, namely for layer 92?
...Can we zap it entirely?
This is what I see in the log:
[1735/1759] blk.91.attn_v.weight - [ 5120, 1024, 1, 1], type = bf16, Using custom type iq6_k for tensor blk.91.attn_v.weight
converting to iq6_k .. size = 10.00 MiB -> 4.14 MiB
[1736/1759] output_norm.weight - [ 5120, 1, 1, 1], type = f32, size = 0.020 MB
[1737/1759] blk.92.nextn.eh_proj.weight - [10240, 5120, 1, 1], type = bf16, Using custom type q8_0 for tensor blk.92.nextn.eh_proj.weight
====== llama_model_quantize_internal: did not find weights for blk.92.nextn.eh_proj.weight
converting to q8_0 .. size = 100.00 MiB -> 53.12 MiB
[1738/1759] blk.92.nextn.enorm.weight - [ 5120, 1, 1, 1], type = f32, size = 0.020 MB
[1739/1759] blk.92.nextn.hnorm.weight - [ 5120, 1, 1, 1], type = f32, size = 0.020 MB
[1740/1759] blk.92.attn_norm.weight - [ 5120, 1, 1, 1], type = f32, size = 0.020 MB
[1741/1759] blk.92.ffn_down_exps.weight - [ 1536, 5120, 160, 1], type = bf16, Using custom type iq3_ks for tensor blk.92.ffn_down_exps.weight
====== llama_model_quantize_internal: did not find weights for blk.92.ffn_down_exps.weight
converting to iq3_ks .. size = 2400.00 MiB -> 479.69 MiB
[1742/1759] blk.92.ffn_gate_exps.weight - [ 5120, 1536, 160, 1], type = bf16, Using custom type iq3_ks for tensor blk.92.ffn_gate_exps.weight
====== llama_model_quantize_internal: did not find weights for blk.92.ffn_gate_exps.weight
converting to iq3_ks .. size = 2400.00 MiB -> 478.59 MiB
[1743/1759] blk.92.ffn_up_exps.weight - [ 5120, 1536, 160, 1], type = bf16, Using custom type iq3_ks for tensor blk.92.ffn_up_exps.weight
====== llama_model_quantize_internal: did not find weights for blk.92.ffn_up_exps.weight
converting to iq3_ks .. size = 2400.00 MiB -> 478.59 MiB
[1744/1759] blk.92.exp_probs_b.bias - [ 160, 1, 1, 1], type = f32, size = 0.001 MB
[1745/1759] blk.92.ffn_gate_inp.weight - [ 5120, 160, 1, 1], type = f32, size = 3.125 MB
[1746/1759] blk.92.ffn_down_shexp.weight - [ 1536, 5120, 1, 1], type = bf16, Using custom type iq5_ks for tensor blk.92.ffn_down_shexp.weight
====== llama_model_quantize_internal: did not find weights for blk.92.ffn_down_shexp.weight
converting to iq5_ks .. size = 15.00 MiB -> 4.94 MiB
[1747/1759] blk.92.ffn_gate_shexp.weight - [ 5120, 1536, 1, 1], type = bf16, Using custom type iq5_ks for tensor blk.92.ffn_gate_shexp.weight
====== llama_model_quantize_internal: did not find weights for blk.92.ffn_gate_shexp.weight
converting to iq5_ks .. size = 15.00 MiB -> 4.93 MiB
[1748/1759] blk.92.ffn_up_shexp.weight - [ 5120, 1536, 1, 1], type = bf16, Using custom type iq5_ks for tensor blk.92.ffn_up_shexp.weight
====== llama_model_quantize_internal: did not find weights for blk.92.ffn_up_shexp.weight
converting to iq5_ks .. size = 15.00 MiB -> 4.93 MiB
[1749/1759] blk.92.post_attention_norm.weight - [ 5120, 1, 1, 1], type = f32, size = 0.020 MB
[1750/1759] blk.92.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
[1751/1759] blk.92.attn_k.bias - [ 1024, 1, 1, 1], type = f32, size = 0.004 MB
[1752/1759] blk.92.attn_k.weight - [ 5120, 1024, 1, 1], type = bf16, Using custom type iq6_k for tensor blk.92.attn_k.weight
====== llama_model_quantize_internal: did not find weights for blk.92.attn_k.weight
converting to iq6_k .. size = 10.00 MiB -> 4.14 MiB
[1753/1759] blk.92.attn_output.weight - [12288, 5120, 1, 1], type = bf16, Using custom type iq5_ks for tensor blk.92.attn_output.weight
====== llama_model_quantize_internal: did not find weights for blk.92.attn_output.weight
converting to iq5_ks .. size = 120.00 MiB -> 39.39 MiB
[1754/1759] blk.92.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
[1755/1759] blk.92.attn_q.bias - [12288, 1, 1, 1], type = f32, size = 0.047 MB
[1756/1759] blk.92.attn_q.weight - [ 5120, 12288, 1, 1], type = bf16, Using custom type iq5_ks for tensor blk.92.attn_q.weight
====== llama_model_quantize_internal: did not find weights for blk.92.attn_q.weight
converting to iq5_ks .. size = 120.00 MiB -> 39.42 MiB
[1757/1759] blk.92.attn_v.bias - [ 1024, 1, 1, 1], type = f32, size = 0.004 MB
[1758/1759] blk.92.attn_v.weight - [ 5120, 1024, 1, 1], type = bf16, Using custom type iq6_k for tensor blk.92.attn_v.weight
====== llama_model_quantize_internal: did not find weights for blk.92.attn_v.weight
converting to iq6_k .. size = 10.00 MiB -> 4.14 MiB
[1759/1759] blk.92.nextn.shared_head_norm.weight - [ 5120, 1, 1, 1], type = f32, size = 0.020 MB
I'm commenting!
Well TBH I have a question. You mentioned:
The last layer 46 in this quant was explicitly set to a Q2_K_S since the layer is for multi token prediction and is not currently used. This does not impact performance but will save a small amount of memory.
Is this true for full GLM 4 as well, namely for layer 92?
...Can we zap it entirely?
If you see this line in the config.json for the model, the last layer is MTP and will not be used:
"num_nextn_predict_layers": 1,
I tried pruning the whole layer out but then got complaints about missing tensors on loading the model so I set it to Q2_K_S to just downsize it. There was an aborted attempt to get MTP working in llama.cpp but it did not pan out so the layer is useless. There may be other ways to zap it (config all tensors unused/empty) but I didn't investigate further since the size adder for Air was not too large.