IQ2_KS

#6
by gghfez - opened

@ubergarm Given the architectures are the same, if I grab your IQ2_KS 203.553 GiB (2.602 BPW) "secret cookbook" from ubergarm/DeepSeek-TNG-R1T2-Chimera-GGUF and run it on this model, would that work?

I've just upgraded to 6x3090 + 192gb vram and the IQ2_KS is flying!

@gghfez yes please let @Panchovix know as I believe he is interested in an updated V3-0324-GGUF

i lost my bf16s when swapping over to a different server and didn't have them backed up so i have not gone back to update this older model with my new recipes

should work fine though and please upload to hugging face if you like with the tag ik_llama.cpp so folks can find it!

feel free to use my imatrix here as well to save some time.

finally, the KT quants if you can fully offload onto VRAM can be interesting option

If you do V3 0324 it would be great! I have been waiting for a recent quant, but I don't have the storage to quant it haha

Assuming this missing attn weights aren't a problem; and if the aws spot instance price doesn't kick me out, I'll upload it to HF

image.png

Assuming this missing attn weights aren't a problem; and if the aws spot instance price doesn't kick me out, I'll upload it to HF

It has to do with imatrix and MLA tensors. Ideally it would only complain "did not find weights for" the token embedding, final output (maybe), and the attn_kv_b tensor. It is probably an issue if it is complaining about attn_k_b or attn_k_v tensors.

I have not had time to go back and re-make my imatrix for Kimi-K2 or older DeepSeek's to explore it fully, however ik seems to think its working on DeepSeek-V2-Lite test model so not sure what is the issue.

I plan to open an issue on ik_llama.cpp about this, haven't done so yet, here is the old discussion:
https://github.com/ikawrakow/ik_llama.cpp/pull/642#issuecomment-3109818995

Hopefully after I upload some new quants at ubergarm/Qwen3-235B-A22B-Thinking-2507-GGUF then I can finally dig out of backlog issues and messages haha

Thanks and please link your model if you get it uploaded, i might be able to test its perplexity etc

@ubergarm It seems to work in my rig. I haven't tested extensively.

gghfez/DeepSeek-V3-0324-IQ2_KS
Haven't had time to do the model card properly.

I also uploaded the converted bf16 weights here (the FP8 in the .gguf file names was an accident)

@gghfez thanks for picking that one up!

I have more discussion about MLA tensors and imatrix just opened here: https://github.com/ikawrakow/ik_llama.cpp/issues/651 if you want to check your llama-quantize log to see if it is getting attn_k_b and attn_v_b tensors. If not you can juice them up to Q8_0 for minimal size increase but slightly slower TG.

Only got the tail end of the log in my terminal history but yeah, it's got the attn_k_b missing

# cat logz1.txt |grep '=='
====== llama_model_quantize_internal: did not find weights for blk.57.attn_k_b.weight
====== llama_model_quantize_internal: did not find weights for blk.58.attn_k_b.weight
====== llama_model_quantize_internal: did not find weights for blk.59.attn_k_b.weight
====== llama_model_quantize_internal: did not find weights for blk.60.attn_k_b.weight
====== llama_model_quantize_internal: did not find weights for output.weight

That's for an IQ3_KS I made after this IQ2_KS (I think I saw @Panchovix say he wanted this size somewhere)

So does this mean those weights won't have the imatrix applied? (I'm wondering why the model appears to be coherent).
The IQ3_KS is smarter than the IQ2_KS (gets one of my tricky private benchmark questions right consistently)

@gghfez

So does this mean those weights won't have the imatrix applied?

Yes, unfortunately, that is my understanding. Will have to look into it more and try to find the right imatrix command to save data for all those tensors maybe. I'm still trying to figure it out. There is also the mainline llama.cpp recent PR updating imatrix by compilade I could test some more too to compare the differences in what is going on.

(I'm wondering why the model appears to be coherent).

If you are keeping those attn_k_b and attn_v_b tensors above ~5BPW then imatrix is less of a big deal.

The IQ3_KS is smarter than the IQ2_KS

Yes, generally the more BPW distributed throughout the model will improve (lower) perplexity and give a smarter model. Of course larger size is harder to fit into RAM/VRAM and also slower TG given that is mostly memory bandwidth limited by the size of the active weights.

Yes, generally the more BPW distributed throughout the model will improve (lower) perplexity and give a smarter model.
Yeah I get that, I just wasn't such an obvious difference vs the difference in perplexity number. I'm used to comparing eg. Q5 vs Q8

I like the old DeepSeek models + wanted to play with the R1-Zero model, so I've uploaded BF16 ggufs and imatrix files for them here in case it's helpful:

collections/gghfez/deepseek-moe-bf16-for-ik-llama-quants

DeepSeek-V3-0324-256x21B-BF16 - literally your imatrix file
DeepSeek-R1-Zero-256x21B-BF16 - generated imatrix with your calibration v5 data + script
DeepSeek-R1-0528-256x21B-BF16 - literally your imatrix file
DeepSeek-R1-OG-256x21B-BF16 - generated imatrix with your calibration v5 data + script

Also IQ2_KS and IQ3_KS quants here:
collections/gghfez/ik-llama-quants

  • R1-Zero is q8_0 for attn_k_b and attn_v_b
  • R1-0528 is q8_0 for attn_k_b and attn_v_b
  • R1-OG will be uploaded when it's finished with q8_0 for attn_k_b and attn_v_b
  • V3 is Q5_0 for attn_k_b and attn_v_b. Not sure if I'll redo it with q8_0 weights (cost is building up!), but if i do I'll push it to the same repo.

I'll fix/add model cards later. Oh I also changed the output weights to q8_0 because on other models, I noticed a difference at longer contexts.
I'm very happy with the R1-0528-IQ2_KS having used it for 1 day.

Just passing by to say thanks! Using V3 IQ3_KS and liking it a lot so far.

Sign up or log in to comment