Appraisal

by wonderfuldestruction - opened Mar 17

Discussion

wonderfuldestruction

Mar 17

Hey Ubergarm,

Quick thank you for releasing this quant.

It's scoring on my local benches equivalent to Unsloth's Q6_K on a RTX 5090 which has been critical for my own work.

Definitely goes much further in context window for same memory consumption and quicker PP+TG.

Thanks again! Keep up the amazing work.

ubergarm

Owner Mar 17

@wonderfuldestruction

Thanks! That is amazing to hear!

Some folks have been requesting me do an actual ik_llama.cpp quant as well as the one I released was mainline compatible experiment.

I just released a good Qwen3.5-35B-A3B if you're interested in that, but it does require ik_llama.cpp: https://huggingface.co/ubergarm/Qwen3.5-35B-A3B-GGUF/blob/main/Qwen3.5-35B-A3B-IQ4_KS.gguf

(model card has quick start to run it, your 32GB VRAM is perfect for full 256k context and mmproj with it)

krzysztofma

Mar 22

•

edited Mar 22

Hi Ubergram,

First of all thanks for your efforts and contributions!
I have question regarding the ik4_nl quant : you wrote that it requires ik_llama.cpp, but it seems to work on later llama.cpp versions too e.g. the one that comes with recent LM Studio. I read that mainstream llama.cpp supports the native ik*_NL quants for a while , but has issues( or crashes )with CPU /RAM offloading and is not that fast.
Is it true?
Also, do the IK*_K variants yield better quality (lower PPL) , than Ik _Nl? Because I think I saw your graph for some other model, where the *_NL quant had lowest PPL at similar BPW, but on the official page, it says that_K should offer best quality.

ubergarm

Owner Mar 22

@krzysztofma

Thanks!

Yes, mainline llama.cpp supports all of the quantization types used in this smol-IQ4_NL 15.405 GiB (4.920 BPW). Sorry for the confusion, as typically I release mostly ik_llama.cpp only quantizations, but recently I did some mainline compatible quantizations as experiments (some types may be faster for mac or Vulkan backend).
In general yes, the ik_llama.cpp exclusive quantization types tend to have lower perplexity than similar BPW quantizations from mainline llama.cpp. The guy who implemented many of the mainline types, the person ik, continued work with his own fork to provide improved types which mainline will not accept back unfortunately.
It is complex. If you are interested, I have some discussion video here: https://blog.aifoundry.org/p/adventures-in-model-quantization
For example, both iq4_nl and iq4_k are 4.5bpw implemented by the person, ik. iq4_nl uses 32 blocks per scale and iq4_k uses 256 blocks per super-block scale.

Cheers!

krzysztofma

Mar 22

@ubergram
Great! Thanks very much for clarification.

AD 3 Thanks, I will watch it - and hopefully understand something :)
Yeah, the more I learn about different quants - and resulting perplexity/KLD vs actual accuracy in established benchmarks - the more complicated and nuanced the topic becomes. For instance I thought NVFP4 (for Blackwell ) or MXFP4 will be the future for limited VRAM GPUs, but it looks like it's usually way worse than FP8 for the time being. For now, it seems like some good Q4_K* or Q5_K* (and IQ_4 of course) quants can still yield significantly better accuracy overall. I'm sure you've seen that , but it's interesting:
https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks#id-2-imatrix-works-very-well

BTW: what do you think about AWQ ? It seems like it's great idea , but unfortunately still not that popular . I guess mainly because it's not supported by llama.cpp etc.?

ubergarm

Owner Mar 23

@krzysztofma

the more complicated and nuanced the topic becomes

yes it is a very fun rabbit hole! haha... There is a lot of noise, confusion, and misinformation on r/LocalLLaMA and HN too 😅

Right, I have no idea how MXFP4 became popular, as the original PR that adds it is very clear that it is only good for gpt-oss QAT models:

But don't get excited about using mxfp4 to quantize other models to fp4. The zero-bit mantissa in the block scales, along with the E2M1 choice for the 4-bit floats, results in a horrible quantization accuracy for the 4.25 bpw spent (about the same as IQ3_K), unless the model was directly trained with this specific fp4 variant (as the gpt-oss models).
https://github.com/ikawrakow/ik_llama.cpp/pull/682

Oh yes, unsloth learns a lot from me, AesSedai, and bartowski to inform and keep them improving their recipes hehe... 😋

BTW: what do you think about AWQ

I don't use vLLM so much as it is more of a full GPU offload multi-user optimied environment. But a guy named Phaelon on https://huggingface.co/BeaverAI discord knows a lot about it. I believe there are some AWQ quants combinations, kernels, and calibration methods that can give decent results.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment