Re-converting the GGUF for MLA?

#15

by Silver267 - opened Apr 15, 2025

Apr 15, 2025

Hi, now that MLA support is officially merged into llama.cpp, is it possible to re-convert the gguf in favour of the new format for MLA to work as described here? Thanks!

Panchovix

Apr 18, 2025

•

edited Apr 18, 2025

+1 to this please! 16K cache by default uses ~80GB VRAM, while with MLA it is barely near that (some few GBs instead)

MB7977

Apr 21, 2025

+1 Would be a game changer for long context work.

Is there a script available anywhere to dynamically quant this model with the latest changes ourselves if Unsloth have no plans to?

SomeOddCodeGuy

Jul 19, 2025

Bumping this request in case anyone at UnSloth sees this. I pulled the q4_K_M last night and realized when I ran it that the current gguf doesn't support MLA. To give a comparison: The unsloth Deepseek R1 0528 q4_K_M requires around 16GB of KV Cache for 32768 tokens. This v3 0324 gguf requires around 156GB of KV Cache for 32768 tokens. This means on the M3 Ultra Mac Studio, I load up to a q5_K_M at 32k of R1 0528, while I might be able to squeeze a q3 of v3 0324... maybe.

I'm going to try to quantize this myself for my own uses to see if it improves the situation, but thought I'd toss the request out here for others, unless there's a reason that I'm missing why it wouldn't work.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment