Re-converting the GGUF for MLA?
+1 to this please! 16K cache by default uses ~80GB VRAM, while with MLA it is barely near that (some few GBs instead)
+1 Would be a game changer for long context work.
Is there a script available anywhere to dynamically quant this model with the latest changes ourselves if Unsloth have no plans to?
Bumping this request in case anyone at UnSloth sees this. I pulled the q4_K_M last night and realized when I ran it that the current gguf doesn't support MLA. To give a comparison: The unsloth Deepseek R1 0528 q4_K_M requires around 16GB of KV Cache for 32768 tokens. This v3 0324 gguf requires around 156GB of KV Cache for 32768 tokens. This means on the M3 Ultra Mac Studio, I load up to a q5_K_M at 32k of R1 0528, while I might be able to squeeze a q3 of v3 0324... maybe.
I'm going to try to quantize this myself for my own uses to see if it improves the situation, but thought I'd toss the request out here for others, unless there's a reason that I'm missing why it wouldn't work.