IQ3_K Results inside!

#3
by phakio - opened

As always, just posting some fresh results with my system. 96GB VRAM, 256GB DDR5

3x3090
1x4090
Intel QYFS Sapphire Rapids ES CPU

I really think this model has improved output compared to the previous deepseek v3.1. If you remember I wasn't too fond of the model, it felt "off". It feel as though it got some of it's personality back, and genuinely feels a lot more fun to interact with in return. Was able to adhere to, and accurately reference long contexts, and is fairly quick too!

If this is the final deepseek v3 family model, they really sealed it with the best this architecture can provide, imo

@ubergarm , thanks for the quick quantizations. always feel like a bum asking for your quants but nobody does it better than you! Every time I see a new model released I'm spam refreshing your huggingface... my friends saw me with your profiles tab open and they're like "who tf is that" lmao I had to explain to them you're somewhat of a legend.

looking forward to what the rest of 2025 has in store for us opensource nerds... and (im)patiently waiting for qwen3-next support in llama.cpp, I have a feeling next gen qwen will be no joke!


{0C35D0EA-36FB-461C-80C0-0DE7FABDC77E}.png


necro update - 11/6/25

Gave this model another shot with IQ5_KS, after upgrading my systems ram to accommodate bigger models.
Solid performance, noticeable accuracy increase, 3t/s slower speeds, but the accuracy tradeoff is very much worth it in my opinion. I used to be fine with lower quants, but after running these with near identical PPL as the full unquanted chungus model, I don't mind twiddling my thumbs waiting for a response!

@phakio

Thanks for the report and cool your vibe check suggests improved "personality"! Now that I've got the quants out I'll play around with it some more too. Your speeds look good !

One thing to double check if you downloaded the IQ3_K in the first 30 minutes of release, the first gguf had some issues due to my missing a single . in the recipe regex leading to sub-par small tensors in the first couple dense layers. I fixed it and re-uploaded ASAP, but you can confirm your file sha256sum matches:

$ sha256sum IQ3_K/DeepSeek-V3.1-Terminus-IQ3_K-00001-of-00007.gguf
5ab908f8cb0f109dc9a01f4607ecdcc77c31c3130420f0f779b4b09817841fa5

Interestingly the perplexity on Terminus is slightly higher "worse" across all the quants and baseline q8_0 than earlier V3.1. Supposedly they did some more fine-tuning for "agentic" data/use and reducing mixing languages. But can't really compare perplexity across different models or draw strong conclusions.

Oh one question, how are you running it with ik_llama.cpp regarding stuff like --jinja or --chat-template-kwargs '{"thinking": false }'? Maybe you have your own jinja template you use? I noticed this morning for the first time playing with that a little that it seems to take a while to answer, as if it was actually thinking but not returning the content depending on how i run it.

Anyway, much thanks and you rock and appreciate a fun community working together to try all this wild stuff out together! go team! πŸ™πŸ™ŒπŸ˜‹

home/phone/Documents/ik_llama.cpp/build/bin/llama-server \
    --model /home/phone/Downloads/LocalModels/DeepSeek-V3.1-Terminus-IQ3_K-00001-of-00007.gguf \
    --alias ubergarm/DeepSeek-V3.1-Terminus-GGUF \
    --ctx-size 20000 \
    -ctk q8_0 \
    -amb 512 \
    -mla 3 -fa -fmoe \
    -ngl 99 \
    -ot "blk\.(2|3|13|14)\.ffn_.*=CUDA0" \
    -ot "blk\.(1|4|5|6)\.ffn_.*=CUDA1" \
    -ot "blk\.(0|7|8|9)\.ffn_.*=CUDA2" \
    -ot "blk\.(10|11|12)\.ffn_.*=CUDA3" \
    -ot exps=CPU \
    --parallel 1 \
    --threads 48 \
    --threads-batch 56 \
    --host 0.0.0.0 \
    --port 8081 \
    --no-mmap

My startup command is the one I use for kimi k2 but slightly modified, found I can fit the smaller first few layers (0,1,2) across all cards, not just shove them all on CUDA0. this prevents an OOM error, because GPU0 is my main GPU, so it already has like -1GB in use just idling due to running the display on computer.

I'm actually running the model raw, no jinja buisiness and for my use case it's good. (summarizing research papers, extracting info, data comparison, etc.) I'm sure for coding or real agentic use my setup is sub-optimal but right now, it's good for me.
EDIT: by not specifying any jinja it seems to default to non thinking, answers promptly no delay, i'm not experiencing that issue you are talking about (I did notice that issue when testing Qwen3.5 next on some other inference engines, like vllm, but that's totally unrelated, and quite frankly i didn't have a good time vllm)

That's part of the reason I enjoy open source models so much, we can finetune and really bend a model to our specific use case and have it work best for us...

Side note; is there anyway to force context cache onto one card? I'm thinking something like have one gpu store 24gb of context, while having JUST the model on the other 3. as far as I know right now the cache is spread across all cards after model is done loading...

I'm just commenting here to see how you go with putting the context cache on a single GPU ;)

I'm just commenting here to see how you go with putting the context cache on a single GPU ;)

right? i mean it seems possible, but maybe due to how model is distributed it's difficult to implement, i'm unsure if this is even a feature in llama.cpp... likewise, if we can somehow load the model into system ram and let the gpus solely deal with prompt processing and context management, i wonder how that could benefit workloads with large context... like a 240gb model in system ram and 96gb vram for fast context processing.

might see a scene shift towards this once model are better at dealing with and referencing larger context. I only deal with this stuff on the user end level, I'm sure someone who works on the actual algorithms and stuff knows more about this than I do and will tell me why this isn't possible, or the technical limitations of it! i'm curioius to learn more

Yeah. I was trying to do this at the start of the year with the rpc-server to reduce network traffic.
I think what you want is --tensor-split.
The following is from memory when I experimented at the start of the year, could be wrong:
Say you have 4 GPUs. If you do

-ts 1,0,0,0 -- that would put everything on your first GPU
And that's equivilent to -ts 1

-ts 0,0,0,1 would put everything on your last GPU

And you can control the ratio, I recall having something like
-ts 0,0,20,20,20,21 or something at one point (because the first 2 devices were RPC servers).

But I stopped experimenting with it a while ago because if actually helped, smarter people like ubergarm would be doing it / advocating it.

Agreed about this one getting some of the fun personality back

...But I'll keep it simple for this version.

Let me code this piece of shit now.</think>

I noticed this morning for the first time playing with that a little that it seems to take a while to answer, as if it was actually thinking but not returning the content depending on how i run it.

Yeah I get the same issue if I use --jinja. Also noticed it complains about double-BOS if I use my standard /v1/completions scripts. Haven't had a chance to tweak all that yet though.

Thanks for the tip, I figured out how to enable/disable thinking and why it was taking a long time before replying when using --jinja.

Basically, if you use only --jinja it seems to default formatting the thinking traces into message.reasoning_content so was invisible in my simple streaming python chat client. So there are some new (to me) options that can help.

Here are the main ways to use it as I understand now:

  1. Like @phakio says by not specifying any jinja it seems to default to non thinking, answers promptly no delay and probably how I will use it myself.

  2. You can enable --jinja and specify a different .jinja template if you like (e.g. download one from another quant etc), then pick where you want the thinking traces to go in the response and also set the budget (-1 = thinking and 0 = nothinking) e.g.

    --jinja \
    --reasoning-format none \
    --reasoning-budget -1 \

Thanks for helping me figure out these new (to me) command arguments! Some more details in the PRs here.

NOTE: Also the behavior will likely/possibly change if you hitting the newer OpenAI compliant endpoints /v1/* or the older endpoints not /v1/* I believe but you might have to experiment depending on what kind of response and your client is expecting haha..πŸ˜…

is there anyway to force context cache onto one card?

As far as I know there is not, and I don't believe there would be any benefit. Some old discussion on similar concept though about NUMA node placement of kv-cache for CPU-only inferencing situations: https://github.com/ggml-org/llama.cpp/discussions/11733#discussioncomment-12343078

if we can somehow load the model into system ram and let the gpus solely deal with prompt processing and context management, i wonder how that could benefit workloads with large context... like a 240gb model in system ram and 96gb vram for fast context processing.

This is more or less already the strategy as I understand it. e.g.

  1. Pick the kv-cache size context length you need for the application keeping in mind anything over 64-128k is probably not very good anyway for most models today.
  2. Offload entire kv-cache onto GPU as this is very important for speed.
  3. Next priority is to offload the always active parameters to GPU e.g. attn/first N dense layers/shared expert
  4. Any remaining space can be used for the sparse routed MoE experts (this is what manual overriding some layers to CUDAx before -ot exps=CPU does when placed at the end). This doesn't help as much as only 8 of the ~128+ routed experts are used per token, and probabilisticly the ones you put on GPU aren't even being used most of the time. This is why buying more fast RAM is generally a safer cheaper bet than getting enough VRAM to hold the entire MoE model of which only say 37B of 671B are active at a given time and the other memory is idle.

If for some odd reason you used up all your VRAM just for very large context kv-cache, then just don't offload any layers e.g. -ngl 0 and all the model weights would go to CPU/RAM and GPU/VRAM would be only for kv-cache stuff.

I hope that makes sense though I'm sure some visuals in a 20 minute youtube video would be a better place to discuss this lol...

Yes you can play with -ts 1,0,0,0 which is just a ratio, and the most convenient way to use it is to use the VRAM amount for each gpu e.g. for a 5090 and two 3090s for example you could use -ts 32,24,24 or any ratio you want. I advise against using this when you're using -ot as with -ot you have more fine-grained control over each tensor/layers placement.

If you really want to try to min-max this stuff, I'd suggest considering trying to keep the layers in order on the same device before going tot he next CUDA device to avoid potentially more round-trips across PCIe bus between GPUs but really I'm not sure it makes a difference. The main thing is to keep those always active parameters on GPU so never put the attn/first N dense layers/shexp onto CPU if you have the VRAM.

Okay I'll stop rambling before I confuse myself even more haha... Cheers!

First off, thank you for your amazing quants and contributions... who'd have thought, just a couple of years back, that we'd be running these SOTA behemoths locally, at "half-decent" speeds along with outstanding PPL/KLD figures to boot.

I can share a few results for those still considering going X99/C612 for big MoEs - which IS viable but really... only for a couple of users AND with decent GPU help. And if you're going to pull the trigger keep in mind that you'll be heavily compute bound (the low "AVX2 clock base" of the E5-2695 v4, 2.4Ghz, really hurts throughput... you'd want a, say, 2699A v4 for a non-negligibe boost during AVX2 workloads). That's also the reasion why i tried to keep as many tensors as possible on VRAM in spite of using -fmoe (gate and up take less space than down with this quant).

There are more bottlenecks to take into account (e.g. PCIe 3.0 hurts PP a lot with such a massive CPU offload) but with that said, it's still useable - again, as a home assistant with few, patient users.

(System: ASRock X99 Extreme4, E5-2695 v4, 256GB ECC 2400, 2x 3090s @ 3.0 x16 & 3x 3080s @ 2.0 x1 via splitter, Ubuntu Server 24.04.3)

  • R1-0528 IQ3_K_R4 (301 GiB):

main: n_kv_max = 32768, n_batch = 2048, n_ubatch = 1344, flash_attn = 1, n_gpu_layers = 99, n_threads = 17, n_threads_batch = 17

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
1344 336 0 34.871 38.54 63.947 5.25
1344 336 1344 35.049 38.35 64.246 5.23
1344 336 2688 35.257 38.12 64.941 5.17
1344 336 4032 35.510 37.85 65.162 5.16
1344 336 5376 34.749 38.68 65.934 5.10
1344 336 6720 34.966 38.44 65.957 5.09
1344 336 8064 35.247 38.13 66.578 5.05
1344 336 9408 35.443 37.92 66.824 5.03
1344 336 10752 35.646 37.70 67.514 4.98
1344 336 12096 35.851 37.49 67.690 4.96
1344 336 13440 36.058 37.27 68.320 4.92
1344 336 14784 36.271 37.05 68.450 4.91
1344 336 16128 36.974 36.35 72.078 4.66
1344 336 17472 36.929 36.39 69.328 4.85
1344 336 18816 37.196 36.13 69.955 4.80
1344 336 20160 37.364 35.97 70.101 4.79
1344 336 21504 37.631 35.72 70.912 4.74
1344 336 22848 37.844 35.51 71.018 4.73
1344 336 24192 38.058 35.31 71.650 4.69
1344 336 25536 38.261 35.13 71.821 4.68
1344 336 26880 38.481 34.93 72.538 4.63
1344 336 28224 38.698 34.73 72.730 4.62
1344 336 29568 38.946 34.51 73.451 4.57
1344 336 30912 39.121 34.35 73.667 4.56
  • V3.1-Terminus IQ3_K w/ -rtr (291 GiB):

main: n_kv_max = 32768, n_batch = 2048, n_ubatch = 1344, flash_attn = 1, n_gpu_layers = 99, n_threads = 17, n_threads_batch = 17

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
1344 336 0 33.133 40.56 69.160 4.86
1344 336 1344 33.282 40.38 69.384 4.84
1344 336 2688 33.478 40.15 70.053 4.80
1344 336 4032 33.722 39.85 70.240 4.78
1344 336 5376 33.986 39.55 70.979 4.73
1344 336 6720 34.185 39.32 71.128 4.72
1344 336 8064 34.472 38.99 71.739 4.68
1344 336 9408 34.684 38.75 71.919 4.67
1344 336 10752 34.933 38.47 72.736 4.62
1344 336 12096 35.150 38.24 72.908 4.61
1344 336 13440 35.273 38.10 73.445 4.57
1344 336 14784 35.505 37.85 73.708 4.56
1344 336 16128 37.158 36.17 74.455 4.51
1344 336 17472 37.197 36.13 74.575 4.51
1344 336 18816 37.421 35.92 75.183 4.47
1344 336 20160 37.614 35.73 75.358 4.46
1344 336 21504 37.856 35.50 76.177 4.41
1344 336 22848 38.052 35.32 76.349 4.40
1344 336 24192 38.280 35.11 76.994 4.36
1344 336 25536 38.535 34.88 77.148 4.36
1344 336 26880 38.739 34.69 77.822 4.32
1344 336 28224 38.962 34.50 77.988 4.31
1344 336 29568 39.161 34.32 78.623 4.27
1344 336 30912 38.347 35.05 78.743 4.27
  • Launch command (same as llama-server): (*)

export CUDA_VISIBLE_DEVICES=0,1,2,3,4
./ik_llama.cpp/build/bin/llama-sweep-bench \
--alias MODEL \
--model /media/nvme/Inference/GGUF/MODEL.gguf \
--ctx-size 32768 --warmup-batch --slot-save-path /media/SataSSD/Inference/MemSlots \
-mla 3 -fa -ctk q8_0 --no-mmap \
-fmoe -amb 256 (-rtr) \
--jinja --reasoning-format auto \
-ts 45,48,0,0,0 \
-ngl 99 \
-sm layer \
-ot "blk.(4|5|6).ffn_(gate|up)exps*=CUDA0" \
-ot "blk.(7|8|9).ffn
(gate|up)exps*=CUDA1" \
-ot "blk.(17|18|19).ffn
(gate|up)exps*=CUDA2" \
-ot "blk.(27|28|29).ffn
(gate|up)exps*=CUDA3" \
-ot "blk.(37|38|39).ffn
(gate|up)_exps*=CUDA4" \
-ot exps=CPU \
--parallel 1 \
--threads 17 \
--host 0.0.0.0 \
--port 8080 \
--ubatch-size 1344 \
--batch-size 2048

Interesting to note that Terminus is significantly slower with the exact same settings even if 10 GiB smaller; most likely due to iq6_K tensors as opposed to q8_0 ones in R1-0528. Also, only 4 MiB left on each 3080 and barely more on the 3090s so a higher ubatch is not really an option if targeting the fastest TG possible. (**)

(*) I usually don't quantize the K(V) cache, which is also 2-3% faster in this setup, but did use q8_0 here since it's common practice.

(**) For 32k context, using a single 3090 for the cache and first 3 layers is indeed faster (smaller compute buffers overall so more tensors on GPUs) but this split is much more flexible when dealing with longer context (64k+) and staying within RAM constraints.

@VoidStare (i'm u/VoidAlchemy on reddit hah!)

Thanks for the report, and what an interesting rig and configuration! Guessing this is one of the lower overall cost systems to do this. Glad you're able to enjoy running these quants of big models!

i tried to keep as many tensors as possible on VRAM in spite of using -fmoe (gate and up take less space than down with this quant).

Yeah I see you're splitting -ot "blk.(4|5|6).ffn_(gate|up)exps*=CUDA0" \ which is unusual. Yeah -fmoe fuses (gate|up) so i think it is okay. I don't think in terms of how many tensors I can offload, just how many GB I can offload, so I don't bother getting so fine grained. But I suppose with enough GPUs folks like to pack each one as much as possible e.g. the "zero one knapsack" problem haha...

Interesting to note that Terminus is significantly slower with the exact same settings even if 10 GiB smaller; most likely due to iq6_K tensors as opposed to q8_0 ones in R1-0528.

Yeah I've fiddled a bit with the recipes. Correct q8_0 takes less CPU than iq6_k for PP, but in theory should be a bit faster for TG given it is smaller, but unfortunately isn't working out that way I guess for your exact config.

a higher ubatch is not really an option

I'm guessing you experimented to find exactly -ub 1344? I'm superstitious and keep it a power of 2, and try to keep the number of threads a power of 2 or at least a combination of powers of 2 lol... Your CPU has 18 cores, but you use 17 (leaving one for kswapd or whatever else I'm guessing?).

I usually don't quantize the K(V) cache, which is also 2-3% faster in this setup, but did use q8_0 here since it's common practice.

Right, it is common to do this to save VRAM and has been shown with llama-perplexity to barely take a hit to perplexity that likely isn't noticeable in practice. Some models run quite a bit faster with f16 kv-cache unquantized like GLM when offloading it all to new-ish GPUs as f16 is faster on GPU than q8_0 (again likely newer >=sm86 arch probably). Since q8_0 is faster for you anyway then yes definitely worth doing.

-ts 45,48,0,0,0 \

So does this really push all the kv-cache onto the first two cards, but -ot puts stuff on the other cards? Because @gghfez mentioned it too, but I don't mess with -ts except for dense models fully offloaded on multiple GPUs. Is the thinking to force the kv-cache onto the faster 3090s and it doesn't mess with -ot? Interesting

Otherwise, I've heard from some folks going from -amb 256 to something a little higher e.g. 512 or 1024 improved speeds just a little, but I believe they were on very new GPUs (and had VRAM to spare hah).

Finally, makes me want to try the "troll rig" strategy again on my local gaming rig stuffing a big 200+GB model into 96GB RAM + 24 GB VRAM and thrashing the page cache with mmap() off of a Gen 5 NVMe SSD e.g. T700 2TB. I doubt I could hit 5 tok/sec though...

Finally, in the "old days" (a few weeks ago lol) people would use -ser 7,1 or -ser 6,1 to reduce the number of active routed exps per token and get proportionally better TG speeds at some cost to perplexity (quality). I believe it is no longer supported though might still exist hah. But u might be able to still do it with overriding kv configs e.g. llama-server ... --override-kv deepseek2.expert_used_count=int:7 ......

Finally have you done the undervolt/overclock on your GPUs which likely won't help speeds much on a big MoE but could lower power/fan noise. Some notes on that here: https://huggingface.co/ubergarm/Kimi-K2-Instruct-0905-GGUF/discussions/1#68caf53133c4fdeba5ebebf1

It definitely helps for dense models or ComfyUI image / video gen in my experience.

Cheers!

I still check old.reddit.com but it's been ages since i last logged in, blame YC or whatever. :)

I acknowledge that splitting layers isn't ideal but with 256 GiB RAM and 78 GiB VRAM... every MiB counts. I'm even considering giving Vulkan a go since, for CUDA to run, each GPU "wastes" a sizeable amount of VRAM (over 2 GiB total in my case). The reason i'm trying to optimize both VRAM and tensor count is because i'm severely compute bound here so every CPU cycle saved is a net win in spite of the overhead. And yeah, i also acknowledge that i have the time, for whatever reasons, to try and squeeze every byte i can out of this "ancient" rig... It is time consuming, no joke. And it does need a spare core for consistent results, specially since i disabled HT from the BIOS.

If i understood your recipes correctly, the iq6_K tensors still reside in the 3090s but regardless, it's a bit of a moot point since as you correctly point out, every setup is different and has its own quirks. I should know. And it makes sense, really, even if computed on the 3090s a q8_0 is so much faster that it should make some difference. As for why i used that seemingly random -ts... mostly because i noticed that about 300 MiB was the most i could "move around" between GPUs, and given the compute buffers' size and KV cache, that particular split allows me to max out the VRAM without OOM'ing. Same with the 1344 ubatch size, mostly so the 3080's stay at 9.996 GiB out of 10.000 GiB. :D

As for -amb i've gone as low as 128MiB without noticing any slowdowns but then again, this setup is pretty underwhelming in the grand scheme of things so it also makes sense when dealing with 5 t/s, at best. But i keep it at 256 for good measure, the VRAM hit is rather small. And yes, using a -ts like X,0,0,0 behaves as if you only had one GPU (dense layers, KV cache, compute buffer) but you can still use the other ones as "auxiliary compute", i imagine? Also, i just -pl the GPUs (250W for the 3090s, 125W for the 3080s since they do pretty much nothing), too lazy to do a proper undervolt these days. :P At least when temps are in check.

PS: Troll rig FTW! Your smol-IQ4_KSS quants are very much worth the "performance" hit more often than not on 256GiB RAM, even if llama.cpp can barely do more than 2GiB/s off NVME storage regardless of Gen. It may not be DeepSeek FP8... but honestly, it's pretty darn close!

--jinja \

--reasoning-format none
--reasoning-budget -1 \

Thanks for these. I'll try them out. I haven't written a jinja for it yet, and I don't use the unsloth ones because because they've started putting copyright / licensing in inside the fucking chat templates (wtf?!)

I just realized that the openai api compliant endpoint has now at least two different styles:

That might be important that your client is using the correct api endpoints etc.

Thanks for these. I'll try them out. I haven't written a jinja for it yet, and I don't use the unsloth ones because because they've started putting copyright / licensing in inside the fucking chat templates (wtf?!)

I came across this GitHub issue comment, which confirmed that using

--jinja --reasoning-format auto

works just fine. That way there’s no need to rely on unsloth’s templates. Tested with open-webui and ik bundled webui. Here’s my full command:

# /bin/bash

CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" ./ik_llama.cpp/build/bin/llama-server \
    --model /home/jliao/Data/Repos/DeepSeek-V3.1-Terminus-IQ3_K/IQ3_K/DeepSeek-V3.1-Terminus-IQ3_K-00001-of-00007.gguf \
    --alias ubergarm/DeepSeek-V3.1-Terminus-IQ3_K \
    --ctx-size 32768 \
    -ctk q8_0 \
    -fa -fmoe \
    -mla 3 -amb 512 \
    -ngl 99 \
    -ot "blk\.(3|4|5)\.ffn_.*=CUDA0" \
    -ot "blk\.(6|7|8)\.ffn_.*=CUDA1" \
    -ot "blk\.(9|10|11)\.ffn_.*=CUDA2" \
    -ot "blk\.(12|13|14)\.ffn_.*=CUDA3" \
    -ot "blk\.(15|16|17)\.ffn_.*=CUDA4" \
    -ot "blk\.(18|19|20)\.ffn_.*=CUDA5" \
    -ot "blk\.(21|22|23)\.ffn_.*=CUDA6" \
    -ot "blk\.(24|25|26)\.ffn_.*=CUDA7" \
    -ot exps=CPU \
    --parallel 1 \
    --threads 32 \
    --jinja \
    --reasoning-format auto \
    --reasoning-budget -1 \
    --host 0.0.0.0 \
    --port 8088 \
    --no-display-prompt \
    --no-mmap

@JianLiao

Thanks for the detailed and very clean looking command. Glad you have it working across 8 GPUs and your CPU/RAM! Also good to hear the built in jinja is working for you (i always keep the official jinja template provided by the original model at the time of quantizing).

@JianLiao Thanks mate, that worked for me. And this --no-display-prompt \ allowed me to get rid of my grep/awk pipe

Sign up or log in to comment