Request for support - improved model fit
Hi John,
First, yet another great quant: many thanks for your constant effort putting up with this plethora of new models! Much appreciated! π
I need a bit of support from your far more focused brain. My 8x RTX 3090 is eventually running smooth. I have managed to enable also p2p and the throughput increase in ik_llama has been substantial. The other half of the system is an AMD Epyc 7443p (24c/48p) + 256 GB DDR4 3200.
I am basically asking for your support, a good recipe, of what to load what to offload + the needed line parametrization to be able to run IK4_KS best of available performance with my 192 GB VRAM and the lazier 256. Given the model size (208gb) I am pretty sure it needs some CPU offloading but I accept, I feel dumb and lost when it comes to CPU / RAM offloading. But I am willing to learn. And also, if you have a link to buy you coffee for the effort, happy to oblige.
I current use Minimax M2.5 in sglang / vllm since there mratsim recipe just flies with my VRAM amount. So far I have tried your this model only with mainline llama cause "--fit" but given the p2p in my cards and "-sm graph" in ik_llama a have a feeling I can get far better performance with it, even with some CPU / RAM offloading.
Okay, let me try to parse your two questions and understand exactly what you're asking simply:
- You have 8x3090s so roughly 192GB VRAM
- You want to run Qwen3.5-397B-A17B best quant and speed possible.
So bad news is ik_llama.cpp does not yet support -graph parallel for qwen35moe. I believe ik has to work out some fused gated delta net stuff first as recently mentioned here: https://github.com/ikawrakow/ik_llama.cpp/pull/1320
Yes vllm/sglang will likely give better aggregate throughput (parallel slots) than any ik/llama.cpp GGUF if you can fully offload it and have enough room for context.
I have new questions for you now given Qwen3.5MoE does not yet support graph parallel:
- Do you still want to workshop your command for best hybrid CPU+GPU
-sm layercommand? If so, share your full command here and we can workshop it. - You could try the slightly smaller Q3_K full GPU offload to keep it simple.
- You could run a larger quant of the smaller Qwen3.5-122B-A10B full offload q8_0 quite easily.
What u thinking?
Here is your question from another thread:
I am going to ask stupid question and am fine with being slapped!
Why can't I get the model in ik_llama to properly use GPU at full like it happens in vllm / sglang? I love "-sm graph" + p2p get some strong
push but still it's still at about half the t/s I am getting in sglang or vllm.
And I am not talking about multiple parallel requests. Nope, single one.
Is it some configuration / ik_llama line parametrization I am doing wrong, or is it architecturally so that *llama can't be that fast (given the > lack of true tensor parallelism) and have to move on with my life and accept for what it is?
I answered briefly, but sounds like you're not using -ngl 999 ??
Also, if you're okay with discord (and hopefully don't have to upload id), check out https://huggingface.co/BeaverAI a lot of people talk about their offload strategies and rigs
First of all, apologies for going silent but this has been a maddening week. Because of the nefarious quant "bug" (singular vs plural) two days of dev spints have gone bananaand had to scrape and start from zero.
Anyway, this way I've discovered how potent Qwen3.5-397B-A17B-GGUF really is. As of now I am using Unsloth's UD-Q3_K_XL (cause I had it laying around and my internet at home sucks - Germany!)
So bad news is ik_llama.cpp does not yet support
-graph parallelfor qwen35moe. I believe ik has to work out some fused gated delta net stuff first as recently mentioned here: https://github.com/ikawrakow/ik_llama.cpp/pull/1320
Ok, so that was all for my p2p + ik_llama anxiety! π
- Do you still want to workshop your command for best hybrid CPU+GPU
-sm layercommand? If so, share your full command here and we can workshop it.
Given the my HW config would you recommend I go for a higher quant from your repo? I mean, process + performance wise would I make it any better than with UD-Q3_K_XL?
Currently I am at 20-22 tks and with ctx at 196k (cannot go any lower than that since my vibe coded project is huge) and this is with:
/home/vik/llms/llama.cpp/build/bin/llama-server
--model ~/models/gguf/unsloth/Qwen3.5-397B-A17B-UD-Q3_K_XL/Qwen3.5-397B-A17B-UD-Q3_K_XL-00001-of-00005.gguf
--mmproj ~/models/gguf/unsloth/Qwen3.5-397B-A17B-UD-Q3_K_XL/mmproj-F16.gguf
--alias "unsloth/Qwen3.5-397B-A17B-UD-Q3_K_XL"
--fit on -t 24
--temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 --seed 3407
--api-key VLLM_API_KEY_2026 --port 5005 --host 0.0.0.0
--jinja -fa on
-c 196608 -ctk q8_0 -ctv q8_0 --kv-offload
--context-shift -cb
-b 2048 -ub 1024
--chat-template-kwargs '{"reasoning_effort": "normal"}'(base)
- You could try the slightly smaller Q3_K full GPU offload to keep it simple.
- You could run a larger quant of the smaller Qwen3.5-122B-A10B full offload q8_0 quite easily.
I have it at hand and I will probably use it to kickstart another project using it. So far the initial tests are WOW: 45tks with llama mainline and aessedai's Q5_K_M
Do you plan or releasing any bigger quants than Qwen3.5-122B-A10B-IQ4_KSS?
Given the my HW config would you recommend I go for a higher quant from your repo? I mean, process + performance wise would I make it any better than with UD-Q3_K_XL?
Currently I am at 20-22 tks and with ctx at 196k
I love learning and would definitely put up with a bigger quant in ik_llama if your experience hints better PP / TKS would be possible over my current range (20-22tks).
I love learning and would definitely put up with a bigger quant in ik_llama if your experience hints better PP / TKS would be possible over my current range (20-22tks).
Heya yes it was a wild weekend with so many unsloth quants being re-uploaded to fix the MXFP4 bugs and I guess they have some new recipes now, still trying to figure out what has been updated, what has not been updated by them, and if they're planning on updating that stuff (e.g. qwen3-coder etc).
Anyway, good news here is ik_llama.cpp just added -sm graph support and ik shows running full offload on many 3090's kind of like your rig: https://github.com/ikawrakow/ik_llama.cpp/pull/1347 and note the point about max 2 GPUs argument as he shows there.
I would recommend trying this. I'm currently working on it myself to figure out commands and get llama-sweep-bench benchmarks.
Anyway, good news here is ik_llama.cpp just added
-sm graphsupport and ik shows running full offload on many 3090's kind of like your rig: https://github.com/ikawrakow/ik_llama.cpp/pull/1347 and note the point about max 2 GPUs argument as he shows there.I would recommend trying this. I'm currently working on it myself to figure out commands and get llama-sweep-bench benchmarks.
I've done, already commented on ik_ PR directly. There I am vikcious, sorry for the confusion! π With "--max-gpu 2" it flies...
main: n_kv_max = 135168, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = 999, n_threads = 1, n_threads_batch = 1
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|---|---|---|---|---|---|---|
| 4096 | 128 | 0 | 2.305 | 1777.30 | 2.243 | 57.06 |
| 4096 | 128 | 4096 | 2.407 | 1701.96 | 2.234 | 57.30 |
| 4096 | 128 | 8192 | 2.443 | 1676.65 | 2.265 | 56.52 |
| 4096 | 128 | 12288 | 2.528 | 1620.05 | 2.280 | 56.14 |
| 4096 | 128 | 16384 | 2.690 | 1522.47 | 2.262 | 56.58 |
| 4096 | 128 | 20480 | 2.795 | 1465.50 | 2.311 | 55.39 |
| 4096 | 128 | 24576 | 2.887 | 1418.91 | 2.359 | 54.26 |
| 4096 | 128 | 28672 | 3.034 | 1350.19 | 2.363 | 54.18 |
| 4096 | 128 | 32768 | 3.155 | 1298.31 | 2.420 | 52.89 |
| 4096 | 128 | 36864 | 3.253 | 1259.09 | 2.434 | 52.59 |
| 4096 | 128 | 40960 | 3.326 | 1231.55 | 2.388 | 53.60 |
| 4096 | 128 | 45056 | 3.473 | 1179.29 | 2.424 | 52.81 |
| 4096 | 128 | 49152 | 3.803 | 1077.09 | 2.443 | 52.39 |
| 4096 | 128 | 53248 | 4.054 | 1010.45 | 3.041 | 42.09 |
| 4096 | 128 | 57344 | 3.821 | 1071.99 | 2.701 | 47.38 |
| 4096 | 128 | 61440 | 3.954 | 1036.03 | 2.684 | 47.69 |
| 4096 | 128 | 65536 | 4.082 | 1003.42 | 2.739 | 46.74 |
| 4096 | 128 | 69632 | 4.167 | 983.05 | 2.850 | 44.91 |
| 4096 | 128 | 73728 | 4.216 | 971.48 | 2.732 | 46.85 |
| 4096 | 128 | 77824 | 4.318 | 948.61 | 2.688 | 47.62 |
| 4096 | 128 | 81920 | 4.414 | 927.87 | 2.689 | 47.61 |
| 4096 | 128 | 86016 | 4.542 | 901.82 | 2.705 | 47.32 |
| 4096 | 128 | 90112 | 4.634 | 883.98 | 2.745 | 46.63 |
| 4096 | 128 | 94208 | 4.779 | 857.03 | 2.688 | 47.62 |
| 4096 | 128 | 98304 | 4.868 | 841.49 | 2.742 | 46.67 |
| 4096 | 128 | 102400 | 4.974 | 823.43 | 2.765 | 46.29 |
| 4096 | 128 | 106496 | 5.091 | 804.60 | 2.839 | 45.09 |
| 4096 | 128 | 110592 | 5.218 | 785.05 | 2.787 | 45.93 |
| 4096 | 128 | 114688 | 5.354 | 765.00 | 2.822 | 45.36 |
| 4096 | 128 | 118784 | 5.417 | 756.21 | 2.906 | 44.05 |
| 4096 | 128 | 122880 | 5.563 | 736.28 | 2.917 | 43.88 |
| 4096 | 128 | 126976 | 5.580 | 734.07 | 2.924 | 43.77 |
| 4096 | 128 | 131072 | 5.785 | 708.02 | 2.901 | 44.12 |
Btw, stupid question... if I need the mmproj file for image loading to model (helps a lot with testing where model can generate his own .pngs of the frontend and debug / look for errors) can I use the one from Unsloth repository? Would dat work?
Well, it looks it doesn't work... any clue? π
mtmd_init_from_file: error: mismatch between text model (n_embd = 3072) and mmproj (n_embd = 4096)
hint: you may be using wrong mmproj
ERR [ load_model] failed to load multimodal model, '%s'
| ="/home/vik/models/gguf/ubergarm/Qwen3.5-122B-A10B-IQ4_KSS/mmproj-F16.gguf"
~ggml_backend_cuda_context: have 21 graphs
~ggml_backend_cuda_context: have 34 graphs
~ggml_backend_cuda_context: have 20 graphs
~ggml_backend_cuda_context: have 34 graphs
~ggml_backend_cuda_context: have 20 graphs
~ggml_backend_cuda_context: have 34 graphs
~ggml_backend_cuda_context: have 20 graphs
~ggml_backend_cuda_context: have 34 graphs
free(): invalid pointer
./ik-server_Qwen3.5-122B-A10B-IQ4_KSS.sh: line 21: 165731 Aborted (core dumped)
Ahh okay yes I saw your sweeps over on ik's PR and graphed them, yes!
if I need the mmproj file for image loading to model (helps a lot with testing where model can generate his own .pngs of the frontend and debug / look for errors) can I use the one from Unsloth repository? Would dat work?
mismatch between text model (n_embd = 3072) and mmproj (n_embd = 4096)
In theory yes, but apparently no? I haven't tried yet myself, as that feature was recently added to ik_llama.cpp ... I think I grabbed mmproj from AesSedai for Kimi-K2.5 maybe and that worked in the past... maybe something with the qwen delta net stuff?
let me know if you find the answer as I got a few other irons in the fire currently!