Request for support - improved model fit

#9
by dehnhaide - opened

Hi John,

First, yet another great quant: many thanks for your constant effort putting up with this plethora of new models! Much appreciated! πŸ™

I need a bit of support from your far more focused brain. My 8x RTX 3090 is eventually running smooth. I have managed to enable also p2p and the throughput increase in ik_llama has been substantial. The other half of the system is an AMD Epyc 7443p (24c/48p) + 256 GB DDR4 3200.

I am basically asking for your support, a good recipe, of what to load what to offload + the needed line parametrization to be able to run IK4_KS best of available performance with my 192 GB VRAM and the lazier 256. Given the model size (208gb) I am pretty sure it needs some CPU offloading but I accept, I feel dumb and lost when it comes to CPU / RAM offloading. But I am willing to learn. And also, if you have a link to buy you coffee for the effort, happy to oblige.

I current use Minimax M2.5 in sglang / vllm since there mratsim recipe just flies with my VRAM amount. So far I have tried your this model only with mainline llama cause "--fit" but given the p2p in my cards and "-sm graph" in ik_llama a have a feeling I can get far better performance with it, even with some CPU / RAM offloading.

Okay, let me try to parse your two questions and understand exactly what you're asking simply:

  1. You have 8x3090s so roughly 192GB VRAM
  2. You want to run Qwen3.5-397B-A17B best quant and speed possible.

So bad news is ik_llama.cpp does not yet support -graph parallel for qwen35moe. I believe ik has to work out some fused gated delta net stuff first as recently mentioned here: https://github.com/ikawrakow/ik_llama.cpp/pull/1320

Yes vllm/sglang will likely give better aggregate throughput (parallel slots) than any ik/llama.cpp GGUF if you can fully offload it and have enough room for context.

I have new questions for you now given Qwen3.5MoE does not yet support graph parallel:

  1. Do you still want to workshop your command for best hybrid CPU+GPU -sm layer command? If so, share your full command here and we can workshop it.
  2. You could try the slightly smaller Q3_K full GPU offload to keep it simple.
  3. You could run a larger quant of the smaller Qwen3.5-122B-A10B full offload q8_0 quite easily.

What u thinking?


Here is your question from another thread:

I am going to ask stupid question and am fine with being slapped!

Why can't I get the model in ik_llama to properly use GPU at full like it happens in vllm / sglang? I love "-sm graph" + p2p get some strong
push but still it's still at about half the t/s I am getting in sglang or vllm.
And I am not talking about multiple parallel requests. Nope, single one.

Is it some configuration / ik_llama line parametrization I am doing wrong, or is it architecturally so that *llama can't be that fast (given the > lack of true tensor parallelism) and have to move on with my life and accept for what it is?

I answered briefly, but sounds like you're not using -ngl 999 ??

Also, if you're okay with discord (and hopefully don't have to upload id), check out https://huggingface.co/BeaverAI a lot of people talk about their offload strategies and rigs

First of all, apologies for going silent but this has been a maddening week. Because of the nefarious quant "bug" (singular vs plural) two days of dev spints have gone bananaand had to scrape and start from zero.
Anyway, this way I've discovered how potent Qwen3.5-397B-A17B-GGUF really is. As of now I am using Unsloth's UD-Q3_K_XL (cause I had it laying around and my internet at home sucks - Germany!)

So bad news is ik_llama.cpp does not yet support -graph parallel for qwen35moe. I believe ik has to work out some fused gated delta net stuff first as recently mentioned here: https://github.com/ikawrakow/ik_llama.cpp/pull/1320

Ok, so that was all for my p2p + ik_llama anxiety! πŸ˜€

  1. Do you still want to workshop your command for best hybrid CPU+GPU -sm layer command? If so, share your full command here and we can workshop it.

Given the my HW config would you recommend I go for a higher quant from your repo? I mean, process + performance wise would I make it any better than with UD-Q3_K_XL?
Currently I am at 20-22 tks and with ctx at 196k (cannot go any lower than that since my vibe coded project is huge) and this is with:

/home/vik/llms/llama.cpp/build/bin/llama-server
--model ~/models/gguf/unsloth/Qwen3.5-397B-A17B-UD-Q3_K_XL/Qwen3.5-397B-A17B-UD-Q3_K_XL-00001-of-00005.gguf
--mmproj ~/models/gguf/unsloth/Qwen3.5-397B-A17B-UD-Q3_K_XL/mmproj-F16.gguf
--alias "unsloth/Qwen3.5-397B-A17B-UD-Q3_K_XL"
--fit on -t 24
--temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 --seed 3407
--api-key VLLM_API_KEY_2026 --port 5005 --host 0.0.0.0
--jinja -fa on
-c 196608 -ctk q8_0 -ctv q8_0 --kv-offload
--context-shift -cb
-b 2048 -ub 1024
--chat-template-kwargs '{"reasoning_effort": "normal"}'(base)

  1. You could try the slightly smaller Q3_K full GPU offload to keep it simple.
  2. You could run a larger quant of the smaller Qwen3.5-122B-A10B full offload q8_0 quite easily.
    I have it at hand and I will probably use it to kickstart another project using it. So far the initial tests are WOW: 45tks with llama mainline and aessedai's Q5_K_M
    Do you plan or releasing any bigger quants than Qwen3.5-122B-A10B-IQ4_KSS?

Given the my HW config would you recommend I go for a higher quant from your repo? I mean, process + performance wise would I make it any better than with UD-Q3_K_XL?
Currently I am at 20-22 tks and with ctx at 196k

I love learning and would definitely put up with a bigger quant in ik_llama if your experience hints better PP / TKS would be possible over my current range (20-22tks).

Owner

@dehnhaide

I love learning and would definitely put up with a bigger quant in ik_llama if your experience hints better PP / TKS would be possible over my current range (20-22tks).

Heya yes it was a wild weekend with so many unsloth quants being re-uploaded to fix the MXFP4 bugs and I guess they have some new recipes now, still trying to figure out what has been updated, what has not been updated by them, and if they're planning on updating that stuff (e.g. qwen3-coder etc).

Anyway, good news here is ik_llama.cpp just added -sm graph support and ik shows running full offload on many 3090's kind of like your rig: https://github.com/ikawrakow/ik_llama.cpp/pull/1347 and note the point about max 2 GPUs argument as he shows there.

I would recommend trying this. I'm currently working on it myself to figure out commands and get llama-sweep-bench benchmarks.

Anyway, good news here is ik_llama.cpp just added -sm graph support and ik shows running full offload on many 3090's kind of like your rig: https://github.com/ikawrakow/ik_llama.cpp/pull/1347 and note the point about max 2 GPUs argument as he shows there.

I would recommend trying this. I'm currently working on it myself to figure out commands and get llama-sweep-bench benchmarks.

I've done, already commented on ik_ PR directly. There I am vikcious, sorry for the confusion! πŸ˜ƒ With "--max-gpu 2" it flies...

main: n_kv_max = 135168, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = 999, n_threads = 1, n_threads_batch = 1

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4096 128 0 2.305 1777.30 2.243 57.06
4096 128 4096 2.407 1701.96 2.234 57.30
4096 128 8192 2.443 1676.65 2.265 56.52
4096 128 12288 2.528 1620.05 2.280 56.14
4096 128 16384 2.690 1522.47 2.262 56.58
4096 128 20480 2.795 1465.50 2.311 55.39
4096 128 24576 2.887 1418.91 2.359 54.26
4096 128 28672 3.034 1350.19 2.363 54.18
4096 128 32768 3.155 1298.31 2.420 52.89
4096 128 36864 3.253 1259.09 2.434 52.59
4096 128 40960 3.326 1231.55 2.388 53.60
4096 128 45056 3.473 1179.29 2.424 52.81
4096 128 49152 3.803 1077.09 2.443 52.39
4096 128 53248 4.054 1010.45 3.041 42.09
4096 128 57344 3.821 1071.99 2.701 47.38
4096 128 61440 3.954 1036.03 2.684 47.69
4096 128 65536 4.082 1003.42 2.739 46.74
4096 128 69632 4.167 983.05 2.850 44.91
4096 128 73728 4.216 971.48 2.732 46.85
4096 128 77824 4.318 948.61 2.688 47.62
4096 128 81920 4.414 927.87 2.689 47.61
4096 128 86016 4.542 901.82 2.705 47.32
4096 128 90112 4.634 883.98 2.745 46.63
4096 128 94208 4.779 857.03 2.688 47.62
4096 128 98304 4.868 841.49 2.742 46.67
4096 128 102400 4.974 823.43 2.765 46.29
4096 128 106496 5.091 804.60 2.839 45.09
4096 128 110592 5.218 785.05 2.787 45.93
4096 128 114688 5.354 765.00 2.822 45.36
4096 128 118784 5.417 756.21 2.906 44.05
4096 128 122880 5.563 736.28 2.917 43.88
4096 128 126976 5.580 734.07 2.924 43.77
4096 128 131072 5.785 708.02 2.901 44.12

Btw, stupid question... if I need the mmproj file for image loading to model (helps a lot with testing where model can generate his own .pngs of the frontend and debug / look for errors) can I use the one from Unsloth repository? Would dat work?

Well, it looks it doesn't work... any clue? 😁

mtmd_init_from_file: error: mismatch between text model (n_embd = 3072) and mmproj (n_embd = 4096)
hint: you may be using wrong mmproj

ERR [ load_model] failed to load multimodal model, '%s'
| ="/home/vik/models/gguf/ubergarm/Qwen3.5-122B-A10B-IQ4_KSS/mmproj-F16.gguf"
~ggml_backend_cuda_context: have 21 graphs
~ggml_backend_cuda_context: have 34 graphs
~ggml_backend_cuda_context: have 20 graphs
~ggml_backend_cuda_context: have 34 graphs
~ggml_backend_cuda_context: have 20 graphs
~ggml_backend_cuda_context: have 34 graphs
~ggml_backend_cuda_context: have 20 graphs
~ggml_backend_cuda_context: have 34 graphs
free(): invalid pointer
./ik-server_Qwen3.5-122B-A10B-IQ4_KSS.sh: line 21: 165731 Aborted (core dumped)

Owner

@dehnhaide

Ahh okay yes I saw your sweeps over on ik's PR and graphed them, yes!

if I need the mmproj file for image loading to model (helps a lot with testing where model can generate his own .pngs of the frontend and debug / look for errors) can I use the one from Unsloth repository? Would dat work?

mismatch between text model (n_embd = 3072) and mmproj (n_embd = 4096)

In theory yes, but apparently no? I haven't tried yet myself, as that feature was recently added to ik_llama.cpp ... I think I grabbed mmproj from AesSedai for Kimi-K2.5 maybe and that worked in the past... maybe something with the qwen delta net stuff?

let me know if you find the answer as I got a few other irons in the fire currently!

Sign up or log in to comment