Request for support - improved model fit

by dehnhaide - opened Feb 25

Feb 25

Hi John,

First, yet another great quant: many thanks for your constant effort putting up with this plethora of new models! Much appreciated! 🙏

I need a bit of support from your far more focused brain. My 8x RTX 3090 is eventually running smooth. I have managed to enable also p2p and the throughput increase in ik_llama has been substantial. The other half of the system is an AMD Epyc 7443p (24c/48p) + 256 GB DDR4 3200.

I am basically asking for your support, a good recipe, of what to load what to offload + the needed line parametrization to be able to run IK4_KS best of available performance with my 192 GB VRAM and the lazier 256. Given the model size (208gb) I am pretty sure it needs some CPU offloading but I accept, I feel dumb and lost when it comes to CPU / RAM offloading. But I am willing to learn. And also, if you have a link to buy you coffee for the effort, happy to oblige.

I current use Minimax M2.5 in sglang / vllm since there mratsim recipe just flies with my VRAM amount. So far I have tried your this model only with mainline llama cause "--fit" but given the p2p in my cards and "-sm graph" in ik_llama a have a feeling I can get far better performance with it, even with some CPU / RAM offloading.

ubergarm

Owner Feb 25

Okay, let me try to parse your two questions and understand exactly what you're asking simply:

You have 8x3090s so roughly 192GB VRAM
You want to run Qwen3.5-397B-A17B best quant and speed possible.

So bad news is ik_llama.cpp does not yet support -graph parallel for qwen35moe. I believe ik has to work out some fused gated delta net stuff first as recently mentioned here: https://github.com/ikawrakow/ik_llama.cpp/pull/1320

Yes vllm/sglang will likely give better aggregate throughput (parallel slots) than any ik/llama.cpp GGUF if you can fully offload it and have enough room for context.

I have new questions for you now given Qwen3.5MoE does not yet support graph parallel:

Do you still want to workshop your command for best hybrid CPU+GPU -sm layer command? If so, share your full command here and we can workshop it.
You could try the slightly smaller Q3_K full GPU offload to keep it simple.
You could run a larger quant of the smaller Qwen3.5-122B-A10B full offload q8_0 quite easily.

What u thinking?

Here is your question from another thread:

I am going to ask stupid question and am fine with being slapped!

Why can't I get the model in ik_llama to properly use GPU at full like it happens in vllm / sglang? I love "-sm graph" + p2p get some strong
push but still it's still at about half the t/s I am getting in sglang or vllm.
And I am not talking about multiple parallel requests. Nope, single one.

Is it some configuration / ik_llama line parametrization I am doing wrong, or is it architecturally so that *llama can't be that fast (given the > lack of true tensor parallelism) and have to move on with my life and accept for what it is?

I answered briefly, but sounds like you're not using -ngl 999 ??

ubergarm

Owner Feb 25

Also, if you're okay with discord (and hopefully don't have to upload id), check out https://huggingface.co/BeaverAI a lot of people talk about their offload strategies and rigs

dehnhaide

Mar 1

•

edited Mar 1

First of all, apologies for going silent but this has been a maddening week. Because of the nefarious quant "bug" (singular vs plural) two days of dev spints have gone bananaand had to scrape and start from zero.
Anyway, this way I've discovered how potent Qwen3.5-397B-A17B-GGUF really is. As of now I am using Unsloth's UD-Q3_K_XL (cause I had it laying around and my internet at home sucks - Germany!)

So bad news is ik_llama.cpp does not yet support -graph parallel for qwen35moe. I believe ik has to work out some fused gated delta net stuff first as recently mentioned here: https://github.com/ikawrakow/ik_llama.cpp/pull/1320

Ok, so that was all for my p2p + ik_llama anxiety! 😀

Do you still want to workshop your command for best hybrid CPU+GPU -sm layer command? If so, share your full command here and we can workshop it.

Given the my HW config would you recommend I go for a higher quant from your repo? I mean, process + performance wise would I make it any better than with UD-Q3_K_XL?
Currently I am at 20-22 tks and with ctx at 196k (cannot go any lower than that since my vibe coded project is huge) and this is with:

/home/vik/llms/llama.cpp/build/bin/llama-server
--model ~/models/gguf/unsloth/Qwen3.5-397B-A17B-UD-Q3_K_XL/Qwen3.5-397B-A17B-UD-Q3_K_XL-00001-of-00005.gguf
--mmproj ~/models/gguf/unsloth/Qwen3.5-397B-A17B-UD-Q3_K_XL/mmproj-F16.gguf
--alias "unsloth/Qwen3.5-397B-A17B-UD-Q3_K_XL"
--fit on -t 24
--temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 --seed 3407
--api-key VLLM_API_KEY_2026 --port 5005 --host 0.0.0.0
--jinja -fa on
-c 196608 -ctk q8_0 -ctv q8_0 --kv-offload
--context-shift -cb
-b 2048 -ub 1024
--chat-template-kwargs '{"reasoning_effort": "normal"}'(base)

You could try the slightly smaller Q3_K full GPU offload to keep it simple.

You could run a larger quant of the smaller Qwen3.5-122B-A10B full offload q8_0 quite easily.
I have it at hand and I will probably use it to kickstart another project using it. So far the initial tests are WOW: 45tks with llama mainline and aessedai's Q5_K_M
Do you plan or releasing any bigger quants than Qwen3.5-122B-A10B-IQ4_KSS?

dehnhaide

Mar 1

Given the my HW config would you recommend I go for a higher quant from your repo? I mean, process + performance wise would I make it any better than with UD-Q3_K_XL?
Currently I am at 20-22 tks and with ctx at 196k

I love learning and would definitely put up with a bigger quant in ik_llama if your experience hints better PP / TKS would be possible over my current range (20-22tks).

ubergarm

Owner Mar 2

@dehnhaide

I love learning and would definitely put up with a bigger quant in ik_llama if your experience hints better PP / TKS would be possible over my current range (20-22tks).

Heya yes it was a wild weekend with so many unsloth quants being re-uploaded to fix the MXFP4 bugs and I guess they have some new recipes now, still trying to figure out what has been updated, what has not been updated by them, and if they're planning on updating that stuff (e.g. qwen3-coder etc).

Anyway, good news here is ik_llama.cpp just added -sm graph support and ik shows running full offload on many 3090's kind of like your rig: https://github.com/ikawrakow/ik_llama.cpp/pull/1347 and note the point about max 2 GPUs argument as he shows there.

I would recommend trying this. I'm currently working on it myself to figure out commands and get llama-sweep-bench benchmarks.

dehnhaide

Mar 2

Anyway, good news here is ik_llama.cpp just added -sm graph support and ik shows running full offload on many 3090's kind of like your rig: https://github.com/ikawrakow/ik_llama.cpp/pull/1347 and note the point about max 2 GPUs argument as he shows there.

I would recommend trying this. I'm currently working on it myself to figure out commands and get llama-sweep-bench benchmarks.

I've done, already commented on ik_ PR directly. There I am vikcious, sorry for the confusion! 😃 With "--max-gpu 2" it flies...

main: n_kv_max = 135168, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = 999, n_threads = 1, n_threads_batch = 1

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
4096	128	0	2.305	1777.30	2.243	57.06
4096	128	4096	2.407	1701.96	2.234	57.30
4096	128	8192	2.443	1676.65	2.265	56.52
4096	128	12288	2.528	1620.05	2.280	56.14
4096	128	16384	2.690	1522.47	2.262	56.58
4096	128	20480	2.795	1465.50	2.311	55.39
4096	128	24576	2.887	1418.91	2.359	54.26
4096	128	28672	3.034	1350.19	2.363	54.18
4096	128	32768	3.155	1298.31	2.420	52.89
4096	128	36864	3.253	1259.09	2.434	52.59
4096	128	40960	3.326	1231.55	2.388	53.60
4096	128	45056	3.473	1179.29	2.424	52.81
4096	128	49152	3.803	1077.09	2.443	52.39
4096	128	53248	4.054	1010.45	3.041	42.09
4096	128	57344	3.821	1071.99	2.701	47.38
4096	128	61440	3.954	1036.03	2.684	47.69
4096	128	65536	4.082	1003.42	2.739	46.74
4096	128	69632	4.167	983.05	2.850	44.91
4096	128	73728	4.216	971.48	2.732	46.85
4096	128	77824	4.318	948.61	2.688	47.62
4096	128	81920	4.414	927.87	2.689	47.61
4096	128	86016	4.542	901.82	2.705	47.32
4096	128	90112	4.634	883.98	2.745	46.63
4096	128	94208	4.779	857.03	2.688	47.62
4096	128	98304	4.868	841.49	2.742	46.67
4096	128	102400	4.974	823.43	2.765	46.29
4096	128	106496	5.091	804.60	2.839	45.09
4096	128	110592	5.218	785.05	2.787	45.93
4096	128	114688	5.354	765.00	2.822	45.36
4096	128	118784	5.417	756.21	2.906	44.05
4096	128	122880	5.563	736.28	2.917	43.88
4096	128	126976	5.580	734.07	2.924	43.77
4096	128	131072	5.785	708.02	2.901	44.12

dehnhaide

Mar 2

Btw, stupid question... if I need the mmproj file for image loading to model (helps a lot with testing where model can generate his own .pngs of the frontend and debug / look for errors) can I use the one from Unsloth repository? Would dat work?

dehnhaide

Mar 2

Well, it looks it doesn't work... any clue? 😁

mtmd_init_from_file: error: mismatch between text model (n_embd = 3072) and mmproj (n_embd = 4096)
hint: you may be using wrong mmproj

ERR [ load_model] failed to load multimodal model, '%s'
| ="/home/vik/models/gguf/ubergarm/Qwen3.5-122B-A10B-IQ4_KSS/mmproj-F16.gguf"
~ggml_backend_cuda_context: have 21 graphs
~ggml_backend_cuda_context: have 34 graphs
~ggml_backend_cuda_context: have 20 graphs
~ggml_backend_cuda_context: have 34 graphs
~ggml_backend_cuda_context: have 20 graphs
~ggml_backend_cuda_context: have 34 graphs
~ggml_backend_cuda_context: have 20 graphs
~ggml_backend_cuda_context: have 34 graphs
free(): invalid pointer
./ik-server_Qwen3.5-122B-A10B-IQ4_KSS.sh: line 21: 165731 Aborted (core dumped)

ubergarm

Owner Mar 3

@dehnhaide

Ahh okay yes I saw your sweeps over on ik's PR and graphed them, yes!

if I need the mmproj file for image loading to model (helps a lot with testing where model can generate his own .pngs of the frontend and debug / look for errors) can I use the one from Unsloth repository? Would dat work?

mismatch between text model (n_embd = 3072) and mmproj (n_embd = 4096)

In theory yes, but apparently no? I haven't tried yet myself, as that feature was recently added to ik_llama.cpp ... I think I grabbed mmproj from AesSedai for Kimi-K2.5 maybe and that worked in the past... maybe something with the qwen delta net stuff?

let me know if you find the answer as I got a few other irons in the fire currently!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment