How to split this model between 2 (3) GPUs and CPU/RAM ?
I'm kinda struggling to figure out which tensors to offload where, any help please?
If you're using 2 or more GPUs the easy way on ik_llama.cpp is to use -sm graph and now you can also use -muge with that. This will do "tensor parallel" and handle splitting between GPUs automatically.
Basically use the quick start example command from my 35B version and add -sm graph and see how it goes for you: https://huggingface.co/ubergarm/Qwen3.5-35B-A3B-GGUF#quick-start
If you're still getting stuck, let me know your GPU models and VRAM size each assuming you're attempting full offload. If you're doing hybrid CPU+GPUs let me know your processor(s), numa config if any (e.g. dual socket servers or Xeon/EPYC class CPUs), and how much DRAM.
I'm out for a bit, but will check in again before end of the week.
Cheers!
Hey thanks for your reply!
My setup is 3x RTX3090 (P2P enabled), although I'd like to off-load more to CPU to keep 1x RTX3090 for diffusion models, if possible. The CPU is EPYC 7J13 (64C/128T) and 256 GB RAM. Numa is set to NPS4 in H12SSL-NT BIOS.
These are my arguments with ik v4466 build from Thireus that let me start it up:
export GGML_CUDA_GRAPH_OPT=1
./llama-server \
-ngl 99 \
-c 131072 \
-ctk f16 -ctv q8_0 \
-sm graph \
--model models/ubergarm_Qwen3.5-122B-A10B-GGUF/Qwen3.5-122B-A10B-IQ4_KSS.gguf \
--jinja \
-np 1 \
--threads 1 \
--host 0.0.0.0 \
--port 8081 \
--api-key 12345 \
--alias "Qwen3.5-122B" \
--flash-attn on \
--batch-size 4096 \
--ubatch-size 1024 \
--no-mmap \
--reasoning-tokens none \
--chat-template-kwargs "{\"enable_thinking\": false}"
I get about 245 t/s PP, and 78 t/s TG.
I also see this which is weird to me (no tensors in 2 GPUs?):
================================ max_gpu = 0
Estimated model buffer size per device:
Device 0: 20733.57 MiB
Device 1: 20669.17 MiB
Device 2: 20659.79 MiB
No tensors in buffer type CUDA0
No tensors in buffer type CUDA1
llm_load_tensors: offloading 48 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 49/49 layers to GPU
llm_load_tensors: CUDA_Host buffer size = 602.46 MiB
llm_load_tensors: CUDA_Split buffer size = 62063.39 MiB
llm_load_tensors: CUDA2 buffer size = 602.47 MiB
If I add --merge-qkv it says it doesn't apply when used with -sm-graph so merging is disabled.
When I try to add -muge I get this weird crash I've not seen before:
================================ max_gpu = 0 /home/runner/work/ik_llama.cpp/ik_llama.cpp/src/llama-load-tensors.cpp:3436: GGML_ASSERT(ntot == extra->splits[is]->ne[extra->split_dim]) failed libggml.so(+0x101bcb) [0x7fb0e4101bcb] libggml.so(ggml_abort+0x15e) [0x7fb0e410397e] libllama.so(+0x11d5e5) [0x7fb106f1d5e5] libllama.so(_ZN21create_tensors_helper14create_tensorsEv+0x40dc) [0x7fb106f77efc] libllama.so(+0xb75c3) [0x7fb106eb75c3] libllama.so(llama_model_load_from_file+0xccf) [0x7fb106eb9bef] ./llama-server(+0x2165dc) [0x56520f4485dc] ./llama-server(+0x13e6f7) [0x56520f3706f7] ./llama-server(+0x8d7d3) [0x56520f2bf7d3] /lib/x86_64-linux-gnu/libc.so.6(+0x29ca8) [0x7fb0e3a35ca8] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85) [0x7fb0e3a35d65] ./llama-server(+0x94695) [0x56520f2c6695] Aborted
I also tried your Devstral-2-123B but there I have even less luck fitting into 3 GPUs...
Oh great you're able to get a baseline command up and running. Great start, and you can iterate on it from here as desired.
My setup is 3x RTX3090 (P2P enabled), although I'd like to off-load more to CPU to keep 1x RTX3090 for diffusion models, if possible. The CPU is EPYC 7J13 (64C/128T) and 256 GB RAM. Numa is set to NPS4 in H12SSL-NT BIOS.
Okay if you want to use only two GPUs you can export another env var or prepend it on the startup command e.g.:
CUDA_VISIBLE_DEVICES="0,1" \
GGML_CUDA_GRAPH_OPT=1 \
./llama-server \
...
Keep in mind however that the IQ4_KSS 61.219 GiB (4.306 BPW) quant is going to require all 3x 3090s to get 72GB VRAM to run it comfortably with full context etc.
No tensors in buffer type CUDA0
Hrmm, clearly it is using all 3x GPUs or it wouldn't fit otherwise e.g. CUDA_Split buffer size = 62063.39 MiB.. That note might be related to the compute buffers or something. Seems like it is runing okay so should be fine.
If I add --merge-qkv it says it doesn't apply when used with -sm-graph so merging is disabled.
Yes, that is expected, --merge-qkv only works with -sm layer but you're not losing much as -sm graph is much more important than fusing a few small tensors. You'd have to go read the old closed PRs on ik_llama.cpp to learn all this but its okay haha...
-muge I get this weird crash I've not seen before:
Oh that is not good, hrmm.. -muge should work now with -sm graph as of March 15th. You might just need a newer build?
Anyway, great job and you have a very nice rig with plenty of model options! If you want to use only 2x GPUs for LLM and 1x GPU for ComfyUI then check out my most recent quant which is among the best for lower VRAM: https://huggingface.co/ubergarm/Qwen3.5-35B-A3B-GGUF#iq4_ks-19799-gib-4907-bpw
Finally, since you are using NPS4, unfortunately your 256GB RAM will be split into 64GB NUMA nodes along with associated CPU chiplet cores.
if you need to do some CPU/RAM hybrid inference you can do stuff like this:
glue together all numa nodes to access all memory for a big model
numactl --interleave=ALL \
llama-server \
--numa distribute \
...
use only a single numa node and 64GB of RAM for a smaller model e.g.
numactl -N 0 -m 0 \
llama-server \
--numa numactl \
...
Finally, since you are using NPS4, unfortunately your 256GB RAM will be split into 64GB NUMA nodes along with associated CPU chiplet cores.
if you need to do some CPU/RAM hybrid inference you can do stuff like this:
glue together all numa nodes to access all memory for a big model
numactl --interleave=ALL \ llama-server \ --numa distribute \ ...use only a single numa node and 64GB of RAM for a smaller model e.g.
numactl -N 0 -m 0 \ llama-server \ --numa numactl \ ...
Thanks for all the pointers and hints above. Every day I learn something new and it feels empowering!
Regarding the NPS4, I don't need to run it that way. I set it up because the web said it's the most efficient way for this application. Originally it was set to NPS1, and to be honest unless I can somehow (scientifically) demonstrate a difference between the two, it makes no difference (no pun intenteded) to me which one it is. 😄
Oh I forgot, I also want to mention something about this model.
If I set my context to 16384 I get a near perfect output on the first try when I ask it to make me a SVG of an analog clock with a certain time on it (this seems to be one of those tests that many models fail).
Any context above 16384 and the quality starts degrading to be totally off and unusable.
Do you know why that is, because I thought the context size does not determine the quality of the output (the precision of the quants does)?
I set it up because the web said it's the most efficient way for this application.
If your application is optimized for NUMA aware thread<->memory management, then yes it will give the best results by keeping memory channels going tot the closest chiplet for best bandwidth and lowest latency.
However, llama.cpp and ik_llama.cpp are not optimized. If you're seeing similar speeds either way then don't worry about it, but in general I just go with NPS1 and then don't have to use numactl. Thought I've heard some folks report they got faster with NPS4 so what do I know! haha...
Any context above 16384 and the quality starts degrading to be totally off and unusable.
Do you mean you're starting the server with -c 16384 or just you are watching your client count higher and higher in context until it is over 16k, and then you ask it the question in a long string of user<->assistant chats?
I assume you're just running up the context a bit manually before asking it your question. If that is so, then in general all models degrade in quality over long context. This is true of all models/quants as far as I know. But "totally unusable" solunds a stronger than just degraded.
There are some things you can do to improve long context performance:
- Use the default of unquantized kv-cache e.g.
-ctk f16 -ctv f16 \ - This is kind of advanced, at your own risk, but you can change the fa-offset in ik_llama.cpp e.g.
-cuda fa-offset=0. I've had good luck with it with this model. 0 is the "best" as it won't scale flash attention and potentially preserve very small numbers near zero. Otherwise it scales things down a bit to prevent accumulator overflow for some other less behaved models. If yoyu see a bunch of.....or likeGGGGGGit means NaNs and gibberish or whatever then remove it. Otherwise it doesn't hurt anything to try.
If your application is optimized for NUMA aware thread<->memory management, then yes it will give the best results by keeping memory channels going tot the closest chiplet for best bandwidth and lowest latency.
However, llama.cpp and ik_llama.cpp are not optimized. If you're seeing similar speeds either way then don't worry about it, but in general I just go with NPS1 and then don't have to use
numactl. Thought I've heard some folks report they got faster with NPS4 so what do I know! haha...
It's that "Works great for me!" kinda thing...an anecdotal evidence that NPS4 is better than NPS1 with these local models which we trust without a qualification, LOL.
Do you mean you're starting the server with
-c 16384or just you are watching your client count higher and higher in context until it is over 16k, and then you ask it the question in a long string of user<->assistant chats?
I assume you're just running up the context a bit manually before asking it your question. If that is so, then in general all models degrade in quality over long context. This is true of all models/quants as far as I know. But "totally unusable" solunds a stronger than just degraded.
I mean I set the context size to begin with at 16384. If I go above that, it starts to mess the quality of the SVG up (misalignments and glitches) and things get all over the place at 32768. The client count is always around the same number of tokens used, since the SVG code generally ends up being the same. I repeat the same test over and over again with different starting context size.
There are some things you can do to improve long context performance:
- Use the default of unquantized kv-cache e.g.
-ctk f16 -ctv f16 \- This is kind of advanced, at your own risk, but you can change the fa-offset in ik_llama.cpp e.g.
-cuda fa-offset=0. I've had good luck with it with this model. 0 is the "best" as it won't scale flash attention and potentially preserve very small numbers near zero. Otherwise it scales things down a bit to prevent accumulator overflow for some other less behaved models. If yoyu see a bunch of.....or likeGGGGGGit means NaNs and gibberish or whatever then remove it. Otherwise it doesn't hurt anything to try.
I already use -cuda fa-offset=0 - I read your suggestion for it elsewhere. 😁
I mean I set the context size to begin with at 16384. If I go above that, it starts to mess
Oh, if you're running the model to begin with with -c 16384 I don't think you can "go over" the size, it is not shifting window or whatever? I'm guessing it just drops or cuts off the beginning of your context window or something, but I would not expect that to work.
In practice try to set -c xxxx higher than what you'll need and as you approach that max window size use /compact or whatever on the client to "compress" the contxt by summarizing it to itself.
I already use -cuda fa-offset=0 - I read your suggestion for it elsewhere. 😁
Nice!!!
Oh one more thing, ik just got a patch today for Qwen3.5 models when using -muge or fused up|gate ffn tensors as there was a bug with exactly 3x GPUs: https://github.com/ikawrakow/ik_llama.cpp/pull/1479
Make sure to get that PR built if testing, or likely it will be merged into main soon. If you try it and it works for you feel free to report on the issue/PR itself.
Cheers!
That patch fixed the -muge crash indeed. Thanks.
Regarding the context, what I mean is that if my context is larger than 16384, the model does not produce good first-shot results. The more I increase the context and try again, the more its output is messed up.
Regarding the context, what I mean is that if my context is larger than 16384, the model does not produce good first-shot results. The more I increase the context and try again, the more its output is messed up.
Is this what you mean?
Case 1
Run llama-server -c 16384 ... and everything works fine
Case 2
Run llama-server -c 131072 (or anything higher than 16384` and it does not produce good first-shot results?
This should not be happening as the amount of context reserved on startup should not effect the output if you're doing 1-shot short prompts without any context loaded.
Are you testing through the local web-ui built in interface? Or which client are you using e.g. opencode or whatever?
Yes, that's correct, on 16k context 1-shot seems fine, but if I increase the context beyond that it starts messing things up. I know, makes no sense at all. 🤷
I use the built-in web-ui, being the easiest and fastes thing to iterate on.
Huh, using the built in web-ui is probably the safest way to test.
That is very unusual behavior though... I am running right now on 2x RTX A6000 (the older sm86 arch) 48GB VRAM each like so just fine:
./build/bin/llama-server \
--alias Qwen3.5-122B-A10B \
--model "$model" \
--mmproj "$mmproj" \
--image-min-tokens 1024 \
--image-max-tokens 4096 \
-ub 4096 -b 4096 \
-fa on \
-ctk f16 -ctv f16 -cuda fa-offset=0 \
-c 262144 \
-muge \
-sm graph \
-ngl 99 \
--no-mmap \
--parallel 1 \
--threads 1 \
--host 127.0.0.1 \
--port 8080 \
--jinja \
--ctx-checkpoints 48 \
--ctx-checkpoints-interval 512 \
--ctx-checkpoints-tolerance 5 \
--cache-ram 16384
You could try that --max-gpu 2 perhaps? might be because 3x GPUs still?
If you can repeat it, might be worth opening an issue on ik_llama.cpp, but not sure best way to test...
if i set --max-gpu=2 ,I run out of memory when launching. Even without it when it try the inference, it crashes OOM. That's with either 256k or 128k context. Dropping it down to 64k I can finally get an output, at just under 80 t/s.
"create a svg of an analog clock showing a time 10:15:35. add ticks for hours and roman numerals for each hour around the clock's inner border and orient them precisely in the correct order. make the hour handle wider than the minute handle, and colour the seconds handle differently."
Now I'm getting totally opposite results!?
For comparison, AesSedai's Qwen3.5-122B_IQ4_XS quant, 81 t/s:
All 1-shot, same startup parameters for both quants, execpt for the context:
./build/bin/llama-server \
--model models/AesSedai_Qwen3.5-122B-A10B-GGUF/Qwen3.5-122B-A10B-IQ4_XS-00001-of-00003.gguf \
--alias Qwen3.5-122B \
--ctx-size 262144 \
-ctk f16 -ctv q8_0 \
-ngl 99 \
-sm graph \
-muge \
--jinja \
--temp 0.6 \
--top-p 0.95 \
--min-p 0.0 \
--top-k 20 \
-t 4 \
--parallel 1 \
--host 0.0.0.0 \
--port 8081 \
--api-key 12345 \
--flash-attn on \
--batch-size 4096 \
--ubatch-size 1024 \
--no-mmap \
--ctx-checkpoints 48 \
--ctx-checkpoints-interval 512 \
--ctx-checkpoints-tolerance 5 \
--cache-ram 16384 \
--reasoning-tokens none \
--chat-template-kwargs "{\"enable_thinking\": false}"
ahh thanks for the very details explanation and results! for a test like this it might be prudent to try fixing the seed e.g. --seed 1337 or whatever you like. Given sampling uses RNG based on the initial seed at startup, it may allow you to see more clearly the differences while taking as much randomness out of the picture as possible.
ahh thanks for the very details explanation and results! for a test like this it might be prudent to try fixing the seed e.g.
--seed 1337or whatever you like. Given sampling uses RNG based on the initial seed at startup, it may allow you to see more clearly the differences while taking as much randomness out of the picture as possible.
Cool thanks, I didn't think of seed argument even though I've seen it being used by people. Most of the time I'd get rid of it, LOL. The same seed should definitely keep the results consistent then, will try that soon.
"create a svg of an analog clock showing a time 10:15:35. add ticks for hours and roman numerals for each hour around the clock's inner border and orient them precisesy in the correct order. make the hour handle wider than the minute handle, and colour the seconds handle differently."
Pinching this prompt for my tests :)
"create a svg of an analog clock showing a time 10:15:35. add ticks for hours and roman numerals for each hour around the clock's inner border and orient them precisesy in the correct order. make the hour handle wider than the minute handle, and colour the seconds handle differently."Pinching this prompt for my tests :)
BTW, I made a typo in "precisely"
I have a similar questions :
What performance drop (approximate, not necessarily exact measurements) one could expect from running this on something like dual RTX 5090(PCI-E 5.0 bifurcated to 8x/8x), P2P mode) on ik_llama.cpp(using it's tensor parallelism implementation) vs real/optimized TP on vLLM or TensorRT-LLM ?
Of course assuming all layers and KV cache offloaded to VRAM, because vLLM has only very basic/poor layers offloading implementation(they call it streaming , so performance would really tank for vLLM).
I'd expect at least 30-40% performance drop with ik_llama , but also recently read that in some cases ik_llama's TP may not even show any speed increase vs pipeline parallelism.
On the other hand it supports efficient layer offloading(I assume it can be combined with -sm graph and -muge arguments for TP) which is very useful for MoE LLMs. So in this case if one could offload at least 25% layers to RAM(considering PCI-E 5.0 8x and 6000MT/s RAM), with minor performance drop, then it would allow at least 4-bit quants( which is basically a minimum for decent accuracy for this model) and decent KV cache size, right?
Also, is my assumption correct that it wouldn't make much sense to run it at Q4 or Q5 with a single 5090(even with PCIE 5.0 16x and 6000MT/s RAM), because probably most of the time it would wait for system RAM access (as most weights would be offloaded there, if one would also need a decent amount of VRAM for KV cache size) ? I know it'd be best to do the testing myself , but currently have access to 4090 and would have to find and rent a VM with 5090 somewhere online.
@krzysztofma try it out by creating a bash script for each scenario and then plot them all on the same graph
./build/bin/llama-sweep-bench \
--model "$model" \
-c 69632 \
-muge \
-sm graph \
-ngl 999 \
-ub 4096 -b 4096 \
--threads 1 \
--no-mmap \
-n 64 \
--warmup-batch
Just vary this base script to suit your needs.
@krzysztofma try it out by creating a bash script for each scenario and then plot them all on the same graph
./build/bin/llama-sweep-bench \ --model "$model" \ -c 69632 \ -muge \ -sm graph \ -ngl 999 \ -ub 4096 -b 4096 \ --threads 1 \ --no-mmap \ -n 64 \ --warmup-batchJust vary this base script to suit your needs.
Thanks,
Currently I don't have access to a PC with dual RTX 5090 and as said I'd have to find and rent such host online for a day or so. Hopefully will have a single 5090 locally next weekend.
Anyway , will this command (in addition to enabling TP) automatically offload the weights that don't fit in VRAM into RAM? Because there's no --fit argument (or n-cpu-moe)?
Also , have you already checked qwen3.6-35B model? If yes then is it really almost as good as qwen3.5-27B or this one(qwen3.5-122B)?
will this command (in addition to enabling TP) automatically offload the weights that don't fit in VRAM into RAM? Because there's no --fit argument (or n-cpu-moe)?
The command I show assumes full GPU offload with no weights on CPU/RAM. You can do that too, just read up on the commands and examples I show on the model cards. You can use -fit if you want or older methods like --n-cpu-moe 60 etc.. Run it however makes sense for the hardware you currently have.
Also , have you already checked qwen3.6-35B model? If yes then is it really almost as good as qwen3.5-27B or this one(qwen3.5-122B)?
I have not yet quantized Qwen3.6-35B-A3B, but I hear it is very good with agentic tasks using a coding harness like pi.dev or opencode etc. If you trust their benchmarks, it might be better at agentic tasks, but worse at 1 shot coding etc. The best way to know is to try it out yourself for a while and see if you like it.
will this command (in addition to enabling TP) automatically offload the weights that don't fit in VRAM into RAM? Because there's no --fit argument (or n-cpu-moe)?
The command I show assumes full GPU offload with no weights on CPU/RAM. You can do that too, just read up on the commands and examples I show on the model cards. You can use
-fitif you want or older methods like--n-cpu-moe 60etc.. Run it however makes sense for the hardware you currently have.Also , have you already checked qwen3.6-35B model? If yes then is it really almost as good as qwen3.5-27B or this one(qwen3.5-122B)?
I have not yet quantized Qwen3.6-35B-A3B, but I hear it is very good with agentic tasks using a coding harness like pi.dev or opencode etc. If you trust their benchmarks, it might be better at agentic tasks, but worse at 1 shot coding etc. The best way to know is to try it out yourself for a while and see if you like it.
Yeah , the whole idea of running it on ik_llama vs vLLM for me was that it allows offloading routed expert layers to RAM without significant performance drop (at least that’s my understanding) while allowing it’s implementation of TP (or at least quasi TP). If it wasn’t for that and everything could fit VRAM I would use vLLM for "real" TP.
Regarding the qwen3.6 35B MoE model - yesterday I (briefly ) tried AesSedai’s Q5_K_M quant for coding prompt and it seemed to produce good results, but at only 20-25 TG on Rtx 4090 and DDR4 3800 - but with max context size (256K) - so it offloaded like a dozen of total 40 layers to RAM. Also tried using Q8 for KV cache and offloaded more layers to GPU, but it wasn’t that much faster
I know it's a bit off topic , but just a small update regarding my qwen3.6 35BA3 test results : after lowering Context length to more common ~64K tokens (also with Q8_0 for KV cache) , and 35 of 40 layers offloaded to GPU got ~48-50tok/s TG. Also disabled thinking ({%- set enable_thinking = false %}) in LM Studio's Prompt template, and the performance uplift is huge. Previously with thinking TTFT was over a minute.
BTW: My prompt was something like: "How to write an iPadOS app that allows keeping history of copied texts , similarly how it works in Windows 11 by using Win+V key combination?"
i still haven't done a Qwen3.6 ik_llama.cpp quant yet haha... the world is moving to fast!
glad you got something working for your setup that you like, we've entered a new era of home vibe coding hah...
Kimi-K2.6 is quite opinionated!
The user now wants an SVG of an analog clock displaying the time 10:15:35, with hour tick marks, Roman numerals, and specific hand requirements. I need to analyze this request first.
The user's requirements are clear: to create a static SVG clock displaying a specific time. However, I think it could be made more interesting by adding interactivity, such as allowing the clock to move or switch between different times.
I also want to add some visual effects, such as a slight shadow for the clock to give it a more three-dimensional feel. A dark background with a light-colored dial would create good contrast.
i still haven't done a Qwen3.6 ik_llama.cpp quant yet haha... the world is moving to fast!
glad you got something working for your setup that you like, we've entered a new era of home vibe coding hah...
Definitely, I’m not the only person who thinks it’s both amazing and scary:)
I only recently started dipping my toes into local LLMs , agentic coding/ etc. as I know how to code and till recently rather used LLM for simple coding assistance(I.e. prompt programming /code reviews) vs context /agentic coding or something remotely close to vibe coding.
but for people with very little coding skills( but enough general IT knowledge or determination ) the frontier models (plus something like Claude Code , RAG etc. ) can be something truly extraordinary .
But also these small /medium qwen3.6 models are surprisingly good.
Also, I forgot to clarify one my previous comments:
I (briefly ) tried AesSedai’s Q5_K_M quant for coding prompt and it seemed to produce good results, but at only 20-25 TG on Rtx 4090 and DDR4 3800 - but with max context size (256K) - so it offloaded like a dozen of total 40 layers to RAM. Also tried using Q8 for KV cache and offloaded more layers to GPU, but it wasn’t that much faster
I meant that even when allocating max context size in VRAM (but using Q8 for KV cache) I expected significantly better performance(than 20-25T/s ) from the Qwen3.6-35BA3 model vs the 27B dense that I tested previously. For the latter I got ~40 T/s (but of course had to allocate much lower context size - I think it was 100K maximum). I read that for MoE models you could easily offload 40% (or even more) of it's layers to RAM without significant performance hit, but in this case it was almost 2x slower .
But I know that with the 27B dense version, the performance would tank if it actually had to use it's max context size (even with somehow all model layers at Q4/Q5 and the cache would fit VRAM on my 4090 in the first place) .
I think I saw someone's results on the 122BA10 variants, that even with context size way above Qwen's standard max(256K) , the performance drop was surprisingly small , so I guess this might be one of the biggest advantages of these MoE variants.
I read that for MoE models you could easily offload 40% (or even more) of it's layers to RAM without significant performance hit, but in this case it was almost 2x slower.
MoE models are much better than dense models for hybrid CPU+GPU offload. In both cases full GPU offload will likely be quite a bit faster (largely due to token generation (decode) being primarily memory read bandwidth bottlenecked, and an old 3090TI offers ~1000GB/s vs well tuned two channel DDR5-6400MT/s system delivering ~80GB/s)
It is important to offload the always active tensors for an MoE though. So that is all attn/shexp/dense layers and leaving only the routed experts on CPU/RAM. Because the routed experts are used sparsely e.g. only 8 per token out of the available 100+ depending on model.
With a dense model, every weight is always active so offloading any part of it it going to slow you down more.
So pay close attention to your offload strategy e.g. llama-server -ngl 50 .. works for dense models, but is the wrong way to offload MoE models which use llama-server -ngl 99 --n-cpu-moe 20 or whatever for your specific model. There are more ways to do it these days with -fit etc too, so it is kind of confusing.
Cheers!





