Unable to use mmap() on this model?

#8
by x-polyglot-x - opened

In the past, I could use a string of commands like this:
-ngl 0
--no-warmup
-dev none

This would result in successfully loading models larger than vram + ram, relying on mmap (with cached ram) for inference. It was slow, but very usable for overnight tasks.

But this does not work with Kimi-K2.5. Well, at least, not anymore. It always tries to put the full model into memory.

Has anyone else experienced this? Any ideas for how to get mmap() working? This worked on Kimi-K2 versions. I do not know if they've changed something with Kimi-K2.5 specifically, or if this is something with a llama.cpp update. I can use this trick for Qwen3.5 models that exceed vram + ram, so it seems to be something about this specific model.

Any tips or advice is appreciated!

Sign up or log in to comment