Unable to use mmap() on this model?

by x-polyglot-x - opened Feb 27

•

In the past, I could use a string of commands like this:
-ngl 0
--no-warmup
-dev none

This would result in successfully loading models larger than vram + ram, relying on mmap (with cached ram) for inference. It was slow, but very usable for overnight tasks.

But this does not work with Kimi-K2.5. Well, at least, not anymore. It always tries to put the full model into memory.

Has anyone else experienced this? Any ideas for how to get mmap() working? This worked on Kimi-K2 versions. I do not know if they've changed something with Kimi-K2.5 specifically, or if this is something with a llama.cpp update. I can use this trick for Qwen3.5 models that exceed vram + ram, so it seems to be something about this specific model.

Any tips or advice is appreciated!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment