Core dumped for me,
Using llama.cpp b8069. What did I do wrong?
slot get_availabl: id 0 | task -1 | selected slot by LRU, t_last = -1
slot launch_slot_: id 0 | task -1 | launching slot : {"id":0,"n_ctx":262144,"speculative":false,"is_processing":false}
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id 0 | task 0 | processing task, is_child = 0
que start_loop: update slots
srv update_slots: posting NEXT_RESPONSE
que post: new task, id = 1, front = 0
slot update_slots: id 0 | task 0 | new prompt, n_ctx_slot = 262144, n_keep = 0, task.n_tokens = 81
res send: sending result for task id = 0
res send: task id = 0 pushed to result queue
slot update_slots: id 0 | task 0 | n_tokens = 0, memory_seq_rm [0, end)
srv update_chat_: Parsing chat message:
Parsing input with format Hermes 2 Pro:
srv stop: all tasks already finished, no need to cancel
srv operator(): http: streamed chunk: data: {"choices":[{"finish_reason":null,"index":0,"delta":{"role":"assistant","content":null}}],"created":1771268548,"id":"chatcmpl-igXfjxhlo4BNL5NY09ZULwRJusUwq5P2","model":"minimax-m2","system_fingerprint":"b8069-d5dfc3302","object":"chat.completion.chunk","timings":{"cache_n":0,"prompt_n":0,"prompt_ms":5e-324,"prompt_per_token_ms":null,"prompt_per_second":null,"predicted_n":0,"predicted_ms":5.3514745826332e-310,"predicted_per_token_ms":null,"predicted_per_second":null},"prompt_progress":{"total":81,"cache":0,"processed":0,"time_ms":0}}
slot update_slots: id 0 | task 0 | prompt processing progress, n_tokens = 81, batch.n_tokens = 81, progress = 1.000000
slot update_slots: id 0 | task 0 | prompt done, n_tokens = 81, batch.n_tokens = 81
slot init_sampler: id 0 | task 0 | init sampler, took 0.04 ms, tokens: text = 81, total = 81
srv update_slots: decoding batch, n_tokens = 81
set_adapters_lora: adapters = (nil)
adapters_lora_are_same: adapters = (nil)
set_embeddings: value = 0
/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:2351: GGML_ASSERT(ids_to_sorted_host.size() == size_t(ne_get_rows)) failed
llama-server(+0x62579f)[0x6282ef99679f]
llama-server(+0x73130a)[0x6282efaa230a]
llama-server(+0x731a13)[0x6282efaa2a13]
llama-server(+0x7364f7)[0x6282efaa74f7]
llama-server(+0x738d19)[0x6282efaa9d19]
llama-server(+0x632d46)[0x6282ef9a3d46]
llama-server(+0x432e53)[0x6282ef7a3e53]
llama-server(+0x43325a)[0x6282ef7a425a]
llama-server(+0x436aca)[0x6282ef7a7aca]
llama-server(+0x1cf153)[0x6282ef540153]
llama-server(+0x222240)[0x6282ef593240]
llama-server(+0xfd94b)[0x6282ef46e94b]
/usr/lib/x86_64-linux-gnu/libc.so.6(+0x2a1ca)[0x78ad8b99a1ca]
/usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x8b)[0x78ad8b99a28b]
llama-server(+0x16ee05)[0x6282ef4dfe05]
/mnt/data/models/Qwen3.5-397B-A17B-GGUF/start-llama: line 26: 139 Aborted (core dumped) llama-server --model /mnt/data/models/Qwen3.5-397B-A17B-GGUF/Qwen3.5-397B-A17B-MXFP4_MOE-00001-of-00006.gguf --verbose --host 0.0.0.0 --alias minimax-m2 --parallel 1 --cache-ram 262144 --temp 1.0 --top-p 0.95 --top-k 40 --repeat-penalty 1.05 --reasoning-format auto --flash-attn on --cache-type-k q8_0 --cache-type-v q8_0 --kv-unified --fit on --batch-size 4096 --ubatch-size 1024 --ctx-checkpoints 10 --cont-batching --no-mmap --mmproj /mnt/data/models/Qwen3.5-397B-A17B-GGUF/mmproj-BF16.gguf --mmproj-offload --jinja
Try reducing the ctx-size
The system is nowhere near running out of GPU RAM, maybe 70% used across 3x RTX PRO 6000. I lowered the ctx-size to 32768 anyway. both the q4_k_xl and th mxfp4 fail immediately on prompt. i've removed everything non-essential from the start command and still core dumps.
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
set_warmup: value = 1
set_warmup: value = 0
srv load_model: initializing slots, n_slots = 1
common_speculative_is_compat: the target context does not support partial sequence removal
srv load_model: speculative decoding not supported by this context
slot load_model: id 0 | task -1 | new slot, n_ctx = 32768
slot reset: id 0 | task -1 |
srv load_model: prompt cache is enabled, size limit: 8192 MiB
srv load_model: use `--cache-ram 0` to disable the prompt cache
srv load_model: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
init: chat template, example_format: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
<think>
'
srv init: init: chat template, thinking = 1
main: model loaded
main: server is listening on http://0.0.0.0:8080
main: starting the main loop...
que start_loop: processing new tasks
que start_loop: update slots
srv update_slots: all slots are idle
que start_loop: waiting for new tasks
srv params_from_: Grammar:
srv params_from_: Grammar lazy: false
srv params_from_: Chat format: Hermes 2 Pro
res add_waiting_: add task 0 to waiting list. current waiting = 0 (before add)
que post: new task, id = 0/1, front = 0
que start_loop: processing new tasks
que start_loop: processing task, id = 0
slot get_availabl: id 0 | task -1 | selected slot by LRU, t_last = -1
slot launch_slot_: id 0 | task -1 | launching slot : {"id":0,"n_ctx":32768,"speculative":false,"is_processing":false}
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id 0 | task 0 | processing task, is_child = 0
que start_loop: update slots
srv update_slots: posting NEXT_RESPONSE
que post: new task, id = 1, front = 0
slot update_slots: id 0 | task 0 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 281
res send: sending result for task id = 0
res send: task id = 0 pushed to result queue
slot update_slots: id 0 | task 0 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_tokens = 281, batch.n_tokens = 281, progress = 1.000000
slot update_slots: id 0 | task 0 | prompt done, n_tokens = 281, batch.n_tokens = 281
srv update_chat_: Parsing chat message:
Parsing input with format Hermes 2 Pro:
slot init_sampler: id 0 | task 0 | init sampler, took 0.13 ms, tokens: text = 281, total = 281
srv update_slots: decoding batch, n_tokens = 281
set_adapters_lora: adapters = (nil)
adapters_lora_are_same: adapters = (nil)
set_embeddings: value = 0
srv stop: all tasks already finished, no need to cancel
srv operator(): http: streamed chunk: data: {"choices":[{"finish_reason":null,"index":0,"delta":{"role":"assistant","content":null}}],"created":1771336710,"id":"chatcmpl-fID4B9TJJax4axVgdkGnkcwZa2ArgLP5","model":"minimax-m2","system_fingerprint":"b8069-d5dfc3302","object":"chat.completion.chunk","timings":{"cache_n":0,"prompt_n":0,"prompt_ms":0.0,"prompt_per_token_ms":null,"prompt_per_second":null,"predicted_n":0,"predicted_ms":4.6518922289705e-310,"predicted_per_token_ms":null,"predicted_per_second":null},"prompt_progress":{"total":281,"cache":0,"processed":0,"time_ms":0}}
/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:2351: GGML_ASSERT(ids_to_sorted_host.size() == size_t(ne_get_rows)) failed
llama-server(+0x62579f)[0x55a21409b79f]
llama-server(+0x73130a)[0x55a2141a730a]
llama-server(+0x731a13)[0x55a2141a7a13]
llama-server(+0x7364f7)[0x55a2141ac4f7]
llama-server(+0x738d19)[0x55a2141aed19]
llama-server(+0x632d46)[0x55a2140a8d46]
llama-server(+0x432e53)[0x55a213ea8e53]
llama-server(+0x43325a)[0x55a213ea925a]
llama-server(+0x436aca)[0x55a213eacaca]
llama-server(+0x1cf153)[0x55a213c45153]
llama-server(+0x222240)[0x55a213c98240]
llama-server(+0xfd94b)[0x55a213b7394b]
/usr/lib/x86_64-linux-gnu/libc.so.6(+0x2a1ca)[0x7adf4099a1ca]
/usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x8b)[0x7adf4099a28b]
llama-server(+0x16ee05)[0x55a213be4e05]
/mnt/data/models/Qwen3.5-397B-A17B-GGUF/start-llama-q4kxl: line 12: 139 Aborted (core dumped) llama-server --model /mnt/data/models/Qwen3.5-397B-A17B-GGUF/Qwen3.5-397B-A17B-UD-Q4_K_XL-00001-of-00006.gguf --verbose --host 0.0.0.0 --alias minimax-m2 --ctx-size 32768 --parallel 1 --temp 1.0 --top-p 0.95 --top-k 40 --jinja
I had the same problem when using --fit on, but when I distributed everything using -ot, it started working smoothly
issue 19116 seems to be specific when using --cache-reuse but I'm not using that. i've given up on the model for now. i'll come back to it once it's working more reliably or I get my 4th GPU and switch to a different server, not llama.cpp
I'm using RTX6000 Pro (but 2 of them) + 2x RTX5090. Fresh fetch n' build below and it started to work as it should (I had similar problems up until yesterday).
OLLAMA_NUM_PARALLEL=1 CUDA_VISIBLE_DEVICES=0,1,2,3 \
./llama.cpp/build/bin/llama-server \
--model ~/.cache/llama.cpp/unsloth_Qwen3.5-397B-A17B-GGUF_UD-Q4_K_XL_Qwen3.5-397B-A17B-UD-Q4_K_XL-00001-of-00006.gguf \
--chat-template-file ~/qwen35-chat-template.jinja \
--main-gpu 0 \
--threads 60 \
--n-gpu-layers 999 \
--fit on \
--ctx-size 262144 \
--jinja \
--flash-attn on \
--no-mmap \
--mlock \
--cont-batching \
--temp 0.6 \
--top_p 0.95 \
--top_k 40 \
--min_p 0.05 \
--no-context-shift \
--cache-ram -1 \
--host 0.0.0.0 \
--port 11110
indeed @AImhotep i didn't re-download the file, just changed to the latest llama.cpp and the core dump doesn't appear to happen anymore. Before, when I ran the model the coredump was immediate. I've picked up an opencode task in progress and switched to this model running in llama.cpp. Tool calls seem to be working fine, not noticing any obvious errors in output or logs. Getting 65 tokens per second generation with about 30K of context in the task. Seems pretty stable now, so I'm going to take on some bigger tasks with lots of tool calls.