Core dumped for me,

by aaron-newsome - opened Feb 16

Feb 16

Using llama.cpp b8069. What did I do wrong?

slot get_availabl: id  0 | task -1 | selected slot by LRU, t_last = -1
slot launch_slot_: id  0 | task -1 | launching slot : {"id":0,"n_ctx":262144,"speculative":false,"is_processing":false}
slot launch_slot_: id  0 | task -1 | sampler chain: logits -> penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist 
slot launch_slot_: id  0 | task 0 | processing task, is_child = 0
que    start_loop: update slots
srv  update_slots: posting NEXT_RESPONSE
que          post: new task, id = 1, front = 0
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 262144, n_keep = 0, task.n_tokens = 81
res          send: sending result for task id = 0
res          send: task id = 0 pushed to result queue
slot update_slots: id  0 | task 0 | n_tokens = 0, memory_seq_rm [0, end)
srv  update_chat_: Parsing chat message: 
Parsing input with format Hermes 2 Pro: 
srv          stop: all tasks already finished, no need to cancel
srv    operator(): http: streamed chunk: data: {"choices":[{"finish_reason":null,"index":0,"delta":{"role":"assistant","content":null}}],"created":1771268548,"id":"chatcmpl-igXfjxhlo4BNL5NY09ZULwRJusUwq5P2","model":"minimax-m2","system_fingerprint":"b8069-d5dfc3302","object":"chat.completion.chunk","timings":{"cache_n":0,"prompt_n":0,"prompt_ms":5e-324,"prompt_per_token_ms":null,"prompt_per_second":null,"predicted_n":0,"predicted_ms":5.3514745826332e-310,"predicted_per_token_ms":null,"predicted_per_second":null},"prompt_progress":{"total":81,"cache":0,"processed":0,"time_ms":0}}


slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 81, batch.n_tokens = 81, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_tokens = 81, batch.n_tokens = 81
slot init_sampler: id  0 | task 0 | init sampler, took 0.04 ms, tokens: text = 81, total = 81
srv  update_slots: decoding batch, n_tokens = 81
set_adapters_lora: adapters = (nil)
adapters_lora_are_same: adapters = (nil)
set_embeddings: value = 0
/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:2351: GGML_ASSERT(ids_to_sorted_host.size() == size_t(ne_get_rows)) failed
llama-server(+0x62579f)[0x6282ef99679f]
llama-server(+0x73130a)[0x6282efaa230a]
llama-server(+0x731a13)[0x6282efaa2a13]
llama-server(+0x7364f7)[0x6282efaa74f7]
llama-server(+0x738d19)[0x6282efaa9d19]
llama-server(+0x632d46)[0x6282ef9a3d46]
llama-server(+0x432e53)[0x6282ef7a3e53]
llama-server(+0x43325a)[0x6282ef7a425a]
llama-server(+0x436aca)[0x6282ef7a7aca]
llama-server(+0x1cf153)[0x6282ef540153]
llama-server(+0x222240)[0x6282ef593240]
llama-server(+0xfd94b)[0x6282ef46e94b]
/usr/lib/x86_64-linux-gnu/libc.so.6(+0x2a1ca)[0x78ad8b99a1ca]
/usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x8b)[0x78ad8b99a28b]
llama-server(+0x16ee05)[0x6282ef4dfe05]
/mnt/data/models/Qwen3.5-397B-A17B-GGUF/start-llama: line 26:   139 Aborted                 (core dumped) llama-server --model /mnt/data/models/Qwen3.5-397B-A17B-GGUF/Qwen3.5-397B-A17B-MXFP4_MOE-00001-of-00006.gguf --verbose --host 0.0.0.0 --alias minimax-m2 --parallel 1 --cache-ram 262144 --temp 1.0 --top-p 0.95 --top-k 40 --repeat-penalty 1.05 --reasoning-format auto --flash-attn on --cache-type-k q8_0 --cache-type-v q8_0 --kv-unified --fit on --batch-size 4096 --ubatch-size 1024 --ctx-checkpoints 10 --cont-batching --no-mmap --mmproj /mnt/data/models/Qwen3.5-397B-A17B-GGUF/mmproj-BF16.gguf --mmproj-offload --jinja

danielhanchen

Unsloth AI org Feb 17

Try reducing the ctx-size

aaron-newsome

Feb 17

The system is nowhere near running out of GPU RAM, maybe 70% used across 3x RTX PRO 6000. I lowered the ctx-size to 32768 anyway. both the q4_k_xl and th mxfp4 fail immediately on prompt. i've removed everything non-essential from the start command and still core dumps.

common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
set_warmup: value = 1
set_warmup: value = 0
srv    load_model: initializing slots, n_slots = 1
common_speculative_is_compat: the target context does not support partial sequence removal
srv    load_model: speculative decoding not supported by this context
slot   load_model: id  0 | task -1 | new slot, n_ctx = 32768
slot        reset: id  0 | task -1 | 
srv    load_model: prompt cache is enabled, size limit: 8192 MiB
srv    load_model: use `--cache-ram 0` to disable the prompt cache
srv    load_model: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
init: chat template, example_format: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
<think>
'
srv          init: init: chat template, thinking = 1
main: model loaded
main: server is listening on http://0.0.0.0:8080
main: starting the main loop...
que    start_loop: processing new tasks
que    start_loop: update slots
srv  update_slots: all slots are idle
que    start_loop: waiting for new tasks
srv  params_from_: Grammar: 
srv  params_from_: Grammar lazy: false
srv  params_from_: Chat format: Hermes 2 Pro
res  add_waiting_: add task 0 to waiting list. current waiting = 0 (before add)
que          post: new task, id = 0/1, front = 0
que    start_loop: processing new tasks
que    start_loop: processing task, id = 0
slot get_availabl: id  0 | task -1 | selected slot by LRU, t_last = -1
slot launch_slot_: id  0 | task -1 | launching slot : {"id":0,"n_ctx":32768,"speculative":false,"is_processing":false}
slot launch_slot_: id  0 | task -1 | sampler chain: logits -> penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist 
slot launch_slot_: id  0 | task 0 | processing task, is_child = 0
que    start_loop: update slots
srv  update_slots: posting NEXT_RESPONSE
que          post: new task, id = 1, front = 0
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 32768, n_keep = 0, task.n_tokens = 281
res          send: sending result for task id = 0
res          send: task id = 0 pushed to result queue
slot update_slots: id  0 | task 0 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 281, batch.n_tokens = 281, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_tokens = 281, batch.n_tokens = 281
srv  update_chat_: Parsing chat message: 
Parsing input with format Hermes 2 Pro: 
slot init_sampler: id  0 | task 0 | init sampler, took 0.13 ms, tokens: text = 281, total = 281
srv  update_slots: decoding batch, n_tokens = 281
set_adapters_lora: adapters = (nil)
adapters_lora_are_same: adapters = (nil)
set_embeddings: value = 0
srv          stop: all tasks already finished, no need to cancel
srv    operator(): http: streamed chunk: data: {"choices":[{"finish_reason":null,"index":0,"delta":{"role":"assistant","content":null}}],"created":1771336710,"id":"chatcmpl-fID4B9TJJax4axVgdkGnkcwZa2ArgLP5","model":"minimax-m2","system_fingerprint":"b8069-d5dfc3302","object":"chat.completion.chunk","timings":{"cache_n":0,"prompt_n":0,"prompt_ms":0.0,"prompt_per_token_ms":null,"prompt_per_second":null,"predicted_n":0,"predicted_ms":4.6518922289705e-310,"predicted_per_token_ms":null,"predicted_per_second":null},"prompt_progress":{"total":281,"cache":0,"processed":0,"time_ms":0}}


/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:2351: GGML_ASSERT(ids_to_sorted_host.size() == size_t(ne_get_rows)) failed
llama-server(+0x62579f)[0x55a21409b79f]
llama-server(+0x73130a)[0x55a2141a730a]
llama-server(+0x731a13)[0x55a2141a7a13]
llama-server(+0x7364f7)[0x55a2141ac4f7]
llama-server(+0x738d19)[0x55a2141aed19]
llama-server(+0x632d46)[0x55a2140a8d46]
llama-server(+0x432e53)[0x55a213ea8e53]
llama-server(+0x43325a)[0x55a213ea925a]
llama-server(+0x436aca)[0x55a213eacaca]
llama-server(+0x1cf153)[0x55a213c45153]
llama-server(+0x222240)[0x55a213c98240]
llama-server(+0xfd94b)[0x55a213b7394b]
/usr/lib/x86_64-linux-gnu/libc.so.6(+0x2a1ca)[0x7adf4099a1ca]
/usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x8b)[0x7adf4099a28b]
llama-server(+0x16ee05)[0x55a213be4e05]
/mnt/data/models/Qwen3.5-397B-A17B-GGUF/start-llama-q4kxl: line 12:   139 Aborted                 (core dumped) llama-server --model /mnt/data/models/Qwen3.5-397B-A17B-GGUF/Qwen3.5-397B-A17B-UD-Q4_K_XL-00001-of-00006.gguf --verbose --host 0.0.0.0 --alias minimax-m2 --ctx-size 32768 --parallel 1 --temp 1.0 --top-p 0.95 --top-k 40 --jinja

rigraf

Feb 19

I had the same problem when using --fit on, but when I distributed everything using -ot, it started working smoothly

AImhotep

Feb 20

https://github.com/ggml-org/llama.cpp/issues/19116

aaron-newsome

Feb 20

issue 19116 seems to be specific when using --cache-reuse but I'm not using that. i've given up on the model for now. i'll come back to it once it's working more reliably or I get my 4th GPU and switch to a different server, not llama.cpp

AImhotep

Feb 20

•

edited Feb 20

I'm using RTX6000 Pro (but 2 of them) + 2x RTX5090. Fresh fetch n' build below and it started to work as it should (I had similar problems up until yesterday).

OLLAMA_NUM_PARALLEL=1 CUDA_VISIBLE_DEVICES=0,1,2,3 \
./llama.cpp/build/bin/llama-server \
          --model ~/.cache/llama.cpp/unsloth_Qwen3.5-397B-A17B-GGUF_UD-Q4_K_XL_Qwen3.5-397B-A17B-UD-Q4_K_XL-00001-of-00006.gguf \
          --chat-template-file ~/qwen35-chat-template.jinja \
          --main-gpu 0 \
          --threads 60 \
          --n-gpu-layers 999 \
          --fit on \
          --ctx-size 262144 \
          --jinja \
          --flash-attn on \
          --no-mmap \
          --mlock \
          --cont-batching \
          --temp 0.6 \
          --top_p 0.95 \
          --top_k 40 \
          --min_p 0.05 \
          --no-context-shift \
          --cache-ram -1 \
    --host 0.0.0.0 \
    --port 11110

aaron-newsome

Feb 20

indeed @AImhotep i didn't re-download the file, just changed to the latest llama.cpp and the core dump doesn't appear to happen anymore. Before, when I ran the model the coredump was immediate. I've picked up an opencode task in progress and switched to this model running in llama.cpp. Tool calls seem to be working fine, not noticing any obvious errors in output or logs. Getting 65 tokens per second generation with about 30K of context in the task. Seems pretty stable now, so I'm going to take on some bigger tasks with lots of tool calls.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment