testing

by shewin - opened Sep 26, 2025

Sep 26, 2025

W790E Sage + QYFS + 512G + RTX5090

IQ5_K:

Tensor blk.60.ffn_up_exps.weight buffer type overriden to CPU
llm_load_tensors: offloading 61 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 62/62 layers to GPU
llm_load_tensors: CPU buffer size = 40445.11 MiB
llm_load_tensors: CPU buffer size = 43654.39 MiB
llm_load_tensors: CPU buffer size = 43150.39 MiB
llm_load_tensors: CPU buffer size = 43150.39 MiB
llm_load_tensors: CPU buffer size = 43654.39 MiB
llm_load_tensors: CPU buffer size = 43150.39 MiB
llm_load_tensors: CPU buffer size = 43150.39 MiB
llm_load_tensors: CPU buffer size = 43654.39 MiB
llm_load_tensors: CPU buffer size = 43150.39 MiB
llm_load_tensors: CPU buffer size = 43150.39 MiB
llm_load_tensors: CPU buffer size = 41176.97 MiB
llm_load_tensors: CPU buffer size = 43433.33 MiB
llm_load_tensors: CPU buffer size = 32790.39 MiB
llm_load_tensors: CPU buffer size = 32790.39 MiB
llm_load_tensors: CPU buffer size = 40927.67 MiB
llm_load_tensors: CPU buffer size = 32790.39 MiB
llm_load_tensors: CPU buffer size = 32790.39 MiB
llm_load_tensors: CPU buffer size = 40927.67 MiB
llm_load_tensors: CPU buffer size = 32790.39 MiB
llm_load_tensors: CPU buffer size = 32790.39 MiB
llm_load_tensors: CPU buffer size = 40927.67 MiB
llm_load_tensors: CPU buffer size = 33281.02 MiB
....................................................................................................
============ llm_prepare_mla: need to compute 61 wkv_b tensors
Computed blk.0.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.1.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.2.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.3.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.4.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.5.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.6.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.7.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.8.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.9.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.10.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.11.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.12.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.13.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.14.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.15.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.16.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.17.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.18.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.19.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.20.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.21.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.22.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.23.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.24.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.25.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.26.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.27.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.28.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.29.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.30.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.31.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.32.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.33.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.34.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.35.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.36.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.37.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.38.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.39.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.40.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.41.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.42.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.43.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.44.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.45.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.46.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.47.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.48.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.49.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.50.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.51.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.52.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.53.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.54.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.55.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.56.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.57.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.58.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.59.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.60.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
llama_new_context_with_model: n_ctx = 80128
llama_new_context_with_model: n_batch = 4090
llama_new_context_with_model: n_ubatch = 4090
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn = 3
llama_new_context_with_model: attn_max_b = 512
llama_new_context_with_model: fused_moe = 1
llama_new_context_with_model: fused_up_gate = 1
llama_new_context_with_model: ser = -1, 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 0.025
ggml_cuda_host_malloc: failed to allocate 2852.76 MiB of pinned memory: forward compatibility was attempted on non supported HW
llama_kv_cache_init: CPU KV buffer size = 2852.76 MiB
llama_new_context_with_model: KV self size = 2852.76 MiB, c^KV (q8_0): 2852.76 MiB, kv^T: not used
ggml_cuda_host_malloc: failed to allocate 0.49 MiB of pinned memory: forward compatibility was attempted on non supported HW
llama_new_context_with_model: CPU output buffer size = 0.49 MiB
ggml_cuda_host_malloc: failed to allocate 5827.02 MiB of pinned memory: forward compatibility was attempted on non supported HW
llama_new_context_with_model: CUDA_Host compute buffer size = 5827.02 MiB
llama_new_context_with_model: graph nodes = 24166
llama_new_context_with_model: graph splits = 1

main: n_kv_max = 80128, n_batch = 4090, n_ubatch = 4090, flash_attn = 1, n_gpu_layers = 63, n_threads = 101, n_threads_batch = 101

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
======================================= HAVE_FANCY_SIMD is defined
4090	1022	0	62.111	65.85	180.071	5.68
4090	1022	4090	86.984	47.02	313.001	3.27
4090	1022	8180	122.579	33.37	311.727	3.28
4090	1022	12270	135.382	30.21	388.599	2.63

ubergarm

Owner Sep 26, 2025

•

edited Sep 26, 2025

@shewin

Thanks for the report. I wonder though how you're running it because this IQ5_K model is ~499GB and looking closely at the debug log it might be putting KV cache onto CPU which would slow you down a lot?

ggml_cuda_host_malloc: failed to allocate 2852.76 MiB of pinned memory: forward compatibility was attempted on non supported HW
llama_kv_cache_init: CPU KV buffer size = 2852.76 MiB
llama_new_context_with_model: KV self size = 2852.76 MiB, c^KV (q8_0): 2852.76 MiB, kv^T: not used
ggml_cuda_host_malloc: failed to allocate 0.49 MiB of pinned memory: forward compatibility was attempted on non supported HW
llama_new_context_with_model: CPU output buffer size = 0.49 MiB
ggml_cuda_host_malloc: failed to allocate 5827.02 MiB of pinned memory: forward compatibility was attempted on non supported HW

It might be possible to fit this into 512 RAM + 32GB VRAM but will be cutting it tight and require likely at least one routed expert layer offloaded to GPU, and possibly running headless no browser open etc.

The other option is to go with smaller batch size e.g. -ub 1024 or default 512 to free up some VRAM to offload a bit more and if you can use --no-mmap that would ensure you're not running mmap() off of your disk which will slow it down.

ubergarm

Owner Sep 26, 2025

•

edited Sep 26, 2025

I just made an intermediate quant that will fit a 512GB RAM + single GPU VRAM nicely that has basically the same perplexity as the IQ5_K and probably a little faster. Uploading now.

## smol-IQ5_KS 417.107 GiB (5.339 BPW)
Final estimate: PPL = 3.4059 +/- 0.01996

shewin

Sep 27, 2025

-b 2048 -ub 2048

IQ5_K:

Tensor blk.60.ffn_up_exps.weight buffer type overriden to CPU
llm_load_tensors: offloading 61 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 62/62 layers to GPU
llm_load_tensors: CPU buffer size = 40445.11 MiB
llm_load_tensors: CPU buffer size = 43654.39 MiB
llm_load_tensors: CPU buffer size = 43150.39 MiB
llm_load_tensors: CPU buffer size = 43150.39 MiB
llm_load_tensors: CPU buffer size = 43654.39 MiB
llm_load_tensors: CPU buffer size = 43150.39 MiB
llm_load_tensors: CPU buffer size = 43150.39 MiB
llm_load_tensors: CPU buffer size = 43654.39 MiB
llm_load_tensors: CPU buffer size = 43150.39 MiB
llm_load_tensors: CPU buffer size = 43150.39 MiB
llm_load_tensors: CPU buffer size = 41176.97 MiB
llm_load_tensors: CPU buffer size = 731.86 MiB
llm_load_tensors: CUDA0 buffer size = 16499.89 MiB
....................................................................................................
============ llm_prepare_mla: need to compute 61 wkv_b tensors
Computed blk.0.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.1.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.2.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.3.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.4.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.5.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.6.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.7.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.8.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.9.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.10.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.11.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.12.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.13.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.14.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.15.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.16.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.17.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.18.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.19.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.20.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.21.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.22.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.23.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.24.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.25.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.26.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.27.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.28.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.29.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.30.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.31.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.32.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.33.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.34.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.35.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.36.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.37.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.38.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.39.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.40.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.41.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.42.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.43.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.44.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.45.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.46.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.47.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.48.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.49.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.50.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.51.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.52.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.53.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.54.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.55.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.56.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.57.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.58.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.59.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.60.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
llama_new_context_with_model: n_ctx = 80128
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 2048
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn = 3
llama_new_context_with_model: attn_max_b = 512
llama_new_context_with_model: fused_moe = 1
llama_new_context_with_model: fused_up_gate = 1
llama_new_context_with_model: ser = -1, 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 0.025
llama_kv_cache_init: CUDA0 KV buffer size = 2852.79 MiB
llama_new_context_with_model: KV self size = 2852.76 MiB, c^KV (q8_0): 2852.76 MiB, kv^T: not used
llama_new_context_with_model: CUDA_Host output buffer size = 0.49 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 7832.38 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 682.02 MiB
llama_new_context_with_model: graph nodes = 24227
llama_new_context_with_model: graph splits = 118

main: n_kv_max = 80128, n_batch = 2048, n_ubatch = 2048, flash_attn = 1, n_gpu_layers = 63, n_threads = 101, n_threads_batch = 101

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
2048	512	0	48.498	42.23	62.342	8.21
2048	512	2048	49.168	41.65	51.926	9.86
2048	512	4096	49.464	41.40	50.368	10.17
2048	512	6144	49.719	41.19	57.228	8.95
2048	512	8192	50.017	40.95	59.888	8.55

shewin

Sep 27, 2025

IQ4_K:

Tensor blk.60.ffn_up_exps.weight buffer type overriden to CPU
llm_load_tensors: offloading 61 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 62/62 layers to GPU
llm_load_tensors: CPU buffer size = 40018.33 MiB
llm_load_tensors: CPU buffer size = 44498.38 MiB
llm_load_tensors: CPU buffer size = 42242.79 MiB
llm_load_tensors: CPU buffer size = 44706.79 MiB
llm_load_tensors: CPU buffer size = 42242.79 MiB
llm_load_tensors: CPU buffer size = 44706.79 MiB
llm_load_tensors: CPU buffer size = 42242.79 MiB
llm_load_tensors: CPU buffer size = 44706.79 MiB
llm_load_tensors: CPU buffer size = 42974.65 MiB
llm_load_tensors: CPU buffer size = 497.11 MiB
llm_load_tensors: CUDA0 buffer size = 14399.89 MiB
....................................................................................................
============ llm_prepare_mla: need to compute 61 wkv_b tensors
Computed blk.0.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.1.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.2.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.3.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.4.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.5.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.6.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.7.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.8.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.9.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.10.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.11.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.12.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.13.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.14.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.15.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.16.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.17.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.18.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.19.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.20.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.21.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.22.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.23.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.24.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.25.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.26.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.27.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.28.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.29.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.30.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.31.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.32.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.33.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.34.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.35.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.36.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.37.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.38.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.39.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.40.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.41.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.42.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.43.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.44.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.45.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.46.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.47.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.48.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.49.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.50.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.51.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.52.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.53.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.54.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.55.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.56.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.57.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.58.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.59.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.60.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
llama_new_context_with_model: n_ctx = 100096
llama_new_context_with_model: n_batch = 4090
llama_new_context_with_model: n_ubatch = 4090
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn = 3
llama_new_context_with_model: attn_max_b = 512
llama_new_context_with_model: fused_moe = 1
llama_new_context_with_model: fused_up_gate = 1
llama_new_context_with_model: ser = -1, 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 0.025
llama_kv_cache_init: CUDA0 KV buffer size = 3563.70 MiB
llama_new_context_with_model: KV self size = 3563.67 MiB, c^KV (q8_0): 3563.67 MiB, kv^T: not used
llama_new_context_with_model: CUDA_Host output buffer size = 0.49 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 11098.75 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 1675.88 MiB
llama_new_context_with_model: graph nodes = 24227
llama_new_context_with_model: graph splits = 118

main: n_kv_max = 100096, n_batch = 4090, n_ubatch = 4090, flash_attn = 1, n_gpu_layers = 63, n_threads = 101, n_threads_batch = 101

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
4090	1022	0	41.690	98.10	107.637	9.49
4090	1022	4090	42.840	95.47	101.844	10.03
4090	1022	8180	43.460	94.11	109.647	9.32
4090	1022	12270	45.039	90.81	110.518	9.25
4090	1022	16360	45.955	89.00	111.980	9.13

shewin

Sep 27, 2025

This model has some problem:
OpenAI completion error: 500 Failed to parse messages: Unsupported content part type: "image_url"

ubergarm

Owner Sep 27, 2025

@shewin

I'm not sure what you expect, as this model is not multimodal right? (it works on text input, and does not tokenize images right?)

image_url is specific to the openai api endpoint: https://github.com/openai/openai-python?tab=readme-ov-file#vision

Are you running with --jinja and what jinja template are you using (builtin or your own?)? See this comment for more discussion, as it is confusing between the legacy api endpoints, openai chat completions and openai responses endpoints: https://github.com/openai/openai-python?tab=readme-ov-file#vision

Finally, I see that ik is adding some new PRs to support multimodal stuff, and if you want your LLM to look at an image, it probably needs multimodal support. So keep that in mind.

shewin

Sep 27, 2025

Regardless of jinja, when running roocode in vscode environment, the following error occurs:
OpenAI completion error: 500 Failed to parse messages: Unsupported content part type: "image_url"
Retry attempt 1
Retrying now...

INFO [ log_server_request] request | tid="125075898470400" timestamp=1758985885 remote_addr="127.0.0.1" remote_port=59248 status=500 method="POST" path="/chat/completions" params={}

shewin

Sep 27, 2025

smol-IQ5_KS:

llama_new_context_with_model: n_ctx = 90112
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 2048
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn = 3
llama_new_context_with_model: attn_max_b = 512
llama_new_context_with_model: fused_moe = 1
llama_new_context_with_model: fused_up_gate = 1
llama_new_context_with_model: ser = -1, 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 0.025
llama_kv_cache_init: CUDA0 KV buffer size = 3208.25 MiB
llama_new_context_with_model: KV self size = 3208.22 MiB, c^KV (q8_0): 3208.22 MiB, kv^T: not used
llama_new_context_with_model: CUDA_Host output buffer size = 0.49 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 8744.01 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 760.02 MiB
llama_new_context_with_model: graph nodes = 24227
llama_new_context_with_model: graph splits = 118

main: n_kv_max = 90112, n_batch = 2048, n_ubatch = 2048, flash_attn = 1, n_gpu_layers = 63, n_threads = 101, n_threads_batch = 101

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
2048	512	0	42.965	47.67	53.676	9.54
2048	512	2048	43.221	47.38	50.273	10.18
2048	512	4096	43.995	46.55	46.646	10.98
2048	512	6144	44.267	46.26	46.366	11.04
2048	512	8192	44.573	45.95	46.669	10.97

ubergarm

Owner Sep 27, 2025

roocode in vscode environment
"POST" path="/chat/completions"

This seems like an problem with your roocode client configuration: https://docs.roocode.com/features/api-configuration-profiles/

You will need to configure your roocode to use the correct api endpoint e.g. /v1/chat/completions something at /v1/* probably or someting similar. Or maybe use lmstudio as the provider or ollama or llama.cpp etc, I haven't used it myself.

I don't know why your client is trying to use the image_url as that is only supported by the real openai endpoint or a multimodal specific model.

smol-IQ5_KS:

the smol-IQ5_KS seems a little faster for TG which is nice! Do you OOM on VRAM with -ub 4096 -b 4096 though as you are still using -ub 2048 looks like.

shewin

Oct 1, 2025

I think it was fixed by unchecking image support.

ubergarm

Owner Oct 1, 2025

That sounds likely, thanks for letting folks know! I believe ik recently added some support for multimodal models so might be able to try that feature with one of those. I've not tried it myself yet.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

testing

Regardless of jinja, when running roocode in vscode environment, the following error occurs:OpenAI completion error: 500 Failed to parse messages: Unsupported content part type: "image_url"Retry attempt 1Retrying now...

Regardless of jinja, when running roocode in vscode environment, the following error occurs:
OpenAI completion error: 500 Failed to parse messages: Unsupported content part type: "image_url"
Retry attempt 1
Retrying now...