testing

#5
by shewin - opened

W790E Sage + QYFS + 512G + RTX5090

IQ5_K:

Tensor blk.60.ffn_up_exps.weight buffer type overriden to CPU
llm_load_tensors: offloading 61 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 62/62 layers to GPU
llm_load_tensors: CPU buffer size = 40445.11 MiB
llm_load_tensors: CPU buffer size = 43654.39 MiB
llm_load_tensors: CPU buffer size = 43150.39 MiB
llm_load_tensors: CPU buffer size = 43150.39 MiB
llm_load_tensors: CPU buffer size = 43654.39 MiB
llm_load_tensors: CPU buffer size = 43150.39 MiB
llm_load_tensors: CPU buffer size = 43150.39 MiB
llm_load_tensors: CPU buffer size = 43654.39 MiB
llm_load_tensors: CPU buffer size = 43150.39 MiB
llm_load_tensors: CPU buffer size = 43150.39 MiB
llm_load_tensors: CPU buffer size = 41176.97 MiB
llm_load_tensors: CPU buffer size = 43433.33 MiB
llm_load_tensors: CPU buffer size = 32790.39 MiB
llm_load_tensors: CPU buffer size = 32790.39 MiB
llm_load_tensors: CPU buffer size = 40927.67 MiB
llm_load_tensors: CPU buffer size = 32790.39 MiB
llm_load_tensors: CPU buffer size = 32790.39 MiB
llm_load_tensors: CPU buffer size = 40927.67 MiB
llm_load_tensors: CPU buffer size = 32790.39 MiB
llm_load_tensors: CPU buffer size = 32790.39 MiB
llm_load_tensors: CPU buffer size = 40927.67 MiB
llm_load_tensors: CPU buffer size = 33281.02 MiB
....................................................................................................
============ llm_prepare_mla: need to compute 61 wkv_b tensors
Computed blk.0.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.1.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.2.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.3.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.4.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.5.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.6.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.7.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.8.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.9.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.10.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.11.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.12.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.13.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.14.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.15.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.16.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.17.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.18.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.19.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.20.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.21.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.22.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.23.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.24.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.25.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.26.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.27.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.28.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.29.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.30.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.31.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.32.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.33.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.34.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.35.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.36.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.37.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.38.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.39.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.40.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.41.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.42.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.43.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.44.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.45.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.46.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.47.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.48.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.49.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.50.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.51.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.52.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.53.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.54.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.55.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.56.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.57.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.58.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.59.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
Computed blk.60.attn_kv_b.weight as 512 x 32768 and stored in buffer CPU
llama_new_context_with_model: n_ctx = 80128
llama_new_context_with_model: n_batch = 4090
llama_new_context_with_model: n_ubatch = 4090
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn = 3
llama_new_context_with_model: attn_max_b = 512
llama_new_context_with_model: fused_moe = 1
llama_new_context_with_model: fused_up_gate = 1
llama_new_context_with_model: ser = -1, 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 0.025
ggml_cuda_host_malloc: failed to allocate 2852.76 MiB of pinned memory: forward compatibility was attempted on non supported HW
llama_kv_cache_init: CPU KV buffer size = 2852.76 MiB
llama_new_context_with_model: KV self size = 2852.76 MiB, c^KV (q8_0): 2852.76 MiB, kv^T: not used
ggml_cuda_host_malloc: failed to allocate 0.49 MiB of pinned memory: forward compatibility was attempted on non supported HW
llama_new_context_with_model: CPU output buffer size = 0.49 MiB
ggml_cuda_host_malloc: failed to allocate 5827.02 MiB of pinned memory: forward compatibility was attempted on non supported HW
llama_new_context_with_model: CUDA_Host compute buffer size = 5827.02 MiB
llama_new_context_with_model: graph nodes = 24166
llama_new_context_with_model: graph splits = 1

main: n_kv_max = 80128, n_batch = 4090, n_ubatch = 4090, flash_attn = 1, n_gpu_layers = 63, n_threads = 101, n_threads_batch = 101

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
======================================= HAVE_FANCY_SIMD is defined
4090 1022 0 62.111 65.85 180.071 5.68
4090 1022 4090 86.984 47.02 313.001 3.27
4090 1022 8180 122.579 33.37 311.727 3.28
4090 1022 12270 135.382 30.21 388.599 2.63

@shewin

Thanks for the report. I wonder though how you're running it because this IQ5_K model is ~499GB and looking closely at the debug log it might be putting KV cache onto CPU which would slow you down a lot?

ggml_cuda_host_malloc: failed to allocate 2852.76 MiB of pinned memory: forward compatibility was attempted on non supported HW
llama_kv_cache_init: CPU KV buffer size = 2852.76 MiB
llama_new_context_with_model: KV self size = 2852.76 MiB, c^KV (q8_0): 2852.76 MiB, kv^T: not used
ggml_cuda_host_malloc: failed to allocate 0.49 MiB of pinned memory: forward compatibility was attempted on non supported HW
llama_new_context_with_model: CPU output buffer size = 0.49 MiB
ggml_cuda_host_malloc: failed to allocate 5827.02 MiB of pinned memory: forward compatibility was attempted on non supported HW

It might be possible to fit this into 512 RAM + 32GB VRAM but will be cutting it tight and require likely at least one routed expert layer offloaded to GPU, and possibly running headless no browser open etc.

The other option is to go with smaller batch size e.g. -ub 1024 or default 512 to free up some VRAM to offload a bit more and if you can use --no-mmap that would ensure you're not running mmap() off of your disk which will slow it down.

I just made an intermediate quant that will fit a 512GB RAM + single GPU VRAM nicely that has basically the same perplexity as the IQ5_K and probably a little faster. Uploading now.

## smol-IQ5_KS 417.107 GiB (5.339 BPW)
Final estimate: PPL = 3.4059 +/- 0.01996

-b 2048 -ub 2048

IQ5_K:

Tensor blk.60.ffn_up_exps.weight buffer type overriden to CPU
llm_load_tensors: offloading 61 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 62/62 layers to GPU
llm_load_tensors: CPU buffer size = 40445.11 MiB
llm_load_tensors: CPU buffer size = 43654.39 MiB
llm_load_tensors: CPU buffer size = 43150.39 MiB
llm_load_tensors: CPU buffer size = 43150.39 MiB
llm_load_tensors: CPU buffer size = 43654.39 MiB
llm_load_tensors: CPU buffer size = 43150.39 MiB
llm_load_tensors: CPU buffer size = 43150.39 MiB
llm_load_tensors: CPU buffer size = 43654.39 MiB
llm_load_tensors: CPU buffer size = 43150.39 MiB
llm_load_tensors: CPU buffer size = 43150.39 MiB
llm_load_tensors: CPU buffer size = 41176.97 MiB
llm_load_tensors: CPU buffer size = 731.86 MiB
llm_load_tensors: CUDA0 buffer size = 16499.89 MiB
....................................................................................................
============ llm_prepare_mla: need to compute 61 wkv_b tensors
Computed blk.0.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.1.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.2.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.3.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.4.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.5.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.6.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.7.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.8.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.9.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.10.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.11.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.12.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.13.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.14.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.15.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.16.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.17.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.18.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.19.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.20.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.21.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.22.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.23.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.24.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.25.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.26.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.27.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.28.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.29.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.30.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.31.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.32.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.33.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.34.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.35.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.36.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.37.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.38.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.39.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.40.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.41.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.42.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.43.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.44.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.45.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.46.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.47.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.48.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.49.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.50.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.51.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.52.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.53.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.54.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.55.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.56.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.57.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.58.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.59.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.60.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
llama_new_context_with_model: n_ctx = 80128
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 2048
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn = 3
llama_new_context_with_model: attn_max_b = 512
llama_new_context_with_model: fused_moe = 1
llama_new_context_with_model: fused_up_gate = 1
llama_new_context_with_model: ser = -1, 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 0.025
llama_kv_cache_init: CUDA0 KV buffer size = 2852.79 MiB
llama_new_context_with_model: KV self size = 2852.76 MiB, c^KV (q8_0): 2852.76 MiB, kv^T: not used
llama_new_context_with_model: CUDA_Host output buffer size = 0.49 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 7832.38 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 682.02 MiB
llama_new_context_with_model: graph nodes = 24227
llama_new_context_with_model: graph splits = 118

main: n_kv_max = 80128, n_batch = 2048, n_ubatch = 2048, flash_attn = 1, n_gpu_layers = 63, n_threads = 101, n_threads_batch = 101

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
2048 512 0 48.498 42.23 62.342 8.21
2048 512 2048 49.168 41.65 51.926 9.86
2048 512 4096 49.464 41.40 50.368 10.17
2048 512 6144 49.719 41.19 57.228 8.95
2048 512 8192 50.017 40.95 59.888 8.55

IQ4_K:

Tensor blk.60.ffn_up_exps.weight buffer type overriden to CPU
llm_load_tensors: offloading 61 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 62/62 layers to GPU
llm_load_tensors: CPU buffer size = 40018.33 MiB
llm_load_tensors: CPU buffer size = 44498.38 MiB
llm_load_tensors: CPU buffer size = 42242.79 MiB
llm_load_tensors: CPU buffer size = 44706.79 MiB
llm_load_tensors: CPU buffer size = 42242.79 MiB
llm_load_tensors: CPU buffer size = 44706.79 MiB
llm_load_tensors: CPU buffer size = 42242.79 MiB
llm_load_tensors: CPU buffer size = 44706.79 MiB
llm_load_tensors: CPU buffer size = 42974.65 MiB
llm_load_tensors: CPU buffer size = 497.11 MiB
llm_load_tensors: CUDA0 buffer size = 14399.89 MiB
....................................................................................................
============ llm_prepare_mla: need to compute 61 wkv_b tensors
Computed blk.0.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.1.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.2.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.3.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.4.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.5.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.6.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.7.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.8.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.9.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.10.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.11.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.12.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.13.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.14.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.15.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.16.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.17.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.18.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.19.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.20.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.21.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.22.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.23.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.24.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.25.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.26.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.27.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.28.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.29.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.30.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.31.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.32.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.33.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.34.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.35.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.36.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.37.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.38.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.39.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.40.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.41.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.42.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.43.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.44.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.45.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.46.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.47.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.48.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.49.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.50.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.51.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.52.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.53.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.54.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.55.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.56.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.57.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.58.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.59.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
Computed blk.60.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
llama_new_context_with_model: n_ctx = 100096
llama_new_context_with_model: n_batch = 4090
llama_new_context_with_model: n_ubatch = 4090
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn = 3
llama_new_context_with_model: attn_max_b = 512
llama_new_context_with_model: fused_moe = 1
llama_new_context_with_model: fused_up_gate = 1
llama_new_context_with_model: ser = -1, 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 0.025
llama_kv_cache_init: CUDA0 KV buffer size = 3563.70 MiB
llama_new_context_with_model: KV self size = 3563.67 MiB, c^KV (q8_0): 3563.67 MiB, kv^T: not used
llama_new_context_with_model: CUDA_Host output buffer size = 0.49 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 11098.75 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 1675.88 MiB
llama_new_context_with_model: graph nodes = 24227
llama_new_context_with_model: graph splits = 118

main: n_kv_max = 100096, n_batch = 4090, n_ubatch = 4090, flash_attn = 1, n_gpu_layers = 63, n_threads = 101, n_threads_batch = 101

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4090 1022 0 41.690 98.10 107.637 9.49
4090 1022 4090 42.840 95.47 101.844 10.03
4090 1022 8180 43.460 94.11 109.647 9.32
4090 1022 12270 45.039 90.81 110.518 9.25
4090 1022 16360 45.955 89.00 111.980 9.13

This model has some problem:
OpenAI completion error: 500 Failed to parse messages: Unsupported content part type: "image_url"

@shewin

I'm not sure what you expect, as this model is not multimodal right? (it works on text input, and does not tokenize images right?)

image_url is specific to the openai api endpoint: https://github.com/openai/openai-python?tab=readme-ov-file#vision

Are you running with --jinja and what jinja template are you using (builtin or your own?)? See this comment for more discussion, as it is confusing between the legacy api endpoints, openai chat completions and openai responses endpoints: https://github.com/openai/openai-python?tab=readme-ov-file#vision

Finally, I see that ik is adding some new PRs to support multimodal stuff, and if you want your LLM to look at an image, it probably needs multimodal support. So keep that in mind.

Regardless of jinja, when running roocode in vscode environment, the following error occurs:
OpenAI completion error: 500 Failed to parse messages: Unsupported content part type: "image_url"
Retry attempt 1
Retrying now...

INFO [ log_server_request] request | tid="125075898470400" timestamp=1758985885 remote_addr="127.0.0.1" remote_port=59248 status=500 method="POST" path="/chat/completions" params={}

smol-IQ5_KS:

llama_new_context_with_model: n_ctx = 90112
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 2048
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn = 3
llama_new_context_with_model: attn_max_b = 512
llama_new_context_with_model: fused_moe = 1
llama_new_context_with_model: fused_up_gate = 1
llama_new_context_with_model: ser = -1, 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 0.025
llama_kv_cache_init: CUDA0 KV buffer size = 3208.25 MiB
llama_new_context_with_model: KV self size = 3208.22 MiB, c^KV (q8_0): 3208.22 MiB, kv^T: not used
llama_new_context_with_model: CUDA_Host output buffer size = 0.49 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 8744.01 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 760.02 MiB
llama_new_context_with_model: graph nodes = 24227
llama_new_context_with_model: graph splits = 118

main: n_kv_max = 90112, n_batch = 2048, n_ubatch = 2048, flash_attn = 1, n_gpu_layers = 63, n_threads = 101, n_threads_batch = 101

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
2048 512 0 42.965 47.67 53.676 9.54
2048 512 2048 43.221 47.38 50.273 10.18
2048 512 4096 43.995 46.55 46.646 10.98
2048 512 6144 44.267 46.26 46.366 11.04
2048 512 8192 44.573 45.95 46.669 10.97

roocode in vscode environment
"POST" path="/chat/completions"

This seems like an problem with your roocode client configuration: https://docs.roocode.com/features/api-configuration-profiles/

You will need to configure your roocode to use the correct api endpoint e.g. /v1/chat/completions something at /v1/* probably or someting similar. Or maybe use lmstudio as the provider or ollama or llama.cpp etc, I haven't used it myself.

I don't know why your client is trying to use the image_url as that is only supported by the real openai endpoint or a multimodal specific model.


smol-IQ5_KS:

the smol-IQ5_KS seems a little faster for TG which is nice! Do you OOM on VRAM with -ub 4096 -b 4096 though as you are still using -ub 2048 looks like.

I think it was fixed by unchecking image support.

2025-10-01_15-56

That sounds likely, thanks for letting folks know! I believe ik recently added some support for multimodal models so might be able to try that feature with one of those. I've not tried it myself yet.

Sign up or log in to comment