Testing Q3_K

by shewin - opened Feb 20

Feb 20

W790E Sage + QYFS + 512G + RTX5090

Tensor blk.59.ffn_up_exps.weight buffer type overriden to CPU
llm_load_tensors: offloading 60 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 61/61 layers to GPU
llm_load_tensors: CPU buffer size = 174720.00 MiB
llm_load_tensors: CUDA_Host buffer size = 545.62 MiB
llm_load_tensors: CUDA0 buffer size = 9024.70 MiB
...................................................................................................~ggml_backend_cuda_context: have 0 graphs
.
===================================== llama_init_from_model: f16
llama_init_from_model: n_ctx = 200192
llama_init_from_model: n_batch = 4096
llama_init_from_model: n_ubatch = 4096
llama_init_from_model: flash_attn = 1
llama_init_from_model: attn_max_b = 0
llama_init_from_model: fused_moe = 1
llama_init_from_model: grouped er = 0
llama_init_from_model: fused_up_gate = 1
llama_init_from_model: fused_mmad = 1
llama_init_from_model: rope_cache = 0
llama_init_from_model: graph_reuse = 1
llama_init_from_model: k_cache_hadam = 0
llama_init_from_model: split_mode_graph_scheduling = 0
llama_init_from_model: reduce_type = f16
llama_init_from_model: sched_async = 0
llama_init_from_model: ser = -1, 0
llama_init_from_model: freq_base = 10000000.0
llama_init_from_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 3302.11 MiB
llama_init_from_model: KV self size = 3115.78 MiB, K (q8_0): 1557.89 MiB, V (q8_0): 1557.89 MiB
llama_init_from_model: CUDA_Host output buffer size = 0.95 MiB
llama_init_from_model: CUDA0 compute buffer size = 4268.23 MiB
llama_init_from_model: CUDA_Host compute buffer size = 1628.16 MiB
llama_init_from_model: graph nodes = 75685
llama_init_from_model: graph splits = 122
llama_init_from_model: enabling only_active_experts scheduling

main: n_kv_max = 200192, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = 99, n_threads = 101, n_threads_batch = 101

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
4096	1024	0	16.329	250.85	34.402	29.77
4096	1024	4096	27.218	150.49	34.449	29.73
4096	1024	8192	27.261	150.25	42.082	24.33
4096	1024	12288	27.432	149.32	42.441	24.13
4096	1024	16384	27.345	149.79	42.555	24.06

shewin

Feb 20

git clone https://github.com/ikawrakow/ik_llama.cpp.git
cd ik_llama.cpp
git checkout ik/qwen35moe

shewin

Feb 20

but failed with --mmproj option

ubergarm

Owner Feb 20

@shewin

Wow looks pretty fast! I was amazed how good quality that q3_K is... Guessing if you want -mmproj for now you'd have to use mainline and the autoparser branch, but likely it will be slower in this hybrid CPU+GPU configuration.

Hopefully ik will get mmproj support after qwen35moe branch is merged!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment