Surprisingly good coder, I'm Impressed.
I tested more than a hundreds off line local models during the last year, This model which is a mixture of models is the best I tested till now. Its doing a remarkable coding work, better than some of the big online commercial models.
Well done.
Any one else share my experience with it ?
Strangely enough, I can use the Q6_K with an RTX 5080 and 16GB VRAM at 30 t/s, and it stays that way even in long contexts. With the correct agents.md, it performs wonderfully. Congratulations to the creators! Perhaps there are some unnecessary parameters in my settings.
.\llama-server.exe -m C:\models\Qwopus3.6-35B-A3B-v1-Q6_K\Qwopus3.6-35B-A3B-v1-Q6_K.gguf
--n-gpu-layers 999 --n-cpu-moe 25
--no-mmap --flash-attn on
--batch-size 2048 --ubatch-size 1024
--ctx-size 131072 --cache-type-k q4_0
--cache-type-v q4_0 --threads 24
--threads-batch 24 --prio 2
--host 0.0.0.0 --cache-reuse 0
--alias Qwen3.6-35B-A3B --metrics
--port 8080
Thank you both for the amazing feedback!
It's also fantastic to see it running so smoothly with long contexts! llama-server run command—that kind of configuration sharing is incredibly helpful for the rest of the community to get the best performance.
❤️ Thanks again for your support and happy coding!
I tested more than a hundreds off line local models during the last year, This model which is a mixture of models is the best I tested till now. Its doing a remarkable coding work, better than some of the big online commercial models.
Well done.
Any one else share my experience with it ?
Strangely enough, I can use the Q6_K with an RTX 5080 and 16GB VRAM at 30 t/s, and it stays that way even in long contexts. With the correct agents.md, it performs wonderfully. Congratulations to the creators! Perhaps there are some unnecessary parameters in my settings.
.\llama-server.exe
-m C:\models\Qwopus3.6-35B-A3B-v1-Q6_K\Qwopus3.6-35B-A3B-v1-Q6_K.gguf
--n-gpu-layers 999--n-cpu-moe 25
--no-mmap--flash-attn on
--batch-size 2048--ubatch-size 1024
--ctx-size 131072--cache-type-k q4_0
--cache-type-v q4_0--threads 24
--threads-batch 24--prio 2
--host 0.0.0.0--cache-reuse 0
--alias Qwen3.6-35B-A3B--metrics
--port 8080
wtf
could you at least try a bit to de-slopify your very warm messages next time?
I've been testing Qwopus 35B (IQ4_XS) on my RTX 5060 Ti 8GB and 64GB DDR4 (4x16GB @ 2100MHz). The model's logic and output quality are surprisingly strong for this setup, and I'm consistently hitting 30-35 t/s.However, I'm running into some repetition loops. Given the hardware constraints—specifically the 8GB VRAM and the low 2100MHz RAM frequency—are there any llama-server optimizations to fix the looping while maintaining this level of performance? Current command:
cd C:\Users\User\Documents\Informatik\Ki\llama-turbo\build\bin\Release
.\llama-server.exe -m "C:/Users/User/.lmstudio/models/Jackrong/Qwopus3.6-35B-A3B-v1-GGUF/Qwopus3.6-35B-A3B-v1-IQ4_XS.gguf"
-ngl 999 --n-cpu-moe 35
--no-mmap --mlock
--cache-type-k q8_0 --cache-type-v q8_0
-c 65536 --host 0.0.0.0
--port 8080 `
--metrics
Strangely enough, I can use the Q6_K with an RTX 5080 and 16GB VRAM at 30 t/s, and it stays that way even in long contexts. With the correct agents.md, it performs wonderfully. Congratulations to the creators! Perhaps there are some unnecessary parameters in my settings.
.\llama-server.exe
-m C:\models\Qwopus3.6-35B-A3B-v1-Q6_K\Qwopus3.6-35B-A3B-v1-Q6_K.gguf
--n-gpu-layers 999--n-cpu-moe 25
--no-mmap--flash-attn on
--batch-size 2048--ubatch-size 1024
--ctx-size 131072--cache-type-k q4_0
--cache-type-v q4_0--threads 24
--threads-batch 24--prio 2
--host 0.0.0.0--cache-reuse 0
--alias Qwen3.6-35B-A3B--metrics
--port 8080
I've tried this:
llama-server
--port ${PORT}
--model /models/Jackrong/Qwopus3.6-35B-A3B-v1-GGUF/Qwopus3.6-35B-A3B-v1-IQ4_XS.gguf
--mmproj /models/Jackrong/Qwopus3.6-35B-A3B-v1-GGUF/qwopus3.6-35b-a3b-mmproj.gguf
--n-gpu-layers ${n_gpu_layers_max}
--n-cpu-moe 18
--ctx-size ${ctx_size_128k}
-ctk q4_0
-ctv q4_0
--flash-attn on
--host ${host}
and get 79t/s on 4070Ti Super 16G.
That's amazing!
"--n-cpu-moe 18 -ctk q4_0 -ctv q4_0" helps a lot.
I've been testing Qwopus 35B (IQ4_XS) on my RTX 5060 Ti 8GB and 64GB DDR4 (4x16GB @ 2100MHz). The model's logic and output quality are surprisingly strong for this setup, and I'm consistently hitting 30-35 t/s.However, I'm running into some repetition loops. Given the hardware constraints—specifically the 8GB VRAM and the low 2100MHz RAM frequency—are there any llama-server optimizations to fix the looping while maintaining this level of performance? Current command:
cd C:\Users\User\Documents\Informatik\Ki\llama-turbo\build\bin\Release
.\llama-server.exe
-m "C:/Users/User/.lmstudio/models/Jackrong/Qwopus3.6-35B-A3B-v1-GGUF/Qwopus3.6-35B-A3B-v1-IQ4_XS.gguf"
-ngl 999--n-cpu-moe 35
--no-mmap--mlock
--cache-type-k q8_0--cache-type-v q8_0
-c 65536--host 0.0.0.0
--port 8080 `
--metrics
My advice is to strengthen agent.md, create prompts that will prevent hallucinations, loops, and lengthy reasoning, because in my experience, 30-40 t/s isn't bad for development; the real time loss occurs in the areas I mentioned.