AMD Issue?

#5
by knarp - opened

When running this GGUF model (or any GLM-4 GGUF models found on HF) in Windows LM Studio using the Vulkan runtime (CPU runtime is working properly), it answers all questions with 'GGGGGGGGGGGGGGGGG'. When I press stop, LM Studio gives this error:

Failed to regenerate message
Unexpected empty grammar stack after accepting piece: G

My PC specs:
CPU: AMD RYZEN AI MAX+ 395
GPU: Radeon 8060S (allocated 64 GB VRAM)

When I run this model in Ollama (+ Open WebUI), it runs properly on my GPU.
Maybe Ollama is using Rocm instead of Vulkan?

Unsloth AI org

@knarp we reuploaded them with some chat template fixes. Could you retry again thanks :)

@knarp we reuploaded them with some chat template fixes. Could you retry again thanks :)

Still the same situation with it responding with 'GGGGGGGGGGGGGGGGGGG' and giving the same error when I press stop:

Failed to regenerate message
Unexpected empty grammar stack after accepting piece: G

I've tried with different setting configurations (enabling/disabling: 'keep model in memory', mmap, flash attention, etc) as well as default settings other than setting GPU Offload to max so that it uses my GPU. I've tried with multiple-sized quants.
Oddly enough, I'm able to normally use the GLM-4-9B model with no issues, however. It seems that the 32B GLM models do not work properly in LM Studio for my PC specs I mentioned earlier.

@knarp if you don't mind, you can add this in llama.cpp -b 8 -ub 8 or -b 16 -ub 16. At least it works like this. In LM Studio probably changing "Evalutaion batch size" is same thing.

@knarp if you don't mind, you can add this in llama.cpp -b 8 -ub 8 or -b 16 -ub 16. At least it works like this. In LM Studio probably changing "Evalutaion batch size" is same thing.

I've set the 'Evaluation Batch Size' to 8 in LM Studio. The output response is working normal. Anything higher, as you mentioned '16', it produces 'GGGGGGGGGGGG' responses.

@knarp Also i just noticed... it looks like performance in LM Studio is worse than in latest llama.cpp vulkan. I guess, it's because LM Studio is using outdated llama.cpp version. I try this promt "Write a snake game code in html", and in LM Studio the game is almost fully working but not quite (no restart possible), while in latest llama.cpp it's fully working with start and restart buttons. My settings are --temp 0.6 -b 8 -ub 8 --min_p 0 --top_k 0.
So try this promt yourself if you want.

same issue on LMStudio on Windows with Vulkan backend running on AMD Vega II

image.png

same issue on LMStudio on Windows with Vulkan backend running on AMD Vega II

image.png

Set the 'Evaluation Batch Size' to 8. Default in LM Studio is 512, and GLM4 needs a very low batch size.

@knarp , @davinwang , @urtuuuu
Please, don't just accept a workaround which significantly reduces the inference speed! Reducing the batch size for prompt ingestion to 8 makes the model incredibly slower!
Demand proper support in llama.cpp and let the developers know that this issue exists here: https://github.com/ggml-org/llama.cpp/issues

Unsloth AI org

@knarp @davinwang hey guys, can you guys try it again? We uploaded it again with more llama.cpp fixes! :)

@knarp @davinwang hey guys, can you guys try it again? We uploaded it again with more llama.cpp fixes! :)

Yes, this issue is now resolved in LM Studio using AMD GPU. It seems, in general, the latest llama.cpp build in LM Studio has resolved the issues with the older GLM-4 32B models as well.
I'm even able to increase the 'Evaluation Batch Size' higher than the default 512 (to 2048) and all is well.

knarp changed discussion status to closed

Sign up or log in to comment