Request to Publish More GLM-4.6V-Flash GGUF Quantized Versions

#1
by baijianfeng - opened

Dear ggml-org Team,

Hello! I am a loyal user of the GLM-4.6V-Flash model, currently deploying the multimodal text-image inference features of llama.cpp on macOS and Android platforms. First of all, thank you for releasing the high-quality Q4_K_M version, which performs excellently in local runs.

During practical use, I found that currently on Hugging Face, only Q4_K_M is a GGUF model that truly supports the vision module. Other community versions (such as Q2_K, Q3_K, Q4_K_S), although similarly named, are purely text models without the vision encoder and mmproj alignment structure, and cannot be used for image inference.

Due to memory limitations on mobile devices (such as Android), the VRAM usage of Q4_K_M is relatively high (about 6.2GB tested on macOS), which easily triggers OOM when deployed on Android. Therefore, we sincerely hope to obtain the following lightweight versions of the vision GGUF model:

GLM-4.6V-Flash-Q2_K.gguf

GLM-4.6V-Flash-Q3_K_M.gguf

GLM-4.6V-Flash-Q4_K_S.gguf

These versions will greatly improve the usability of the model on mobile and low-VRAM devices, and also help promote the application of multimodal models in local deployment and edge computing scenarios.

If you plan to release these versions or have related conversion tools and processes, please also consider documenting them in the repository. We are willing to assist with testing and feedback.

Thank you again for your contributions to the open-source community!

Sincerely,

Jeffery (from Beijing, China) iOS + Android local deployment user and developer December 22, 2025

Sign up or log in to comment