File size: 3,010 Bytes
927092d faffe6f 927092d faffe6f 927092d faffe6f 927092d faffe6f 0092801 927092d faffe6f 927092d faffe6f 927092d faffe6f 927092d 9dc8866 faffe6f 9dc8866 faffe6f e00b07f faffe6f deb1e7f faffe6f deb1e7f faffe6f deb1e7f faffe6f deb1e7f faffe6f 927092d faffe6f 927092d faffe6f 927092d faffe6f | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 | ---
license: apache-2.0
tags:
- diffusion
- masked-diffusion
- llada
- llama
- gguf
- diffuse-cpp
base_model: GSAI-ML/LLaDA-8B-Instruct
pipeline_tag: text-generation
---
# LLaDA-8B-Instruct-GGUF
GGUF quantizations of [GSAI-ML/LLaDA-8B-Instruct](https://huggingface.co/GSAI-ML/LLaDA-8B-Instruct) for use with [diffuse-cpp](https://github.com/iafiscal1212/diffuse-cpp), a CPU inference engine for Diffusion Language Models.
LLaDA is a masked diffusion language model based on the Llama backbone with Multi-Head Attention (MHA, 32/32 heads).
## Available Quantizations
| File | Type | Size | Description |
|------|------|------|-------------|
| `llada-8b-f16.gguf` | F16 | ~14.9 GB | Full precision, best quality |
| `llada-8b-q8_0.gguf` | Q8_0 | ~8.4 GB | 8-bit quantization, near-lossless |
| `llada-8b-q4km.gguf` | Q4_K_M | ~5.1 GB | 4-bit mixed quantization, best quality/size ratio |
**Recommended:** Q4_K_M for most users. Q8_0 if you have enough RAM and want minimal quality loss.
## Performance
Benchmarked on diffuse-cpp with entropy_exit + inter-step KV cache, Q4_K_M, B=256, 12 threads, seed=42:
| Prompt | No-Cache tok/s | Cache tok/s | Steps | vs llama.cpp |
|--------|----------------|-------------|-------|-------------|
| Capital of France? | 17.5 | **24.4** | 3 | 2.9x |
| Translate to French | 25.9 | **27.7** | 2 | 3.3x |
| 15 x 23? | 12.8 | **15.7** | 4 | 1.8x |
| Translate to Spanish | 7.6 | **22.9** | 7 | 2.7x |
| Python is_prime() | 3.2 | **4.9** | 16 | 0.6x |
| Poem about ocean | 3.2 | **5.3** | 16 | 0.6x |
| Why is sky blue? | 3.3 | **12.0** | 16 | 1.4x |
| List the planets | 3.3 | **9.4** | 15 | 1.1x |
| **Average** | **9.6** | **15.3** | | **1.8x** |
- **Inter-step cache: 1.6x average speedup** (9.6 -> 15.3 tok/s)
- Easy prompts: **15-28 tok/s** (up to 3.3x faster than llama.cpp)
- 6 of 8 prompts outperform llama.cpp (8.51 tok/s baseline)
- Cache enabled by default, no quality degradation
## Usage
```bash
# Download
huggingface-cli download diffuse-cpp/LLaDA-8B-Instruct-GGUF llada-8b-q4km.gguf
# Run (requires diffuse-cpp)
./diffuse-cli -m llada-8b-q4km.gguf -p "What is the capital of France?" -n 256 -s 16
```
## Model Details
- **Architecture:** Llama backbone with bidirectional attention
- **Parameters:** 8B
- **Layers:** 32
- **Hidden size:** 4096
- **Attention:** MHA (32 query heads, 32 KV heads, head dim 128)
- **FFN:** SwiGLU, intermediate size 12288
- **Vocabulary:** 126,464 tokens
- **RoPE theta:** 500,000
- **Mask token ID:** 126336
## Also Available
- **[Dream-v0-Instruct-7B-GGUF](https://huggingface.co/diffuse-cpp/Dream-v0-Instruct-7B-GGUF)** — Qwen2.5 backbone, GQA, 7.6B params. Excels at factual and math prompts (21.6 tok/s).
## Citation
```bibtex
@software{diffuse_cpp_2026,
title={diffuse-cpp: High-Performance Inference for Diffusion Language Models},
author={Carmen Estévez},
year={2026},
url={https://github.com/iafiscal1212/diffuse-cpp}
}
```
## License
Apache 2.0, following the original LLaDA model license.
|