File size: 3,010 Bytes
927092d
 
 
faffe6f
 
 
 
 
 
927092d
 
 
 
faffe6f
927092d
faffe6f
927092d
faffe6f
0092801
927092d
 
faffe6f
 
 
 
 
 
 
927092d
faffe6f
927092d
faffe6f
927092d
9dc8866
faffe6f
9dc8866
 
 
 
 
 
 
 
 
 
faffe6f
 
 
 
e00b07f
faffe6f
deb1e7f
faffe6f
 
 
deb1e7f
faffe6f
 
 
deb1e7f
faffe6f
deb1e7f
faffe6f
 
 
 
 
 
 
 
 
927092d
faffe6f
927092d
faffe6f
 
 
 
 
 
 
 
 
 
 
927092d
faffe6f
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
---
license: apache-2.0
tags:
- diffusion
- masked-diffusion
- llada
- llama
- gguf
- diffuse-cpp
base_model: GSAI-ML/LLaDA-8B-Instruct
pipeline_tag: text-generation
---

# LLaDA-8B-Instruct-GGUF

GGUF quantizations of [GSAI-ML/LLaDA-8B-Instruct](https://huggingface.co/GSAI-ML/LLaDA-8B-Instruct) for use with [diffuse-cpp](https://github.com/iafiscal1212/diffuse-cpp), a CPU inference engine for Diffusion Language Models.

LLaDA is a masked diffusion language model based on the Llama backbone with Multi-Head Attention (MHA, 32/32 heads).

## Available Quantizations

| File | Type | Size | Description |
|------|------|------|-------------|
| `llada-8b-f16.gguf` | F16 | ~14.9 GB | Full precision, best quality |
| `llada-8b-q8_0.gguf` | Q8_0 | ~8.4 GB | 8-bit quantization, near-lossless |
| `llada-8b-q4km.gguf` | Q4_K_M | ~5.1 GB | 4-bit mixed quantization, best quality/size ratio |

**Recommended:** Q4_K_M for most users. Q8_0 if you have enough RAM and want minimal quality loss.

## Performance

Benchmarked on diffuse-cpp with entropy_exit + inter-step KV cache, Q4_K_M, B=256, 12 threads, seed=42:

| Prompt | No-Cache tok/s | Cache tok/s | Steps | vs llama.cpp |
|--------|----------------|-------------|-------|-------------|
| Capital of France? | 17.5 | **24.4** | 3 | 2.9x |
| Translate to French | 25.9 | **27.7** | 2 | 3.3x |
| 15 x 23? | 12.8 | **15.7** | 4 | 1.8x |
| Translate to Spanish | 7.6 | **22.9** | 7 | 2.7x |
| Python is_prime() | 3.2 | **4.9** | 16 | 0.6x |
| Poem about ocean | 3.2 | **5.3** | 16 | 0.6x |
| Why is sky blue? | 3.3 | **12.0** | 16 | 1.4x |
| List the planets | 3.3 | **9.4** | 15 | 1.1x |
| **Average** | **9.6** | **15.3** | | **1.8x** |

- **Inter-step cache: 1.6x average speedup** (9.6 -> 15.3 tok/s)
- Easy prompts: **15-28 tok/s** (up to 3.3x faster than llama.cpp)
- 6 of 8 prompts outperform llama.cpp (8.51 tok/s baseline)
- Cache enabled by default, no quality degradation

## Usage

```bash
# Download
huggingface-cli download diffuse-cpp/LLaDA-8B-Instruct-GGUF llada-8b-q4km.gguf

# Run (requires diffuse-cpp)
./diffuse-cli -m llada-8b-q4km.gguf -p "What is the capital of France?" -n 256 -s 16
```

## Model Details

- **Architecture:** Llama backbone with bidirectional attention
- **Parameters:** 8B
- **Layers:** 32
- **Hidden size:** 4096
- **Attention:** MHA (32 query heads, 32 KV heads, head dim 128)
- **FFN:** SwiGLU, intermediate size 12288
- **Vocabulary:** 126,464 tokens
- **RoPE theta:** 500,000
- **Mask token ID:** 126336

## Also Available

- **[Dream-v0-Instruct-7B-GGUF](https://huggingface.co/diffuse-cpp/Dream-v0-Instruct-7B-GGUF)** — Qwen2.5 backbone, GQA, 7.6B params. Excels at factual and math prompts (21.6 tok/s).

## Citation

```bibtex
@software{diffuse_cpp_2026,
  title={diffuse-cpp: High-Performance Inference for Diffusion Language Models},
  author={Carmen Estévez},
  year={2026},
  url={https://github.com/iafiscal1212/diffuse-cpp}
}
```

## License

Apache 2.0, following the original LLaDA model license.