diffuse-cpp
/

LLaDA-8B-Instruct-GGUF

@@ -1,39 +1,38 @@
 ---
 license: apache-2.0
-language:
-  - en
 tags:
-  - diffusion
-  - llada
-  - gguf
-  - diffuse-cpp
 base_model: GSAI-ML/LLaDA-8B-Instruct
-quantized_by: Carmenest
 pipeline_tag: text-generation
 ---
-# LLaDA-8B-Instruct GGUF
-GGUF quantized versions of [GSAI-ML/LLaDA-8B-Instruct](https://huggingface.co/GSAI-ML/LLaDA-8B-Instruct) for use with [diffuse-cpp](https://github.com/iafiscal1212/diffuse-cpp).
-LLaDA is a **diffusion language model** that generates text by iterative unmasking rather than autoregressive token-by-token prediction.
-> **Paper:** [Diffusion Language Models are Faster than Autoregressive on CPU](https://doi.org/10.5281/zenodo.19119813) — C. Esteban, 2026
 ## Available Quantizations
-| File | Quant | Size | Description |
-|------|-------|------|-------------|
-| llada-8b-q4km.gguf | Q4_K_M | 5.1 GB | **Recommended** best throughput |
-| llada-8b-q8_0.gguf | Q8_0 | 8.4 GB | High quality, good throughput |
-| llada-8b-f16.gguf | F16 | 14.9 GB | Full precision reference |
-## Benchmark (AMD EPYC 4465P 12-Core, steps=16, threads=12)
-### Real Prompt Performance (Q4_K_M + entropy_exit + inter-step cache, B=256)
 | Prompt | No-Cache tok/s | Cache tok/s | Steps | vs llama.cpp |
-|---|---|---|---|---|
 | Capital of France? | 17.5 | **24.4** | 3 | 2.9x |
 | Translate to French | 25.9 | **27.7** | 2 | 3.3x |
 | 15 x 23? | 12.8 | **15.7** | 4 | 1.8x |
@@ -44,39 +43,48 @@ LLaDA is a **diffusion language model** that generates text by iterative unmaski
 | List the planets | 3.3 | **9.4** | 15 | 1.1x |
 | **Average** | **9.6** | **15.3** | | **1.8x** |
-*llama.cpp baseline: 8.51 tok/s (Llama-3-8B Q4_K_M, same hardware). Cache enabled by default. 6 of 8 prompts outperform llama.cpp; 2 (code generation, creative writing) remain slower due to requiring all 16 steps.*
-### Quantization Comparison (low_confidence baseline, B=64)
-| Model | Size | tok/s | vs F16 |
-|-------|------|-------|--------|
-| F16 | 14.9 GB | 1.64 | 1.00x |
-| Q8_0 | 8.4 GB | 1.84 | 1.12x |
-| Q4_K_M | 5.1 GB | 2.52 | 1.54x |
-### Summary
-- **15-28 tok/s on easy real prompts** (Q4_K_M + entropy_exit + inter-step cache, B=256)
-- **Up to 3.2x faster than llama.cpp** on the same hardware
-- **Inter-step KV cache**: 1.6x average speedup with no quality degradation
-- **6 of 8 real prompts outperform llama.cpp** (vs 3 of 8 without cache)
-- **256-token generation** with 20% lower per-token cost vs 64-token batches
-- **7.5x thread scaling** from 1 to 12 threads
-Full results: [research/benchmark/RESULTS.md](https://github.com/iafiscal1212/diffuse-cpp/blob/main/research/benchmark/RESULTS.md)
-## Usage
-```bash
-git clone --recursive https://github.com/iafiscal1212/diffuse-cpp
-cd diffuse-cpp
-cmake -B build -DCMAKE_BUILD_TYPE=Release
-cmake --build build -j$(nproc)
-# Generate with entropy_exit + cache (recommended, cache is ON by default)
-python tools/generate.py \
-    --model-dir /path/to/LLaDA-8B-Instruct \
-    --gguf llada-8b-q4km.gguf \
-    -p "What is the capital of France?" \
-    -n 256 -s 16 -t 12 --remasking entropy_exit
 ```

 ---
 license: apache-2.0
 tags:
+- diffusion
+- masked-diffusion
+- llada
+- llama
+- gguf
+- diffuse-cpp
 base_model: GSAI-ML/LLaDA-8B-Instruct
 pipeline_tag: text-generation
 ---
+# LLaDA-8B-Instruct-GGUF
+GGUF quantizations of [GSAI-ML/LLaDA-8B-Instruct](https://huggingface.co/GSAI-ML/LLaDA-8B-Instruct) for use with [diffuse-cpp](https://github.com/iafiscal1212/diffuse-cpp), a CPU inference engine for Diffusion Language Models.
+LLaDA is a masked diffusion language model based on the Llama backbone with Multi-Head Attention (MHA, 32/32 heads).
 ## Available Quantizations
+| File | Type | Size | Description |
+|------|------|------|-------------|
+| `llada-8b-f16.gguf` | F16 | ~14.9 GB | Full precision, best quality |
+| `llada-8b-q8_0.gguf` | Q8_0 | ~8.4 GB | 8-bit quantization, near-lossless |
+| `llada-8b-q4km.gguf` | Q4_K_M | ~5.1 GB | 4-bit mixed quantization, best quality/size ratio |
+**Recommended:** Q4_K_M for most users. Q8_0 if you have enough RAM and want minimal quality loss.
+## Performance
+Benchmarked on diffuse-cpp with entropy_exit + inter-step KV cache, Q4_K_M, B=256, 12 threads, seed=42:
 | Prompt | No-Cache tok/s | Cache tok/s | Steps | vs llama.cpp |
+|--------|----------------|-------------|-------|-------------|
 | Capital of France? | 17.5 | **24.4** | 3 | 2.9x |
 | Translate to French | 25.9 | **27.7** | 2 | 3.3x |
 | 15 x 23? | 12.8 | **15.7** | 4 | 1.8x |
 | List the planets | 3.3 | **9.4** | 15 | 1.1x |
 | **Average** | **9.6** | **15.3** | | **1.8x** |
+- **Inter-step cache: 1.6x average speedup** (9.6 -> 15.3 tok/s)
+- Easy prompts: **15-28 tok/s** (up to 3.3x faster than llama.cpp)
+- 6 of 8 prompts outperform llama.cpp (8.51 tok/s baseline)
+- Cache enabled by default, no quality degradation
+## Usage
+```bash
+# Download
+huggingface-cli download diffuse-cpp/LLaDA-8B-Instruct-GGUF llada-8b-q4km.gguf
+# Run (requires diffuse-cpp)
+./diffuse-cli -m llada-8b-q4km.gguf -p "What is the capital of France?" -n 256 -s 16
+```
+## Model Details
+- **Architecture:** Llama backbone with bidirectional attention
+- **Parameters:** 8B
+- **Layers:** 32
+- **Hidden size:** 4096
+- **Attention:** MHA (32 query heads, 32 KV heads, head dim 128)
+- **FFN:** SwiGLU, intermediate size 12288
+- **Vocabulary:** 126,464 tokens
+- **RoPE theta:** 500,000
+- **Mask token ID:** 126336
+## Also Available
+- **[Dream-v0-Instruct-7B-GGUF](https://huggingface.co/diffuse-cpp/Dream-v0-Instruct-7B-GGUF)** — Qwen2.5 backbone, GQA, 7.6B params. Excels at factual and math prompts (21.6 tok/s).
+## Citation
+```bibtex
+@software{diffuse_cpp_2026,
+  title={diffuse-cpp: High-Performance Inference for Diffusion Language Models},
+  author={Carmen Estévez},
+  year={2026},
+  url={https://github.com/iafiscal1212/diffuse-cpp}
+}
 ```
+## License
+Apache 2.0, following the original LLaDA model license.