Carmenest commited on
Commit
faffe6f
·
verified ·
1 Parent(s): 9dc8866

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +56 -48
README.md CHANGED
@@ -1,39 +1,38 @@
1
  ---
2
  license: apache-2.0
3
- language:
4
- - en
5
  tags:
6
- - diffusion
7
- - llada
8
- - gguf
9
- - diffuse-cpp
 
 
10
  base_model: GSAI-ML/LLaDA-8B-Instruct
11
- quantized_by: Carmenest
12
  pipeline_tag: text-generation
13
  ---
14
 
15
- # LLaDA-8B-Instruct GGUF
16
 
17
- GGUF quantized versions of [GSAI-ML/LLaDA-8B-Instruct](https://huggingface.co/GSAI-ML/LLaDA-8B-Instruct) for use with [diffuse-cpp](https://github.com/iafiscal1212/diffuse-cpp).
18
 
19
- LLaDA is a **diffusion language model** that generates text by iterative unmasking rather than autoregressive token-by-token prediction.
20
-
21
- > **Paper:** [Diffusion Language Models are Faster than Autoregressive on CPU](https://doi.org/10.5281/zenodo.19119813) — C. Esteban, 2026
22
 
23
  ## Available Quantizations
24
 
25
- | File | Quant | Size | Description |
26
- |------|-------|------|-------------|
27
- | llada-8b-q4km.gguf | Q4_K_M | 5.1 GB | **Recommended** best throughput |
28
- | llada-8b-q8_0.gguf | Q8_0 | 8.4 GB | High quality, good throughput |
29
- | llada-8b-f16.gguf | F16 | 14.9 GB | Full precision reference |
 
 
30
 
31
- ## Benchmark (AMD EPYC 4465P 12-Core, steps=16, threads=12)
32
 
33
- ### Real Prompt Performance (Q4_K_M + entropy_exit + inter-step cache, B=256)
34
 
35
  | Prompt | No-Cache tok/s | Cache tok/s | Steps | vs llama.cpp |
36
- |---|---|---|---|---|
37
  | Capital of France? | 17.5 | **24.4** | 3 | 2.9x |
38
  | Translate to French | 25.9 | **27.7** | 2 | 3.3x |
39
  | 15 x 23? | 12.8 | **15.7** | 4 | 1.8x |
@@ -44,39 +43,48 @@ LLaDA is a **diffusion language model** that generates text by iterative unmaski
44
  | List the planets | 3.3 | **9.4** | 15 | 1.1x |
45
  | **Average** | **9.6** | **15.3** | | **1.8x** |
46
 
47
- *llama.cpp baseline: 8.51 tok/s (Llama-3-8B Q4_K_M, same hardware). Cache enabled by default. 6 of 8 prompts outperform llama.cpp; 2 (code generation, creative writing) remain slower due to requiring all 16 steps.*
 
 
 
48
 
49
- ### Quantization Comparison (low_confidence baseline, B=64)
50
 
51
- | Model | Size | tok/s | vs F16 |
52
- |-------|------|-------|--------|
53
- | F16 | 14.9 GB | 1.64 | 1.00x |
54
- | Q8_0 | 8.4 GB | 1.84 | 1.12x |
55
- | Q4_K_M | 5.1 GB | 2.52 | 1.54x |
56
 
57
- ### Summary
 
 
58
 
59
- - **15-28 tok/s on easy real prompts** (Q4_K_M + entropy_exit + inter-step cache, B=256)
60
- - **Up to 3.2x faster than llama.cpp** on the same hardware
61
- - **Inter-step KV cache**: 1.6x average speedup with no quality degradation
62
- - **6 of 8 real prompts outperform llama.cpp** (vs 3 of 8 without cache)
63
- - **256-token generation** with 20% lower per-token cost vs 64-token batches
64
- - **7.5x thread scaling** from 1 to 12 threads
65
 
66
- Full results: [research/benchmark/RESULTS.md](https://github.com/iafiscal1212/diffuse-cpp/blob/main/research/benchmark/RESULTS.md)
 
 
 
 
 
 
 
 
67
 
68
- ## Usage
69
 
70
- ```bash
71
- git clone --recursive https://github.com/iafiscal1212/diffuse-cpp
72
- cd diffuse-cpp
73
- cmake -B build -DCMAKE_BUILD_TYPE=Release
74
- cmake --build build -j$(nproc)
75
-
76
- # Generate with entropy_exit + cache (recommended, cache is ON by default)
77
- python tools/generate.py \
78
- --model-dir /path/to/LLaDA-8B-Instruct \
79
- --gguf llada-8b-q4km.gguf \
80
- -p "What is the capital of France?" \
81
- -n 256 -s 16 -t 12 --remasking entropy_exit
82
  ```
 
 
 
 
 
1
  ---
2
  license: apache-2.0
 
 
3
  tags:
4
+ - diffusion
5
+ - masked-diffusion
6
+ - llada
7
+ - llama
8
+ - gguf
9
+ - diffuse-cpp
10
  base_model: GSAI-ML/LLaDA-8B-Instruct
 
11
  pipeline_tag: text-generation
12
  ---
13
 
14
+ # LLaDA-8B-Instruct-GGUF
15
 
16
+ GGUF quantizations of [GSAI-ML/LLaDA-8B-Instruct](https://huggingface.co/GSAI-ML/LLaDA-8B-Instruct) for use with [diffuse-cpp](https://github.com/iafiscal1212/diffuse-cpp), a CPU inference engine for Diffusion Language Models.
17
 
18
+ LLaDA is a masked diffusion language model based on the Llama backbone with Multi-Head Attention (MHA, 32/32 heads).
 
 
19
 
20
  ## Available Quantizations
21
 
22
+ | File | Type | Size | Description |
23
+ |------|------|------|-------------|
24
+ | `llada-8b-f16.gguf` | F16 | ~14.9 GB | Full precision, best quality |
25
+ | `llada-8b-q8_0.gguf` | Q8_0 | ~8.4 GB | 8-bit quantization, near-lossless |
26
+ | `llada-8b-q4km.gguf` | Q4_K_M | ~5.1 GB | 4-bit mixed quantization, best quality/size ratio |
27
+
28
+ **Recommended:** Q4_K_M for most users. Q8_0 if you have enough RAM and want minimal quality loss.
29
 
30
+ ## Performance
31
 
32
+ Benchmarked on diffuse-cpp with entropy_exit + inter-step KV cache, Q4_K_M, B=256, 12 threads, seed=42:
33
 
34
  | Prompt | No-Cache tok/s | Cache tok/s | Steps | vs llama.cpp |
35
+ |--------|----------------|-------------|-------|-------------|
36
  | Capital of France? | 17.5 | **24.4** | 3 | 2.9x |
37
  | Translate to French | 25.9 | **27.7** | 2 | 3.3x |
38
  | 15 x 23? | 12.8 | **15.7** | 4 | 1.8x |
 
43
  | List the planets | 3.3 | **9.4** | 15 | 1.1x |
44
  | **Average** | **9.6** | **15.3** | | **1.8x** |
45
 
46
+ - **Inter-step cache: 1.6x average speedup** (9.6 -> 15.3 tok/s)
47
+ - Easy prompts: **15-28 tok/s** (up to 3.3x faster than llama.cpp)
48
+ - 6 of 8 prompts outperform llama.cpp (8.51 tok/s baseline)
49
+ - Cache enabled by default, no quality degradation
50
 
51
+ ## Usage
52
 
53
+ ```bash
54
+ # Download
55
+ huggingface-cli download diffuse-cpp/LLaDA-8B-Instruct-GGUF llada-8b-q4km.gguf
 
 
56
 
57
+ # Run (requires diffuse-cpp)
58
+ ./diffuse-cli -m llada-8b-q4km.gguf -p "What is the capital of France?" -n 256 -s 16
59
+ ```
60
 
61
+ ## Model Details
 
 
 
 
 
62
 
63
+ - **Architecture:** Llama backbone with bidirectional attention
64
+ - **Parameters:** 8B
65
+ - **Layers:** 32
66
+ - **Hidden size:** 4096
67
+ - **Attention:** MHA (32 query heads, 32 KV heads, head dim 128)
68
+ - **FFN:** SwiGLU, intermediate size 12288
69
+ - **Vocabulary:** 126,464 tokens
70
+ - **RoPE theta:** 500,000
71
+ - **Mask token ID:** 126336
72
 
73
+ ## Also Available
74
 
75
+ - **[Dream-v0-Instruct-7B-GGUF](https://huggingface.co/diffuse-cpp/Dream-v0-Instruct-7B-GGUF)** — Qwen2.5 backbone, GQA, 7.6B params. Excels at factual and math prompts (21.6 tok/s).
76
+
77
+ ## Citation
78
+
79
+ ```bibtex
80
+ @software{diffuse_cpp_2026,
81
+ title={diffuse-cpp: High-Performance Inference for Diffusion Language Models},
82
+ author={Carmen Estévez},
83
+ year={2026},
84
+ url={https://github.com/iafiscal1212/diffuse-cpp}
85
+ }
 
86
  ```
87
+
88
+ ## License
89
+
90
+ Apache 2.0, following the original LLaDA model license.