YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
TurboQuant: First Open-Source Implementation
First open-source implementation of TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate (Zandieh, Daliri, Hadian, Mirrokni β Google Research / Google DeepMind / NYU, April 2025).
TurboQuant compresses LLM KV caches 4-7x at inference time using random rotation + optimal scalar quantization, with near-zero quality loss. No training, no calibration data, fully data-oblivious. Drop-in replacement for HuggingFace Transformers cache.
Key Results
Benchmarked across 5 model families, 6 models (7B to 70B) on NVIDIA H100 NVL (96GB):
| Model | Architecture | KV Heads | head_dim | Outlier Layers | Prefill Fidelity | Saved @8K |
|---|---|---|---|---|---|---|
| Qwen2.5-7B | 28L, qwen2 | 4 | 128 | layers 0, 27 | exact | 380 MB |
| Llama-3.1-8B | 32L, llama | 8 | 128 | none | exact | 890 MB |
| Gemma-2-9B | 42L, gemma2 | 8 | 256 | none | exact | 2,323 MB |
| Phi-4-14B | 40L, phi3 | 10 | 128 | none | exact | 1,392 MB |
| Qwen2.5-32B | 64L, qwen2 | 8 | 128 | none | exact | 1,791 MB |
| Llama-3.3-70B | 80L, llama | 8 | 128 | none | exact | 501 MB (@2K) |
Prefill logits are bit-identical (0.0 difference) across all 6 tested models. Output quality is coherent and semantically correct β divergence from uncompressed output is purely greedy-decoding drift, not quality degradation.
Needle-in-a-Haystack: 100% Recall
Tested on Qwen2.5-7B across 5 context lengths (1K-16K) and 3 needle positions (25%, 50%, 75%):
| Default Cache | TurboQuant Cache | |
|---|---|---|
| Recall | 15/15 (100%) | 15/15 (100%) |
TurboQuant preserves retrieval quality perfectly, matching the paper's 0.997 recall claim.
Memory Savings Scale with Context
Qwen2.5-32B (4-bit weights) on H100:
| Context | Default KV | TurboQuant KV | Saved |
|---|---|---|---|
| 1K tokens | 19.97 GB | 19.79 GB | 186 MB |
| 4K tokens | 21.23 GB | 20.42 GB | 833 MB |
| 8K tokens | 23.16 GB | 21.41 GB | 1,791 MB |
| 32K tokens | ~27.5 GB | ~21.8 GB | ~5,700 MB (projected) |
Quickstart
from transformers import AutoModelForCausalLM, AutoTokenizer
from turboquant import TurboQuantCache
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-32B-Instruct", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-32B-Instruct")
# Auto-detect outlier layers, create compressed cache
skip = TurboQuantCache.calibrate_skip_layers(model, tokenizer)
cache = TurboQuantCache(model.config, nbits=4, skip_layers=skip)
# Use exactly like default cache
inputs = tokenizer("Hello world", return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=100, past_key_values=cache)
How It Works
TurboQuant implements Algorithm 1 (TurboQuant_mse) from the paper:
- Random rotation (QR decomposition): transforms each KV vector so coordinates follow a known Beta distribution
- Optimal scalar quantization (Lloyd-Max): quantizes each coordinate to 4 bits using precomputed codebook
- Bit packing: stores 128-dim vectors as 64 bytes (uint4) + 2 bytes (norm) = 66 bytes vs 256 bytes BF16
Theoretical guarantee: MSE distortion β€ 0.009 at 4-bit, within 2.7x of information-theoretic optimum (Shannon lower bound).
Our measured MSE: 0.0093 β matches the paper.
What We Found Beyond the Paper
Outlier Layer Norms
The paper mentions "splitting channels into outlier and non-outlier sets" without specifying how. We discovered:
- Qwen2.5-7B: Layer 0 key norms = 273.8 (16.2x median). Layer 27 = outlier too.
- Qwen2.5-32B: Layer 0 = 37.8 (2.35x median). Mild, no skip needed.
- Llama-3.1-8B: Max/median ratio = 1.18x. No outliers at all.
- Gemma-2-9B: Max/median ratio = 1.19x. No outliers.
- Phi-4-14B: Max/median ratio = 1.38x. No outliers.
Finding: Smaller Qwen models have severe outlier layers. Larger models and non-Qwen architectures are well-balanced. Our calibrate_skip_layers() auto-detects outliers and keeps them in full precision.
head_dim Compatibility
The paper only tested head_dim=128 (Llama, Mistral). We verified TurboQuant works with head_dim=256 (Gemma-2) β the Lloyd-Max codebook adapts to any dimension since it's computed from the Beta distribution parameterized by d.
Architecture Coverage
| Architecture | Paper Tested | We Tested | Works |
|---|---|---|---|
| Llama | Llama-3.1-8B | Llama-3.1-8B, 3.3-70B | Yes |
| Mistral | Ministral-7B | β | β |
| Qwen | β | Qwen2.5-7B, 32B | Yes (with outlier handling) |
| Gemma | β | Gemma-2-9B | Yes (head_dim=256) |
| Phi | β | Phi-4-14B | Yes |
Files
turboquant/
βββ __init__.py # Public API
βββ codebook.py # Lloyd-Max solver for Beta distribution
βββ quantizer.py # Core TurboQuantizer: rotate β quantize β pack
βββ packing.py # uint4/uint2 bit packing
βββ cache.py # TurboQuantCache for HF Transformers
scripts/
βββ verify.py # Unit tests (MSE bounds, packing, fixed-point)
βββ test_cache.py # Cache API integration tests
βββ benchmark_models.py # Multi-model benchmark suite
βββ run_inference.py # Interactive inference demo
benchmark_results.json # Raw benchmark data (all 5 models)
Verified Against Paper
| Metric | Paper | Ours |
|---|---|---|
| MSE at 4-bit (unit vectors) | β€ 0.009 | 0.0093 |
| MSE at 2-bit (unit vectors) | β€ 0.117 | 0.116 |
| Compression ratio (per-vector) | ~4x | 3.88x |
| System compression @8K+ | 4-7x | 7.2x |
| Prefill fidelity | "quality neutral" | exact (0.0 logit diff) |
| Double quantization | fixed point | verified (indices identical) |
Requirements
- Python 3.10+
- PyTorch 2.7+ (CUDA 12.8 compatible)
- HuggingFace Transformers 5.0+
- scipy (for codebook computation)
- bitsandbytes (optional, for 4-bit model loading)
Citation
If you use this implementation, please cite the original paper:
@article{zandieh2025turboquant,
title={TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate},
author={Zandieh, Amir and Daliri, Majid and Hadian, Majid and Mirrokni, Vahab},
journal={arXiv preprint arXiv:2504.19874},
year={2025}
}
License
This implementation is released under MIT License. The TurboQuant algorithm is described in the paper above.