YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

TurboQuant: First Open-Source Implementation

First open-source implementation of TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate (Zandieh, Daliri, Hadian, Mirrokni β€” Google Research / Google DeepMind / NYU, April 2025).

TurboQuant compresses LLM KV caches 4-7x at inference time using random rotation + optimal scalar quantization, with near-zero quality loss. No training, no calibration data, fully data-oblivious. Drop-in replacement for HuggingFace Transformers cache.

Key Results

Benchmarked across 5 model families, 6 models (7B to 70B) on NVIDIA H100 NVL (96GB):

Model Architecture KV Heads head_dim Outlier Layers Prefill Fidelity Saved @8K
Qwen2.5-7B 28L, qwen2 4 128 layers 0, 27 exact 380 MB
Llama-3.1-8B 32L, llama 8 128 none exact 890 MB
Gemma-2-9B 42L, gemma2 8 256 none exact 2,323 MB
Phi-4-14B 40L, phi3 10 128 none exact 1,392 MB
Qwen2.5-32B 64L, qwen2 8 128 none exact 1,791 MB
Llama-3.3-70B 80L, llama 8 128 none exact 501 MB (@2K)

Prefill logits are bit-identical (0.0 difference) across all 6 tested models. Output quality is coherent and semantically correct β€” divergence from uncompressed output is purely greedy-decoding drift, not quality degradation.

Needle-in-a-Haystack: 100% Recall

Tested on Qwen2.5-7B across 5 context lengths (1K-16K) and 3 needle positions (25%, 50%, 75%):

Default Cache TurboQuant Cache
Recall 15/15 (100%) 15/15 (100%)

TurboQuant preserves retrieval quality perfectly, matching the paper's 0.997 recall claim.

Memory Savings Scale with Context

Qwen2.5-32B (4-bit weights) on H100:

Context Default KV TurboQuant KV Saved
1K tokens 19.97 GB 19.79 GB 186 MB
4K tokens 21.23 GB 20.42 GB 833 MB
8K tokens 23.16 GB 21.41 GB 1,791 MB
32K tokens ~27.5 GB ~21.8 GB ~5,700 MB (projected)

Quickstart

from transformers import AutoModelForCausalLM, AutoTokenizer
from turboquant import TurboQuantCache

model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-32B-Instruct", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-32B-Instruct")

# Auto-detect outlier layers, create compressed cache
skip = TurboQuantCache.calibrate_skip_layers(model, tokenizer)
cache = TurboQuantCache(model.config, nbits=4, skip_layers=skip)

# Use exactly like default cache
inputs = tokenizer("Hello world", return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=100, past_key_values=cache)

How It Works

TurboQuant implements Algorithm 1 (TurboQuant_mse) from the paper:

  1. Random rotation (QR decomposition): transforms each KV vector so coordinates follow a known Beta distribution
  2. Optimal scalar quantization (Lloyd-Max): quantizes each coordinate to 4 bits using precomputed codebook
  3. Bit packing: stores 128-dim vectors as 64 bytes (uint4) + 2 bytes (norm) = 66 bytes vs 256 bytes BF16

Theoretical guarantee: MSE distortion ≀ 0.009 at 4-bit, within 2.7x of information-theoretic optimum (Shannon lower bound).

Our measured MSE: 0.0093 β€” matches the paper.

What We Found Beyond the Paper

Outlier Layer Norms

The paper mentions "splitting channels into outlier and non-outlier sets" without specifying how. We discovered:

  • Qwen2.5-7B: Layer 0 key norms = 273.8 (16.2x median). Layer 27 = outlier too.
  • Qwen2.5-32B: Layer 0 = 37.8 (2.35x median). Mild, no skip needed.
  • Llama-3.1-8B: Max/median ratio = 1.18x. No outliers at all.
  • Gemma-2-9B: Max/median ratio = 1.19x. No outliers.
  • Phi-4-14B: Max/median ratio = 1.38x. No outliers.

Finding: Smaller Qwen models have severe outlier layers. Larger models and non-Qwen architectures are well-balanced. Our calibrate_skip_layers() auto-detects outliers and keeps them in full precision.

head_dim Compatibility

The paper only tested head_dim=128 (Llama, Mistral). We verified TurboQuant works with head_dim=256 (Gemma-2) β€” the Lloyd-Max codebook adapts to any dimension since it's computed from the Beta distribution parameterized by d.

Architecture Coverage

Architecture Paper Tested We Tested Works
Llama Llama-3.1-8B Llama-3.1-8B, 3.3-70B Yes
Mistral Ministral-7B β€” β€”
Qwen β€” Qwen2.5-7B, 32B Yes (with outlier handling)
Gemma β€” Gemma-2-9B Yes (head_dim=256)
Phi β€” Phi-4-14B Yes

Files

turboquant/
β”œβ”€β”€ __init__.py          # Public API
β”œβ”€β”€ codebook.py          # Lloyd-Max solver for Beta distribution
β”œβ”€β”€ quantizer.py         # Core TurboQuantizer: rotate β†’ quantize β†’ pack
β”œβ”€β”€ packing.py           # uint4/uint2 bit packing
β”œβ”€β”€ cache.py             # TurboQuantCache for HF Transformers
scripts/
β”œβ”€β”€ verify.py            # Unit tests (MSE bounds, packing, fixed-point)
β”œβ”€β”€ test_cache.py        # Cache API integration tests
β”œβ”€β”€ benchmark_models.py  # Multi-model benchmark suite
β”œβ”€β”€ run_inference.py     # Interactive inference demo
benchmark_results.json   # Raw benchmark data (all 5 models)

Verified Against Paper

Metric Paper Ours
MSE at 4-bit (unit vectors) ≀ 0.009 0.0093
MSE at 2-bit (unit vectors) ≀ 0.117 0.116
Compression ratio (per-vector) ~4x 3.88x
System compression @8K+ 4-7x 7.2x
Prefill fidelity "quality neutral" exact (0.0 logit diff)
Double quantization fixed point verified (indices identical)

Requirements

  • Python 3.10+
  • PyTorch 2.7+ (CUDA 12.8 compatible)
  • HuggingFace Transformers 5.0+
  • scipy (for codebook computation)
  • bitsandbytes (optional, for 4-bit model loading)

Citation

If you use this implementation, please cite the original paper:

@article{zandieh2025turboquant,
  title={TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate},
  author={Zandieh, Amir and Daliri, Majid and Hadian, Majid and Mirrokni, Vahab},
  journal={arXiv preprint arXiv:2504.19874},
  year={2025}
}

License

This implementation is released under MIT License. The TurboQuant algorithm is described in the paper above.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Paper for vivekvar/turboquant