YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
RotorQuant Weights + Runtime Quantization for Qwen2.5-0.5B-Instruct
This repository implements and benchmarks multiple quantization/deployment paths for Qwen/Qwen2.5-0.5B-Instruct, including:
- RotorQuant-style 3-bit weight quantization (custom codec)
- Quantized model loading + text generation with proper Qwen chat template
- Validation against FP32 baseline
- Runtime acceleration paths:
- RotorQuant fused runtime (packed-weight linear)
- Dynamic INT8 runtime baseline (for speed comparison)
The goal is to evaluate low-bit compression and practical inference/runtime tradeoffs in a reproducible way.
Model Reference
- Hugging Face model card: https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct
- This repo uses the official chat template via
tokenizer.apply_chat_template(...).
Repository Layout
rotorquant.py: Original RotorQuant core classes (RotorQuantMSE,RotorQuantProd,RotorQuantKVCache) using Clifford algebra + Lloyd-Max.clifford.py: Cl(3,0) multivector algebra and rotor operations.lloyd_max.py: Lloyd-Max codebook solver (SciPy integration) and utilities.rotorquant_weights.py: Custom weight quantization codec for model tensors (3-bit packing, dequantization, reports).quantize_qwen.py: Quantize a Hugging Face model checkpoint into custom package format.run_inference.py: Load quantized package, reconstruct model, run generation.validate_quantization.py: Baseline vs quantized quality checks (logit cosine + token match).benchmark_scenarios.py: Benchmark baseline and all quantized artifacts inartifacts/.runtime_int8.py: Build/load dynamic INT8 runtime models (full or selective modules).runtime_rotor_fused.py: Fused RotorQuant runtime path using packed-weight linear module.benchmark_runtime_vs_rotor.py: Unified benchmark: FP32 vs RotorQuant package vs Rotor fused vs INT8 runtime.artifacts/*.json: Saved reports/metrics from experiments.
Environment Setup
python3 -m venv .venv
source .venv/bin/activate
pip install --index-url https://download.pytorch.org/whl/cpu torch
pip install transformers accelerate safetensors huggingface_hub datasets scipy psutil
Or:
pip install -r requirements.txt
How Quantization Is Done
1) RotorQuant-style 3-bit weight codec (rotorquant_weights.py)
For selected float tensors (typically Linear.weight):
- Optional rotor transform over triples of values (deterministic per tensor)
- Block-wise normalization
- mean-center block
- scale by max-abs per block
- 3-bit scalar quantization to 8-level codebook
- Packed 3-bit index serialization
- Stored metadata for dequantization (scales, centers, codebook, shape info)
Supported options include:
--block-size(example: 128 or 64)--rowwise--include-name-contains/--skip-nameselection--lowrank-rankresidual correction--outlier-fracresidual outlier preservation
2) Rotor fused runtime (runtime_rotor_fused.py)
- Replaces selected
nn.Linearmodules withFusedRotorLinear. - Reads packed 3-bit weights directly.
- Decodes on demand and caches decoded weight for repeated use.
- Avoids full eager dequantization at load for quantized layers.
3) Dynamic INT8 runtime (runtime_int8.py)
- Uses PyTorch dynamic quantization (
nn.Linear -> qint8) as a runtime speed baseline. - Supports full-model or selective-module quantization (
--include-name-contains mlp.).
Reproducible Commands
A) Quantize (RotorQuant package)
Example: MLP-only, 3-bit, block size 64:
python quantize_qwen.py \
--model-id Qwen/Qwen2.5-0.5B-Instruct \
--output artifacts/qwen2.5-0.5b-rotorq3-mlp-only-b64.pt \
--report artifacts/qwen2.5-0.5b-rotorq3-mlp-only-b64-report.json \
--bits 3 \
--block-size 64 \
--dtype float32 \
--include-name-contains mlp.
B) Inference from quantized package
python run_inference.py \
--quantized artifacts/qwen2.5-0.5b-rotorq3-mlp-only-b64.pt \
--prompt "Explain quantization in one paragraph." \
--max-new-tokens 64
C) Quality validation
python validate_quantization.py \
--quantized artifacts/qwen2.5-0.5b-rotorq3-mlp-only-b64.pt \
--max-new-tokens 48 \
--dtype float32
D) Build dynamic INT8 runtime models
Full dynamic INT8:
python runtime_int8.py build \
--model-id Qwen/Qwen2.5-0.5B-Instruct \
--out artifacts/qwen2.5-0.5b-dynamic-int8.pt \
--meta artifacts/qwen2.5-0.5b-dynamic-int8-meta.json
Selective dynamic INT8 (MLP-only):
python runtime_int8.py build \
--model-id Qwen/Qwen2.5-0.5B-Instruct \
--out artifacts/qwen2.5-0.5b-dynamic-int8-mlp.pt \
--meta artifacts/qwen2.5-0.5b-dynamic-int8-mlp-meta.json \
--include-name-contains mlp.
E) Run Rotor fused runtime
python runtime_rotor_fused.py run \
--pkg artifacts/qwen2.5-0.5b-rotorq3-rowwise-skipemb.pt \
--prompt "Explain quantization in one paragraph." \
--max-new-tokens 64
F) Benchmark all quantized artifacts in artifacts/
python benchmark_scenarios.py \
--model-id Qwen/Qwen2.5-0.5B-Instruct \
--artifacts-dir artifacts \
--max-new-tokens 64 \
--dtype float32 \
--out artifacts/benchmark_results.json
G) Unified runtime benchmark (FP32 vs Rotor pkg vs Rotor fused vs INT8)
python benchmark_runtime_vs_rotor.py \
--model-id Qwen/Qwen2.5-0.5B-Instruct \
--rotor-pkg artifacts/qwen2.5-0.5b-rotorq3-mlp-only.pt \
--fused-pkg artifacts/qwen2.5-0.5b-rotorq3-rowwise-skipemb.pt \
--int8-model artifacts/qwen2.5-0.5b-dynamic-int8-mlp.pt \
--max-new-tokens 64 \
--out artifacts/runtime_benchmark_with_fused.json
Reported Metrics
Quality metric example (RotorQuant package)
For qwen2.5-0.5b-rotorq3-mlp-only-b64 (~3.50 bits/weight on quantized tensors):
- Mean cosine similarity (last-token logits):
0.868771 - Mean greedy token-match ratio:
0.0781
(From validate_quantization.py run on 4 prompts.)
Runtime benchmark summary
From artifacts/runtime_benchmark_with_fused.json (CPU, max_new_tokens=64, 4 prompts):
| Scenario | Load (s) | First Token (s) | Generate (s) | Decode tok/s | RSS after load (GB) | Token Match vs FP32 |
|---|---|---|---|---|---|---|
| FP32 baseline | 6.495 | 0.159 | 6.547 | 9.78 | 2.28 | 1.0000 |
| RotorQuant package (dequantized load) | 6.533 | 0.152 | 6.707 | 9.55 | 2.71 | 0.0820 |
| RotorQuant fused runtime | 3.895 | 0.162 | 6.190 | 10.34 | 2.89 | 0.0039 |
| Dynamic INT8 runtime (MLP-only) | 4.646 | 0.097 | 4.059 | 15.77 | 2.65 | 0.0156 |
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support