YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

RotorQuant Weights + Runtime Quantization for Qwen2.5-0.5B-Instruct

This repository implements and benchmarks multiple quantization/deployment paths for Qwen/Qwen2.5-0.5B-Instruct, including:

  1. RotorQuant-style 3-bit weight quantization (custom codec)
  2. Quantized model loading + text generation with proper Qwen chat template
  3. Validation against FP32 baseline
  4. Runtime acceleration paths:
  5. RotorQuant fused runtime (packed-weight linear)
  6. Dynamic INT8 runtime baseline (for speed comparison)

The goal is to evaluate low-bit compression and practical inference/runtime tradeoffs in a reproducible way.

Model Reference

Repository Layout

  • rotorquant.py: Original RotorQuant core classes (RotorQuantMSE, RotorQuantProd, RotorQuantKVCache) using Clifford algebra + Lloyd-Max.
  • clifford.py: Cl(3,0) multivector algebra and rotor operations.
  • lloyd_max.py: Lloyd-Max codebook solver (SciPy integration) and utilities.
  • rotorquant_weights.py: Custom weight quantization codec for model tensors (3-bit packing, dequantization, reports).
  • quantize_qwen.py: Quantize a Hugging Face model checkpoint into custom package format.
  • run_inference.py: Load quantized package, reconstruct model, run generation.
  • validate_quantization.py: Baseline vs quantized quality checks (logit cosine + token match).
  • benchmark_scenarios.py: Benchmark baseline and all quantized artifacts in artifacts/.
  • runtime_int8.py: Build/load dynamic INT8 runtime models (full or selective modules).
  • runtime_rotor_fused.py: Fused RotorQuant runtime path using packed-weight linear module.
  • benchmark_runtime_vs_rotor.py: Unified benchmark: FP32 vs RotorQuant package vs Rotor fused vs INT8 runtime.
  • artifacts/*.json: Saved reports/metrics from experiments.

Environment Setup

python3 -m venv .venv
source .venv/bin/activate
pip install --index-url https://download.pytorch.org/whl/cpu torch
pip install transformers accelerate safetensors huggingface_hub datasets scipy psutil

Or:

pip install -r requirements.txt

How Quantization Is Done

1) RotorQuant-style 3-bit weight codec (rotorquant_weights.py)

For selected float tensors (typically Linear.weight):

  1. Optional rotor transform over triples of values (deterministic per tensor)
  2. Block-wise normalization
  3. mean-center block
  4. scale by max-abs per block
  5. 3-bit scalar quantization to 8-level codebook
  6. Packed 3-bit index serialization
  7. Stored metadata for dequantization (scales, centers, codebook, shape info)

Supported options include:

  • --block-size (example: 128 or 64)
  • --rowwise
  • --include-name-contains / --skip-name selection
  • --lowrank-rank residual correction
  • --outlier-frac residual outlier preservation

2) Rotor fused runtime (runtime_rotor_fused.py)

  • Replaces selected nn.Linear modules with FusedRotorLinear.
  • Reads packed 3-bit weights directly.
  • Decodes on demand and caches decoded weight for repeated use.
  • Avoids full eager dequantization at load for quantized layers.

3) Dynamic INT8 runtime (runtime_int8.py)

  • Uses PyTorch dynamic quantization (nn.Linear -> qint8) as a runtime speed baseline.
  • Supports full-model or selective-module quantization (--include-name-contains mlp.).

Reproducible Commands

A) Quantize (RotorQuant package)

Example: MLP-only, 3-bit, block size 64:

python quantize_qwen.py \
  --model-id Qwen/Qwen2.5-0.5B-Instruct \
  --output artifacts/qwen2.5-0.5b-rotorq3-mlp-only-b64.pt \
  --report artifacts/qwen2.5-0.5b-rotorq3-mlp-only-b64-report.json \
  --bits 3 \
  --block-size 64 \
  --dtype float32 \
  --include-name-contains mlp.

B) Inference from quantized package

python run_inference.py \
  --quantized artifacts/qwen2.5-0.5b-rotorq3-mlp-only-b64.pt \
  --prompt "Explain quantization in one paragraph." \
  --max-new-tokens 64

C) Quality validation

python validate_quantization.py \
  --quantized artifacts/qwen2.5-0.5b-rotorq3-mlp-only-b64.pt \
  --max-new-tokens 48 \
  --dtype float32

D) Build dynamic INT8 runtime models

Full dynamic INT8:

python runtime_int8.py build \
  --model-id Qwen/Qwen2.5-0.5B-Instruct \
  --out artifacts/qwen2.5-0.5b-dynamic-int8.pt \
  --meta artifacts/qwen2.5-0.5b-dynamic-int8-meta.json

Selective dynamic INT8 (MLP-only):

python runtime_int8.py build \
  --model-id Qwen/Qwen2.5-0.5B-Instruct \
  --out artifacts/qwen2.5-0.5b-dynamic-int8-mlp.pt \
  --meta artifacts/qwen2.5-0.5b-dynamic-int8-mlp-meta.json \
  --include-name-contains mlp.

E) Run Rotor fused runtime

python runtime_rotor_fused.py run \
  --pkg artifacts/qwen2.5-0.5b-rotorq3-rowwise-skipemb.pt \
  --prompt "Explain quantization in one paragraph." \
  --max-new-tokens 64

F) Benchmark all quantized artifacts in artifacts/

python benchmark_scenarios.py \
  --model-id Qwen/Qwen2.5-0.5B-Instruct \
  --artifacts-dir artifacts \
  --max-new-tokens 64 \
  --dtype float32 \
  --out artifacts/benchmark_results.json

G) Unified runtime benchmark (FP32 vs Rotor pkg vs Rotor fused vs INT8)

python benchmark_runtime_vs_rotor.py \
  --model-id Qwen/Qwen2.5-0.5B-Instruct \
  --rotor-pkg artifacts/qwen2.5-0.5b-rotorq3-mlp-only.pt \
  --fused-pkg artifacts/qwen2.5-0.5b-rotorq3-rowwise-skipemb.pt \
  --int8-model artifacts/qwen2.5-0.5b-dynamic-int8-mlp.pt \
  --max-new-tokens 64 \
  --out artifacts/runtime_benchmark_with_fused.json

Reported Metrics

Quality metric example (RotorQuant package)

For qwen2.5-0.5b-rotorq3-mlp-only-b64 (~3.50 bits/weight on quantized tensors):

  • Mean cosine similarity (last-token logits): 0.868771
  • Mean greedy token-match ratio: 0.0781

(From validate_quantization.py run on 4 prompts.)

Runtime benchmark summary

From artifacts/runtime_benchmark_with_fused.json (CPU, max_new_tokens=64, 4 prompts):

Scenario Load (s) First Token (s) Generate (s) Decode tok/s RSS after load (GB) Token Match vs FP32
FP32 baseline 6.495 0.159 6.547 9.78 2.28 1.0000
RotorQuant package (dequantized load) 6.533 0.152 6.707 9.55 2.71 0.0820
RotorQuant fused runtime 3.895 0.162 6.190 10.34 2.89 0.0039
Dynamic INT8 runtime (MLP-only) 4.646 0.097 4.059 15.77 2.65 0.0156
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support