Whisper Large v3 Turbo - ONNX Dynamic INT8
Dynamically quantized ONNX version of openai/whisper-large-v3-turbo.
Quantization Details
| Property | Value |
|---|---|
| Method | Dynamic INT8 via ONNX Runtime |
| Target ops | MatMul-family graph ops selected by ONNX Runtime quantizer |
| Excluded ops | Conv |
| Calibration | None (dynamic quantization) |
| Tools | HuggingFace Optimum + ONNX Runtime |
| Compression ratio | 2.165x |
Usage
from optimum.onnxruntime import ORTModelForSpeechSeq2Seq
from transformers import AutoProcessor, pipeline
model = ORTModelForSpeechSeq2Seq.from_pretrained(
"nkarv/whisper-large-v3-turbo-onnx-int8"
)
processor = AutoProcessor.from_pretrained(
"nkarv/whisper-large-v3-turbo-onnx-int8"
)
pipe = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
)
result = pipe("audio.mp3", return_timestamps=True)
print(result["text"])
Why Dynamic INT8?
Dynamic quantization converts Linear layer weights to INT8 offline while computing activation scales at inference time. This avoids the need for a calibration dataset while still achieving significant compression and practical latency gains.
- Downloads last month
- 55
Model tree for nkarv/whisper-large-v3-turbo-onnx-int8
Base model
openai/whisper-large-v3 Finetuned
openai/whisper-large-v3-turbo