Whisper Large v3 Turbo - ONNX Dynamic INT8

Dynamically quantized ONNX version of openai/whisper-large-v3-turbo.

Quantization Details

Property Value
Method Dynamic INT8 via ONNX Runtime
Target ops MatMul-family graph ops selected by ONNX Runtime quantizer
Excluded ops Conv
Calibration None (dynamic quantization)
Tools HuggingFace Optimum + ONNX Runtime
Compression ratio 2.165x

Usage

from optimum.onnxruntime import ORTModelForSpeechSeq2Seq
from transformers import AutoProcessor, pipeline

model = ORTModelForSpeechSeq2Seq.from_pretrained(
    "nkarv/whisper-large-v3-turbo-onnx-int8"
)
processor = AutoProcessor.from_pretrained(
    "nkarv/whisper-large-v3-turbo-onnx-int8"
)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
)

result = pipe("audio.mp3", return_timestamps=True)
print(result["text"])

Why Dynamic INT8?

Dynamic quantization converts Linear layer weights to INT8 offline while computing activation scales at inference time. This avoids the need for a calibration dataset while still achieving significant compression and practical latency gains.

Downloads last month
55
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for nkarv/whisper-large-v3-turbo-onnx-int8

Quantized
(26)
this model