Whisper Large v3 Turbo - ONNX Dynamic INT8

Dynamically quantized ONNX version of openai/whisper-large-v3-turbo.

Quantization Details

Property	Value
Method	Dynamic INT8 via ONNX Runtime
Target ops	MatMul-family graph ops selected by ONNX Runtime quantizer
Excluded ops	Conv
Calibration	None (dynamic quantization)
Tools	HuggingFace Optimum + ONNX Runtime
Compression ratio	2.165x

Usage

from optimum.onnxruntime import ORTModelForSpeechSeq2Seq
from transformers import AutoProcessor, pipeline

model = ORTModelForSpeechSeq2Seq.from_pretrained(
    "nkarv/whisper-large-v3-turbo-onnx-int8"
)
processor = AutoProcessor.from_pretrained(
    "nkarv/whisper-large-v3-turbo-onnx-int8"
)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
)

result = pipe("audio.mp3", return_timestamps=True)
print(result["text"])

Why Dynamic INT8?

Dynamic quantization converts Linear layer weights to INT8 offline while computing activation scales at inference time. This avoids the need for a calibration dataset while still achieving significant compression and practical latency gains.

Downloads last month: 55

Model tree for nkarv/whisper-large-v3-turbo-onnx-int8

Base model

openai/whisper-large-v3

Finetuned

openai/whisper-large-v3-turbo

Quantized

(26)

this model