Capacit-ai/saga

Capacit-ai/saga is a state-of-the-art Danish automatic speech recognition model based on Qwen/Qwen3-ASR-1.7B.

The model is optimized for fast inference, with aggressive input downsampling and variable chunk sizing unlike the competing models, this enables our Saga model to achieve state-of-the-art performance, while being significantly more efficient.

The model was trained on an nvidia B200, with the use of the CoRal dataset family, courtesy of the Danish Innovation fund and the Alexandra Institute

This repository is intended for Danish transcription only. The underlying Qwen3-ASR base model is multilingual, but this finetuned checkpoint is Danish-focused and the model has unlearned most of its multilingual capabilities.

Model Summary

  • Base model: Qwen/Qwen3-ASR-1.7B
  • Task: automatic speech recognition
  • Primary language: Danish
  • Input audio: 16 kHz mono waveform

Quickstart

Install the packages:

pip install -U transformers soundfile torch qwen-asr

Then load the model with transformers:

import torch
from transformers import AutoModelForCausalLM, AutoProcessor

MODEL_ID = "capacit-ai/saga"
DEVICE = "cuda:0" if torch.cuda.is_available() else "cpu"
DTYPE = torch.bfloat16 if DEVICE.startswith("cuda") else torch.float32

processor = AutoProcessor.from_pretrained(MODEL_ID, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID, trust_remote_code=True, torch_dtype=DTYPE,
)
model.to(DEVICE)
model.eval()

audio = processor.load_audio("audio.wav")
text = model.transcribe(audio, processor)
print(text)

Long-Form Audio

The base Qwen3-ASR architecture supports long inputs, but the most stable long-form decoding in this project came from accumulated-audio continuation decoding rather than a single naive generate call. The model.transcribe() method already implements this strategy it walks through the audio in step_seconds chunks, re-feeding the accumulated waveform together with previously decoded text so the model keeps prior context. The step_seconds, rollback_tokens, and max_new_tokens parameters can be tuned for your use case.

The processor.load_audio and model.transcribe methods accept the following parameters:

# Load and resample any audio file to a mono float32 waveform
audio = processor.load_audio(
    path="audio.wav",
    target_sr=16_000,          # target sample rate (default: 16 000)
)

# Transcribe with accumulated-audio continuation decoding
text = model.transcribe(
    audio,
    processor,
    language="Danish",         # language tag in the prompt (default: "Danish")
    target_sr=16_000,          # must match load_audio target_sr (default: 16 000)
    step_seconds=15.0,         # seconds of new audio per continuation step (default: 15.0)
    rollback_tokens=8,         # token rollback for prefix overlap (default: 8)
    max_new_tokens=2048,       # generation budget per step (default: 2048)
)

🚀Fast inference🚀, vllm

pip install -U qwen-asr[vllm]
MAX_JOBS=4 pip install -U flash-attn --no-build-isolation
import librosa
from qwen_asr import Qwen3ASRModel

def transcribe_single_file(audio_path, model_id="capacit-ai/saga"):
    model = Qwen3ASRModel.LLM(model=model_id, gpu_memory_utilization=0.92)
    audio, _ = librosa.load(audio_path, sr=16000)
    output = model.transcribe(audio=[(audio, 16000)], language=["Danish"])
    return output[0].text

if __name__ == "__main__":
    print(transcribe_single_file("audio.wav"))

Evaluation

All of the finetuned models has been trained on CoRal data, as it's the most comprehensive and high quality (open-source) danish ASR dataset family, therefore we evaluated them on CoRal. All Qwen based models where evaluated using the same script and all Whisper based models where evaluated using the same script.

Upcoming: More unseen datasets and performance metrics on the way!

Dataset Model Samples CER WER
CoRal read_aloud (test) capacit-ai/saga 8000 6.7% 15.6%
CoRal read_aloud (test) Qwen/Qwen3-ASR-1.7B 8000 15.0% 33.6%
CoRal read_aloud (test) pluttodk/milo-asr 8000 7.6% 16.8%
CoRal read_aloud (test) openai/whisper-large-v3 8000 10.3% 25.2%
CoRal read_aloud (test) CoRal-project/roest-v3-whisper-1.5b 8000 4.7% 11.6%
CoRal read_aloud (test) syvai/hviske-v3-conversation 8000 7.7% 18.2%

plot

plot

Model RTFx
capacit-ai/saga 470
Qwen/Qwen3-ASR-1.7B 585
openai/whisper-large-v3 50

plot

  • RTFx figures are with vllm and fastattention enabled for Qwen backends, we succesfully ran pluttodk/milo-asr with a vllm backend and saw no significant drop in WER or CER.
  • All evaluation metrics where created using a single RTX 5090 instance.

Acknowledgements

Credit to the talented Qwen team, for making efficient and accurate models and open sourcing them.

And credit to the Danish Innovation fund, Alexandra Institute and partners, for the CoRal datasets.

Creator

This model has been finetuned and model card authored by Andreas Eefsen, Capacit A/S Copenhagen.

Downloads last month
15
Safetensors
Model size
2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for capacit-ai/saga

Finetuned
(17)
this model