Qwen3-ASR LoRA
This repository contains a fine-tuned LoRA adapter and a complete set of scripts to train, evaluate, and deploy Qwen3-ASR so that anyone can fine tune any lanaguage with a good dataset.
By leveraging Parameter-Efficient Fine-Tuning (PEFT) via LoRA, this project optimizes the 1.7B parameter Qwen3-ASR model to perform highly accurate transcriptions for Bengali audio while keeping VRAM usage extremely efficient.
π Key Features
- Efficient LoRA Training: Optimized for an A100 (80GB) using Flash Attention 2, bfloat16, TF32, and deep integration with PyTorch 2.9.
- Robust Pipeline: End-to-end data preparation, training, merging, and evaluating scripts.
- Web UI & Inference: Includes a customized Gradio app (
web_asr.py) for quick streaming and offline inference testing.
π Repository Structure
Data & Checkpoints
data_local/β Where your processed JSONL data (train, validation, test) and 16kHz WAVs live.models/β Original base model weights (Qwen3-ASR-1.7Bor0.6B).checkpoints_bangla_lora/β The output directory for LoRA adapters during training.
Pipeline Scripts
prep_data.pyHandles dataset downloading, preprocessing, resampling to 16kHz (mono), and strictly normalizing Bengali Unicode strings viabnunicodenormalizer. It generates thetrain.jsonlandvalidation.jsonlfiles.train_qwen3_lora.pyThe core training script. It patches the internal Qwen3 thinker forward pass, automatically manages LoRA wrapping, utilizes custom data collators (preventing audio duplication in cache), and avoids OOM issues with smart batching/cleanup callbacks.merge_lora.pyFuses the trained LoRA adapter weights back into the baseQwen3-ASR-1.7Bmodel, creating a standalone, fully capable inference model that doesn't rely on dynamically loading adapters.evaluate_lora.pyExtracts metrics (Word Error Rate / Character Error Rate) against your test dataset to measure how well the fine-tuning performed.test_inference.pyA lightweight script for running quick, pure-python terminal transcriptions to test adapters on the fly.web_asr.pyA responsive web application designed to load your model and expose an interactive GUI for microphone recordings and file uploads.
π οΈ Usage & Quick Start
1. Environment Setup
We recommend using uv. Important dependencies are listed in pyproject.toml.
uv sync # Install dependencies (Transformers, PEFT, FLash Attention)
2. Prepare Data
Put your raw data into place and run the prep script. This ensures audio is strictly 16kHz and text is normalized.
uv run prep_data.py
3. Training
If you want to train on an A100 setup, simply run:
uv run train_qwen3_lora.py
Note: Configs like BATCH_SIZE, LORA_RANK (16), and LR (2e-4) can be adjusted right at the top of the file depending on your VRAM limits.
4. Evaluation & Merging
Once training completes:
# Evaluate the adapter's performance (WER/CER)
uv run evaluate_lora.py
# Merge the adapter permanently if you are satisfied
uv run merge_lora.py
5. Web UI
Instantly spin up the browser-based transcription interface:
uv run web_asr.py
6. Python Inference Example
Prerequisite: Ensure you have qwen_asr already on the repo. So no need to add qwen_asr on pyproject.toml.
You can then run the following code to load the models:
from qwen_asr.inference.qwen3_asr import Qwen3ASRModel
# Load the merged model (or base model)
wrapper = Qwen3ASRModel.from_pretrained(
"./models/Qwen3-ASR-1.7B-Bengali-Merged",
max_inference_batch_size=32
)
# Run Transcription in Bengali
predictions = wrapper.transcribe(
audio=["./test_audio.wav"],
language="Bengali"
)
print(predictions[0].text)
π Dataset Details
The models in this repository were trained and evaluated on highly diverse, multi-dialect Bengali audio:
- Data Filtering (Transparency): To ensure optimal batching, memory efficiency, and learning stability, any audio clips longer than 15 seconds were strictly discarded during data preparation.
- Training Set:
61,000cleaned and normalized audio samples (all <15s). The model was trained via LoRA for exactly 1 Epoch. - Evaluation Set: A strictly held-out subset of
2,000randomly selected Bengali audio clips. These were used exclusively inevaluate_lora.pyto calculate the final Word Error Rate (WER) and Character Error Rate (CER), ensuring no training data leakage.
π Evaluation Benchmarks
This fine-tuned adapter (1.7B) was evaluated on the held-out test set of 2,000 randomized Bengali audio samples against the original Base model.
| Model | WER (%) | CER (%) | RTFx | Note |
|---|---|---|---|---|
| Qwen3-Bengali-LoRA (Ours) | 20.70% | 7.61% | 15.60x | Highly accurate transcriber across local dialects. |
| Base Qwen3-ASR (1.7B) | 72.25% | 41.79% | 28.44x | Failed to transcribe properly. Often hallucinated Hindi instead of Bengali. |
| Whisper Large-v3 | 73.11% | 29.31% | 4.41x | Evaluated using --language bn flag. Failed significantly on multi-dialect audio. |
π Results Analysis & Methodology
Key Findings
- Superior Accuracy: Our fine-tuned Qwen3-Bengali-LoRA achieved a 20.70% WER, representing a ~71.5% relative error reduction compared to the base model.
- Dialect Resilience: While Whisper Large-v3 struggled significantly (73.11% WER) with the diverse and often noisy multi-dialect audio in this dataset, the LoRA adapter successfully mapped these variations to clean Bengali text.
- Base Model Hallucinations: The original Qwen3-ASR Base model tended to hallucinate Hindi or English characters when encountering complex Bengali phonemes, a behavior almost entirely eliminated after just 1 epoch of training.
Evaluation Methodology (Fairness First)
To ensure a rigorous and unbiased comparison, we avoided "zero-shot" or "auto-detect" modes which often penalize models for routing errors. Instead:
- Qwen3 Models: Both Base and LoRA were evaluated using the explicit
language="Bengali"parameter in thetranscribemethodology. - Whisper Large-v3: Running as the primary industry baseline, it was forced to transcribe in Bengali using the
language="bn"generation flag. - Consistency: All 2,000 samples were identical across all three test runs (Seed 42), spanning diverse audio lengths (up to 15s) and multiple local dialects.
Citation
@article{Qwen3ASR,
title={Qwen3-ASR: A Unified Speech Recognition and Forced Alignment Model},
author={Qwen Team},
year={2026}
}
- Downloads last month
- -