Qwen3-ASR LoRA

This repository contains a fine-tuned LoRA adapter and a complete set of scripts to train, evaluate, and deploy Qwen3-ASR so that anyone can fine tune any lanaguage with a good dataset.

By leveraging Parameter-Efficient Fine-Tuning (PEFT) via LoRA, this project optimizes the 1.7B parameter Qwen3-ASR model to perform highly accurate transcriptions for Bengali audio while keeping VRAM usage extremely efficient.

πŸš€ Key Features

  • Efficient LoRA Training: Optimized for an A100 (80GB) using Flash Attention 2, bfloat16, TF32, and deep integration with PyTorch 2.9.
  • Robust Pipeline: End-to-end data preparation, training, merging, and evaluating scripts.
  • Web UI & Inference: Includes a customized Gradio app (web_asr.py) for quick streaming and offline inference testing.

πŸ“ Repository Structure

Data & Checkpoints

  • data_local/ β€” Where your processed JSONL data (train, validation, test) and 16kHz WAVs live.
  • models/ β€” Original base model weights (Qwen3-ASR-1.7B or 0.6B).
  • checkpoints_bangla_lora/ β€” The output directory for LoRA adapters during training.

Pipeline Scripts

  1. prep_data.py Handles dataset downloading, preprocessing, resampling to 16kHz (mono), and strictly normalizing Bengali Unicode strings via bnunicodenormalizer. It generates the train.jsonl and validation.jsonl files.
  2. train_qwen3_lora.py The core training script. It patches the internal Qwen3 thinker forward pass, automatically manages LoRA wrapping, utilizes custom data collators (preventing audio duplication in cache), and avoids OOM issues with smart batching/cleanup callbacks.
  3. merge_lora.py Fuses the trained LoRA adapter weights back into the base Qwen3-ASR-1.7B model, creating a standalone, fully capable inference model that doesn't rely on dynamically loading adapters.
  4. evaluate_lora.py Extracts metrics (Word Error Rate / Character Error Rate) against your test dataset to measure how well the fine-tuning performed.
  5. test_inference.py A lightweight script for running quick, pure-python terminal transcriptions to test adapters on the fly.
  6. web_asr.py A responsive web application designed to load your model and expose an interactive GUI for microphone recordings and file uploads.

πŸ› οΈ Usage & Quick Start

1. Environment Setup

We recommend using uv. Important dependencies are listed in pyproject.toml.

uv sync  # Install dependencies (Transformers, PEFT, FLash Attention)

2. Prepare Data

Put your raw data into place and run the prep script. This ensures audio is strictly 16kHz and text is normalized.

uv run prep_data.py

3. Training

If you want to train on an A100 setup, simply run:

uv run train_qwen3_lora.py

Note: Configs like BATCH_SIZE, LORA_RANK (16), and LR (2e-4) can be adjusted right at the top of the file depending on your VRAM limits.

4. Evaluation & Merging

Once training completes:

# Evaluate the adapter's performance (WER/CER)
uv run evaluate_lora.py

# Merge the adapter permanently if you are satisfied
uv run merge_lora.py

5. Web UI

Instantly spin up the browser-based transcription interface:

uv run web_asr.py

6. Python Inference Example

Prerequisite: Ensure you have qwen_asr already on the repo. So no need to add qwen_asr on pyproject.toml.

You can then run the following code to load the models:

from qwen_asr.inference.qwen3_asr import Qwen3ASRModel

# Load the merged model (or base model)
wrapper = Qwen3ASRModel.from_pretrained(
    "./models/Qwen3-ASR-1.7B-Bengali-Merged", 
    max_inference_batch_size=32
)

# Run Transcription in Bengali
predictions = wrapper.transcribe(
    audio=["./test_audio.wav"],
    language="Bengali"
)
print(predictions[0].text)

πŸ“š Dataset Details

The models in this repository were trained and evaluated on highly diverse, multi-dialect Bengali audio:

  • Data Filtering (Transparency): To ensure optimal batching, memory efficiency, and learning stability, any audio clips longer than 15 seconds were strictly discarded during data preparation.
  • Training Set: 61,000 cleaned and normalized audio samples (all <15s). The model was trained via LoRA for exactly 1 Epoch.
  • Evaluation Set: A strictly held-out subset of 2,000 randomly selected Bengali audio clips. These were used exclusively in evaluate_lora.py to calculate the final Word Error Rate (WER) and Character Error Rate (CER), ensuring no training data leakage.

πŸ“Š Evaluation Benchmarks

This fine-tuned adapter (1.7B) was evaluated on the held-out test set of 2,000 randomized Bengali audio samples against the original Base model.

Model WER (%) CER (%) RTFx Note
Qwen3-Bengali-LoRA (Ours) 20.70% 7.61% 15.60x Highly accurate transcriber across local dialects.
Base Qwen3-ASR (1.7B) 72.25% 41.79% 28.44x Failed to transcribe properly. Often hallucinated Hindi instead of Bengali.
Whisper Large-v3 73.11% 29.31% 4.41x Evaluated using --language bn flag. Failed significantly on multi-dialect audio.

πŸ“ˆ Results Analysis & Methodology

Key Findings

  • Superior Accuracy: Our fine-tuned Qwen3-Bengali-LoRA achieved a 20.70% WER, representing a ~71.5% relative error reduction compared to the base model.
  • Dialect Resilience: While Whisper Large-v3 struggled significantly (73.11% WER) with the diverse and often noisy multi-dialect audio in this dataset, the LoRA adapter successfully mapped these variations to clean Bengali text.
  • Base Model Hallucinations: The original Qwen3-ASR Base model tended to hallucinate Hindi or English characters when encountering complex Bengali phonemes, a behavior almost entirely eliminated after just 1 epoch of training.

Evaluation Methodology (Fairness First)

To ensure a rigorous and unbiased comparison, we avoided "zero-shot" or "auto-detect" modes which often penalize models for routing errors. Instead:

  • Qwen3 Models: Both Base and LoRA were evaluated using the explicit language="Bengali" parameter in the transcribe methodology.
  • Whisper Large-v3: Running as the primary industry baseline, it was forced to transcribe in Bengali using the language="bn" generation flag.
  • Consistency: All 2,000 samples were identical across all three test runs (Seed 42), spanning diverse audio lengths (up to 15s) and multiple local dialects.

Citation

@article{Qwen3ASR,
  title={Qwen3-ASR: A Unified Speech Recognition and Forced Alignment Model},
  author={Qwen Team},
  year={2026}
}
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support