Whisper Large-v3 β Egyptian Arabic v4 (LoRA Fine-tune)
A LoRA fine-tuned version of openai/whisper-large-v3 specialized for Egyptian Arabic (EGY) dialect speech recognition, trained on the arabic-egy-cleaned dataset (~72 hours of Egyptian Arabic speech).
Model Details
| Property | Value |
|---|---|
| Base model | openai/whisper-large-v3 |
| Fine-tuning method | LoRA (PEFT) |
| Language | Egyptian Arabic |
| Task | Automatic Speech Recognition |
| Training dataset | MAdel121/arabic-egy-cleaned |
| Training samples | ~82,900 |
| LoRA rank | 32 |
| LoRA alpha | 64 |
| LoRA target modules | q_proj, v_proj, k_proj, out_proj |
Evaluation Results
| Metric | Score |
|---|---|
| WER (arabic-egy-cleaned test split) | 36.92% |
Evaluation was performed with Arabic text normalization applied (alef variants normalized).
Usage
import torch
from transformers import WhisperProcessor, WhisperForConditionalGeneration
from peft import PeftModel
BASE_MODEL_ID = "openai/whisper-large-v3"
LORA_MODEL_ID = "maryamas222/whisper-large-v3-egyptian-lora-v4"
processor = WhisperProcessor.from_pretrained(BASE_MODEL_ID, language="Arabic", task="transcribe")
base_model = WhisperForConditionalGeneration.from_pretrained(
BASE_MODEL_ID, torch_dtype=torch.float16, device_map="auto"
)
model = PeftModel.from_pretrained(base_model, LORA_MODEL_ID)
model = model.merge_and_unload()
model.eval()
# Transcribe audio
inputs = processor(audio_array, sampling_rate=16000, return_tensors="pt").input_features
inputs = inputs.to("cuda").to(torch.float16)
with torch.no_grad():
predicted_ids = model.generate(
inputs,
language="arabic",
task="transcribe",
max_new_tokens=225,
)
transcription = processor.tokenizer.decode(predicted_ids[0], skip_special_tokens=True)
print(transcription)
Training Details
- Learning rate: 6e-5
- Batch size: 64
- Epochs: 5 (with early stopping, patience=3)
- Best checkpoint: step 2500
- Warmup steps: 200
- Precision: bfloat16
- Hardware: NVIDIA A100 80GB
Data Filtering
Training samples were filtered to:
- Audio duration between 2β25 seconds
- Transcript length between 6β400 characters
Text Normalization
Alef variants (Ψ₯Ψ£Ψ’) normalized to (Ψ§) for consistency between training labels and evaluation.
Limitations
- Egyptian Arabic only β optimized specifically for Egyptian dialect; performance on Modern Standard Arabic (MSA) or other dialects will be significantly lower
- Training was interrupted β the run was affected by GPU preemptions and timeouts which may have impacted final WER
- Not suitable for MSA β the base Whisper model handles MSA better without this adapter
- Male speaker bias β ~85% of the training data is male speakers
Intended Use
This model is intended for academic research and experimentation with Egyptian Arabic ASR. It is not recommended for production use without further evaluation on domain-specific data.
Related Models
- maryamas222/whisper-large-v3-egyptian-lora β v3 model trained on MGB-3 (31% WER, smaller dataset but cleaner training run)
Citation
@misc{radford2022whisper,
title={Robust Speech Recognition via Large-Scale Weak Supervision},
author={Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
year={2022}
}
- Downloads last month
- 8
Model tree for maryamas222/whisper-large-v3-egyptian-lora-v4
Base model
openai/whisper-large-v3