Whisper Large-v3 β€” Egyptian Arabic v4 (LoRA Fine-tune)

A LoRA fine-tuned version of openai/whisper-large-v3 specialized for Egyptian Arabic (EGY) dialect speech recognition, trained on the arabic-egy-cleaned dataset (~72 hours of Egyptian Arabic speech).

Model Details

Property Value
Base model openai/whisper-large-v3
Fine-tuning method LoRA (PEFT)
Language Egyptian Arabic
Task Automatic Speech Recognition
Training dataset MAdel121/arabic-egy-cleaned
Training samples ~82,900
LoRA rank 32
LoRA alpha 64
LoRA target modules q_proj, v_proj, k_proj, out_proj

Evaluation Results

Metric Score
WER (arabic-egy-cleaned test split) 36.92%

Evaluation was performed with Arabic text normalization applied (alef variants normalized).

Usage

import torch
from transformers import WhisperProcessor, WhisperForConditionalGeneration
from peft import PeftModel

BASE_MODEL_ID = "openai/whisper-large-v3"
LORA_MODEL_ID = "maryamas222/whisper-large-v3-egyptian-lora-v4"

processor = WhisperProcessor.from_pretrained(BASE_MODEL_ID, language="Arabic", task="transcribe")

base_model = WhisperForConditionalGeneration.from_pretrained(
    BASE_MODEL_ID, torch_dtype=torch.float16, device_map="auto"
)
model = PeftModel.from_pretrained(base_model, LORA_MODEL_ID)
model = model.merge_and_unload()
model.eval()

# Transcribe audio
inputs = processor(audio_array, sampling_rate=16000, return_tensors="pt").input_features
inputs = inputs.to("cuda").to(torch.float16)

with torch.no_grad():
    predicted_ids = model.generate(
        inputs,
        language="arabic",
        task="transcribe",
        max_new_tokens=225,
    )

transcription = processor.tokenizer.decode(predicted_ids[0], skip_special_tokens=True)
print(transcription)

Training Details

  • Learning rate: 6e-5
  • Batch size: 64
  • Epochs: 5 (with early stopping, patience=3)
  • Best checkpoint: step 2500
  • Warmup steps: 200
  • Precision: bfloat16
  • Hardware: NVIDIA A100 80GB

Data Filtering

Training samples were filtered to:

  • Audio duration between 2–25 seconds
  • Transcript length between 6–400 characters

Text Normalization

Alef variants (Ψ₯Ψ£Ψ’) normalized to (Ψ§) for consistency between training labels and evaluation.

Limitations

  • Egyptian Arabic only β€” optimized specifically for Egyptian dialect; performance on Modern Standard Arabic (MSA) or other dialects will be significantly lower
  • Training was interrupted β€” the run was affected by GPU preemptions and timeouts which may have impacted final WER
  • Not suitable for MSA β€” the base Whisper model handles MSA better without this adapter
  • Male speaker bias β€” ~85% of the training data is male speakers

Intended Use

This model is intended for academic research and experimentation with Egyptian Arabic ASR. It is not recommended for production use without further evaluation on domain-specific data.

Related Models

Citation

@misc{radford2022whisper,
  title={Robust Speech Recognition via Large-Scale Weak Supervision},
  author={Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
  year={2022}
}
Downloads last month
8
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for maryamas222/whisper-large-v3-egyptian-lora-v4

Adapter
(203)
this model

Dataset used to train maryamas222/whisper-large-v3-egyptian-lora-v4