Whisper Large-v3 — Egyptian Arabic v4 (LoRA Fine-tune)

A LoRA fine-tuned version of openai/whisper-large-v3 specialized for Egyptian Arabic (EGY) dialect speech recognition, trained on the arabic-egy-cleaned dataset (~72 hours of Egyptian Arabic speech).

Model Details

Property	Value
Base model	openai/whisper-large-v3
Fine-tuning method	LoRA (PEFT)
Language	Egyptian Arabic
Task	Automatic Speech Recognition
Training dataset	MAdel121/arabic-egy-cleaned
Training samples	~82,900
LoRA rank	32
LoRA alpha	64
LoRA target modules	q_proj, v_proj, k_proj, out_proj

Evaluation Results

Metric	Score
WER (arabic-egy-cleaned test split)	36.92%

Evaluation was performed with Arabic text normalization applied (alef variants normalized).

Usage

import torch
from transformers import WhisperProcessor, WhisperForConditionalGeneration
from peft import PeftModel

BASE_MODEL_ID = "openai/whisper-large-v3"
LORA_MODEL_ID = "maryamas222/whisper-large-v3-egyptian-lora-v4"

processor = WhisperProcessor.from_pretrained(BASE_MODEL_ID, language="Arabic", task="transcribe")

base_model = WhisperForConditionalGeneration.from_pretrained(
    BASE_MODEL_ID, torch_dtype=torch.float16, device_map="auto"
)
model = PeftModel.from_pretrained(base_model, LORA_MODEL_ID)
model = model.merge_and_unload()
model.eval()

# Transcribe audio
inputs = processor(audio_array, sampling_rate=16000, return_tensors="pt").input_features
inputs = inputs.to("cuda").to(torch.float16)

with torch.no_grad():
    predicted_ids = model.generate(
        inputs,
        language="arabic",
        task="transcribe",
        max_new_tokens=225,
    )

transcription = processor.tokenizer.decode(predicted_ids[0], skip_special_tokens=True)
print(transcription)

Training Details

Learning rate: 6e-5
Batch size: 64
Epochs: 5 (with early stopping, patience=3)
Best checkpoint: step 2500
Warmup steps: 200
Precision: bfloat16
Hardware: NVIDIA A100 80GB

Data Filtering

Training samples were filtered to:

Audio duration between 2–25 seconds
Transcript length between 6–400 characters

Text Normalization

Alef variants (إأآ) normalized to (ا) for consistency between training labels and evaluation.

Limitations

Egyptian Arabic only — optimized specifically for Egyptian dialect; performance on Modern Standard Arabic (MSA) or other dialects will be significantly lower
Training was interrupted — the run was affected by GPU preemptions and timeouts which may have impacted final WER
Not suitable for MSA — the base Whisper model handles MSA better without this adapter
Male speaker bias — ~85% of the training data is male speakers

Intended Use

This model is intended for academic research and experimentation with Egyptian Arabic ASR. It is not recommended for production use without further evaluation on domain-specific data.

Related Models

maryamas222/whisper-large-v3-egyptian-lora — v3 model trained on MGB-3 (31% WER, smaller dataset but cleaner training run)

Citation

@misc{radford2022whisper,
  title={Robust Speech Recognition via Large-Scale Weak Supervision},
  author={Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
  year={2022}
}

Downloads last month: 8

Model tree for maryamas222/whisper-large-v3-egyptian-lora-v4

Base model

openai/whisper-large-v3

Adapter

(203)

this model

maryamas222
/

whisper-large-v3-egyptian-lora-v4