Model Card for MMS TTS Urdu

Model Details

Model Description

This is a Urdu text-to-speech (TTS) model fine-tuned from Facebook MMS (Massively Multilingual Speech) VITS for Urdu written in Arabic script.

The model converts Urdu text into natural-sounding speech using a non-autoregressive VITS-based architecture.

Developed by: hamza-amin
Model type: Text-to-Speech (TTS)
Language(s): Urdu (ur)
Finetuned from model: facebook/mms-tts-urd-script_arabic
Framework: 🤗 Transformers

Model Sources

Base model: facebook/mms-tts-urd-script_arabic
Architecture paper: https://arxiv.org/abs/1910.09700

Uses

Direct Use

Generating Urdu speech from Unicode Urdu text
Offline or server-side Urdu TTS inference
Research and educational purposes

Out-of-Scope Use

Roman Urdu input
Languages other than Urdu
Voice cloning or speaker identification

Bias, Risks, and Limitations

Output quality depends heavily on training data quality.
The model reflects linguistic and pronunciation patterns present in the dataset.
Not evaluated for fairness across regional accents or dialects.

Recommendations

Users should perform qualitative evaluation before deployment in user-facing applications.

How to Get Started with the Model

from transformers import VitsModel, AutoTokenizer, AutoFeatureExtractor
import torch, soundfile as sf

model_id = "hamza-amin/mms-tts-urd-fine-tuned"

tokenizer = AutoTokenizer.from_pretrained(model_id)
feature_extractor = AutoFeatureExtractor.from_pretrained(model_id)
model = VitsModel.from_pretrained(model_id)

inputs = tokenizer("میں اردو بول سکتا ہوں۔", return_tensors="pt")

with torch.no_grad():
    output = model(**inputs)

sf.write(
    "output.wav",
    output.waveform.squeeze().numpy(),
    feature_extractor.sampling_rate
)

Training Details

Training Data

Dataset: codewithdark/urdu-tts
Language: Urdu (Arabic script)

Training Procedure

Fine-tuned using Hugging Face finetune-hf-vits
Generator-only model exported for inference

Training Hyperparameters

Training regime: fp32

Evaluation

No formal quantitative evaluation was performed. Evaluation was based on qualitative listening tests.

Technical Specifications

Model Architecture and Objective

VITS (Variational Inference with Adversarial Learning for End-to-End TTS)
Objective: text-to-waveform generation

Software

transformers
PyTorch

Citation

If you use this model, please also cite the MMS and VITS papers:

@article{pratap2023mms,
  title={Scaling Speech Technology to 1,000+ Languages},
  author={Pratap et al.},
  journal={arXiv preprint arXiv:2305.13516},
  year={2023}
}