Model Card for MMS TTS Urdu

Model Details

Model Description

This is a Urdu text-to-speech (TTS) model fine-tuned from Facebook MMS (Massively Multilingual Speech) VITS for Urdu written in Arabic script.

The model converts Urdu text into natural-sounding speech using a non-autoregressive VITS-based architecture.

  • Developed by: hamza-amin
  • Model type: Text-to-Speech (TTS)
  • Language(s): Urdu (ur)
  • Finetuned from model: facebook/mms-tts-urd-script_arabic
  • Framework: 🤗 Transformers

Model Sources


Uses

Direct Use

  • Generating Urdu speech from Unicode Urdu text
  • Offline or server-side Urdu TTS inference
  • Research and educational purposes

Out-of-Scope Use

  • Roman Urdu input
  • Languages other than Urdu
  • Voice cloning or speaker identification

Bias, Risks, and Limitations

  • Output quality depends heavily on training data quality.
  • The model reflects linguistic and pronunciation patterns present in the dataset.
  • Not evaluated for fairness across regional accents or dialects.

Recommendations

Users should perform qualitative evaluation before deployment in user-facing applications.


How to Get Started with the Model

from transformers import VitsModel, AutoTokenizer, AutoFeatureExtractor
import torch, soundfile as sf

model_id = "hamza-amin/mms-tts-urd-fine-tuned"

tokenizer = AutoTokenizer.from_pretrained(model_id)
feature_extractor = AutoFeatureExtractor.from_pretrained(model_id)
model = VitsModel.from_pretrained(model_id)

inputs = tokenizer("میں اردو بول سکتا ہوں۔", return_tensors="pt")

with torch.no_grad():
    output = model(**inputs)

sf.write(
    "output.wav",
    output.waveform.squeeze().numpy(),
    feature_extractor.sampling_rate
)

Training Details

Training Data

  • Dataset: codewithdark/urdu-tts
  • Language: Urdu (Arabic script)

Training Procedure

  • Fine-tuned using Hugging Face finetune-hf-vits
  • Generator-only model exported for inference

Training Hyperparameters

  • Training regime: fp32

Evaluation

No formal quantitative evaluation was performed. Evaluation was based on qualitative listening tests.


Technical Specifications

Model Architecture and Objective

  • VITS (Variational Inference with Adversarial Learning for End-to-End TTS)
  • Objective: text-to-waveform generation

Software

  • transformers
  • PyTorch

Citation

If you use this model, please also cite the MMS and VITS papers:

@article{pratap2023mms,
  title={Scaling Speech Technology to 1,000+ Languages},
  author={Pratap et al.},
  journal={arXiv preprint arXiv:2305.13516},
  year={2023}
}

Model Card Authors

hamza-amin

Model Card Contact

Via Hugging Face profile: https://huggingface.co/hamza-amin

Downloads last month
66
Safetensors
Model size
36.3M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for hamza-amin/mms-tts-urd-fine-tuned

Finetuned
(1)
this model

Dataset used to train hamza-amin/mms-tts-urd-fine-tuned

Papers for hamza-amin/mms-tts-urd-fine-tuned