Scaling Speech Technology to 1,000+ Languages
Paper • 2305.13516 • Published • 12
This is a Urdu text-to-speech (TTS) model fine-tuned from Facebook MMS (Massively Multilingual Speech) VITS for Urdu written in Arabic script.
The model converts Urdu text into natural-sounding speech using a non-autoregressive VITS-based architecture.
Users should perform qualitative evaluation before deployment in user-facing applications.
from transformers import VitsModel, AutoTokenizer, AutoFeatureExtractor
import torch, soundfile as sf
model_id = "hamza-amin/mms-tts-urd-fine-tuned"
tokenizer = AutoTokenizer.from_pretrained(model_id)
feature_extractor = AutoFeatureExtractor.from_pretrained(model_id)
model = VitsModel.from_pretrained(model_id)
inputs = tokenizer("میں اردو بول سکتا ہوں۔", return_tensors="pt")
with torch.no_grad():
output = model(**inputs)
sf.write(
"output.wav",
output.waveform.squeeze().numpy(),
feature_extractor.sampling_rate
)
finetune-hf-vitsNo formal quantitative evaluation was performed. Evaluation was based on qualitative listening tests.
If you use this model, please also cite the MMS and VITS papers:
@article{pratap2023mms,
title={Scaling Speech Technology to 1,000+ Languages},
author={Pratap et al.},
journal={arXiv preprint arXiv:2305.13516},
year={2023}
}
hamza-amin
Via Hugging Face profile: https://huggingface.co/hamza-amin
Base model
facebook/mms-tts-urd-script_arabic