speecht5_finetuned_sindhi_digits
A fine-tuned version of microsoft/speecht5_tts for Sindhi spoken digit synthesis (0β9)
- Eval Loss: 0.5491
- Dataset: DanishMahdi/sindhi-spoken-digits-tts
- Speakers: Multiple speakers conditioned via x-vectors
- Audio: 16kHz mono
How to Use
import torch
import numpy as np
import soundfile as sf
from IPython.display import Audio as IPythonAudio
from transformers import SpeechT5ForTextToSpeech, SpeechT5HifiGan, SpeechT5Processor
# ββ Load model ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
model_id = "DanishMahdi/speecht5_finetuned_sindhi_digits"
processor = SpeechT5Processor.from_pretrained(model_id)
model = SpeechT5ForTextToSpeech.from_pretrained(model_id)
vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")
# ββ Speaker embedding βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
# speaker_mean_embeddings: dict of {speaker_id: 512-dim numpy array}
# obtained by averaging x-vectors across all utterances of each speaker
target_speaker = list(speaker_mean_embeddings.keys())[0]
speaker_emb = torch.tensor(
np.array(speaker_mean_embeddings[target_speaker], dtype=np.float32)
).unsqueeze(0) # shape: (1, 512)
# ββ Normalize Sindhi script
text = normalize_text("ΪΎΪͺ") # (1)
print(f"Input: 'ΪΎΪͺ' β normalized: '{text}'")
# ββ Generate speech βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
inputs = processor(text=text, return_tensors="pt")
speech = model.generate_speech(inputs["input_ids"], speaker_emb, vocoder=vocoder)
display(IPythonAudio(speech.numpy(), rate=16000))
sf.write("output.wav", speech.numpy(), samplerate=16000)
Training results
| Training Loss | Epoch | Step | Validation Loss |
|---|---|---|---|
| 1.6538 | 1.0929 | 200 | 0.8015 |
| 1.4826 | 2.1858 | 400 | 0.7021 |
| 1.3317 | 3.2787 | 600 | 0.6665 |
| 1.2469 | 4.3716 | 800 | 0.6149 |
| 1.2163 | 5.4645 | 1000 | 0.6035 |
| 1.1473 | 6.5574 | 1200 | 0.5750 |
| 1.1213 | 7.6503 | 1400 | 0.5863 |
| 1.1055 | 8.7432 | 1600 | 0.5584 |
| 1.0704 | 9.8361 | 1800 | 0.5591 |
| 1.0603 | 10.9290 | 2000 | 0.5491 |
Limitations
- Digits 0β9 only β does not generalize to arbitrary Sindhi text
- Best quality for speakers seen during training
Citation
@inproceedings{ao-etal-2022-speecht5,
title = {SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing},
author = {Ao, Junyi and Wang, Rui and Zhou, Long and others},
booktitle = {Proceedings of ACL 2022},
year = {2022}
}
@misc{mahdi-2026-sindhi-digits-tts,
title = {SpeechT5 Fine-tuned for Sindhi Spoken Digit Synthesis},
author = {Danish Mahdi},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/DanishMahdi/speecht5_finetuned_sindhi_digits}
}
- Downloads last month
- 20
Model tree for DanishMahdi/speecht5_finetuned_sindhi_digits
Base model
microsoft/speecht5_tts