speecht5_finetuned_sindhi_digits

A fine-tuned version of microsoft/speecht5_tts for Sindhi spoken digit synthesis (0–9)

Eval Loss: 0.5491
Dataset: DanishMahdi/sindhi-spoken-digits-tts
Speakers: Multiple speakers conditioned via x-vectors
Audio: 16kHz mono

How to Use


import torch
import numpy as np
import soundfile as sf
from IPython.display import Audio as IPythonAudio
from transformers import SpeechT5ForTextToSpeech, SpeechT5HifiGan, SpeechT5Processor

# ── Load model ──────────────────────────────────────────────────────────────
model_id  = "DanishMahdi/speecht5_finetuned_sindhi_digits"
processor = SpeechT5Processor.from_pretrained(model_id)
model     = SpeechT5ForTextToSpeech.from_pretrained(model_id)
vocoder   = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")

# ── Speaker embedding ───────────────────────────────────────────────────────
# speaker_mean_embeddings: dict of {speaker_id: 512-dim numpy array}
# obtained by averaging x-vectors across all utterances of each speaker
target_speaker = list(speaker_mean_embeddings.keys())[0]
speaker_emb = torch.tensor(
    np.array(speaker_mean_embeddings[target_speaker], dtype=np.float32)
).unsqueeze(0)  # shape: (1, 512)

# ── Normalize Sindhi script
text = normalize_text("ھڪ")          #  (1)
print(f"Input: 'ھڪ' → normalized: '{text}'")

# ── Generate speech ─────────────────────────────────────────────────────────
inputs = processor(text=text, return_tensors="pt")
speech  = model.generate_speech(inputs["input_ids"], speaker_emb, vocoder=vocoder)

display(IPythonAudio(speech.numpy(), rate=16000))
sf.write("output.wav", speech.numpy(), samplerate=16000)

Training results

Training Loss	Epoch	Step	Validation Loss
1.6538	1.0929	200	0.8015
1.4826	2.1858	400	0.7021
1.3317	3.2787	600	0.6665
1.2469	4.3716	800	0.6149
1.2163	5.4645	1000	0.6035
1.1473	6.5574	1200	0.5750
1.1213	7.6503	1400	0.5863
1.1055	8.7432	1600	0.5584
1.0704	9.8361	1800	0.5591
1.0603	10.9290	2000	0.5491

Limitations

Digits 0–9 only — does not generalize to arbitrary Sindhi text
Best quality for speakers seen during training

Citation

@inproceedings{ao-etal-2022-speecht5,
    title     = {SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing},
    author    = {Ao, Junyi and Wang, Rui and Zhou, Long and others},
    booktitle = {Proceedings of ACL 2022},
    year      = {2022}
}

@misc{mahdi-2026-sindhi-digits-tts,
    title     = {SpeechT5 Fine-tuned for Sindhi Spoken Digit Synthesis},
    author    = {Danish Mahdi},
    year      = {2026},
    publisher = {Hugging Face},
    url       = {https://huggingface.co/DanishMahdi/speecht5_finetuned_sindhi_digits}
}

Downloads last month: 20

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for DanishMahdi/speecht5_finetuned_sindhi_digits

Base model

microsoft/speecht5_tts

Finetuned

(1376)

this model