speecht5_finetuned_sindhi_digits

A fine-tuned version of microsoft/speecht5_tts for Sindhi spoken digit synthesis (0–9)

How to Use


import torch
import numpy as np
import soundfile as sf
from IPython.display import Audio as IPythonAudio
from transformers import SpeechT5ForTextToSpeech, SpeechT5HifiGan, SpeechT5Processor

# ── Load model ──────────────────────────────────────────────────────────────
model_id  = "DanishMahdi/speecht5_finetuned_sindhi_digits"
processor = SpeechT5Processor.from_pretrained(model_id)
model     = SpeechT5ForTextToSpeech.from_pretrained(model_id)
vocoder   = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")

# ── Speaker embedding ───────────────────────────────────────────────────────
# speaker_mean_embeddings: dict of {speaker_id: 512-dim numpy array}
# obtained by averaging x-vectors across all utterances of each speaker
target_speaker = list(speaker_mean_embeddings.keys())[0]
speaker_emb = torch.tensor(
    np.array(speaker_mean_embeddings[target_speaker], dtype=np.float32)
).unsqueeze(0)  # shape: (1, 512)

# ── Normalize Sindhi script
text = normalize_text("ΪΎΪͺ")          #  (1)
print(f"Input: 'ΪΎΪͺ' β†’ normalized: '{text}'")

# ── Generate speech ─────────────────────────────────────────────────────────
inputs = processor(text=text, return_tensors="pt")
speech  = model.generate_speech(inputs["input_ids"], speaker_emb, vocoder=vocoder)

display(IPythonAudio(speech.numpy(), rate=16000))
sf.write("output.wav", speech.numpy(), samplerate=16000)

Training results

Training Loss Epoch Step Validation Loss
1.6538 1.0929 200 0.8015
1.4826 2.1858 400 0.7021
1.3317 3.2787 600 0.6665
1.2469 4.3716 800 0.6149
1.2163 5.4645 1000 0.6035
1.1473 6.5574 1200 0.5750
1.1213 7.6503 1400 0.5863
1.1055 8.7432 1600 0.5584
1.0704 9.8361 1800 0.5591
1.0603 10.9290 2000 0.5491

Limitations

  • Digits 0–9 only β€” does not generalize to arbitrary Sindhi text
  • Best quality for speakers seen during training

Citation

@inproceedings{ao-etal-2022-speecht5,
    title     = {SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing},
    author    = {Ao, Junyi and Wang, Rui and Zhou, Long and others},
    booktitle = {Proceedings of ACL 2022},
    year      = {2022}
}

@misc{mahdi-2026-sindhi-digits-tts,
    title     = {SpeechT5 Fine-tuned for Sindhi Spoken Digit Synthesis},
    author    = {Danish Mahdi},
    year      = {2026},
    publisher = {Hugging Face},
    url       = {https://huggingface.co/DanishMahdi/speecht5_finetuned_sindhi_digits}
}
Downloads last month
20
Safetensors
Model size
0.1B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for DanishMahdi/speecht5_finetuned_sindhi_digits

Finetuned
(1376)
this model