Code?

#1
by yukiarimo - opened

Where is the architecture, inference, and training code?

Thank you for asking.
The full model architecture and training notebooks are available in my GitHub repository.
If needed, I can share the link.

Thanks, I’ve found it.

https://github.com/Satyam0775/Lightweight-Phoneme-TTS

Gonna do some experiments. I have a few questions:

  1. Minimum, how many hours/minutes you need for this to do at least something?
  2. Do you think yours text -> Mel generator -> audio is better/worse & requires less/more data than text -> AE (full audio vector, non auto regressive) -> audio?
  3. What is the fastest neural vocoder (faster than ISTFT) that I can train with LJSpeech-sized dataset or less (bonus: supports music, too)?

Thanks for checking out the repo and for the thoughtful questions happy to clarify based on my current setup and experiments.

  1. Minimum time to get “something working”
    With the current lightweight Conv1D + Griffin-Lim setup, I can usually get the system to produce intelligible (though robotic) audio within 30–60 minutes on a single GPU/CPU. Meaningful results (stable loss, multiple samples, basic evaluation) typically take 3–6 hours. This is mainly because the model is small, non-autoregressive, and does not require training a neural vocoder.

  2. Text → Mel → Audio vs Text → AE (raw audio, non-autoregressive)
    In my experience, text → phoneme → mel → audio is more data-efficient and stable than directly predicting raw audio vectors. Mel-spectrograms act as a structured, compressed intermediate representation, which reduces learning complexity and helps with convergence on small datasets.
    Direct text → raw-audio autoencoding can potentially produce richer outputs, but it usually requires significantly more data and is harder to train due to the high dimensionality and phase modeling challenges. For a lightweight and fast-iteration setup, mel-based modeling works better.

  3. Fastest neural vocoder trainable on LJSpeech-scale data
    The most practical option is HiFi-GAN (V1/V2) — it offers near real-time inference on GPU, can be trained on LJSpeech-sized datasets, and is widely used in production TTS systems.
    MelGAN is even faster but lower quality, while WaveGlow / diffusion-based vocoders are heavier. For mixed speech-music support, HiFi-GAN variants trained on broader datasets perform reasonably well, though pure speech training will limit music generalization.

Overall, Griffin-Lim was chosen intentionally for fast validation of phoneme-to-mel learning, and swapping in a neural vocoder like HiFi-GAN is the natural next step once alignment and reconstruction are verified.

Sign up or log in to comment