Models (Text-to-Speech) - a SinapsisAI Collection

SinapsisAI 's Collections

Models (Text-to-Speech)

Models (Text-to-Audio)

Models (Audio Classification)

STT (Speech To Text)

Models (Diffusion T2I)

Models (Diffusion I2I)

Models (Diffusion T2V)

Models (Diffusion I2V)

Models (Text Generation Thinking)

Models (Text Generation Instruct)

Models (Vision)

Models (Text-to-Speech)

updated 6 days ago

Best open-source Text-to-Speech (TTS) models — SOTA neural voice synthesis, zero-shot cloning, multilingual & expressive speech generation.

hexgrad/Kokoro-82M

Text-to-Speech • Updated Apr 10, 2025 • 9.28M • • 5.85k

Note 🏆 SOTA lightweight TTS (82M params). Achieves best MOS in its parameter class, outperforming models 10x larger. Ranked #1 on TTS-Arena for quality/speed ratio. 📄 Paper: https://arxiv.org/abs/2501.13067 📊 Benchmark: TTS-Arena Leaderboard - https://huggingface.co/spaces/Pendrokar/TTS-Spaces-Arena 🔗 Repo: https://github.com/hexgrad/kokoro
OpenMOSS-Team/MOSS-TTS

Text-to-Speech • 8B • Updated 6 days ago • 88.8k • 352

Note 🏆 SOTA open-source multilingual TTS by OpenMOSS. LLM-based architecture with highly expressive, high-quality speech synthesis across multiple languages. 📄 Paper: https://arxiv.org/abs/2409.03900 📊 Benchmark: TTS-Arena Leaderboard - https://huggingface.co/spaces/Pendrokar/TTS-Spaces-Arena 🔗 GitHub: https://github.com/OpenMOSS/MOSS-TTS
elbruno/Qwen3-TTS-12Hz-0.6B-Base-ONNX

Text-to-Speech • Updated Feb 23 • 8

Note 🏆 ONNX-optimized version of Qwen3-TTS (Alibaba). SOTA efficient multilingual TTS for edge/local inference. 12Hz codec for high-quality audio g 🔗 Modelo original: https://huggingface.co/Qwen/Qwen3-TTS-12Hz-0.6B-Base
coqui/XTTS-v2

Text-to-Speech • Updated Dec 11, 2023 • 6.28M • 3.44k

Note 🏆 SOTA en clonación de voz zero-shot multilingüe (17 idiomas). El modelo TTS open source más descargado en HF durante 2024. Mejor MOS en voice cloning entre modelos públicos. 📄 Paper: https://arxiv.org/abs/2406.04904 📊 Benchmark: VALL-E/XTTS Eval (MOS, SECS, WER) - https://paperswithcode.com/paper/xtts-mastering-multilingual-speech-synthesis 🔗 GitHub: https://github.com/coqui-ai/TTS
nari-labs/Dia-1.6B

Text-to-Speech • Updated Jun 1, 2025 • 114k • • 2.83k

Note 🏆 SOTA en TTS dialogal/conversacional. Primer modelo open source con síntesis de diálogos multi-hablante nativos (incluyendo risas, suspiros, emociones). Supera a ElevenLabs en MOS conversacional. 📄 Paper: https://github.com/nari-labs/dia (technical report) 📊 Benchmark: TTS-Arena Leaderboard & Seed-TTS Eval - https://huggingface.co/spaces/Pendrokar/TTS-Spaces-Arena 🔗 GitHub: https://github.com/nari-labs/dia
sesame/csm-1b

Text-to-Speech • Updated Dec 1, 2025 • 135k • 2.35k

Note 🏆 SOTA en naturalidad conversacional (Sesame AI). Arquitectura Conversational Speech Model con contexto de larga duración. Mejor MOS en diálogo natural open source, evaluado como humano en blind tests. 📄 Paper: https://arxiv.org/abs/2501.13068 📊 Benchmark: Sesame CSM Eval (MOS, UTMOS) - https://www.sesame.com/research/crossing_the_uncanny_valley_of_voice 🔗 GitHub: https://github.com/SesameAILabs/csm
ResembleAI/chatterbox

Text-to-Speech • Updated Sep 23, 2025 • 2.28M • • 1.53k

Note 🏆 SOTA open-source TTS with emotion control & zero-shot voice cloning (ResembleAI). Achieves top scores on TTS-Arena with <200ms latency. Best open-source model for real-time expressive speech. 📄 Paper: https://arxiv.org/abs/2505.12212 📊 Benchmark: TTS-Arena Leaderboard - https://huggingface.co/spaces/Pendrokar/TTS-Spaces-Arena 🔗 GitHub: https://github.com/resemble-ai/chatterbox
suno/bark

Text-to-Speech • Updated Oct 4, 2023 • 15.2k • 1.51k

Note 🏆 SOTA transformer-based generative TTS with non-verbal audio (laughter, sighs, music). Pioneer open-source model for highly expressive speech generation beyond standard TTS. Best open model for audio generation with prosodic richness. 📄 Paper: https://github.com/suno-ai/bark (technical blog) 📊 Benchmark: TTS-Arena + Manual MOS evaluations - https://huggingface.co/spaces/Pendrokar/TTS-Spaces-Arena 🔗 GitHub: https://github.com/suno-ai/bark
SWivid/F5-TTS

Text-to-Speech • Updated Mar 21, 2025 • 792k • 1.15k

Note 🏆 SOTA non-autoregressive TTS using Flow Matching with Diffusion Transformer (DiT). Trained on 100K hours multilingual data with zero-shot voice cloning, code-switching, and inference RTF of 0.15 — fastest among diffusion-based TTS models. 📄 Paper: https://arxiv.org/abs/2410.06885 📊 Benchmark: Seed-TTS Eval & TTS-Arena (WER, SIM-O, MOS) - https://arxiv.org/abs/2410.06885 🔗 GitHub: https://github.com/SWivid/F5-TTS
Zyphra/Zonos-v0.1-hybrid

Text-to-Speech • Updated Jun 3, 2025 • 2.53k • 1.1k

Note 🏆 SOTA hybrid (transformer + SSM) TTS with emotion control and zero-shot cloning. Ranks #1 on TTS-Arena for naturalness. Achieves near-human MOS (4.3+) on standard benchmarks, outperforming proprietary models. 📄 Paper: https://arxiv.org/abs/2505.02098 📊 Benchmark: TTS-Arena & Seed-TTS Eval - https://huggingface.co/spaces/Pendrokar/TTS-Spaces-Arena 🔗 GitHub: https://github.com/Zyphra/Zonos
SparkAudio/Spark-TTS-0.5B

Text-to-Speech • Updated Mar 7, 2025 • 885 • 727

Note 🏆 SOTA lightweight TTS (0.5B) from SparkAudio. Top performer on Seed-TTS Eval benchmarks for zero-shot voice cloning at low parameter count. Best efficiency/quality tradeoff in open-source TTS. 📄 Paper: https://arxiv.org/abs/2503.01710 📊 Benchmark: Seed-TTS Eval (WER, SIM-O, SIM-R) - https://arxiv.org/abs/2503.01710 🔗 GitHub: https://github.com/SparkAudio/Spark-TTS
fishaudio/fish-speech-1.5

Text-to-Speech • Updated Mar 25, 2025 • 6.7k • 714

Note 🏆 SOTA multilingual TTS powered by LLM architecture (Fish Audio). Supports zero-shot voice cloning in 14+ languages with highly natural prosody. Achieves top WER and speaker similarity scores on multilingual TTS benchmarks. 📄 Paper: https://arxiv.org/abs/2411.01156 📊 Benchmark: Seed-TTS Eval & multilingual MOS (WER, SIM-O) - https://arxiv.org/abs/2411.01156 🔗 GitHub: https://github.com/fishaudio/fish-speech
canopylabs/orpheus-3b-0.1-ft

Text-to-Speech • Updated May 6, 2025 • 31.2k • • 673

Note 🏆 SOTA Llama-3B fine-tuned for TTS (Canopy Labs). Achieves lowest WER on Seed-TTS Eval among open-source models. Produces extremely natural speech with human-level prosody, benchmarked against GPT-4o Audio. 📄 Paper: https://canopylabs.ai/model-releases (technical blog) 📊 Benchmark: Seed-TTS Eval & UTMOS - https://huggingface.co/canopylabs/orpheus-3b-0.1-ft 🔗 GitHub: https://github.com/canopylabs/orpheus-tts
Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice

Text-to-Speech • 2B • Updated Jan 29 • 968k • 1.32k

Note 🏆 SOTA multilingual TTS by Alibaba Qwen (1.7B). Custom voice variant with 12Hz codec for high-fidelity, expressive zero-shot voice cloning. Achieves near-human naturalness across multiple languages, outperforming proprietary models on standard TTS evaluations. 📄 Paper: https://arxiv.org/abs/2601.15621 📊 Benchmark: Seed-TTS Eval & MOS (WER, SIM-O, naturalness) - https://arxiv.org/abs/2601.15621 🔗 GitHub: https://github.com/QwenLM/Qwen3-TTS
microsoft/VibeVoice-1.5B

Text-to-Speech • 3B • Updated Jan 22 • 98.2k • 2.25k

Note 🏆 SOTA frontier open-source TTS by Microsoft (1.5B). Designed for expressive, long-form multi-speaker podcast generation with natural turn-taking and speaker consistency. Achieves human-level naturalness on TTS-Arena. Supports English and Chinese. 📄 Paper: https://arxiv.org/abs/2508.19205 📊 Benchmark: TTS-Arena & EmergentTTS-Eval (MOS, naturalness) - https://arxiv.org/abs/2508.19205 🔗 GitHub: https://github.com/microsoft/VibeVoice
fishaudio/s2-pro

Text-to-Speech • 5B • Updated 15 days ago • 15.3k • 747

Note 🏆 SOTA open-source TTS by Fish Audio (S2, 5B). Features multi-speaker, multi-turn generation and instruction-following via natural language. Achieves RTF of 0.195 with time-to-first-audio below 100ms — production-ready streaming inference. 📄 Paper: https://arxiv.org/abs/2603.08823 📊 Benchmark: EmergentTTS-Eval & Seed-TTS Eval (WER, SIM-O, expressiveness) - https://arxiv.org/abs/2603.08823 🔗 GitHub: https://github.com/fishaudio/fish-speech
IndexTeam/IndexTTS-2

Text-to-Speech • Updated Jan 20 • 17.9k • 660

Note 🏆 SOTA emotionally expressive auto-regressive zero-shot TTS (IndexTeam). Breakthrough in duration control and fine-grained emotion transfer. Achieves top scores on naturalness and speaker similarity benchmarks vs. proprietary models. 📄 Paper: https://arxiv.org/abs/2506.21619 📊 Benchmark: Seed-TTS Eval & MOS (WER, SIM-O, emotion accuracy) - https://arxiv.org/abs/2506.21619 🔗 GitHub: https://github.com/index-tts/index-tts
bosonai/higgs-audio-v2-generation-3B-base

Text-to-Speech • 6B • Updated Jul 28, 2025 • 609k • 663

Note 🏆 SOTA expressive audio foundation model by Boson AI (6B). Pretrained on 10M+ hours of audio data with deep language and acoustic understanding. Achieves #1 win rate on EmergentTTS-Eval for expressive speech, outperforming GPT-4o Audio. 📄 Paper: https://arxiv.org/abs/2505.23009 📊 Benchmark: EmergentTTS-Eval (win rate vs. GPT-4o Audio) - https://arxiv.org/abs/2505.23009 🔗 GitHub: https://github.com/boson-ai/higgs-audio
FunAudioLLM/Fun-CosyVoice3-0.5B-2512

Text-to-Speech • Updated Feb 3 • 6.14k • 491

Note 🏆 SOTA scalable multilingual zero-shot TTS by Alibaba FunAudio (CosyVoice3, 0.5B). LLM-based with supervised semantic tokens enabling natural prosody, cross-lingual voice cloning, and fine-grained style control across multiple languages. 📄 Paper: https://arxiv.org/abs/2407.05407 📊 Benchmark: Seed-TTS Eval & multilingual MOS (WER, SIM-O, DNSMOS) - https://arxiv.org/abs/2407.05407 🔗 GitHub: https://github.com/FunAudioLLM/CosyVoice
HKUSTAudio/Llasa-3B

Text-to-Speech • 4B • Updated May 10, 2025 • 486 • 526

Note 🏆 SOTA Llama-based speech synthesis scaling (HKUST, 3B). First TTS model to systematically scale both train-time and inference-time compute. Achieves SOTA on Seed-TTS Eval English & Chinese benchmarks, surpassing prior open-source models. 📄 Paper: https://arxiv.org/abs/2502.04128 📊 Benchmark: Seed-TTS Eval (WER, SIM-O) English & Chinese - https://arxiv.org/abs/2502.04128 🔗 GitHub: https://github.com/HKUSTAudio/Llasa
myshell-ai/OpenVoiceV2

Text-to-Speech • Updated Dec 24, 2024 • 478

Note 🏆 SOTA zero-shot cross-lingual voice cloning by MyShell AI (OpenVoice V2). Natively supports English, Spanish, French, Chinese, Japanese & Korean with accurate tone color cloning, flexible style control (emotion, rhythm, intonation). MIT license. 📄 Paper: https://arxiv.org/abs/2312.01479 📊 Benchmark: Speaker similarity & naturalness MOS vs. XTTS - https://arxiv.org/abs/2312.01479 🔗 GitHub: https://github.com/myshell-ai/OpenVoice
amphion/MaskGCT

Text-to-Speech • Updated Apr 13, 2025 • 731 • 306

Note 🏆 SOTA zero-shot TTS with Masked Generative Codec Transformer (Amphion). Non-autoregressive approach with parallel decoding for fast, high-quality synthesis. Achieves top scores on TTS-Arena for naturalness across 6 languages (EN, ZH, FR, DE, JA, KO). 📄 Paper: https://arxiv.org/abs/2409.00750 📊 Benchmark: TTS-Arena & SeedTTS Eval (MOS, SIM-O, WER) - https://arxiv.org/abs/2409.00750 🔗 GitHub: https://github.com/open-mmlab/Amphion