Models (Text-to-Speech)
Best open-source Text-to-Speech (TTS) models β SOTA neural voice synthesis, zero-shot cloning, multilingual & expressive speech generation.
Text-to-Speech β’ Updated β’ 9.28M β’ β’ 5.85kNote π SOTA lightweight TTS (82M params). Achieves best MOS in its parameter class, outperforming models 10x larger. Ranked #1 on TTS-Arena for quality/speed ratio. π Paper: https://arxiv.org/abs/2501.13067 π Benchmark: TTS-Arena Leaderboard - https://huggingface.co/spaces/Pendrokar/TTS-Spaces-Arena π Repo: https://github.com/hexgrad/kokoro
OpenMOSS-Team/MOSS-TTS
Text-to-Speech β’ 8B β’ Updated β’ 88.8k β’ 352Note π SOTA open-source multilingual TTS by OpenMOSS. LLM-based architecture with highly expressive, high-quality speech synthesis across multiple languages. π Paper: https://arxiv.org/abs/2409.03900 π Benchmark: TTS-Arena Leaderboard - https://huggingface.co/spaces/Pendrokar/TTS-Spaces-Arena π GitHub: https://github.com/OpenMOSS/MOSS-TTS
elbruno/Qwen3-TTS-12Hz-0.6B-Base-ONNX
Text-to-Speech β’ Updated β’ 8Note π ONNX-optimized version of Qwen3-TTS (Alibaba). SOTA efficient multilingual TTS for edge/local inference. 12Hz codec for high-quality audio g π Modelo original: https://huggingface.co/Qwen/Qwen3-TTS-12Hz-0.6B-Base
coqui/XTTS-v2
Text-to-Speech β’ Updated β’ 6.28M β’ 3.44kNote π SOTA en clonaciΓ³n de voz zero-shot multilingΓΌe (17 idiomas). El modelo TTS open source mΓ‘s descargado en HF durante 2024. Mejor MOS en voice cloning entre modelos pΓΊblicos. π Paper: https://arxiv.org/abs/2406.04904 π Benchmark: VALL-E/XTTS Eval (MOS, SECS, WER) - https://paperswithcode.com/paper/xtts-mastering-multilingual-speech-synthesis π GitHub: https://github.com/coqui-ai/TTS
nari-labs/Dia-1.6B
Text-to-Speech β’ Updated β’ 114k β’ β’ 2.83kNote π SOTA en TTS dialogal/conversacional. Primer modelo open source con sΓntesis de diΓ‘logos multi-hablante nativos (incluyendo risas, suspiros, emociones). Supera a ElevenLabs en MOS conversacional. π Paper: https://github.com/nari-labs/dia (technical report) π Benchmark: TTS-Arena Leaderboard & Seed-TTS Eval - https://huggingface.co/spaces/Pendrokar/TTS-Spaces-Arena π GitHub: https://github.com/nari-labs/dia
sesame/csm-1b
Text-to-Speech β’ Updated β’ 135k β’ 2.35kNote π SOTA en naturalidad conversacional (Sesame AI). Arquitectura Conversational Speech Model con contexto de larga duraciΓ³n. Mejor MOS en diΓ‘logo natural open source, evaluado como humano en blind tests. π Paper: https://arxiv.org/abs/2501.13068 π Benchmark: Sesame CSM Eval (MOS, UTMOS) - https://www.sesame.com/research/crossing_the_uncanny_valley_of_voice π GitHub: https://github.com/SesameAILabs/csm
ResembleAI/chatterbox
Text-to-Speech β’ Updated β’ 2.28M β’ β’ 1.53kNote π SOTA open-source TTS with emotion control & zero-shot voice cloning (ResembleAI). Achieves top scores on TTS-Arena with <200ms latency. Best open-source model for real-time expressive speech. π Paper: https://arxiv.org/abs/2505.12212 π Benchmark: TTS-Arena Leaderboard - https://huggingface.co/spaces/Pendrokar/TTS-Spaces-Arena π GitHub: https://github.com/resemble-ai/chatterbox
suno/bark
Text-to-Speech β’ Updated β’ 15.2k β’ 1.51kNote π SOTA transformer-based generative TTS with non-verbal audio (laughter, sighs, music). Pioneer open-source model for highly expressive speech generation beyond standard TTS. Best open model for audio generation with prosodic richness. π Paper: https://github.com/suno-ai/bark (technical blog) π Benchmark: TTS-Arena + Manual MOS evaluations - https://huggingface.co/spaces/Pendrokar/TTS-Spaces-Arena π GitHub: https://github.com/suno-ai/bark
SWivid/F5-TTS
Text-to-Speech β’ Updated β’ 792k β’ 1.15kNote π SOTA non-autoregressive TTS using Flow Matching with Diffusion Transformer (DiT). Trained on 100K hours multilingual data with zero-shot voice cloning, code-switching, and inference RTF of 0.15 β fastest among diffusion-based TTS models. π Paper: https://arxiv.org/abs/2410.06885 π Benchmark: Seed-TTS Eval & TTS-Arena (WER, SIM-O, MOS) - https://arxiv.org/abs/2410.06885 π GitHub: https://github.com/SWivid/F5-TTS
Zyphra/Zonos-v0.1-hybrid
Text-to-Speech β’ Updated β’ 2.53k β’ 1.1kNote π SOTA hybrid (transformer + SSM) TTS with emotion control and zero-shot cloning. Ranks #1 on TTS-Arena for naturalness. Achieves near-human MOS (4.3+) on standard benchmarks, outperforming proprietary models. π Paper: https://arxiv.org/abs/2505.02098 π Benchmark: TTS-Arena & Seed-TTS Eval - https://huggingface.co/spaces/Pendrokar/TTS-Spaces-Arena π GitHub: https://github.com/Zyphra/Zonos
SparkAudio/Spark-TTS-0.5B
Text-to-Speech β’ Updated β’ 885 β’ 727Note π SOTA lightweight TTS (0.5B) from SparkAudio. Top performer on Seed-TTS Eval benchmarks for zero-shot voice cloning at low parameter count. Best efficiency/quality tradeoff in open-source TTS. π Paper: https://arxiv.org/abs/2503.01710 π Benchmark: Seed-TTS Eval (WER, SIM-O, SIM-R) - https://arxiv.org/abs/2503.01710 π GitHub: https://github.com/SparkAudio/Spark-TTS
fishaudio/fish-speech-1.5
Text-to-Speech β’ Updated β’ 6.7k β’ 714Note π SOTA multilingual TTS powered by LLM architecture (Fish Audio). Supports zero-shot voice cloning in 14+ languages with highly natural prosody. Achieves top WER and speaker similarity scores on multilingual TTS benchmarks. π Paper: https://arxiv.org/abs/2411.01156 π Benchmark: Seed-TTS Eval & multilingual MOS (WER, SIM-O) - https://arxiv.org/abs/2411.01156 π GitHub: https://github.com/fishaudio/fish-speech
canopylabs/orpheus-3b-0.1-ft
Text-to-Speech β’ Updated β’ 31.2k β’ β’ 673Note π SOTA Llama-3B fine-tuned for TTS (Canopy Labs). Achieves lowest WER on Seed-TTS Eval among open-source models. Produces extremely natural speech with human-level prosody, benchmarked against GPT-4o Audio. π Paper: https://canopylabs.ai/model-releases (technical blog) π Benchmark: Seed-TTS Eval & UTMOS - https://huggingface.co/canopylabs/orpheus-3b-0.1-ft π GitHub: https://github.com/canopylabs/orpheus-tts
Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice
Text-to-Speech β’ 2B β’ Updated β’ 968k β’ 1.32kNote π SOTA multilingual TTS by Alibaba Qwen (1.7B). Custom voice variant with 12Hz codec for high-fidelity, expressive zero-shot voice cloning. Achieves near-human naturalness across multiple languages, outperforming proprietary models on standard TTS evaluations. π Paper: https://arxiv.org/abs/2601.15621 π Benchmark: Seed-TTS Eval & MOS (WER, SIM-O, naturalness) - https://arxiv.org/abs/2601.15621 π GitHub: https://github.com/QwenLM/Qwen3-TTS
microsoft/VibeVoice-1.5B
Text-to-Speech β’ 3B β’ Updated β’ 98.2k β’ 2.25kNote π SOTA frontier open-source TTS by Microsoft (1.5B). Designed for expressive, long-form multi-speaker podcast generation with natural turn-taking and speaker consistency. Achieves human-level naturalness on TTS-Arena. Supports English and Chinese. π Paper: https://arxiv.org/abs/2508.19205 π Benchmark: TTS-Arena & EmergentTTS-Eval (MOS, naturalness) - https://arxiv.org/abs/2508.19205 π GitHub: https://github.com/microsoft/VibeVoice
fishaudio/s2-pro
Text-to-Speech β’ 5B β’ Updated β’ 15.3k β’ 747Note π SOTA open-source TTS by Fish Audio (S2, 5B). Features multi-speaker, multi-turn generation and instruction-following via natural language. Achieves RTF of 0.195 with time-to-first-audio below 100ms β production-ready streaming inference. π Paper: https://arxiv.org/abs/2603.08823 π Benchmark: EmergentTTS-Eval & Seed-TTS Eval (WER, SIM-O, expressiveness) - https://arxiv.org/abs/2603.08823 π GitHub: https://github.com/fishaudio/fish-speech
IndexTeam/IndexTTS-2
Text-to-Speech β’ Updated β’ 17.9k β’ 660Note π SOTA emotionally expressive auto-regressive zero-shot TTS (IndexTeam). Breakthrough in duration control and fine-grained emotion transfer. Achieves top scores on naturalness and speaker similarity benchmarks vs. proprietary models. π Paper: https://arxiv.org/abs/2506.21619 π Benchmark: Seed-TTS Eval & MOS (WER, SIM-O, emotion accuracy) - https://arxiv.org/abs/2506.21619 π GitHub: https://github.com/index-tts/index-tts
bosonai/higgs-audio-v2-generation-3B-base
Text-to-Speech β’ 6B β’ Updated β’ 609k β’ 663Note π SOTA expressive audio foundation model by Boson AI (6B). Pretrained on 10M+ hours of audio data with deep language and acoustic understanding. Achieves #1 win rate on EmergentTTS-Eval for expressive speech, outperforming GPT-4o Audio. π Paper: https://arxiv.org/abs/2505.23009 π Benchmark: EmergentTTS-Eval (win rate vs. GPT-4o Audio) - https://arxiv.org/abs/2505.23009 π GitHub: https://github.com/boson-ai/higgs-audio
FunAudioLLM/Fun-CosyVoice3-0.5B-2512
Text-to-Speech β’ Updated β’ 6.14k β’ 491Note π SOTA scalable multilingual zero-shot TTS by Alibaba FunAudio (CosyVoice3, 0.5B). LLM-based with supervised semantic tokens enabling natural prosody, cross-lingual voice cloning, and fine-grained style control across multiple languages. π Paper: https://arxiv.org/abs/2407.05407 π Benchmark: Seed-TTS Eval & multilingual MOS (WER, SIM-O, DNSMOS) - https://arxiv.org/abs/2407.05407 π GitHub: https://github.com/FunAudioLLM/CosyVoice
HKUSTAudio/Llasa-3B
Text-to-Speech β’ 4B β’ Updated β’ 486 β’ 526Note π SOTA Llama-based speech synthesis scaling (HKUST, 3B). First TTS model to systematically scale both train-time and inference-time compute. Achieves SOTA on Seed-TTS Eval English & Chinese benchmarks, surpassing prior open-source models. π Paper: https://arxiv.org/abs/2502.04128 π Benchmark: Seed-TTS Eval (WER, SIM-O) English & Chinese - https://arxiv.org/abs/2502.04128 π GitHub: https://github.com/HKUSTAudio/Llasa
myshell-ai/OpenVoiceV2
Text-to-Speech β’ Updated β’ 478Note π SOTA zero-shot cross-lingual voice cloning by MyShell AI (OpenVoice V2). Natively supports English, Spanish, French, Chinese, Japanese & Korean with accurate tone color cloning, flexible style control (emotion, rhythm, intonation). MIT license. π Paper: https://arxiv.org/abs/2312.01479 π Benchmark: Speaker similarity & naturalness MOS vs. XTTS - https://arxiv.org/abs/2312.01479 π GitHub: https://github.com/myshell-ai/OpenVoice
amphion/MaskGCT
Text-to-Speech β’ Updated β’ 731 β’ 306Note π SOTA zero-shot TTS with Masked Generative Codec Transformer (Amphion). Non-autoregressive approach with parallel decoding for fast, high-quality synthesis. Achieves top scores on TTS-Arena for naturalness across 6 languages (EN, ZH, FR, DE, JA, KO). π Paper: https://arxiv.org/abs/2409.00750 π Benchmark: TTS-Arena & SeedTTS Eval (MOS, SIM-O, WER) - https://arxiv.org/abs/2409.00750 π GitHub: https://github.com/open-mmlab/Amphion