| <!DOCTYPE html> |
| <html lang="en"> |
|
|
| <head> |
| <meta charset="UTF-8" /> |
| <meta name="viewport" content="width=device-width, initial-scale=1.0" /> |
| <title>MiniMax-Speech Tech Report | Intrinsic Zero-Shot Text-to-Speech with a Learnable Speaker Encoder</title> |
| <meta name="description" |
| content=" MiniMax-Speech, an autoregressive Transformer-based Text-to-Speech (TTS) model that generates high-quality speech" /> |
| <meta name="keywords" content="latex.css,css library,class-less css,latex css" /> |
| <meta property="og:title" |
| content="MiniMax-Speech Tech Report | Intrinsic Zero-Shot Text-to-Speech with a Learnable Speaker Encoder" /> |
| <meta property="og:url" content="https://minimax-ai.github.io/tts_tech_report" /> |
| <meta property="og:description" |
| content=" MiniMax-Speech, an autoregressive Transformer-based Text-to-Speech (TTS) model that generates high-quality speech" /> |
| <meta property="og:type" content="website" /> |
|
|
| <link rel="stylesheet" href="style.css" /> |
| </head> |
|
|
| <body id="top" class="text-justify"> |
| <header |
| style="background-image: url('assets/images/header-bg.jpeg'); background-size: cover; background-position: center; padding: 1rem 0; border-radius: 1rem;"> |
| <h1>MiniMax-Speech</h1> |
| <h4 style="font-size: 1.3rem; line-height: 1; text-align: center;">Intrinsic Zero-Shot Text-to-Speech |
| with a |
| Learnable Speaker |
| Encoder</h4> |
| <p class="author"> |
| MiniMax Team <span class="date">May 2025</span><br /> |
| <a style="font-size: 1.1rem;" target="_blank" href="https://arxiv.org/abs/2505.07916">[Tech |
| Report]</a> |
| <a style="font-size: 1.1rem; margin-left: 1rem;" target="_blank" |
| href="https://huggingface.co/datasets/MiniMaxAI/TTS-Multilingual-Test-Set">[Multilingual Test Set]</a> |
| <a style="font-size: 1.1rem; margin-left: 1rem;" target="_blank" href="https://github.com/MiniMax-AI">[GitHub]</a> |
| </p> |
| </header> |
|
|
| <div class="abstract"> |
| <h2>Abstract</h2> |
| <p style="text-align: left;"> |
| We introduce MiniMax-Speech, an autoregressive Transformer-based Text-to-Speech (TTS) model that generates |
| high-quality |
| speech. A key innovation is our learnable speaker encoder, which extracts timbre features from a reference audio |
| without |
| requiring its transcription. This enables MiniMax-Speech to produce highly expressive speech with timbre |
| consistent with |
| the reference in a zero-shot manner, while also supporting one-shot voice cloning with exceptionally high |
| similarity to |
| the reference voice. In addition, the overall quality of the synthesized audio is enhanced through the proposed |
| Flow-VAE. Our model supports 32 languages and demonstrates excellent performance across multiple objective and |
| subjective evaluations metrics. Notably, it achieves state-of-the-art (SOTA) results on objective voice cloning |
| metrics |
| (Word Error Rate and Speaker Similarity) and has secured the top position on the public TTS Arena leaderboard. |
| Another |
| key strength of MiniMax-Speech, granted by the robust and disentangled representations from the speaker encoder, |
| is its |
| extensibility without modifying the base model, enabling various applications such as: arbitrary voice emotion |
| control |
| via LoRA; text to voice (T2V) by synthesizing timbre features directly from text description; and professional |
| voice |
| cloning (PVC) by fine-tuning timbre features with additional data. |
| </p> |
| </div> |
|
|
| <nav role="navigation" class="toc"> |
| <h2>Explore MiniMax-Speech</h2> |
| <p>Welcome to visit |
| <a href="https://www.minimax.io/audio">MiniMax Audio</a> and |
| explore our powerful TTS features. |
| </p> |
| <h2>Contents</h2> |
| <ol> |
| <li> |
| <a href="#architecture-overview">Architecture Overview</a> |
| </li> |
| <li> |
| <a href="#expressiveness-demonstrations">Expressiveness Demonstrations</a> |
| <ol> |
| <li><a href="#showcase-with-high-versatility">Showcase with High Versatility</a></li> |
| <li><a href="#showcase-with-multiple-generation-attempts">Showcase with Multiple Generation Attempts</a></li> |
| </ol> |
| </li> |
| <li><a href="#zero-shot-vs-one-shot-demonstrations">Zero-Shot vs. One-Shot Demonstrations</a></li> |
| <li><a href="#multilingual-and-cross-lingual-capabilities-demonstrations">Multilingual and Cross-Lingual |
| Capabilities Demonstrations</a></li> |
| <li><a href="#flow-vae-vs-vae-comparisons">Flow-VAE vs. VAE Comparisons</a></li> |
| <li><a href="#professional-voice-clone-pvc-demonstrations">Professional Voice Clone (PVC) Demonstrations</a></li> |
| <li><a href="#emotion-control-demonstrations">Emotion Control Demonstrations</a></li> |
| <li><a href="#text-prompted-voice-generation-demonstrations">Text-Prompted Voice Generation Demonstrations</a> |
| </li> |
| <li><a href="#comparison-of-voice-naturalness">Comparison of voice |
| naturalness with the previous generation products</a></li> |
| <li><a href="#citation">Citation</a></li> |
| </ol> |
| </nav> |
|
|
| <main> |
| <article> |
| <div class="article-block"> |
| <h2 id="architecture-overview">Architecture Overview</h2> |
| <figure> |
| <img src="assets/images/system-overview.jpg" loading="lazy" alt="System Architecture" width="100%" |
| height="auto" /> |
| <figcaption> |
| An overview of the architecture of MiniMax-Speech. |
| </figcaption> |
| </figure> |
| </div> |
|
|
| <div class="article-block"> |
| <h2 id="expressiveness-demonstrations">Expressiveness Demonstrations</h2> |
| <h3 id="showcase-with-high-versatility">Showcase with High Versatility</h3> |
| <div class="scroll-wrapper"> |
| <table style="width: 100%;"> |
| <tbody> |
| <tr class="border-bottom-thin"> |
| <th scope="col" style="width: 40%;">Description</th> |
| <th scope="col" style="width: 30%; text-align: center;">Source Audio</th> |
| <th scope="col" style="width: 30%; text-align: center;">Generated Audio</th> |
| </tr> |
| <tr class="border-bottom-thin"> |
| <td> |
| A Compelling and Persuasive Speaker Voice |
| </td> |
| <td> |
| <audio class="audio-md" src="assets/audios/Marketing_Voice_Sourse.wav" controls></audio> |
| </td> |
| <td> |
| <audio class="audio-md" src="assets/audios/Compelling%20and%20Persuasive.wav" controls></audio> |
| </td> |
| </tr> |
| <tr class="border-bottom-thin"> |
| <td> |
| A Clear and Explanatory Voice with Broad Emotional Dynamics Across Different Texts |
| </td> |
| <td> |
| <audio class="audio-md" src="assets/audios/Science_Voice_Sourse.wav" controls></audio> |
| </td> |
| <td> |
| <audio class="audio-md" src="assets/audios/Explanatory%20Broad%20Emotional.wav" controls></audio> |
| </td> |
| </tr> |
| <tr class="border-bottom-thin"> |
| <td> |
| Another Explanatory Voice with Supernatural Prosody, <br> |
| Featuring Distinct Ethnic and Age Characteristics |
| </td> |
| <td> |
| <audio class="audio-md" src="assets/audios/Sociology_Sourse.mp3" controls></audio> |
| </td> |
| <td> |
| <audio class="audio-md" src="assets/audios/Explanatory Supernatural Prosody.MP3" controls></audio> |
| </td> |
| </tr> |
| <tr class="border-bottom-thin"> |
| <td> |
| A Warm and Magnetic Voice that Brings Comfort |
| </td> |
| <td> |
| <audio class="audio-md" src="assets/audios/Warm%20and%20Magnetic_Sourse.mp3" controls></audio> |
| </td> |
| <td> |
| <audio class="audio-md" src="assets/audios/Warm%20and%20Magnetic.mp3" controls></audio> |
| </td> |
| </tr> |
| <tr class="border-bottom-thin"> |
| <td> |
| An ASMR Whispering Voice with Generated Breathing and Sound Effects |
| </td> |
| <td> |
| <audio class="audio-md" src="assets/audios/Breathy%20ASMR_Sourse.wav" controls></audio> |
| </td> |
| <td> |
| <audio class="audio-md" src="assets/audios/Breathy%20ASMR.MP3" controls></audio> |
| </td> |
| </tr> |
| <tr class="border-bottom-thin"> |
| <td> |
| A Robotic Voice with Rich Bass Resonance and Spatial Presence |
| </td> |
| <td> |
| <audio class="audio-md" src="assets/audios/Lucky%20Robot_Sourse.wav" controls></audio> |
| </td> |
| <td> |
| <audio class="audio-md" src="assets/audios/Lucky%20Robot.mp3" controls></audio> |
| </td> |
| </tr> |
| <tr class="border-bottom-thin"> |
| <td> |
| A Sardonic Mature Female Voice |
| </td> |
| <td> |
| <audio class="audio-md" src="assets/audios/Onee-san_Sourse.MP3" controls></audio> |
| </td> |
| <td> |
| <audio class="audio-md" src="assets/audios/Onee-san.wav" controls></audio> |
| </td> |
| </tr> |
| </tbody> |
| </table> |
| </div> |
|
|
| <h3 id="showcase-with-multiple-generation-attempts">Showcase with Multiple Generation Attempts, Post-Processing |
| Audio Effects and Added Sound Effects</h3> |
| <div class="scroll-wrapper"> |
| <table style="width: 100%;"> |
| <tbody> |
| <tr class="border-bottom-thin"> |
| <th scope="col" style="width: 50%;">Description</th> |
| <th scope="col" style="width: 50%; text-align: center;">Generated Audio</th> |
| </tr> |
| <tr class="border-bottom-thin"> |
| <td> |
| A Husky Male Voice: From Soft Murmur to Excitement to Anger, then to Whispers |
| </td> |
| <td> |
| <audio class="audio-lg" src="assets/audios/Murmur-Excitement-Anger-%20Whispers.MP3" controls></audio> |
| </td> |
| </tr> |
| <tr class="border-bottom-thin"> |
| <td> |
| An Angry Female Voice: From Soft Murmur to Rage to Reminiscence, then to Weeping |
| </td> |
| <td> |
| <audio class="audio-lg" src="assets/audios/Neutral-Rage-Reminiscence-Weeping.MP3" controls></audio> |
| </td> |
| </tr> |
| </tbody> |
| </table> |
| </div> |
| </div> |
|
|
| <div class="article-block"> |
| <h2 id="zero-shot-vs-one-shot-demonstrations">Zero-Shot vs. One-Shot Demonstrations</h2> |
| <p> |
| ZeroShot maintains speaker identity while generating more natural emotions, pauses, and other expressive |
| features based |
| on the text content, whereas OneShot adheres more strictly to the speaker characteristics (prosody, speech |
| rate, |
| emotions, etc.). For details of Zero-Shot and One-Shot, refer to the <a |
| href="https://arxiv.org/abs/2505.07916" target="_blank">technical report</a>. |
| </p> |
| <div class="scroll-wrapper" style="margin-top: 2rem;"> |
| <table style="width: 100%;"> |
| <tbody> |
| <tr class="border-bottom-thin"> |
| <th scope="col">Source Audio</th> |
| <th scope="col">Text</th> |
| <th scope="col">Zero-Shot Version</th> |
| <th scope="col">One-Shot Version</th> |
| <th scope="col">Elevenlabs Multilingual_v2</th> |
| </tr> |
| <tr class="border-bottom-thin"> |
| <td> |
| <audio class="audio-sm" src="assets/audios/Lyrical%20Cantonese_Prompt.WAV" controls></audio> |
| </td> |
| <td> |
| 命运就算颠沛流离,<br> |
| 命运就算曲折离奇,<br> |
| 命运就算恐吓着你,<br> |
| 做人没趣味。<br> |
| 别流泪,心酸,更不应舍弃。<br> |
| 我愿能,一生永远陪伴你。 |
| </td> |
| <td> |
| <audio class="audio-sm" src="assets/audios/Lyrical%20Cantonese_ZeroShot.mp3" controls></audio> |
| Preserving Distinctive Voice<br> |
| Timbre and Expressive <br> |
| Prosody with Regularized <br> |
| Pausing and Speech Rate |
| </td> |
| <td> |
| <audio class="audio-sm" src="assets/audios/Lyrical%20Cantonese_Oneshot.mp3" controls></audio> |
| Better Reproduction of<br> |
| Prompt's Exaggerated Speech<br> |
| Rate and Characteristic<br> |
| Phrase-Initial Pauses |
| </td> |
| <td> |
| Cantonese not supported |
| </td> |
| </tr> |
| <tr class="border-bottom-thin"> |
| <td> |
| <audio class="audio-sm" src="assets/audios/Breaking%20Down%20Mandarin_Prompt.WAV" controls></audio> |
| </td> |
| <td> |
| 你们这些躲在道德高地的懦夫,<br> |
| 敢承认自己对本我的恐惧吗?<br> |
| 回答我!嗯?你回答我!<br> |
| Look in my eyes!<br> |
| 老子写梦的解析时<br> |
| 你们还在玩泥巴,<br> |
| 我精神分析引论每个字母都能<br> |
| 刺穿文明社会的虚伪面具,<br> |
| 我解剖潜意识就像<br> |
| 外科医生划开皮肤。<br> |
| 是不是啊?说话! |
| </td> |
| <td> |
| <audio class="audio-sm" src="assets/audios/Breaking%20Down%20Mandarin_ZeroShot.mp3" controls></audio> |
| Capable of Generating<br> |
| Relatively Calmer Emotions<br> |
| while Preserving Voice<br> |
| Identity |
| </td> |
| <td> |
| <audio class="audio-sm" src="assets/audios/Breaking%20Down%20Mandarin_OneShot.mp3" controls></audio> |
| Consistently Reproducing the<br> |
| Angry Emotion from Prompt<br> |
| in Every Utterance |
| </td> |
| <td> |
| <audio class="audio-sm" src="assets/audios/ElevenLabs_Breaking Down Mandarin.mp3" controls></audio> |
| </td> |
| </tr> |
| <tr class="border-bottom-thin"> |
| <td> |
| <audio class="audio-sm" src="assets/audios/Quirky%20Female%20English_Prompt.MP3" controls></audio> |
| </td> |
| <td> |
| Would you believe what happened at the<br> |
| grocery store today? My goodness! The<br> |
| avocados were on sale - half price! Half<br> |
| price! I bought twenty of them! |
| </td> |
| <td> |
| <audio class="audio-sm" src="assets/audios/Quirky%20Female%20English_ZeroShot.MP3" controls></audio> |
| Effectively follows textual cues<br> |
| for both longer and shorter<br> |
| inter-sentence pauses |
| </td> |
| <td> |
| <audio class="audio-sm" src="assets/audios/Quirky%20Female%20English_OneShot.MP3" controls></audio> |
| Better reproduces the<br> |
| exaggerated high pitch<br> |
| characteristic of anime voices<br> |
| </td> |
| <td> |
| <audio class="audio-sm" src="assets/audios/ElevenLabs_Quirky%20Female%20English.mp3" controls></audio> |
| </td> |
| </tr> |
| <tr> |
| <td> |
| <audio class="audio-sm" src="assets/audios/Neurotic%20Teenage%20English_Prompt.MP3" controls></audio> |
| </td> |
| <td> |
| Oh my gosh, like, I literally can't believe<br> |
| what just happened! Um, so basically, I was,<br> |
| you know, just sitting there in class,<br> |
| right? And then, ugh, this totally weird<br> |
| thing happened - like, seriously weird! Wait,<br> |
| wait... Should I even be talking about this?<br> |
| Ugh, whatever. |
| </td> |
| <td> |
| <audio class="audio-sm" src="assets/audios/Neurotic%20Teenage%20English_ZeroShot.MP3" |
| controls></audio> |
| Effectively follows textual cues<br> |
| for both longer and shorter<br> |
| inter-sentence pauses |
| </td> |
| <td> |
| <audio class="audio-sm" src="assets/audios/Neurotic%20Teenage%20English_OneShot.MP3" controls></audio> |
| Better reproduces the<br> |
| exaggerated high pitch<br> |
| characteristic of anime voices<br> |
| </td> |
| <td> |
| <audio class="audio-sm" src="assets/audios/ElevenLabs_Neurotic%20Teenage%20English.mp3" |
| controls></audio> |
| </td> |
| </tr> |
| </tbody> |
| </table> |
| </div> |
| </div> |
|
|
| <div class="article-block"> |
| <h2 id="multilingual-and-cross-lingual-capabilities-demonstrations">Multilingual and Cross-Lingual Capabilities |
| Demonstrations</h2> |
| <p>Speech-02-HD maintains high naturalness in less common languages while demonstrating significant advantages |
| in |
| Standard |
| Chinese pronunciation accuracy.</p> |
| <div class="scroll-wrapper" style="margin-top: 2rem;"> |
| <table style="width: 100%;"> |
| <tbody> |
| <tr class="border-bottom-thin"> |
| <th scope="col">Languages</th> |
| <th scope="col">Source Audio</th> |
| <th scope="col">Text</th> |
| <th scope="col">MiniMax<br>Speech_02_HD</th> |
| <th scope="col">ElevenLabs<br>Multilingual_v2</th> |
| <th scope="col">OpenAI<br>TTS_1_HD<br>(*not cloned voice)</th> |
| </tr> |
| |
| <tr class="border-bottom-thin"> |
| <th>Thai</th> |
| <td> |
| <audio class="audio-sm" src="assets/audios/Thai_Male_Sourse.wav" controls></audio> |
| </td> |
| <td> |
| สวัสดีค่ะ วันนี้อากาศดีมากเลย<br> |
| คุณจะไปทานอาหารกลางวันที่ไหนคะ<br> |
| ฉันกำลังคิดว่าจะไปร้านอาหารไทยแถวนี้<br> |
| </td> |
| <td> |
| <audio class="audio-sm" src="assets/audios/Thai.mp3" controls></audio> |
| </td> |
| <td> |
| Thai not perfectly supported |
| <audio class="audio-sm" src="assets/audios/ElevenLabs_Thai.mp3" controls></audio> |
| </td> |
| <td> |
| <audio class="audio-sm" src="assets/audios/OpenAI_Thai.mp3" controls></audio> |
| </td> |
| </tr> |
| |
| <tr class="border-bottom-thin"> |
| <th>Vietnamese</th> |
| <td> |
| <audio class="audio-sm" src="assets/audios/Vietnamese_Female_Sourse.wav" controls></audio> |
| </td> |
| <td> |
| Tôi đang đọc một cuốn sách rất hay về lịch sử Việt Nam.<br> |
| Những câu chuyện về văn hóa truyền<br> |
| thống thật sự rất thú vị.<br> |
| </td> |
| <td> |
| <audio class="audio-sm" src="assets/audios/Vietnamese.mp3" controls></audio> |
| </td> |
| <td> |
| Vietnamese not perfectly supported |
| <audio class="audio-sm" src="assets/audios/ElevenLabs_Vietnamese.mp3" controls></audio> |
| </td> |
| <td> |
| <audio class="audio-sm" src="assets/audios/OpenAI_Vietnamese.mp3" controls></audio> |
| </td> |
| </tr> |
| |
| <tr class="border-bottom-thin"> |
| <th>Czech</th> |
| <td> |
| <audio class="audio-sm" src="assets/audios/Czech_Female_Sourse.wav" controls></audio> |
| </td> |
| <td> |
| Ranní mlha se pomalu zvedá nad řekou,<br> |
| zatímco první paprsky slunce prosvítají mezi stromy.<br> |
| Ptáci začínají svůj ranní koncert.<br> |
| </td> |
| <td> |
| <audio class="audio-sm" src="assets/audios/Czech.mp3" controls></audio> |
| </td> |
| <td> |
| <audio class="audio-sm" src="assets/audios/ElevenLabs_Czech.mp3" controls></audio> |
| </td> |
| <td> |
| <audio class="audio-sm" src="assets/audios/OpenAI_Czech.mp3" controls></audio> |
| </td> |
| </tr> |
| |
| <tr class="border-bottom-thin"> |
| <th>Polish</th> |
| <td> |
| <audio class="audio-sm" src="assets/audios/Polish_Male_Sourse.wav" controls></audio>、 |
| </td> |
| <td> |
| Młoda sowa siedzi cicho na gałęzi sosny,<br> |
| obserwując leśną polanę w świetle księżyca.<br> |
| Wiatr delikatnie porusza liśćmi drzew.<br> |
| </td> |
| <td> |
| <audio class="audio-sm" src="assets/audios/Polish.mp3" controls></audio> |
| </td> |
| <td> |
| <audio class="audio-sm" src="assets/audios/ElevenLabs_Polish.mp3" controls></audio> |
| </td> |
| <td> |
| <audio class="audio-sm" src="assets/audios/OpenAI_Polish.mp3" controls></audio> |
| </td> |
| </tr> |
| |
| <tr class="border-bottom-thin"> |
| <th>Japanese</th> |
| <td> |
| <audio class="audio-sm" src="assets/audios/Japanese_DominantMan_Sourse.mp3" controls></audio> |
| </td> |
| <td> |
| 電車が遅延している影響で、渋谷駅がとても混雑<br> |
| しています。次の山手線は約10分後に到着<br> |
| 予定です。お急ぎのお客様は、他の路線も<br> |
| ご利用ください。 |
| </td> |
| <td> |
| <audio class="audio-sm" src="assets/audios/Japanese.mp3" controls></audio> |
| </td> |
| <td> |
| <audio class="audio-sm" src="assets/audios/ElevenLabs_Japanese_Dominant_Man.mp3" controls></audio> |
| </td> |
| <td> |
| <audio class="audio-sm" src="assets/audios/OpenAI_Japanese.mp3" controls></audio> |
| </td> |
| </tr> |
| </tbody> |
| </table> |
| </div> |
| <p style="margin-top: 4rem;">Speech-02-HD has superior performance in zero-shot cross-lingual scenarios.</p> |
| <div class="scroll-wrapper" style="margin-top: 2rem;"> |
| <table style="width: 100%;"> |
| <tbody> |
| <tr class="border-bottom-thin"> |
| <th scope="col">Original Language</th> |
| <th scope="col">Source Audio</th> |
| <th scope="col">Mixed Language</th> |
| <th scope="col">Text</th> |
| <th scope="col">MiniMax<br>Speech_02_HD</th> |
| <th scope="col">ElevenLabs<br>Multilingual_v2</th> |
| <th scope="col">OpenAI<br>TTS_1_HD<br>(*not cloned voice)</th> |
| </tr> |
| <tr class="border-bottom-thin"> |
| <td>English</td> |
| <td> |
| <audio class="audio-sm" src="assets/audios/Wong_Sourse.mp3" controls></audio> |
| </td> |
| <td>English + Mandarin</td> |
| <td> |
| Kiddo! Come come come, 学如逆水行舟,不进则退。<br> |
| I see you're using AI tools already - so smart!<br> |
| But eh, cannot just rely on tools only lah!<br> |
| The future belongs to those who can work alongside AI,<br> |
| not those scared of it. |
| </td> |
| <td> |
| <audio class="audio-md" src="assets/audios/English-Mandarin.wav" controls></audio> |
| </td> |
| <td> |
| <audio class="audio-md" src="assets/audios/ElevenLabs_English-Mandarin.mp3" controls></audio> |
| </td> |
| <td> |
| <audio class="audio-md" src="assets/audios/OpenAI_English-Mandarin.mp3" controls></audio> |
| </td> |
| </tr> |
| <tr class="border-bottom-thin"> |
| <td>Mandarin</td> |
| <td> |
| <audio class="audio-sm" src="assets/audios/ShiBanYu_Sourse.mp3" controls></audio> |
| </td> |
| <td>Mandarin + Cantonese</td> |
| <td> |
| 老铁啊,多谢晒你送我呢本,广州话正音字典,咁好嘢喎!<br> |
| 我呢个大老爷们儿学广州话真系好难㗎!成日都分唔清声调啊。<br> |
| 嗱,而家有咗呢本书,什么都好啦。 |
| </td> |
| <td> |
| <audio class="audio-md" src="assets/audios/Mandarin-Cantonese.MP3" controls></audio> |
| </td> |
| <td> |
| Cantonese not supported |
| </td> |
| <td> |
| Cantonese not supported |
| </td> |
| </tr> |
| <tr class="border-bottom-thin"> |
| <td>Mandarin</td> |
| <td> |
| <audio class="audio-sm" src="assets/audios/ShuanQ_Sourse.mp3" controls></audio> |
| </td> |
| <td>Mandarin + English</td> |
| <td> |
| The people said, 桂林's scenery is the first under heaven.<br> |
| Yet in my opinion, 阳朔 scenery is better than 桂林。<br> |
| 群峰倒影山浮水,无水无山不入神。 |
| </td> |
| <td> |
| <audio class="audio-md" src="assets/audios/Mandarin-English.WAV" controls></audio> |
| </td> |
| <td> |
| <audio class="audio-md" src="assets/audios/ElevenLabs_Mandarin-English.mp3" controls></audio> |
| </td> |
| <td> |
| <audio class="audio-md" src="assets/audios/OpenAI_Mandarin-English.mp3" controls></audio> |
| </td> |
| </tr> |
| <tr class="border-bottom-thin"> |
| <td>English</td> |
| <td> |
| <audio class="audio-sm" src="assets/audios/CoCo_Sourse.mp3" controls></audio> |
| </td> |
| <td>English + Spanish</td> |
| <td> |
| Mi abuelita always told me "el que persevera, alcanza".<br> |
| If you persevere, you'll achieve your dreams!<br> |
| Guess what! They choose me to play the lead role in our BIG show! |
| </td> |
| <td> |
| <audio class="audio-md" src="assets/audios/English-Spanish.wav" controls></audio> |
| </td> |
| <td> |
| <audio class="audio-md" src="assets/audios/ElevenLabs_English-Spanish.mp3" controls></audio> |
| </td> |
| <td> |
| <audio class="audio-md" src="assets/audios/OpenAI_English-Spanish.mp3" controls></audio> |
| </td> |
| </tr> |
| <tr class="border-bottom-thin"> |
| <td>Japanese</td> |
| <td> |
| <audio class="audio-sm" src="assets/audios/Powerful_Girl_Sourse.mp3" controls></audio> |
| </td> |
| <td>Japanese + Korean</td> |
| <td> |
| 最近の天気予報によりますと、今週末は桜の開花に最適<br> |
| な気温になる予定です。<br> |
| 東京都内の各公園では花見客で賑わうことが予想されますが、<br> |
| 서울에서도 벚꽃이 피기 시작했다고 하네요.<br> |
| 이번 주말에는 여의도 공원에서 벚꽃 축제가 열린다고 하니<br> |
| 많은 분들이 찾아오실 것 같습니다. |
| </td> |
| <td> |
| <audio class="audio-md" src="assets/audios/Japanese-Korean.mp3" controls></audio> |
| </td> |
| <td> |
| <audio class="audio-md" src="assets/audios/ElevenLabs_Japanese-Korean.mp3" controls></audio> |
| </td> |
| <td> |
| <audio class="audio-md" src="assets/audios/OpenAI_Japanese-Korean.mp3" controls></audio> |
| </td> |
| </tr> |
| </tbody> |
| </table> |
| </div> |
| <p>*Although OpenAI currently does not support voice cloning functionality, we still wish to conduct comparative |
| listening |
| tests with its excellent naturalness as a reference.</p> |
| </div> |
|
|
| <div class="article-block"> |
| <h2 id="flow-vae-vs-vae-comparisons">Flow-VAE vs. VAE Comparison</h2> |
| <p>Flow-VAE is less likely to produce the following instabilities.</p> |
| <div class="scroll-wrapper" style="margin-top: 2rem;"> |
| <table style="width: 100%;"> |
| <tbody> |
| <tr class="border-bottom-thin"> |
| <th scope="col" style="text-align: center;">Source Audio</th> |
| <th scope="col" style="text-align: center;">Flow-VAE</th> |
| <th scope="col" style="text-align: center;">VAE</th> |
| <th scope="col" style="text-align: center;">Differences</th> |
| </tr> |
| <tr class="border-bottom-thin"> |
| <td style="width: 25%"> |
| <audio src="assets/audios/Condition1.wav" controls></audio> |
| </td> |
| <td style="width: 25%"> |
| <audio src="assets/audios/FlowVAE1.wav" controls></audio> |
| </td> |
| <td style="width: 25%"> |
| <audio src="assets/audios/VAE1.wav" controls></audio> |
| </td> |
| <td> |
| Flow-VAE reproduces more continuous<br> |
| and natural reverberation |
| </td> |
| </tr> |
| <tr class="border-bottom-thin"> |
| <td> |
| <audio src="assets/audios/Condition2.wav" controls></audio> |
| </td> |
| <td> |
| <audio src="assets/audios/FlowVAE2.wav" controls></audio> |
| </td> |
| <td> |
| <audio src="assets/audios/VAE2.wav" controls></audio> |
| </td> |
| <td> |
| VAE introduces unwanted<br> |
| high-frequency components |
| </td> |
| </tr> |
| <tr> |
| <td> |
| <audio src="assets/audios/Conditon3.wav" controls></audio> |
| </td> |
| <td> |
| <audio src="assets/audios/FlowVAE3.wav" controls></audio> |
| </td> |
| <td> |
| <audio src="assets/audios/VAE3.wav" controls></audio> |
| </td> |
| <td> |
| VAE produces electronic-sounding<br> |
| artifacts at the beginning |
| </td> |
| </tr> |
| </tbody> |
| </table> |
| </div> |
| </div> |
|
|
| <div class="article-block"> |
| <h2 id="professional-voice-clone-pvc-demonstrations">Professional Voice Clone (PVC) Demonstrations</h2> |
| <p>For more complex dialectal accents and tonal characteristics, PVC can reproduce these features while |
| maintaining high |
| naturalness based on the text content.</p> |
| <div class="scroll-wrapper" style="margin-top: 2rem;"> |
| <table style="width: 100%;"> |
| <tbody> |
| <tr class="border-bottom-thin"> |
| <th scope="col" style="text-align: center;">Source Audio</th> |
| <th scope="col" style="text-align: center;">Zero-Shot</th> |
| <th scope="col" style="text-align: center;">PVC</th> |
| <th scope="col" style="text-align: center;">Differences</th> |
| </tr> |
| <tr class="border-bottom-thin"> |
| <td style="width: 25%"> |
| <audio src="assets/audios/JosephBrodsky_Source.wav" controls></audio> |
| </td> |
| <td style="width: 25%"> |
| <audio src="assets/audios/JosephBrodsky_Fast.mp3" controls></audio> |
| </td> |
| <td style="width: 25%"> |
| <audio src="assets/audios/JosephBrodsky_PVC.mp3" controls></audio> |
| </td> |
| <td> |
| Like the ZeroShot version, the PVC<br> |
| version has rising sentence-final intonation,<br> |
| but distinctively sustains this<br> |
| elevated pitch instead of the typical<br> |
| pitch declination found in common<br> |
| declarative sentences |
| </td> |
| </tr> |
| <tr class="border-bottom-thin"> |
| <td> |
| <audio src="assets/audios/TianJin_Source.wav" controls></audio> |
| </td> |
| <td> |
| <audio src="assets/audios/TianJin_Fast.mp3" controls></audio> |
| </td> |
| <td> |
| <audio src="assets/audios/TianJin_PVC.mp3" controls></audio> |
| </td> |
| <td> |
| With more materials, the model not only<br> |
| reproduces the speaker's voice characteristics<br> |
| but also accurately captures more<br> |
| dialectal features |
| </td> |
| </tr> |
| </tbody> |
| </table> |
| </div> |
| </div> |
|
|
| <div class="article-block"> |
| <h2 id="emotion-control-demonstrations">Emotion Control Demonstrations</h2> |
| <h3>Source Audio for Refreshing Young Man</h3> |
| <audio src="assets/audios/Mandarin_Refreshing_Young_Man_Sourse.mp3" controls></audio> |
| <h3>DEMO</h3> |
| <div class="scroll-wrapper"> |
| <table style="width: 100%;"> |
| <tbody> |
| <tr class="border-bottom-thin"> |
| <th scope="col">Neutral</th> |
| <th scope="col" style="min-width: 120px;">Emotion</th> |
| <th scope="col">Text</th> |
| <th scope="col">Emotion Control Audio</th> |
| </tr> |
| <tr class="border-bottom-thin"> |
| <td> |
| <audio class="audio-md" src="assets/audios/Neutral1.mp3" controls></audio> |
| </td> |
| <td> |
| Surprised |
| </td> |
| <td> |
| 天哪!我完全没想到会在这里遇见你,<br> |
| 都过去这么多年了,你一点都没变! |
| </td> |
| <td> |
| <audio class="audio-md" src="assets/audios/Surprised.mp3" controls></audio> |
| </td> |
| </tr> |
| <tr class="border-bottom-thin"> |
| <td> |
| <audio class="audio-md" src="assets/audios/Neutral2.mp3" controls></audio> |
| </td> |
| <td> |
| Disgusted |
| </td> |
| <td> |
| 这个地方实在太脏乱了,到处都是垃圾和难闻的气味儿,<br> |
| 我一秒钟都不想多待。 |
| </td> |
| <td> |
| <audio class="audio-md" src="assets/audios/Disgusted.mp3" controls></audio> |
| </td> |
| </tr> |
| <tr class="border-bottom-thin"> |
| <td> |
| <audio class="audio-md" src="assets/audios/Neutral3.mp3" controls></audio> |
| </td> |
| <td> |
| Fearful |
| </td> |
| <td> |
| 深夜回家的路上,我清楚地听见身后有脚步声在跟着我,<br> |
| 可是回头却什么都看不见。 |
| </td> |
| <td> |
| <audio class="audio-md" src="assets/audios/Fearful.mp3" controls></audio> |
| </td> |
| </tr> |
| <tr class="border-bottom-thin"> |
| <td> |
| <audio class="audio-md" src="assets/audios/Neutral4.mp3" controls></audio> |
| </td> |
| <td> |
| Angry |
| </td> |
| <td> |
| 我付出了这么多,换来的却是这样的背叛!<br> |
| 你怎么可以这样对待我的信任! |
| </td> |
| <td> |
| <audio class="audio-md" src="assets/audios/Angry.mp3" controls></audio> |
| </td> |
| </tr> |
| <tr class="border-bottom-thin"> |
| <td> |
| <audio class="audio-md" src="assets/audios/Neutral5.mp3" controls></audio> |
| </td> |
| <td> |
| Sad |
| </td> |
| <td> |
| 躺在床上翻来覆去,心里压着说不出的难过和沮丧,<br> |
| 昨天晚上又失眠了。 |
| </td> |
| <td> |
| <audio class="audio-md" src="assets/audios/Sad.mp3" controls></audio> |
| </td> |
| </tr> |
| <tr class="border-bottom-thin"> |
| <td> |
| <audio class="audio-md" src="assets/audios/Neutral6.mp3" controls></audio> |
| </td> |
| <td> |
| Happy |
| </td> |
| <td> |
| 和好朋友一起在院子里烧烤,聊着有趣的故事,<br> |
| 享受着美食和欢乐的时光。 |
| </td> |
| <td> |
| <audio class="audio-md" src="assets/audios/Happy.mp3" controls></audio> |
| </td> |
| </tr> |
| </tbody> |
| </table> |
| </div> |
| </div> |
|
|
| <div class="article-block"> |
| <h2 id="text-prompted-voice-generation-demonstrations">Text-Prompted Voice Generation Demonstrations</h2> |
| <div class="scroll-wrapper"> |
| <table style="width: 100%;"> |
| <tbody> |
| <tr class="border-bottom-thin"> |
| <th scope="col">Prompt</th> |
| <th scope="col">Text</th> |
| <th scope="col" style="text-align: center;">Audio</th> |
| </tr> |
| <tr class="border-bottom-thin"> |
| <td> |
| 男性中年声音,说中文,音色浑厚醇厚,带有自然的磁性,<br> |
| 语速偏慢,音量适中,音调偏低沉。声音整体给人沉稳可靠的感觉,<br> |
| 在深度访谈场景中表现出专业性和亲和力,音质清晰,吐字规整有力。 |
| </td> |
| <td> |
| 在这个安静的夜晚,让我们一起走进《人生笔记》这本书。<br> |
| 作者用平实的文字记录下生活中的点点滴滴,<br> |
| 让我们看到平凡中的真善美。<br> |
| 今天,我们先来读第一章:'生活的痕迹'...... |
| </td> |
| <td> |
| <audio class="audio-md" src="assets/audios/深度访谈男中年.wav" controls></audio> |
| </td> |
| </tr> |
| <tr class="border-bottom-thin"> |
| <td> |
| 说中文的女青年,音色偏甜美,语速比较快,<br> |
| 说话时带着一种轻快的感觉,整体音调较高,像是在直播带货,<br> |
| 整体氛围比较活跃,声音清晰,听起来很有亲和力。 |
| </td> |
| <td> |
| 亲爱的宝宝们,等了好久的神仙面霜终于到货啦!<br> |
| 你们看这个包装是不是超级精致?<br> |
| 我自己已经用了一个月了,效果真的绝绝子!<br> |
| 而且这次活动价真的太划算了,错过真的会后悔的哦~ |
| </td> |
| <td> |
| <audio class="audio-md" src="assets/audios/直播带货女青年.wav" controls></audio> |
| </td> |
| </tr> |
| <tr class="border-bottom-thin"> |
| <td> |
| 中国男性声音,听着像是青年,音色清亮,语速比较快,<br> |
| 说话很有激情,像是在解说比赛,声音中带着紧张和兴奋的感觉。 |
| </td> |
| <td> |
| 漂亮!这个进攻太精彩了!张伟突破防线,<br> |
| 一个漂亮的转身,球传到禁区,王超跟上,射门!<br> |
| 球进了!难以置信的精彩配合,现场观众都沸腾了! |
| </td> |
| <td> |
| <audio class="audio-md" src="assets/audios/体育解说男青年.wav" controls></audio> |
| </td> |
| </tr> |
| <tr class="border-bottom-thin"> |
| <td> |
| 中国女青年的声音,音色清脆,说话速度偏快,语调活泼,<br> |
| 像是在做游戏直播,声音中带着愉快的感觉,整体音调较高,<br> |
| 整体氛围比较轻松。 |
| </td> |
| <td> |
| 啊!这里有个宝箱!让我们看看里面是什么~<br> |
| 哇!是传说中的紫色装备!运气也太好了吧!<br> |
| 谢谢小伙伴们的打赏,我们继续往前探索...... |
| </td> |
| <td> |
| <audio class="audio-md" src="assets/audios/游戏主播女青年.wav" controls></audio> |
| </td> |
| </tr> |
| <tr class="border-bottom-thin"> |
| <td> |
| English-speaking female voice, sounding relatively young,<br> |
| with a sweet and pleasant tone. Speaking at a moderate pace<br> |
| with a touch of energy, similar to someone narrating a<br> |
| beauty/makeup tutorial video. The overall atmosphere is<br> |
| relaxed and cheerful. |
| </td> |
| <td> |
| Hi everyone! Today I'll be sharing a soft, romantic<br> |
| makeup look that's perfect for dates. Many of you have <br> |
| been asking how to apply this eyeshadow naturally - the<br> |
| key is using gentle techniques. Let's go through the<br> |
| steps together... |
| </td> |
| <td> |
| <audio class="audio-md" src="assets/audios/美妆女博主.wav" controls></audio> |
| </td> |
| </tr> |
| <tr> |
| <td> |
| English-speaking middle-aged male voice, slightly husky, <br> |
| speaking at a moderate-to-slow pace with a deep tone. Like<br> |
| someone telling an old story, conveying a nostalgic feeling,<br> |
| with a relaxed and composed manner of speaking. |
| </td> |
| <td> |
| That was back in the late 1970s. I remember when our <br> |
| village first got electricity - everyone was so excited. <br> |
| In theevenings, people would bring their stools and <br> |
| gather under the big banyan tree by the village committee <br> |
| office to watch movies projected on the wall. Even now, <br> |
| thinking back to those moments still fills me with warmth. |
| </td> |
| <td> |
| <audio class="audio-md" src="assets/audios/回忆男中年.wav" controls></audio> |
| </td> |
| </tr> |
| </tbody> |
| </table> |
| </div> |
| </div> |
|
|
| <div class="article-block"> |
| <h2 id="comparison-of-voice-naturalness">Comparison of voice naturalness |
| with the previous generation products</h2> |
| <p>The new model demonstrates significant advantages in naturalness compared to the previous version.</p> |
| <h3 style="margin-top: 2rem;">Source Audio for Radiant_Girl</h3> |
| <audio src="assets/audios/English_Radiant_Girl_Sourse.wav" controls></audio> |
| <h3>DEMO</h3> |
| <div class="scroll-wrapper"> |
| <table style="width: 100%;"> |
| <tbody> |
| <tr class="border-bottom-thin"> |
| <th scope="col">Text</th> |
| <th scope="col" style="text-align: center;">MiniMax<br>Speech_02_HD</th> |
| <th scope="col" style="text-align: center;">Microsoft<br>Azure TTS</th> |
| <th scope="col" style="text-align: center;">AWS<br>Polly</th> |
| </tr> |
| <tr class="border-bottom-thin"> |
| <td> |
| I sat alone in the empty room, staring at the old photographs,<br> |
| wondering how everything could change so quickly,<br> |
| how a lifetime of memories could fade away just like that. |
| </td> |
| <td> |
| <audio class="audio-md" src="assets/audios/Radiant_Girl_1.mp3" controls></audio> |
| </td> |
| <td> |
| <audio class="audio-md" src="assets/audios/Emma_1.mp3" controls></audio> |
| </td> |
| <td> |
| <audio class="audio-md" src="assets/audios/Joanna_1.mp3" controls></audio> |
| </td> |
| </tr> |
| <tr class="border-bottom-thin"> |
| <td> |
| The moment I held my acceptance letter, my heart burst with joy - <br> |
| all those sleepless nights finally paid off, and I couldn't stop<br> |
| dancing around the room, calling everyone I knew to share this amazing news! |
| </td> |
| <td> |
| <audio class="audio-md" src="assets/audios/Radiant_Girl_2.mp3" controls></audio> |
| </td> |
| <td> |
| <audio class="audio-md" src="assets/audios/Emma_2.mp3" controls></audio> |
| </td> |
| <td> |
| <audio class="audio-md" src="assets/audios/Joanna_2.mp3" controls></audio> |
| </td> |
| </tr> |
| </tbody> |
| </table> |
| </div> |
| </div> |
|
|
| <div class="article-block"> |
| <h2 id="citation">Citation</h2> |
| <div> |
| <pre> |
| <code> |
| @misc{minimax2025minimaxspeechintrinsiczeroshottexttospeech, |
| title={MiniMax-Speech: Intrinsic Zero-Shot Text-to-Speech with a Learnable Speaker Encoder}, |
| author={Bowen Zhang, Congchao Guo, Geng Yang, Hang Yu, Haozhe Zhang, Heidi Lei, Jialong Mai, Junjie Yan, Kaiyue Yang, |
| Mingqi Yang, Peikai Huang, Ruiyang Jin, Sitan Jiang, Weihua Cheng, Yawei Li, Yichen Xiao, Yiying Zhou, Yongmao Zhang, |
| Yuan Lu, Yucen He}, |
| year={2025}, |
| eprint={2505.07916}, |
| archivePrefix={arXiv}, |
| primaryClass={eess.AS}, |
| url={https://arxiv.org/abs/2505.07916}, |
| }</code> |
| </pre> |
| </div> |
| </div> |
| </article> |
| </main> |
|
|
| <script> |
| MathJax = { |
| tex: { |
| inlineMath: [['$', '$'],], |
| }, |
| } |
| |
| const darkModeToggle = document.getElementById('dark-mode-toggle') |
| darkModeToggle.addEventListener('click', () => { |
| document.body.classList.toggle('latex-dark') |
| }) |
| </script> |
| <script id="MathJax-script" async src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js"></script> |
| </body> |
|
|
| </html> |