| --- |
| language: |
| - fr |
| - es |
| - pt |
| - de |
| - en |
| metrics: |
| - bleu |
| - comet |
| base_model: |
| - kyutai/hibiki-zero-3b-pytorch-bf16 |
| pipeline_tag: audio-to-audio |
| --- |
| # Hibiki-Zero |
|
|
| [Hibiki-Zero](https://github.com/kyutai-labs/hibiki-zero) is a model for **simultaneous speech translation**. Traditional approaches for building simultaneous translation systems rely on supervised training with word-level aligned data between the source and the target content. Hibiki-Zero eliminates the need for word-level alignments entirely so that it fundamentally simplifies the training pipeline and enables **seamless scaling to multiple languages** with varying grammatical structures. |
|
|
| Hibiki-Zero supports translation from 🇫🇷 French, 🇪🇸 Spanish, 🇵🇹 Portuguese and 🇩🇪 German to 🇬🇧 English. At inference, Hibiki-Zero **adapts its flow** to accumulate just enough context so that it produces a real-time and natural speech translation with voice transfer along with a text translation. Hibiki-Zero can also be **adapted to a new input language with less than 1000h of speech data**. |
|
|
| --- |
|
|
| ## Model Details |
|
|
| This is the model simply referred to as *Hibiki-Zero* in our [paper][paper], a 3B-parameter hierarchical Transformer producing speech and text tokens at a framerate of 12.5Hz, with audio being generated at a 2.2kbps bitrate. |
|
|
| ### Model Description |
|
|
| Hibiki-Zero is a decoder-only model that can receive and generate audio tokens produced by the the streaming neural audio codec [Mimi](https://huggingface.co/kyutai/mimi). It leverages the same **multistream** architecture as [Moshi](https://arxiv.org/abs/2410.00037) or [Hibiki](https://arxiv.org/abs/2502.03382) to model source and target speech jointly. This allows Hibiki-Zero to continuously process the input stream while generating the target speech and text tokens at a constant framerate of 12.5Hz producing a **continuous output audio stream**, along with timestamped text translation. Hibiki-Zero consist of a main backbone of 3 billion parameters. |
|
|
| At inference, Hibiki-Zero continuously encodes the input user speech and produces **real-time speech and text translation**. Our model relies on simple temperature sampling and is thus compatible with batching unlike models using complex inference policies. It is also possible to run **batched inference 3x faster than real-time** on a single H100 GPU as demonstrated by our [inference code](https://github.com/kyutai-labs/hibiki-zero). Hibiki-Zero only supports a single speaker in a single language per session. However, it shows zero-shot capabilities for translation with voice transfer of multiple speakers with different languages in the same audio. |
|
|
|
|
| - **Developed by:** Kyutai |
| - **Model type:** Simultaneous speech-to-speech and speech-to-text translation. |
| - **Languages:** {French,Spanish,Portuguese,German}-to-English |
| - **License:** CC BY-NC-SA 4.0 |
|
|
| ### Model Sources |
|
|
| - **Paper:** [Simultaneous Speech-to-Speech Translation Without Aligned Data][paper] |
| - **Inference code:** [github.com/kyutai-labs/hibiki-zero](https://github.com/kyutai-labs/hibiki-zero) |
| - **Examples:** [huggingface.co/spaces/kyutai/hibiki-zero-samples](https://huggingface.co/spaces/kyutai/hibiki-zero-samples) |
|
|
| --- |
|
|
| ## Usage |
|
|
| ### Direct Use |
|
|
| The model can be used for streaming translation from French, Spanish, Portuguese and German to English in real-time settings, or for batched simultaneous translation of many input sequences. It is robust to noisy conditions and is trained on sequences up to 120 seconds. |
|
|
|
|
| ### Downstream Use |
|
|
| Some components of the model can be used independently or repurposed relatively easily. For instance the [Mimi](https://huggingface.co/kyutai/mimi) codec is a state-of-the-art audio neural codec that combines semantic and acoustic information into audio tokens running at 12Hz and a bitrate of 1.1kbps, which make it particularly adapted to train speech language models or text-to-speech systems. Regarding the main Hibiki-Zero architecture, we demonstrated that it was possible to finetune it to adapt to a new input language with less than 1000h of speech and explicit the method in our [paper][paper]. |
|
|
|
|
| ### Out-of-Scope Use |
|
|
| The model is not intended to be used to impersonate other people or any malicious use of any kind. |
|
|
|
|
| ## How to Get Started with the Model |
|
|
| See the [README](https://github.com/kyutai-labs/hibiki-zero) file for the inference code. |
|
|
| --- |
|
|
| ## Training Details |
|
|
| ### Training Data |
|
|
| - *Textual data:* The underlying text LLM model [Helium-1-2B](https://huggingface.co/kyutai/helium-1-2b) is trained on a mix of data including: Wikipedia, Stack Exchange, open-access scientific articles (from peS2o) and Common Crawl. |
|
|
| - *Audio data:* |
| - **Unsupervised audio dataset:** This dataset used for audio pretraining is a large collection readily available audio content in French, Spanish, Portuguese, German and English. Our data mixture contains approximately 12% of audio in each source language, 50% of English and less than 2% of Italian (see [Section 4.2.2][paper]). |
| - **Speech translation dataset:** This dataset used for speech translation training and reinforcement contains around 40k hours of real speech data for each source language with synthetic sentence-level aligned speech in English (see [Sections 4.2.3 and 4.2.4][paper]). |
| - **Speech translation fine-tuning dataset:** This dataset is a small 200h resynthesized subset of the *speech translation dataset* with natural pauses to improve audio quality and speech naturalness (see [Section 4.2.5][paper]). |
|
|
| ### Training procedure and hyper-parameters |
|
|
| The different training stages along with the hyper-parameters are detailled in the [paper][paper]. |
|
|
| ### Compute Infrastructure |
|
|
| The final model was trained on 48 H100 Nvidia GPUs. |
|
|
| --- |
|
|
| ## Citation |
|
|
| If you use this model, please cite: |
|
|
| ```bibtex |
| @unpublished{hibikizero2026, |
| title={Simultaneous Speech-to-Speech Translation Without Aligned Data}, |
| author={Tom Labiausse and Romain Fabre and Yannick Estève and Alexandre Défossez and Neil Zeghidour}, |
| note={Preprint}, |
| year={2026}, |
| url={https://arxiv.org/abs/2602.11072v1} |
| } |
| ``` |
|
|
| [paper]: https://arxiv.org/abs/2602.11072v1 |