Audio-to-Audio
hibiki
t0m1ab's picture
update
73175ce verified
---
language:
- fr
- es
- pt
- de
- en
metrics:
- bleu
- comet
base_model:
- kyutai/hibiki-zero-3b-pytorch-bf16
pipeline_tag: audio-to-audio
---
# Hibiki-Zero
[Hibiki-Zero](https://github.com/kyutai-labs/hibiki-zero) is a model for **simultaneous speech translation**. Traditional approaches for building simultaneous translation systems rely on supervised training with word-level aligned data between the source and the target content. Hibiki-Zero eliminates the need for word-level alignments entirely so that it fundamentally simplifies the training pipeline and enables **seamless scaling to multiple languages** with varying grammatical structures.
Hibiki-Zero supports translation from 🇫🇷 French, 🇪🇸 Spanish, 🇵🇹 Portuguese and 🇩🇪 German to 🇬🇧 English. At inference, Hibiki-Zero **adapts its flow** to accumulate just enough context so that it produces a real-time and natural speech translation with voice transfer along with a text translation. Hibiki-Zero can also be **adapted to a new input language with less than 1000h of speech data**.
---
## Model Details
This is the model simply referred to as *Hibiki-Zero* in our [paper][paper], a 3B-parameter hierarchical Transformer producing speech and text tokens at a framerate of 12.5Hz, with audio being generated at a 2.2kbps bitrate.
### Model Description
Hibiki-Zero is a decoder-only model that can receive and generate audio tokens produced by the the streaming neural audio codec [Mimi](https://huggingface.co/kyutai/mimi). It leverages the same **multistream** architecture as [Moshi](https://arxiv.org/abs/2410.00037) or [Hibiki](https://arxiv.org/abs/2502.03382) to model source and target speech jointly. This allows Hibiki-Zero to continuously process the input stream while generating the target speech and text tokens at a constant framerate of 12.5Hz producing a **continuous output audio stream**, along with timestamped text translation. Hibiki-Zero consist of a main backbone of 3 billion parameters.
At inference, Hibiki-Zero continuously encodes the input user speech and produces **real-time speech and text translation**. Our model relies on simple temperature sampling and is thus compatible with batching unlike models using complex inference policies. It is also possible to run **batched inference 3x faster than real-time** on a single H100 GPU as demonstrated by our [inference code](https://github.com/kyutai-labs/hibiki-zero). Hibiki-Zero only supports a single speaker in a single language per session. However, it shows zero-shot capabilities for translation with voice transfer of multiple speakers with different languages in the same audio.
- **Developed by:** Kyutai
- **Model type:** Simultaneous speech-to-speech and speech-to-text translation.
- **Languages:** {French,Spanish,Portuguese,German}-to-English
- **License:** CC BY-NC-SA 4.0
### Model Sources
- **Paper:** [Simultaneous Speech-to-Speech Translation Without Aligned Data][paper]
- **Inference code:** [github.com/kyutai-labs/hibiki-zero](https://github.com/kyutai-labs/hibiki-zero)
- **Examples:** [huggingface.co/spaces/kyutai/hibiki-zero-samples](https://huggingface.co/spaces/kyutai/hibiki-zero-samples)
---
## Usage
### Direct Use
The model can be used for streaming translation from French, Spanish, Portuguese and German to English in real-time settings, or for batched simultaneous translation of many input sequences. It is robust to noisy conditions and is trained on sequences up to 120 seconds.
### Downstream Use
Some components of the model can be used independently or repurposed relatively easily. For instance the [Mimi](https://huggingface.co/kyutai/mimi) codec is a state-of-the-art audio neural codec that combines semantic and acoustic information into audio tokens running at 12Hz and a bitrate of 1.1kbps, which make it particularly adapted to train speech language models or text-to-speech systems. Regarding the main Hibiki-Zero architecture, we demonstrated that it was possible to finetune it to adapt to a new input language with less than 1000h of speech and explicit the method in our [paper][paper].
### Out-of-Scope Use
The model is not intended to be used to impersonate other people or any malicious use of any kind.
## How to Get Started with the Model
See the [README](https://github.com/kyutai-labs/hibiki-zero) file for the inference code.
---
## Training Details
### Training Data
- *Textual data:* The underlying text LLM model [Helium-1-2B](https://huggingface.co/kyutai/helium-1-2b) is trained on a mix of data including: Wikipedia, Stack Exchange, open-access scientific articles (from peS2o) and Common Crawl.
- *Audio data:*
- **Unsupervised audio dataset:** This dataset used for audio pretraining is a large collection readily available audio content in French, Spanish, Portuguese, German and English. Our data mixture contains approximately 12% of audio in each source language, 50% of English and less than 2% of Italian (see [Section 4.2.2][paper]).
- **Speech translation dataset:** This dataset used for speech translation training and reinforcement contains around 40k hours of real speech data for each source language with synthetic sentence-level aligned speech in English (see [Sections 4.2.3 and 4.2.4][paper]).
- **Speech translation fine-tuning dataset:** This dataset is a small 200h resynthesized subset of the *speech translation dataset* with natural pauses to improve audio quality and speech naturalness (see [Section 4.2.5][paper]).
### Training procedure and hyper-parameters
The different training stages along with the hyper-parameters are detailled in the [paper][paper].
### Compute Infrastructure
The final model was trained on 48 H100 Nvidia GPUs.
---
## Citation
If you use this model, please cite:
```bibtex
@unpublished{hibikizero2026,
title={Simultaneous Speech-to-Speech Translation Without Aligned Data},
author={Tom Labiausse and Romain Fabre and Yannick Estève and Alexandre Défossez and Neil Zeghidour},
note={Preprint},
year={2026},
url={https://arxiv.org/abs/2602.11072v1}
}
```
[paper]: https://arxiv.org/abs/2602.11072v1