Audio-to-Audio
hibiki

Hibiki-Zero

Hibiki-Zero is a model for simultaneous speech translation. Traditional approaches for building simultaneous translation systems rely on supervised training with word-level aligned data between the source and the target content. Hibiki-Zero eliminates the need for word-level alignments entirely so that it fundamentally simplifies the training pipeline and enables seamless scaling to multiple languages with varying grammatical structures.

Hibiki-Zero supports translation from 🇫🇷 French, 🇪🇸 Spanish, 🇵🇹 Portuguese and 🇩🇪 German to 🇬🇧 English. At inference, Hibiki-Zero adapts its flow to accumulate just enough context so that it produces a real-time and natural speech translation with voice transfer along with a text translation. Hibiki-Zero can also be adapted to a new input language with less than 1000h of speech data.


Model Details

This is the model simply referred to as Hibiki-Zero in our paper, a 3B-parameter hierarchical Transformer producing speech and text tokens at a framerate of 12.5Hz, with audio being generated at a 2.2kbps bitrate.

Model Description

Hibiki-Zero is a decoder-only model that can receive and generate audio tokens produced by the the streaming neural audio codec Mimi. It leverages the same multistream architecture as Moshi or Hibiki to model source and target speech jointly. This allows Hibiki-Zero to continuously process the input stream while generating the target speech and text tokens at a constant framerate of 12.5Hz producing a continuous output audio stream, along with timestamped text translation. Hibiki-Zero consist of a main backbone of 3 billion parameters.

At inference, Hibiki-Zero continuously encodes the input user speech and produces real-time speech and text translation. Our model relies on simple temperature sampling and is thus compatible with batching unlike models using complex inference policies. It is also possible to run batched inference 3x faster than real-time on a single H100 GPU as demonstrated by our inference code. Hibiki-Zero only supports a single speaker in a single language per session. However, it shows zero-shot capabilities for translation with voice transfer of multiple speakers with different languages in the same audio.

  • Developed by: Kyutai
  • Model type: Simultaneous speech-to-speech and speech-to-text translation.
  • Languages: {French,Spanish,Portuguese,German}-to-English
  • License: CC BY-NC-SA 4.0

Model Sources


Usage

Direct Use

The model can be used for streaming translation from French, Spanish, Portuguese and German to English in real-time settings, or for batched simultaneous translation of many input sequences. It is robust to noisy conditions and is trained on sequences up to 120 seconds.

Downstream Use

Some components of the model can be used independently or repurposed relatively easily. For instance the Mimi codec is a state-of-the-art audio neural codec that combines semantic and acoustic information into audio tokens running at 12Hz and a bitrate of 1.1kbps, which make it particularly adapted to train speech language models or text-to-speech systems. Regarding the main Hibiki-Zero architecture, we demonstrated that it was possible to finetune it to adapt to a new input language with less than 1000h of speech and explicit the method in our paper.

Out-of-Scope Use

The model is not intended to be used to impersonate other people or any malicious use of any kind.

How to Get Started with the Model

See the README file for the inference code.


Training Details

Training Data

  • Textual data: The underlying text LLM model Helium-1-2B is trained on a mix of data including: Wikipedia, Stack Exchange, open-access scientific articles (from peS2o) and Common Crawl.

  • Audio data:

    • Unsupervised audio dataset: This dataset used for audio pretraining is a large collection readily available audio content in French, Spanish, Portuguese, German and English. Our data mixture contains approximately 12% of audio in each source language, 50% of English and less than 2% of Italian (see Section 4.2.2).
    • Speech translation dataset: This dataset used for speech translation training and reinforcement contains around 40k hours of real speech data for each source language with synthetic sentence-level aligned speech in English (see Sections 4.2.3 and 4.2.4).
    • Speech translation fine-tuning dataset: This dataset is a small 200h resynthesized subset of the speech translation dataset with natural pauses to improve audio quality and speech naturalness (see Section 4.2.5).

Training procedure and hyper-parameters

The different training stages along with the hyper-parameters are detailled in the paper.

Compute Infrastructure

The final model was trained on 48 H100 Nvidia GPUs.


Citation

If you use this model, please cite:

@unpublished{hibikizero2026,
  title={Simultaneous Speech-to-Speech Translation Without Aligned Data},
  author={Tom Labiausse and Romain Fabre and Yannick Estève and Alexandre Défossez and Neil Zeghidour},
  note={Preprint},
  year={2026},
  url={https://arxiv.org/abs/2602.11072v1}
}
Downloads last month
163
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for kyutai/hibiki-zero-3b-pytorch-bf16

Unable to build the model tree, the base model loops to the model itself. Learn more.

Collection including kyutai/hibiki-zero-3b-pytorch-bf16

Papers for kyutai/hibiki-zero-3b-pytorch-bf16