update

73175ce verified 3 months ago

6.16 kB

	---
	language:
	- fr
	- es
	- pt
	- de
	- en
	metrics:
	- bleu
	- comet
	base_model:
	- kyutai/hibiki-zero-3b-pytorch-bf16
	pipeline_tag: audio-to-audio
	---
	# Hibiki-Zero

	[Hibiki-Zero](https://github.com/kyutai-labs/hibiki-zero) is a model for simultaneous speech translation. Traditional approaches for building simultaneous translation systems rely on supervised training with word-level aligned data between the source and the target content. Hibiki-Zero eliminates the need for word-level alignments entirely so that it fundamentally simplifies the training pipeline and enables seamless scaling to multiple languages with varying grammatical structures.

	Hibiki-Zero supports translation from 🇫🇷 French, 🇪🇸 Spanish, 🇵🇹 Portuguese and 🇩🇪 German to 🇬🇧 English. At inference, Hibiki-Zero adapts its flow to accumulate just enough context so that it produces a real-time and natural speech translation with voice transfer along with a text translation. Hibiki-Zero can also be adapted to a new input language with less than 1000h of speech data.

	---

	## Model Details

	This is the model simply referred to as Hibiki-Zero in our [paper][paper], a 3B-parameter hierarchical Transformer producing speech and text tokens at a framerate of 12.5Hz, with audio being generated at a 2.2kbps bitrate.

	### Model Description

	Hibiki-Zero is a decoder-only model that can receive and generate audio tokens produced by the the streaming neural audio codec [Mimi](https://huggingface.co/kyutai/mimi). It leverages the same multistream architecture as [Moshi](https://arxiv.org/abs/2410.00037) or [Hibiki](https://arxiv.org/abs/2502.03382) to model source and target speech jointly. This allows Hibiki-Zero to continuously process the input stream while generating the target speech and text tokens at a constant framerate of 12.5Hz producing a continuous output audio stream, along with timestamped text translation. Hibiki-Zero consist of a main backbone of 3 billion parameters.

	At inference, Hibiki-Zero continuously encodes the input user speech and produces real-time speech and text translation. Our model relies on simple temperature sampling and is thus compatible with batching unlike models using complex inference policies. It is also possible to run batched inference 3x faster than real-time on a single H100 GPU as demonstrated by our [inference code](https://github.com/kyutai-labs/hibiki-zero). Hibiki-Zero only supports a single speaker in a single language per session. However, it shows zero-shot capabilities for translation with voice transfer of multiple speakers with different languages in the same audio.


	- Developed by: Kyutai
	- Model type: Simultaneous speech-to-speech and speech-to-text translation.
	- Languages: {French,Spanish,Portuguese,German}-to-English
	- License: CC BY-NC-SA 4.0

	### Model Sources

	- Paper: [Simultaneous Speech-to-Speech Translation Without Aligned Data][paper]
	- Inference code: [github.com/kyutai-labs/hibiki-zero](https://github.com/kyutai-labs/hibiki-zero)
	- Examples: [huggingface.co/spaces/kyutai/hibiki-zero-samples](https://huggingface.co/spaces/kyutai/hibiki-zero-samples)

	---

	## Usage

	### Direct Use

	The model can be used for streaming translation from French, Spanish, Portuguese and German to English in real-time settings, or for batched simultaneous translation of many input sequences. It is robust to noisy conditions and is trained on sequences up to 120 seconds.


	### Downstream Use

	Some components of the model can be used independently or repurposed relatively easily. For instance the [Mimi](https://huggingface.co/kyutai/mimi) codec is a state-of-the-art audio neural codec that combines semantic and acoustic information into audio tokens running at 12Hz and a bitrate of 1.1kbps, which make it particularly adapted to train speech language models or text-to-speech systems. Regarding the main Hibiki-Zero architecture, we demonstrated that it was possible to finetune it to adapt to a new input language with less than 1000h of speech and explicit the method in our [paper][paper].


	### Out-of-Scope Use

	The model is not intended to be used to impersonate other people or any malicious use of any kind.


	## How to Get Started with the Model

	See the [README](https://github.com/kyutai-labs/hibiki-zero) file for the inference code.

	---

	## Training Details

	### Training Data

	- Textual data: The underlying text LLM model [Helium-1-2B](https://huggingface.co/kyutai/helium-1-2b) is trained on a mix of data including: Wikipedia, Stack Exchange, open-access scientific articles (from peS2o) and Common Crawl.

	- Audio data:
	- Unsupervised audio dataset: This dataset used for audio pretraining is a large collection readily available audio content in French, Spanish, Portuguese, German and English. Our data mixture contains approximately 12% of audio in each source language, 50% of English and less than 2% of Italian (see [Section 4.2.2][paper]).
	- Speech translation dataset: This dataset used for speech translation training and reinforcement contains around 40k hours of real speech data for each source language with synthetic sentence-level aligned speech in English (see [Sections 4.2.3 and 4.2.4][paper]).
	- Speech translation fine-tuning dataset: This dataset is a small 200h resynthesized subset of the speech translation dataset with natural pauses to improve audio quality and speech naturalness (see [Section 4.2.5][paper]).

	### Training procedure and hyper-parameters

	The different training stages along with the hyper-parameters are detailled in the [paper][paper].

	### Compute Infrastructure

	The final model was trained on 48 H100 Nvidia GPUs.

	---

	## Citation

	If you use this model, please cite:

	```bibtex
	@unpublished{hibikizero2026,
	title={Simultaneous Speech-to-Speech Translation Without Aligned Data},
	author={Tom Labiausse and Romain Fabre and Yannick Estève and Alexandre Défossez and Neil Zeghidour},
	note={Preprint},
	year={2026},
	url={https://arxiv.org/abs/2602.11072v1}
	}
	```

	[paper]: https://arxiv.org/abs/2602.11072v1