kyutai
/

hibiki-zero-3b-pytorch-bf16

Model card Files Files and versions

t0m1ab commited on about 12 hours ago

Commit

73175ce

·

verified ·

1 Parent(s): 44f8b4a

update

Files changed (1) hide show

README.md +3 -3

README.md CHANGED Viewed

@@ -74,13 +74,13 @@ See the [README](https://github.com/kyutai-labs/hibiki-zero) file for the infere
 - *Textual data:* The underlying text LLM model [Helium-1-2B](https://huggingface.co/kyutai/helium-1-2b) is trained on a mix of data including: Wikipedia, Stack Exchange, open-access scientific articles (from peS2o) and Common Crawl.
 - *Audio data:*
-  - **Unsupervised audio dataset:** This dataset used for audio pretraining is a large collection readily available audio content in French, Spanish, Portuguese, German and English following the preprocessing and recipe of [Moshi](https://arxiv.org/abs/2410.00037). Our data mixture contains approximately 12% of audio in each source language, 50% of English and less than 2% of Italian (see [Section 4.2.2][paper]).
-  - **Speech translation dataset:** This dataset used for speech translation training and reinforcement and contains around 40k hours of real speech data for each source language with synthetic sentence-level aligned speech in English (see [Sections 4.2.3 and 4.2.4][paper]).
   - **Speech translation fine-tuning dataset:** This dataset is a small 200h resynthesized subset of the *speech translation dataset* with natural pauses to improve audio quality and speech naturalness (see [Section 4.2.5][paper]).
 ### Training procedure and hyper-parameters
-The different stages of the training procedure are detailled in the [paper][paper] along with the hyper-parameters.
 ### Compute Infrastructure

 - *Textual data:* The underlying text LLM model [Helium-1-2B](https://huggingface.co/kyutai/helium-1-2b) is trained on a mix of data including: Wikipedia, Stack Exchange, open-access scientific articles (from peS2o) and Common Crawl.
 - *Audio data:*
+  - **Unsupervised audio dataset:** This dataset used for audio pretraining is a large collection readily available audio content in French, Spanish, Portuguese, German and English. Our data mixture contains approximately 12% of audio in each source language, 50% of English and less than 2% of Italian (see [Section 4.2.2][paper]).
+  - **Speech translation dataset:** This dataset used for speech translation training and reinforcement contains around 40k hours of real speech data for each source language with synthetic sentence-level aligned speech in English (see [Sections 4.2.3 and 4.2.4][paper]).
   - **Speech translation fine-tuning dataset:** This dataset is a small 200h resynthesized subset of the *speech translation dataset* with natural pauses to improve audio quality and speech naturalness (see [Section 4.2.5][paper]).
 ### Training procedure and hyper-parameters
+The different training stages along with the hyper-parameters are detailled in the [paper][paper].
 ### Compute Infrastructure