Audio-to-Audio
hibiki
t0m1ab commited on
Commit
73175ce
·
verified ·
1 Parent(s): 44f8b4a
Files changed (1) hide show
  1. README.md +3 -3
README.md CHANGED
@@ -74,13 +74,13 @@ See the [README](https://github.com/kyutai-labs/hibiki-zero) file for the infere
74
  - *Textual data:* The underlying text LLM model [Helium-1-2B](https://huggingface.co/kyutai/helium-1-2b) is trained on a mix of data including: Wikipedia, Stack Exchange, open-access scientific articles (from peS2o) and Common Crawl.
75
 
76
  - *Audio data:*
77
- - **Unsupervised audio dataset:** This dataset used for audio pretraining is a large collection readily available audio content in French, Spanish, Portuguese, German and English following the preprocessing and recipe of [Moshi](https://arxiv.org/abs/2410.00037). Our data mixture contains approximately 12% of audio in each source language, 50% of English and less than 2% of Italian (see [Section 4.2.2][paper]).
78
- - **Speech translation dataset:** This dataset used for speech translation training and reinforcement and contains around 40k hours of real speech data for each source language with synthetic sentence-level aligned speech in English (see [Sections 4.2.3 and 4.2.4][paper]).
79
  - **Speech translation fine-tuning dataset:** This dataset is a small 200h resynthesized subset of the *speech translation dataset* with natural pauses to improve audio quality and speech naturalness (see [Section 4.2.5][paper]).
80
 
81
  ### Training procedure and hyper-parameters
82
 
83
- The different stages of the training procedure are detailled in the [paper][paper] along with the hyper-parameters.
84
 
85
  ### Compute Infrastructure
86
 
 
74
  - *Textual data:* The underlying text LLM model [Helium-1-2B](https://huggingface.co/kyutai/helium-1-2b) is trained on a mix of data including: Wikipedia, Stack Exchange, open-access scientific articles (from peS2o) and Common Crawl.
75
 
76
  - *Audio data:*
77
+ - **Unsupervised audio dataset:** This dataset used for audio pretraining is a large collection readily available audio content in French, Spanish, Portuguese, German and English. Our data mixture contains approximately 12% of audio in each source language, 50% of English and less than 2% of Italian (see [Section 4.2.2][paper]).
78
+ - **Speech translation dataset:** This dataset used for speech translation training and reinforcement contains around 40k hours of real speech data for each source language with synthetic sentence-level aligned speech in English (see [Sections 4.2.3 and 4.2.4][paper]).
79
  - **Speech translation fine-tuning dataset:** This dataset is a small 200h resynthesized subset of the *speech translation dataset* with natural pauses to improve audio quality and speech naturalness (see [Section 4.2.5][paper]).
80
 
81
  ### Training procedure and hyper-parameters
82
 
83
+ The different training stages along with the hyper-parameters are detailled in the [paper][paper].
84
 
85
  ### Compute Infrastructure
86