update
Browse files
README.md
CHANGED
|
@@ -74,13 +74,13 @@ See the [README](https://github.com/kyutai-labs/hibiki-zero) file for the infere
|
|
| 74 |
- *Textual data:* The underlying text LLM model [Helium-1-2B](https://huggingface.co/kyutai/helium-1-2b) is trained on a mix of data including: Wikipedia, Stack Exchange, open-access scientific articles (from peS2o) and Common Crawl.
|
| 75 |
|
| 76 |
- *Audio data:*
|
| 77 |
-
- **Unsupervised audio dataset:** This dataset used for audio pretraining is a large collection readily available audio content in French, Spanish, Portuguese, German and English
|
| 78 |
-
- **Speech translation dataset:** This dataset used for speech translation training and reinforcement
|
| 79 |
- **Speech translation fine-tuning dataset:** This dataset is a small 200h resynthesized subset of the *speech translation dataset* with natural pauses to improve audio quality and speech naturalness (see [Section 4.2.5][paper]).
|
| 80 |
|
| 81 |
### Training procedure and hyper-parameters
|
| 82 |
|
| 83 |
-
The different stages
|
| 84 |
|
| 85 |
### Compute Infrastructure
|
| 86 |
|
|
|
|
| 74 |
- *Textual data:* The underlying text LLM model [Helium-1-2B](https://huggingface.co/kyutai/helium-1-2b) is trained on a mix of data including: Wikipedia, Stack Exchange, open-access scientific articles (from peS2o) and Common Crawl.
|
| 75 |
|
| 76 |
- *Audio data:*
|
| 77 |
+
- **Unsupervised audio dataset:** This dataset used for audio pretraining is a large collection readily available audio content in French, Spanish, Portuguese, German and English. Our data mixture contains approximately 12% of audio in each source language, 50% of English and less than 2% of Italian (see [Section 4.2.2][paper]).
|
| 78 |
+
- **Speech translation dataset:** This dataset used for speech translation training and reinforcement contains around 40k hours of real speech data for each source language with synthetic sentence-level aligned speech in English (see [Sections 4.2.3 and 4.2.4][paper]).
|
| 79 |
- **Speech translation fine-tuning dataset:** This dataset is a small 200h resynthesized subset of the *speech translation dataset* with natural pauses to improve audio quality and speech naturalness (see [Section 4.2.5][paper]).
|
| 80 |
|
| 81 |
### Training procedure and hyper-parameters
|
| 82 |
|
| 83 |
+
The different training stages along with the hyper-parameters are detailled in the [paper][paper].
|
| 84 |
|
| 85 |
### Compute Infrastructure
|
| 86 |
|