Speech Data Selected Opensource speech data MLCommons/peoples_speech Viewer • Updated Nov 20, 2024 • 8.05M • 26.9k • 263 speechcolab/gigaspeech Viewer • Updated Feb 7 • 11.9M • 13.2k • 155 keithito/lj_speech Updated Aug 14, 2024 • 1.02k • 60 legacy-datasets/common_voice Updated Aug 22, 2024 • 1.84k • 144
text-pretrain-data some pretrain dataset for LLM allenai/MADLAD-400 Updated Sep 9, 2024 • 628k • 159 CASIA-LM/ChineseWebText Viewer • Updated Nov 13, 2023 • 1k • 3.39k • 44 allenai/dolma Updated Apr 17, 2024 • 3.37k • 1.01k allenai/peS2o Updated Oct 13, 2024 • 6.33k • 196
Speech Data Selected Opensource speech data MLCommons/peoples_speech Viewer • Updated Nov 20, 2024 • 8.05M • 26.9k • 263 speechcolab/gigaspeech Viewer • Updated Feb 7 • 11.9M • 13.2k • 155 keithito/lj_speech Updated Aug 14, 2024 • 1.02k • 60 legacy-datasets/common_voice Updated Aug 22, 2024 • 1.84k • 144
text-pretrain-data some pretrain dataset for LLM allenai/MADLAD-400 Updated Sep 9, 2024 • 628k • 159 CASIA-LM/ChineseWebText Viewer • Updated Nov 13, 2023 • 1k • 3.39k • 44 allenai/dolma Updated Apr 17, 2024 • 3.37k • 1.01k allenai/peS2o Updated Oct 13, 2024 • 6.33k • 196