--- license: mit language: - bn - hi - mr - gu - ta - te base_model: ResembleAI/chatterbox tags: - text-to-speech - tts - bengali - hindi - marathi - gujarati - tamil - telugu - indic - chatterbox - fine-tuned - zero-shot-tts - speech - speech-synthesis datasets: - ai4bharat/Shrutilipi - ai4bharat/Rasa - SPRINGLab/IndicTTS_Bengali - SPRINGLab/IndicTTS_Gujarati - SPRINGLab/IndicTTS_Marathi - SPRINGLab/IndicTTS_Tamil - SPRINGLab/IndicTTS_Telugu --- # ChatterBox Desi A fine-tuned version of [ResembleAI/chatterbox](https://huggingface.co/ResembleAI/chatterbox) for **6 Indic languages** text-to-speech synthesis: Bengali, Hindi, Marathi, Gujarati, Tamil, and Telugu. ## Zero-shot TTS Output | Language | Reference | Output | Text | |--|--|--|--| | Bengali (বাংলা) |||আমরা কেউ মাষ্টার হতে চেয়েছিলাম, কেউ ডাক্তার, কেউ উকিল। অমলকান্তি সে-সব কিছু হতে চায়নি। সে রোদ্দুর হতে চেয়েছিল!| | Bengali (বাংলা) |||আমরা কেউ মাষ্টার হতে চেয়েছিলাম, কেউ ডাক্তার, কেউ উকিল। অমলকান্তি সে-সব কিছু হতে চায়নি। সে রোদ্দুর হতে চেয়েছিল!| | Hindi (हिंदी) |||हम में से कुछ मास्टर बनना चाहते थे, कुछ डॉक्टर, कुछ वकील। अमलकांति उन सब कुछ बनना नहीं चाहता था। वह धूप बनना चाहता था!| | Hindi (हिंदी) |||हम में से कुछ मास्टर बनना चाहते थे, कुछ डॉक्टर, कुछ वकील। अमलकांति उन सब कुछ बनना नहीं चाहता था। वह धूप बनना चाहता था!| | Marathi (मराठी) |||आम्ही कोणीतरी मास्टर होऊ इच्छित होतो, कोणीतरी डॉक्टर, कोणीतरी वकील. अमलकांती त्या सगळ्या काही होऊ इच्छित नव्हता. तो सूर्य होऊ इच्छित होता!| | Marathi (मराठी) |||आम्ही कोणीतरी मास्टर होऊ इच्छित होतो, कोणीतरी डॉक्टर, कोणीतरी वकील. अमलकांती त्या सगळ्या काही होऊ इच्छित नव्हता. तो सूर्य होऊ इच्छित होता!| | Gujarati (ગુજરાતી) |||અમલકાંતિ તે બધું બનવું નથી માંગતો હતો. તે ધૂપ બનવું માંગતો હતો!| | Gujarati (ગુજરાતી) |||અમલકાંતિ તે બધું બનવું નથી માંગતો હતો. તે ધૂપ બનવું માંગતો હતો!| | Tamil (தமிழ்) |||நாங்கள் யாரும் மாஸ்டர் ஆக விரும்பவில்லை, யாரும் டாக்டர் ஆக விரும்பவில்லை, யாரும் வக்கீல் ஆக விரும்பவில்லை. அமல்காந்தி அந்த எல்லாவற்றையும் ஆக விரும்பவில்லை. அவன் வெயிலாக இருக்க விரும்பினான்!| | Tamil (தமிழ்) |||நாங்கள் யாரும் மாஸ்டர் ஆக விரும்பவில்லை, யாரும் டாக்டர் ஆக விரும்பவில்லை, யாரும் வக்கீல் ஆக விரும்பவில்லை. அமல்காந்தி அந்த எல்லாவற்றையும் ஆக விரும்பவில்லை. அவன் வெயிலாக இருக்க விரும்பினான்!| | Telugu (తెలుగు) |||మనలో కొందరు మాస్టర్ కావాలని కోరుకున్నారు, కొందరు డాక్టర్ కావాలని కోరుకున్నారు, కొందరు వకీల్ కావాలని కోరుకున్నారు. అమల్కాంతి ఆ అన్ని కావాలని కోరుకోలేదు. అతను సూర్యుడిగా ఉండాలని కోరుకున్నాడు!| | Telugu (తెలుగు) |||మనలో కొందరు మాస్టర్ కావాలని కోరుకున్నారు, కొందరు డాక్టర్ కావాలని కోరుకున్నారు, కొందరు వకీల్ కావాలని కోరుకున్నారు. అమల్కాంతి ఆ అన్ని కావాలని కోరుకోలేదు. అతను సూర్యుడిగా ఉండాలని కోరుకున్నాడు!| ## Model Details - **Base model**: ResembleAI/chatterbox — multilingual ChatterBox (supports 23 languages) - **Fine-tuned on**: 6 Indic language speech corpus (~424 hours, 216,819 samples) - ai4bharat/Shrutilipi (Bengali, Hindi splits) - ai4bharat/Rasa (Bengali, Hindi, Marathi, Gujarati, Tamil, Telugu splits) - SPRINGLab/IndicTTS (Bengali, Gujarati, Marathi, Tamil, Telugu) - **Training steps**: 10,000 - **Architecture**: T3 (Text-to-Token Transformer) + HiFT-GAN vocoder - **Vocabulary**: Extended from 2,530 → 2,820 tokens to cover all 6 Indic scripts - **Language tagging**: Text must be prefixed with language tag (e.g. `[bn]`, `[hi]`, `[mr]`, `[gu]`, `[ta]`, `[te]`) ### Training Data by Language | Language | Code | Samples | Hours | |----------|------|---------|-------| | Bengali | bn | 58,820 | 99.95 | | Gujarati | gu | 32,604 | 73.17 | | Hindi | hi | 12,116 | 21.55 | | Marathi | mr | 37,899 | 72.70 | | Tamil | ta | 39,437 | 72.74 | | Telugu | te | 35,943 | 84.04 | | **Total**| | **216,819** | **424.15** | ## Requirements ```bash git clone https://github.com/gokhaneraslan/chatterbox-finetuning cd chatterbox-finetuning pip install -r requirements.txt ``` ### One-time patch (upstream vocab resize fix) The upstream `chatterbox-finetuning` repo initialises T3 with a hard-coded 704-token vocabulary, which causes a size mismatch when loading this model (vocab=2820). Apply this one-line fix before running inference: ```bash # Run from inside the cloned chatterbox-finetuning directory python - <<'EOF' import re, pathlib f = pathlib.Path("src/chatterbox_/tts.py") txt = f.read_text() old = " t3 = T3()\n t3_state = load_file(ckpt_dir / \"t3_cfg.safetensors\")" new = ( " t3_state = load_file(ckpt_dir / \"t3_cfg.safetensors\")\n" " from .models.t3.modules.t3_config import T3Config\n" " t3 = T3(hp=T3Config(text_tokens_dict_size=t3_state[\"text_emb.weight\"].shape[0]))" ) f.write_text(txt.replace(old, new)) print("Patched tts.py") EOF ``` ## Usage **Important**: Text must be prefixed with a language tag: `[bn]`, `[hi]`, `[mr]`, `[gu]`, `[ta]`, or `[te]`. ```python import sys sys.path.insert(0, "/path/to/chatterbox-finetuning") from huggingface_hub import snapshot_download from src.chatterbox_.tts import ChatterboxTTS import torchaudio model_dir = snapshot_download("BosonLab/chatterbox-desi") model = ChatterboxTTS.from_local(model_dir, device="cuda") # Bengali text = "[bn] আমি বাংলায় কথা বলতে পারি। এটি একটি পরীক্ষামূলক বাক্য।" wav = model.generate(text) torchaudio.save("output_bn.wav", wav, model.sr) # Hindi text = "[hi] मैं हिंदी में बोल सकता हूँ। यह एक परीक्षण वाक्य है।" wav = model.generate(text) torchaudio.save("output_hi.wav", wav, model.sr) # Marathi text = "[mr] मी मराठीत बोलू शकतो. हे एक चाचणी वाक्य आहे." wav = model.generate(text) torchaudio.save("output_mr.wav", wav, model.sr) # Gujarati text = "[gu] હું ગુજરાતીમાં બોલી શકું છું. આ એક પ્રાયોગિક વાક્ય છે." wav = model.generate(text) torchaudio.save("output_gu.wav", wav, model.sr) # Tamil text = "[ta] நான் தமிழில் பேச முடியும். இது ஒரு சோதனை வாக்கியம்." wav = model.generate(text) torchaudio.save("output_ta.wav", wav, model.sr) # Telugu text = "[te] నేను తెలుగులో మాట్లాడగలను. ఇది ఒక పరీక్ష వాక్యం." wav = model.generate(text) torchaudio.save("output_te.wav", wav, model.sr) ``` ## With Voice Cloning ```python wav = model.generate(text, audio_prompt_path="reference.wav") ``` ## Files | File | Description | |------|-------------| | `t3_cfg.safetensors` | Fine-tuned T3 text-to-token transformer (6 Indic langs, vocab=2820) | | `s3gen.safetensors` | Speech codec decoder (unchanged from base) | | `ve.safetensors` | Voice encoder (unchanged from base) | | `conds.pt` | Conditioning embeddings (unchanged from base) | | `tokenizer.json` | Tokenizer extended with 6 Indic scripts | ## Training Data All audio resampled to 16kHz. Text cleaned, normalized, and prefixed with language tags. Datasets sourced from AI4Bharat and SPRINGLab public datasets (CC BY 4.0). ## Language Tags Prefix your text with the appropriate language tag for best results: | Language | Tag | Script | |----------|-----|--------| | Bengali | `[bn]` | Bengali | | Hindi | `[hi]` | Devanagari | | Marathi | `[mr]` | Devanagari | | Gujarati | `[gu]` | Gujarati | | Tamil | `[ta]` | Tamil | | Telugu | `[te]` | Telugu | ## Limitations - Optimized for 6 Indic languages; other languages may degrade - Language tag prefix is required for correct language identification - Best results with clear, well-punctuated text - Emotion control inherited from base ChatterBox multilingual model - Requires chatterbox-finetuning kit due to extended vocabulary