| IndexTTS-Rust/ (Complete Directory Structure) |
| β |
| βββ indextts/ # Main Python package (194 files) |
| β β |
| β βββ __init__.py # Package initialization |
| β βββ cli.py # Command-line interface (64 lines) |
| β βββ infer.py # Original inference (v1) - 690 lines |
| β βββ infer_v2.py # Main inference v2 - 739 lines βββ |
| β β |
| β βββ gpt/ # GPT-based TTS model (9 files, 16,953 lines) |
| β β βββ __init__.py |
| β β βββ model.py # Original UnifiedVoice (713L) |
| β β βββ model_v2.py # UnifiedVoice v2 βββ (747L) |
| β β βββ conformer_encoder.py # Conformer encoder ββ (520L) |
| β β βββ perceiver.py # Perceiver resampler (317L) |
| β β βββ conformer_encoder.py # Conformer components |
| β β βββ transformers_gpt2.py # GPT2 implementation (1,878L) |
| β β βββ transformers_generation_utils.py # Generation utilities (4,747L) |
| β β βββ transformers_beam_search.py # Beam search (1,013L) |
| β β βββ transformers_modeling_utils.py # Model utilities (5,525L) |
| β β |
| β βββ BigVGAN/ # Neural Vocoder (6+ files, ~1000+ lines) |
| β β βββ __init__.py |
| β β βββ models.py # BigVGAN architecture βββ |
| β β βββ ECAPA_TDNN.py # Speaker encoder |
| β β βββ activations.py # Snake, SnakeBeta activations |
| β β βββ utils.py # Helper functions |
| β β β |
| β β βββ alias_free_activation/ # CUDA kernel variants |
| β β β βββ cuda/ |
| β β β β βββ activation1d.py # CUDA kernel loader |
| β β β β βββ load.py |
| β β β βββ torch/ |
| β β β βββ act.py # PyTorch activation |
| β β β βββ filter.py # Anti-aliasing filter |
| β β β βββ resample.py # Resampling |
| β β β |
| β β βββ alias_free_torch/ # PyTorch-only fallback |
| β β β βββ act.py |
| β β β βββ filter.py |
| β β β βββ resample.py |
| β β β |
| β β βββ nnet/ # Network modules |
| β β βββ linear.py |
| β β βββ normalization.py |
| β β βββ CNN.py |
| β β |
| β βββ s2mel/ # Semantic-to-Mel Models (~500+ lines) |
| β β βββ modules/ # Core modules (10+ files) |
| β β β βββ audio.py # Mel-spectrogram computation β |
| β β β βββ commons.py # Common utilities (21KB) |
| β β β βββ layers.py # NN layers (13KB) |
| β β β βββ length_regulator.py # Duration modeling |
| β β β βββ flow_matching.py # Continuous flow matching |
| β β β βββ diffusion_transformer.py # Diffusion model |
| β β β βββ rmvpe.py # Pitch extraction (22KB) |
| β β β βββ quantize.py # Quantization |
| β β β βββ encodec.py # EnCodec codec |
| β β β βββ wavenet.py # WaveNet implementation |
| β β β β |
| β β β βββ bigvgan/ # BigVGAN vocoder |
| β β β β βββ modules.py |
| β β β β βββ config.json |
| β β β β βββ bigvgan.py |
| β β β β βββ alias_free_activation/ # Variants |
| β β β β βββ models.py |
| β β β β |
| β β β βββ vocos/ # Vocos codec |
| β β β βββ hifigan/ # HiFiGAN vocoder |
| β β β βββ openvoice/ # OpenVoice components (11 files) |
| β β β βββ campplus/ # CAMPPlus speaker encoder |
| β β β β βββ DTDNN.py # DTDNN architecture |
| β β β βββ gpt_fast/ # Fast GPT inference |
| β β β |
| β β βββ dac/ # DAC codec |
| β β β βββ model/ |
| β β β βββ nn/ |
| β β β βββ utils/ |
| β β β |
| β β βββ (other s2mel implementations) |
| β β |
| β βββ utils/ # Text & Feature Utils (12+ files, ~500L) |
| β β βββ __init__.py |
| β β βββ front.py # TextNormalizer, TextTokenizer βββ (700L) |
| β β βββ maskgct_utils.py # Semantic codec builders (250L) |
| β β βββ arch_util.py # AttentionBlock, utilities |
| β β βββ checkpoint.py # Model loading |
| β β βββ xtransformers.py # Transformer utils (1,600L) |
| β β βββ feature_extractors.py # MelSpectrogramFeatures |
| β β βββ common.py # Common functions |
| β β βββ text_utils.py # Text utilities |
| β β βββ typical_sampling.py # TypicalLogitsWarper sampling |
| β β βββ utils.py # General utils |
| β β βββ webui_utils.py # Web UI helpers |
| β β βββ tagger_cache/ # Text normalization cache |
| β β β |
| β β βββ maskgct/ # MaskGCT codec (100+ files, 10KB+) |
| β β βββ models/ |
| β β βββ codec/ # Multiple codec implementations |
| β β β βββ amphion_codec/ # Amphion codec |
| β β β β βββ codec.py |
| β β β β βββ vocos.py |
| β β β β βββ quantize/ # Quantization |
| β β β β βββ vector_quantize.py |
| β β β β βββ residual_vq.py |
| β β β β βββ factorized_vector_quantize.py |
| β β β β βββ lookup_free_quantize.py |
| β β β β |
| β β β βββ facodec/ # FACodec variant |
| β β β β βββ facodec_inference.py |
| β β β β βββ modules/ |
| β β β β β βββ commons.py |
| β β β β β βββ attentions.py |
| β β β β β βββ layers.py |
| β β β β β βββ quantize.py |
| β β β β β βββ wavenet.py |
| β β β β β βββ style_encoder.py |
| β β β β β βββ gradient_reversal.py |
| β β β β β βββ JDC/ (pitch detection) |
| β β β β βββ alias_free_torch/ # Anti-aliasing |
| β β β β |
| β β β βββ speechtokenizer/ # Speech Tokenizer codec |
| β β β β βββ model.py |
| β β β β βββ modules/ |
| β β β β βββ seanet.py |
| β β β β βββ lstm.py |
| β β β β βββ norm.py |
| β β β β βββ conv.py |
| β β β β βββ quantization/ |
| β β β β |
| β β β βββ ns3_codec/ # NS3 codec variant |
| β β β βββ vevo/ # VEVo codec |
| β β β βββ kmeans/ # KMeans codec |
| β β β βββ melvqgan/ # MelVQ-GAN codec |
| β β β β |
| β β β βββ codec_inference.py |
| β β β βββ codec_sampler.py |
| β β β βββ codec_trainer.py |
| β β β βββ codec_dataset.py |
| β β β |
| β β βββ tts/ |
| β β βββ maskgct/ |
| β β βββ maskgct_s2a.py # Semantic-to-acoustic |
| β β βββ ckpt/ |
| β β |
| β βββ vqvae/ # Vector Quantized VAE |
| β βββ xtts_dvae.py # Discrete VAE (currently disabled) |
| β βββ (other VAE components) |
| β |
| βββ examples/ # Sample Data & Test Cases |
| β βββ cases.jsonl # Example test cases |
| β βββ voice_*.wav # Sample voice prompts (12 files) |
| β βββ emo_*.wav # Emotion reference samples (2 files) |
| β βββ sample_prompt.wav # Default prompt (implied) |
| β |
| βββ tests/ # Test Suite |
| β βββ regression_test.py # Main regression tests β |
| β βββ padding_test.py # Padding/batch tests |
| β |
| βββ tools/ # Utility Scripts & i18n |
| β βββ download_files.py # Model downloading from HF |
| β βββ i18n/ # Internationalization |
| β βββ i18n.py # Translation system |
| β βββ scan_i18n.py # i18n scanner |
| β βββ locale/ |
| β βββ en_US.json # English translations |
| β βββ zh_CN.json # Chinese translations |
| β |
| βββ archive/ # Historical Docs |
| β βββ README_INDEXTTS_1_5.md # IndexTTS 1.5 documentation |
| β |
| βββ webui.py # Gradio Web UI βββ (18KB) |
| βββ cli.py # Command-line interface |
| βββ requirements.txt # Python dependencies |
| βββ MANIFEST.in # Package manifest |
| βββ .gitignore # Git ignore rules |
| βββ .gitattributes # Git attributes |
| βββ LICENSE # Apache 2.0 License |
|
|
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| KEY FILES BY IMPORTANCE: |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
|
|
| βββ CRITICAL (Core Logic - MUST Convert First) |
| 1. indextts/infer_v2.py - Main inference pipeline (739L) |
| 2. indextts/gpt/model_v2.py - UnifiedVoice GPT model (747L) |
| 3. indextts/utils/front.py - Text processing (700L) |
| 4. indextts/BigVGAN/models.py - Vocoder (1000+L) |
| 5. indextts/s2mel/modules/audio.py - Mel-spectrogram (83L, critical DSP) |
|
|
| ββ HIGH PRIORITY (Major Components) |
| 1. indextts/gpt/conformer_encoder.py - Conformer blocks (520L) |
| 2. indextts/gpt/perceiver.py - Perceiver attention (317L) |
| 3. indextts/utils/maskgct_utils.py - Codec builders (250L) |
| 4. indextts/s2mel/modules/commons.py - Common utilities (21KB) |
|
|
| β MEDIUM PRIORITY (Utilities & Optimization) |
| 1. indextts/utils/xtransformers.py - Transformer utils (1,600L) |
| 2. indextts/BigVGAN/activations.py - Activation functions |
| 3. indextts/s2mel/modules/rmvpe.py - Pitch extraction (22KB) |
|
|
| OPTIONAL (Web UI, Tools) |
| 1. webui.py - Gradio interface |
| 2. tools/download_files.py - Model downloading |
|
|
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| TOTAL STATISTICS: |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| Total Python Files: 194 |
| Total Lines of Code: ~25,000+ |
| GPT Module: 16,953 lines |
| MaskGCT Codecs: ~10,000+ lines |
| S2Mel Models: ~2,000+ lines |
| BigVGAN: ~1,000+ lines |
| Utils: ~500 lines |
| Tests: ~100 lines |
|
|
| Models Supported: 6 major HuggingFace models |
| Languages: Chinese (full), English (full), Mixed |
| Emotion Dimensions: 8-dimensional emotion control |
| Audio Sample Rate: 22,050 Hz (primary) |
| Max Text Tokens: 120 |
| Max Mel Tokens: 250 |
| Mel Spectrogram Bins: 80 |
|
|