PrismAudio / docs /ThinkSound /Training.md
prismaudio-project
init
ddb382a

A newer version of the Gradio SDK is available: 6.10.0

Upgrade

Training Guide

This guide will walk you through the process of preparing data, configuring your training setup, and launching GRPO training for the ThinkSound model. For best results, we recommend reading through all steps before starting.


Step 1: Prepare the Dataset

Before training, you must preprocess the dataset following the instructions in Dataset.md. This includes:

  1. Converting raw videos and CoT annotations into structured feature .npz files.
  2. Constructing a valid dataset configuration JSON that points to all precomputed features.

Make sure your extracted dataset includes all required modalities and is organized correctly.


Step 2: Configure Training Script

Open scripts/PrismAudio/grpo_1node8gpus.sh and customize the following items:

Under the grpo/config section, set the paths to your model and configuration files:

  • model_config: Path to the model architecture config (e.g., ThinkSound/configs/model_configs/prismaudio.json)
  • pretransform_ckpt_path: Path to the pretrained model checkpoint (e.g., ckpts/prismaudio.ckpt)
  • dataset_config: Path to your dataset configuration JSON prepared in Step 1

Also modify distributed training settings as needed:

  • num_gpus, num_nodes, node_rank, MASTER_PORT, etc.

  • (Optional) Enable debug mode by adding the --debug flag when running the script.

πŸ” Tip

If you're using a multi-GPU setup, ensure the WORLD_SIZE, NODE_RANK, and MASTER_PORT are correctly set for your environment. These are critical for DistributedDataParallel (DDP) training.


Step 3: Configure Reward Functions (Optional)

ThinkSound supports two optional reward functions during GRPO training. To enable them, provide the corresponding reference paths when extracting features (see Dataset.md):

Reward Required Argument Description
Synchformer --add_video_path Enables audio-visual synchronization reward
ITD --add_audio_path Enables inter-track distance reward using reference audio

These paths are embedded into the extracted .pth feature files during dataset preparation and will be automatically used during GRPO training if present.


Step 4: Launch Training

Make the script executable (if not already) and start training:

chmod +x scripts/PrismAudio/grpo_1node8gpus.sh
./scripts/PrismAudio/grpo_1node8gpus.sh

Logs will be written to the specified log directory (log_dir).


Step 5: Customize Model and Training Parameters

To modify model architecture or training strategy, open the model config file specified in grpo/config. You can adjust a wide range of parameters, such as:

  • Number of model parameters
  • Optimizer type
  • Learning rate
  • Latent dimension
  • GRPO-specific reward weights

Be sure to keep a backup of your config for reproducibility.


Happy training! πŸš€
If you run into any issues, consider opening an issue or checking the documentation for detailed help.