stereo2spatial-v1

stereo2spatial-v1 is a DiT model for converting mono or stereo audio into 12-channel 7.1.4 spatial audio at 48 kHz.

The model is intended to be used with the stereo2spatial codebase. The bundle contains the model weights, runtime config, and bundled EAR-VAE assets needed for inference.

Model Summary

  • Architecture: SpatialDiT
  • Sample rate: 48000
  • Latent FPS: 50
  • Output layout: 7.1.4
  • Output channels: 12
  • Channel order: FL, FR, FC, LFE, BL, BR, SL, SR, TFL, TFR, TBL, TBR
  • Hidden size: 1024
  • Layers: 12
  • Heads: 16
  • Latent dim: 64
  • Memory tokens: 32

Training Summary

This v1 release was trained for 440,000 total steps:

  • Stage 1, part 1: 200,000 steps without GAN
  • Stage 1, part 2: 200,000 additional steps with GAN enabled
  • Stage 2: 40,000 steps with GAN enabled

Intended Use

This model is intended for:

  • research and experimentation in stereo-to-spatial generation
  • local inference workflows that render mono/stereo audio to 7.1.4
  • prototyping multichannel music and immersive-audio pipelines

This model is not a drop-in replacement for professional mastering, QC, or broadcast authoring workflows.

Limitations

  • The model is trained for a 7.1.4 output layout; do not expect other layouts to work without retraining or exporting a different target-channel setup.
  • Results are input-dependent and may introduce artifacts, unstable imaging, or balance issues on difficult material.
  • Each output goes through a VAE that was trained on stereo content (not individual channels from spatial tracks) so sometimes results may sound subpar. (At some point I may finetune the vae on per-channel outputs for increased quality without having to retrain this model)

Quick Start

From a local checkout of the stereo2spatial code repository:

python -m venv .venv
. .venv/Scripts/activate  # Windows PowerShell: .\.venv\Scripts\Activate.ps1
pip install -e .
python -m pip install -U "huggingface_hub[cli]"
hf download francislabounty/stereo2spatial-v1 --local-dir checkpoints/stereo2spatial-v1
python infer.py --checkpoint checkpoints/stereo2spatial-v1 --input-audio path/to/input.wav --output-audio path/to/output_spatial.wav --device cuda --show-progress

The recommended usage is pointing --checkpoint at the downloaded bundle directory. The inference CLI will:

  • read config.json
  • load weights from model.safetensors
  • auto-discover the bundled EAR-VAE files under vae/

Example Inference Command

python infer.py --checkpoint checkpoints/stereo2spatial-v1 --input-audio path/to/input.wav --output-audio path/to/output_spatial.wav --device cuda --show-progress --report-json outputs/report.json

Useful flags:

  • --device cpu to run on CPU
  • --solver auto|heun|euler|unipc|... to change the latent solver
  • --normalize-peak to normalize the rendered WAV before writing

License

This model is released under the Apache 2.0 license.

Downloads last month
48
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including francislabounty/stereo2spatial-v1