Foundation-1 / training_dataset_info.md

Update training_dataset_info.md

8d8816d verified about 2 months ago

preview code

raw

history blame contribute delete

8.21 kB

Foundation-1 Training & Dataset Notes

High-level overview of the training setup, dataset composition, and design philosophy behind Foundation-1.

This page provides a high-level overview of the training setup, dataset composition, and design philosophy behind Foundation-1.

It is intended as a companion to the main model page, which focuses on capabilities, prompting, and audio examples.

Training Summary

Foundation-1 was trained as a structured text-to-sample diffusion model designed for music production workflows, rather than general-purpose music captioning.

Hardware

GPUs: 2 × NVIDIA RTX A6000
System RAM: 128 GB

Training Run

Training stopped at step: 183,474

Audio Configuration

Sample Rate: 44,100 Hz
Bit Depth: 16-bit
Channels: Stereo
Sample Length: 882,000 samples (~20 seconds)

Technical Training Notes

Foundation-1 was fine-tuned from stabilityai/stable-audio-open-1.0 using a diffusion transformer architecture with structured prompt conditioning.

Model Architecture

Backbone: Diffusion Transformer (DiT)
Transformer Depth: 24 layers
Attention Heads: 24
Embedding Dimension: 1536
Conditioning Dimension: 768
Text Encoder: t5-base

Optimization

Optimizer: AdamW
Learning Rate: 5e-5
Weight Decay: 1e-3
Scheduler: InverseLR
EMA: Enabled

Dataset Overview

Dataset Totals

Total WAV files: 3,784,862
Total Dataset Size: 7.100 TiB

Important Note on Dataset Scale

While the dataset appears quite large, its scale reflects the post-augmentation training set, not a flat count of completely unique and unrelated one-off melodic phrases.

All samples were hand-labeled first, then processed through a controlled augmentation pipeline designed to expand sonic and conditioning coverage.

As a result, a single melodic phrase may be represented across many different contexts, including variations in:

timbre
FX treatment
key
BPM
loop structure
tonal emphasis

This design was intentional. The goal was not simply to maximize the number of isolated phrases, but to teach the model how musical ideas translate across different sonic identities and production scenarios.

Dataset and Training Philosophy

Foundation-1 was built around a structured sample-generation philosophy, rather than generic or genre-based audio captioning.

The dataset consists entirely of hand-labeled audio, organized around a layered prompt and conditioning system designed to reflect how producers actually think about sound.

At a high level, the training design emphasizes:

structured musical loops
instrument hierarchy
explicit timbre representation
dedicated FX descriptors
notation-aware prompt terms
key / tempo / bar-aware looping
strong production relevance
broad reuse for compositional workflows

This design is central to the model’s musical coherence, prompt controllability, and production-facing behavior.

Why the Dataset Was Structured This Way

Most audio generation systems treat prompts as broad descriptive captions.

Foundation-1 instead was trained to understand sound as a layered system with separable controls.

That structure includes:

Instrument Identity
Broad family and sub-family control over what kind of sound is being generated.

Timbre
Direct conditioning over tonal character, texture, density, brightness, width, grit, warmth, and other sonic traits.

FX Context
Dedicated processing descriptors such as reverb, delay, distortion, phasing, and bitcrushing.

Musical Structure
Notation-aware terms that encourage coherent phrasing, melodic behavior, rhythmic structure, and harmonic motion.

Timing and Tonality
Explicit support for BPM, bar count, and key-aware sample generation.

This layered design is one of the main reasons Foundation-1 can produce outputs that feel both musically structured and sonically steerable.

Coverage and Augmentation

The augmentation strategy was designed to improve coverage, control, and generalization.

Rather than treating each source phrase as a single static example, labeled material was expanded across multiple sonic and musical contexts.

This helps the model learn relationships between:

melody and timbre
phrase behavior and instrumentation
sound design and FX treatment
tonal setting and loop structure
tempo and musical feel

In practice, this encourages the model to learn reusable musical relationships, rather than simply memorizing isolated recordings.

Generalization and Overfitting

Because structured augmentation can increase dataset size rapidly, special care was taken to reduce melodic overfitting and encourage broader generalization.

The objective was to help the model learn:

how similar phrase structures can exist across many timbral identities
how sound design changes affect musical material
how prompts can steer sonic outcomes without collapsing variety
how the same conditioning vocabulary can remain useful across many production contexts

This is one reason Foundation-1 can produce multiple distinct outputs from the same prompt while still preserving the requested timbral and structural identity.

What the Dataset Size Represents

The full size of the dataset should be understood as a measure of coverage and structured variation, not simply a count of unrelated melodies.

Foundation-1 was not designed around “more files for the sake of more files,” which is often seen with large scraped or loosely structured datasets.

Instead, the dataset was built from the ground up to teach the model how:

phrases behave across different instruments
timbral tags influence sonic identity
FX descriptors shape the output
timing and tonality influence loop behavior
musical structure can remain coherent under many sonic conditions

In other words, the dataset size reflects the breadth of the conditioning system as much as it reflects the raw quantity of audio.

Charts at a Glance

The following charts illustrate the distribution of key components of the Foundation-1 training dataset.

These visualizations provide a quick overview of the dataset’s coverage across:

instrument families
instrument sub-families
timbral descriptors
FX conditioning tags

Scope

Foundation-1 is focused specifically on sample-generation workflows.

It was trained to generate:

musical loops
melodic phrases
chordal material
arps
top-line ideas
basslines
textures
production-ready instrumental content

It was not designed as:

a full song generator
a drum generator
a general-purpose music captioning model

Final Note

Foundation-1 was built to give producers structured control over:

what the sound is
how it behaves musically
how it feels sonically
how it sits in a production context

The dataset and training design were built around that exact goal.

For examples, prompting guidance, and model capabilities, see the main repository page.

Additional information on the model and design philosophy can be found in the companion video:

🎥 Watch the Foundation-1 overview and design philosophy video