Foundation-1 Training & Dataset Notes
This page provides a high-level overview of the training setup, dataset composition, and design philosophy behind Foundation-1.
It is intended as a companion to the main model page, which focuses on capabilities, prompting, and audio examples.
Training Summary
Foundation-1 was trained as a structured text-to-sample diffusion model designed for music production workflows, rather than general-purpose music captioning.
Hardware
- GPUs: 2 × NVIDIA RTX A6000
- System RAM: 128 GB
Training Run
- Training stopped at step: 183,474
Audio Configuration
- Sample Rate: 44,100 Hz
- Bit Depth: 16-bit
- Channels: Stereo
- Sample Length: 882,000 samples (~20 seconds)
Technical Training Notes
Foundation-1 was fine-tuned from stabilityai/stable-audio-open-1.0 using a diffusion transformer architecture with structured prompt conditioning.
Model Architecture
- Backbone: Diffusion Transformer (DiT)
- Transformer Depth: 24 layers
- Attention Heads: 24
- Embedding Dimension: 1536
- Conditioning Dimension: 768
- Text Encoder:
t5-base
Optimization
- Optimizer: AdamW
- Learning Rate:
5e-5 - Weight Decay:
1e-3 - Scheduler: InverseLR
- EMA: Enabled
Dataset Overview
Dataset Totals
- Total WAV files: 3,784,862
- Total Dataset Size: 7.100 TiB
Important Note on Dataset Scale
While the dataset appears quite large, its scale reflects the post-augmentation training set, not a flat count of completely unique and unrelated one-off melodic phrases.
All samples were hand-labeled first, then processed through a controlled augmentation pipeline designed to expand sonic and conditioning coverage.
As a result, a single melodic phrase may be represented across many different contexts, including variations in:
- timbre
- FX treatment
- key
- BPM
- loop structure
- tonal emphasis
This design was intentional. The goal was not simply to maximize the number of isolated phrases, but to teach the model how musical ideas translate across different sonic identities and production scenarios.
Dataset and Training Philosophy
Foundation-1 was built around a structured sample-generation philosophy, rather than generic or genre-based audio captioning.
The dataset consists entirely of hand-labeled audio, organized around a layered prompt and conditioning system designed to reflect how producers actually think about sound.
At a high level, the training design emphasizes:
- structured musical loops
- instrument hierarchy
- explicit timbre representation
- dedicated FX descriptors
- notation-aware prompt terms
- key / tempo / bar-aware looping
- strong production relevance
- broad reuse for compositional workflows
This design is central to the model’s musical coherence, prompt controllability, and production-facing behavior.
Why the Dataset Was Structured This Way
Most audio generation systems treat prompts as broad descriptive captions.
Foundation-1 instead was trained to understand sound as a layered system with separable controls.
That structure includes:
Instrument Identity
Broad family and sub-family control over what kind of sound is being generated.
Timbre
Direct conditioning over tonal character, texture, density, brightness, width, grit, warmth, and other sonic traits.
FX Context
Dedicated processing descriptors such as reverb, delay, distortion, phasing, and bitcrushing.
Musical Structure
Notation-aware terms that encourage coherent phrasing, melodic behavior, rhythmic structure, and harmonic motion.
Timing and Tonality
Explicit support for BPM, bar count, and key-aware sample generation.
This layered design is one of the main reasons Foundation-1 can produce outputs that feel both musically structured and sonically steerable.
Coverage and Augmentation
The augmentation strategy was designed to improve coverage, control, and generalization.
Rather than treating each source phrase as a single static example, labeled material was expanded across multiple sonic and musical contexts.
This helps the model learn relationships between:
- melody and timbre
- phrase behavior and instrumentation
- sound design and FX treatment
- tonal setting and loop structure
- tempo and musical feel
In practice, this encourages the model to learn reusable musical relationships, rather than simply memorizing isolated recordings.
Generalization and Overfitting
Because structured augmentation can increase dataset size rapidly, special care was taken to reduce melodic overfitting and encourage broader generalization.
The objective was to help the model learn:
- how similar phrase structures can exist across many timbral identities
- how sound design changes affect musical material
- how prompts can steer sonic outcomes without collapsing variety
- how the same conditioning vocabulary can remain useful across many production contexts
This is one reason Foundation-1 can produce multiple distinct outputs from the same prompt while still preserving the requested timbral and structural identity.
What the Dataset Size Represents
The full size of the dataset should be understood as a measure of coverage and structured variation, not simply a count of unrelated melodies.
Foundation-1 was not designed around “more files for the sake of more files,” which is often seen with large scraped or loosely structured datasets.
Instead, the dataset was built from the ground up to teach the model how:
- phrases behave across different instruments
- timbral tags influence sonic identity
- FX descriptors shape the output
- timing and tonality influence loop behavior
- musical structure can remain coherent under many sonic conditions
In other words, the dataset size reflects the breadth of the conditioning system as much as it reflects the raw quantity of audio.
Charts at a Glance
The following charts illustrate the distribution of key components of the Foundation-1 training dataset.
|
|
|
|
|
|
These visualizations provide a quick overview of the dataset’s coverage across:
- instrument families
- instrument sub-families
- timbral descriptors
- FX conditioning tags
Scope
Foundation-1 is focused specifically on sample-generation workflows.
It was trained to generate:
- musical loops
- melodic phrases
- chordal material
- arps
- top-line ideas
- basslines
- textures
- production-ready instrumental content
It was not designed as:
- a full song generator
- a drum generator
- a general-purpose music captioning model
Final Note
Foundation-1 was built to give producers structured control over:
- what the sound is
- how it behaves musically
- how it feels sonically
- how it sits in a production context
The dataset and training design were built around that exact goal.
For examples, prompting guidance, and model capabilities, see the main repository page.
Additional information on the model and design philosophy can be found in the companion video:
🎥 Watch the Foundation-1 overview and design philosophy video