Beta-VAE with Controlled Capacity Increase

A β-VAE trained on the CelebA dataset to learn disentangled latent representations of human faces. The model can generate faces, reconstruct images and enable controllable latent editing of attributes such as smile, pose, etc.

This model is a β-Variational Autoencoder trained on 64×64 CelebA images with a 32-dimensional latent space and capacity annealing (γ=1000, C_max=50). The encoder maps images into a structured latent representation, while the decoder generates realistic face reconstructions and samples from latent vectors drawn from a Gaussian prior. The learned latent space exhibits partially disentangled factors corresponding to interpretable facial attributes such as smiling, face shape, background color, head rotation, etc. The model supports tasks including unconditional image generation, latent interpolation, feature extraction, and controllable latent editing. It is based on the paper Understanding Disentanglement in β-VAE with changes in the architecture and training.

Training code and model file is available on GitHub.
View the demo of latent editing on HuggingFace Spaces.

Architecture

Instead of using simple convolutional blocks, the encoder and decoder are built using Residual Blocks. These residual connections help improve gradient flow during training, allowing the network to learn deeper representations while stabilizing optimization and improving reconstruction quality. You should avoid normalization like Batch Normalization because it interferes with the statistics but here Group Normalization is used to stabilize the training. Without it, the training is unstable and you will encounter exploding gradients, NaNs.

Training

AdamW optimizer with weight decay 0 is used. Learning rate should not be kept above 3e-5 as it makes the training unstable and gradients will explode.

Hyperparameters:

Latent Dimensions: 32
C_max: 50.0
Gamma: 1000.0
Epochs: 200
Capacity increase schedule: 80 epochs
Batch size: 128
Loss: BCE with Logits mean over batch

Dataset used: jessicali9530/celeba-dataset (Kaggle). The images are center cropped to 148x148 and then resized to 64x64.

Directions

The directions/ folder contains latent direction tensors corresponding to semantic facial attributes. Each tensor has shape (32,), matching the dimensionality of the latent space. These directions are computed by taking the difference between the mean latent vectors of positive and negative samples for a given attribute in the CelebA annotations.

For example, a direction for smile is computed as:

direction = mean(latents_smiling) - mean(latents_not_smiling)

Moving along these directions modifies the corresponding attribute while keeping most other features relatively unchanged.

To edit an attribute, add the direction to the latent vector:

edited_latent = latent + strength * direction

where strength controls the intensity of the attribute change.

Positive values increase the attribute (e.g., more smile), while negative values reduce it.

Usage

Typical workflow:

Encode a 64x64 image to obtain its latent vector. Make sure the face is centered and there is minimal background.
Add a scaled attribute direction to the latent.
Decode the modified latent to generate the edited image.

This enables interactive latent editing such as:

increasing or decreasing smile
changing head pose
modifying face shape
altering background or lighting characteristics
and other attributed in the directions folder

Multiple directions can also be combined to perform compound edits.

You can also:

sample a random latent vector and it will generate a random face
sample nearby vectors to generate similar images
Encode two different images and interpolate between them (latent interpolation). The first image will smoothly transform into the second image.

The model architecture is provided in the GitHub repository. To run the model locally, clone the repository and use the architecture implementation from there, then load the provided model weights and direction tensors from this Hugging Face repository.

Observations

Active dimensions are measured using the variance of latent activations across the dataset. Dimensions with higher variance are considered to encode meaningful factors of variation.

Active Dimensions with kl > 0.1: 19
Strong Dimensions with kl > 1.0: 17

Limitations

The latent space is partially disentangled, so some attributes may influence others.
Large edit strengths may produce unrealistic faces.
The model is trained on CelebA, so performance on non-face images or heavily out-of-distribution faces will degrade

Bias and Dataset Considerations

CelebA contains celebrity images and therefore does not represent the full diversity of human faces. As a result, the model may inherit biases present in the dataset and may perform unevenly across different demographics.

Downloads last month: 32

Safetensors

Model size

3.54M params

Tensor type

F32

Inference Providers NEW

Unconditional Image Generation

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Space using ayushshah/beta-vae-capacity-annealing-celeba 1

Paper for ayushshah/beta-vae-capacity-annealing-celeba

Understanding disentangling in β-VAE

Paper • 1804.03599 • Published Apr 10, 2018 • 1

Evaluation results

BCE with Logits on CelebA
test set self-reported

6,521.18