Beta-VAE with Controlled Capacity Increase
A β-VAE trained on the CelebA dataset to learn disentangled latent representations of human faces. The model can generate faces, reconstruct images and enable controllable latent editing of attributes such as smile, pose, etc.
This model is a β-Variational Autoencoder trained on 64×64 CelebA images with a 32-dimensional latent space and capacity annealing (γ=1000, C_max=50). The encoder maps images into a structured latent representation, while the decoder generates realistic face reconstructions and samples from latent vectors drawn from a Gaussian prior. The learned latent space exhibits partially disentangled factors corresponding to interpretable facial attributes such as smiling, face shape, background color, head rotation, etc. The model supports tasks including unconditional image generation, latent interpolation, feature extraction, and controllable latent editing. It is based on the paper Understanding Disentanglement in β-VAE with changes in the architecture and training.
Training code and model file is available on GitHub.
View the demo of latent editing on HuggingFace Spaces.
Architecture
Instead of using simple convolutional blocks, the encoder and decoder are built using Residual Blocks. These residual connections help improve gradient flow during training, allowing the network to learn deeper representations while stabilizing optimization and improving reconstruction quality. You should avoid normalization like Batch Normalization because it interferes with the statistics but here Group Normalization is used to stabilize the training. Without it, the training is unstable and you will encounter exploding gradients, NaNs.
Training
AdamW optimizer with weight decay 0 is used. Learning rate should not be kept above 3e-5 as it makes the training unstable and gradients will explode.
Hyperparameters:
- Latent Dimensions: 32
- C_max: 50.0
- Gamma: 1000.0
- Epochs: 200
- Capacity increase schedule: 80 epochs
- Batch size: 128
- Loss: BCE with Logits mean over batch
Dataset used: jessicali9530/celeba-dataset (Kaggle). The images are center cropped to 148x148 and then resized to 64x64.
Directions
The directions/ folder contains latent direction tensors corresponding to semantic facial attributes. Each tensor has shape (32,), matching the dimensionality of the latent space. These directions are computed by taking the difference between the mean latent vectors of positive and negative samples for a given attribute in the CelebA annotations.
For example, a direction for smile is computed as:
direction = mean(latents_smiling) - mean(latents_not_smiling)
Moving along these directions modifies the corresponding attribute while keeping most other features relatively unchanged.
To edit an attribute, add the direction to the latent vector:
edited_latent = latent + strength * direction
where strength controls the intensity of the attribute change.
Positive values increase the attribute (e.g., more smile), while negative values reduce it.
Usage
Typical workflow:
- Encode a 64x64 image to obtain its latent vector. Make sure the face is centered and there is minimal background.
- Add a scaled attribute direction to the latent.
- Decode the modified latent to generate the edited image.
This enables interactive latent editing such as:
- increasing or decreasing smile
- changing head pose
- modifying face shape
- altering background or lighting characteristics
- and other attributed in the directions folder
Multiple directions can also be combined to perform compound edits.
You can also:
- sample a random latent vector and it will generate a random face
- sample nearby vectors to generate similar images
- Encode two different images and interpolate between them (latent interpolation). The first image will smoothly transform into the second image.
The model architecture is provided in the GitHub repository. To run the model locally, clone the repository and use the architecture implementation from there, then load the provided model weights and direction tensors from this Hugging Face repository.
Observations
Active dimensions are measured using the variance of latent activations across the dataset. Dimensions with higher variance are considered to encode meaningful factors of variation.
- Active Dimensions with kl > 0.1: 19
- Strong Dimensions with kl > 1.0: 17
Limitations
- The latent space is partially disentangled, so some attributes may influence others.
- Large edit strengths may produce unrealistic faces.
- The model is trained on CelebA, so performance on non-face images or heavily out-of-distribution faces will degrade
Bias and Dataset Considerations
CelebA contains celebrity images and therefore does not represent the full diversity of human faces. As a result, the model may inherit biases present in the dataset and may perform unevenly across different demographics.
- Downloads last month
- 32
Space using ayushshah/beta-vae-capacity-annealing-celeba 1
Paper for ayushshah/beta-vae-capacity-annealing-celeba
Evaluation results
- BCE with Logits on CelebAtest set self-reported6,521.18