--- license: mit library_name: pytorch tags: - image-feature-extraction - feature-upsampling - pixel-dense-features - computer-vision - stable-diffusion - vae - image-upsampling - uplift datasets: - unsplash/lite --- # UPLiFT for Stable Diffusion 1.5 VAE | Input Image | UPLiFT Upsampled Output | |:-----------:|:-----------------------:| | ![Input](Gigi_3_512.png) | ![UPLiFT Output](Gigi_3_512.png_uplift_sd1.5vae-2.png) | This is the official pretrained **UPLiFT** (Efficient Pixel-Dense Feature Upsampling with Local Attenders) model for the **Stable Diffusion 1.5 VAE** encoder. UPLiFT is a lightweight method to upscale features from pretrained vision backbones to create pixel-dense feature maps. When applied to the SD 1.5 VAE, it enables high-quality image upsampling by operating in the VAE's latent space. ## Model Details | Property | Value | |----------|-------| | **Backbone** | Stable Diffusion 1.5 VAE (`stable-diffusion-v1-5/stable-diffusion-v1-5`) | | **Latent Channels** | 4 | | **Patch Size** | 8 | | **Upsampling Factor** | 2x per iteration | | **Local Attender Size** | N=17 | | **Training Dataset** | Unsplash-Lite | | **Training Image Size** | 1024x1024 | | **License** | MIT | ## Links - **Paper**: [https://arxiv.org/abs/2601.17950](https://arxiv.org/abs/2601.17950) - **GitHub**: [https://github.com/mwalmer-umd/UPLiFT](https://github.com/mwalmer-umd/UPLiFT) - **Project Website**: [https://www.cs.umd.edu/~mwalmer/uplift/](https://www.cs.umd.edu/~mwalmer/uplift/) ## Installation ```bash pip install 'uplift[sd-vae] @ git+https://github.com/mwalmer-umd/UPLiFT.git' ``` ## Quick Start ```python import torch from PIL import Image # Load model (weights auto-download from HuggingFace) model = torch.hub.load('mwalmer-umd/UPLiFT', 'uplift_sd15_vae') # Run inference - upsamples the image image = Image.open('your_image.jpg') upsampled_image = model(image) ``` ## Usage Options ### Adjust Upsampling Iterations Control the number of iterative upsampling steps (default: 2 for VAE): ```python # Fewer iterations = lower memory usage model = torch.hub.load('mwalmer-umd/UPLiFT', 'uplift_sd15_vae', iters=2) ``` ### Raw UPLiFT Model (Without Backbone) Load only the UPLiFT upsampling module without the SD VAE: ```python model = torch.hub.load('mwalmer-umd/UPLiFT', 'uplift_sd15_vae', include_extractor=False) ``` **Note:** We do not recommend running the model in this way, as the added complexity of extracting and using features from a Diffusers pipeline VAE can introduce errors in feature handling. Running with the backbone included will handle the features correctly. ## Architecture This UPLiFT variant is specifically designed for VAE latent upsampling and includes: 1. **Encoder**: Processes the input image with a series of convolutional blocks to create dense representations to guide feature upsampling 2. **Decoder**: Upsamples latent features with noise channel concatenation for stochastic refinement 3. **Local Attender**: A local-neighborhood-based attention pooling module that maintains semantic consistency with the original features 4. **Refiner**: An additional 12-layer refinement block with noise injection that enhances output quality Key differences from ViT-based UPLiFT models: - Uses layer normalization instead of batch normalization - Includes noise channel concatenation (4 channels) in decoder and refiner - Features a dedicated refiner module for enhanced image quality - Trained with latent-space noise augmentation ## Intended Use This model is designed for: - High-quality image upsampling using Stable Diffusion's VAE - Super-resolution tasks - Enhancing image resolution while preserving details - Research on diffusion model components ## Limitations - Optimized specifically for Stable Diffusion 1.5 VAE; may not work with other VAE architectures - Output quality depends on the input image characteristics - Requires more computation than simpler upsampling methods - Best results achieved with images that match the training distribution (natural photographs) ## Citation If you use UPLiFT in your research, please cite our paper. ``` @article{walmer2026uplift, title={UPLiFT: Efficient Pixel-Dense Feature Upsampling with Local Attenders}, author={Walmer, Matthew and Suri, Saksham and Aggarwal, Anirud and Shrivastava, Abhinav}, journal={arXiv preprint arXiv:2601.17950}, year={2026} } ``` ## Acknowledgements This work builds upon: - [Stable Diffusion](https://github.com/CompVis/stable-diffusion) by Stability AI and CompVis - [Diffusers](https://github.com/huggingface/diffusers) by Hugging Face - [Unsplash](https://unsplash.com/) for the training dataset