--- license: mit library_name: pytorch tags: - image-feature-extraction - feature-upsampling - pixel-dense-features - computer-vision - dinov2 - vision-transformer - uplift datasets: - ILSVRC/imagenet-1k --- # UPLiFT for DINOv2-S/14 | Input Image | Base DINOv2 Features | UPLiFT Upsampled Features | |:-----------:|:--------------------:|:-------------------------:| | ![Input](Gigi_2_448.png) | ![Base Features](Gigi_2_448.png_uplift_dinov2-s14-base-feature-PCA.png) | ![UPLiFT Features](Gigi_2_448.png_uplift_dinov2-s14-4-PCA.png) | This is the official pretrained **UPLiFT** (Efficient Pixel-Dense Feature Upsampling with Local Attenders) model for the **DINOv2-S/14** backbone. UPLiFT is a lightweight method to upscale features from pretrained vision backbones to create pixel-dense feature maps. It uses Local Attenders to efficiently upsample low-resolution backbone features while preserving semantic information. ## Model Details | Property | Value | |----------|-------| | **Backbone** | DINOv2-S/14 (`vit_small_patch14_dinov2.lvd142m`) | | **Backbone Channels** | 384 | | **Patch Size** | 14 | | **Upsampling Factor** | 2x per iteration | | **Local Attender Size** | N=17 | | **Training Dataset** | ImageNet | | **Training Image Size** | 448x448 | | **License** | MIT | ## Links - **Paper**: [https://arxiv.org/abs/2601.17950](https://arxiv.org/abs/2601.17950) - **GitHub**: [https://github.com/mwalmer-umd/UPLiFT](https://github.com/mwalmer-umd/UPLiFT) - **Project Website**: [https://www.cs.umd.edu/~mwalmer/uplift/](https://www.cs.umd.edu/~mwalmer/uplift/) ## Installation ```bash pip install 'uplift[vit] @ git+https://github.com/mwalmer-umd/UPLiFT.git' ``` ## Quick Start ```python import torch from PIL import Image # Load model (weights auto-download from HuggingFace) model = torch.hub.load('mwalmer-umd/UPLiFT', 'uplift_dinov2_s14') # Run inference image = Image.open('your_image.jpg') features = model(image) # Returns pixel-dense features ``` ## Usage Options ### Adjust Upsampling Iterations Control the number of iterative upsampling steps (default: 4): ```python # Fewer iterations = lower memory usage model = torch.hub.load('mwalmer-umd/UPLiFT', 'uplift_dinov2_s14', iters=4) ``` ### Raw UPLiFT Model (Without Backbone) Load only the UPLiFT upsampling module without the DINOv2 backbone: ```python model = torch.hub.load('mwalmer-umd/UPLiFT', 'uplift_dinov2_s14', include_extractor=False) ``` ### Return Base Features Get both upsampled and original backbone features: ```python model = torch.hub.load('mwalmer-umd/UPLiFT', 'uplift_dinov2_s14', return_base_feat=True) upsampled_features, base_features = model(image) ``` ## Architecture UPLiFT consists of: 1. **Encoder**: Processes the input image with a series of convolutional blocks to create dense representations to guide feature upsampling 2. **Decoder**: Upsamples features using transposed convolutions with bilinear residual connections 3. **Local Attender**: A local-neighborhood-based attention pooling module that maintains semantic consistency with the original features The model uses encoder sharing, meaning a single encoder pass is used across all upsampling iterations for efficiency. ## Intended Use This model is designed for: - Creating pixel-dense feature maps from DINOv2 features - Dense prediction tasks (semantic segmentation, depth estimation, etc.) - Feature visualization and analysis - Research on vision foundation models ## Limitations - Optimized specifically for DINOv2-S/14 features; may not generalize to other backbones without retraining - Performance depends on the quality of the underlying DINOv2 features - Higher iteration counts increase computation time - DINOv2 uses a patch size of 14, so 14x upsampling is required to make pixel-dense features. UPLiFT with 4 iterations performs 16x upsampling, slightly over-sampling the features. If exactly pixel-dense features are required, we recommend downsampling these over-sampled features to the correct size with bilinear or bicubic interpolation. ## Citation If you use UPLiFT in your research, please cite our paper. ``` @article{walmer2026uplift, title={UPLiFT: Efficient Pixel-Dense Feature Upsampling with Local Attenders}, author={Walmer, Matthew and Suri, Saksham and Aggarwal, Anirud and Shrivastava, Abhinav}, journal={arXiv preprint arXiv:2601.17950}, year={2026} } ``` ## Acknowledgements This work builds upon: - [DINOv2](https://github.com/facebookresearch/dinov2) by Meta AI - [timm](https://github.com/huggingface/pytorch-image-models) for model loading