---
license: mit
library_name: pytorch
tags:
  - image-feature-extraction
  - feature-upsampling
  - pixel-dense-features
  - computer-vision
  - dinov2
  - vision-transformer
  - uplift
datasets:
  - ILSVRC/imagenet-1k
---

# UPLiFT for DINOv2-S/14

| Input Image | Base DINOv2 Features | UPLiFT Upsampled Features |
|:-----------:|:--------------------:|:-------------------------:|
| ![Input](Gigi_2_448.png) | ![Base Features](Gigi_2_448.png_uplift_dinov2-s14-base-feature-PCA.png) | ![UPLiFT Features](Gigi_2_448.png_uplift_dinov2-s14-4-PCA.png) |

This is the official pretrained **UPLiFT** (Efficient Pixel-Dense Feature Upsampling with Local Attenders) model for the **DINOv2-S/14** backbone.

UPLiFT is a lightweight method to upscale features from pretrained vision backbones to create pixel-dense feature maps. It uses Local Attenders to efficiently upsample low-resolution backbone features while preserving semantic information.

## Model Details

| Property | Value |
|----------|-------|
| **Backbone** | DINOv2-S/14 (`vit_small_patch14_dinov2.lvd142m`) |
| **Backbone Channels** | 384 |
| **Patch Size** | 14 |
| **Upsampling Factor** | 2x per iteration |
| **Local Attender Size** | N=17 |
| **Training Dataset** | ImageNet |
| **Training Image Size** | 448x448 |
| **License** | MIT |

## Links

- **Paper**: [https://arxiv.org/abs/2601.17950](https://arxiv.org/abs/2601.17950)
- **GitHub**: [https://github.com/mwalmer-umd/UPLiFT](https://github.com/mwalmer-umd/UPLiFT)
- **Project Website**: [https://www.cs.umd.edu/~mwalmer/uplift/](https://www.cs.umd.edu/~mwalmer/uplift/)

## Installation

```bash
pip install 'uplift[vit] @ git+https://github.com/mwalmer-umd/UPLiFT.git'
```

## Quick Start

```python
import torch
from PIL import Image

# Load model (weights auto-download from HuggingFace)
model = torch.hub.load('mwalmer-umd/UPLiFT', 'uplift_dinov2_s14')

# Run inference
image = Image.open('your_image.jpg')
features = model(image)  # Returns pixel-dense features
```

## Usage Options

### Adjust Upsampling Iterations

Control the number of iterative upsampling steps (default: 4):

```python
# Fewer iterations = lower memory usage
model = torch.hub.load('mwalmer-umd/UPLiFT', 'uplift_dinov2_s14', iters=4)
```

### Raw UPLiFT Model (Without Backbone)

Load only the UPLiFT upsampling module without the DINOv2 backbone:

```python
model = torch.hub.load('mwalmer-umd/UPLiFT', 'uplift_dinov2_s14',
                       include_extractor=False)
```

### Return Base Features

Get both upsampled and original backbone features:

```python
model = torch.hub.load('mwalmer-umd/UPLiFT', 'uplift_dinov2_s14',
                       return_base_feat=True)
upsampled_features, base_features = model(image)
```

## Architecture

UPLiFT consists of:

1. **Encoder**: Processes the input image with a series of convolutional blocks to create dense representations to guide feature upsampling
2. **Decoder**: Upsamples features using transposed convolutions with bilinear residual connections
3. **Local Attender**: A local-neighborhood-based attention pooling module that maintains semantic consistency with the original features

The model uses encoder sharing, meaning a single encoder pass is used across all upsampling iterations for efficiency.

## Intended Use

This model is designed for:

- Creating pixel-dense feature maps from DINOv2 features
- Dense prediction tasks (semantic segmentation, depth estimation, etc.)
- Feature visualization and analysis
- Research on vision foundation models

## Limitations

- Optimized specifically for DINOv2-S/14 features; may not generalize to other backbones without retraining
- Performance depends on the quality of the underlying DINOv2 features
- Higher iteration counts increase computation time
- DINOv2 uses a patch size of 14, so 14x upsampling is required to make pixel-dense features. UPLiFT with 4 iterations performs 16x upsampling, slightly over-sampling the features. If exactly pixel-dense features are required, we recommend downsampling these over-sampled features to the correct size with bilinear or bicubic interpolation.

## Citation

If you use UPLiFT in your research, please cite our paper.

```
@article{walmer2026uplift,
  title={UPLiFT: Efficient Pixel-Dense Feature Upsampling with Local Attenders},
  author={Walmer, Matthew and Suri, Saksham and Aggarwal, Anirud and Shrivastava, Abhinav},
  journal={arXiv preprint arXiv:2601.17950},
  year={2026}
}
```

## Acknowledgements

This work builds upon:
- [DINOv2](https://github.com/facebookresearch/dinov2) by Meta AI
- [timm](https://github.com/huggingface/pytorch-image-models) for model loading