AniAggarwal's picture
Add image-feature-extraction tag
3f79a1f verified
metadata
license: mit
library_name: pytorch
tags:
  - image-feature-extraction
  - feature-upsampling
  - pixel-dense-features
  - computer-vision
  - dinov3
  - vision-transformer
  - uplift
datasets:
  - ILSVRC/imagenet-1k

UPLiFT for DINOv3-S+/16

Input Image Base DINOv3 Features UPLiFT Upsampled Features
Input Base Features UPLiFT Features

This is the official pretrained UPLiFT (Efficient Pixel-Dense Feature Upsampling with Local Attenders) model for the DINOv3-S+/16 backbone.

UPLiFT is a lightweight method to upscale features from pretrained vision backbones to create pixel-dense feature maps. It uses Local Attenders to efficiently upsample low-resolution backbone features while preserving semantic information.

Model Details

Property Value
Backbone DINOv3-S+/16 (vit_small_plus_patch16_dinov3.lvd1689m)
Backbone Channels 384
Patch Size 16
Upsampling Factor 2x per iteration
Local Attender Size N=17
Training Dataset ImageNet
Training Image Size 448x448
License MIT

Links

Installation

pip install 'uplift[vit] @ git+https://github.com/mwalmer-umd/UPLiFT.git'

Quick Start

import torch
from PIL import Image

# Load model (weights auto-download from HuggingFace)
model = torch.hub.load('mwalmer-umd/UPLiFT', 'uplift_dinov3_splus16')

# Run inference
image = Image.open('your_image.jpg')
features = model(image)  # Returns pixel-dense features

Usage Options

Adjust Upsampling Iterations

Control the number of iterative upsampling steps (default: 4):

# Fewer iterations = lower memory usage
model = torch.hub.load('mwalmer-umd/UPLiFT', 'uplift_dinov3_splus16', iters=4)

Raw UPLiFT Model (Without Backbone)

Load only the UPLiFT upsampling module without the DINOv3 backbone:

model = torch.hub.load('mwalmer-umd/UPLiFT', 'uplift_dinov3_splus16',
                       include_extractor=False)

Return Base Features

Get both upsampled and original backbone features:

model = torch.hub.load('mwalmer-umd/UPLiFT', 'uplift_dinov3_splus16',
                       return_base_feat=True)
upsampled_features, base_features = model(image)

Architecture

UPLiFT consists of:

  1. Encoder: Processes the input image with a series of convolutional blocks to create dense representations to guide feature upsampling
  2. Decoder: Upsamples features using transposed convolutions with bilinear residual connections
  3. Local Attender: A local-neighborhood-based attention pooling module that maintains semantic consistency with the original features

The model uses encoder sharing, meaning a single encoder pass is used across all upsampling iterations for efficiency.

Intended Use

This model is designed for:

  • Creating pixel-dense feature maps from DINOv3 features
  • Dense prediction tasks (semantic segmentation, depth estimation, etc.)
  • Feature visualization and analysis
  • Research on vision foundation models

Limitations

  • Optimized specifically for DINOv3-S+/16 features; may not generalize to other backbones without retraining
  • Performance depends on the quality of the underlying DINOv3 features
  • Higher iteration counts increase computation time

Citation

If you use UPLiFT in your research, please cite our paper.

@article{walmer2026uplift,
  title={UPLiFT: Efficient Pixel-Dense Feature Upsampling with Local Attenders},
  author={Walmer, Matthew and Suri, Saksham and Aggarwal, Anirud and Shrivastava, Abhinav},
  journal={arXiv preprint arXiv:2601.17950},
  year={2026}
}

Acknowledgements

This work builds upon: