uplift_dinov3-splus16 / README.md

AniAggarwal

Add image-feature-extraction tag

3f79a1f verified 16 days ago

preview code

raw

history blame contribute delete

4.34 kB

metadata

license: mit
library_name: pytorch
tags:
  - image-feature-extraction
  - feature-upsampling
  - pixel-dense-features
  - computer-vision
  - dinov3
  - vision-transformer
  - uplift
datasets:
  - ILSVRC/imagenet-1k

UPLiFT for DINOv3-S+/16

Input Image	Base DINOv3 Features	UPLiFT Upsampled Features

This is the official pretrained UPLiFT (Efficient Pixel-Dense Feature Upsampling with Local Attenders) model for the DINOv3-S+/16 backbone.

UPLiFT is a lightweight method to upscale features from pretrained vision backbones to create pixel-dense feature maps. It uses Local Attenders to efficiently upsample low-resolution backbone features while preserving semantic information.

Model Details

Property	Value
Backbone	DINOv3-S+/16 (`vit_small_plus_patch16_dinov3.lvd1689m`)
Backbone Channels	384
Patch Size	16
Upsampling Factor	2x per iteration
Local Attender Size	N=17
Training Dataset	ImageNet
Training Image Size	448x448
License	MIT

Installation

pip install 'uplift[vit] @ git+https://github.com/mwalmer-umd/UPLiFT.git'

Quick Start

import torch
from PIL import Image

# Load model (weights auto-download from HuggingFace)
model = torch.hub.load('mwalmer-umd/UPLiFT', 'uplift_dinov3_splus16')

# Run inference
image = Image.open('your_image.jpg')
features = model(image)  # Returns pixel-dense features

Usage Options

Adjust Upsampling Iterations

Control the number of iterative upsampling steps (default: 4):

# Fewer iterations = lower memory usage
model = torch.hub.load('mwalmer-umd/UPLiFT', 'uplift_dinov3_splus16', iters=4)

Raw UPLiFT Model (Without Backbone)

Load only the UPLiFT upsampling module without the DINOv3 backbone:

model = torch.hub.load('mwalmer-umd/UPLiFT', 'uplift_dinov3_splus16',
                       include_extractor=False)

Return Base Features

Get both upsampled and original backbone features:

model = torch.hub.load('mwalmer-umd/UPLiFT', 'uplift_dinov3_splus16',
                       return_base_feat=True)
upsampled_features, base_features = model(image)

Architecture

UPLiFT consists of:

Encoder: Processes the input image with a series of convolutional blocks to create dense representations to guide feature upsampling
Decoder: Upsamples features using transposed convolutions with bilinear residual connections
Local Attender: A local-neighborhood-based attention pooling module that maintains semantic consistency with the original features

The model uses encoder sharing, meaning a single encoder pass is used across all upsampling iterations for efficiency.

Intended Use

This model is designed for:

Creating pixel-dense feature maps from DINOv3 features
Dense prediction tasks (semantic segmentation, depth estimation, etc.)
Feature visualization and analysis
Research on vision foundation models

Limitations

Optimized specifically for DINOv3-S+/16 features; may not generalize to other backbones without retraining
Performance depends on the quality of the underlying DINOv3 features
Higher iteration counts increase computation time

Citation

If you use UPLiFT in your research, please cite our paper.

@article{walmer2026uplift,
  title={UPLiFT: Efficient Pixel-Dense Feature Upsampling with Local Attenders},
  author={Walmer, Matthew and Suri, Saksham and Aggarwal, Anirud and Shrivastava, Abhinav},
  journal={arXiv preprint arXiv:2601.17950},
  year={2026}
}

Acknowledgements

This work builds upon:

DINOv3 by Meta AI
timm for model loading

UPLiFT-upsampler
/

uplift_dinov3-splus16