nanoVLM - devilops/nanoVLM-222M

This is a nano Vision-Language Model (nanoVLM) trained as part of the COM-304 course.

Model Description

The model consists of three main components:

Vision Backbone: Pretrained google/siglip-base-patch16-224
Language Model: Pretrained HuggingFaceTB/SmolLM2-135M
Modality Projector: A learnable linear layer with Pixel Shuffle reduction.

Usage

You can load this model using the VisionLanguageModel class from the nanovlm repository.

from models.vision_language_model import VisionLanguageModel
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
model = VisionLanguageModel.from_pretrained("devilops/nanoVLM-222M").to(device)

Downloads last month: 10

Safetensors

Model size

0.2B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support