nanoVLM - devilops/nanoVLM-222M

This is a nano Vision-Language Model (nanoVLM) trained as part of the COM-304 course.

Model Description

The model consists of three main components:

  • Vision Backbone: Pretrained google/siglip-base-patch16-224
  • Language Model: Pretrained HuggingFaceTB/SmolLM2-135M
  • Modality Projector: A learnable linear layer with Pixel Shuffle reduction.

Usage

You can load this model using the VisionLanguageModel class from the nanovlm repository.

from models.vision_language_model import VisionLanguageModel
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
model = VisionLanguageModel.from_pretrained("devilops/nanoVLM-222M").to(device)
Downloads last month
10
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support