nanoVLM - devilops/nanoVLM-222M
This is a nano Vision-Language Model (nanoVLM) trained as part of the COM-304 course.
Model Description
The model consists of three main components:
- Vision Backbone: Pretrained
google/siglip-base-patch16-224 - Language Model: Pretrained
HuggingFaceTB/SmolLM2-135M - Modality Projector: A learnable linear layer with Pixel Shuffle reduction.
Usage
You can load this model using the VisionLanguageModel class from the nanovlm repository.
from models.vision_language_model import VisionLanguageModel
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
model = VisionLanguageModel.from_pretrained("devilops/nanoVLM-222M").to(device)
- Downloads last month
- 10
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support