CLIP-ViT-IJEPA-VLMs-0.5B β€” Vision Encoder Stitching Comparison

Which frozen vision encoder produces the best embeddings for a VLM?

This repo contains trained projector weights + LoRA adapters from a controlled experiment comparing three vision encoders stitched into Qwen2.5-0.5B-Instruct. Trained on COCO-QA (46k Q&A pairs) for 2,000 steps.

See also: 1.5B version (VQAv2 training)


The Experiment

[Frozen Vision Encoder] ──► [Trainable Projector] ──► [Qwen-0.5B + LoRA]
         β”‚                          β”‚                         β”‚
    CLIP / ViT / I-JEPA      LayerNorm+MLP           rank=16, q_proj+v_proj

Vision Encoders Compared

Encoder Model Hidden Dim Patches Params Pre-training
ViT-L/16 google/vit-large-patch16-224 1024 196 304M Supervised (ImageNet-21k)
CLIP ViT-L/14 openai/clip-vit-large-patch14 1024 256 304M Contrastive (400M image-text pairs)
I-JEPA ViT-H/14 facebook/ijepa_vith14_1k 1280 256 632M Self-supervised (predict masked patches)

Training Setup

Component Details
LLM Qwen/Qwen2.5-0.5B-Instruct (frozen + LoRA rank=16)
Projector LayerNorm β†’ Linear β†’ GELU β†’ Linear
Dataset HuggingFaceM4/the_cauldron (cocoqa subset, 46k Q&A pairs)
Optimizer AdamW, lr=1e-4, linear warmup 100 steps
Steps 2,000 optimizer steps, effective batch size 16
Hardware Single A100 GPU (~18 min per encoder)

Results

Training Loss

Encoder Final Loss Min Loss Trainable Params Total Params
CLIP 0.6815 0.1747 2,805,504 (0.35%) 800,018,048
ViT 0.5221 0.2141 2,805,504 (0.35%) 801,189,504
I-JEPA 0.4947 0.2282 3,035,392 (0.27%) 1,127,830,400

Evaluation Accuracy (exact match, 1K samples per benchmark)

Encoder CLEVR COCO-QA TallyQA Average
CLIP 5.0% 55.8% 0.2% 20.3%
I-JEPA 10.4% 51.2% 0.0% 20.5%
ViT 10.6% 51.0% 0.0% 20.5%

Sample Inference (Golden Retriever Image)

Question CLIP ViT I-JEPA
"Describe the image?" Dog Dog Dog
"Is the dog's tongue out?" Yes Yes Yes
"What color is the fur?" Dog Yellow Brown
"What color is the dog?" Fur Brown Brown
"What animal is this?" Dog Dog Dog

Key Findings (0.5B)

  • CLIP excels on COCO-QA (55.8%) β€” its text-aligned embeddings match the Q&A format well
  • I-JEPA and ViT lead on CLEVR (~10%) β€” better at compositional/spatial reasoning
  • All three struggle on TallyQA (counting) with only 0.5B LLM and short training
  • Answers are short (1 word) because COCO-QA uses short Q&A format

Repo Structure

β”œβ”€β”€ clip/
β”‚   β”œβ”€β”€ projector.pt           # Trained MLP bridge weights
β”‚   β”œβ”€β”€ lora_adapter/          # LoRA adapter (loadable via peft)
β”‚   β”œβ”€β”€ loss_history.json      # Per-step training loss
β”‚   └── config.json            # Full experiment config
β”œβ”€β”€ vit/
β”‚   └── ... (same structure)
└── ijepa/
    └── ... (same structure)

Quick Start

Install

pip install torch transformers peft Pillow huggingface-hub

Inference (standalone β€” no repo clone needed)

import json, torch
from PIL import Image
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig
from huggingface_hub import hf_hub_download, snapshot_download

REPO = "Teen-Different/CLIP-ViT-IJEPA-VLMs-0.5B"
ENCODER = "clip"  # or "vit" or "ijepa"

# Download weights
cfg = json.load(open(hf_hub_download(REPO, f"{ENCODER}/config.json")))
proj_path = hf_hub_download(REPO, f"{ENCODER}/projector.pt")
lora_dir = snapshot_download(REPO, allow_patterns=f"{ENCODER}/lora_adapter/*")
lora_path = f"{lora_dir}/{ENCODER}/lora_adapter"

# Load encoder (example for CLIP)
from transformers import CLIPVisionModel, CLIPImageProcessor
encoder = CLIPVisionModel.from_pretrained("openai/clip-vit-large-patch14",
                                           torch_dtype=torch.bfloat16).cuda().eval()
processor = CLIPImageProcessor.from_pretrained("openai/clip-vit-large-patch14")

# Load projector
import torch.nn as nn
class Projector(nn.Module):
    def __init__(self, vd, ld):
        super().__init__()
        self.norm = nn.LayerNorm(vd)
        self.fc1 = nn.Linear(vd, ld)
        self.act = nn.GELU()
        self.fc2 = nn.Linear(ld, ld)
    def forward(self, x):
        return self.fc2(self.act(self.fc1(self.norm(x))))

llm_dim = AutoConfig.from_pretrained(cfg["llm_model_id"]).hidden_size
projector = Projector(1024, llm_dim).to(torch.bfloat16).cuda()
projector.load_state_dict(torch.load(proj_path, map_location="cuda", weights_only=True))
projector.eval()

# Load LLM + LoRA
tokenizer = AutoTokenizer.from_pretrained(cfg["llm_model_id"])
llm = AutoModelForCausalLM.from_pretrained(cfg["llm_model_id"],
                                            torch_dtype=torch.bfloat16).cuda()
llm = PeftModel.from_pretrained(llm, lora_path).cuda().eval()

# Run inference
image = Image.open("photo.jpg").convert("RGB")
pixels = processor(images=[image], return_tensors="pt")["pixel_values"].cuda()

with torch.no_grad():
    vis = encoder(pixel_values=pixels).last_hidden_state[:, 1:, :]  # drop CLS
    vis = projector(vis)
    tok = tokenizer("<image>\nWhat is in this image?", return_tensors="pt").to("cuda")
    txt = llm.model.model.embed_tokens(tok["input_ids"])
    embeds = torch.cat([vis, txt], dim=1)
    mask = torch.ones(embeds.shape[:2], dtype=torch.long, device="cuda")
    out = llm.generate(inputs_embeds=embeds, attention_mask=mask,
                       max_new_tokens=50, min_new_tokens=3, do_sample=False)
    print(tokenizer.decode(out[0], skip_special_tokens=True))

Using the CLI (from source repo)

git clone https://github.com/REDDITARUN/CLIP-ViT-IJEPA-VLM.git && cd CLIP-ViT-IJEPA-VLM
pip install -r requirements.txt

# Inference
python inference.py --encoder clip --image photo.jpg --from-hub Teen-Different/CLIP-ViT-IJEPA-VLMs-0.5B

# Evaluate
python eval.py --from-hub Teen-Different/CLIP-ViT-IJEPA-VLMs-0.5B

# Interactive
python inference.py --encoder clip --image photo.jpg --from-hub Teen-Different/CLIP-ViT-IJEPA-VLMs-0.5B --interactive

What Makes Each Encoder Different?

Encoder What the embeddings encode Trained on Best for
CLIP Language-aligned concepts ("dog", "red", "beach") 400M image-text pairs Fast convergence, concept questions
I-JEPA Spatial structure, part-whole relations, layout ImageNet (self-supervised) Spatial reasoning, compositional tasks
ViT ImageNet class-discriminative features ImageNet-21k (supervised) Classification-style tasks

Limitations

  • 0.5B LLM β€” small language model limits answer quality and length
  • Short answers β€” trained on COCO-QA which has 1-3 word answers
  • 2,000 steps β€” relatively short training; longer training would improve all encoders
  • Single dataset β€” trained only on COCO-QA; multi-dataset training would improve generalization

Citation

@misc{clip-vit-ijepa-vlm-2026,
  title={CLIP vs ViT vs I-JEPA: Vision Encoder Stitching Comparison},
  author={Tarun Reddi},
  year={2026},
  url={https://github.com/REDDITARUN/CLIP-ViT-IJEPA-VLM}
}
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Teen-Different/CLIP-ViT-IJEPA-VLMs-0.5B

Base model

Qwen/Qwen2.5-0.5B
Adapter
(406)
this model

Dataset used to train Teen-Different/CLIP-ViT-IJEPA-VLMs-0.5B