---
license: apache-2.0
language:
- en
tags:
- multimodal
- vision-language
- gemma
- clip
- llava
- pytorch
- lightning
datasets:
- liuhaotian/LLaVA-Instruct-150K
pipeline_tag: image-to-text
---

# Multimodal Gemma-270M

A **Multimodal Vision-Language Model** combining Google Gemma-270M with CLIP vision encoder, trained on the full LLaVA-Instruct-150K dataset.

## 🎯 Model Inference Examples

Here are real inference results from our trained model:

### 🐱 Animal Detection

| Cats on Couch | White Cat Sleeping |
|---------------|-------------------|
| ![Cat Prediction](inference_results/sample_001_prediction.png) | ![White Cat](inference_results/sample_009_prediction.png) |

### 🐕 Dog Recognition

| Golden Retriever in Park |
|-------------------------|
| ![Dog Prediction](inference_results/sample_007_prediction.png) |

### 🏠 Room & Scene Understanding

| Modern Kitchen | Clean Kitchen |
|---------------|---------------|
| ![Kitchen 1](inference_results/sample_003_prediction.png) | ![Kitchen 2](inference_results/sample_004_prediction.png) |

### 🍕 Food & Objects

| Food Scene | Apple on Table |
|------------|----------------|
| ![Food](inference_results/sample_002_prediction.png) | ![Apple](inference_results/sample_008_prediction.png) |

### 🛹 Activity & People

| Skate Park | Family Dining |
|------------|---------------|
| ![Skate Park](inference_results/sample_005_prediction.png) | ![Family](inference_results/sample_006_prediction.png) |

---

## 📊 Training Details

| Parameter | Value |
|-----------|-------|
| **Training Samples** | 157,712 (Full LLaVA dataset) |
| **Epochs** | 3 |
| **Final Training Loss** | 1.333 |
| **Final Validation Loss** | 1.430 |
| **Total Parameters** | 539M |
| **Trainable Parameters** | 18.6M (3.4%) |
| **GPU** | NVIDIA A100 40GB |
| **Training Time** | ~9 hours |
| **Batch Size** | 20 (effective: 40) |
| **Precision** | bf16-mixed |

## 📈 Benchmark Results

| Benchmark | Score |
|-----------|-------|
| **Basic VQA** | 53.8% (7/13 correct) |
| **POPE Hallucination** | 20.0% |

### VQA Breakdown
- ✅ Animal identification (cats, dogs)
- ✅ Room identification (kitchen, living room)
- ✅ Object presence detection
- ⚠️ Color identification (moderate)
- ⚠️ Detailed attributes (needs improvement)

## 🏗️ Architecture

| Component | Details |
|-----------|---------|
| **Language Model** | Google Gemma-3-270M with LoRA adapters |
| **Vision Encoder** | OpenAI CLIP ViT-Large/14 (frozen, 428M params) |
| **Vision Projector** | MLP (3.4M params) |
| **LoRA** | r=16, alpha=32, dropout=0.1 |

## 🚀 Usage

```python
from src.models.multimodal_gemma import MultimodalGemma
import torch
from PIL import Image

# Load model
model = MultimodalGemma(config)
checkpoint = torch.load("final_model.ckpt")
model.load_state_dict(checkpoint["state_dict"])
model.eval()

# Inference
image = Image.open("your_image.jpg")
prompt = "What do you see in this image?"
response = model.generate(image, prompt)
print(response)
```

## 📁 Files

| File | Size | Description |
|------|------|-------------|
| `final_model.ckpt` | 1.2GB | Full model checkpoint |
| `inference_results/` | 13.8MB | Example predictions with images |

## 🔗 Links

- **GitHub**: [sagar431/multimodal-gemma-270m](https://github.com/sagar431/multimodal-gemma-270m)
- **Demo**: [HuggingFace Space](https://huggingface.co/spaces/sagar007/Multimodal-Gemma)

## 📚 References

- [LLaVA Paper](https://arxiv.org/abs/2304.08485)
- [Gemma Technical Report](https://arxiv.org/abs/2403.08295)

## 📄 License

Apache 2.0

## 🙏 Acknowledgments

- Google for Gemma models
- OpenAI for CLIP
- LLaVA team for multimodal architecture inspiration
- PyTorch Lightning team