--- license: apache-2.0 language: - en tags: - multimodal - vision-language - gemma - clip - llava - pytorch - lightning datasets: - liuhaotian/LLaVA-Instruct-150K pipeline_tag: image-to-text --- # Multimodal Gemma-270M A **Multimodal Vision-Language Model** combining Google Gemma-270M with CLIP vision encoder, trained on the full LLaVA-Instruct-150K dataset. ## 🎯 Model Inference Examples Here are real inference results from our trained model: ### 🐱 Animal Detection | Cats on Couch | White Cat Sleeping | |---------------|-------------------| | ![Cat Prediction](inference_results/sample_001_prediction.png) | ![White Cat](inference_results/sample_009_prediction.png) | ### 🐕 Dog Recognition | Golden Retriever in Park | |-------------------------| | ![Dog Prediction](inference_results/sample_007_prediction.png) | ### 🏠 Room & Scene Understanding | Modern Kitchen | Clean Kitchen | |---------------|---------------| | ![Kitchen 1](inference_results/sample_003_prediction.png) | ![Kitchen 2](inference_results/sample_004_prediction.png) | ### 🍕 Food & Objects | Food Scene | Apple on Table | |------------|----------------| | ![Food](inference_results/sample_002_prediction.png) | ![Apple](inference_results/sample_008_prediction.png) | ### 🛹 Activity & People | Skate Park | Family Dining | |------------|---------------| | ![Skate Park](inference_results/sample_005_prediction.png) | ![Family](inference_results/sample_006_prediction.png) | --- ## 📊 Training Details | Parameter | Value | |-----------|-------| | **Training Samples** | 157,712 (Full LLaVA dataset) | | **Epochs** | 3 | | **Final Training Loss** | 1.333 | | **Final Validation Loss** | 1.430 | | **Total Parameters** | 539M | | **Trainable Parameters** | 18.6M (3.4%) | | **GPU** | NVIDIA A100 40GB | | **Training Time** | ~9 hours | | **Batch Size** | 20 (effective: 40) | | **Precision** | bf16-mixed | ## 📈 Benchmark Results | Benchmark | Score | |-----------|-------| | **Basic VQA** | 53.8% (7/13 correct) | | **POPE Hallucination** | 20.0% | ### VQA Breakdown - ✅ Animal identification (cats, dogs) - ✅ Room identification (kitchen, living room) - ✅ Object presence detection - ⚠️ Color identification (moderate) - ⚠️ Detailed attributes (needs improvement) ## 🏗️ Architecture | Component | Details | |-----------|---------| | **Language Model** | Google Gemma-3-270M with LoRA adapters | | **Vision Encoder** | OpenAI CLIP ViT-Large/14 (frozen, 428M params) | | **Vision Projector** | MLP (3.4M params) | | **LoRA** | r=16, alpha=32, dropout=0.1 | ## 🚀 Usage ```python from src.models.multimodal_gemma import MultimodalGemma import torch from PIL import Image # Load model model = MultimodalGemma(config) checkpoint = torch.load("final_model.ckpt") model.load_state_dict(checkpoint["state_dict"]) model.eval() # Inference image = Image.open("your_image.jpg") prompt = "What do you see in this image?" response = model.generate(image, prompt) print(response) ``` ## 📁 Files | File | Size | Description | |------|------|-------------| | `final_model.ckpt` | 1.2GB | Full model checkpoint | | `inference_results/` | 13.8MB | Example predictions with images | ## 🔗 Links - **GitHub**: [sagar431/multimodal-gemma-270m](https://github.com/sagar431/multimodal-gemma-270m) - **Demo**: [HuggingFace Space](https://huggingface.co/spaces/sagar007/Multimodal-Gemma) ## 📚 References - [LLaVA Paper](https://arxiv.org/abs/2304.08485) - [Gemma Technical Report](https://arxiv.org/abs/2403.08295) ## 📄 License Apache 2.0 ## 🙏 Acknowledgments - Google for Gemma models - OpenAI for CLIP - LLaVA team for multimodal architecture inspiration - PyTorch Lightning team