File size: 3,354 Bytes

acd771b

---
library_name: transformers
tags:
- vision
- image-text
- clip
- zero-shot
---

<div align="center">
  <img class="block dark:hidden" src="assets/Raon-VisionEncoder-Gradient-Black.png" alt="Raon VisionEncoder" width="600">
  <img class="hidden dark:block" src="assets/Raon-VisionEncoder-Gradient-White.png" alt="Raon VisionEncoder" width="600">
</div>

<p align="center">
  <a href="https://www.krafton.ai/ko/"><img src="https://img.shields.io/badge/Homepage-KRAFTON%20AI-blue?style=flat&logo=google-chrome&logoColor=white" alt="Homepage"></a>
  <br>
  <a href="https://huggingface.co/KRAFTON"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-KRAFTON-yellow?style=flat" alt="Hugging Face"></a>
  <a href="https://x.com/Krafton_AI"><img src="https://img.shields.io/badge/X-KRAFTON%20AI-white?style=flat&logo=x&logoColor=black" alt="X"></a>
  <br>
  <a href="https://www.apache.org/licenses/LICENSE-2.0"><img src="https://img.shields.io/badge/License-Apache%202.0-lightgrey?style=flat" alt="License"></a>
</p>

**Raon-VisionEncoder** is a 1.14B-parameter vision-language foundation model by [KRAFTON](https://www.krafton.com) for image and text feature extraction.
It supports zero-shot image classification, image-text retrieval, and native aspect ratio inference via NaFlex.
Built on [OpenCLIP](https://github.com/mlfoundations/open_clip) with a LocCa (Localized CoCa) architecture and ViT-SO400M vision encoder.

## Pretrained Models

| Model | Params (Inference) | Vision | Text | Patch Size | NaFlex Default Patches |
|-------|--------------------|--------|------|------------|------------------------|
| LocCa ViT-SO400M-16-SigLIP2 | 1.14B | 0.43B | 0.71B | 16x16 | 256 |

## Requirements

```bash
pip install torch torchvision timm transformers huggingface-hub safetensors ftfy
```

## Quick Start

```python
import torch
from transformers import AutoModel
from PIL import Image

# Load model + processor
model = AutoModel.from_pretrained("KRAFTON/Raon-VisionEncoder", trust_remote_code=True)
model = model.to(dtype=torch.bfloat16).eval()
processor = model.get_processor("KRAFTON/Raon-VisionEncoder")

# Encode image and text
img_inputs = processor(images=Image.open("assets/photo.jpg"))
txt_inputs = processor(text=["a cat", "a dog"])

with torch.no_grad():
    img_feat = model.encode_image(**img_inputs)
    txt_feat = model.encode_text(**txt_inputs)

    # Compute similarity with learned scale and bias
    logits = model.logit_scale.exp() * (img_feat @ txt_feat.T) + model.logit_bias
    probs = logits.softmax(dim=-1)
    print(probs)
```

## API Reference

| Method | Input | Output |
|--------|-------|--------|
| `model.encode_image(**inputs)` | Processor output (image) | `[B, 1152]` normalized image features |
| `model.encode_text(**inputs)` | Processor output (text) | `[B, 1152]` normalized text features |
| `model.logit_scale` | - | Learned temperature parameter |
| `model.logit_bias` | - | Learned bias parameter |
| `model.get_processor(repo_id)` | HuggingFace repo ID | Processor instance |
| `processor(images=img)` | PIL Image | Preprocessed image dict |
| `processor(text=["a cat"])` | list of strings | Tokenized text dict |

## License

This repository is licensed under the [Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0).
Third-party notices in [NOTICE](NOTICE).

© 2026 KRAFTON