File size: 3,354 Bytes
acd771b | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 | ---
library_name: transformers
tags:
- vision
- image-text
- clip
- zero-shot
---
<div align="center">
<img class="block dark:hidden" src="assets/Raon-VisionEncoder-Gradient-Black.png" alt="Raon VisionEncoder" width="600">
<img class="hidden dark:block" src="assets/Raon-VisionEncoder-Gradient-White.png" alt="Raon VisionEncoder" width="600">
</div>
<p align="center">
<a href="https://www.krafton.ai/ko/"><img src="https://img.shields.io/badge/Homepage-KRAFTON%20AI-blue?style=flat&logo=google-chrome&logoColor=white" alt="Homepage"></a>
<br>
<a href="https://huggingface.co/KRAFTON"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-KRAFTON-yellow?style=flat" alt="Hugging Face"></a>
<a href="https://x.com/Krafton_AI"><img src="https://img.shields.io/badge/X-KRAFTON%20AI-white?style=flat&logo=x&logoColor=black" alt="X"></a>
<br>
<a href="https://www.apache.org/licenses/LICENSE-2.0"><img src="https://img.shields.io/badge/License-Apache%202.0-lightgrey?style=flat" alt="License"></a>
</p>
**Raon-VisionEncoder** is a 1.14B-parameter vision-language foundation model by [KRAFTON](https://www.krafton.com) for image and text feature extraction.
It supports zero-shot image classification, image-text retrieval, and native aspect ratio inference via NaFlex.
Built on [OpenCLIP](https://github.com/mlfoundations/open_clip) with a LocCa (Localized CoCa) architecture and ViT-SO400M vision encoder.
## Pretrained Models
| Model | Params (Inference) | Vision | Text | Patch Size | NaFlex Default Patches |
|-------|--------------------|--------|------|------------|------------------------|
| LocCa ViT-SO400M-16-SigLIP2 | 1.14B | 0.43B | 0.71B | 16x16 | 256 |
## Requirements
```bash
pip install torch torchvision timm transformers huggingface-hub safetensors ftfy
```
## Quick Start
```python
import torch
from transformers import AutoModel
from PIL import Image
# Load model + processor
model = AutoModel.from_pretrained("KRAFTON/Raon-VisionEncoder", trust_remote_code=True)
model = model.to(dtype=torch.bfloat16).eval()
processor = model.get_processor("KRAFTON/Raon-VisionEncoder")
# Encode image and text
img_inputs = processor(images=Image.open("assets/photo.jpg"))
txt_inputs = processor(text=["a cat", "a dog"])
with torch.no_grad():
img_feat = model.encode_image(**img_inputs)
txt_feat = model.encode_text(**txt_inputs)
# Compute similarity with learned scale and bias
logits = model.logit_scale.exp() * (img_feat @ txt_feat.T) + model.logit_bias
probs = logits.softmax(dim=-1)
print(probs)
```
## API Reference
| Method | Input | Output |
|--------|-------|--------|
| `model.encode_image(**inputs)` | Processor output (image) | `[B, 1152]` normalized image features |
| `model.encode_text(**inputs)` | Processor output (text) | `[B, 1152]` normalized text features |
| `model.logit_scale` | - | Learned temperature parameter |
| `model.logit_bias` | - | Learned bias parameter |
| `model.get_processor(repo_id)` | HuggingFace repo ID | Processor instance |
| `processor(images=img)` | PIL Image | Preprocessed image dict |
| `processor(text=["a cat"])` | list of strings | Tokenized text dict |
## License
This repository is licensed under the [Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0).
Third-party notices in [NOTICE](NOTICE).
© 2026 KRAFTON
|