--- library_name: transformers tags: - vision - image-text - clip - zero-shot ---

**Raon-VisionEncoder** is a 1.14B-parameter vision-language foundation model by [KRAFTON](https://www.krafton.com) for image and text feature extraction. It supports zero-shot image classification, image-text retrieval, and native aspect ratio inference via NaFlex. Built on [OpenCLIP](https://github.com/mlfoundations/open_clip) with a LocCa (Localized CoCa) architecture and ViT-SO400M vision encoder. ## Pretrained Models | Model | Params (Inference) | Vision | Text | Patch Size | NaFlex Default Patches | |-------|--------------------|--------|------|------------|------------------------| | LocCa ViT-SO400M-16-SigLIP2 | 1.14B | 0.43B | 0.71B | 16x16 | 256 | ## Requirements ```bash pip install torch torchvision timm transformers huggingface-hub safetensors ftfy ``` ## Quick Start ```python import torch from transformers import AutoModel from PIL import Image # Load model + processor model = AutoModel.from_pretrained("KRAFTON/Raon-VisionEncoder", trust_remote_code=True) model = model.to(dtype=torch.bfloat16).eval() processor = model.get_processor("KRAFTON/Raon-VisionEncoder") # Encode image and text img_inputs = processor(images=Image.open("assets/photo.jpg")) txt_inputs = processor(text=["a cat", "a dog"]) with torch.no_grad(): img_feat = model.encode_image(**img_inputs) txt_feat = model.encode_text(**txt_inputs) # Compute similarity with learned scale and bias logits = model.logit_scale.exp() * (img_feat @ txt_feat.T) + model.logit_bias probs = logits.softmax(dim=-1) print(probs) ``` ## API Reference | Method | Input | Output | |--------|-------|--------| | `model.encode_image(**inputs)` | Processor output (image) | `[B, 1152]` normalized image features | | `model.encode_text(**inputs)` | Processor output (text) | `[B, 1152]` normalized text features | | `model.logit_scale` | - | Learned temperature parameter | | `model.logit_bias` | - | Learned bias parameter | | `model.get_processor(repo_id)` | HuggingFace repo ID | Processor instance | | `processor(images=img)` | PIL Image | Preprocessed image dict | | `processor(text=["a cat"])` | list of strings | Tokenized text dict | ## License This repository is licensed under the [Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0). Third-party notices in [NOTICE](NOTICE). © 2026 KRAFTON