File size: 2,555 Bytes
acd771b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
raon-vision-encoder
Copyright 2024-2026 Raon Vision Team

This product includes software derived from the following projects:

===============================================================================
OpenCLIP
https://github.com/mlfoundations/open_clip
Licensed under the MIT License (see LICENSES/MIT-OpenCLIP.txt)

Copyright (c) 2012-2021 Gabriel Ilharco, Mitchell Wortsman,
Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar,
John Miller, Hongseok Namkoong, Hannaneh Hajishirzi, Ali Farhadi,
Ludwig Schmidt

Used in: model/ and train/ packages (LocCa, CLIP, loss, factory,
transformer, data pipeline, training loop, etc.)

===============================================================================
OpenAI CLIP
https://github.com/openai/CLIP
Licensed under the MIT License (see LICENSES/MIT-OpenAI-CLIP.txt)

Copyright (c) 2021 OpenAI

Used in: model/tokenizer.py, model/bpe_simple_vocab_16e6.txt.gz

===============================================================================
Meta Platforms, Inc. (MAE / MoCo v3)
Licensed under the MIT License via OpenCLIP

Copyright (c) Meta Platforms, Inc. and affiliates

Used in: model/pos_embed.py (sincos position embedding utilities)

===============================================================================
timm (pytorch-image-models)
https://github.com/huggingface/pytorch-image-models
Licensed under the Apache License 2.0

Copyright (c) Ross Wightman

Used in: model/transform.py (ResizeKeepRatio)

===============================================================================
References

The following papers informed the design and implementation of features
in this software. Code was independently implemented unless noted above.

- CoCa: Yu et al., "CoCa: Contrastive Captioners are Image-Text Foundation Models", 2022
- SigLIP: Zhai et al., "Sigmoid Loss for Language Image Pre-Training", 2023
- SigLIP2: Tschannen et al., "SigLIP 2: Multilingual Vision-Language Encoders", 2025
- DINO: Caron et al., "Emerging Properties in Self-Supervised Vision Transformers", 2021
- DINOv2: Oquab et al., "DINOv2: Learning Robust Visual Features without Supervision", 2024
- SILC: Naeem et al., "SILC: Improving Vision Language Pretraining with Self-Distillation", 2023
- TIPS: Huang et al., "TIPS: Text-Image Pretraining with Spatial Awareness", 2024
- Koleo: Sablayrolles et al., "Spreading vectors for similarity search", ICLR 2019
- Gram Anchoring: Simeoni et al., "DINOv3", 2025 (independently implemented)
- NaFlex: from SigLIP2 / PaLI (independently implemented in PyTorch)