CoME-VL: Scaling Complementary Multi-Encoder Vision-Language
Overview
CoME-VL is a complementary multi-encoder vision-language framework that fuses contrastively trained and self-supervised visual representations to improve both visual understanding and grounding. Built on top of Molmo (Ai2), CoME-VL introduces three key architectural innovations:
- Entropy-guided layer selection to identify and select complementary layer ranges from SigLIP2 and DINOv3
- Orthogonality-regularized multi-layer mixing (OL) to reduce redundancy and promote complementary feature fusion
- RoPE-enhanced cross-attention (RGCA) to spatially align heterogeneous token grids across encoders
Overview of CoME-VL: dual encoders (SigLIP2 + DINOv3) fused via orthogonality-regularized mixing and RoPE-based cross-attention, injected into a decoder-only LLM.
Installation
Python 3.10 is recommended. First install PyTorch for your platform, then:
git clone https://github.com/ankan8145/COME-VL.git
cd COME-VL
pip install -e .[all]
Environment Setup
export MOLMO_DATA_DIR=/path/to/data
export HF_HOME=/path/to/huggingface/cache
Training / Fine-tuning
Fine-tune starting from a pretrained checkpoint:
HF_HUB_OFFLINE=1 \
TRANSFORMERS_OFFLINE=1 \
WANDB_MODE=offline \
WANDB_API_KEY="<your_wandb_key>" \
WANDB_PROJECT="come-vl" \
WANDB_ENTITY="<your_entity>" \
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
torchrun --standalone --nnodes=1 --nproc_per_node=8 \
launch_scripts/train_multitask_model.py \
3.2-synthetic \
checkpoint_folder \
--save_folder=output_folder \
--save_overwrite
Notes:
checkpoint_foldershould point to your starting model checkpoint directory.--save_foldershould use a short, descriptive name โ avoid long paths with special characters.3.2-syntheticspecifies the training data mixture.--save_overwriteallows overwriting an existing save folder.
Evaluation
torchrun --nproc-per-node 1 --master_port 29504 \
launch_scripts/eval_downstream.py \
checkpoint_folder \
"test-low-res" \
--save_to_checkpoint_dir
Notes:
test-low-resevaluates at standard resolution on the test split.- Use
test-high-resfor high-resolution evaluation (add--fsdp --high_resflags). - Results and predictions are saved into the checkpoint directory.
- Add
--overwriteto re-run and replace cached metrics.
Model Architecture
CoME-VL uses:
- Language backbone: Qwen2-7B
- Contrastive encoder: SigLIP2-SO400M โ semantic alignment
- Self-supervised encoder: DINOv3-Large โ spatial grounding
- Selected layers: SigLIP2 layers 0โ27 (all) + DINOv3 layers 10โ23 (entropy-guided)
Data
Most data is managed via HuggingFace Datasets. Training uses the PixMo dataset and RefCOCO.
Download all datasets:
python3 scripts/download.py all --n_proc 12
Download a specific dataset:
python3 scripts/download_data.py pixmo_count_counting --n_proc 12
Pretrained Model Initialization
Convert HuggingFace weights before training from scratch:
python3 scripts/convert_hf_to_molmo.py qwen2_7b
python3 scripts/convert_hf_to_molmo.py openai
Citation
If you find CoME-VL useful in your research, please consider citing:
@article{comevl2026,
title={CoME-VL: Scaling Complementary Multi-Encoder Vision-Language},
author={Deria, Ankan and Kumar, Komal and He, Xilin and Razzak, Imran and Cholakkal, Hisham and Khan, Fahad Shahbaz and Khan, Salman},
journal={arXiv preprint},
year={2026}
}
Acknowledgements
This codebase is built on top of Molmo by the Allen Institute for AI (Ai2). We thank the Ai2 team for open-sourcing their work.
- Downloads last month
- -
Model tree for MBZUAI/CoME-VL
Base model
Qwen/Qwen2-7B