Improve model card: add library name, links, and sample usage (#1)

59204a1 verified 4 months ago

3.2 kB

	---
	base_model:
	- Qwen/Qwen2.5-VL-3B-Instruct
	datasets:
	- HuggingFaceM4/FineVision
	- mvp-lab/LLaVA-OneVision-1.5-Instruct-Data
	language:
	- en
	license: cc-by-nc-sa-4.0
	pipeline_tag: image-text-to-text
	library_name: transformers
	---

	# CASA-Qwen2_5-VL-3B

	This repository contains the model weights for CASA-Qwen2_5-VL-3B, introduced in the paper [CASA: Cross-Attention via Self-Attention for Efficient Vision-Language Fusion](https://huggingface.co/papers/2512.19535).

	CASA is a vision-language fusion paradigm that improves on cross-attention while preserving its scalability. This model is a [Qwen-2.5VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct) model adapted from token insertion to a cross-attention-based architecture using CASA layers.

	- Paper: [CASA: Cross-Attention via Self-Attention for Efficient Vision-Language Fusion](https://arxiv.org/abs/2512.19535)
	- Project Page: [kyutai.org/casa](https://kyutai.org/casa)
	- Code: [github.com/kyutai-labs/casa](https://github.com/kyutai-labs/casa)

	## Sample Usage

	This model requires `trust_remote_code=True` to load the custom architecture. Below is a snippet to run inference using `transformers`.

	```python
	import torch
	from transformers.models.auto.modeling_auto import AutoModel
	from transformers.models.auto.processing_auto import AutoProcessor

	model_id = "kyutai/CASA-Qwen2_5-VL-3B"
	model = AutoModel.from_pretrained(
	model_id,
	torch_dtype=torch.bfloat16,
	attn_implementation="flash_attention_2",
	trust_remote_code=True,
	).cuda()

	processor = AutoProcessor.from_pretrained(
	model_id,
	trust_remote_code=True,
	)

	conversation = [
	{
	"role": "user",
	"content": [
	{
	"type": "image",
	"image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/ai2d-demo.png",
	},
	{
	"type": "text",
	"text": "Describe this image.",
	},
	],
	},
	]

	inputs = processor.tokenize_messages(messages=conversation)
	inputs = inputs.to(model.device)
	input_len = inputs["input_ids"].shape[1]

	output_ids = model.generate_from_image(
	**inputs,
	max_new_tokens=512,
	pre_image_tokens=processor.pre_image_tokens,
	post_image_tokens=processor.post_image_tokens,
	eos_token_id=model.generation_config.eos_token_id,
	)[0, input_len:]

	response = processor.tokenizer.decode(output_ids, skip_special_tokens=True)
	print(response)
	```

	## Citation

	```bibtex
	@article{kyutai2025casa,
	author = {Moritz B\"ohle and Am\'elie Royer and Juliette Marrie and Edouard Grave and Patrick P\'erez},
	year = {2025},
	title = {CASA: Cross-Attention via Self-Attention for Efficient Vision-Language Fusion},
	journal = {ArXiv},
	url = {https://arxiv.org/abs/2512.19535}
	}
	```

	## License

	The code in the official repository is provided under the MIT license. The weights for this model are released under the CC-BY-NC-SA 4.0 license. Additionally, as this model includes weights from Qwen2.5-VL-3B, it is subject to the [Qwen RESEARCH LICENSE AGREEMENT](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct/blob/main/LICENSE).