--- library_name: transformers license: apache-2.0 base_model: Qwen/Qwen2.5-VL-7B-Instruct tags: - multimodal - vision-language - spatial-reasoning - latent-reasoning pipeline_tag: image-text-to-text --- # Mull-Tokens: Modality-Agnostic Latent Thinking This is the model for the paper **"Mull-Tokens: Modality-Agnostic Latent Thinking"**. [[Paper]](https://arxiv.org/abs/2512.10941) | [[Project Page]](https://arijitray1993.github.io/mulltokens/) | [[Code]](https://github.com/arijitray1993/mull) ## Overview Mull-Tokens are latent tokens that can be pre-trained to hold intermediate information in either image or text modalities so as to think towards the correct answer. Across four challenging spatial reasoning benchmarks, Mull-Tokens achieve a **+3% average improvement** and up to **+16%** on reasoning-heavy splits compared to the strongest baseline. ## Available Models | Model | Description | |---|---| | [array/Qwen2.5-VL-Mull](https://huggingface.co/array/Qwen2.5-VL-Mull) | Mull-Tokens with multimodal warm-up | | [array/Qwen2.5-VL-MullGRPO](https://huggingface.co/array/Qwen2.5-VL-MullGRPO) | Mull-Tokens + GRPO reinforcement learning | ## Quick Start ```python from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor from qwen_vl_utils import process_vision_info import torch # Choose model: "array/Qwen2.5-VL-Mull" or "array/Qwen2.5-VL-MullGRPO" MODEL_ID = "array/Qwen2.5-VL-Mull" NUM_LATENTS = 20 # Load model and processor model = Qwen2_5_VLForConditionalGeneration.from_pretrained( MODEL_ID, torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2", device_map="auto", ) processor = AutoProcessor.from_pretrained(MODEL_ID) # Prepare your question image_path = "path/to/your/image.jpg" question = "If you stand at the X marked point and turn left, will the table be to your left or right? Please choose between the following answer choices: A. left. B. right. " question_type = "multiple choice" QUESTION_TEMPLATE_LATENT = ( "{Question}\n" "Please think about this question deeply. " "It's encouraged to include self-reflection or verification in the reasoning process. " "Provide your final answer between the tags." ) TYPE_TEMPLATE = { "multiple choice": " Please provide only the single option letter (e.g., A, B, C, D, etc.) within the tags.", "numerical": " Please provide the numerical value (e.g., 42 or 3.14) within the tags.", "OCR": " Please transcribe text from the image/video clearly and provide your text answer within the tags.", "free-form": " Please provide your text answer within the tags.", "regression": " Please provide the numerical value (e.g., 42 or 3.14) within the tags.", } prompt = QUESTION_TEMPLATE_LATENT.format(Question=question) + TYPE_TEMPLATE[question_type] # Build messages with latent thinking tokens messages = [ { "role": "user", "content": [ {"type": "image", "image": image_path}, {"type": "text", "text": prompt}, ], }, # IMPORTANT: Mull-Tokens requires latent thinking tokens before answer generation { "role": "assistant", "content": [ { "type": "text", "text": "" + "<|latent_pad|>" * NUM_LATENTS + "\n", } ], }, ] # Process inputs text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=False) text = text.replace("<|im_end|>\n", "") # Remove end token so model continues generating image_inputs, video_inputs = process_vision_info(messages) inputs = processor( text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt", ).to(model.device) # Generate response with torch.no_grad(): output_ids = model.generate( **inputs, max_new_tokens=512, do_sample=False, ) # Decode output (skip input tokens) generated_ids = output_ids[:, inputs["input_ids"].shape[1]:] response = processor.batch_decode(generated_ids, skip_special_tokens=True)[0] print(response) ``` ## Citation ```bibtex @misc{ray2025mulltokensmodalityagnosticlatentthinking, title={Mull-Tokens: Modality-Agnostic Latent Thinking}, author={Arijit Ray and Ahmed Abdelkader and Chengzhi Mao and Bryan A. Plummer and Kate Saenko and Ranjay Krishna and Leonidas Guibas and Wen-Sheng Chu}, year={2025}, eprint={2512.10941}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2512.10941}, } ```