Abstract
Vision-language models exhibit selective blindness where linguistic framing affects visual attention patterns, leading to reduced performance that can be improved through learnable prompt-tuning methods.
Vision-Language Models (VLMs) have been shown to be blind, often underutilizing their visual inputs even on tasks that require visual reasoning. In this work, we demonstrate that VLMs are selectively blind. They modulate the amount of attention applied to visual inputs based on linguistic framing even when alternative framings demand identical visual reasoning. Using visual attention as a probe, we quantify how framing alters both the amount and distribution of attention over the image. Constrained framings, such as multiple choice and yes/no, induce substantially lower attention to image context compared to open-ended, reduce focus on task-relevant regions, and shift attention towards uninformative tokens. We further demonstrate that this attention misallocation is the principal cause of degraded accuracy and cross-framing inconsistency. Building on this mechanistic insight, we introduce a lightweight prompt-tuning method using learnable tokens that encourages the robust, visually grounded attention patterns observed in open-ended settings, improving visual grounding and improving performance across framings.
Community
Turns out VLMs aren’t inherently bad at vision, they’re selectively blind.
Ask the wrong way, and they won't look at the image!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Deeper Thought, Weaker Aim: Understanding and Mitigating Perceptual Impairment during Reasoning in Multimodal Large Language Models (2026)
- InViC: Intent-aware Visual Cues for Medical Visual Question Answering (2026)
- When Thinking Hurts: Mitigating Visual Forgetting in Video Reasoning via Frame Repetition (2026)
- ConFoThinking: Consolidated Focused Attention Driven Thinking for Visual Question Answering (2026)
- Do MLLMs Really See It: Reinforcing Visual Attention in Multimodal LLMs (2026)
- Selective Training for Large Vision Language Models via Visual Information Gain (2026)
- ClueTracer: Question-to-Vision Clue Tracing for Training-Free Hallucination Suppression in Multimodal Reasoning (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper