arxiv:2603.16932

Look Where It Matters: High-Resolution Crops Retrieval for Efficient VLMs

Published on Mar 14

· Submitted by

Authors:

Abstract

AwaRes is a spatial-on-demand framework for vision-language models that dynamically retrieves high-resolution image segments based on query needs, using tool-calling and multi-turn reinforcement learning with composite rewards.

AI-generated summary

Vision-language models (VLMs) typically process images at a native high-resolution, forcing a trade-off between accuracy and computational efficiency: high-resolution inputs capture fine details but incur significant computational costs, while low-resolution inputs advocate for efficiency, they potentially miss critical visual information, like small text. We present AwaRes, a spatial-on-demand framework that resolves this accuracy-efficiency trade-off by operating on a low-resolution global view and using tool-calling to retrieve only high-resolution segments needed for a given query. We construct supervised data automatically: a judge compares low- vs.\ high-resolution answers to label whether cropping is needed, and an oracle grounding model localizes the evidence for the correct answer, which we map to a discrete crop set to form multi-turn tool-use trajectories. We train our framework with cold-start SFT followed by multi-turn GRPO with a composite reward that combines semantic answer correctness with explicit crop-cost penalties. Project page: https://nimrodshabtay.github.io/AwaRes

View arXiv page View PDF Project page GitHub 1 Add to collection

Community

NimrodShabtay1986

Paper submitter 35 minutes ago

Vision-language models (VLMs) typically process images at native high resolution, forcing a trade-off between accuracy and computational efficiency: high-resolution inputs capture fine details but incur significant computational costs, while low-resolution inputs promote efficiency but potentially miss critical visual information like small text.

We present AwaRes, a spatial-on-demand framework that resolves this accuracy–efficiency trade-off by operating on a low-resolution global view and using tool-calling to retrieve only the high-resolution segments needed for a given query. We construct supervised data automatically: a judge compares low- vs. high-resolution answers to label whether cropping is needed, and an oracle grounding model localizes the evidence for the correct answer, which we map to a discrete crop set to form multi-turn tool-use trajectories. We train our framework with cold-start SFT followed by multi-turn GRPO with a composite reward combining semantic answer correctness with explicit crop-cost penalties.

AwaRes provides a practical, deployment-friendly path to high-detail VLM reasoning under tight compute budgets.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2603.16932 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2603.16932 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.