arxiv:2604.06870

RefineAnything: Multimodal Region-Specific Refinement for Perfect Local Details

Published on Apr 8

· Submitted by

Dewei Zhou on Apr 13

#3 Paper of the day

Zhejiang University

Upvote

Authors:

Abstract

A multimodal diffusion-based model called RefineAnything is presented for region-specific image refinement that preserves backgrounds while enhancing local details, using a focus-and-refine strategy and boundary-aware loss functions.

AI-generated summary

We introduce region-specific image refinement as a dedicated problem setting: given an input image and a user-specified region (e.g., a scribble mask or a bounding box), the goal is to restore fine-grained details while keeping all non-edited pixels strictly unchanged. Despite rapid progress in image generation, modern models still frequently suffer from local detail collapse (e.g., distorted text, logos, and thin structures). Existing instruction-driven editing models emphasize coarse-grained semantic edits and often either overlook subtle local defects or inadvertently change the background, especially when the region of interest occupies only a small portion of a fixed-resolution input. We present RefineAnything, a multimodal diffusion-based refinement model that supports both reference-based and reference-free refinement. Building on a counter-intuitive observation that crop-and-resize can substantially improve local reconstruction under a fixed VAE input resolution, we propose Focus-and-Refine, a region-focused refinement-and-paste-back strategy that improves refinement effectiveness and efficiency by reallocating the resolution budget to the target region, while a blended-mask paste-back guarantees strict background preservation. We further introduce a boundary-aware Boundary Consistency Loss to reduce seam artifacts and improve paste-back naturalness. To support this new setting, we construct Refine-30K (20K reference-based and 10K reference-free samples) and introduce RefineEval, a benchmark that evaluates both edited-region fidelity and background consistency. On RefineEval, RefineAnything achieves strong improvements over competitive baselines and near-perfect background preservation, establishing a practical solution for high-precision local refinement. Project Page: https://limuloo.github.io/RefineAnything/.

View arXiv page View PDF Project page GitHub 49 Add to collection

Community

limuloo1999

Paper submitter about 16 hours ago

demo: https://huggingface.co/spaces/limuloo1999/RefineAnything
checkpoint: https://huggingface.co/limuloo1999/RefineAnything

limuloo1999

Paper submitter about 15 hours ago

avahal

about 4 hours ago

the crop-and-resize trick under a fixed VAE resolution is a surprisingly clean way to reallocate the budget to the edited region and boost micro-detail fidelity. it’s counterintuitive because no new information is added, yet zooming into the target lets the denoiser allocate capacity where it matters most. i’d be curious about boundary sensitivity: how small a margin can they tolerate before seams creep in, and does the boundary-consistency loss fully quell that without hurting elsewhere? the arxivlens breakdown helped me parse these choices, and it aligns with what they describe there (https://arxivlens.com/PaperView/Details/refineanything-multimodal-region-specific-refinement-for-perfect-local-details-3647-406bb3a5)