Papers
arxiv:2510.14605

Knowledge-based Visual Question Answer with Multimodal Processing, Retrieval and Filtering

Published on Oct 16, 2025
ยท Submitted by
hongyuyang
on Oct 21, 2025
Authors:
,
,
,
,
,
,

Abstract

A novel three-stage method, Wiki-PRF, enhances knowledge-based visual question answering by improving multimodal query quality and relevance through visual language models and reinforcement learning.

AI-generated summary

Knowledge-based visual question answering (KB-VQA) requires visual language models (VLMs) to integrate visual understanding with external knowledge retrieval. Although retrieval-augmented generation (RAG) achieves significant advances in this task by combining knowledge-base querying, it still struggles with the quality of multimodal queries and the relevance of retrieved results. To overcome these challenges, we propose a novel three-stage method, termed Wiki-PRF, including Processing, Retrieval and Filtering stages. The processing stage dynamically invokes visual tools to extract precise multimodal information for retrieval. The retrieval stage integrates visual and text features to achieve multimodal knowledge retrieval. The filtering stage performs relevance filtering and concentration on retrieval results. To this end, we introduce a visual language model trained with answer accuracy and format consistency as reward signals via a reinforcement learning manner. This enhances the model's reasoning, tool invocation for accurate queries, and filtering of irrelevant content. Experiments on benchmark datasets (E-VQA and InfoSeek) show significant improvements~(36.0 and 42.8) in answer quality, achieving state-of-the-art performance. Code is available at https://github.com/cqu-student/Wiki-PRF

Community

Paper author Paper submitter
โ€ข
edited 13 days ago

[Wiki-PRF: A Three-Stage Framework for Knowledge-Based Visual Question Answering (NeurIPS 2025)

๐Ÿ”— arXiv:2510.14605 | ๐Ÿ“œ NeurIPS Poster | ๐Ÿค— Model | ๐Ÿ’ป GitHub

Wiki-PRF introduces a robust Processing-Retrieval-Filtering pipeline to solve noise and query misalignment in KB-VQA:

๐Ÿ” Processing: Extracts precise multimodal cues via visual tools.

๐Ÿ“š Retrieval: Performs joint visual-text knowledge fetching.

๐Ÿงน Filtering: Uses RL-based rewards to strip irrelevant results and boost accuracy.

Achieves SOTA on InfoSeek (42.8%) and E-VQA. Check out our open-source code and weights!](Wiki-PRF: A Three-Stage Framework for Knowledge-Based Visual Question Answering (NeurIPS 2025)

๐Ÿ“„ Paper (arXiv): https://arxiv.org/abs/2510.14605
๐Ÿค— Model: https://huggingface.co/hongyuyang23casia/Wiki-PRF-7B-Infoseek
๐Ÿ’ป Code: https://github.com/cqu-student/Wiki-PRF

We propose Wiki-PRF, a modular pipeline to solve noise in KB-VQA:

๐Ÿ” Processing: Extracts multimodal cues via visual tools.

๐Ÿ“š Retrieval: Joint visual-text knowledge fetching.

๐Ÿงน Filtering: RL-based rewards to strip irrelevant results.

Achieves SOTA on InfoSeek (42.8%) and E-VQA (39.2%). Welcome to check it out!)

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 3

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2510.14605 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2510.14605 in a Space README.md to link it from this page.

Collections including this paper 4