VisionLM

zerozeyi 's Collections

Dataset

Text-to-images

LLM

Text-to-videos

VisionLM

AudioLLM

updated Jan 12

Upvote

146

EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters

Paper • 2402.04252 • Published Feb 6, 2024 • 30
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models

Paper • 2402.03749 • Published Feb 6, 2024 • 15
ScreenAI: A Vision-Language Model for UI and Infographics Understanding

Paper • 2402.04615 • Published Feb 7, 2024 • 44
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss

Paper • 2402.05008 • Published Feb 7, 2024 • 23
WebLINX: Real-World Website Navigation with Multi-Turn Dialogue

Paper • 2402.05930 • Published Feb 8, 2024 • 39
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models

Paper • 2402.05935 • Published Feb 8, 2024 • 17
ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling

Paper • 2402.06118 • Published Feb 9, 2024 • 15
OS-Copilot: Towards Generalist Computer Agents with Self-Improvement

Paper • 2402.07456 • Published Feb 12, 2024 • 46
PIVOT: Iterative Visual Prompting Elicits Actionable Knowledge for VLMs

Paper • 2402.07872 • Published Feb 12, 2024 • 16
Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models

Paper • 2402.07865 • Published Feb 12, 2024 • 15
World Model on Million-Length Video And Language With RingAttention

Paper • 2402.08268 • Published Feb 13, 2024 • 40
PaLM2-VAdapter: Progressively Aligned Language Model Makes a Strong Vision-language Adapter

Paper • 2402.10896 • Published Feb 16, 2024 • 16
FinTral: A Family of GPT-4 Level Multimodal Financial Large Language Models

Paper • 2402.10986 • Published Feb 16, 2024 • 82
AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling

Paper • 2402.12226 • Published Feb 19, 2024 • 45
CoLLaVO: Crayon Large Language and Vision mOdel

Paper • 2402.11248 • Published Feb 17, 2024 • 22
Vision-Flan: Scaling Human-Labeled Tasks in Visual Instruction Tuning

Paper • 2402.11690 • Published Feb 18, 2024 • 9
VideoPrism: A Foundational Visual Encoder for Video Understanding

Paper • 2402.13217 • Published Feb 20, 2024 • 40
Video ReCap: Recursive Captioning of Hour-Long Videos

Paper • 2402.13250 • Published Feb 20, 2024 • 26
A Touch, Vision, and Language Dataset for Multimodal Alignment

Paper • 2402.13232 • Published Feb 20, 2024 • 16
How Easy is It to Fool Your Multimodal LLMs? An Empirical Analysis on Deceptive Prompts

Paper • 2402.13220 • Published Feb 20, 2024 • 14
BBA: Bi-Modal Behavioral Alignment for Reasoning with Large Vision-Language Models

Paper • 2402.13577 • Published Feb 21, 2024 • 9
PALO: A Polyglot Large Multimodal Model for 5B People

Paper • 2402.14818 • Published Feb 22, 2024 • 23
TinyLLaVA: A Framework of Small-scale Large Multimodal Models

Paper • 2402.14289 • Published Feb 22, 2024 • 20
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models

Paper • 2402.17177 • Published Feb 27, 2024 • 87
Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers

Paper • 2402.19479 • Published Feb 29, 2024 • 35
MovieLLM: Enhancing Long Video Understanding with AI-Generated Movies

Paper • 2403.01422 • Published Mar 3, 2024 • 30
InfiMM-HD: A Leap Forward in High-Resolution Multimodal Understanding

Paper • 2403.01487 • Published Mar 3, 2024 • 16
Finetuned Multimodal Language Models Are High-Quality Image-Text Data Filters

Paper • 2403.02677 • Published Mar 5, 2024 • 18
Modeling Collaborator: Enabling Subjective Vision Classification With Minimal Human Effort via LLM Tool-Use

Paper • 2403.02626 • Published Mar 5, 2024 • 11
MAGID: An Automated Pipeline for Generating Synthetic Multi-modal Datasets

Paper • 2403.03194 • Published Mar 5, 2024 • 15
Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large Language Models

Paper • 2403.03003 • Published Mar 5, 2024 • 11
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

Paper • 2403.09611 • Published Mar 14, 2024 • 129
MoAI: Mixture of All Intelligence for Large Language and Vision Models

Paper • 2403.07508 • Published Mar 12, 2024 • 77
Synth^2: Boosting Visual-Language Models with Synthetic Captions and Image Embeddings

Paper • 2403.07750 • Published Mar 12, 2024 • 23
DragAnything: Motion Control for Anything using Entity Representation

Paper • 2403.07420 • Published Mar 12, 2024 • 14
An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models

Paper • 2403.06764 • Published Mar 11, 2024 • 27
VideoMamba: State Space Model for Efficient Video Understanding

Paper • 2403.06977 • Published Mar 11, 2024 • 29
ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

Paper • 2403.05135 • Published Mar 8, 2024 • 45
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Paper • 2403.05530 • Published Mar 8, 2024 • 64
DeepSeek-VL: Towards Real-World Vision-Language Understanding

Paper • 2403.05525 • Published Mar 8, 2024 • 49
VideoElevator: Elevating Video Generation Quality with Versatile Text-to-Image Diffusion Models

Paper • 2403.05438 • Published Mar 8, 2024 • 20
Uni-SMART: Universal Science Multimodal Analysis and Research Transformer

Paper • 2403.10301 • Published Mar 15, 2024 • 54
VideoAgent: Long-form Video Understanding with Large Language Model as Agent

Paper • 2403.10517 • Published Mar 15, 2024 • 37
LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images

Paper • 2403.11703 • Published Mar 18, 2024 • 17
VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding

Paper • 2403.11481 • Published Mar 18, 2024 • 13
mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding

Paper • 2403.12895 • Published Mar 19, 2024 • 32
Chart-based Reasoning: Transferring Capabilities from LLMs to VLMs

Paper • 2403.12596 • Published Mar 19, 2024 • 11
MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?

Paper • 2403.14624 • Published Mar 21, 2024 • 53
Can large language models explore in-context?

Paper • 2403.15371 • Published Mar 22, 2024 • 33
InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding

Paper • 2403.15377 • Published Mar 22, 2024 • 29
SiMBA: Simplified Mamba-Based Architecture for Vision and Multivariate Time series

Paper • 2403.15360 • Published Mar 22, 2024 • 13
VidLA: Video-Language Alignment at Scale

Paper • 2403.14870 • Published Mar 21, 2024 • 15
ViTAR: Vision Transformer with Any Resolution

Paper • 2403.18361 • Published Mar 27, 2024 • 55
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models

Paper • 2403.18814 • Published Mar 27, 2024 • 48
sDPO: Don't Use Your Data All at Once

Paper • 2403.19270 • Published Mar 28, 2024 • 41
TextCraftor: Your Text Encoder Can be Image Quality Controller

Paper • 2403.18978 • Published Mar 27, 2024 • 15
Unsolvable Problem Detection: Evaluating Trustworthiness of Vision Language Models

Paper • 2403.20331 • Published Mar 29, 2024 • 16
Getting it Right: Improving Spatial Consistency in Text-to-Image Models

Paper • 2404.01197 • Published Apr 1, 2024 • 31
Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward

Paper • 2404.01258 • Published Apr 1, 2024 • 12
MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens

Paper • 2404.03413 • Published Apr 4, 2024 • 27
LVLM-Intrepret: An Interpretability Tool for Large Vision-Language Models

Paper • 2404.03118 • Published Apr 3, 2024 • 25
CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching

Paper • 2404.03653 • Published Apr 4, 2024 • 35
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs

Paper • 2404.05719 • Published Apr 8, 2024 • 83
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

Paper • 2404.05726 • Published Apr 8, 2024 • 23
MoMA: Multimodal LLM Adapter for Fast Personalized Image Generation

Paper • 2404.05674 • Published Apr 8, 2024 • 15
Koala: Key frame-conditioned long video-LLM

Paper • 2404.04346 • Published Apr 5, 2024 • 7
InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD

Paper • 2404.06512 • Published Apr 9, 2024 • 30
Adapting LLaMA Decoder to Vision Transformer

Paper • 2404.06773 • Published Apr 10, 2024 • 18
BRAVE: Broadening the visual encoding of vision-language models

Paper • 2404.07204 • Published Apr 10, 2024 • 19
Transferable and Principled Efficiency for Open-Vocabulary Segmentation

Paper • 2404.07448 • Published Apr 11, 2024 • 12
Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models

Paper • 2404.07973 • Published Apr 11, 2024 • 32
HQ-Edit: A High-Quality Dataset for Instruction-based Image Editing

Paper • 2404.09990 • Published Apr 15, 2024 • 14
TextHawk: Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models

Paper • 2404.09204 • Published Apr 14, 2024 • 11
On Speculative Decoding for Multimodal Large Language Models

Paper • 2404.08856 • Published Apr 13, 2024 • 13
Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language Models

Paper • 2404.12387 • Published Apr 18, 2024 • 40
BLINK: Multimodal Large Language Models Can See but Not Perceive

Paper • 2404.12390 • Published Apr 18, 2024 • 26
MultiBooth: Towards Generating All Your Concepts in an Image from Text

Paper • 2404.14239 • Published Apr 22, 2024 • 9
A Multimodal Automated Interpretability Agent

Paper • 2404.14394 • Published Apr 22, 2024 • 23
TextSquare: Scaling up Text-Centric Visual Instruction Tuning

Paper • 2404.12803 • Published Apr 19, 2024 • 30
Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models

Paper • 2404.13013 • Published Apr 19, 2024 • 31
CatLIP: CLIP-level Visual Recognition Accuracy with 2.7x Faster Pre-training on Web-scale Image-Text Data

Paper • 2404.15653 • Published Apr 24, 2024 • 29
Editable Image Elements for Controllable Synthesis

Paper • 2404.16029 • Published Apr 24, 2024 • 12
MoDE: CLIP Data Experts via Clustering

Paper • 2404.16030 • Published Apr 24, 2024 • 15
SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with Text-Rich Visual Comprehension

Paper • 2404.16790 • Published Apr 25, 2024 • 10
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

Paper • 2404.16821 • Published Apr 25, 2024 • 59
List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs

Paper • 2404.16375 • Published Apr 25, 2024 • 18
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning

Paper • 2404.16994 • Published Apr 25, 2024 • 37
HaLo-NeRF: Learning Geometry-Guided Semantics for Exploring Unconstrained Photo Collections

Paper • 2404.16845 • Published Feb 14, 2024 • 7
BlenderAlchemy: Editing 3D Graphics with Vision-Language Models

Paper • 2404.17672 • Published Apr 26, 2024 • 19
Ag2Manip: Learning Novel Manipulation Skills with Agent-Agnostic Visual and Action Representations

Paper • 2404.17521 • Published Apr 26, 2024 • 13
Automatic Creative Selection with Cross-Modal Matching

Paper • 2405.00029 • Published Feb 28, 2024 • 9
What matters when building vision-language models?

Paper • 2405.02246 • Published May 3, 2024 • 104
Plot2Code: A Comprehensive Benchmark for Evaluating Multi-modal Large Language Models in Code Generation from Scientific Plots

Paper • 2405.07990 • Published May 13, 2024 • 20
No Time to Waste: Squeeze Time into Channel for Mobile Video Understanding

Paper • 2405.08344 • Published May 14, 2024 • 15
Understanding the performance gap between online and offline alignment algorithms

Paper • 2405.08448 • Published May 14, 2024 • 18
SpeechVerse: A Large-scale Generalizable Audio Language Model

Paper • 2405.08295 • Published May 14, 2024 • 19
SpeechGuard: Exploring the Adversarial Robustness of Multimodal Large Language Models

Paper • 2405.08317 • Published May 14, 2024 • 13
Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model

Paper • 2405.09215 • Published May 15, 2024 • 22
LoRA Learns Less and Forgets Less

Paper • 2405.09673 • Published May 15, 2024 • 91
Many-Shot In-Context Learning in Multimodal Foundation Models

Paper • 2405.09798 • Published May 16, 2024 • 32
Chameleon: Mixed-Modal Early-Fusion Foundation Models

Paper • 2405.09818 • Published May 16, 2024 • 133
Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection

Paper • 2405.10300 • Published May 16, 2024 • 31
Toon3D: Seeing Cartoons from a New Perspective

Paper • 2405.10320 • Published May 16, 2024 • 22
Octo: An Open-Source Generalist Robot Policy

Paper • 2405.12213 • Published May 20, 2024 • 29
Imp: Highly Capable Large Multimodal Models for Mobile Devices

Paper • 2405.12107 • Published May 20, 2024 • 29
Your Transformer is Secretly Linear

Paper • 2405.12250 • Published May 19, 2024 • 157
Diffusion for World Modeling: Visual Details Matter in Atari

Paper • 2405.12399 • Published May 20, 2024 • 30
AlignGPT: Multi-modal Large Language Models with Adaptive Alignment Capability

Paper • 2405.14129 • Published May 23, 2024 • 14
CamViG: Camera Aware Image-to-Video Generation with Multimodal Transformers

Paper • 2405.13195 • Published May 21, 2024 • 12
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models

Paper • 2405.15574 • Published May 24, 2024 • 55
Denoising LM: Pushing the Limits of Error Correction Models for Speech Recognition

Paper • 2405.15216 • Published May 24, 2024 • 15
An Introduction to Vision-Language Modeling

Paper • 2405.17247 • Published May 27, 2024 • 90
Matryoshka Multimodal Models

Paper • 2405.17430 • Published May 27, 2024 • 34
NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models

Paper • 2405.17428 • Published May 27, 2024 • 20
ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models

Paper • 2405.15738 • Published May 24, 2024 • 46
Dense Connector for MLLMs

Paper • 2405.13800 • Published May 22, 2024 • 24
Visual Echoes: A Simple Unified Transformer for Audio-Visual Generation

Paper • 2405.14598 • Published May 23, 2024 • 13
Jina CLIP: Your CLIP Model Is Also Your Text Retriever

Paper • 2405.20204 • Published May 30, 2024 • 37
Zipper: A Multi-Tower Decoder Architecture for Fusing Modalities

Paper • 2405.18669 • Published May 29, 2024 • 12
MotionLLM: Understanding Human Behaviors from Human Motions and Videos

Paper • 2405.20340 • Published May 30, 2024 • 20
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Paper • 2405.21075 • Published May 31, 2024 • 26
Show, Don't Tell: Aligning Language Models with Demonstrated Feedback

Paper • 2406.00888 • Published Jun 2, 2024 • 33
Parrot: Multilingual Visual Instruction Tuning

Paper • 2406.02539 • Published Jun 4, 2024 • 36
PosterLLaVa: Constructing a Unified Multi-modal Layout Generator with LLM

Paper • 2406.02884 • Published Jun 5, 2024 • 18
ShareGPT4Video: Improving Video Understanding and Generation with Better Captions

Paper • 2406.04325 • Published Jun 6, 2024 • 74
AgentGym: Evolving Large Language Model-based Agents across Diverse Environments

Paper • 2406.04151 • Published Jun 6, 2024 • 24
Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration

Paper • 2406.01014 • Published Jun 3, 2024 • 33
Vript: A Video Is Worth Thousands of Words

Paper • 2406.06040 • Published Jun 10, 2024 • 28
An Image is Worth 32 Tokens for Reconstruction and Generation

Paper • 2406.07550 • Published Jun 11, 2024 • 60
AsyncDiff: Parallelizing Diffusion Models by Asynchronous Denoising

Paper • 2406.06911 • Published Jun 11, 2024 • 12
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

Paper • 2406.07476 • Published Jun 11, 2024 • 36
What If We Recaption Billions of Web Images with LLaMA-3?

Paper • 2406.08478 • Published Jun 12, 2024 • 43
MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos

Paper • 2406.08407 • Published Jun 12, 2024 • 28
Needle In A Multimodal Haystack

Paper • 2406.07230 • Published Jun 11, 2024 • 55
mDPO: Conditional Preference Optimization for Multimodal Large Language Models

Paper • 2406.11839 • Published Jun 17, 2024 • 40
VideoLLM-online: Online Video Large Language Model for Streaming Video

Paper • 2406.11816 • Published Jun 17, 2024 • 26
TroL: Traversal of Layers for Large Language and Vision Models

Paper • 2406.12246 • Published Jun 18, 2024 • 36
VoCo-LLaMA: Towards Vision Compression with Large Language Models

Paper • 2406.12275 • Published Jun 18, 2024 • 31
Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning

Paper • 2406.12742 • Published Jun 18, 2024 • 15
Adversarial Attacks on Multimodal Agents

Paper • 2406.12814 • Published Jun 18, 2024 • 4
Multimodal Needle in a Haystack: Benchmarking Long-Context Capability of Multimodal Large Language Models

Paper • 2406.11230 • Published Jun 17, 2024 • 33
Probabilistic Conceptual Explainers: Trustworthy Conceptual Explanations for Vision Foundation Models

Paper • 2406.12649 • Published Jun 18, 2024 • 16
Understanding Hallucinations in Diffusion Models through Mode Interpolation

Paper • 2406.09358 • Published Jun 13, 2024 • 5
CMC-Bench: Towards a New Paradigm of Visual Signal Compression

Paper • 2406.09356 • Published Jun 13, 2024 • 6
4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities

Paper • 2406.09406 • Published Jun 13, 2024 • 15
Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models

Paper • 2406.09403 • Published Jun 13, 2024 • 23
MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding

Paper • 2406.09411 • Published Jun 13, 2024 • 19
mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus

Paper • 2406.08707 • Published Jun 13, 2024 • 17
EMMA: Your Text-to-Image Diffusion Model Can Secretly Accept Multi-Modal Prompts

Paper • 2406.09162 • Published Jun 13, 2024 • 14
OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text

Paper • 2406.08418 • Published Jun 12, 2024 • 32
GUI Odyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices

Paper • 2406.08451 • Published Jun 12, 2024 • 26
Are We Done with MMLU?

Paper • 2406.04127 • Published Jun 6, 2024 • 39
NaRCan: Natural Refined Canonical Image with Integration of Diffusion Prior for Video Editing

Paper • 2406.06523 • Published Jun 10, 2024 • 53
Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models

Paper • 2406.08487 • Published Jun 12, 2024 • 14
VCR: Visual Caption Restoration

Paper • 2406.06462 • Published Jun 10, 2024 • 13
An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels

Paper • 2406.09415 • Published Jun 13, 2024 • 51
OpenVLA: An Open-Source Vision-Language-Action Model

Paper • 2406.09246 • Published Jun 13, 2024 • 47
DiTFastAttn: Attention Compression for Diffusion Transformer Models

Paper • 2406.08552 • Published Jun 12, 2024 • 25
Physics3D: Learning Physical Properties of 3D Gaussians via Video Diffusion

Paper • 2406.04338 • Published Jun 6, 2024 • 39
Hibou: A Family of Foundational Vision Transformers for Pathology

Paper • 2406.05074 • Published Jun 7, 2024 • 10
Make It Count: Text-to-Image Generation with an Accurate Number of Objects

Paper • 2406.10210 • Published Jun 14, 2024 • 78
XLand-100B: A Large-Scale Multi-Task Dataset for In-Context Reinforcement Learning

Paper • 2406.08973 • Published Jun 13, 2024 • 89
MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs

Paper • 2406.11833 • Published Jun 17, 2024 • 62
Exploring the Role of Large Language Models in Prompt Encoding for Diffusion Models

Paper • 2406.11831 • Published Jun 17, 2024 • 22
From Pixels to Prose: A Large Dataset of Dense Image Captions

Paper • 2406.10328 • Published Jun 14, 2024 • 18
Prism: A Framework for Decoupling and Assessing the Capabilities of VLMs

Paper • 2406.14544 • Published Jun 20, 2024 • 35
WildVision: Evaluating Vision-Language Models in the Wild with Human Preferences

Paper • 2406.11069 • Published Jun 16, 2024 • 14
MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens

Paper • 2406.11271 • Published Jun 17, 2024 • 21
Task Me Anything

Paper • 2406.11775 • Published Jun 17, 2024 • 9
Unifying Multimodal Retrieval via Document Screenshot Embedding

Paper • 2406.11251 • Published Jun 17, 2024 • 11
The Devil is in the Details: StyleFeatureEditor for Detail-Rich StyleGAN Inversion and High Quality Image Editing

Paper • 2406.10601 • Published Jun 15, 2024 • 70
MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding

Paper • 2406.14515 • Published Jun 20, 2024 • 33
Two Giraffes in a Dirt Field: Using Game Play to Investigate Situation Modelling in Large Multimodal Models

Paper • 2406.14035 • Published Jun 20, 2024 • 13
ICAL: Continual Learning of Multimodal Agents by Transforming Trajectories into Actionable Insights

Paper • 2406.14596 • Published Jun 20, 2024 • 5
Multimodal Structured Generation: CVPR's 2nd MMFM Challenge Technical Report

Paper • 2406.11403 • Published Jun 17, 2024 • 4
VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models

Paper • 2406.16338 • Published Jun 24, 2024 • 26
Long Context Transfer from Language to Vision

Paper • 2406.16852 • Published Jun 24, 2024 • 33
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs

Paper • 2406.16860 • Published Jun 24, 2024 • 63
MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning

Paper • 2406.17770 • Published Jun 25, 2024 • 19
video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models

Paper • 2406.15704 • Published Jun 22, 2024 • 6
Octo-planner: On-device Language Model for Planner-Action Agents

Paper • 2406.18082 • Published Jun 26, 2024 • 48
CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs

Paper • 2406.18521 • Published Jun 26, 2024 • 30
Multimodal Task Vectors Enable Many-Shot Multimodal In-Context Learning

Paper • 2406.15334 • Published Jun 21, 2024 • 9
Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large Language Models

Paper • 2406.17294 • Published Jun 25, 2024 • 11
OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding

Paper • 2406.19389 • Published Jun 27, 2024 • 54
Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs

Paper • 2406.18629 • Published Jun 26, 2024 • 42
MUMU: Bootstrapping Multimodal Image Generation from Text-to-Image Data

Paper • 2406.18790 • Published Jun 26, 2024 • 34
Simulating Classroom Education with LLM-Empowered Agents

Paper • 2406.19226 • Published Jun 27, 2024 • 32
AUTOHALLUSION: Automatic Generation of Hallucination Benchmarks for Vision-Language Models

Paper • 2406.10900 • Published Jun 16, 2024 • 11
LLaRA: Supercharging Robot Learning Data for Vision-Language Policy

Paper • 2406.20095 • Published Jun 28, 2024 • 18
EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model

Paper • 2406.20076 • Published Jun 28, 2024 • 10
Arboretum: A Large Multimodal Dataset Enabling AI for Biodiversity

Paper • 2406.17720 • Published Jun 25, 2024 • 8
We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?

Paper • 2407.01284 • Published Jul 1, 2024 • 81
ROS-LLM: A ROS framework for embodied AI with task feedback and structured reasoning

Paper • 2406.19741 • Published Jun 28, 2024 • 60
MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient Evaluation

Paper • 2407.00468 • Published Jun 29, 2024 • 36
ColPali: Efficient Document Retrieval with Vision Language Models

Paper • 2407.01449 • Published Jun 27, 2024 • 51
OmniJARVIS: Unified Vision-Language-Action Tokenization Enables Open-World Instruction Following Agents

Paper • 2407.00114 • Published Jun 27, 2024 • 13
Understanding Alignment in Multimodal LLMs: A Comprehensive Study

Paper • 2407.02477 • Published Jul 2, 2024 • 24
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output

Paper • 2407.03320 • Published Jul 3, 2024 • 94
TokenPacker: Efficient Visual Projector for Multimodal LLM

Paper • 2407.02392 • Published Jul 2, 2024 • 23
Unveiling Encoder-Free Vision-Language Models

Paper • 2406.11832 • Published Jun 17, 2024 • 54
Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams

Paper • 2406.08085 • Published Jun 12, 2024 • 17
Granular Privacy Control for Geolocation with Vision Language Models

Paper • 2407.04952 • Published Jul 6, 2024 • 7
ANOLE: An Open, Autoregressive, Native Large Multimodal Models for Interleaved Image-Text Generation

Paper • 2407.06135 • Published Jul 8, 2024 • 22
Multi-Object Hallucination in Vision-Language Models

Paper • 2407.06192 • Published Jul 8, 2024 • 12
Vision language models are blind

Paper • 2407.06581 • Published Jul 9, 2024 • 84
VIMI: Grounding Video Generation through Multi-modal Instruction

Paper • 2407.06304 • Published Jul 8, 2024 • 10
Video-to-Audio Generation with Hidden Alignment

Paper • 2407.07464 • Published Jul 10, 2024 • 17
Stark: Social Long-Term Multi-Modal Conversation with Persona Commonsense Knowledge

Paper • 2407.03958 • Published Jul 4, 2024 • 21
Understanding Visual Feature Reliance through the Lens of Complexity

Paper • 2407.06076 • Published Jul 8, 2024 • 6
Graph-Based Captioning: Enhancing Visual Descriptions by Interconnecting Region Captions

Paper • 2407.06723 • Published Jul 9, 2024 • 11
PaliGemma: A versatile 3B VLM for transfer

Paper • 2407.07726 • Published Jul 10, 2024 • 72
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

Paper • 2407.07895 • Published Jul 10, 2024 • 42
Do Vision and Language Models Share Concepts? A Vector Space Alignment Study

Paper • 2302.06555 • Published Feb 13, 2023 • 9
DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception

Paper • 2407.08303 • Published Jul 11, 2024 • 19
The Synergy between Data and Multi-Modal Large Language Models: A Survey from Co-Development Perspective

Paper • 2407.08583 • Published Jul 11, 2024 • 13
Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language Model

Paper • 2407.07053 • Published Jul 9, 2024 • 47
E5-V: Universal Embeddings with Multimodal Large Language Models

Paper • 2407.12580 • Published Jul 17, 2024 • 42
Goldfish: Vision-Language Understanding of Arbitrarily Long Videos

Paper • 2407.12679 • Published Jul 17, 2024 • 8
AUITestAgent: Automatic Requirements Oriented GUI Function Testing

Paper • 2407.09018 • Published Jul 12, 2024 • 5
ThinkGrasp: A Vision-Language System for Strategic Part Grasping in Clutter

Paper • 2407.11298 • Published Jul 16, 2024 • 6
NavGPT-2: Unleashing Navigational Reasoning Capability for Large Vision-Language Models

Paper • 2407.12366 • Published Jul 17, 2024 • 4
Benchmarking Trustworthiness of Multimodal Large Language Models: A Comprehensive Study

Paper • 2406.07057 • Published Jun 11, 2024 • 17
EVLM: An Efficient Vision-Language Model for Visual Understanding

Paper • 2407.14177 • Published Jul 19, 2024 • 45
VisFocus: Prompt-Guided Vision Encoders for OCR-Free Dense Document Understanding

Paper • 2407.12594 • Published Jul 17, 2024 • 19
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models

Paper • 2407.15841 • Published Jul 22, 2024 • 39
VideoGameBunny: Towards vision assistants for video games

Paper • 2407.15295 • Published Jul 21, 2024 • 23
CGB-DM: Content and Graphic Balance Layout Generation with Transformer-based Diffusion Model

Paper • 2407.15233 • Published Jul 21, 2024 • 7
OutfitAnyone: Ultra-high Quality Virtual Try-On for Any Clothing and Any Person

Paper • 2407.16224 • Published Jul 23, 2024 • 29
MovieDreamer: Hierarchical Generation for Coherent Long Visual Sequence

Paper • 2407.16655 • Published Jul 23, 2024 • 30
INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model

Paper • 2407.16198 • Published Jul 23, 2024 • 13
VILA^2: VILA Augmented VILA

Paper • 2407.17453 • Published Jul 24, 2024 • 41
Learning to Manipulate Anywhere: A Visual Generalizable Framework For Reinforcement Learning

Paper • 2407.15815 • Published Jul 22, 2024 • 14
AMEX: Android Multi-annotation Expo Dataset for Mobile GUI Agents

Paper • 2407.17490 • Published Jul 3, 2024 • 31
Efficient Inference of Vision Instruction-Following Models with Elastic Cache

Paper • 2407.18121 • Published Jul 25, 2024 • 17
VSSD: Vision Mamba with Non-Casual State Space Duality

Paper • 2407.18559 • Published Jul 26, 2024 • 20
Wolf: Captioning Everything with a World Summarization Framework

Paper • 2407.18908 • Published Jul 26, 2024 • 32
Diffusion Feedback Helps CLIP See Better

Paper • 2407.20171 • Published Jul 29, 2024 • 36
VolDoGer: LLM-assisted Datasets for Domain Generalization in Vision-Language Tasks

Paper • 2407.19795 • Published Jul 29, 2024 • 11
Mixture of Nested Experts: Adaptive Processing of Visual Tokens

Paper • 2407.19985 • Published Jul 29, 2024 • 37
MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts

Paper • 2407.21770 • Published Jul 31, 2024 • 22
Towards Achieving Human Parity on End-to-end Simultaneous Speech Translation via LLM Agent

Paper • 2407.21646 • Published Jul 31, 2024 • 18
ShieldGemma: Generative AI Content Moderation Based on Gemma

Paper • 2407.21772 • Published Jul 31, 2024 • 14
Open-Vocabulary Audio-Visual Semantic Segmentation

Paper • 2407.21721 • Published Jul 31, 2024 • 9
SAM 2: Segment Anything in Images and Videos

Paper • 2408.00714 • Published Aug 1, 2024 • 122
OmniParser for Pure Vision Based GUI Agent

Paper • 2408.00203 • Published Aug 1, 2024 • 24
Generalized Out-of-Distribution Detection and Beyond in Vision Language Model Era: A Survey

Paper • 2407.21794 • Published Jul 31, 2024 • 6
MiniCPM-V: A GPT-4V Level MLLM on Your Phone

Paper • 2408.01800 • Published Aug 3, 2024 • 94
Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining

Paper • 2408.02657 • Published Aug 5, 2024 • 35
Language Model Can Listen While Speaking

Paper • 2408.02622 • Published Aug 5, 2024 • 40
ExoViP: Step-by-step Verification and Exploration with Exoskeleton Modules for Compositional Visual Reasoning

Paper • 2408.02210 • Published Aug 5, 2024 • 9
Operationalizing Contextual Integrity in Privacy-Conscious Assistants

Paper • 2408.02373 • Published Aug 5, 2024 • 5
LLaVA-OneVision: Easy Visual Task Transfer

Paper • 2408.03326 • Published Aug 6, 2024 • 61
Diffusion Models as Data Mining Tools

Paper • 2408.02752 • Published Jul 20, 2024 • 15
AVESFormer: Efficient Transformer Design for Real-Time Audio-Visual Segmentation

Paper • 2408.01708 • Published Aug 3, 2024 • 4
Optimus-1: Hybrid Multimodal Memory Empowered Agents Excel in Long-Horizon Tasks

Paper • 2408.03615 • Published Aug 7, 2024 • 31
Openstory++: A Large-scale Dataset and Benchmark for Instance-aware Open-domain Visual Storytelling

Paper • 2408.03695 • Published Aug 7, 2024 • 13
Speech-MASSIVE: A Multilingual Speech Dataset for SLU and Beyond

Paper • 2408.03900 • Published Aug 7, 2024 • 10
Sketch2Scene: Automatic Generation of Interactive 3D Game Scenes from User's Casual Sketches

Paper • 2408.04567 • Published Aug 8, 2024 • 26
Img-Diff: Contrastive Data Synthesis for Multimodal Large Language Models

Paper • 2408.04594 • Published Aug 8, 2024 • 14
Puppet-Master: Scaling Interactive Video Generation as a Motion Prior for Part-Level Dynamics

Paper • 2408.04631 • Published Aug 8, 2024 • 9
VITA: Towards Open-Source Interactive Omni Multimodal LLM

Paper • 2408.05211 • Published Aug 9, 2024 • 50
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models

Paper • 2408.04840 • Published Aug 9, 2024 • 33
UniBench: Visual Reasoning Requires Rethinking Vision-Language Beyond Scaling

Paper • 2408.04810 • Published Aug 9, 2024 • 24
ControlNeXt: Powerful and Efficient Control for Image and Video Generation

Paper • 2408.06070 • Published Aug 12, 2024 • 55
VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents

Paper • 2408.06327 • Published Aug 12, 2024 • 17
UniPortrait: A Unified Framework for Identity-Preserving Single- and Multi-Human Image Personalization

Paper • 2408.05939 • Published Aug 12, 2024 • 14
Imagen 3

Paper • 2408.07009 • Published Aug 13, 2024 • 62
Amuro & Char: Analyzing the Relationship between Pre-Training and Fine-Tuning of Large Language Models

Paper • 2408.06663 • Published Aug 13, 2024 • 16
DeepSpeak Dataset v1.0

Paper • 2408.05366 • Published Aug 9, 2024 • 14
Towards flexible perception with visual memory

Paper • 2408.08172 • Published Aug 15, 2024 • 23
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models

Paper • 2408.08872 • Published Aug 16, 2024 • 101
JPEG-LM: LLMs as Image Generators with Canonical Codec Representations

Paper • 2408.08459 • Published Aug 15, 2024 • 45
D5RL: Diverse Datasets for Data-Driven Deep Reinforcement Learning

Paper • 2408.08441 • Published Aug 15, 2024 • 8
LongVILA: Scaling Long-Context Visual Language Models for Long Videos

Paper • 2408.10188 • Published Aug 19, 2024 • 52
MegaFusion: Extend Diffusion Models towards Higher-resolution Image Generation without Further Tuning

Paper • 2408.11001 • Published Aug 20, 2024 • 13
Factorized-Dreamer: Training A High-Quality Video Generator with Limited and Low-Quality Data

Paper • 2408.10119 • Published Aug 19, 2024 • 17
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

Paper • 2408.11039 • Published Aug 20, 2024 • 63
NeCo: Improving DINOv2's spatial representations in 19 GPU hours with Patch Neighbor Consistency

Paper • 2408.11054 • Published Aug 20, 2024 • 14
Predicting Rewards Alongside Tokens: Non-disruptive Parameter Insertion for Efficient Inference Intervention in Large Language Model

Paper • 2408.10764 • Published Aug 20, 2024 • 9
Audio Match Cutting: Finding and Creating Matching Audio Transitions in Movies and Videos

Paper • 2408.10998 • Published Aug 20, 2024 • 9
MambaEVT: Event Stream based Visual Object Tracking using State Space Model

Paper • 2408.10487 • Published Aug 20, 2024 • 7
FocusLLM: Scaling LLM's Context by Parallel Decoding

Paper • 2408.11745 • Published Aug 21, 2024 • 25
TWLV-I: Analysis and Insights from Holistic Evaluation on Video Foundation Models

Paper • 2408.11318 • Published Aug 21, 2024 • 56
GRAB: A Challenging GRaph Analysis Benchmark for Large Multimodal Models

Paper • 2408.11817 • Published Aug 21, 2024 • 9
FRAP: Faithful and Realistic Text-to-Image Generation with Adaptive Prompt Weighting

Paper • 2408.11706 • Published Aug 21, 2024 • 7
TrackGo: A Flexible and Efficient Method for Controllable Video Generation

Paper • 2408.11475 • Published Aug 21, 2024 • 18
Out-of-Distribution Detection with Attention Head Masking for Multimodal Document Classification

Paper • 2408.11237 • Published Aug 20, 2024 • 6
Iterative Object Count Optimization for Text-to-image Diffusion Models

Paper • 2408.11721 • Published Aug 21, 2024 • 6
Sapiens: Foundation for Human Vision Models

Paper • 2408.12569 • Published Aug 22, 2024 • 94
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

Paper • 2408.12528 • Published Aug 22, 2024 • 51
Open-FinLLMs: Open Multimodal Large Language Models for Financial Applications

Paper • 2408.11878 • Published Aug 20, 2024 • 64
xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations

Paper • 2408.12590 • Published Aug 22, 2024 • 35
Scalable Autoregressive Image Generation with Mamba

Paper • 2408.12245 • Published Aug 22, 2024 • 26
Real-Time Video Generation with Pyramid Attention Broadcast

Paper • 2408.12588 • Published Aug 22, 2024 • 17
SPARK: Multi-Vision Sensor Perception and Reasoning Benchmark for Large-scale Vision-Language Models

Paper • 2408.12114 • Published Aug 22, 2024 • 15
Anim-Director: A Large Multimodal Model Powered Agent for Controllable Animation Video Generation

Paper • 2408.09787 • Published Aug 19, 2024 • 10
Building and better understanding vision-language models: insights and future directions

Paper • 2408.12637 • Published Aug 22, 2024 • 133
MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?

Paper • 2408.13257 • Published Aug 23, 2024 • 26
CustomCrafter: Customized Video Generation with Preserving Motion and Concept Composition Abilities

Paper • 2408.13239 • Published Aug 23, 2024 • 11
Foundation Models for Music: A Survey

Paper • 2408.14340 • Published Aug 26, 2024 • 44
LLaVaOLMoBitnet1B: Ternary LLM goes Multimodal!

Paper • 2408.13402 • Published Aug 23, 2024 • 18
TVG: A Training-free Transition Video Generation Method with Diffusion Models

Paper • 2408.13413 • Published Aug 24, 2024 • 14
BaichuanSEED: Sharing the Potential of ExtensivE Data Collection and Deduplication by Introducing a Competitive Large Language Model Baseline

Paper • 2408.15079 • Published Aug 27, 2024 • 56
Law of Vision Representation in MLLMs

Paper • 2408.16357 • Published Aug 29, 2024 • 95
CogVLM2: Visual Language Models for Image and Video Understanding

Paper • 2408.16500 • Published Aug 29, 2024 • 57
WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling

Paper • 2408.16532 • Published Aug 29, 2024 • 50
Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming

Paper • 2408.16725 • Published Aug 29, 2024 • 53
VisionTS: Visual Masked Autoencoders Are Free-Lunch Zero-Shot Time Series Forecasters

Paper • 2408.17253 • Published Aug 30, 2024 • 39
TableBench: A Comprehensive and Complex Benchmark for Table Question Answering

Paper • 2408.09174 • Published Aug 17, 2024 • 53
VideoLLaMB: Long-context Video Understanding with Recurrent Memory Bridges

Paper • 2409.01071 • Published Sep 2, 2024 • 27
DepthCrafter: Generating Consistent Long Depth Sequences for Open-world Videos

Paper • 2409.02095 • Published Sep 3, 2024 • 37
LinFusion: 1 GPU, 1 Minute, 16K Image

Paper • 2409.02097 • Published Sep 3, 2024 • 34
LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture

Paper • 2409.02889 • Published Sep 4, 2024 • 54
Attention Heads of Large Language Models: A Survey

Paper • 2409.03752 • Published Sep 5, 2024 • 92
Open-MAGVIT2: An Open-Source Project Toward Democratizing Auto-regressive Visual Generation

Paper • 2409.04410 • Published Sep 6, 2024 • 25
MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct

Paper • 2409.05840 • Published Sep 9, 2024 • 49
Towards a Unified View of Preference Learning for Large Language Models: A Survey

Paper • 2409.02795 • Published Sep 4, 2024 • 72
POINTS: Improving Your Vision-language Model with Affordable Strategies

Paper • 2409.04828 • Published Sep 7, 2024 • 24
Benchmarking Chinese Knowledge Rectification in Large Language Models

Paper • 2409.05806 • Published Sep 9, 2024 • 15
LLaMA-Omni: Seamless Speech Interaction with Large Language Models

Paper • 2409.06666 • Published Sep 10, 2024 • 60
Draw an Audio: Leveraging Multi-Instruction for Video-to-Audio Synthesis

Paper • 2409.06135 • Published Sep 10, 2024 • 16
PingPong: A Benchmark for Role-Playing Language Models with User Emulation and Multi-Model Evaluation

Paper • 2409.06820 • Published Sep 10, 2024 • 68
MVLLaVA: An Intelligent Agent for Unified and Flexible Novel View Synthesis

Paper • 2409.07129 • Published Sep 11, 2024 • 8
PiTe: Pixel-Temporal Alignment for Large Video-Language Model

Paper • 2409.07239 • Published Sep 11, 2024 • 15
Ferret: Federated Full-Parameter Tuning at Scale for Large Language Models

Paper • 2409.06277 • Published Sep 10, 2024 • 15
Guiding Vision-Language Model Selection for Visual Question-Answering Across Tasks, Domains, and Knowledge Types

Paper • 2409.09269 • Published Sep 14, 2024 • 8
One missing piece in Vision and Language: A Survey on Comics Understanding

Paper • 2409.09502 • Published Sep 14, 2024 • 24
NVLM: Open Frontier-Class Multimodal LLMs

Paper • 2409.11402 • Published Sep 17, 2024 • 74
OmniGen: Unified Image Generation

Paper • 2409.11340 • Published Sep 17, 2024 • 115
Fine-Tuning Image-Conditional Diffusion Models is Easier than You Think

Paper • 2409.11355 • Published Sep 17, 2024 • 30
OSV: One Step is Enough for High-Quality Image to Video Generation

Paper • 2409.11367 • Published Sep 17, 2024 • 14
mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page Document Understanding

Paper • 2409.03420 • Published Sep 5, 2024 • 26
InstantDrag: Improving Interactivity in Drag-based Image Editing

Paper • 2409.08857 • Published Sep 13, 2024 • 34
AudioBERT: Audio Knowledge Augmented Language Model

Paper • 2409.08199 • Published Sep 12, 2024 • 5
LLM-Powered Grapheme-to-Phoneme Conversion: Benchmark and Case Study

Paper • 2409.08554 • Published Sep 13, 2024 • 3
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Paper • 2409.12191 • Published Sep 18, 2024 • 79
Qwen2.5-Coder Technical Report

Paper • 2409.12186 • Published Sep 18, 2024 • 154
Preference Tuning with Human Feedback on Language, Speech, and Vision Tasks: A Survey

Paper • 2409.11564 • Published Sep 17, 2024 • 20
Takin: A Cohort of Superior Quality Zero-shot Speech Generation Models

Paper • 2409.12139 • Published Sep 18, 2024 • 12
Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution

Paper • 2409.12961 • Published Sep 19, 2024 • 25
StoryMaker: Towards Holistic Consistent Characters in Text-to-image Generation

Paper • 2409.12576 • Published Sep 19, 2024 • 16
Imagine yourself: Tuning-Free Personalized Image Generation

Paper • 2409.13346 • Published Sep 20, 2024 • 69
YesBut: A High-Quality Annotated Multimodal Dataset for evaluating Satire Comprehension capability of Vision-Language Models

Paper • 2409.13592 • Published Sep 20, 2024 • 50
Portrait Video Editing Empowered by Multimodal Generative Priors

Paper • 2409.13591 • Published Sep 20, 2024 • 16
PixWizard: Versatile Image-to-Image Visual Assistant with Open-Language Instructions

Paper • 2409.15278 • Published Sep 23, 2024 • 24
Phantom of Latent for Large Language and Vision Models

Paper • 2409.14713 • Published Sep 23, 2024 • 29
Reflecting Reality: Enabling Diffusion Models to Produce Faithful Mirror Reflections

Paper • 2409.14677 • Published Sep 23, 2024 • 15
MIMO: Controllable Character Video Synthesis with Spatial Decomposed Modeling

Paper • 2409.16160 • Published Sep 24, 2024 • 34
MonoFormer: One Transformer for Both Diffusion and Autoregression

Paper • 2409.16280 • Published Sep 24, 2024 • 18
Seeing Faces in Things: A Model and Dataset for Pareidolia

Paper • 2409.16143 • Published Sep 24, 2024 • 17
Attention Prompting on Image for Large Vision-Language Models

Paper • 2409.17143 • Published Sep 25, 2024 • 7
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models

Paper • 2409.17146 • Published Sep 25, 2024 • 121
MIO: A Foundation Model on Multimodal Tokens

Paper • 2409.17692 • Published Sep 26, 2024 • 53
MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning

Paper • 2409.20566 • Published Sep 30, 2024 • 54
Visual Question Decomposition on Multimodal Large Language Models

Paper • 2409.19339 • Published Sep 28, 2024 • 8
Loong: Generating Minute-level Long Videos with Autoregressive Language Models

Paper • 2410.02757 • Published Oct 3, 2024 • 36
Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models

Paper • 2410.02740 • Published Oct 3, 2024 • 53
LLaVA-Critic: Learning to Evaluate Multimodal Models

Paper • 2410.02712 • Published Oct 3, 2024 • 37
Interpreting and Editing Vision-Language Representations to Mitigate Hallucinations

Paper • 2410.02762 • Published Oct 3, 2024 • 9
Vinoground: Scrutinizing LMMs over Dense Temporal Reasoning with Short Videos

Paper • 2410.02763 • Published Oct 3, 2024 • 7
Addition is All You Need for Energy-efficient Language Models

Paper • 2410.00907 • Published Oct 1, 2024 • 151
VideoGuide: Improving Video Diffusion Models without Training Through a Teacher's Guide

Paper • 2410.04364 • Published Oct 6, 2024 • 29
Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents

Paper • 2410.05243 • Published Oct 7, 2024 • 20
UniMuMo: Unified Text, Music and Motion Generation

Paper • 2410.04534 • Published Oct 6, 2024 • 19
TLDR: Token-Level Detective Reward Model for Large Vision Language Models

Paper • 2410.04734 • Published Oct 7, 2024 • 19
OmniBooth: Learning Latent Control for Image Synthesis with Multi-modal Instruction

Paper • 2410.04932 • Published Oct 7, 2024 • 9
A Spark of Vision-Language Intelligence: 2-Dimensional Autoregressive Transformer for Efficient Finegrained Image Generation

Paper • 2410.01912 • Published Oct 2, 2024 • 14
ControlAR: Controllable Image Generation with Autoregressive Models

Paper • 2410.02705 • Published Oct 3, 2024 • 11
Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models

Paper • 2410.03290 • Published Oct 4, 2024 • 7
Aria: An Open Multimodal Native Mixture-of-Experts Model

Paper • 2410.05993 • Published Oct 8, 2024 • 111
Personalized Visual Instruction Tuning

Paper • 2410.07113 • Published Oct 9, 2024 • 70
Pixtral 12B

Paper • 2410.07073 • Published Oct 9, 2024 • 69
IterComp: Iterative Composition-Aware Feedback Learning from Model Gallery for Text-to-Image Generation

Paper • 2410.07171 • Published Oct 9, 2024 • 43
Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate

Paper • 2410.07167 • Published Oct 9, 2024 • 39
Unveiling the Backbone-Optimizer Coupling Bias in Visual Representation Learning

Paper • 2410.06373 • Published Oct 8, 2024 • 36
Pyramidal Flow Matching for Efficient Video Generative Modeling

Paper • 2410.05954 • Published Oct 8, 2024 • 40
Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation

Paper • 2410.05363 • Published Oct 7, 2024 • 45
Story-Adapter: A Training-free Iterative Framework for Long Story Visualization

Paper • 2410.06244 • Published Oct 8, 2024 • 20
MM-Ego: Towards Building Egocentric Multimodal LLMs

Paper • 2410.07177 • Published Oct 9, 2024 • 22
TweedieMix: Improving Multi-Concept Fusion for Diffusion-based Image/Video Generation

Paper • 2410.05591 • Published Oct 8, 2024 • 13
Temporal Reasoning Transfer from Text to Video

Paper • 2410.06166 • Published Oct 8, 2024 • 13
MLLM as Retriever: Interactively Learning Multimodal Retrieval for Embodied Agents

Paper • 2410.03450 • Published Oct 4, 2024 • 36
Intriguing Properties of Large Language and Vision Models

Paper • 2410.04751 • Published Oct 7, 2024 • 16
Progressive Autoregressive Video Diffusion Models

Paper • 2410.08151 • Published Oct 10, 2024 • 16
Preserving Multi-Modal Capabilities of Pre-trained VLMs for Improving Vision-Linguistic Compositionality

Paper • 2410.05210 • Published Oct 7, 2024 • 11
Self-Boosting Large Language Models with Synthetic Preference Data

Paper • 2410.06961 • Published Oct 9, 2024 • 16
WALL-E: World Alignment by Rule Learning Improves World Model-based LLM Agents

Paper • 2410.07484 • Published Oct 9, 2024 • 51
Agent S: An Open Agentic Framework that Uses Computers Like a Human

Paper • 2410.08164 • Published Oct 10, 2024 • 26
GLOV: Guided Large Language Models as Implicit Optimizers for Vision Language Models

Paper • 2410.06154 • Published Oct 8, 2024 • 16
Baichuan-Omni Technical Report

Paper • 2410.08565 • Published Oct 11, 2024 • 87
From Generalist to Specialist: Adapting Vision Language Models via Task-Specific Visual Instruction Tuning

Paper • 2410.06456 • Published Oct 9, 2024 • 37
EvolveDirector: Approaching Advanced Text-to-Image Generation with Large Vision-Language Models

Paper • 2410.07133 • Published Oct 9, 2024 • 19
MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models

Paper • 2410.10139 • Published Oct 14, 2024 • 51
VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents

Paper • 2410.10594 • Published Oct 14, 2024 • 29
MLLM can see? Dynamic Correction Decoding for Hallucination Mitigation

Paper • 2410.11779 • Published Oct 15, 2024 • 26
LVD-2M: A Long-take Video Dataset with Temporally Dense Captions

Paper • 2410.10816 • Published Oct 14, 2024 • 21
Improving Long-Text Alignment for Text-to-Image Diffusion Models

Paper • 2410.11817 • Published Oct 15, 2024 • 15
OMCAT: Omni Context Aware Transformer

Paper • 2410.12109 • Published Oct 15, 2024 • 4
VidEgoThink: Assessing Egocentric Video Understanding Capabilities for Embodied AI

Paper • 2410.11623 • Published Oct 15, 2024 • 49
HumanEval-V: Benchmarking High-Level Visual Reasoning with Complex Diagrams in Coding Tasks

Paper • 2410.12381 • Published Oct 16, 2024 • 43
The Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio

Paper • 2410.12787 • Published Oct 16, 2024 • 30
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

Paper • 2410.13848 • Published Oct 17, 2024 • 36
Harnessing Webpage UIs for Text-Rich Visual Understanding

Paper • 2410.13824 • Published Oct 17, 2024 • 30
WorldCuisines: A Massive-Scale Benchmark for Multilingual and Multicultural Visual Question Answering on Global Cuisines

Paper • 2410.12705 • Published Oct 16, 2024 • 32
Fluid: Scaling Autoregressive Text-to-image Generative Models with Continuous Tokens

Paper • 2410.13863 • Published Oct 17, 2024 • 37
MobA: A Two-Level Agent System for Efficient Mobile Task Automation

Paper • 2410.13757 • Published Oct 17, 2024 • 32
Roadmap towards Superhuman Speech Understanding using Large Language Models

Paper • 2410.13268 • Published Oct 17, 2024 • 33
Movie Gen: A Cast of Media Foundation Models

Paper • 2410.13720 • Published Oct 17, 2024 • 100
DreamVideo-2: Zero-Shot Subject-Driven Video Customization with Precise Motion Control

Paper • 2410.13830 • Published Oct 17, 2024 • 26
MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language Models

Paper • 2410.13085 • Published Oct 16, 2024 • 24
A Comparative Study on Reasoning Patterns of OpenAI's o1 Model

Paper • 2410.13639 • Published Oct 17, 2024 • 19
VidPanos: Generative Panoramic Videos from Casual Panning Videos

Paper • 2410.13832 • Published Oct 17, 2024 • 13
Remember, Retrieve and Generate: Understanding Infinite Visual Concepts as Your Personalized Assistant

Paper • 2410.13360 • Published Oct 17, 2024 • 9
γ-MoD: Exploring Mixture-of-Depth Adaptation for Multimodal Large Language Models

Paper • 2410.13859 • Published Oct 17, 2024 • 8
Can MLLMs Understand the Deep Implication Behind Chinese Images?

Paper • 2410.13854 • Published Oct 17, 2024 • 12
FiTv2: Scalable and Improved Flexible Vision Transformer for Diffusion Model

Paper • 2410.13925 • Published Oct 17, 2024 • 24
Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities

Paper • 2410.11190 • Published Oct 15, 2024 • 22
SemiEvol: Semi-supervised Fine-tuning for LLM Adaptation

Paper • 2410.14745 • Published Oct 17, 2024 • 47
SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree

Paper • 2410.16268 • Published Oct 21, 2024 • 70
Baichuan Alignment Technical Report

Paper • 2410.14940 • Published Oct 19, 2024 • 51
PUMA: Empowering Unified MLLM with Multi-granular Visual Generation

Paper • 2410.13861 • Published Oct 17, 2024 • 56
Toward Guidance-Free AR Visual Generation via Condition Contrastive Alignment

Paper • 2410.09347 • Published Oct 12, 2024 • 5
AutoTrain: No-code training for state-of-the-art models

Paper • 2410.15735 • Published Oct 21, 2024 • 59
RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style

Paper • 2410.16184 • Published Oct 21, 2024 • 26
Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant

Paper • 2410.15316 • Published Oct 20, 2024 • 12
PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction

Paper • 2410.17247 • Published Oct 22, 2024 • 47
Aligning Large Language Models via Self-Steering Optimization

Paper • 2410.17131 • Published Oct 22, 2024 • 24
Improve Vision Language Model Chain-of-thought Reasoning

Paper • 2410.16198 • Published Oct 21, 2024 • 26
xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs

Paper • 2410.16267 • Published Oct 21, 2024 • 18
MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models

Paper • 2410.17637 • Published Oct 23, 2024 • 35
Can Knowledge Editing Really Correct Hallucinations?

Paper • 2410.16251 • Published Oct 21, 2024 • 55
LOGO -- Long cOntext aliGnment via efficient preference Optimization

Paper • 2410.18533 • Published Oct 24, 2024 • 43
Distill Visual Chart Reasoning Ability from LLMs to MLLMs

Paper • 2410.18798 • Published Oct 24, 2024 • 21
Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data

Paper • 2410.18558 • Published Oct 24, 2024 • 18
ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning

Paper • 2410.17779 • Published Oct 23, 2024 • 8
ROCKET-1: Master Open-World Interaction with Visual-Temporal Context Prompting

Paper • 2410.17856 • Published Oct 23, 2024 • 52
Continuous Speech Synthesis using per-token Latent Diffusion

Paper • 2410.16048 • Published Oct 21, 2024 • 30
GPT-4o System Card

Paper • 2410.21276 • Published Oct 25, 2024 • 87
Vision Search Assistant: Empower Vision-Language Models as Multimodal Search Engines

Paper • 2410.21220 • Published Oct 28, 2024 • 11
CLEAR: Character Unlearning in Textual and Visual Modalities

Paper • 2410.18057 • Published Oct 23, 2024 • 209
Toxicity of the Commons: Curating Open-Source Pre-Training Data

Paper • 2410.22587 • Published Oct 29, 2024 • 10
ReferEverything: Towards Segmenting Everything We Can Speak of in Videos

Paper • 2410.23287 • Published Oct 30, 2024 • 19
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

Paper • 2410.23218 • Published Oct 30, 2024 • 49
Personalization of Large Language Models: A Survey

Paper • 2411.00027 • Published Oct 29, 2024 • 33
Randomized Autoregressive Visual Generation

Paper • 2411.00776 • Published Nov 1, 2024 • 18
Face Anonymization Made Simple

Paper • 2411.00762 • Published Nov 1, 2024 • 9
AndroidLab: Training and Systematic Benchmarking of Android Autonomous Agents

Paper • 2410.24024 • Published Oct 31, 2024 • 49
WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning

Paper • 2411.02337 • Published Nov 4, 2024 • 36
How Far is Video Generation from World Model: A Physical Law Perspective

Paper • 2411.02385 • Published Nov 4, 2024 • 34
Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent

Paper • 2411.02265 • Published Nov 4, 2024 • 25
Adaptive Caching for Faster Video Generation with Diffusion Transformers

Paper • 2411.02397 • Published Nov 4, 2024 • 23
AutoVFX: Physically Realistic Video Editing from Natural Language Instructions

Paper • 2411.02394 • Published Nov 4, 2024 • 16
DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution

Paper • 2411.02359 • Published Nov 4, 2024 • 14
Both Text and Images Leaked! A Systematic Analysis of Multimodal LLM Data Contamination

Paper • 2411.03823 • Published Nov 6, 2024 • 49
Adaptive Length Image Tokenization via Recurrent Allocation

Paper • 2411.02393 • Published Nov 4, 2024 • 13
ReCapture: Generative Video Camera Controls for User-Provided Videos using Masked Video Fine-Tuning

Paper • 2411.05003 • Published Nov 7, 2024 • 71
TIP-I2V: A Million-Scale Real Text and Image Prompt Dataset for Image-to-Video Generation

Paper • 2411.04709 • Published Nov 5, 2024 • 27
M3DocRAG: Multi-modal Retrieval is What You Need for Multi-page Multi-document Understanding

Paper • 2411.04952 • Published Nov 7, 2024 • 29
Needle Threading: Can LLMs Follow Threads through Near-Million-Scale Haystacks?

Paper • 2411.05000 • Published Nov 7, 2024 • 22
VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos

Paper • 2411.04923 • Published Nov 7, 2024 • 23
Analyzing The Language of Visual Tokens

Paper • 2411.05001 • Published Nov 7, 2024 • 24
LLM2CLIP: Powerful Language Model Unlock Richer Visual Representation

Paper • 2411.04997 • Published Nov 7, 2024 • 39
RaVL: Discovering and Mitigating Spurious Correlations in Fine-Tuned Vision-Language Models

Paper • 2411.04097 • Published Nov 6, 2024 • 5
OmniEdit: Building Image Editing Generalist Models Through Specialist Supervision

Paper • 2411.07199 • Published Nov 11, 2024 • 50
Chinese SimpleQA: A Chinese Factuality Evaluation for Large Language Models

Paper • 2411.07140 • Published Nov 11, 2024 • 35
Edify Image: High-Quality Image Generation with Pixel Space Laplacian Diffusion Models

Paper • 2411.07126 • Published Nov 11, 2024 • 30
Add-it: Training-Free Object Insertion in Images With Pretrained Diffusion Models

Paper • 2411.07232 • Published Nov 11, 2024 • 68
JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation

Paper • 2411.07975 • Published Nov 12, 2024 • 32
Autoregressive Models in Vision: A Survey

Paper • 2411.05902 • Published Nov 8, 2024 • 19
MagicQuill: An Intelligent Interactive Image Editing System

Paper • 2411.09703 • Published Nov 14, 2024 • 80
Sharingan: Extract User Action Sequence from Desktop Recordings

Paper • 2411.08768 • Published Nov 13, 2024 • 9
LLaVA-o1: Let Vision Language Models Reason Step-by-Step

Paper • 2411.10440 • Published Nov 15, 2024 • 129
Region-Aware Text-to-Image Generation via Hard Binding and Soft Refinement

Paper • 2411.06558 • Published Nov 10, 2024 • 36
The Dawn of GUI Agent: A Preliminary Case Study with Claude 3.5 Computer Use

Paper • 2411.10323 • Published Nov 15, 2024 • 34
Number it: Temporal Grounding Videos like Flipping Manga

Paper • 2411.10332 • Published Nov 15, 2024 • 14
BlueLM-V-3B: Algorithm and System Co-Design for Multimodal Large Language Models on Mobile Devices

Paper • 2411.10640 • Published Nov 16, 2024 • 46
Generative World Explorer

Paper • 2411.11844 • Published Nov 18, 2024 • 77
AnimateAnything: Consistent and Controllable Animation for Video Generation

Paper • 2411.10836 • Published Nov 16, 2024 • 24
SlimLM: An Efficient Small Language Model for On-Device Document Assistance

Paper • 2411.09944 • Published Nov 15, 2024 • 12
Adaptive Decoding via Latent Preference Optimization

Paper • 2411.09661 • Published Nov 14, 2024 • 10
StableV2V: Stablizing Shape Consistency in Video-to-Video Editing

Paper • 2411.11045 • Published Nov 17, 2024 • 11
RedPajama: an Open Dataset for Training Large Language Models

Paper • 2411.12372 • Published Nov 19, 2024 • 58
SymDPO: Boosting In-Context Learning of Large Multimodal Models with Symbol Demonstration Direct Preference Optimization

Paper • 2411.11909 • Published Nov 17, 2024 • 22
FlipSketch: Flipping Static Drawings to Text-Guided Sketch Animations

Paper • 2411.10818 • Published Nov 16, 2024 • 26
ITACLIP: Boosting Training-Free Semantic Segmentation with Image, Text, and Architectural Enhancements

Paper • 2411.12044 • Published Nov 18, 2024 • 14
Continuous Speculative Decoding for Autoregressive Image Generation

Paper • 2411.11925 • Published Nov 18, 2024 • 16
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization

Paper • 2411.10442 • Published Nov 15, 2024 • 87
Multimodal Autoregressive Pre-training of Large Vision Encoders

Paper • 2411.14402 • Published Nov 21, 2024 • 47
Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models

Paper • 2411.14432 • Published Nov 21, 2024 • 25
Large Multi-modal Models Can Interpret Features in Large Multi-modal Models

Paper • 2411.14982 • Published Nov 22, 2024 • 19
O1 Replication Journey -- Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson?

Paper • 2411.16489 • Published Nov 25, 2024 • 45
One Diffusion to Generate Them All

Paper • 2411.16318 • Published Nov 25, 2024 • 28
DreamRunner: Fine-Grained Storytelling Video Generation with Retrieval-Augmented Motion Adaptation

Paper • 2411.16657 • Published Nov 25, 2024 • 19
Factorized Visual Tokenization and Generation

Paper • 2411.16681 • Published Nov 25, 2024 • 19
TEXGen: a Generative Diffusion Model for Mesh Textures

Paper • 2411.14740 • Published Nov 22, 2024 • 17
ROICtrl: Boosting Instance Control for Visual Generation

Paper • 2411.17949 • Published Nov 27, 2024 • 87
ShowUI: One Vision-Language-Action Model for GUI Visual Agent

Paper • 2411.17465 • Published Nov 26, 2024 • 90
SketchAgent: Language-Driven Sequential Sketch Generation

Paper • 2411.17673 • Published Nov 26, 2024 • 18
Rethinking Token Reduction in MLLMs: Towards a Unified Paradigm for Training-Free Acceleration

Paper • 2411.17686 • Published Nov 26, 2024 • 19
MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs

Paper • 2411.15296 • Published Nov 22, 2024 • 21
Large Language Model-Brained GUI Agents: A Survey

Paper • 2411.18279 • Published Nov 27, 2024 • 30
VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video Comprehension with Video-Text Duet Interaction Format

Paper • 2411.17991 • Published Nov 27, 2024 • 5
Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning

Paper • 2411.18203 • Published Nov 27, 2024 • 40
On Domain-Specific Post-Training for Multimodal Large Language Models

Paper • 2411.19930 • Published Nov 29, 2024 • 30
Yi-Lightning Technical Report

Paper • 2412.01253 • Published Dec 2, 2024 • 28
X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models

Paper • 2412.01824 • Published Dec 2, 2024 • 64
VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by Video Spatiotemporal Augmentation

Paper • 2412.00927 • Published Dec 1, 2024 • 29
Open-Sora Plan: Open-Source Large Video Generation Model

Paper • 2412.00131 • Published Nov 28, 2024 • 33
SOLAMI: Social Vision-Language-Action Modeling for Immersive Interaction with 3D Autonomous Characters

Paper • 2412.00174 • Published Nov 29, 2024 • 23
VisOnlyQA: Large Vision Language Models Still Struggle with Visual Perception of Geometric Information

Paper • 2412.00947 • Published Dec 1, 2024 • 8
AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?

Paper • 2412.02611 • Published Dec 3, 2024 • 25
PaliGemma 2: A Family of Versatile VLMs for Transfer

Paper • 2412.03555 • Published Dec 4, 2024 • 135
TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation

Paper • 2412.03069 • Published Dec 4, 2024 • 34
Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene Understanding

Paper • 2412.00493 • Published Nov 30, 2024 • 17
Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning

Paper • 2412.03565 • Published Dec 4, 2024 • 10
VisionZip: Longer is Better but Not Necessary in Vision Language Models

Paper • 2412.04467 • Published Dec 5, 2024 • 118
Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion

Paper • 2412.04424 • Published Dec 5, 2024 • 62
NVILA: Efficient Frontier Visual Language Models

Paper • 2412.04468 • Published Dec 5, 2024 • 60
Negative Token Merging: Image-based Adversarial Feature Guidance

Paper • 2412.01339 • Published Dec 2, 2024 • 22
Personalized Multimodal Large Language Models: A Survey

Paper • 2412.02142 • Published Dec 3, 2024 • 13
OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows

Paper • 2412.01169 • Published Dec 2, 2024 • 13
p-MoD: Building Mixture-of-Depths MLLMs via Progressive Ratio Decay

Paper • 2412.04449 • Published Dec 5, 2024 • 7
Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension

Paper • 2412.03704 • Published Dec 4, 2024 • 6
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Paper • 2412.05271 • Published Dec 6, 2024 • 160
MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale

Paper • 2412.05237 • Published Dec 6, 2024 • 46
LiFT: Leveraging Human Feedback for Text-to-Video Model Alignment

Paper • 2412.04814 • Published Dec 6, 2024 • 46
SwiftEdit: Lightning Fast Text-Guided Image Editing via One-Step Diffusion

Paper • 2412.04301 • Published Dec 5, 2024 • 40
CompCap: Improving Multimodal Large Language Models with Composite Captions

Paper • 2412.05243 • Published Dec 6, 2024 • 20
Mind the Time: Temporally-Controlled Multi-Event Video Generation

Paper • 2412.05263 • Published Dec 6, 2024 • 10
BigDocs: An Open and Permissively-Licensed Dataset for Training Multimodal Models on Document and Code Tasks

Paper • 2412.04626 • Published Dec 5, 2024 • 13
Training Large Language Models to Reason in a Continuous Latent Space

Paper • 2412.06769 • Published Dec 9, 2024 • 94
Around the World in 80 Timesteps: A Generative Approach to Global Visual Geolocation

Paper • 2412.06781 • Published Dec 9, 2024 • 23
Maya: An Instruction Finetuned Multilingual Multimodal Model

Paper • 2412.07112 • Published Dec 10, 2024 • 28
Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation

Paper • 2412.04432 • Published Dec 5, 2024 • 16
Exploring Multi-Grained Concept Annotations for Multimodal Large Language Models

Paper • 2412.05939 • Published Dec 8, 2024 • 15
DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for Customized Manga Generation

Paper • 2412.07589 • Published Dec 10, 2024 • 48
Perception Tokens Enhance Visual Reasoning in Multimodal Language Models

Paper • 2412.03548 • Published Dec 4, 2024 • 17
POINTS1.5: Building a Vision-Language Model towards Real World Applications

Paper • 2412.08443 • Published Dec 11, 2024 • 38
LAION-SG: An Enhanced Large-Scale Dataset for Training Complex Image-Text Models with Structural Annotations

Paper • 2412.08580 • Published Dec 11, 2024 • 45
MIT-10M: A Large Scale Parallel Corpus of Multilingual Image Translation

Paper • 2412.07147 • Published Dec 10, 2024 • 5
StreamChat: Chatting with Streaming Video

Paper • 2412.08646 • Published Dec 11, 2024 • 18
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions

Paper • 2412.09596 • Published Dec 12, 2024 • 97
Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity Visual Descriptions

Paper • 2412.08737 • Published Dec 11, 2024 • 54
Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition

Paper • 2412.09501 • Published Dec 12, 2024 • 48
Multimodal Latent Language Modeling with Next-Token Diffusion

Paper • 2412.08635 • Published Dec 11, 2024 • 49
EasyRef: Omni-Generalized Group Image Reference for Diffusion Models via Multimodal LLM

Paper • 2412.09618 • Published Dec 12, 2024 • 21
VisionArena: 230K Real World User-VLM Conversations with Preference Labels

Paper • 2412.08687 • Published Dec 11, 2024 • 13
Arbitrary-steps Image Super-resolution via Diffusion Inversion

Paper • 2412.09013 • Published Dec 12, 2024 • 13
Apollo: An Exploration of Video Understanding in Large Multimodal Models

Paper • 2412.10360 • Published Dec 13, 2024 • 147
GenEx: Generating an Explorable World

Paper • 2412.09624 • Published Dec 12, 2024 • 98
InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption

Paper • 2412.09283 • Published Dec 12, 2024 • 19
Multimodal Music Generation with Explicit Bridges and Retrieval Augmentation

Paper • 2412.09428 • Published Dec 12, 2024 • 7
SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding

Paper • 2412.09604 • Published Dec 12, 2024 • 38
Byte Latent Transformer: Patches Scale Better Than Tokens

Paper • 2412.09871 • Published Dec 13, 2024 • 108
BrushEdit: All-In-One Image Inpainting and Editing

Paper • 2412.10316 • Published Dec 13, 2024 • 36
VidTok: A Versatile and Open-Source Video Tokenizer

Paper • 2412.13061 • Published Dec 17, 2024 • 8
GUI Agents: A Survey

Paper • 2412.13501 • Published Dec 18, 2024 • 30
Progressive Multimodal Reasoning via Active Retrieval

Paper • 2412.14835 • Published Dec 19, 2024 • 73
MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval

Paper • 2412.14475 • Published Dec 19, 2024 • 58
Descriptive Caption Enhancement with Visual Specialists for Multimodal Perception

Paper • 2412.14233 • Published Dec 18, 2024 • 6
Large Motion Video Autoencoding with Cross-modal Video VAE

Paper • 2412.17805 • Published Dec 23, 2024 • 24
Friends-MMC: A Dataset for Multi-modal Multi-party Conversation Understanding

Paper • 2412.17295 • Published Dec 23, 2024 • 9
Flowing from Words to Pixels: A Framework for Cross-Modality Evolution

Paper • 2412.15213 • Published Dec 19, 2024 • 28
Affordance-Aware Object Insertion via Mask-Aware Dual Diffusion

Paper • 2412.14462 • Published Dec 19, 2024 • 15
AV-Link: Temporally-Aligned Diffusion Features for Cross-Modal Audio-Video Generation

Paper • 2412.15191 • Published Dec 19, 2024 • 5
Parallelized Autoregressive Visual Generation

Paper • 2412.15119 • Published Dec 19, 2024 • 53
Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis

Paper • 2412.15322 • Published Dec 19, 2024 • 20
Sequence Matters: Harnessing Video Models in 3D Super-Resolution

Paper • 2412.11525 • Published Dec 16, 2024 • 11
Diving into Self-Evolving Training for Multimodal Reasoning

Paper • 2412.17451 • Published Dec 23, 2024 • 42
Distilled Decoding 1: One-step Sampling of Image Auto-regressive Models with Flow Matching

Paper • 2412.17153 • Published Dec 22, 2024 • 39
NILE: Internal Consistency Alignment in Large Language Models

Paper • 2412.16686 • Published Dec 21, 2024 • 8
DepthLab: From Partial to Complete

Paper • 2412.18153 • Published Dec 24, 2024 • 36
3DGraphLLM: Combining Semantic Graphs and Large Language Models for 3D Scene Understanding

Paper • 2412.18450 • Published Dec 24, 2024 • 36
Fourier Position Embedding: Enhancing Attention's Periodic Extension for Length Generalization

Paper • 2412.17739 • Published Dec 23, 2024 • 41
DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation

Paper • 2412.18597 • Published Dec 24, 2024 • 20
How "Real" is Your Real-Time Simultaneous Speech-to-Text Translation System?

Paper • 2412.18495 • Published Dec 24, 2024 • 9
Video-Panda: Parameter-efficient Alignment for Encoder-free Video-Language Models

Paper • 2412.18609 • Published Dec 24, 2024 • 17
Bridging the Data Provenance Gap Across Text, Speech and Video

Paper • 2412.17847 • Published Dec 19, 2024 • 12
Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search

Paper • 2412.18319 • Published Dec 24, 2024 • 39
YuLan-Mini: An Open Data-efficient Language Model

Paper • 2412.17743 • Published Dec 23, 2024 • 66
MMFactory: A Universal Solution Search Engine for Vision-Language Tasks

Paper • 2412.18072 • Published Dec 24, 2024 • 18
Molar: Multimodal LLMs with Collaborative Filtering Alignment for Enhanced Sequential Recommendation

Paper • 2412.18176 • Published Dec 24, 2024 • 16
1.58-bit FLUX

Paper • 2412.18653 • Published Dec 24, 2024 • 86
Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey

Paper • 2412.18619 • Published Dec 16, 2024 • 59
Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment

Paper • 2412.19326 • Published Dec 26, 2024 • 18
Safeguard Fine-Tuned LLMs Through Pre- and Post-Tuning Model Merging

Paper • 2412.19512 • Published Dec 27, 2024 • 9
Explanatory Instructions: Towards Unified Vision Tasks Understanding and Zero-shot Generalization

Paper • 2412.18525 • Published Dec 24, 2024 • 74
Edicho: Consistent Image Editing in the Wild

Paper • 2412.21079 • Published Dec 30, 2024 • 22
TangoFlux: Super Fast and Faithful Text to Audio Generation with Flow Matching and Clap-Ranked Preference Optimization

Paper • 2412.21037 • Published Dec 30, 2024 • 24
Are Vision-Language Models Truly Understanding Multi-vision Sensor?

Paper • 2412.20750 • Published Dec 30, 2024 • 20
2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining

Paper • 2501.00958 • Published Jan 1, 2025 • 110
VideoAnydoor: High-fidelity Video Object Insertion with Precise Motion Control

Paper • 2501.01427 • Published Jan 2, 2025 • 53
LTX-Video: Realtime Video Latent Diffusion

Paper • 2501.00103 • Published Dec 30, 2024 • 50
VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM

Paper • 2501.00599 • Published Dec 31, 2024 • 46
MLLM-as-a-Judge for Image Safety without Human Labeling

Paper • 2501.00192 • Published Dec 31, 2024 • 32
A3: Android Agent Arena for Mobile GUI Agents

Paper • 2501.01149 • Published Jan 2, 2025 • 22
Unifying Specialized Visual Encoders for Video Language Models

Paper • 2501.01426 • Published Jan 2, 2025 • 20
VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

Paper • 2501.01957 • Published Jan 3, 2025 • 47
LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token

Paper • 2501.03895 • Published Jan 7, 2025 • 52
MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models

Paper • 2501.02955 • Published Jan 6, 2025 • 44
Cosmos World Foundation Model Platform for Physical AI

Paper • 2501.03575 • Published Jan 7, 2025 • 82
REINFORCE++: A Simple and Efficient Approach for Aligning Large Language Models

Paper • 2501.03262 • Published Jan 4, 2025 • 104
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

Paper • 2501.04001 • Published Jan 7, 2025 • 47
OpenOmni: Large Language Models Pivot Zero-shot Omnimodal Alignment across Language with Real-time Self-Aware Emotional Speech Synthesis

Paper • 2501.04561 • Published Jan 8, 2025 • 16
InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection

Paper • 2501.04575 • Published Jan 8, 2025 • 25
Search-o1: Agentic Search-Enhanced Large Reasoning Models

Paper • 2501.05366 • Published Jan 9, 2025 • 104
DPO Kernels: A Semantically-Aware, Kernel-Enhanced, and Divergence-Rich Paradigm for Direct Preference Optimization

Paper • 2501.03271 • Published Jan 5, 2025 • 10
The GAN is dead; long live the GAN! A Modern GAN Baseline

Paper • 2501.05441 • Published Jan 9, 2025 • 98
Enhancing Human-Like Responses in Large Language Models

Paper • 2501.05032 • Published Jan 9, 2025 • 61
An Empirical Study of Autoregressive Pre-training from Videos

Paper • 2501.05453 • Published Jan 9, 2025 • 41
Centurio: On Drivers of Multilingual Ability of Large Vision-Language Model

Paper • 2501.05122 • Published Jan 9, 2025 • 19
On Computational Limits and Provably Efficient Criteria of Visual Autoregressive Models: A Fine-Grained Complexity Analysis

Paper • 2501.04377 • Published Jan 8, 2025 • 14
VideoRAG: Retrieval-Augmented Generation over Video Corpus

Paper • 2501.05874 • Published Jan 10, 2025 • 75
LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs

Paper • 2501.06186 • Published Jan 10, 2025 • 65
Migician: Revealing the Magic of Free-Form Multi-Image Grounding in Multimodal Large Language Models

Paper • 2501.05767 • Published Jan 10, 2025 • 29
OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?

Paper • 2501.05510 • Published Jan 9, 2025 • 44
MinMo: A Multimodal Large Language Model for Seamless Voice Interaction

Paper • 2501.06282 • Published Jan 10, 2025 • 53
Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks

Paper • 2501.08326 • Published Jan 14, 2025 • 34
MatchAnything: Universal Cross-Modality Image Matching with Large-Scale Pre-Training

Paper • 2501.07556 • Published Jan 13, 2025 • 7
MMDocIR: Benchmarking Multi-Modal Retrieval for Long Documents

Paper • 2501.08828 • Published Jan 15, 2025 • 30
RepVideo: Rethinking Cross-Layer Representation for Video Generation

Paper • 2501.08994 • Published Jan 15, 2025 • 15
ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding

Paper • 2501.05452 • Published Jan 9, 2025 • 15
Multiagent Finetuning: Self Improvement with Diverse Reasoning Chains

Paper • 2501.05707 • Published Jan 10, 2025 • 20
VideoAuteur: Towards Long Narrative Video Generation

Paper • 2501.06173 • Published Jan 10, 2025 • 31
SPAM: Spike-Aware Adam with Momentum Reset for Stable LLM Training

Paper • 2501.06842 • Published Jan 12, 2025 • 16
Evaluating Sample Utility for Data Selection by Mimicking Model Weights

Paper • 2501.06708 • Published Jan 12, 2025 • 5
MiniMax-01: Scaling Foundation Models with Lightning Attention

Paper • 2501.08313 • Published Jan 14, 2025 • 302
Democratizing Text-to-Image Masked Generative Models with Compact Text-Aware One-Dimensional Tokens

Paper • 2501.07730 • Published Jan 13, 2025 • 18
HALoGEN: Fantastic LLM Hallucinations and Where to Find Them

Paper • 2501.08292 • Published Jan 14, 2025 • 17
Tarsier2: Advancing Large Vision-Language Models from Detailed Video Description to Comprehensive Video Understanding

Paper • 2501.07888 • Published Jan 14, 2025 • 15
OpenCSG Chinese Corpus: A Series of High-quality Chinese Datasets for LLM Training

Paper • 2501.08197 • Published Jan 14, 2025 • 9
Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal Understanding

Paper • 2501.07783 • Published Jan 14, 2025 • 8
MINIMA: Modality Invariant Image Matching

Paper • 2412.19412 • Published Dec 27, 2024 • 4
OmniThink: Expanding Knowledge Boundaries in Machine Writing through Thinking

Paper • 2501.09751 • Published Jan 16, 2025 • 46
Learnings from Scaling Visual Tokenizers for Reconstruction and Generation

Paper • 2501.09755 • Published Jan 16, 2025 • 35
Do generative video models learn physical principles from watching videos?

Paper • 2501.09038 • Published Jan 14, 2025 • 34
FAST: Efficient Action Tokenization for Vision-Language-Action Models

Paper • 2501.09747 • Published Jan 16, 2025 • 29
VideoWorld: Exploring Knowledge Learning from Unlabeled Videos

Paper • 2501.09781 • Published Jan 16, 2025 • 27
MMVU: Measuring Expert-Level Multi-Discipline Video Understanding

Paper • 2501.12380 • Published Jan 21, 2025 • 84
Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks

Paper • 2501.11733 • Published Jan 20, 2025 • 28
Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step

Paper • 2501.13926 • Published Jan 23, 2025 • 43
Baichuan-Omni-1.5 Technical Report

Paper • 2501.15368 • Published Jan 26, 2025 • 60
Qwen2.5-1M Technical Report

Paper • 2501.15383 • Published Jan 26, 2025 • 72
Towards General-Purpose Model-Free Reinforcement Learning

Paper • 2501.16142 • Published Jan 27, 2025 • 31
Emilia: A Large-Scale, Extensive, Multilingual, and Diverse Dataset for Speech Generation

Paper • 2501.15907 • Published Jan 27, 2025 • 18
Are Vision Language Models Texture or Shape Biased and Can We Steer Them?

Paper • 2403.09193 • Published Mar 14, 2024 • 9
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

Paper • 2501.17161 • Published Jan 28, 2025 • 125
PixelWorld: Towards Perceiving Everything as Pixels

Paper • 2501.19339 • Published Jan 31, 2025 • 17
OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human Animation Models

Paper • 2502.01061 • Published Feb 3, 2025 • 225
Process Reinforcement through Implicit Rewards

Paper • 2502.01456 • Published Feb 3, 2025 • 62
AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Understanding

Paper • 2502.01341 • Published Feb 3, 2025 • 39
Humanity's Last Exam

Paper • 2501.14249 • Published Jan 24, 2025 • 77
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

Paper • 2501.13106 • Published Jan 22, 2025 • 91
Kimi k1.5: Scaling Reinforcement Learning with LLMs

Paper • 2501.12599 • Published Jan 22, 2025 • 128
Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback

Paper • 2501.12895 • Published Jan 22, 2025 • 61
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Paper • 2501.12948 • Published Jan 22, 2025 • 447
Token Assorted: Mixing Latent and Text Tokens for Improved Language Model Reasoning

Paper • 2502.03275 • Published Feb 5, 2025 • 18
Analyze Feature Flow to Enhance Interpretation and Steering in Language Models

Paper • 2502.03032 • Published Feb 5, 2025 • 60
Ola: Pushing the Frontiers of Omni-Modal Language Model with Progressive Modality Alignment

Paper • 2502.04328 • Published Feb 6, 2025 • 29
VideoRoPE: What Makes for Good Video Rotary Position Embedding?

Paper • 2502.05173 • Published Feb 7, 2025 • 64
Fast Video Generation with Sliding Tile Attention

Paper • 2502.04507 • Published Feb 6, 2025 • 51
Goku: Flow Based Video Generative Foundation Models

Paper • 2502.04896 • Published Feb 7, 2025 • 107
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

Paper • 2502.05171 • Published Feb 7, 2025 • 154
QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive Multimodal Understanding and Generation

Paper • 2502.05178 • Published Feb 7, 2025 • 10
On-device Sora: Enabling Diffusion-Based Text-to-Video Generation for Mobile Devices

Paper • 2502.04363 • Published Feb 5, 2025 • 12
Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling

Paper • 2502.06703 • Published Feb 10, 2025 • 153
Scaling Pre-training to One Hundred Billion Data for Vision Language Models

Paper • 2502.07617 • Published Feb 11, 2025 • 29
Expect the Unexpected: FailSafe Long Context QA for Finance

Paper • 2502.06329 • Published Feb 10, 2025 • 133
Magic 1-For-1: Generating One Minute Video Clips within One Minute

Paper • 2502.07701 • Published Feb 11, 2025 • 37
Light-A-Video: Training-free Video Relighting via Progressive Light Fusion

Paper • 2502.08590 • Published Feb 12, 2025 • 42
TextAtlas5M: A Large-scale Dataset for Dense Text Image Generation

Paper • 2502.07870 • Published Feb 11, 2025 • 45
WorldGUI: Dynamic Testing for Comprehensive Desktop GUI Automation

Paper • 2502.08047 • Published Feb 12, 2025 • 28
TransMLA: Multi-head Latent Attention Is All You Need

Paper • 2502.07864 • Published Feb 11, 2025 • 69
mmE5: Improving Multimodal Multilingual Embeddings via High-quality Synthetic Data

Paper • 2502.08468 • Published Feb 12, 2025 • 16
The Stochastic Parrot on LLM's Shoulder: A Summative Assessment of Physical Concept Understanding

Paper • 2502.08946 • Published Feb 13, 2025 • 192
Skrr: Skip and Re-use Text Encoder Layers for Memory Efficient Text-to-Image Generation

Paper • 2502.08690 • Published Feb 12, 2025 • 43
EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents

Paper • 2502.09560 • Published Feb 13, 2025 • 35
ZeroBench: An Impossible Visual Benchmark for Contemporary Large Multimodal Models

Paper • 2502.09696 • Published Feb 13, 2025 • 43
Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model

Paper • 2502.10248 • Published Feb 14, 2025 • 57
MM-RLHF: The Next Step Forward in Multimodal LLM Alignment

Paper • 2502.10391 • Published Feb 14, 2025 • 34
Large Language Diffusion Models

Paper • 2502.09992 • Published Feb 14, 2025 • 127
Learning Getting-Up Policies for Real-World Humanoid Robots

Paper • 2502.12152 • Published Feb 17, 2025 • 42
Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention

Paper • 2502.11089 • Published Feb 16, 2025 • 170
How Do LLMs Acquire New Knowledge? A Knowledge Circuits Perspective on Continual Pre-Training

Paper • 2502.11196 • Published Feb 16, 2025 • 23
I Think, Therefore I Diffuse: Enabling Multimodal In-Context Reasoning in Diffusion Models

Paper • 2502.10458 • Published Feb 12, 2025 • 38
HermesFlow: Seamlessly Closing the Gap in Multimodal Understanding and Generation

Paper • 2502.12148 • Published Feb 17, 2025 • 17
Intuitive physics understanding emerges from self-supervised pretraining on natural videos

Paper • 2502.11831 • Published Feb 17, 2025 • 20
video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model

Paper • 2502.11775 • Published Feb 17, 2025 • 9
Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval-Augmented Generation

Paper • 2502.08826 • Published Feb 12, 2025 • 17
ILIAS: Instance-Level Image retrieval At Scale

Paper • 2502.11748 • Published Feb 17, 2025 • 4
Soundwave: Less is More for Speech-Text Alignment in LLMs

Paper • 2502.12900 • Published Feb 18, 2025 • 86
Continuous Diffusion Model for Language Modeling

Paper • 2502.11564 • Published Feb 17, 2025 • 53
Phantom: Subject-consistent video generation via cross-modal alignment

Paper • 2502.11079 • Published Feb 16, 2025 • 59
Magma: A Foundation Model for Multimodal AI Agents

Paper • 2502.13130 • Published Feb 18, 2025 • 58
SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?

Paper • 2502.12115 • Published Feb 17, 2025 • 46
Multimodal Mamba: Decoder-only Multimodal State Space Model via Quadratic to Linear Distillation

Paper • 2502.13145 • Published Feb 18, 2025 • 38
RealSyn: An Effective and Scalable Multimodal Interleaved Document Transformation Paradigm

Paper • 2502.12513 • Published Feb 18, 2025 • 16
Harnessing Vision Models for Time Series Analysis: A Survey

Paper • 2502.08869 • Published Feb 13, 2025 • 2
Qwen2.5-VL Technical Report

Paper • 2502.13923 • Published Feb 19, 2025 • 217
On the Trustworthiness of Generative Foundation Models: Guideline, Assessment, and Perspective

Paper • 2502.14296 • Published Feb 20, 2025 • 45
SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Paper • 2502.14786 • Published Feb 20, 2025 • 164
How Much Knowledge Can You Pack into a LoRA Adapter without Harming LLM?

Paper • 2502.14502 • Published Feb 20, 2025 • 92
LongWriter-V: Enabling Ultra-Long and High-Fidelity Generation in Vision-Language Models

Paper • 2502.14834 • Published Feb 20, 2025 • 24
Does Time Have Its Place? Temporal Heads: Where Language Models Recall Time-specific Information

Paper • 2502.14258 • Published Feb 20, 2025 • 26
PC-Agent: A Hierarchical Multi-Agent Collaboration Framework for Complex Task Automation on PC

Paper • 2502.14282 • Published Feb 20, 2025 • 29
How to Get Your LLM to Generate Challenging Problems for Evaluation

Paper • 2502.14678 • Published Feb 20, 2025 • 18
Dynamic Concepts Personalization from Single Videos

Paper • 2502.14844 • Published Feb 20, 2025 • 16
Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation

Paper • 2502.14846 • Published Feb 20, 2025 • 15
NAVIG: Natural Language-guided Analysis with Vision Language Models for Image Geo-localization

Paper • 2502.14638 • Published Feb 20, 2025 • 11
From RAG to Memory: Non-Parametric Continual Learning for Large Language Models

Paper • 2502.14802 • Published Feb 20, 2025 • 13
Cramming 1568 Tokens into a Single Vector and Back Again: Exploring the Limits of Embedding Space Capacity

Paper • 2502.13063 • Published Feb 18, 2025 • 74
VLM^2-Bench: A Closer Look at How Well VLMs Implicitly Link Explicit Matching Visual Cues

Paper • 2502.12084 • Published Feb 17, 2025 • 35
LLM-Microscope: Uncovering the Hidden Role of Punctuation in Context Memory of Transformers

Paper • 2502.15007 • Published Feb 20, 2025 • 175
SurveyX: Academic Survey Automation via Large Language Models

Paper • 2502.14776 • Published Feb 20, 2025 • 100
PhotoDoodle: Learning Artistic Image Editing from Few-Shot Pairwise Data

Paper • 2502.14397 • Published Feb 20, 2025 • 41
DICEPTION: A Generalist Diffusion Model for Visual Perceptual Tasks

Paper • 2502.17157 • Published Feb 24, 2025 • 52
Multimodal Inconsistency Reasoning (MMIR): A New Benchmark for Multimodal Reasoning Models

Paper • 2502.16033 • Published Feb 22, 2025 • 18
Reflective Planning: Vision-Language Models for Multi-Stage Long-Horizon Robotic Manipulation

Paper • 2502.16707 • Published Feb 23, 2025 • 14
OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference

Paper • 2502.18411 • Published Feb 25, 2025 • 74
SpargeAttn: Accurate Sparse Attention Accelerating Any Model Inference

Paper • 2502.18137 • Published Feb 25, 2025 • 60
SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution

Paper • 2502.18449 • Published Feb 25, 2025 • 75
KV-Edit: Training-Free Image Editing for Precise Background Preservation

Paper • 2502.17363 • Published Feb 24, 2025 • 37
ART: Anonymous Region Transformer for Variable Multi-Layer Transparent Image Generation

Paper • 2502.18364 • Published Feb 25, 2025 • 36
K-LoRA: Unlocking Training-Free Fusion of Any Subject and Style LoRAs

Paper • 2502.18461 • Published Feb 25, 2025 • 17
Introducing Visual Perception Token into Multimodal Large Language Model

Paper • 2502.17425 • Published Feb 24, 2025 • 16
MLLMs Know Where to Look: Training-free Perception of Small Visual Details with Multimodal LLMs

Paper • 2502.17422 • Published Feb 24, 2025 • 7
LDGen: Enhancing Text-to-Image Synthesis via Large Language Model-Driven Language Representation

Paper • 2502.18302 • Published Feb 25, 2025 • 5
Shakti-VLMs: Scalable Vision-Language Models for Enterprise AI

Paper • 2502.17092 • Published Feb 24, 2025 • 3
TheoremExplainAgent: Towards Multimodal Explanations for LLM Theorem Understanding

Paper • 2502.19400 • Published Feb 26, 2025 • 47
Towards an AI co-scientist

Paper • 2502.18864 • Published Feb 26, 2025 • 52
MedVLM-R1: Incentivizing Medical Reasoning Capability of Vision-Language Models (VLMs) via Reinforcement Learning

Paper • 2502.19634 • Published Mar 19, 2025 • 62
UniTok: A Unified Tokenizer for Visual Generation and Understanding

Paper • 2502.20321 • Published Feb 27, 2025 • 30
Multimodal Representation Alignment for Image Generation: Text-Image Interleaved Control Is Easier Than You Think

Paper • 2502.20172 • Published Feb 27, 2025 • 29
HAIC: Improving Human Action Understanding and Generation with Better Captions for Multi-modal Large Language Models

Paper • 2502.20811 • Published Feb 28, 2025 • 3
Chain of Draft: Thinking Faster by Writing Less

Paper • 2502.18600 • Published Feb 25, 2025 • 50
Tell me why: Visual foundation models as self-explainable classifiers

Paper • 2502.19577 • Published Feb 26, 2025 • 11
SoS1: O1 and R1-Like Reasoning LLMs are Sum-of-Square Solvers

Paper • 2502.20545 • Published Feb 27, 2025 • 22
MIGE: A Unified Framework for Multimodal Instruction-Based Image Generation and Editing

Paper • 2502.21291 • Published Feb 28, 2025 • 5
Predictive Data Selection: The Data That Predicts Is the Data That Teaches

Paper • 2503.00808 • Published Mar 2, 2025 • 57
Visual-RFT: Visual Reinforcement Fine-Tuning

Paper • 2503.01785 • Published Mar 3, 2025 • 86
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

Paper • 2503.01743 • Published Mar 3, 2025 • 90
Qilin: A Multimodal Information Retrieval Dataset with APP-level User Sessions

Paper • 2503.00501 • Published Mar 1, 2025 • 12
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Paper • 2402.03300 • Published Feb 5, 2024 • 144
DoraCycle: Domain-Oriented Adaptation of Unified Generative Model in Multimodal Cycles

Paper • 2503.03651 • Published Mar 5, 2025 • 16
UFO: A Unified Approach to Fine-grained Visual Perception via Open-ended Language Interface

Paper • 2503.01342 • Published Mar 3, 2025 • 8
From Hours to Minutes: Lossless Acceleration of Ultra Long Sequence Generation up to 100K Tokens

Paper • 2502.18890 • Published Feb 26, 2025 • 30
HoT: Highlighted Chain of Thought for Referencing Supporting Facts from Inputs

Paper • 2503.02003 • Published Mar 3, 2025 • 48
Process-based Self-Rewarding Language Models

Paper • 2503.03746 • Published Mar 5, 2025 • 39
CognitiveDrone: A VLA Model and Evaluation Benchmark for Real-Time Cognitive Task Solving and Reasoning in UAVs

Paper • 2503.01378 • Published Mar 3, 2025 • 5
Token-Efficient Long Video Understanding for Multimodal LLMs

Paper • 2503.04130 • Published Mar 6, 2025 • 96
LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM

Paper • 2503.04724 • Published Mar 6, 2025 • 72
Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities

Paper • 2503.03983 • Published Mar 6, 2025 • 27
How to Steer LLM Latents for Hallucination Detection?

Paper • 2503.01917 • Published Mar 1, 2025 • 11
The Best of Both Worlds: Integrating Language Models and Diffusion Models for Video Generation

Paper • 2503.04606 • Published Mar 6, 2025 • 9
Unified Reward Model for Multimodal Understanding and Generation

Paper • 2503.05236 • Published Mar 7, 2025 • 124
R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model

Paper • 2503.05132 • Published Mar 7, 2025 • 57
Sketch-of-Thought: Efficient LLM Reasoning with Adaptive Cognitive-Inspired Sketching

Paper • 2503.05179 • Published Mar 7, 2025 • 46
S2S-Arena, Evaluating Speech2Speech Protocols on Instruction Following with Paralinguistic Information

Paper • 2503.05085 • Published Mar 7, 2025 • 47
R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcing Learning

Paper • 2503.05379 • Published Mar 7, 2025 • 38
VideoPainter: Any-length Video Inpainting and Editing with Plug-and-Play Context Control

Paper • 2503.05639 • Published Mar 7, 2025 • 26
TrajectoryCrafter: Redirecting Camera Trajectory for Monocular Videos via Diffusion Models

Paper • 2503.05638 • Published Mar 7, 2025 • 20
MM-Eureka: Exploring Visual Aha Moment with Rule-based Large-scale Reinforcement Learning

Paper • 2503.07365 • Published Mar 10, 2025 • 61
Automated Movie Generation via Multi-Agent CoT Planning

Paper • 2503.07314 • Published Mar 10, 2025 • 44
Taking Notes Brings Focus? Towards Multi-Turn Multimodal Dialogue Learning

Paper • 2503.07002 • Published Mar 10, 2025 • 39
Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

Paper • 2503.06749 • Published Mar 9, 2025 • 31
Crowdsource, Crawl, or Generate? Creating SEA-VL, a Multicultural Vision-Language Dataset for Southeast Asia

Paper • 2503.07920 • Published Mar 10, 2025 • 101
MagicInfinite: Generating Infinite Talking Videos with Your Words and Voice

Paper • 2503.05978 • Published Mar 7, 2025 • 36
LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL

Paper • 2503.07536 • Published Mar 10, 2025 • 88
Video Action Differencing

Paper • 2503.07860 • Published Mar 10, 2025 • 33
UniF^2ace: Fine-grained Face Understanding and Generation with Unified Multimodal Models

Paper • 2503.08120 • Published Mar 11, 2025 • 31
SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories

Paper • 2503.08625 • Published Mar 11, 2025 • 27
Implicit Reasoning in Transformers is Reasoning through Shortcuts

Paper • 2503.07604 • Published Mar 10, 2025 • 23
LightGen: Efficient Image Generation through Knowledge Distillation and Direct Preference Optimization

Paper • 2503.08619 • Published Mar 11, 2025 • 20
EasyControl: Adding Efficient and Flexible Control for Diffusion Transformer

Paper • 2503.07027 • Published Mar 10, 2025 • 29
LLaVE: Large Language and Vision Embedding Models with Hardness-Weighted Contrastive Learning

Paper • 2503.04812 • Published Mar 4, 2025 • 17
Words or Vision: Do Vision-Language Models Have Blind Faith in Text?

Paper • 2503.02199 • Published Mar 4, 2025 • 8
Seedream 2.0: A Native Chinese-English Bilingual Image Generation Foundation Model

Paper • 2503.07703 • Published Mar 10, 2025 • 37
Gemini Embedding: Generalizable Embeddings from Gemini

Paper • 2503.07891 • Published Mar 10, 2025 • 46
OmniMamba: Efficient and Unified Multimodal Understanding and Generation via State Space Models

Paper • 2503.08686 • Published Mar 11, 2025 • 19
CineBrain: A Large-Scale Multi-Modal Brain Dataset During Naturalistic Audiovisual Narrative Processing

Paper • 2503.06940 • Published Mar 10, 2025 • 11
Transformers without Normalization

Paper • 2503.10622 • Published Mar 13, 2025 • 172
Charting and Navigating Hugging Face's Model Atlas

Paper • 2503.10633 • Published Mar 13, 2025 • 93
World Modeling Makes a Better Planner: Dual Preference Optimization for Embodied Task Planning

Paper • 2503.10480 • Published Mar 13, 2025 • 57
GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing

Paper • 2503.10639 • Published Mar 13, 2025 • 53
VisualPRM: An Effective Process Reward Model for Multimodal Reasoning

Paper • 2503.10291 • Published Mar 13, 2025 • 36
4D LangSplat: 4D Language Gaussian Splatting via Multimodal Large Language Models

Paper • 2503.10437 • Published Mar 13, 2025 • 34
CoRe^2: Collect, Reflect and Refine to Generate Better and Faster

Paper • 2503.09662 • Published Mar 12, 2025 • 33
OmniPaint: Mastering Object-Oriented Editing via Disentangled Insertion-Removal Inpainting

Paper • 2503.08677 • Published Mar 11, 2025 • 29
Light-R1: Curriculum SFT, DPO and RL for Long COT from Scratch and Beyond

Paper • 2503.10460 • Published Mar 13, 2025 • 30
GroundingSuite: Measuring Complex Multi-Granular Pixel Grounding

Paper • 2503.10596 • Published Mar 13, 2025 • 18
R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization

Paper • 2503.10615 • Published Mar 13, 2025 • 17
Open-Sora 2.0: Training a Commercial-Level Video Generation Model in $200k

Paper • 2503.09642 • Published Mar 12, 2025 • 20
PLADIS: Pushing the Limits of Attention in Diffusion Models at Inference Time by Leveraging Sparsity

Paper • 2503.07677 • Published Mar 10, 2025 • 86
ReCamMaster: Camera-Controlled Generative Rendering from A Single Video

Paper • 2503.11647 • Published Mar 14, 2025 • 148
GoalFlow: Goal-Driven Flow Matching for Multimodal Trajectories Generation in End-to-End Autonomous Driving

Paper • 2503.05689 • Published Mar 7, 2025 • 3
SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion

Paper • 2503.11576 • Published Mar 14, 2025 • 156
Large-scale Pre-training for Grounded Video Caption Generation

Paper • 2503.10781 • Published Mar 13, 2025 • 16
ARMOR v0.1: Empowering Autoregressive Multimodal Understanding Model with Interleaved Multimodal Generation via Asymmetric Synergy

Paper • 2503.06542 • Published Mar 9, 2025 • 7
DropletVideo: A Dataset and Approach to Explore Integral Spatio-Temporal Consistent Video Generation

Paper • 2503.06053 • Published Mar 8, 2025 • 138
Being-0: A Humanoid Robotic Agent with Vision-Language Models and Modular Skills

Paper • 2503.12533 • Published Mar 16, 2025 • 68
DreamRenderer: Taming Multi-Instance Attribute Control in Large-Scale Text-to-Image Models

Paper • 2503.12885 • Published Mar 17, 2025 • 43
Edit Transfer: Learning Image Editing via Vision In-Context Relations

Paper • 2503.13327 • Published Mar 17, 2025 • 29
BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing

Paper • 2503.13434 • Published Mar 17, 2025 • 27
R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization

Paper • 2503.12937 • Published Mar 17, 2025 • 30
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey

Paper • 2503.12605 • Published Mar 16, 2025 • 35
DeepPerception: Advancing R1-like Cognitive Visual Perception in MLLMs for Knowledge-Intensive Visual Grounding

Paper • 2503.12797 • Published Mar 17, 2025 • 32
Aligning Multimodal LLM with Human Preference: A Survey

Paper • 2503.14504 • Published Mar 18, 2025 • 26
Cosmos-Transfer1: Conditional World Generation with Adaptive Multimodal Control

Paper • 2503.14492 • Published Mar 18, 2025 • 20
TULIP: Towards Unified Language-Image Pretraining

Paper • 2503.15485 • Published Mar 19, 2025 • 49
φ-Decoding: Adaptive Foresight Sampling for Balanced Inference-Time Exploration and Exploitation

Paper • 2503.13288 • Published Mar 17, 2025 • 51
Temporal Regularization Makes Your Video Generator Stronger

Paper • 2503.15417 • Published Mar 19, 2025 • 22
VERIFY: A Benchmark of Visual Explanation and Reasoning for Investigating Multimodal Reasoning Fidelity

Paper • 2503.11557 • Published Mar 14, 2025 • 22
Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models

Paper • 2503.16419 • Published Mar 20, 2025 • 77
Unleashing Vecset Diffusion Model for Fast Shape Generation

Paper • 2503.16302 • Published Mar 20, 2025 • 43
DiffMoE: Dynamic Token Selection for Scalable Diffusion Transformers

Paper • 2503.14487 • Published Mar 18, 2025 • 28
JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse

Paper • 2503.16365 • Published Mar 20, 2025 • 41
InfiniteYou: Flexible Photo Recrafting While Preserving Your Identity

Paper • 2503.16418 • Published Mar 20, 2025 • 36
Ultra-Resolution Adaptation with Ease

Paper • 2503.16322 • Published Mar 20, 2025 • 13
M3: 3D-Spatial MultiModal Memory

Paper • 2503.16413 • Published Mar 20, 2025 • 15
See-Saw Modality Balance: See Gradient, and Sew Impaired Vision-Language Balance to Mitigate Dominant Modality Bias

Paper • 2503.13834 • Published Mar 18, 2025 • 5
Expert Race: A Flexible Routing Strategy for Scaling Diffusion Transformer with Mixture of Experts

Paper • 2503.16057 • Published Mar 20, 2025 • 14
DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Paper • 2503.14476 • Published Mar 18, 2025 • 146
RWKV-7 "Goose" with Expressive Dynamic State Evolution

Paper • 2503.14456 • Published Mar 18, 2025 • 154
Impossible Videos

Paper • 2503.14378 • Published Mar 18, 2025 • 61
Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn't

Paper • 2503.16219 • Published Mar 20, 2025 • 52
Inside-Out: Hidden Factual Knowledge in LLMs

Paper • 2503.15299 • Published Mar 19, 2025 • 56
Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning

Paper • 2503.15558 • Published Mar 18, 2025 • 50
Where do Large Vision-Language Models Look at when Answering Questions?

Paper • 2503.13891 • Published Mar 18, 2025 • 8
MAPS: A Multi-Agent Framework Based on Big Seven Personality and Socratic Guidance for Multimodal Scientific Problem Solving

Paper • 2503.16905 • Published Mar 21, 2025 • 54
OpenVLThinker: An Early Exploration to Complex Vision-Language Reasoning via Iterative Self-Improvement

Paper • 2503.17352 • Published Mar 21, 2025 • 24
Bridging Continuous and Discrete Tokens for Autoregressive Visual Generation

Paper • 2503.16430 • Published Mar 20, 2025 • 34
When Preferences Diverge: Aligning Diffusion Models with Minority-Aware Adaptive DPO

Paper • 2503.16921 • Published Mar 21, 2025 • 6
From Head to Tail: Towards Balanced Representation in Large Vision-Language Models through Adaptive Data Calibration

Paper • 2503.12821 • Published Mar 17, 2025 • 10
MathFlow: Enhancing the Perceptual Flow of MLLMs for Visual Mathematical Problems

Paper • 2503.16549 • Published Mar 19, 2025 • 15
Why Do Multi-Agent LLM Systems Fail?

Paper • 2503.13657 • Published Mar 17, 2025 • 49
When Less is Enough: Adaptive Token Reduction for Efficient Image Representation

Paper • 2503.16660 • Published Mar 20, 2025 • 72
Can Large Vision Language Models Read Maps Like a Human?

Paper • 2503.14607 • Published Mar 18, 2025 • 10
GAEA: A Geolocation Aware Conversational Model

Paper • 2503.16423 • Published Mar 20, 2025 • 6
I Have Covered All the Bases Here: Interpreting Reasoning Features in Large Language Models via Sparse Autoencoders

Paper • 2503.18878 • Published Mar 24, 2025 • 120
Video-T1: Test-Time Scaling for Video Generation

Paper • 2503.18942 • Published Mar 24, 2025 • 90
SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild

Paper • 2503.18892 • Published Mar 24, 2025 • 31
Aether: Geometric-Aware Unified World Modeling

Paper • 2503.18945 • Published Mar 24, 2025 • 28
Judge Anything: MLLM as a Judge Across Any Modality

Paper • 2503.17489 • Published Mar 21, 2025 • 23
Vision-R1: Evolving Human-Free Alignment in Large Vision-Language Models via Vision-Guided Reinforcement Learning

Paper • 2503.18013 • Published Mar 23, 2025 • 20
Mind with Eyes: from Language Reasoning to Multimodal Reasoning

Paper • 2503.18071 • Published Mar 23, 2025 • 3
Exploring Hallucination of Large Multimodal Models in Video Understanding: Benchmark, Analysis and Mitigation

Paper • 2503.19622 • Published Mar 25, 2025 • 31
CoMP: Continual Multimodal Pre-training for Vision Foundation Models

Paper • 2503.18931 • Published Mar 24, 2025 • 30
Long-Context Autoregressive Video Modeling with Next-Frame Prediction

Paper • 2503.19325 • Published Mar 25, 2025 • 73
Spot the Fake: Large Multimodal Model-Based Synthetic Image Detection with Artifact Explanation

Paper • 2503.14905 • Published Mar 19, 2025 • 20
When Words Outperform Vision: VLMs Can Self-Improve Via Text-Only Training For Human-Centered Decision Making

Paper • 2503.16965 • Published Mar 21, 2025 • 4
LEGO-Puzzles: How Good Are MLLMs at Multi-Step Spatial Reasoning?

Paper • 2503.19990 • Published Mar 25, 2025 • 35
Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy

Paper • 2503.19757 • Published Mar 25, 2025 • 51
GenHancer: Imperfect Generative Models are Secretly Strong Vision-Centric Enhancers

Paper • 2503.19480 • Published Mar 25, 2025 • 16
Qwen2.5-Omni Technical Report

Paper • 2503.20215 • Published Mar 26, 2025 • 172
Wan: Open and Advanced Large-Scale Video Generative Models

Paper • 2503.20314 • Published Mar 26, 2025 • 61
Open Deep Search: Democratizing Search with Open-source Reasoning Agents

Paper • 2503.20201 • Published Mar 26, 2025 • 48
Beyond Words: Advancing Long-Text Image Generation via Multimodal Autoregressive Models

Paper • 2503.20198 • Published Mar 26, 2025 • 4
Video-R1: Reinforcing Video Reasoning in MLLMs

Paper • 2503.21776 • Published Mar 27, 2025 • 79
UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement Learning

Paper • 2503.21620 • Published Mar 27, 2025 • 62
Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for Embodied Interactive Tasks

Paper • 2503.21696 • Published Mar 27, 2025 • 23
A Survey of Efficient Reasoning for Large Reasoning Models: Language, Multimodality, and Beyond

Paper • 2503.21614 • Published Mar 27, 2025 • 43
OThink-MR1: Stimulating multimodal generalized reasoning capabilities via dynamic reinforcement learning

Paper • 2503.16081 • Published Mar 20, 2025 • 28
Your ViT is Secretly an Image Segmentation Model

Paper • 2503.19108 • Published Mar 24, 2025 • 25
On Large Multimodal Models as Open-World Image Classifiers

Paper • 2503.21851 • Published Mar 27, 2025 • 8
TextCrafter: Accurately Rendering Multiple Texts in Complex Visual Scenes

Paper • 2503.23461 • Published Mar 30, 2025 • 94
Any2Caption:Interpreting Any Condition to Caption for Controllable Video Generation

Paper • 2503.24379 • Published Mar 31, 2025 • 76
Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1

Paper • 2503.24376 • Published Mar 31, 2025 • 38
Open-Qwen2VL: Compute-Efficient Pre-Training of Fully-Open Multimodal LLMs on Academic Resources

Paper • 2504.00595 • Published Apr 1, 2025 • 37
Recitation over Reasoning: How Cutting-Edge Language Models Can Fail on Elementary School-Level Reasoning Problems?

Paper • 2504.00509 • Published Apr 1, 2025 • 24
MoCha: Towards Movie-Grade Talking Character Synthesis

Paper • 2503.23307 • Published Mar 30, 2025 • 141
Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

Paper • 2503.24290 • Published Mar 31, 2025 • 62
Unicorn: Text-Only Data Synthesis for Vision Language Model Training

Paper • 2503.22655 • Published Mar 28, 2025 • 38
TokenHSI: Unified Synthesis of Physical Human-Scene Interactions through Task Tokenization

Paper • 2503.19901 • Published Mar 25, 2025 • 41
Expanding RL with Verifiable Rewards Across Diverse Domains

Paper • 2503.23829 • Published Mar 31, 2025 • 24
Multi-Token Attention

Paper • 2504.00927 • Published Apr 1, 2025 • 56
OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts

Paper • 2503.22952 • Published Mar 29, 2025 • 17
Efficient LLaMA-3.2-Vision by Trimming Cross-attended Visual Features

Paper • 2504.00557 • Published Apr 1, 2025 • 15
Chapter-Llama: Efficient Chaptering in Hour-Long Videos with LLMs

Paper • 2504.00072 • Published Mar 31, 2025 • 6
MergeVQ: A Unified Framework for Visual Generation and Representation with Disentangled Token Merging and Quantization

Paper • 2504.00999 • Published Apr 1, 2025 • 96
Improved Visual-Spatial Reasoning via R1-Zero-Like Training

Paper • 2504.00883 • Published Apr 1, 2025 • 67
DreamActor-M1: Holistic, Expressive and Robust Human Image Animation with Hybrid Guidance

Paper • 2504.01724 • Published Apr 2, 2025 • 68
AnimeGamer: Infinite Anime Life Simulation with Next Game State Prediction

Paper • 2504.01014 • Published Apr 1, 2025 • 70
Towards Physically Plausible Video Generation via VLM Planning

Paper • 2503.23368 • Published Mar 30, 2025 • 40
Understanding R1-Zero-Like Training: A Critical Perspective

Paper • 2503.20783 • Published Mar 26, 2025 • 59
ILLUME+: Illuminating Unified MLLM with Dual Visual Tokenization and Diffusion Refinement

Paper • 2504.01934 • Published Apr 2, 2025 • 22
Articulated Kinematics Distillation from Video Diffusion Models

Paper • 2504.01204 • Published Apr 1, 2025 • 23
Safeguarding Vision-Language Models: Mitigating Vulnerabilities to Gaussian Noise in Perturbation-based Attacks

Paper • 2504.01308 • Published Apr 2, 2025 • 14
DASH: Detection and Assessment of Systematic Hallucinations of VLMs

Paper • 2503.23573 • Published Mar 30, 2025 • 12
Enhanced OoD Detection through Cross-Modal Alignment of Multi-Modal Representations

Paper • 2503.18817 • Published Mar 24, 2025 • 3
Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing

Paper • 2504.02826 • Published Apr 3, 2025 • 68
WikiVideo: Article Generation from Multiple Videos

Paper • 2504.00939 • Published Apr 1, 2025 • 37
GPT-ImgEval: A Comprehensive Benchmark for Diagnosing GPT4o in Image Generation

Paper • 2504.02782 • Published Apr 3, 2025 • 57
Inference-Time Scaling for Generalist Reward Modeling

Paper • 2504.02495 • Published Apr 3, 2025 • 58
Rethinking RL Scaling for Vision Language Models: A Transparent, From-Scratch Framework and Comprehensive Evaluation Scheme

Paper • 2504.02587 • Published Apr 3, 2025 • 32
ShortV: Efficient Multimodal Large Language Models by Freezing Visual Tokens in Ineffective Layers

Paper • 2504.00502 • Published Apr 1, 2025 • 26
VARGPT-v1.1: Improve Visual Autoregressive Large Unified Model via Iterative Instruction Tuning and Reinforcement Learning

Paper • 2504.02949 • Published Apr 3, 2025 • 21
MME-Unify: A Comprehensive Benchmark for Unified Multimodal Understanding and Generation Models

Paper • 2504.03641 • Published Apr 4, 2025 • 14
Slow-Fast Architecture for Video Multi-Modal Large Language Models

Paper • 2504.01328 • Published Apr 2, 2025 • 7
URECA: Unique Region Caption Anything

Paper • 2504.05305 • Published Apr 7, 2025 • 35
Concept Lancet: Image Editing with Compositional Representation Transplant

Paper • 2504.02828 • Published Apr 3, 2025 • 16
LiveVQA: Live Visual Knowledge Seeking

Paper • 2504.05288 • Published Apr 7, 2025 • 15
SmolVLM: Redefining small and efficient multimodal models

Paper • 2504.05299 • Published Apr 7, 2025 • 207
Why Reasoning Matters? A Survey of Advancements in Multimodal Reasoning (v1)

Paper • 2504.03151 • Published Apr 4, 2025 • 15
Tuning-Free Image Editing with Fidelity and Editability via Unified Latent Diffusion Model

Paper • 2504.05594 • Published Apr 8, 2025 • 11
Skywork R1V: Pioneering Multimodal Reasoning with Chain-of-Thought

Paper • 2504.05599 • Published Apr 8, 2025 • 86
Rethinking Reflection in Pre-Training

Paper • 2504.04022 • Published Apr 5, 2025 • 80
Mamba as a Bridge: Where Vision Foundation Models Meet Vision Language Models for Domain-Generalized Semantic Segmentation

Paper • 2504.03193 • Published Apr 4, 2025 • 4
OmniSVG: A Unified Scalable Vector Graphics Generation Model

Paper • 2504.06263 • Published Apr 8, 2025 • 186
V-MAGE: A Game Evaluation Framework for Assessing Visual-Centric Capabilities in Multimodal Large Language Models

Paper • 2504.06148 • Published Apr 8, 2025 • 13
OmniCaptioner: One Captioner to Rule Them All

Paper • 2504.07089 • Published Apr 9, 2025 • 20
Caption Anything in Video: Fine-grained Object-centric Captioning via Spatiotemporal Multimodal Prompting

Paper • 2504.05541 • Published Apr 7, 2025 • 15
VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

Paper • 2504.06958 • Published Apr 9, 2025 • 13
Kimi-VL Technical Report

Paper • 2504.07491 • Published Apr 10, 2025 • 139
DeepSeek-R1 Thoughtology: Let's <think> about LLM Reasoning

Paper • 2504.07128 • Published Apr 2, 2025 • 87
VCR-Bench: A Comprehensive Evaluation Framework for Video Chain-of-Thought Reasoning

Paper • 2504.07956 • Published Apr 10, 2025 • 46
VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning

Paper • 2504.07960 • Published Apr 10, 2025 • 50
MM-IFEngine: Towards Multimodal Instruction Following

Paper • 2504.07957 • Published Apr 10, 2025 • 35
Scaling Laws for Native Multimodal Models Scaling Laws for Native Multimodal Models

Paper • 2504.07951 • Published Apr 10, 2025 • 30
Towards Visual Text Grounding of Multimodal Large Language Model

Paper • 2504.04974 • Published Apr 7, 2025 • 17
Seaweed-7B: Cost-Effective Training of Video Generation Foundation Model

Paper • 2504.08685 • Published Apr 11, 2025 • 130
GigaTok: Scaling Visual Tokenizers to 3 Billion Parameters for Autoregressive Image Generation

Paper • 2504.08736 • Published Apr 11, 2025 • 46
MineWorld: a Real-Time and Open-Source Interactive World Model on Minecraft

Paper • 2504.08388 • Published Apr 11, 2025 • 42
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

Paper • 2504.07615 • Published Apr 10, 2025 • 35
VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning

Paper • 2504.08837 • Published Apr 10, 2025 • 44
FUSION: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding

Paper • 2504.09925 • Published Apr 14, 2025 • 39
Have we unified image generation and understanding yet? An empirical study of GPT-4o's image generation ability

Paper • 2504.08003 • Published Apr 9, 2025 • 49
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Paper • 2504.10479 • Published Apr 14, 2025 • 308
Mavors: Multi-granularity Video Representation for Multimodal Large Language Model

Paper • 2504.10068 • Published Apr 14, 2025 • 30
TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning

Paper • 2504.09641 • Published Apr 13, 2025 • 16
VisuoThink: Empowering LVLM Reasoning with Multimodal Tree Search

Paper • 2504.09130 • Published Apr 12, 2025 • 12
Reasoning Models Can Be Effective Without Thinking

Paper • 2504.09858 • Published Apr 14, 2025 • 12
The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer

Paper • 2504.10462 • Published Apr 14, 2025 • 15
Pixel-SAIL: Single Transformer For Pixel-Grounded Understanding

Paper • 2504.10465 • Published Apr 14, 2025 • 27
Generate, but Verify: Reducing Hallucination in Vision-Language Models with Retrospective Resampling

Paper • 2504.13169 • Published Apr 17, 2025 • 39
VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models

Paper • 2504.13122 • Published Apr 17, 2025 • 20
SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models

Paper • 2504.11468 • Published Apr 10, 2025 • 30
VisualPuzzles: Decoupling Multimodal Reasoning Evaluation from Domain Knowledge

Paper • 2504.10342 • Published Apr 14, 2025 • 11
Multimodal Long Video Modeling Based on Temporal Dynamic Context

Paper • 2504.10443 • Published Apr 14, 2025 • 3
Summarization of Multimodal Presentations with Vision-Language Models: Study of the Effect of Modalities and Structure

Paper • 2504.10049 • Published Apr 14, 2025 • 2
ColorBench: Can VLMs See and Understand the Colorful World? A Comprehensive Benchmark for Color Perception, Reasoning, and Robustness

Paper • 2504.10514 • Published Apr 10, 2025 • 48
Perception Encoder: The best visual embeddings are not at the output of the network

Paper • 2504.13181 • Published Apr 17, 2025 • 36
NoisyRollout: Reinforcing Visual Reasoning with Data Augmentation

Paper • 2504.13055 • Published Apr 17, 2025 • 19
DMM: Building a Versatile Image Generation Model via Distillation-Based Model Merging

Paper • 2504.12364 • Published Apr 16, 2025 • 22
PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding

Paper • 2504.13180 • Published Apr 17, 2025 • 20
Could Thinking Multilingually Empower LLM Reasoning?

Paper • 2504.11833 • Published Apr 16, 2025 • 29
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Paper • 2504.13837 • Published Apr 18, 2025 • 141
UFO2: The Desktop AgentOS

Paper • 2504.14603 • Published Apr 20, 2025 • 29
LearnAct: Few-Shot Mobile GUI Agent with a Unified Demonstration Benchmark

Paper • 2504.13805 • Published Apr 18, 2025 • 11
Vidi: Large Multimodal Models for Video Understanding and Editing

Paper • 2504.15681 • Published Apr 22, 2025 • 14
LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale

Paper • 2504.16030 • Published Apr 22, 2025 • 37
Seeing from Another Perspective: Evaluating Multi-View Understanding in MLLMs

Paper • 2504.15280 • Published Apr 21, 2025 • 25
InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners

Paper • 2504.14239 • Published Apr 19, 2025 • 14
IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs

Paper • 2504.15415 • Published Apr 21, 2025 • 23
Describe Anything: Detailed Localized Image and Video Captioning

Paper • 2504.16072 • Published Apr 22, 2025 • 64
Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models

Paper • 2504.15271 • Published Apr 21, 2025 • 68
AerialMegaDepth: Learning Aerial-Ground Reconstruction and View Synthesis

Paper • 2504.13157 • Published Apr 17, 2025 • 20
Uni3C: Unifying Precisely 3D-Enhanced Camera and Human Motion Controls for Video Generation

Paper • 2504.14899 • Published Apr 21, 2025 • 20
An LMM for Efficient Video Understanding via Reinforced Compression of Video Cubes

Paper • 2504.15270 • Published Apr 21, 2025 • 9
BookWorld: From Novels to Interactive Agent Societies for Creative Story Generation

Paper • 2504.14538 • Published Apr 20, 2025 • 30
Personalized Text-to-Image Generation with Auto-Regressive Models

Paper • 2504.13162 • Published Apr 17, 2025 • 18
From Reflection to Perfection: Scaling Inference-Time Optimization for Text-to-Image Diffusion Models via Reflection Tuning

Paper • 2504.16080 • Published Apr 22, 2025 • 15
MR. Video: "MapReduce" is the Principle for Long Video Understanding

Paper • 2504.16082 • Published Apr 22, 2025 • 5
VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models

Paper • 2504.15279 • Published Apr 21, 2025 • 78
DreamID: High-Fidelity and Fast diffusion-based Face Swapping via Triplet ID Group Learning

Paper • 2504.14509 • Published Apr 20, 2025 • 53
Trillion 7B Technical Report

Paper • 2504.15431 • Published Apr 21, 2025 • 38
I-Con: A Unifying Framework for Representation Learning

Paper • 2504.16929 • Published Apr 23, 2025 • 30
PHYBench: Holistic Evaluation of Physical Perception and Reasoning in Large Language Models

Paper • 2504.16074 • Published Apr 22, 2025 • 36
DreamO: A Unified Framework for Image Customization

Paper • 2504.16915 • Published Apr 23, 2025 • 24
Progressive Language-guided Visual Learning for Multi-Task Visual Grounding

Paper • 2504.16145 • Published Apr 22, 2025 • 2
Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning

Paper • 2504.17192 • Published Apr 24, 2025 • 124
Step1X-Edit: A Practical Framework for General Image Editing

Paper • 2504.17761 • Published Apr 24, 2025 • 92
RefVNLI: Towards Scalable Evaluation of Subject-driven Text-to-image Generation

Paper • 2504.17502 • Published Apr 24, 2025 • 55
Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs

Paper • 2504.17432 • Published Apr 24, 2025 • 41
Perspective-Aware Reasoning in Vision-Language Models via Mental Imagery Simulation

Paper • 2504.17207 • Published Apr 24, 2025 • 30
Token-Shuffle: Towards High-Resolution Image Generation with Autoregressive Models

Paper • 2504.17789 • Published Apr 24, 2025 • 23
DyMU: Dynamic Merging and Virtual Unmerging for Efficient VLMs

Paper • 2504.17040 • Published Apr 23, 2025 • 13
Boosting Generative Image Modeling via Joint Image-Feature Synthesis

Paper • 2504.16064 • Published Apr 22, 2025 • 14
TimeChat-Online: 80% Visual Tokens are Naturally Redundant in Streaming Videos

Paper • 2504.17343 • Published Apr 24, 2025 • 13
ViSMaP: Unsupervised Hour-long Video Summarisation by Meta-Prompting

Paper • 2504.15921 • Published Apr 22, 2025 • 7
Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning

Paper • 2504.16656 • Published Apr 23, 2025 • 58
Towards Understanding Camera Motions in Any Video

Paper • 2504.15376 • Published Apr 21, 2025 • 157
Can Large Language Models Help Multimodal Language Analysis? MMLA: A Comprehensive Benchmark

Paper • 2504.16427 • Published Apr 23, 2025 • 18
DC-SAM: In-Context Segment Anything in Images and Videos via Dual Consistency

Paper • 2504.12080 • Published Apr 16, 2025 • 8
Contrastive Localized Language-Image Pre-Training

Paper • 2410.02746 • Published Oct 3, 2024 • 36
CLIP-MoE: Towards Building Mixture of Experts for CLIP with Diversified Multiplet Upcycling

Paper • 2409.19291 • Published Sep 28, 2024 • 21
GenSim2: Scaling Robot Data Generation with Multi-modal and Reasoning LLMs

Paper • 2410.03645 • Published Oct 4, 2024 • 3
LLM-Powered GUI Agents in Phone Automation: Surveying Progress and Prospects

Paper • 2504.19838 • Published Apr 28, 2025 • 23
RepText: Rendering Visual Text via Replicating

Paper • 2504.19724 • Published Apr 28, 2025 • 31
Benchmarking Multimodal Mathematical Reasoning with Explicit Visual Dependency

Paper • 2504.18589 • Published Apr 24, 2025 • 13
Clinical knowledge in LLMs does not translate to human interactions

Paper • 2504.18919 • Published Apr 26, 2025 • 26
SPC: Evolving Self-Play Critic via Adversarial Games for LLM Reasoning

Paper • 2504.19162 • Published Apr 27, 2025 • 18
MMInference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention

Paper • 2504.16083 • Published Apr 22, 2025 • 8
NORA: A Small Open-Sourced Generalist Vision Language Action Model for Embodied Tasks

Paper • 2504.19854 • Published Apr 28, 2025 • 7
Reinforcement Learning for Reasoning in Large Language Models with One Training Example

Paper • 2504.20571 • Published Apr 29, 2025 • 98
The Leaderboard Illusion

Paper • 2504.20879 • Published Apr 29, 2025 • 72
UniversalRAG: Retrieval-Augmented Generation over Multiple Corpora with Diverse Modalities and Granularities

Paper • 2504.20734 • Published Apr 29, 2025 • 62
YoChameleon: Personalized Vision and Language Generation

Paper • 2504.20998 • Published Apr 29, 2025 • 12
X-Fusion: Introducing New Modality to Frozen Large Language Models

Paper • 2504.20996 • Published Apr 29, 2025 • 13
A Review of 3D Object Detection with Vision-Language Models

Paper • 2504.18738 • Published Apr 25, 2025 • 2
Phi-4-Mini-Reasoning: Exploring the Limits of Small Reasoning Language Models in Math

Paper • 2504.21233 • Published Apr 30, 2025 • 49
100 Days After DeepSeek-R1: A Survey on Replication Studies and More Directions for Reasoning Language Models

Paper • 2505.00551 • Published May 1, 2025 • 36
COMPACT: COMPositional Atomic-to-Complex Visual Capability Tuning

Paper • 2504.21850 • Published Apr 30, 2025 • 27
ReVision: High-Quality, Low-Cost Video Generation with Explicit 3D Physics Modeling for Complex Motion and Interaction

Paper • 2504.21855 • Published Apr 30, 2025 • 13
A Survey of Interactive Generative Video

Paper • 2504.21853 • Published Apr 30, 2025 • 46
T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT

Paper • 2505.00703 • Published May 1, 2025 • 44
PixelHacker: Image Inpainting with Structural and Semantic Consistency

Paper • 2504.20438 • Published Apr 29, 2025 • 44
Improving Editability in Image Generation with Layer-wise Memory

Paper • 2505.01079 • Published May 2, 2025 • 29
Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play

Paper • 2505.02707 • Published May 5, 2025 • 85
RM-R1: Reward Modeling as Reasoning

Paper • 2505.02387 • Published May 5, 2025 • 81
R1-Reward: Training Multimodal Reward Model Through Stable Reinforcement Learning

Paper • 2505.02835 • Published May 5, 2025 • 28
Ming-Lite-Uni: Advancements in Unified Architecture for Natural Multimodal Interaction

Paper • 2505.02471 • Published May 5, 2025 • 15
SuperEdit: Rectifying and Facilitating Supervision for Instruction-Based Image Editing

Paper • 2505.02370 • Published May 5, 2025 • 14
Agentic Reasoning and Tool Integration for LLMs via Reinforcement Learning

Paper • 2505.01441 • Published Apr 28, 2025 • 39
LLaMA-Omni2: LLM-based Real-time Spoken Chatbot with Autoregressive Streaming Speech Synthesis

Paper • 2505.02625 • Published May 5, 2025 • 23
HoloTime: Taming Video Diffusion Models for Panoramic 4D Scene Generation

Paper • 2504.21650 • Published Apr 30, 2025 • 16
Unified Multimodal Chain-of-Thought Reward Model through Reinforcement Fine-Tuning

Paper • 2505.03318 • Published May 6, 2025 • 94
OSUniverse: Benchmark for Multimodal GUI-navigation AI Agents

Paper • 2505.03570 • Published May 6, 2025 • 8
OmniGIRL: A Multilingual and Multimodal Benchmark for GitHub Issue Resolution

Paper • 2505.04606 • Published May 7, 2025 • 9
OpenVision: A Fully-Open, Cost-Effective Family of Advanced Vision Encoders for Multimodal Learning

Paper • 2505.04601 • Published May 7, 2025 • 29
Beyond Recognition: Evaluating Visual Perspective Taking in Vision Language Models

Paper • 2505.03821 • Published May 3, 2025 • 24
HunyuanCustom: A Multimodal-Driven Architecture for Customized Video Generation

Paper • 2505.04512 • Published May 7, 2025 • 36
ZeroSearch: Incentivize the Search Capability of LLMs without Searching

Paper • 2505.04588 • Published May 7, 2025 • 65
Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities

Paper • 2505.02567 • Published May 5, 2025 • 82
Scenethesis: A Language and Vision Agentic Framework for 3D Scene Generation

Paper • 2505.02836 • Published May 5, 2025 • 8
VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model

Paper • 2505.03739 • Published May 6, 2025 • 10
Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models

Paper • 2505.04921 • Published May 8, 2025 • 187
On Path to Multimodal Generalist: General-Level and General-Bench

Paper • 2505.04620 • Published May 7, 2025 • 83
Flow-GRPO: Training Flow Matching Models via Online RL

Paper • 2505.05470 • Published May 8, 2025 • 88
FG-CLIP: Fine-Grained Visual and Textual Alignment

Paper • 2505.05071 • Published May 8, 2025 • 18
X-Reasoner: Towards Generalizable Reasoning Across Modalities and Domains

Paper • 2505.03981 • Published May 6, 2025 • 15
Vision-Language-Action Models: Concepts, Progress, Applications and Challenges

Paper • 2505.04769 • Published May 7, 2025 • 10
Bielik v3 Small: Technical Report

Paper • 2505.02550 • Published May 5, 2025 • 68
Bielik 11B v2 Technical Report

Paper • 2505.02410 • Published May 5, 2025 • 54
Seed1.5-VL Technical Report

Paper • 2505.07062 • Published May 11, 2025 • 157
Unified Continuous Generative Models

Paper • 2505.07447 • Published May 12, 2025 • 42
DanceGRPO: Unleashing GRPO on Visual Generation

Paper • 2505.07818 • Published May 12, 2025 • 32
Skywork-VL Reward: An Effective Reward Model for Multimodal Understanding and Reasoning

Paper • 2505.07263 • Published May 12, 2025 • 30
H^{3}DP: Triply-Hierarchical Diffusion Policy for Visuomotor Learning

Paper • 2505.07819 • Published May 12, 2025 • 5
MonetGPT: Solving Puzzles Enhances MLLMs' Image Retouching Skills

Paper • 2505.06176 • Published May 9, 2025 • 12
DynamicRAG: Leveraging Outputs of Large Language Model as Feedback for Dynamic Reranking in Retrieval-Augmented Generation

Paper • 2505.07233 • Published May 12, 2025 • 8
MiniMax-Speech: Intrinsic Zero-Shot Text-to-Speech with a Learnable Speaker Encoder

Paper • 2505.07916 • Published May 12, 2025 • 135
Fast Text-to-Audio Generation with Adversarial Post-Training

Paper • 2505.08175 • Published May 13, 2025 • 26
Bring Reason to Vision: Understanding Perception and Reasoning through Model Merging

Paper • 2505.05464 • Published May 8, 2025 • 11
Aya Vision: Advancing the Frontier of Multilingual Multimodality

Paper • 2505.08751 • Published May 13, 2025 • 13
SkillFormer: Unified Multi-View Video Understanding for Proficiency Estimation

Paper • 2505.08665 • Published May 13, 2025 • 5
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

Paper • 2505.09568 • Published May 14, 2025 • 99
Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures

Paper • 2505.09343 • Published May 14, 2025 • 76
MathCoder-VL: Bridging Vision and Code for Enhanced Multimodal Mathematical Reasoning

Paper • 2505.10557 • Published May 15, 2025 • 48
DeCLIP: Decoupled Learning for Open-Vocabulary Dense Perception

Paper • 2505.04410 • Published May 7, 2025 • 44
WavReward: Spoken Dialogue Models With Generalist Reward Evaluators

Paper • 2505.09558 • Published May 14, 2025 • 10
Omni-R1: Do You Really Need Audio to Fine-Tune Your Audio LLM?

Paper • 2505.09439 • Published May 14, 2025 • 10
VCRBench: Exploring Long-form Causal Reasoning Capabilities of Large Video Language Models

Paper • 2505.08455 • Published May 13, 2025 • 5
Understanding and Mitigating Toxicity in Image-Text Pretraining Datasets: A Case Study on LLaVA

Paper • 2505.06356 • Published May 9, 2025 • 3
Beyond 'Aha!': Toward Systematic Meta-Abilities Alignment in Large Reasoning Models

Paper • 2505.10554 • Published May 15, 2025 • 120
OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning

Paper • 2505.08617 • Published May 13, 2025 • 42
WorldPM: Scaling Human Preference Modeling

Paper • 2505.10527 • Published May 15, 2025 • 34
End-to-End Vision Tokenizer Tuning

Paper • 2505.10562 • Published May 15, 2025 • 22
Exploring the Deep Fusion of Large Language Models and Diffusion Transformers for Text-to-Image Synthesis

Paper • 2505.10046 • Published May 15, 2025 • 9
AdaptCLIP: Adapting CLIP for Universal Visual Anomaly Detection

Paper • 2505.09926 • Published May 15, 2025 • 6
Qwen3 Technical Report

Paper • 2505.09388 • Published May 14, 2025 • 339
MMLongBench: Benchmarking Long-Context Vision-Language Models Effectively and Thoroughly

Paper • 2505.10610 • Published May 15, 2025 • 55
GuardReasoner-VL: Safeguarding VLMs via Reinforced Reasoning

Paper • 2505.11049 • Published May 16, 2025 • 61
Visual Planning: Let's Think Only with Images

Paper • 2505.11409 • Published May 16, 2025 • 57
Simple Semi-supervised Knowledge Distillation from Vision-Language Models via texttt{D}ual-texttt{H}ead texttt{O}ptimization

Paper • 2505.07675 • Published May 12, 2025 • 21
Chain-of-Model Learning for Language Model

Paper • 2505.11820 • Published May 17, 2025 • 121
AdaptThink: Reasoning Models Can Learn When to Think

Paper • 2505.13417 • Published May 19, 2025 • 83
Model Merging in Pre-training of Large Language Models

Paper • 2505.12082 • Published May 17, 2025 • 39
Through the Looking Glass: Common Sense Consistency Evaluation of Weird Images

Paper • 2505.07704 • Published May 12, 2025 • 29
Faster Video Diffusion with Trainable Sparse Attention

Paper • 2505.13389 • Published May 19, 2025 • 38
ViPlan: A Benchmark for Visual Planning with Symbolic Predicates and Vision-Language Models

Paper • 2505.13180 • Published May 19, 2025 • 13
VisionReasoner: Unified Visual Perception and Reasoning via Reinforcement Learning

Paper • 2505.12081 • Published May 17, 2025 • 18
R3: Robust Rubric-Agnostic Reward Models

Paper • 2505.13388 • Published May 19, 2025 • 11
Efficient Speech Language Modeling via Energy Distance in Continuous Latent Space

Paper • 2505.13181 • Published May 19, 2025 • 9
A Token is Worth over 1,000 Tokens: Efficient Knowledge Distillation through Low-Rank Clone

Paper • 2505.12781 • Published May 19, 2025 • 2
Emerging Properties in Unified Multimodal Pretraining

Paper • 2505.14683 • Published May 20, 2025 • 134
Reward Reasoning Model

Paper • 2505.14674 • Published May 20, 2025 • 37
Visual Agentic Reinforcement Fine-Tuning

Paper • 2505.14246 • Published May 20, 2025 • 32
VisualQuality-R1: Reasoning-Induced Image Quality Assessment via Reinforcement Learning to Rank

Paper • 2505.14460 • Published May 20, 2025 • 33
Think Only When You Need with Large Hybrid-Reasoning Models

Paper • 2505.14631 • Published May 20, 2025 • 20
Visionary-R1: Mitigating Shortcuts in Visual Reasoning with Reinforcement Learning

Paper • 2505.14677 • Published May 20, 2025 • 15
Hunyuan-Game: Industrial-grade Intelligent Game Creation Model

Paper • 2505.14135 • Published May 20, 2025 • 16
VideoEval-Pro: Robust and Realistic Long Video Understanding Evaluation

Paper • 2505.14640 • Published May 20, 2025 • 16
Two Experts Are All You Need for Steering Thinking: Reinforcing Cognitive Effort in MoE Reasoning Models Without Additional Training

Paper • 2505.14681 • Published May 20, 2025 • 10
Visual Instruction Bottleneck Tuning

Paper • 2505.13946 • Published May 20, 2025 • 10
Not All Correct Answers Are Equal: Why Your Distillation Source Matters

Paper • 2505.14464 • Published May 20, 2025 • 10
Lessons from Defending Gemini Against Indirect Prompt Injections

Paper • 2505.14534 • Published May 20, 2025 • 8
The Hallucination Tax of Reinforcement Finetuning

Paper • 2505.13988 • Published May 20, 2025 • 8
Incorporating brain-inspired mechanisms for multimodal learning in artificial intelligence

Paper • 2505.10176 • Published May 15, 2025 • 3
Web-Shepherd: Advancing PRMs for Reinforcing Web Agents

Paper • 2505.15277 • Published May 21, 2025 • 105
MMaDA: Multimodal Large Diffusion Language Models

Paper • 2505.15809 • Published May 21, 2025 • 98
UniVG-R1: Reasoning Guided Universal Visual Grounding with Reinforcement Learning

Paper • 2505.14231 • Published May 20, 2025 • 53
Diffusion vs. Autoregressive Language Models: A Text Embedding Perspective

Paper • 2505.15045 • Published May 21, 2025 • 56
Vid2World: Crafting Video Diffusion Models to Interactive World Models

Paper • 2505.14357 • Published May 20, 2025 • 27
When to Continue Thinking: Adaptive Thinking Mode Switching for Efficient Reasoning

Paper • 2505.15400 • Published May 21, 2025 • 23
lmgame-Bench: How Good are LLMs at Playing Games?

Paper • 2505.15146 • Published May 21, 2025 • 20
IA-T2I: Internet-Augmented Text-to-Image Generation

Paper • 2505.15779 • Published May 21, 2025 • 14
Deliberation on Priors: Trustworthy Reasoning of Large Language Models on Knowledge Graphs

Paper • 2505.15210 • Published May 21, 2025 • 19
RLVR-World: Training World Models with Reinforcement Learning

Paper • 2505.13934 • Published May 20, 2025 • 16
ConvSearch-R1: Enhancing Query Reformulation for Conversational Search with Reasoning via Reinforcement Learning

Paper • 2505.15776 • Published May 21, 2025 • 11
HumaniBench: A Human-Centric Framework for Large Multimodal Models Evaluation

Paper • 2505.11454 • Published May 16, 2025 • 5
QuickVideo: Real-Time Long Video Understanding with System Algorithm Co-Design

Paper • 2505.16175 • Published May 22, 2025 • 42
LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning

Paper • 2505.16933 • Published May 22, 2025 • 34
GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning

Paper • 2505.17022 • Published May 22, 2025 • 27
Risk-Averse Reinforcement Learning with Itakura-Saito Loss

Paper • 2505.16925 • Published May 22, 2025 • 26
Understanding Generative AI Capabilities in Everyday Image Editing Tasks

Paper • 2505.16181 • Published May 22, 2025 • 24
Training-Free Efficient Video Generation via Dynamic Token Carving

Paper • 2505.16864 • Published May 22, 2025 • 24
Dimple: Discrete Diffusion Multimodal Large Language Model with Parallel Decoding

Paper • 2505.16990 • Published May 22, 2025 • 22
VideoGameQA-Bench: Evaluating Vision-Language Models for Video Game Quality Assurance

Paper • 2505.15952 • Published May 21, 2025 • 20
SophiaVL-R1: Reinforcing MLLMs Reasoning with Thinking Reward

Paper • 2505.17018 • Published May 22, 2025 • 15
Backdoor Cleaning without External Guidance in MLLM Fine-tuning

Paper • 2505.16916 • Published May 22, 2025 • 17
WebAgent-R1: Training Web Agents via End-to-End Multi-Turn Reinforcement Learning

Paper • 2505.16421 • Published May 22, 2025 • 19
LaViDa: A Large Diffusion Language Model for Multimodal Understanding

Paper • 2505.16839 • Published May 22, 2025 • 13
GRIT: Teaching MLLMs to Think with Images

Paper • 2505.15879 • Published May 21, 2025 • 13
Think or Not? Selective Reasoning via Reinforcement Learning for Vision-Language Models

Paper • 2505.16854 • Published May 22, 2025 • 11
OViP: Online Vision-Language Preference Learning

Paper • 2505.15963 • Published May 21, 2025 • 9
Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models

Paper • 2505.17015 • Published May 22, 2025 • 9
VLM-R^3: Region Recognition, Reasoning, and Refinement for Enhanced Multimodal Chain-of-Thought

Paper • 2505.16192 • Published May 22, 2025 • 12
Robo2VLM: Visual Question Answering from Large-Scale In-the-Wild Robot Manipulation Datasets

Paper • 2505.15517 • Published May 21, 2025 • 6
How Do Large Vision-Language Models See Text in Image? Unveiling the Distinctive Role of OCR Heads

Paper • 2505.15865 • Published May 21, 2025 • 5
RAVENEA: A Benchmark for Multimodal Retrieval-Augmented Visual Culture Understanding

Paper • 2505.14462 • Published May 20, 2025 • 4
One RL to See Them All: Visual Triple Unified Reinforcement Learning

Paper • 2505.18129 • Published May 23, 2025 • 62
Teaching with Lies: Curriculum DPO on Synthetic Negatives for Hallucination Detection

Paper • 2505.17558 • Published May 23, 2025 • 15
AudioTrust: Benchmarking the Multifaceted Trustworthiness of Audio Large Language Models

Paper • 2505.16211 • Published May 22, 2025 • 18
Are Vision-Language Models Safe in the Wild? A Meme-Based Benchmark Study

Paper • 2505.15389 • Published May 21, 2025 • 8
Reinforcement Fine-Tuning Powers Reasoning Capability of Multimodal Large Language Models

Paper • 2505.18536 • Published May 24, 2025 • 18
QwenLong-L1: Towards Long-Context Large Reasoning Models with Reinforcement Learning

Paper • 2505.17667 • Published May 23, 2025 • 88
Reasoning Model is Stubborn: Diagnosing Instruction Overriding in Reasoning Models

Paper • 2505.17225 • Published May 22, 2025 • 64
QwenLong-CPRS: Towards infty-LLMs with Dynamic Context Optimization

Paper • 2505.18092 • Published May 23, 2025 • 43
RBench-V: A Primary Assessment for Visual Reasoning Models with Multi-modal Outputs

Paper • 2505.16770 • Published May 22, 2025 • 12
Interactive Post-Training for Vision-Language-Action Models

Paper • 2505.17016 • Published May 22, 2025 • 6
G1: Bootstrapping Perception and Reasoning Abilities of Vision-Language Model via Reinforcement Learning

Paper • 2505.13426 • Published May 19, 2025 • 13
Error Typing for Smarter Rewards: Improving Process Reward Models with Error-Aware Hierarchical Supervision

Paper • 2505.19706 • Published May 26, 2025 • 3
RePrompt: Reasoning-Augmented Reprompting for Text-to-Image Generation via Reinforcement Learning

Paper • 2505.17540 • Published May 23, 2025 • 7
Shifting AI Efficiency From Model-Centric to Data-Centric Compression

Paper • 2505.19147 • Published May 25, 2025 • 145
Can MLLMs Guide Me Home? A Benchmark Study on Fine-Grained Visual Reasoning from Transit Maps

Paper • 2505.18675 • Published May 24, 2025 • 27
Omni-R1: Reinforcement Learning for Omnimodal Reasoning via Two-System Collaboration

Paper • 2505.20256 • Published May 26, 2025 • 19
REARANK: Reasoning Re-ranking Agent via Reinforcement Learning

Paper • 2505.20046 • Published May 26, 2025 • 18
Hard Negative Contrastive Learning for Fine-Grained Geometric Understanding in Large Multimodal Models

Paper • 2505.20152 • Published May 26, 2025 • 11
Interleaved Reasoning for Large Language Models via Reinforcement Learning

Paper • 2505.19640 • Published May 26, 2025 • 15
InfantAgent-Next: A Multimodal Generalist Agent for Automated Computer Interaction

Paper • 2505.10887 • Published May 16, 2025 • 10
STAR-R1: Spatial TrAnsformation Reasoning by Reinforcing Multimodal LLMs

Paper • 2505.15804 • Published May 21, 2025 • 10
Jodi: Unification of Visual Generation and Understanding via Joint Modeling

Paper • 2505.19084 • Published May 25, 2025 • 20
Towards Holistic Evaluation of Large Audio-Language Models: A Comprehensive Survey

Paper • 2505.15957 • Published May 21, 2025 • 3
Seeing is Believing, but How Much? A Comprehensive Analysis of Verbalized Calibration in Vision-Language Models

Paper • 2505.20236 • Published May 26, 2025 • 3
Textual Steering Vectors Can Improve Visual Understanding in Multimodal Large Language Models

Paper • 2505.14071 • Published May 20, 2025 • 1
Paper2Poster: Towards Multimodal Poster Automation from Scientific Papers

Paper • 2505.21497 • Published May 27, 2025 • 109
Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?

Paper • 2505.21374 • Published May 27, 2025 • 28
MME-VideoOCR: Evaluating OCR-Based Capabilities of Multimodal LLMs in Video Scenarios

Paper • 2505.21333 • Published May 27, 2025 • 38
MME-Reasoning: A Comprehensive Benchmark for Logical Reasoning in MLLMs

Paper • 2505.21327 • Published May 27, 2025 • 83
MMMR: Benchmarking Massive Multi-Modal Reasoning Tasks

Paper • 2505.16459 • Published May 22, 2025 • 45
SeePhys: Does Seeing Help Thinking? -- Benchmarking Vision-Based Physics Reasoning

Paper • 2505.19099 • Published May 25, 2025 • 7
Active-O3: Empowering Multimodal Large Language Models with Active Perception via GRPO

Paper • 2505.21457 • Published May 27, 2025 • 16
UI-Genie: A Self-Improving Approach for Iteratively Boosting MLLM-based Mobile GUI Agents

Paper • 2505.21496 • Published May 27, 2025 • 38
ViewSpatial-Bench: Evaluating Multi-perspective Spatial Localization in Vision-Language Models

Paper • 2505.21500 • Published May 27, 2025 • 13
ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows

Paper • 2505.19897 • Published May 26, 2025 • 104
MLLMs are Deeply Affected by Modality Bias

Paper • 2505.18657 • Published May 24, 2025 • 5
Unsupervised Post-Training for Multi-Modal LLM Reasoning via GRPO

Paper • 2505.22453 • Published May 28, 2025 • 46
Advancing Multimodal Reasoning via Reinforcement Learning with Cold Start

Paper • 2505.22334 • Published May 28, 2025 • 36
The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

Paper • 2505.22617 • Published May 28, 2025 • 132
R2R: Efficiently Navigating Divergent Reasoning Paths with Small-Large Model Token Routing

Paper • 2505.21600 • Published May 27, 2025 • 71
Skywork Open Reasoner 1 Technical Report

Paper • 2505.22312 • Published May 28, 2025 • 55
Sherlock: Self-Correcting Reasoning in Vision-Language Models

Paper • 2505.22651 • Published May 28, 2025 • 48
Fostering Video Reasoning via Next-Event Prediction

Paper • 2505.22457 • Published May 28, 2025 • 29
VRAG-RL: Empower Vision-Perception-Based RAG for Visually Rich Information Understanding via Iterative Reasoning with Reinforcement Learning

Paper • 2505.22019 • Published May 28, 2025 • 12
RICO: Improving Accuracy and Completeness in Image Recaptioning via Visual Reconstruction

Paper • 2505.22613 • Published May 28, 2025 • 9
Zero-Shot Vision Encoder Grafting via LLM Surrogates

Paper • 2505.22664 • Published May 28, 2025 • 7
MangaVQA and MangaLMM: A Benchmark and Specialized Model for Multimodal Manga Understanding

Paper • 2505.20298 • Published May 26, 2025 • 9
Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence

Paper • 2505.23747 • Published May 29, 2025 • 69
ZeroGUI: Automating Online GUI Learning at Zero Human Cost

Paper • 2505.23762 • Published May 29, 2025 • 45
The Climb Carves Wisdom Deeper Than the Summit: On the Noisy Rewards in Learning to Reason

Paper • 2505.22653 • Published May 28, 2025 • 43
VF-Eval: Evaluating Multimodal LLMs for Generating Feedback on AIGC Videos

Paper • 2505.23693 • Published May 29, 2025 • 53
VideoReasonBench: Can MLLMs Perform Vision-Centric Complex Video Reasoning?

Paper • 2505.23359 • Published May 29, 2025 • 38
To Trust Or Not To Trust Your Vision-Language Model's Prediction

Paper • 2505.23745 • Published May 29, 2025 • 4
Can LLMs Deceive CLIP? Benchmarking Adversarial Compositionality of Pre-trained Multimodal Representation via Text Updates

Paper • 2505.22943 • Published May 28, 2025 • 3
FS-DAG: Few Shot Domain Adapting Graph Networks for Visually Rich Document Understanding

Paper • 2505.17330 • Published May 22, 2025 • 22
HoPE: Hybrid of Position Embedding for Length Generalization in Vision-Language Models

Paper • 2505.20444 • Published May 26, 2025 • 5
cadrille: Multi-modal CAD Reconstruction with Online Reinforcement Learning

Paper • 2505.22914 • Published May 28, 2025 • 39
Are Reasoning Models More Prone to Hallucination?

Paper • 2505.23646 • Published May 29, 2025 • 24
Multi-Domain Explainability of Preferences

Paper • 2505.20088 • Published May 26, 2025 • 20
REOrdering Patches Improves Vision Models

Paper • 2505.23751 • Published May 29, 2025 • 15
Re-ttention: Ultra Sparse Visual Generation via Attention Statistical Reshape

Paper • 2505.22918 • Published May 28, 2025 • 6
Puzzled by Puzzles: When Vision-Language Models Can't Take a Hint

Paper • 2505.23759 • Published May 29, 2025 • 5
A Graph Perspective to Probe Structural Patterns of Knowledge in Large Language Models

Paper • 2505.19286 • Published May 25, 2025 • 3
Grounded Reinforcement Learning for Visual Reasoning

Paper • 2505.23678 • Published May 29, 2025 • 2
Time Blindness: Why Video-Language Models Can't See What Humans Can?

Paper • 2505.24867 • Published May 30, 2025 • 82
AlphaOne: Reasoning Models Thinking Slow and Fast at Test Time

Paper • 2505.24863 • Published May 30, 2025 • 97
ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models

Paper • 2505.24864 • Published May 30, 2025 • 146
Large Language Models for Data Synthesis

Paper • 2505.14752 • Published May 20, 2025 • 49
Don't Look Only Once: Towards Multimodal Interactive Reasoning with Selective Visual Revisitation

Paper • 2505.18842 • Published May 24, 2025 • 36
ViStoryBench: Comprehensive Benchmark Suite for Story Visualization

Paper • 2505.24862 • Published May 30, 2025 • 30
DINO-R1: Incentivizing Reasoning Capability in Vision Foundation Models

Paper • 2505.24025 • Published May 29, 2025 • 27
MoDoMoDo: Multi-Domain Data Mixtures for Multimodal LLM Reinforcement Learning

Paper • 2505.24871 • Published May 30, 2025 • 23
Open CaptchaWorld: A Comprehensive Web-based Platform for Testing and Benchmarking Multimodal LLM Agents

Paper • 2505.24878 • Published May 30, 2025 • 23
Vision Language Models are Biased

Paper • 2505.23941 • Published May 29, 2025 • 23
More Thinking, Less Seeing? Assessing Amplified Hallucination in Multimodal Reasoning Models

Paper • 2505.21523 • Published May 23, 2025 • 13
Fork-Merge Decoding: Enhancing Multimodal Understanding in Audio-Visual Large Language Models

Paper • 2505.20873 • Published May 27, 2025 • 9
ReasonGen-R1: CoT for Autoregressive Image generation models through SFT and RL

Paper • 2505.24875 • Published May 30, 2025 • 10
un^2CLIP: Improving CLIP's Visual Detail Capturing Ability via Inverting unCLIP

Paper • 2505.24517 • Published May 30, 2025 • 5
SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

Paper • 2506.01844 • Published Jun 2, 2025 • 157
Temporal In-Context Fine-Tuning for Versatile Control of Video Diffusion Models

Paper • 2506.00996 • Published Jun 1, 2025 • 40
Jigsaw-R1: A Study of Rule-based Visual Reinforcement Learning with Jigsaw Puzzles

Paper • 2505.23590 • Published May 29, 2025 • 25
LoHoVLA: A Unified Vision-Language-Action Model for Long-Horizon Embodied Tasks

Paper • 2506.00411 • Published May 31, 2025 • 31
SRPO: Enhancing Multimodal LLM Reasoning via Reflection-Aware Reinforcement Learning

Paper • 2506.01713 • Published Jun 2, 2025 • 48
EarthMind: Towards Multi-Granular and Multi-Sensor Earth Observation with Large Multimodal Models

Paper • 2506.01667 • Published Jun 2, 2025 • 21
VisualSphinx: Large-Scale Synthetic Vision Logic Puzzles for RL

Paper • 2505.23977 • Published May 29, 2025 • 10
Learning from Videos for 3D World: Enhancing MLLMs with 3D Vision Geometry Priors

Paper • 2505.24625 • Published May 30, 2025 • 9
OmniResponse: Online Multimodal Conversational Response Generation in Dyadic Interactions

Paper • 2505.21724 • Published May 27, 2025 • 5
Aligning VLM Assistants with Personalized Situated Cognition

Paper • 2506.00930 • Published Jun 1, 2025 • 2
MotionSight: Boosting Fine-Grained Motion Understanding in Multimodal LLMs

Paper • 2506.01674 • Published Jun 2, 2025 • 28
SynthRL: Scaling Visual Reasoning with Verifiable Data Synthesis

Paper • 2506.02096 • Published Jun 2, 2025 • 52
VS-Bench: Evaluating VLMs for Strategic Reasoning and Decision-Making in Multi-Agent Environments

Paper • 2506.02387 • Published Jun 3, 2025 • 58
UniWorld: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

Paper • 2506.03147 • Published Jun 3, 2025 • 58
CSVQA: A Chinese Multimodal Benchmark for Evaluating STEM Reasoning Capabilities of VLMs

Paper • 2505.24120 • Published May 30, 2025 • 50
OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models

Paper • 2506.03135 • Published Jun 3, 2025 • 40
Visual Embodied Brain: Let Multimodal Large Language Models See, Think, and Control in Spaces

Paper • 2506.00123 • Published May 30, 2025 • 35
GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents

Paper • 2506.03143 • Published Jun 3, 2025 • 54
FuseLIP: Multimodal Embeddings via Early Fusion of Discrete Tokens

Paper • 2506.03096 • Published Jun 3, 2025 • 4
TimeHC-RL: Temporal-aware Hierarchical Cognitive Reinforcement Learning for Enhancing LLMs' Social Intelligence

Paper • 2505.24500 • Published May 30, 2025 • 12
Advancing Multimodal Reasoning: From Optimized Cold Start to Staged Reinforcement Learning

Paper • 2506.04207 • Published Jun 4, 2025 • 48
MiMo-VL Technical Report

Paper • 2506.03569 • Published Jun 4, 2025 • 80
MMR-V: What's Left Unsaid? A Benchmark for Multimodal Deep Reasoning in Videos

Paper • 2506.04141 • Published Jun 4, 2025 • 30
Video-Skill-CoT: Skill-based Chain-of-Thoughts for Domain-Adaptive Video Reasoning

Paper • 2506.03525 • Published Jun 4, 2025 • 6
RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics

Paper • 2506.04308 • Published Jun 4, 2025 • 43
Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Paper • 2506.05176 • Published Jun 5, 2025 • 80
EOC-Bench: Can MLLMs Identify, Recall, and Forecast Objects in an Egocentric World?

Paper • 2506.05287 • Published Jun 5, 2025 • 14
SparseMM: Head Sparsity Emerges from Visual Concept Responses in MLLMs

Paper • 2506.05344 • Published Jun 5, 2025 • 17
FusionAudio-1.2M: Towards Fine-grained Audio Captioning with Multimodal Contextual Fusion

Paper • 2506.01111 • Published Jun 1, 2025 • 31
Is Extending Modality The Right Path Towards Omni-Modality?

Paper • 2506.01872 • Published Jun 2, 2025 • 24
Reinforcement Pre-Training

Paper • 2506.08007 • Published Jun 9, 2025 • 265
Astra: Toward General-Purpose Mobile Robots via Hierarchical Multimodal Learning

Paper • 2506.06205 • Published Jun 6, 2025 • 30
Image Reconstruction as a Tool for Feature Analysis

Paper • 2506.07803 • Published Jun 9, 2025 • 29
Bootstrapping World Models from Dynamics Models in Multimodal Foundation Models

Paper • 2506.06006 • Published Jun 6, 2025 • 15
Vision Transformers Don't Need Trained Registers

Paper • 2506.08010 • Published Jun 9, 2025 • 22
Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning

Paper • 2506.07044 • Published Jun 8, 2025 • 114
DiscoVLA: Discrepancy Reduction in Vision, Language, and Alignment for Parameter-Efficient Video-Text Retrieval

Paper • 2506.08887 • Published Jun 10, 2025 • 4
Autoregressive Semantic Visual Reconstruction Helps VLMs Understand Better

Paper • 2506.09040 • Published Jun 10, 2025 • 34
Seedance 1.0: Exploring the Boundaries of Video Generation Models

Paper • 2506.09113 • Published Jun 10, 2025 • 108
MMRefine: Unveiling the Obstacles to Robust Refinement in Multimodal Large Language Models

Paper • 2506.04688 • Published Jun 5, 2025 • 3
Confidence Is All You Need: Few-Shot RL Fine-Tuning of Language Models

Paper • 2506.06395 • Published Jun 5, 2025 • 135
Hidden in plain sight: VLMs overlook their visual representations

Paper • 2506.08008 • Published Jun 9, 2025 • 7
Vision Matters: Simple Visual Perturbations Can Boost Multimodal Math Reasoning

Paper • 2506.09736 • Published Jun 11, 2025 • 9
VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos

Paper • 2506.10857 • Published Jun 12, 2025 • 30
Optimus-3: Towards Generalist Multimodal Minecraft Agents with Scalable Task Experts

Paper • 2506.10357 • Published Jun 12, 2025 • 21
SAFE: Multitask Failure Detection for Vision-Language-Action Models

Paper • 2506.09937 • Published Jun 11, 2025 • 9
Magistral

Paper • 2506.10910 • Published Jun 12, 2025 • 68
Ming-Omni: A Unified Multimodal Model for Perception and Generation

Paper • 2506.09344 • Published Jun 11, 2025 • 31
VideoDeepResearch: Long Video Understanding With Agentic Tool Using

Paper • 2506.10821 • Published Jun 12, 2025 • 19
UniPre3D: Unified Pre-training of 3D Point Cloud Models with Cross-Modal Gaussian Splatting

Paper • 2506.09952 • Published Jun 11, 2025 • 6
Aligned Novel View Image and Geometry Synthesis via Cross-modal Attention Instillation

Paper • 2506.11924 • Published Jun 13, 2025 • 35
ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs

Paper • 2506.10128 • Published Jun 11, 2025 • 22
MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

Paper • 2506.13585 • Published Jun 16, 2025 • 274
BridgeVLA: Input-Output Alignment for Efficient 3D Manipulation Learning with Vision-Language Models

Paper • 2506.07961 • Published Jun 9, 2025 • 11
Discrete Diffusion in Large Language and Multimodal Models: A Survey

Paper • 2506.13759 • Published Jun 16, 2025 • 42
Stream-Omni: Simultaneous Multimodal Interactions with Large Language-Vision-Speech Model

Paper • 2506.13642 • Published Jun 16, 2025 • 27
VGR: Visual Grounded Reasoning

Paper • 2506.11991 • Published Jun 13, 2025 • 20
AR-RAG: Autoregressive Retrieval Augmentation for Image Generation

Paper • 2506.06962 • Published Jun 8, 2025 • 28
DoTA-RAG: Dynamic of Thought Aggregation RAG

Paper • 2506.12571 • Published Jun 14, 2025 • 50
Ego-R1: Chain-of-Tool-Thought for Ultra-Long Egocentric Video Reasoning

Paper • 2506.13654 • Published Jun 16, 2025 • 43
Scientists' First Exam: Probing Cognitive Abilities of MLLM via Perception, Understanding, and Reasoning

Paper • 2506.10521 • Published Jun 12, 2025 • 73
LongLLaDA: Unlocking Long Context Capabilities in Diffusion LLMs

Paper • 2506.14429 • Published Jun 17, 2025 • 44
EfficientVLA: Training-Free Acceleration and Compression for Vision-Language-Action Models

Paper • 2506.10100 • Published Jun 11, 2025 • 10
VideoMolmo: Spatio-Temporal Grounding Meets Pointing

Paper • 2506.05336 • Published Jun 5, 2025 • 9
MultiFinBen: A Multilingual, Multimodal, and Difficulty-Aware Benchmark for Financial LLM Evaluation

Paper • 2506.14028 • Published Jun 16, 2025 • 94
Sekai: A Video Dataset towards World Exploration

Paper • 2506.15675 • Published Jun 18, 2025 • 66
ProtoReasoning: Prototypes as the Foundation for Generalizable Reasoning in LLMs

Paper • 2506.15211 • Published Jun 18, 2025 • 39
GenRecal: Generation after Recalibration from Large to Small Vision-Language Models

Paper • 2506.15681 • Published Jun 18, 2025 • 42
SciVer: Evaluating Foundation Models for Multimodal Scientific Claim Verification

Paper • 2506.15569 • Published Jun 18, 2025 • 12
FedNano: Toward Lightweight Federated Tuning for Pretrained Multimodal Large Language Models

Paper • 2506.14824 • Published Jun 12, 2025 • 8
CoMemo: LVLMs Need Image Context with Image Memory

Paper • 2506.06279 • Published Jun 6, 2025 • 8
Show-o2: Improved Native Unified Multimodal Models

Paper • 2506.15564 • Published Jun 18, 2025 • 29
Vision-Guided Chunking Is All You Need: Enhancing RAG with Multimodal Document Understanding

Paper • 2506.16035 • Published Jun 19, 2025 • 89
PAROAttention: Pattern-Aware ReOrdering for Efficient Sparse and Quantized Attention in Visual Generation Models

Paper • 2506.16054 • Published Jun 19, 2025 • 60
Hunyuan-GameCraft: High-dynamic Interactive Game Video Generation with Hybrid History Condition

Paper • 2506.17201 • Published Jun 20, 2025 • 55
Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens

Paper • 2506.17218 • Published Jun 20, 2025 • 29
UniFork: Exploring Modality Alignment for Unified Multimodal Understanding and Generation

Paper • 2506.17202 • Published Jun 20, 2025 • 10
Hunyuan3D 2.1: From Images to High-Fidelity 3D Assets with Production-Ready PBR Material

Paper • 2506.15442 • Published Jun 18, 2025 • 16
InfiniPot-V: Memory-Constrained KV Cache Compression for Streaming Video Understanding

Paper • 2506.15745 • Published Jun 18, 2025 • 14
MEXA: Towards General Multimodal Reasoning with Dynamic Multi-Expert Aggregation

Paper • 2506.17113 • Published Jun 20, 2025 • 5
OmniGen2: Exploration to Advanced Multimodal Generation

Paper • 2506.18871 • Published Jun 23, 2025 • 78
Vision as a Dialect: Unifying Visual Understanding and Generation via Text-Aligned Representations

Paper • 2506.18898 • Published Jun 23, 2025 • 34
From Intention to Execution: Probing the Generalization Boundaries of Vision-Language-Action Models

Paper • 2506.09930 • Published Jun 11, 2025 • 8
USAD: Universal Speech and Audio Representation via Distillation

Paper • 2506.18843 • Published Jun 23, 2025 • 12
ScaleCap: Inference-Time Scalable Image Captioning via Dual-Modality Debiasing

Paper • 2506.19848 • Published Jun 24, 2025 • 26
Unified Vision-Language-Action Model

Paper • 2506.19850 • Published Jun 24, 2025 • 28
GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning

Paper • 2506.16141 • Published Jun 19, 2025 • 27
Phantom-Data : Towards a General Subject-Consistent Video Generation Dataset

Paper • 2506.18851 • Published Jun 23, 2025 • 30
ShareGPT-4o-Image: Aligning Multimodal Models with GPT-4o-Level Image Generation

Paper • 2506.18095 • Published Jun 22, 2025 • 66
OctoThinker: Mid-training Incentivizes Reinforcement Learning Scaling

Paper • 2506.20512 • Published Jun 25, 2025 • 47
MMSearch-R1: Incentivizing LMMs to Search

Paper • 2506.20670 • Published Jun 25, 2025 • 64
WorldVLA: Towards Autoregressive Action World Model

Paper • 2506.21539 • Published Jun 26, 2025 • 40
FaSTA^*: Fast-Slow Toolpath Agent with Subroutine Mining for Efficient Multi-turn Image Editing

Paper • 2506.20911 • Published Jun 26, 2025 • 41
LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs

Paper • 2506.21862 • Published Jun 27, 2025 • 36
Fine-Grained Preference Optimization Improves Spatial Reasoning in VLMs

Paper • 2506.21656 • Published Jun 26, 2025 • 16
MiCo: Multi-image Contrast for Reinforcement Visual Reasoning

Paper • 2506.22434 • Published Jun 27, 2025 • 10
BlenderFusion: 3D-Grounded Visual Editing and Generative Compositing

Paper • 2506.17450 • Published Jun 20, 2025 • 64
ShotBench: Expert-Level Cinematic Understanding in Vision-Language Models

Paper • 2506.21356 • Published Jun 26, 2025 • 22
Audio-FLAN: A Preliminary Release

Paper • 2502.16584 • Published Feb 23, 2025 • 36
Do Vision-Language Models Have Internal World Models? Towards an Atomic Evaluation

Paper • 2506.21876 • Published Jun 27, 2025 • 28
Aha Moment Revisited: Are VLMs Truly Capable of Self Verification in Inference-time Scaling?

Paper • 2506.17417 • Published Jun 20, 2025 • 11
Ovis-U1 Technical Report

Paper • 2506.23044 • Published Jun 29, 2025 • 61
Listener-Rewarded Thinking in VLMs for Image Preferences

Paper • 2506.22832 • Published Jun 28, 2025 • 22
GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

Paper • 2507.01006 • Published Jul 1, 2025 • 253
MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings

Paper • 2506.23115 • Published Jun 29, 2025 • 36
MusiXQA: Advancing Visual Music Understanding in Multimodal Large Language Models

Paper • 2506.23009 • Published Jun 28, 2025 • 11
Kwai Keye-VL Technical Report

Paper • 2507.01949 • Published Jul 2, 2025 • 132
A Survey on Vision-Language-Action Models: An Action Tokenization Perspective

Paper • 2507.01925 • Published Jul 2, 2025 • 39
LongAnimation: Long Animation Generation with Dynamic Global-Local Memory

Paper • 2507.01945 • Published Jul 2, 2025 • 76
Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers

Paper • 2506.23918 • Published Jun 30, 2025 • 90
How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation Models on Standard Computer Vision Tasks

Paper • 2507.01955 • Published Jul 2, 2025 • 36
MemOS: A Memory OS for AI System

Paper • 2507.03724 • Published Jul 4, 2025 • 166
DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge

Paper • 2507.04447 • Published Jul 6, 2025 • 45
BMMR: A Large-Scale Bilingual Multimodal Multi-Discipline Reasoning Dataset

Paper • 2507.03483 • Published Jul 4, 2025 • 24
ArtifactsBench: Bridging the Visual-Interactive Gap in LLM Code Generation Evaluation

Paper • 2507.04952 • Published Jul 7, 2025 • 11
VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and Visual Documents

Paper • 2507.04590 • Published Jul 7, 2025 • 17
Perception-Aware Policy Optimization for Multimodal Reasoning

Paper • 2507.06448 • Published Jul 8, 2025 • 48
4KAgent: Agentic Any Image to 4K Super-Resolution

Paper • 2507.07105 • Published Jul 9, 2025 • 107
Go to Zero: Towards Zero-shot Motion Generation with Million-scale Data

Paper • 2507.07095 • Published Jul 9, 2025 • 56
Scaling RL to Long Videos

Paper • 2507.07966 • Published Jul 10, 2025 • 162
Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology

Paper • 2507.07999 • Published Jul 10, 2025 • 51
PyVision: Agentic Vision with Dynamic Tooling

Paper • 2507.07998 • Published Jul 10, 2025 • 33
Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video LLMs

Paper • 2507.07990 • Published Jul 10, 2025 • 46
A Survey on Long-Video Storytelling Generation: Architectures, Consistency, and Cinematic Quality

Paper • 2507.07202 • Published Jul 9, 2025 • 25
Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning

Paper • 2507.05255 • Published Jul 7, 2025 • 75
Vision Foundation Models as Effective Visual Tokenizers for Autoregressive Image Generation

Paper • 2507.08441 • Published Jul 11, 2025 • 62
Robust Multimodal Large Language Models Against Modality Conflict

Paper • 2507.07151 • Published Jul 9, 2025 • 6
Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Paper • 2507.06261 • Published Jul 7, 2025 • 67
Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models

Paper • 2507.08128 • Published Jul 10, 2025 • 13
SpeakerVid-5M: A Large-Scale High-Quality Dataset for Audio-Visual Dyadic Interactive Human Generation

Paper • 2507.09862 • Published Jul 14, 2025 • 51
Vision-Language-Vision Auto-Encoder: Scalable Knowledge Distillation from Diffusion Models

Paper • 2507.07104 • Published Jul 9, 2025 • 46
Can Multimodal Foundation Models Understand Schematic Diagrams? An Empirical Study on Information-Seeking QA over Scientific Papers

Paper • 2507.10787 • Published Jul 14, 2025 • 13
UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks

Paper • 2507.11336 • Published Jul 15, 2025 • 7
MMHU: A Massive-Scale Multimodal Benchmark for Human Behavior Understanding

Paper • 2507.12463 • Published Jul 16, 2025 • 27
VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning

Paper • 2507.13348 • Published Jul 17, 2025 • 79
Diffuman4D: 4D Consistent Human View Synthesis from Sparse-View Videos with Spatio-Temporal Diffusion Models

Paper • 2507.13344 • Published Jul 17, 2025 • 59
AnyCap Project: A Unified Framework, Dataset, and Benchmark for Controllable Omni-modal Captioning

Paper • 2507.12841 • Published Jul 17, 2025 • 42
π^3: Scalable Permutation-Equivariant Visual Geometry Learning

Paper • 2507.13347 • Published Jul 17, 2025 • 67
Automating Steering for Safe Multimodal Large Language Models

Paper • 2507.13255 • Published Jul 17, 2025 • 4
Mono-InternVL-1.5: Towards Cheaper and Faster Monolithic Multimodal Large Language Models

Paper • 2507.12566 • Published Jul 16, 2025 • 15
Franca: Nested Matryoshka Clustering for Scalable Visual Representation Learning

Paper • 2507.14137 • Published Jul 18, 2025 • 36
GUI-G^2: Gaussian Reward Modeling for GUI Grounding

Paper • 2507.15846 • Published Jul 21, 2025 • 135
Being-H0: Vision-Language-Action Pretraining from Large-Scale Human Videos

Paper • 2507.15597 • Published Jul 21, 2025 • 34
Latent Denoising Makes Good Visual Tokenizers

Paper • 2507.15856 • Published Jul 21, 2025 • 12
Step-Audio 2 Technical Report

Paper • 2507.16632 • Published Jul 22, 2025 • 74
Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning

Paper • 2507.16746 • Published Jul 22, 2025 • 35
Semi-off-Policy Reinforcement Learning for Vision-Language Slow-thinking Reasoning

Paper • 2507.16814 • Published Jul 22, 2025 • 21
ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning

Paper • 2507.16815 • Published Jul 22, 2025 • 42
Experience is the Best Teacher: Grounding VLMs for Robotics through Self-Generated Memory

Paper • 2507.16713 • Published Jul 22, 2025 • 21
Pixels, Patterns, but No Poetry: To See The World like Humans

Paper • 2507.16863 • Published Jul 21, 2025 • 69
Yume: An Interactive World Generation Model

Paper • 2507.17744 • Published Jul 23, 2025 • 92
Finding Dori: Memorization in Text-to-Image Diffusion Models Is Less Local Than Assumed

Paper • 2507.16880 • Published Jul 22, 2025 • 7
TTS-VAR: A Test-Time Scaling Framework for Visual Auto-Regressive Generation

Paper • 2507.18537 • Published Jul 24, 2025 • 18
Iwin Transformer: Hierarchical Vision Transformer using Interleaved Windows

Paper • 2507.18405 • Published Dec 6, 2025 • 5
Chat with AI: The Surprising Turn of Real-time Video Communication from Human to AI

Paper • 2507.10510 • Published Jul 14, 2025 • 5
Deep Researcher with Test-Time Diffusion

Paper • 2507.16075 • Published Jul 21, 2025 • 68
When Tokens Talk Too Much: A Survey of Multimodal Long-Context Token Compression across Images, Videos, and Audios

Paper • 2507.20198 • Published Jul 27, 2025 • 28
ARC-Hunyuan-Video-7B: Structured Video Comprehension of Real-World Shorts

Paper • 2507.20939 • Published Jul 28, 2025 • 57
Region-based Cluster Discrimination for Visual Representation Learning

Paper • 2507.20025 • Published Jul 26, 2025 • 20
Reconstructing 4D Spatial Intelligence: A Survey

Paper • 2507.21045 • Published Jul 28, 2025 • 38
GPT-IMAGE-EDIT-1.5M: A Million-Scale, GPT-Generated Image Dataset

Paper • 2507.21033 • Published Jul 28, 2025 • 23
X-Omni: Reinforcement Learning Makes Discrete Autoregressive Image Generative Models Great Again

Paper • 2507.22058 • Published Jul 29, 2025 • 40
MaPPO: Maximum a Posteriori Preference Optimization with Prior Knowledge

Paper • 2507.21183 • Published Jul 27, 2025 • 15
ForCenNet: Foreground-Centric Network for Document Image Rectification

Paper • 2507.19804 • Published Jul 26, 2025 • 12
Towards Omnimodal Expressions and Reasoning in Referring Audio-Visual Segmentation

Paper • 2507.22886 • Published Jul 30, 2025 • 10
HunyuanWorld 1.0: Generating Immersive, Explorable, and Interactive 3D Worlds from Words or Pixels

Paper • 2507.21809 • Published Jul 29, 2025 • 142
VL-Cogito: Progressive Curriculum Reinforcement Learning for Advanced Multimodal Reasoning

Paper • 2507.22607 • Published Jul 30, 2025 • 47
Scalable Multi-Task Reinforcement Learning for Generalizable Spatial Intelligence in Visuomotor Agents

Paper • 2507.23698 • Published Jul 31, 2025 • 11
Phi-Ground Tech Report: Advancing Perception in GUI Grounding

Paper • 2507.23779 • Published Jul 31, 2025 • 46
C3: A Bilingual Benchmark for Spoken Dialogue Models Exploring Challenges in Complex Conversations

Paper • 2507.22968 • Published Jul 30, 2025 • 25
villa-X: Enhancing Latent Action Modeling in Vision-Language-Action Models

Paper • 2507.23682 • Published Jul 31, 2025 • 24
Multimodal Referring Segmentation: A Survey

Paper • 2508.00265 • Published Aug 1, 2025 • 9
PixNerd: Pixel Neural Field Diffusion

Paper • 2507.23268 • Published Jul 31, 2025 • 52
Qwen-Image Technical Report

Paper • 2508.02324 • Published Aug 4, 2025 • 274
VeOmni: Scaling Any Modality Model Training with Model-Centric Distributed Recipe Zoo

Paper • 2508.02317 • Published Aug 4, 2025 • 23
InstructVLA: Vision-Language-Action Instruction Tuning from Understanding to Manipulation

Paper • 2507.17520 • Published Jul 23, 2025 • 15
A Glimpse to Compress: Dynamic Visual Token Pruning for Large Vision-Language Models

Paper • 2508.01548 • Published Aug 3, 2025 • 14
Skywork UniPic: Unified Autoregressive Modeling for Visual Understanding and Generation

Paper • 2508.03320 • Published Aug 5, 2025 • 64
LongVie: Multimodal-Guided Controllable Ultra-Long Video Generation

Paper • 2508.03694 • Published Aug 5, 2025 • 52
Representation Shift: Unifying Token Compression with FlashAttention

Paper • 2508.00367 • Published Aug 1, 2025 • 16
CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward

Paper • 2508.03686 • Published Aug 5, 2025 • 39
The Promise of RL for Autoregressive Image Editing

Paper • 2508.01119 • Published Aug 1, 2025 • 11
HyCodePolicy: Hybrid Language Controllers for Multimodal Monitoring and Decision in Embodied Agents

Paper • 2508.02629 • Published Aug 4, 2025 • 6
LAMIC: Layout-Aware Multi-Image Composition via Scalability of Multimodal Diffusion Transformer

Paper • 2508.00477 • Published Aug 1, 2025 • 11
ChartCap: Mitigating Hallucination of Dense Chart Captioning

Paper • 2508.03164 • Published Aug 5, 2025 • 7
Sotopia-RL: Reward Design for Social Intelligence

Paper • 2508.03905 • Published Aug 5, 2025 • 23
HPSv3: Towards Wide-Spectrum Human Preference Score

Paper • 2508.03789 • Published Aug 5, 2025 • 21
On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification

Paper • 2508.05629 • Published Aug 7, 2025 • 192
Enhancing Vision-Language Model Training with Reinforcement Learning in Synthetic Worlds for Real-World Success

Paper • 2508.04280 • Published Aug 6, 2025 • 35
CM^3: Calibrating Multimodal Recommendation

Paper • 2508.01226 • Published Aug 2, 2025 • 1
DeepPHY: Benchmarking Agentic VLMs on Physical Reasoning

Paper • 2508.05405 • Published Aug 7, 2025 • 65
Can Large Multimodal Models Actively Recognize Faulty Inputs? A Systematic Evaluation Framework of Their Input Scrutiny Ability

Paper • 2508.04017 • Published Aug 6, 2025 • 11
Adapting Vision-Language Models Without Labels: A Comprehensive Survey

Paper • 2508.05547 • Published Aug 7, 2025 • 11
Voost: A Unified and Scalable Diffusion Transformer for Bidirectional Virtual Try-On and Try-Off

Paper • 2508.04825 • Published Aug 6, 2025 • 60
Reinforcement Learning in Vision: A Survey

Paper • 2508.08189 • Published Aug 11, 2025 • 30
WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent

Paper • 2508.05748 • Published Aug 7, 2025 • 142
Adversarial Video Promotion Against Text-to-Video Retrieval

Paper • 2508.06964 • Published Aug 9, 2025 • 9
MathReal: We Keep It Real! A Real Scene Benchmark for Evaluating Math Reasoning in Multimodal Large Language Models

Paper • 2508.06009 • Published Aug 8, 2025 • 16
Story2Board: A Training-Free Approach for Expressive Storyboard Generation

Paper • 2508.09983 • Published Aug 13, 2025 • 70
Stand-In: A Lightweight and Plug-and-Play Identity Control for Video Generation

Paper • 2508.07901 • Published Aug 11, 2025 • 40
Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory

Paper • 2508.09736 • Published Aug 13, 2025 • 58
IAG: Input-aware Backdoor Attack on VLMs for Visual Grounding

Paper • 2508.09456 • Published Aug 13, 2025 • 8
NextStep-1: Toward Autoregressive Image Generation with Continuous Tokens at Scale

Paper • 2508.10711 • Published Aug 14, 2025 • 146
We-Math 2.0: A Versatile MathBook System for Incentivizing Visual Mathematical Reasoning

Paper • 2508.10433 • Published Aug 14, 2025 • 146
HumanSense: From Multimodal Perception to Empathetic Context-Aware Responses through Reasoning MLLMs

Paper • 2508.10576 • Published Aug 14, 2025 • 8
Processing and acquisition traces in visual encoders: What does CLIP know about your camera?

Paper • 2508.10637 • Published Aug 14, 2025 • 8
Thyme: Think Beyond Images

Paper • 2508.11630 • Published Aug 15, 2025 • 81
DINOv3

Paper • 2508.10104 • Published Aug 13, 2025 • 305
Controlling Multimodal LLMs via Reward-guided Decoding

Paper • 2508.11616 • Published Aug 15, 2025 • 7
Ovis2.5 Technical Report

Paper • 2508.11737 • Published Aug 15, 2025 • 114
Next Visual Granularity Generation

Paper • 2508.12811 • Published Aug 18, 2025 • 49
OmniTry: Virtual Try-On Anything without Masks

Paper • 2508.13632 • Published Aug 19, 2025 • 15
MultiRef: Controllable Image Generation with Multiple Visual References

Paper • 2508.06905 • Published Aug 9, 2025 • 21
Training-Free Text-Guided Color Editing with Multi-Modal Diffusion Transformer

Paper • 2508.09131 • Published Aug 12, 2025 • 17
RynnEC: Bringing MLLMs into Embodied World

Paper • 2508.14160 • Published Aug 19, 2025 • 20
Intern-S1: A Scientific Multimodal Foundation Model

Paper • 2508.15763 • Published Aug 21, 2025 • 273
Waver: Wave Your Way to Lifelike Video Generation

Paper • 2508.15761 • Published Aug 21, 2025 • 38
Visual Autoregressive Modeling for Instruction-Guided Image Editing

Paper • 2508.15772 • Published Aug 21, 2025 • 9
When and What: Diffusion-Grounded VideoLLM with Entity Aware Segmentation for Long Video Understanding

Paper • 2508.15641 • Published Aug 21, 2025 • 4
LLaSO: A Foundational Framework for Reproducible Research in Large Language and Speech Model

Paper • 2508.15418 • Published Aug 21, 2025 • 8
AgentFly: Fine-tuning LLM Agents without Fine-tuning LLMs

Paper • 2508.16153 • Published Aug 22, 2025 • 162
PosterGen: Aesthetic-Aware Paper-to-Poster Generation via Multi-Agent LLMs

Paper • 2508.17188 • Published Aug 24, 2025 • 17
OmniHuman-1.5: Instilling an Active Mind in Avatars via Cognitive Simulation

Paper • 2508.19209 • Published Aug 26, 2025 • 42
Autoregressive Universal Video Segmentation Model

Paper • 2508.19242 • Published Aug 26, 2025 • 29
MovieCORE: COgnitive REasoning in Movies

Paper • 2508.19026 • Published Aug 26, 2025 • 7
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Paper • 2508.18265 • Published Aug 25, 2025 • 217
MMTok: Multimodal Coverage Maximization for Efficient Inference of VLMs

Paper • 2508.18264 • Published Aug 25, 2025 • 25
Explain Before You Answer: A Survey on Compositional Visual Reasoning

Paper • 2508.17298 • Published Aug 24, 2025 • 4
Deep Think with Confidence

Paper • 2508.15260 • Published Aug 21, 2025 • 90
Self-Rewarding Vision-Language Model via Reasoning Decomposition

Paper • 2508.19652 • Published Aug 27, 2025 • 84
Mixture of Contexts for Long Video Generation

Paper • 2508.21058 • Published Aug 28, 2025 • 35
Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning

Paper • 2508.20751 • Published Aug 28, 2025 • 90
Beyond Transcription: Mechanistic Interpretability in ASR

Paper • 2508.15882 • Published Aug 21, 2025 • 89
MIDAS: Multimodal Interactive Digital-human Synthesis via Real-time Autoregressive Video Generation

Paper • 2508.19320 • Published Aug 26, 2025 • 29
SEAM: Semantically Equivalent Across Modalities Benchmark for Vision-Language Models

Paper • 2508.18179 • Published Aug 25, 2025 • 9
CogVLA: Cognition-Aligned Vision-Language-Action Model via Instruction-Driven Routing & Sparsification

Paper • 2508.21046 • Published Aug 28, 2025 • 9
OneReward: Unified Mask-Guided Image Generation via Multi-Task Human Preference Learning

Paper • 2508.21066 • Published Aug 28, 2025 • 13
Social-MAE: A Transformer-Based Multimodal Autoencoder for Face and Voice

Paper • 2508.17502 • Published Aug 24, 2025 • 1
AudioStory: Generating Long-Form Narrative Audio with Large Language Models

Paper • 2508.20088 • Published Aug 27, 2025 • 21
R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs via Bi-Mode Annealing and Reinforce Learning

Paper • 2508.21113 • Published Aug 28, 2025 • 110
EmbodiedOneVision: Interleaved Vision-Text-Action Pretraining for General Robot Control

Paper • 2508.21112 • Published Aug 28, 2025 • 78
UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

Paper • 2509.02544 • Published Sep 2, 2025 • 127
POINTS-Reader: Distillation-Free Adaptation of Vision-Language Models for Document Conversion

Paper • 2509.01215 • Published Sep 1, 2025 • 51
ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Long Video Understanding

Paper • 2508.21496 • Published Aug 29, 2025 • 55
Kwai Keye-VL 1.5 Technical Report

Paper • 2509.01563 • Published Sep 1, 2025 • 38
OpenVision 2: A Family of Generative Pretrained Visual Encoders for Multimodal Learning

Paper • 2509.01644 • Published Sep 1, 2025 • 34
Droplet3D: Commonsense Priors from Videos Facilitate 3D Generation

Paper • 2508.20470 • Published Aug 28, 2025 • 75
Planning with Reasoning using Vision Language World Model

Paper • 2509.02722 • Published Sep 2, 2025 • 24
Beyond Correctness: Harmonizing Process and Outcome Rewards through RL Training

Paper • 2509.03403 • Published Sep 3, 2025 • 23
Drivel-ology: Challenging LLMs with Interpreting Nonsense with Depth

Paper • 2509.03867 • Published Sep 4, 2025 • 213
Video-MTR: Reinforced Multi-Turn Reasoning for Long Video Understanding

Paper • 2508.20478 • Published Aug 28, 2025 • 18
Why Language Models Hallucinate

Paper • 2509.04664 • Published Sep 4, 2025 • 199
WildScore: Benchmarking MLLMs in-the-Wild Symbolic Music Reasoning

Paper • 2509.04744 • Published Sep 5, 2025 • 12
LatticeWorld: A Multimodal Large Language Model-Empowered Framework for Interactive Complex World Generation

Paper • 2509.05263 • Published Sep 5, 2025 • 11
Reinforced Visual Perception with Tools

Paper • 2509.01656 • Published Sep 1, 2025 • 32
Focusing by Contrastive Attention: Enhancing VLMs' Visual Reasoning

Paper • 2509.06461 • Published Sep 8, 2025 • 20
UniVerse-1: Unified Audio-Video Generation via Stitching of Experts

Paper • 2509.06155 • Published Sep 7, 2025 • 14
Easier Painting Than Thinking: Can Text-to-Image Models Set the Stage, but Not Direct the Play?

Paper • 2509.03516 • Published Sep 3, 2025 • 12
D-HUMOR: Dark Humor Understanding via Multimodal Open-ended Reasoning

Paper • 2509.06771 • Published Sep 8, 2025 • 6
Visual Representation Alignment for Multimodal Large Language Models

Paper • 2509.07979 • Published Sep 9, 2025 • 84
Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search

Paper • 2509.07969 • Published Sep 9, 2025 • 59
Reconstruction Alignment Improves Unified Multimodal Models

Paper • 2509.07295 • Published Sep 8, 2025 • 40
A Survey of Reinforcement Learning for Large Reasoning Models

Paper • 2509.08827 • Published Sep 10, 2025 • 193
RewardDance: Reward Scaling in Visual Generation

Paper • 2509.08826 • Published Sep 10, 2025 • 73
VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model

Paper • 2509.09372 • Published Sep 11, 2025 • 254
HuMo: Human-Centric Video Generation via Collaborative Multi-Modal Conditioning

Paper • 2509.08519 • Published Sep 10, 2025 • 130
SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning

Paper • 2509.09674 • Published Sep 11, 2025 • 80
Can Understanding and Generation Truly Benefit Together -- or Just Coexist?

Paper • 2509.09666 • Published Sep 11, 2025 • 34
Kling-Avatar: Grounding Multimodal Instructions for Cascaded Long-Duration Avatar Animation Synthesis

Paper • 2509.09595 • Published Sep 11, 2025 • 48
SpatialVID: A Large-Scale Video Dataset with Spatial Annotations

Paper • 2509.09676 • Published Sep 11, 2025 • 35
The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs

Paper • 2509.09677 • Published Sep 11, 2025 • 37
Learning to Optimize Multi-Objective Alignment Through Dynamic Reward Weighting

Paper • 2509.11452 • Published Sep 14, 2025 • 14
PersonaX: Multimodal Datasets with LLM-Inferred Behavior Traits

Paper • 2509.11362 • Published Sep 14, 2025 • 5
Lost in Embeddings: Information Loss in Vision-Language Models

Paper • 2509.11986 • Published Sep 15, 2025 • 29
OmniWorld: A Multi-Domain and Multi-Modal Dataset for 4D World Modeling

Paper • 2509.12201 • Published Sep 15, 2025 • 107
UI-S1: Advancing GUI Automation via Semi-online Reinforcement Learning

Paper • 2509.11543 • Published Sep 15, 2025 • 50
LazyDrag: Enabling Stable Drag-Based Editing on Multi-Modal Diffusion Transformers via Explicit Correspondence

Paper • 2509.12203 • Published Sep 15, 2025 • 20
Multimodal Reasoning for Science: Technical Report and 1st Place Solution to the ICML 2025 SeePhys Challenge

Paper • 2509.06079 • Published Sep 7, 2025 • 6
3D Aware Region Prompted Vision Language Model

Paper • 2509.13317 • Published Sep 16, 2025 • 14
Hunyuan3D Studio: End-to-End AI Pipeline for Game-Ready 3D Asset Generation

Paper • 2509.12815 • Published Sep 16, 2025 • 41
Look Again, Think Slowly: Enhancing Visual Reflection in Vision-Language Models

Paper • 2509.12132 • Published Sep 15, 2025 • 7
Dr.V: A Hierarchical Perception-Temporal-Cognition Framework to Diagnose Video Hallucination by Fine-grained Spatial-Temporal Grounding

Paper • 2509.11866 • Published Sep 15, 2025 • 2
Scaling Agents via Continual Pre-training

Paper • 2509.13310 • Published Sep 16, 2025 • 117
SAIL-VL2 Technical Report

Paper • 2509.14033 • Published Sep 17, 2025 • 44
FlowRL: Matching Reward Distributions for LLM Reasoning

Paper • 2509.15207 • Published Sep 18, 2025 • 118
WorldForge: Unlocking Emergent 3D/4D Generation in Video Diffusion Model via Training-Free Guidance

Paper • 2509.15130 • Published Sep 18, 2025 • 31
MultiEdit: Advancing Instruction-based Image Editing on Diverse and Challenging Tasks

Paper • 2509.14638 • Published Sep 18, 2025 • 14
AToken: A Unified Tokenizer for Vision

Paper • 2509.14476 • Published Sep 17, 2025 • 37
Unleashing the Potential of Multimodal LLMs for Zero-Shot Spatio-Temporal Video Grounding

Paper • 2509.15178 • Published Sep 18, 2025 • 6
PANORAMA: The Rise of Omnidirectional Vision in the Embodied AI Era

Paper • 2509.12989 • Published Sep 16, 2025 • 28
LLM-I: LLMs are Naturally Interleaved Multimodal Creators

Paper • 2509.13642 • Published Sep 17, 2025 • 9
MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer

Paper • 2509.16197 • Published Sep 19, 2025 • 58
Lynx: Towards High-Fidelity Personalized Video Generation

Paper • 2509.15496 • Published Sep 19, 2025 • 13
BaseReward: A Strong Baseline for Multimodal Reward Model

Paper • 2509.16127 • Published Sep 19, 2025 • 21
BTL-UI: Blink-Think-Link Reasoning Model for GUI Agent

Paper • 2509.15566 • Published Sep 19, 2025 • 14
Video2Roleplay: A Multimodal Dataset and Framework for Video-Guided Role-playing Agents

Paper • 2509.15233 • Published Sep 17, 2025 • 2
A Vision-Language-Action-Critic Model for Robotic Real-World Reinforcement Learning

Paper • 2509.15937 • Published Sep 19, 2025 • 20
OmniInsert: Mask-Free Video Insertion of Any Reference via Diffusion Transformer Models

Paper • 2509.17627 • Published Sep 22, 2025 • 66
TempSamp-R1: Effective Temporal Sampling with Reinforcement Fine-Tuning for Video LLMs

Paper • 2509.18056 • Published Sep 22, 2025 • 27
Qwen3-Omni Technical Report

Paper • 2509.17765 • Published Sep 22, 2025 • 153
GeoPQA: Bridging the Visual Perception Gap in MLLMs for Geometric Reasoning

Paper • 2509.17437 • Published Sep 22, 2025 • 17
VideoFrom3D: 3D Scene Video Generation via Complementary Image and Video Diffusion Models

Paper • 2509.17985 • Published Sep 22, 2025 • 26
ContextFlow: Training-Free Video Object Editing via Adaptive Context Enrichment

Paper • 2509.17818 • Published Sep 22, 2025 • 8
MetaEmbed: Scaling Multimodal Retrieval at Test-Time with Flexible Late Interaction

Paper • 2509.18095 • Published Sep 22, 2025 • 9
When Big Models Train Small Ones: Label-Free Model Parity Alignment for Efficient Visual Question Answering using Small VLMs

Paper • 2509.16633 • Published Sep 20, 2025 • 2
MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe

Paper • 2509.18154 • Published Sep 16, 2025 • 56
Reinforcement Learning on Pre-Training Data

Paper • 2509.19249 • Published Sep 23, 2025 • 67
Lyra: Generative 3D Scene Reconstruction via Video Diffusion Model Self-Distillation

Paper • 2509.19296 • Published Sep 23, 2025 • 27
Hyper-Bagel: A Unified Acceleration Framework for Multimodal Understanding and Generation

Paper • 2509.18824 • Published Sep 23, 2025 • 23
UniPixel: Unified Object Referring and Segmentation for Pixel-Level Visual Reasoning

Paper • 2509.18094 • Published Sep 22, 2025 • 4
MAPO: Mixed Advantage Policy Optimization

Paper • 2509.18849 • Published Sep 23, 2025 • 27
EmbeddingGemma: Powerful and Lightweight Text Representations

Paper • 2509.20354 • Published Sep 24, 2025 • 48
Advancing Speech Understanding in Speech-Aware Language Models with GRPO

Paper • 2509.16990 • Published Sep 21, 2025 • 22
Video models are zero-shot learners and reasoners

Paper • 2509.20328 • Published Sep 24, 2025 • 100
EditVerse: Unifying Image and Video Editing and Generation with In-Context Learning

Paper • 2509.20360 • Published Sep 24, 2025 • 18
Seedream 4.0: Toward Next-generation Multimodal Image Generation

Paper • 2509.20427 • Published Sep 24, 2025 • 84
MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling and Open Resources

Paper • 2509.21268 • Published Sep 25, 2025 • 104
Hunyuan3D-Omni: A Unified Framework for Controllable Generation of 3D Assets

Paper • 2509.21245 • Published Sep 25, 2025 • 40
How Far are VLMs from Visual Spatial Intelligence? A Benchmark-Driven Perspective

Paper • 2509.18905 • Published Sep 23, 2025 • 30
Lavida-O: Elastic Large Masked Diffusion Models for Unified Multimodal Understanding and Generation

Paper • 2509.19244 • Published Sep 23, 2025 • 12
AutoIntent: AutoML for Text Classification

Paper • 2509.21138 • Published Sep 25, 2025 • 37
CHARM: Control-point-based 3D Anime Hairstyle Auto-Regressive Modeling

Paper • 2509.21114 • Published Sep 25, 2025 • 16
V-GameGym: Visual Game Generation for Code Large Language Models

Paper • 2509.20136 • Published Sep 24, 2025 • 8
Interactive Recommendation Agent with Active User Commands

Paper • 2509.21317 • Published Sep 25, 2025 • 7
MOSS-ChatV: Reinforcement Learning with Process Reasoning Reward for Video Temporal Reasoning

Paper • 2509.21113 • Published Sep 25, 2025 • 6
LongLive: Real-time Interactive Long Video Generation

Paper • 2509.22622 • Published Sep 26, 2025 • 189
VoiceAssistant-Eval: Benchmarking AI Assistants across Listening, Speaking, and Viewing

Paper • 2509.22651 • Published Sep 26, 2025 • 23
CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning

Paper • 2509.22647 • Published Sep 26, 2025 • 34
UniVid: Unifying Vision Tasks with Pre-trained Video Generation Models

Paper • 2509.21760 • Published Sep 26, 2025 • 15
Think-on-Graph 3.0: Efficient and Adaptive LLM Reasoning on Heterogeneous Graphs via Multi-Agent Dual-Evolving Context Retrieval

Paper • 2509.21710 • Published Sep 26, 2025 • 19
MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing

Paper • 2509.22186 • Published Sep 26, 2025 • 156
Where MLLMs Attend and What They Rely On: Explaining Autoregressive Token Generation

Paper • 2509.22496 • Published Sep 26, 2025 • 4
X-Streamer: Unified Human World Modeling with Audiovisual Interaction

Paper • 2509.21574 • Published Sep 25, 2025 • 8
Visual Jigsaw Post-Training Improves MLLMs

Paper • 2509.25190 • Published Sep 29, 2025 • 37
LOVE-R1: Advancing Long Video Understanding with an Adaptive Zoom-in Mechanism via Multi-Step Reasoning

Paper • 2509.24786 • Published Sep 29, 2025 • 7
MGM-Omni: Scaling Omni LLMs to Personalized Long-Horizon Speech

Paper • 2509.25131 • Published Sep 29, 2025 • 16
StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs

Paper • 2509.22220 • Published Sep 26, 2025 • 66
OpenGPT-4o-Image: A Comprehensive Dataset for Advanced Image Generation and Editing

Paper • 2509.24900 • Published Sep 29, 2025 • 53
GSM8K-V: Can Vision Language Models Solve Grade School Math Word Problems in Visual Contexts

Paper • 2509.25160 • Published Sep 29, 2025 • 32
EditScore: Unlocking Online RL for Image Editing via High-Fidelity Reward Modeling

Paper • 2509.23909 • Published Sep 28, 2025 • 33
HunyuanImage 3.0 Technical Report

Paper • 2509.23951 • Published Sep 28, 2025 • 25
MMPB: It's Time for Multi-Modal Personalization

Paper • 2509.22820 • Published Sep 26, 2025 • 15
BRIDGE - Building Reinforcement-Learning Depth-to-Image Data Generation Engine for Monocular Depth Estimation

Paper • 2509.25077 • Published Sep 29, 2025 • 15
Understanding Language Prior of LVLMs by Contrasting Chain-of-Embedding

Paper • 2509.23050 • Published Sep 27, 2025 • 15
UniMIC: Token-Based Multimodal Interactive Coding for Human-AI Collaboration

Paper • 2509.22570 • Published Sep 26, 2025 • 4
Vision-Zero: Scalable VLM Self-Improvement via Strategic Gamified Self-Play

Paper • 2509.25541 • Published Sep 29, 2025 • 141
LEAML: Label-Efficient Adaptation to Out-of-Distribution Visual Tasks for Multimodal Large Language Models

Paper • 2510.03232 • Published Oct 3, 2025 • 1
Efficient Test-Time Scaling for Small Vision-Language Models

Paper • 2510.03574 • Published Oct 3, 2025 • 10
Self-Improvement in Multimodal Large Language Models: A Survey

Paper • 2510.02665 • Published Oct 3, 2025 • 21
Efficient Multi-modal Large Language Models via Progressive Consistency Distillation

Paper • 2510.00515 • Published Oct 1, 2025 • 42
Apriel-1.5-15b-Thinker

Paper • 2510.01141 • Published Oct 1, 2025 • 123
VLA-R1: Enhancing Reasoning in Vision-Language-Action Models

Paper • 2510.01623 • Published Oct 2, 2025 • 12
VideoNSA: Native Sparse Attention Scales Video Understanding

Paper • 2510.02295 • Published Oct 2, 2025 • 10
Visual Multi-Agent System: Mitigating Hallucination Snowballing via Visual Flow

Paper • 2509.21789 • Published Sep 26, 2025 • 9
RewardMap: Tackling Sparse Rewards in Fine-grained Visual Reasoning via Multi-Stage Reinforcement Learning

Paper • 2510.02240 • Published Oct 2, 2025 • 18
Learning to Reason for Hallucination Span Detection

Paper • 2510.02173 • Published Oct 2, 2025 • 19
VOGUE: Guiding Exploration with Visual Uncertainty Improves Multimodal Reasoning

Paper • 2510.01444 • Published Oct 1, 2025 • 20
Ovi: Twin Backbone Cross-Modal Fusion for Audio-Video Generation

Paper • 2510.01284 • Published Sep 30, 2025 • 37
ModernVBERT: Towards Smaller Visual Document Retrievers

Paper • 2510.01149 • Published Oct 1, 2025 • 33
StealthAttack: Robust 3D Gaussian Splatting Poisoning via Density-Guided Illusions

Paper • 2510.02314 • Published Oct 2, 2025 • 61
Self-Forcing++: Towards Minute-Scale High-Quality Video Generation

Paper • 2510.02283 • Published Oct 2, 2025 • 98
VLM-FO1: Bridging the Gap Between High-Level Reasoning and Fine-Grained Perception in VLMs

Paper • 2509.25916 • Published Sep 30, 2025 • 6
Training Vision-Language Process Reward Models for Test-Time Scaling in Multimodal Reasoning: Key Insights and Lessons Learned

Paper • 2509.23250 • Published Sep 27, 2025 • 6
EditReward: A Human-Aligned Reward Model for Instruction-Guided Image Editing

Paper • 2509.26346 • Published Sep 30, 2025 • 19
VLA-RFT: Vision-Language-Action Reinforcement Fine-tuning with Verified Rewards in World Simulators

Paper • 2510.00406 • Published Oct 1, 2025 • 67
IMG: Calibrating Diffusion Models via Implicit Multimodal Guidance

Paper • 2509.26231 • Published Sep 30, 2025 • 18
MotionRAG: Motion Retrieval-Augmented Image-to-Video Generation

Paper • 2509.26391 • Published Sep 30, 2025 • 22
DC-VideoGen: Efficient Video Generation with Deep Compression Video Autoencoder

Paper • 2509.25182 • Published Sep 29, 2025 • 39
Learning to See Before Seeing: Demystifying LLM Visual Priors from Language Pre-training

Paper • 2509.26625 • Published Sep 30, 2025 • 43
The Dragon Hatchling: The Missing Link between the Transformer and Models of the Brain

Paper • 2509.26507 • Published Sep 30, 2025 • 550
More Thought, Less Accuracy? On the Dual Nature of Reasoning in Vision-Language Models

Paper • 2509.25848 • Published Sep 30, 2025 • 81
Winning the Pruning Gamble: A Unified Approach to Joint Sample and Token Pruning for Efficient Supervised Fine-Tuning

Paper • 2509.23873 • Published Sep 28, 2025 • 68
Heptapod: Language Modeling on Visual Signals

Paper • 2510.06673 • Published Oct 8, 2025 • 5
Patch-as-Decodable-Token: Towards Unified Multi-Modal Vision Tasks in MLLMs

Paper • 2510.01954 • Published Oct 2, 2025 • 14
Are We Using the Right Benchmark: An Evaluation Framework for Visual Token Compression Methods

Paper • 2510.07143 • Published Oct 8, 2025 • 13
TTRV: Test-Time Reinforcement Learning for Vision Language Models

Paper • 2510.06783 • Published Oct 8, 2025 • 13
MATRIX: Mask Track Alignment for Interaction-aware Video Generation

Paper • 2510.07310 • Published Oct 8, 2025 • 36
Lumina-DiMOO: An Omni Diffusion Large Language Model for Multi-Modal Generation and Understanding

Paper • 2510.06308 • Published Oct 7, 2025 • 55
Ming-UniVision: Joint Image Understanding and Generation with a Unified Continuous Tokenizer

Paper • 2510.06590 • Published Oct 8, 2025 • 77
Cache-to-Cache: Direct Semantic Communication Between Large Language Models

Paper • 2510.03215 • Published Oct 3, 2025 • 99
SAEdit: Token-level control for continuous image editing via Sparse AutoEncoder

Paper • 2510.05081 • Published Oct 6, 2025 • 5
MoME: Mixture of Matryoshka Experts for Audio-Visual Speech Recognition

Paper • 2510.04136 • Published Oct 5, 2025 • 4
Character Mixing for Video Generation

Paper • 2510.05093 • Published Oct 6, 2025 • 7
ChronoEdit: Towards Temporal Reasoning for Image Editing and World Simulation

Paper • 2510.04290 • Published Oct 5, 2025 • 21
Factuality Matters: When Image Generation and Editing Meet Structured Visuals

Paper • 2510.05091 • Published Oct 6, 2025 • 20
VChain: Chain-of-Visual-Thought for Reasoning in Video Generation

Paper • 2510.05094 • Published Oct 6, 2025 • 38
Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models

Paper • 2510.05034 • Published Oct 6, 2025 • 51
Paper2Video: Automatic Video Generation from Scientific Papers

Paper • 2510.05096 • Published Oct 6, 2025 • 120
MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization

Paper • 2510.08540 • Published Oct 9, 2025 • 110
DreamOmni2: Multimodal Instruction-based Editing and Generation

Paper • 2510.06679 • Published Oct 8, 2025 • 74
UniVideo: Unified Understanding, Generation, and Editing for Videos

Paper • 2510.08377 • Published Oct 9, 2025 • 81
VideoCanvas: Unified Video Completion from Arbitrary Spatiotemporal Patches via In-Context Conditioning

Paper • 2510.08555 • Published Oct 9, 2025 • 65
ARTDECO: Towards Efficient and High-Fidelity On-the-Fly 3D Reconstruction with Structured Scene Representation

Paper • 2510.08551 • Published Oct 9, 2025 • 34
PickStyle: Video-to-Video Style Transfer with Context-Style Adapters

Paper • 2510.07546 • Published Oct 8, 2025 • 22
UniMMVSR: A Unified Multi-Modal Framework for Cascaded Video Super-Resolution

Paper • 2510.08143 • Published Oct 9, 2025 • 20
NaViL: Rethinking Scaling Properties of Native Multimodal Large Language Models under Data Constraints

Paper • 2510.08565 • Published Oct 9, 2025 • 21
InstructX: Towards Unified Visual Editing with MLLM Guidance

Paper • 2510.08485 • Published Oct 9, 2025 • 18
Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation

Paper • 2510.08673 • Published Oct 9, 2025 • 127
D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI

Paper • 2510.05684 • Published Oct 7, 2025 • 146
Multimodal Prompt Optimization: Why Not Leverage Multiple Modalities for MLLMs

Paper • 2510.09201 • Published Oct 10, 2025 • 50
StreamingVLM: Real-Time Understanding for Infinite Video Streams

Paper • 2510.09608 • Published Oct 10, 2025 • 52
BEAR: Benchmarking and Enhancing Multimodal Language Models for Atomic Embodied Capabilities

Paper • 2510.08759 • Published Oct 9, 2025 • 46
SpaceVista: All-Scale Visual Spatial Reasoning from mm to km

Paper • 2510.09606 • Published Oct 10, 2025 • 18
ARES: Multimodal Adaptive Reasoning via Difficulty-Aware Token-Level Entropy Shaping

Paper • 2510.08457 • Published Oct 9, 2025 • 13
OmniVideoBench: Towards Audio-Visual Understanding Evaluation for Omni MLLMs

Paper • 2510.10689 • Published Oct 12, 2025 • 47
AVoCaDO: An Audiovisual Video Captioner Driven by Temporal Orchestration

Paper • 2510.10395 • Published Oct 12, 2025 • 31
DiT360: High-Fidelity Panoramic Image Generation via Hybrid Training

Paper • 2510.11712 • Published Oct 13, 2025 • 31
Spotlight on Token Perception for Multimodal Reinforcement Learning

Paper • 2510.09285 • Published Oct 10, 2025 • 37
On Epistemic Uncertainty of Visual Tokens for Object Hallucinations in Large Vision-Language Models

Paper • 2510.09008 • Published Oct 10, 2025 • 16
Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning

Paper • 2510.11027 • Published Oct 13, 2025 • 23
FlashVSR: Towards Real-Time Diffusion-Based Streaming Video Super-Resolution

Paper • 2510.12747 • Published Oct 14, 2025 • 39
ERA: Transforming VLMs into Embodied Agents via Embodied Prior Learning and Online Reinforcement Learning

Paper • 2510.12693 • Published Oct 14, 2025 • 28
DeepMMSearch-R1: Empowering Multimodal LLMs in Multimodal Web Search

Paper • 2510.12801 • Published Oct 14, 2025 • 13
UniFusion: Vision-Language Model as Unified Encoder in Image Generation

Paper • 2510.12789 • Published Oct 14, 2025 • 19
SAIL-Embedding Technical Report: Omni-modal Embedding Foundation Model

Paper • 2510.12709 • Published Oct 14, 2025 • 13
SRUM: Fine-Grained Self-Rewarding for Unified Multimodal Models

Paper • 2510.12784 • Published Oct 14, 2025 • 20
Spatial Forcing: Implicit Spatial Representation Alignment for Vision-language-action Model

Paper • 2510.12276 • Published Oct 14, 2025 • 149
Scaling Language-Centric Omnimodal Representation Learning

Paper • 2510.11693 • Published Oct 13, 2025 • 107
HoneyBee: Data Recipes for Vision-Language Reasoners

Paper • 2510.12225 • Published Oct 14, 2025 • 11
UniMoE-Audio: Unified Speech and Music Generation with Dynamic-Capacity MoE

Paper • 2510.13344 • Published Oct 15, 2025 • 64
FlashWorld: High-quality 3D Scene Generation within Seconds

Paper • 2510.13678 • Published Oct 15, 2025 • 74
InteractiveOmni: A Unified Omni-modal Model for Audio-Visual Multi-turn Dialogue

Paper • 2510.13747 • Published Oct 15, 2025 • 30
Bee: A High-Quality Corpus and Full-Stack Suite to Unlock Advanced Fully Open MLLMs

Paper • 2510.13795 • Published Oct 15, 2025 • 59
Generative Universal Verifier as Multimodal Meta-Reasoner

Paper • 2510.13804 • Published Oct 15, 2025 • 27
UniME-V2: MLLM-as-a-Judge for Universal Multimodal Embedding Learning

Paper • 2510.13515 • Published Oct 15, 2025 • 12
InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy

Paper • 2510.13778 • Published Oct 15, 2025 • 17
FG-CLIP 2: A Bilingual Fine-grained Vision-Language Alignment Model

Paper • 2510.10921 • Published Oct 13, 2025 • 11
X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model

Paper • 2510.10274 • Published Oct 11, 2025 • 16
LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models

Paper • 2510.13626 • Published Oct 15, 2025 • 47
PhysMaster: Mastering Physical Representation for Video Generation via Reinforcement Learning

Paper • 2510.13809 • Published Oct 15, 2025 • 38
Trace Anything: Representing Any Video in 4D via Trajectory Fields

Paper • 2510.13802 • Published Oct 15, 2025 • 31
Uni-MMMU: A Massive Multi-discipline Multimodal Unified Benchmark

Paper • 2510.13759 • Published Oct 15, 2025 • 11
When Models Lie, We Learn: Multilingual Span-Level Hallucination Detection with PsiloQA

Paper • 2510.04849 • Published Oct 6, 2025 • 117
WithAnyone: Towards Controllable and ID Consistent Image Generation

Paper • 2510.14975 • Published Oct 16, 2025 • 86
From Pixels to Words -- Towards Native Vision-Language Primitives at Scale

Paper • 2510.14979 • Published Oct 16, 2025 • 69
PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model

Paper • 2510.14528 • Published Oct 16, 2025 • 124
MathCanvas: Intrinsic Visual Chain-of-Thought for Multimodal Mathematical Reasoning

Paper • 2510.14958 • Published Oct 16, 2025 • 23
VR-Thinker: Boosting Video Reward Models through Thinking-with-Image Reasoning

Paper • 2510.10518 • Published Oct 12, 2025 • 19
VLA^2: Empowering Vision-Language-Action Models with an Agentic Framework for Unseen Concept Manipulation

Paper • 2510.14902 • Published Oct 16, 2025 • 17
Qwen3Guard Technical Report

Paper • 2510.14276 • Published Oct 16, 2025 • 15
OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM

Paper • 2510.15870 • Published Oct 17, 2025 • 92
Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset

Paper • 2510.15742 • Published Oct 17, 2025 • 51
PICABench: How Far Are We from Physically Realistic Image Editing?

Paper • 2510.17681 • Published Oct 20, 2025 • 65
Glyph: Scaling Context Windows via Visual-Text Compression

Paper • 2510.17800 • Published Oct 20, 2025 • 69
Towards Mixed-Modal Retrieval for Universal Retrieval-Augmented Generation

Paper • 2510.17354 • Published Oct 20, 2025 • 35
FineVision: Open Data Is All You Need

Paper • 2510.17269 • Published Oct 20, 2025 • 79
RL makes MLLMs see better than SFT

Paper • 2510.16333 • Published Oct 18, 2025 • 49
Uniworld-V2: Reinforce Image Editing with Diffusion Negative-aware Finetuning and MLLM Implicit Feedback

Paper • 2510.16888 • Published Oct 19, 2025 • 22
Visual Autoregressive Models Beat Diffusion Models on Inference Time Scaling

Paper • 2510.16751 • Published Oct 19, 2025 • 21
ConsistEdit: Highly Consistent and Precise Training-free Visual Editing

Paper • 2510.17803 • Published Oct 20, 2025 • 14
Executable Knowledge Graphs for Replicating AI Research

Paper • 2510.17795 • Published Oct 20, 2025 • 15
Knowledge-based Visual Question Answer with Multimodal Processing, Retrieval and Filtering

Paper • 2510.14605 • Published Oct 16, 2025 • 5
MultiVerse: A Multi-Turn Conversation Benchmark for Evaluating Large Vision and Language Models

Paper • 2510.16641 • Published Oct 18, 2025 • 5
Paper2Web: Let's Make Your Paper Alive!

Paper • 2510.15842 • Published Oct 17, 2025 • 27
Towards Faithful and Controllable Personalization via Critique-Post-Edit Reinforcement Learning

Paper • 2510.18849 • Published Oct 21, 2025 • 21
World-in-World: World Models in a Closed-Loop World

Paper • 2510.18135 • Published Oct 20, 2025 • 78
Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs

Paper • 2510.18876 • Published Oct 21, 2025 • 37
ProCLIP: Progressive Vision-Language Alignment via LLM-based Embedder

Paper • 2510.18795 • Published Oct 21, 2025 • 11
IF-VidCap: Can Video Caption Models Follow Instructions?

Paper • 2510.18726 • Published Oct 21, 2025 • 26
MT-Video-Bench: A Holistic Video Understanding Benchmark for Evaluating Multimodal LLMs in Multi-Turn Dialogues

Paper • 2510.17722 • Published Oct 20, 2025 • 20
olmOCR 2: Unit Test Rewards for Document OCR

Paper • 2510.19817 • Published Oct 22, 2025 • 16
DaMo: Data Mixing Optimizer in Fine-tuning Multimodal LLMs for Mobile Phone Agents

Paper • 2510.19336 • Published Oct 22, 2025 • 17
Unified Reinforcement and Imitation Learning for Vision-Language Models

Paper • 2510.19307 • Published Oct 22, 2025 • 32
OmniNWM: Omniscient Driving Navigation World Models

Paper • 2510.18313 • Published Oct 21, 2025 • 14
MINED: Probing and Updating with Multimodal Time-Sensitive Knowledge for Large Multimodal Models

Paper • 2510.19457 • Published Oct 22, 2025 • 9
Pico-Banana-400K: A Large-Scale Dataset for Text-Guided Image Editing

Paper • 2510.19808 • Published Oct 22, 2025 • 30
KORE: Enhancing Knowledge Injection for Large Multimodal Models via Knowledge-Oriented Augmentations and Constraints

Paper • 2510.19316 • Published Oct 22, 2025 • 12
VideoAgentTrek: Computer Use Pretraining from Unlabeled Videos

Paper • 2510.19488 • Published Oct 22, 2025 • 21
GigaBrain-0: A World Model-Powered Vision-Language-Action Model

Paper • 2510.19430 • Published Oct 22, 2025 • 53
From Charts to Code: A Hierarchical Benchmark for Multimodal Models

Paper • 2510.17932 • Published Oct 20, 2025 • 8
Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence

Paper • 2510.20579 • Published Oct 23, 2025 • 56
HoloCine: Holistic Generation of Cinematic Multi-Shot Long Video Narratives

Paper • 2510.20822 • Published Oct 23, 2025 • 41
ARGenSeg: Image Segmentation with Autoregressive Image Generation Model

Paper • 2510.20803 • Published Oct 23, 2025 • 12
Conan: Progressive Learning to Reason Like a Detective over Multi-Scale Visual Evidence

Paper • 2510.20470 • Published Oct 23, 2025 • 12
LayerComposer: Interactive Personalized T2I via Spatially-Aware Layered Canvas

Paper • 2510.20820 • Published Oct 23, 2025 • 11
Directional Reasoning Injection for Fine-Tuning MLLMs

Paper • 2510.15050 • Published Oct 16, 2025 • 12
See the Text: From Tokenization to Visual Reading

Paper • 2510.18840 • Published Oct 21, 2025 • 4
Text or Pixels? It Takes Half: On the Token Efficiency of Visual Text Inputs in Multimodal LLMs

Paper • 2510.18279 • Published Oct 21, 2025 • 6
Every Question Has Its Own Value: Reinforcement Learning with Explicit Human Values

Paper • 2510.20187 • Published Oct 23, 2025 • 19
SAKE: Towards Editing Auditory Attribute Knowledge of Large Audio-Language Models

Paper • 2510.16917 • Published Oct 19, 2025 • 20
UI-Ins: Enhancing GUI Grounding with Multi-Perspective Instruction-as-Reasoning

Paper • 2510.20286 • Published Oct 23, 2025 • 24
Video-As-Prompt: Unified Semantic Control for Video Generation

Paper • 2510.20888 • Published Oct 23, 2025 • 50
From Denoising to Refining: A Corrective Framework for Vision-Language Diffusion Model

Paper • 2510.19871 • Published Oct 22, 2025 • 30
Model Merging with Functional Dual Anchors

Paper • 2510.21223 • Published Oct 24, 2025 • 13
Visual Diffusion Models are Geometric Solvers

Paper • 2510.21697 • Published Oct 24, 2025 • 20
Sample By Step, Optimize By Chunk: Chunk-Level GRPO For Text-to-Image Generation

Paper • 2510.21583 • Published Oct 24, 2025 • 31
Reasoning with Sampling: Your Base Model is Smarter Than You Think

Paper • 2510.14901 • Published Oct 16, 2025 • 48
RECALL: REpresentation-aligned Catastrophic-forgetting ALLeviation via Hierarchical Model Merging

Paper • 2510.20479 • Published Oct 23, 2025 • 12
Taming Modality Entanglement in Continual Audio-Visual Segmentation

Paper • 2510.17234 • Published Oct 20, 2025 • 5
PhysVLM-AVR: Active Visual Reasoning for Multimodal Large Language Models in Physical Environments

Paper • 2510.21111 • Published Oct 24, 2025 • 3
Omni-Reward: Towards Generalist Omni-Modal Reward Modeling with Free-Form Preferences

Paper • 2510.23451 • Published Oct 27, 2025 • 28
LightBagel: A Light-weighted, Double Fusion Framework for Unified Multimodal Understanding and Generation

Paper • 2510.22946 • Published Oct 27, 2025 • 18
LongCat-Video Technical Report

Paper • 2510.22200 • Published Oct 25, 2025 • 34
Tongyi DeepResearch Technical Report

Paper • 2510.24701 • Published Oct 28, 2025 • 103
Generalization or Memorization: Dynamic Decoding for Mode Steering

Paper • 2510.22099 • Published Oct 25, 2025 • 4
VL-SAE: Interpreting and Enhancing Vision-Language Alignment with a Unified Concept Set

Paper • 2510.21323 • Published Oct 24, 2025 • 4
Rethinking Visual Intelligence: Insights from Video Pretraining

Paper • 2510.24448 • Published Oct 28, 2025 • 6
Lookahead Anchoring: Preserving Character Identity in Audio-Driven Human Animation

Paper • 2510.23581 • Published Oct 27, 2025 • 42
VITA-E: Natural Embodied Interaction with Concurrent Seeing, Hearing, Speaking, and Acting

Paper • 2510.21817 • Published Oct 21, 2025 • 41
Open Multimodal Retrieval-Augmented Factual Image Generation

Paper • 2510.22521 • Published Oct 26, 2025 • 31
MergeMix: A Unified Augmentation Paradigm for Visual and Multi-Modal Understanding

Paper • 2510.23479 • Published Oct 27, 2025 • 18
Latent Sketchpad: Sketching Visual Thoughts to Elicit Multimodal Reasoning in MLLMs

Paper • 2510.24514 • Published Oct 28, 2025 • 22
Emu3.5: Native Multimodal Models are World Learners

Paper • 2510.26583 • Published Oct 30, 2025 • 114
The End of Manual Decoding: Towards Truly End-to-End Language Models

Paper • 2510.26697 • Published Oct 30, 2025 • 119
Kimi Linear: An Expressive, Efficient Attention Architecture

Paper • 2510.26692 • Published Oct 30, 2025 • 132
Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark

Paper • 2510.26802 • Published Oct 30, 2025 • 34
OmniX: From Unified Panoramic Generation and Perception to Graphics-Ready 3D Scenes

Paper • 2510.26800 • Published Oct 30, 2025 • 22
OmniLayout: Enabling Coarse-to-Fine Learning with LLMs for Universal Document Layout Generation

Paper • 2510.26213 • Published Oct 30, 2025 • 10
ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning

Paper • 2510.27492 • Published Oct 30, 2025 • 87
π_RL: Online RL Fine-tuning for Flow-based Vision-Language-Action Models

Paper • 2510.25889 • Published Oct 29, 2025 • 66
Continuous Autoregressive Language Models

Paper • 2510.27688 • Published Oct 31, 2025 • 74
Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning

Paper • 2510.27606 • Published Oct 31, 2025 • 31
Revisiting Multimodal Positional Encoding in Vision-Language Models

Paper • 2510.23095 • Published Oct 27, 2025 • 22
SemCoT: Accelerating Chain-of-Thought Reasoning through Semantically-Aligned Implicit Tokens

Paper • 2510.24940 • Published Oct 28, 2025 • 18
Visual Backdoor Attacks on MLLM Embodied Decision Making via Contrastive Trigger Learning

Paper • 2510.27623 • Published Oct 31, 2025 • 13
The Underappreciated Power of Vision Models for Graph Structural Understanding

Paper • 2510.24788 • Published Oct 27, 2025 • 36
LongCat-Flash-Omni Technical Report

Paper • 2511.00279 • Published Oct 31, 2025 • 26
Do Vision-Language Models Measure Up? Benchmarking Visual Measurement Reading with MeasureBench

Paper • 2510.26865 • Published Oct 30, 2025 • 12
left|,circlearrowright,text{BUS},right|: A Large and Diverse Multimodal Benchmark for evaluating the ability of Vision-Language Models to understand Rebus Puzzles

Paper • 2511.01340 • Published Nov 3, 2025 • 13
Don't Blind Your VLA: Aligning Visual Representations for OOD Generalization

Paper • 2510.25616 • Published Oct 29, 2025 • 106
VCode: a Multimodal Coding Benchmark with SVG as Symbolic Visual Representation

Paper • 2511.02778 • Published Nov 4, 2025 • 103
When Visualizing is the First Step to Reasoning: MIRA, a Benchmark for Visual Chain-of-Thought

Paper • 2511.02779 • Published Nov 4, 2025 • 60
When Modalities Conflict: How Unimodal Reasoning Uncertainty Governs Preference Dynamics in MLLMs

Paper • 2511.02243 • Published Nov 4, 2025 • 25
Can Visual Input Be Compressed? A Visual Token Compression Benchmark for Large Multimodal Models

Paper • 2511.02650 • Published Nov 4, 2025 • 10
MME-CC: A Challenging Multi-Modal Evaluation Benchmark of Cognitive Capacity

Paper • 2511.03146 • Published Nov 5, 2025 • 8
Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm

Paper • 2511.04570 • Published Nov 6, 2025 • 242
V-Thinker: Interactive Thinking with Images

Paper • 2511.04460 • Published Nov 6, 2025 • 98
Benchmark Designers Should "Train on the Test Set" to Expose Exploitable Non-Visual Shortcuts

Paper • 2511.04655 • Published Nov 6, 2025 • 10
NVIDIA Nemotron Nano V2 VL

Paper • 2511.03929 • Published Nov 6, 2025 • 31
UniAVGen: Unified Audio and Video Generation with Asymmetric Cross-Modal Interactions

Paper • 2511.03334 • Published Nov 5, 2025 • 54
Cambrian-S: Towards Spatial Supersensing in Video

Paper • 2511.04670 • Published Nov 6, 2025 • 39
Contamination Detection for VLMs using Multi-Modal Semantic Perturbation

Paper • 2511.03774 • Published Nov 5, 2025 • 13
Visual Spatial Tuning

Paper • 2511.05491 • Published Nov 7, 2025 • 53
Too Good to be Bad: On the Failure of LLMs to Role-Play Villains

Paper • 2511.04962 • Published Nov 7, 2025 • 57
Dense Motion Captioning

Paper • 2511.05369 • Published Nov 7, 2025 • 10
DeepEyesV2: Toward Agentic Multimodal Model

Paper • 2511.05271 • Published Nov 7, 2025 • 46
HaluMem: Evaluating Hallucinations in Memory Systems of Agents

Paper • 2511.03506 • Published Nov 5, 2025 • 95
MVU-Eval: Towards Multi-Video Understanding Evaluation for Multimodal LLMs

Paper • 2511.07250 • Published Nov 10, 2025 • 18
Long Grounded Thoughts: Distilling Compositional Visual Reasoning Chains at Scale

Paper • 2511.05705 • Published Nov 7, 2025 • 10
Reinforcement Learning Improves Traversal of Hierarchical Knowledge in LLMs

Paper • 2511.05933 • Published Nov 8, 2025 • 9
MPJudge: Towards Perceptual Assessment of Music-Induced Paintings

Paper • 2511.07137 • Published Nov 10, 2025 • 6
VADER: Towards Causal Video Anomaly Understanding with Relation-Aware Large Language Models

Paper • 2511.07299 • Published Nov 10, 2025 • 9
Wasm: A Pipeline for Constructing Structured Arabic Interleaved Multimodal Corpora

Paper • 2511.07080 • Published Nov 10, 2025 • 32
VideoSSR: Video Self-Supervised Reinforcement Learning

Paper • 2511.06281 • Published Nov 9, 2025 • 25
TimeSearch-R: Adaptive Temporal Search for Long-Form Video Understanding via Self-Verification Reinforcement Learning

Paper • 2511.05489 • Published Nov 7, 2025 • 3
WMPO: World Model-based Policy Optimization for Vision-Language-Action Models

Paper • 2511.09515 • Published Nov 12, 2025 • 20
MathSE: Improving Multimodal Mathematical Reasoning via Self-Evolving Iterative Reflection and Reward-Guided Fine-Tuning

Paper • 2511.06805 • Published Nov 10, 2025 • 13
Lumine: An Open Recipe for Building Generalist Agents in 3D Open Worlds

Paper • 2511.08892 • Published Nov 12, 2025 • 216
TiDAR: Think in Diffusion, Talk in Autoregression

Paper • 2511.08923 • Published Nov 12, 2025 • 128
Time-to-Move: Training-Free Motion Controlled Video Generation via Dual-Clock Denoising

Paper • 2511.08633 • Published Nov 9, 2025 • 56
GGBench: A Geometric Generative Reasoning Benchmark for Unified Multimodal Models

Paper • 2511.11134 • Published Nov 14, 2025 • 33
DoPE: Denoising Rotary Position Embedding

Paper • 2511.09146 • Published Nov 12, 2025 • 98
WEAVE: Unleashing and Benchmarking the In-context Interleaved Comprehension and Generation

Paper • 2511.11434 • Published Nov 14, 2025 • 47
UI2Code^N: A Visual Language Model for Test-Time Scalable Interactive UI-to-Code Generation

Paper • 2511.08195 • Published Nov 11, 2025 • 34
SpatialThinker: Reinforcing 3D Reasoning in Multimodal LLMs via Spatial Rewards

Paper • 2511.07403 • Published Nov 10, 2025 • 14
Simulating the Visual World with Artificial Intelligence: A Roadmap

Paper • 2511.08585 • Published Nov 11, 2025 • 30
EmoVid: A Multimodal Emotion Video Dataset for Emotion-Centric Video Understanding and Generation

Paper • 2511.11002 • Published Nov 14, 2025 • 4
Depth Anything 3: Recovering the Visual Space from Any Views

Paper • 2511.10647 • Published Nov 13, 2025 • 101
Music Flamingo: Scaling Music Understanding in Audio Language Models

Paper • 2511.10289 • Published Nov 13, 2025 • 19
AffordBot: 3D Fine-grained Embodied Reasoning via Multimodal Large Language Models

Paper • 2511.10017 • Published Nov 13, 2025 • 7
MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling

Paper • 2511.11793 • Published Nov 14, 2025 • 195
MMaDA-Parallel: Multimodal Large Diffusion Language Models for Thinking-Aware Editing and Generation

Paper • 2511.09611 • Published Nov 12, 2025 • 71
Part-X-MLLM: Part-aware 3D Multimodal Large Language Model

Paper • 2511.13647 • Published Nov 17, 2025 • 72
TiViBench: Benchmarking Think-in-Video Reasoning for Video Generative Models

Paper • 2511.13704 • Published Nov 17, 2025 • 44
Back to Basics: Let Denoising Generative Models Denoise

Paper • 2511.13720 • Published Nov 17, 2025 • 70
NORA-1.5: A Vision-Language-Action Model Trained using World Model- and Action-based Preference Rewards

Paper • 2511.14659 • Published Nov 18, 2025 • 13
VIDEOP2R: Video Understanding from Perception to Reasoning

Paper • 2511.11113 • Published Nov 14, 2025 • 112
A Style is Worth One Code: Unlocking Code-to-Style Image Generation with Discrete Style Space

Paper • 2511.10555 • Published Nov 13, 2025 • 63
Can World Simulators Reason? Gen-ViRe: A Generative Visual Reasoning Benchmark

Paper • 2511.13853 • Published Nov 17, 2025 • 37
MVI-Bench: A Comprehensive Benchmark for Evaluating Robustness to Misleading Visual Inputs in LVLMs

Paper • 2511.14159 • Published Nov 18, 2025 • 25
OmniZip: Audio-Guided Dynamic Token Compression for Fast Omnimodal Large Language Models

Paper • 2511.14582 • Published Nov 18, 2025 • 19
REVISOR: Beyond Textual Reflection, Towards Multimodal Introspective Reasoning in Long-Form Video Understanding

Paper • 2511.13026 • Published Nov 17, 2025 • 26
General Agentic Memory Via Deep Research

Paper • 2511.18423 • Published Nov 23, 2025 • 170
SAM 3: Segment Anything with Concepts

Paper • 2511.16719 • Published Nov 20, 2025 • 134
OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe

Paper • 2511.16334 • Published Nov 20, 2025 • 96
Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens

Paper • 2511.19418 • Published Nov 24, 2025 • 29
HunyuanVideo 1.5 Technical Report

Paper • 2511.18870 • Published Nov 24, 2025 • 29
HunyuanOCR Technical Report

Paper • 2511.19575 • Published Nov 24, 2025 • 22
RynnVLA-002: A Unified Vision-Language-Action and World Model

Paper • 2511.17502 • Published Nov 21, 2025 • 28
O-Mem: Omni Memory System for Personalized, Long Horizon, Self-Evolving Agents

Paper • 2511.13593 • Published Nov 17, 2025 • 28
Video-R4: Reinforcing Text-Rich Video Reasoning with Visual Rumination

Paper • 2511.17490 • Published Nov 21, 2025 • 22
Loomis Painter: Reconstructing the Painting Process

Paper • 2511.17344 • Published Nov 21, 2025 • 20
Mantis: A Versatile Vision-Language-Action Model with Disentangled Visual Foresight

Paper • 2511.16175 • Published Nov 20, 2025 • 12
VisMem: Latent Vision Memory Unlocks Potential of Vision-Language Models

Paper • 2511.11007 • Published Nov 14, 2025 • 15
Downscaling Intelligence: Exploring Perception and Reasoning Bottlenecks in Small Multimodal Models

Paper • 2511.17487 • Published Nov 21, 2025 • 12
VLA-4D: Embedding 4D Awareness into Vision-Language-Action Models for SpatioTemporally Coherent Robotic Manipulation

Paper • 2511.17199 • Published Nov 21, 2025 • 8
DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation

Paper • 2511.19365 • Published Nov 24, 2025 • 66
In-Video Instructions: Visual Signals as Generative Control

Paper • 2511.19401 • Published Nov 24, 2025 • 32
The Image as Its Own Reward: Reinforcement Learning with Adversarial Reward for Image Generation

Paper • 2511.20256 • Published Nov 25, 2025 • 28
AICC: Parse HTML Finer, Make Models Better -- A 7.3T AI-Ready Corpus Built by a Model-Based HTML Parser

Paper • 2511.16397 • Published Nov 20, 2025 • 11
Does Understanding Inform Generation in Unified Multimodal Models? From Analysis to Path Forward

Paper • 2511.20561 • Published Nov 25, 2025 • 33
Soft Adaptive Policy Optimization

Paper • 2511.20347 • Published Nov 25, 2025 • 43
STARFlow-V: End-to-End Video Generative Modeling with Normalizing Flow

Paper • 2511.20462 • Published Nov 25, 2025 • 33
MagicWorld: Interactive Geometry-driven Video World Exploration

Paper • 2511.18886 • Published Nov 24, 2025 • 19
UniGame: Turning a Unified Multimodal Model Into Its Own Adversary

Paper • 2511.19413 • Published Nov 24, 2025 • 21
Monet: Reasoning in Latent Visual Space Beyond Images and Language

Paper • 2511.21395 • Published Nov 26, 2025 • 19
G^2VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning

Paper • 2511.21688 • Published Nov 26, 2025 • 8
Video Generation Models Are Good Latent Reward Models

Paper • 2511.21541 • Published Nov 26, 2025 • 47
Canvas-to-Image: Compositional Image Generation with Multimodal Controls

Paper • 2511.21691 • Published Nov 26, 2025 • 36
REASONEDIT: Towards Reasoning-Enhanced Image Editing Models

Paper • 2511.22625 • Published Nov 27, 2025 • 48
InternVideo-Next: Towards General Video Foundation Models without Video-Text Supervision

Paper • 2512.01342 • Published Dec 1, 2025 • 19
TimeViper: A Hybrid Mamba-Transformer Vision-Language Model for Efficient Long Video Understanding

Paper • 2511.16595 • Published Nov 20, 2025 • 10
Video-as-Answer: Predict and Generate Next Video Event with Joint-GRPO

Paper • 2511.16669 • Published Nov 20, 2025 • 31
MiMo-Embodied: X-Embodied Foundation Model Technical Report

Paper • 2511.16518 • Published Nov 20, 2025 • 26
SRPO: Self-Referential Policy Optimization for Vision-Language-Action Models

Paper • 2511.15605 • Published Nov 19, 2025 • 25
Scaling Spatial Intelligence with Multimodal Foundation Models

Paper • 2511.13719 • Published Nov 17, 2025 • 48
Step-Audio-R1 Technical Report

Paper • 2511.15848 • Published Nov 19, 2025 • 58
V-ReasonBench: Toward Unified Reasoning Benchmark Suite for Video Generation Models

Paper • 2511.16668 • Published Nov 20, 2025 • 56
ARC-Chapter: Structuring Hour-Long Videos into Navigable Chapters and Hierarchical Summaries

Paper • 2511.14349 • Published Nov 18, 2025 • 18
VisPlay: Self-Evolving Vision-Language Models from Images

Paper • 2511.15661 • Published Nov 19, 2025 • 44
Reasoning via Video: The First Evaluation of Video Models' Reasoning Abilities through Maze-Solving Tasks

Paper • 2511.15065 • Published Nov 19, 2025 • 78
Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation

Paper • 2511.14993 • Published Nov 19, 2025 • 233
MultiShotMaster: A Controllable Multi-Shot Video Generation Framework

Paper • 2512.03041 • Published Dec 2, 2025 • 65
TRivia: Self-supervised Fine-tuning of Vision-Language Models for Table Recognition

Paper • 2512.01248 • Published Dec 1, 2025 • 12
OneThinker: All-in-one Reasoning Model for Image and Video

Paper • 2512.03043 • Published Dec 2, 2025 • 34
ViDiC: Video Difference Captioning

Paper • 2512.03405 • Published Dec 3, 2025 • 28
Thinking with Programming Vision: Towards a Unified View for Thinking with Images

Paper • 2512.03746 • Published Dec 3, 2025 • 17
Qwen3-VL Technical Report

Paper • 2511.21631 • Published Nov 26, 2025 • 161
AdaptVision: Efficient Vision-Language Models via Adaptive Visual Acquisition

Paper • 2512.03794 • Published Dec 3, 2025 • 5
Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length

Paper • 2512.04677 • Published Dec 4, 2025 • 177
Deep Forcing: Training-Free Long Video Generation with Deep Sink and Participative Compression

Paper • 2512.05081 • Published Dec 4, 2025 • 33
Some Modalities are More Equal Than Others: Decoding and Architecting Multimodal Integration in MLLMs

Paper • 2511.22826 • Published Nov 28, 2025 • 8
Guided Self-Evolving LLMs with Minimal Human Supervision

Paper • 2512.02472 • Published Dec 2, 2025 • 55
Skywork-R1V4: Toward Agentic Multimodal Intelligence through Interleaved Thinking with Images and DeepResearch

Paper • 2512.02395 • Published Dec 2, 2025 • 51
WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning

Paper • 2512.02425 • Published Dec 2, 2025 • 25
Does Hearing Help Seeing? Investigating Audio-Video Joint Denoising for Video Generation

Paper • 2512.02457 • Published Dec 2, 2025 • 14
MagicQuillV2: Precise and Interactive Image Editing with Layered Visual Cues

Paper • 2512.03046 • Published Dec 2, 2025 • 12
Revisiting the Necessity of Lengthy Chain-of-Thought in Vision-centric Reasoning Generalization

Paper • 2511.22586 • Published Nov 27, 2025 • 7
Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation

Paper • 2512.04678 • Published Dec 4, 2025 • 42
Relational Visual Similarity

Paper • 2512.07833 • Published Dec 8, 2025 • 25
One Layer Is Enough: Adapting Pretrained Visual Encoders for Image Generation

Paper • 2512.07829 • Published Dec 8, 2025 • 24
LongCat-Image Technical Report

Paper • 2512.07584 • Published Dec 8, 2025 • 23
OpenSubject: Leveraging Video-Derived Identity and Diversity Priors for Subject-driven Image Generation and Manipulation

Paper • 2512.08294 • Published Dec 9, 2025 • 18
UnityVideo: Unified Multi-Modal Multi-Task Learning for Enhancing World-Aware Video Generation

Paper • 2512.07831 • Published Dec 8, 2025 • 17
Visual Generation Tuning

Paper • 2511.23469 • Published Nov 28, 2025 • 16
Wan-Move: Motion-controllable Video Generation via Latent Trajectory Guidance

Paper • 2512.08765 • Published Dec 9, 2025 • 134
Preserving Source Video Realism: High-Fidelity Face Swapping for Cinematic Quality

Paper • 2512.07951 • Published Dec 8, 2025 • 51
OneStory: Coherent Multi-Shot Video Generation with Adaptive Memory

Paper • 2512.07802 • Published Dec 8, 2025 • 46
DeepCode: Open Agentic Coding

Paper • 2512.07921 • Published Dec 8, 2025 • 33
See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models

Paper • 2512.02231 • Published Dec 1, 2025 • 9
InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input Vision-Language Models

Paper • 2512.08829 • Published Dec 9, 2025 • 21
Rethinking Chain-of-Thought Reasoning for Videos

Paper • 2512.09616 • Published Dec 10, 2025 • 19
VQRAE: Representation Quantization Autoencoders for Multimodal Understanding, Generation and Reconstruction

Paper • 2511.23386 • Published Nov 28, 2025 • 16
LongVie 2: Multimodal Controllable Ultra-Long Video World Model

Paper • 2512.13604 • Published Dec 15, 2025 • 76
QwenLong-L1.5: Post-Training Recipe for Long-Context Reasoning and Memory Management

Paper • 2512.12967 • Published Dec 15, 2025 • 111
Spatial-Aware VLA Pretraining through Visual-Physical Alignment from Human Videos

Paper • 2512.13080 • Published Dec 15, 2025 • 17
MMGR: Multi-Modal Generative Reasoning

Paper • 2512.14691 • Published Dec 16, 2025 • 121
Scone: Bridging Composition and Distinction in Subject-Driven Image Generation via Unified Understanding-Generation Modeling

Paper • 2512.12675 • Published Dec 14, 2025 • 41
Vector Prism: Animating Vector Graphics by Stratifying Semantic Structure

Paper • 2512.14336 • Published Dec 16, 2025 • 32
OpenDataArena: A Fair and Open Arena for Benchmarking Post-Training Dataset Value

Paper • 2512.14051 • Published Dec 16, 2025 • 47
ShowTable: Unlocking Creative Table Visualization with Collaborative Reflection and Refinement

Paper • 2512.13303 • Published Dec 15, 2025 • 17
Zoom-Zero: Reinforced Coarse-to-Fine Video Understanding via Temporal Zoom-in

Paper • 2512.14273 • Published Dec 16, 2025 • 10
Qwen-Image-Layered: Towards Inherent Editability via Layer Decomposition

Paper • 2512.15603 • Published Dec 17, 2025 • 69
DiffusionVL: Translating Any Autoregressive Models into Diffusion Vision Language Models

Paper • 2512.15713 • Published Dec 17, 2025 • 18
WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling

Paper • 2512.14614 • Published Dec 16, 2025 • 73
Video Reality Test: Can AI-Generated ASMR Videos fool VLMs and Humans?

Paper • 2512.13281 • Published Dec 15, 2025 • 65
MemFlow: Flowing Adaptive Memory for Consistent and Efficient Long Video Narratives

Paper • 2512.14699 • Published Dec 16, 2025 • 28
TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs

Paper • 2512.14698 • Published Dec 16, 2025 • 21
HyperVL: An Efficient and Dynamic Multimodal Large Language Model for Edge Devices

Paper • 2512.14052 • Published Dec 16, 2025 • 42
Puzzle Curriculum GRPO for Vision-Centric Reasoning

Paper • 2512.14944 • Published Dec 16, 2025 • 36
In Pursuit of Pixel Supervision for Visual Pre-training

Paper • 2512.15715 • Published Dec 17, 2025 • 11
Kling-Omni Technical Report

Paper • 2512.16776 • Published Dec 18, 2025 • 173
Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model

Paper • 2512.13507 • Published Dec 15, 2025 • 41
Multimodal RewardBench 2: Evaluating Omni Reward Models for Interleaved Text and Image

Paper • 2512.16899 • Published Dec 18, 2025 • 14
GroundingME: Exposing the Visual Grounding Gap in MLLMs through Multi-Dimensional Evaluation

Paper • 2512.17495 • Published Dec 19, 2025 • 20
Robust-R1: Degradation-Aware Reasoning for Robust Visual Understanding

Paper • 2512.17532 • Published Dec 19, 2025 • 68
PhysBrain: Human Egocentric Data as a Bridge from Vision Language Models to Physical Intelligence

Paper • 2512.16793 • Published Dec 18, 2025 • 76
Both Semantics and Reconstruction Matter: Making Representation Encoders Ready for Text-to-Image Generation and Editing

Paper • 2512.17909 • Published Dec 19, 2025 • 37
MMFormalizer: Multimodal Autoformalization in the Wild

Paper • 2601.03017 • Published Jan 6 • 106
Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking

Paper • 2601.04720 • Published Jan 8 • 58
VideoAuto-R1: Video Auto Reasoning via Thinking Once, Answering Twice

Paper • 2601.05175 • Published Jan 8 • 36
Few Tokens Matter: Entropy Guided Attacks on Vision-Language Models

Paper • 2512.21815 • Published Dec 26, 2025 • 22
JavisGPT: A Unified Multi-modal LLM for Sounding-Video Comprehension and Generation

Paper • 2512.22905 • Published Dec 28, 2025 • 20
InSight-o3: Empowering Multimodal Foundation Models with Generalized Visual Search

Paper • 2512.18745 • Published Dec 21, 2025 • 12
Latent Implicit Visual Reasoning

Paper • 2512.21218 • Published Dec 24, 2025 • 70
How Much 3D Do Video Foundation Models Encode?

Paper • 2512.19949 • Published Dec 23, 2025 • 10
Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models

Paper • 2512.20557 • Published Dec 23, 2025 • 51
Beyond Memorization: A Multi-Modal Ordinal Regression Benchmark to Expose Popularity Bias in Vision-Language Models

Paper • 2512.21337 • Published Dec 24, 2025 • 31
LongVideoAgent: Multi-Agent Reasoning with Long Videos

Paper • 2512.20618 • Published Dec 23, 2025 • 56
LTX-2: Efficient Joint Audio-Visual Foundation Model

Paper • 2601.03233 • Published Jan 6 • 176
UniCorn: Towards Self-Improving Unified Multimodal Models through Self-Generated Supervision

Paper • 2601.03193 • Published Jan 6 • 50
MindWatcher: Toward Smarter Multimodal Tool-Integrated Reasoning

Paper • 2512.23412 • Published Dec 29, 2025 • 42
MiMo-V2-Flash Technical Report

Paper • 2601.02780 • Published Jan 6 • 37
NextFlow: Unified Sequential Modeling Activates Multimodal Understanding and Generation

Paper • 2601.02204 • Published Jan 5 • 63
Taming Hallucinations: Boosting MLLMs' Video Understanding via Counterfactual Video Generation

Paper • 2512.24271 • Published Dec 30, 2025 • 64

Upvote

146

Collection guide
Browse collections