EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper
• 2402.04252
• Published
• 29
Vision Superalignment: Weak-to-Strong Generalization for Vision
Foundation Models
Paper
• 2402.03749
• Published
• 15
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper
• 2402.04615
• Published
• 44
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance
Loss
Paper
• 2402.05008
• Published
• 23
WebLINX: Real-World Website Navigation with Multi-Turn Dialogue
Paper
• 2402.05930
• Published
• 39
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large
Language Models
Paper
• 2402.05935
• Published
• 17
ViGoR: Improving Visual Grounding of Large Vision Language Models with
Fine-Grained Reward Modeling
Paper
• 2402.06118
• Published
• 15
OS-Copilot: Towards Generalist Computer Agents with Self-Improvement
Paper
• 2402.07456
• Published
• 46
PIVOT: Iterative Visual Prompting Elicits Actionable Knowledge for VLMs
Paper
• 2402.07872
• Published
• 16
Prismatic VLMs: Investigating the Design Space of Visually-Conditioned
Language Models
Paper
• 2402.07865
• Published
• 15
World Model on Million-Length Video And Language With RingAttention
Paper
• 2402.08268
• Published
• 40
PaLM2-VAdapter: Progressively Aligned Language Model Makes a Strong
Vision-language Adapter
Paper
• 2402.10896
• Published
• 16
FinTral: A Family of GPT-4 Level Multimodal Financial Large Language
Models
Paper
• 2402.10986
• Published
• 81
AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling
Paper
• 2402.12226
• Published
• 45
CoLLaVO: Crayon Large Language and Vision mOdel
Paper
• 2402.11248
• Published
• 22
Vision-Flan: Scaling Human-Labeled Tasks in Visual Instruction Tuning
Paper
• 2402.11690
• Published
• 9
VideoPrism: A Foundational Visual Encoder for Video Understanding
Paper
• 2402.13217
• Published
• 38
Video ReCap: Recursive Captioning of Hour-Long Videos
Paper
• 2402.13250
• Published
• 26
A Touch, Vision, and Language Dataset for Multimodal Alignment
Paper
• 2402.13232
• Published
• 16
How Easy is It to Fool Your Multimodal LLMs? An Empirical Analysis on
Deceptive Prompts
Paper
• 2402.13220
• Published
• 14
BBA: Bi-Modal Behavioral Alignment for Reasoning with Large
Vision-Language Models
Paper
• 2402.13577
• Published
• 9
PALO: A Polyglot Large Multimodal Model for 5B People
Paper
• 2402.14818
• Published
• 23
TinyLLaVA: A Framework of Small-scale Large Multimodal Models
Paper
• 2402.14289
• Published
• 20
Sora: A Review on Background, Technology, Limitations, and Opportunities
of Large Vision Models
Paper
• 2402.17177
• Published
• 88
Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers
Paper
• 2402.19479
• Published
• 35
MovieLLM: Enhancing Long Video Understanding with AI-Generated Movies
Paper
• 2403.01422
• Published
• 30
InfiMM-HD: A Leap Forward in High-Resolution Multimodal Understanding
Paper
• 2403.01487
• Published
• 16
Finetuned Multimodal Language Models Are High-Quality Image-Text Data
Filters
Paper
• 2403.02677
• Published
• 18
Modeling Collaborator: Enabling Subjective Vision Classification With
Minimal Human Effort via LLM Tool-Use
Paper
• 2403.02626
• Published
• 11
MAGID: An Automated Pipeline for Generating Synthetic Multi-modal
Datasets
Paper
• 2403.03194
• Published
• 15
Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large
Language Models
Paper
• 2403.03003
• Published
• 11
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
Paper
• 2403.09611
• Published
• 129
MoAI: Mixture of All Intelligence for Large Language and Vision Models
Paper
• 2403.07508
• Published
• 77
Synth^2: Boosting Visual-Language Models with Synthetic Captions and
Image Embeddings
Paper
• 2403.07750
• Published
• 23
DragAnything: Motion Control for Anything using Entity Representation
Paper
• 2403.07420
• Published
• 14
An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference
Acceleration for Large Vision-Language Models
Paper
• 2403.06764
• Published
• 27
VideoMamba: State Space Model for Efficient Video Understanding
Paper
• 2403.06977
• Published
• 29
ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment
Paper
• 2403.05135
• Published
• 45
Gemini 1.5: Unlocking multimodal understanding across millions of tokens
of context
Paper
• 2403.05530
• Published
• 65
DeepSeek-VL: Towards Real-World Vision-Language Understanding
Paper
• 2403.05525
• Published
• 49
VideoElevator: Elevating Video Generation Quality with Versatile
Text-to-Image Diffusion Models
Paper
• 2403.05438
• Published
• 20
Uni-SMART: Universal Science Multimodal Analysis and Research
Transformer
Paper
• 2403.10301
• Published
• 54
VideoAgent: Long-form Video Understanding with Large Language Model as
Agent
Paper
• 2403.10517
• Published
• 37
LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images
Paper
• 2403.11703
• Published
• 17
VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding
Paper
• 2403.11481
• Published
• 13
mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document
Understanding
Paper
• 2403.12895
• Published
• 32
Chart-based Reasoning: Transferring Capabilities from LLMs to VLMs
Paper
• 2403.12596
• Published
• 11
MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual
Math Problems?
Paper
• 2403.14624
• Published
• 53
Can large language models explore in-context?
Paper
• 2403.15371
• Published
• 33
InternVideo2: Scaling Video Foundation Models for Multimodal Video
Understanding
Paper
• 2403.15377
• Published
• 28
SiMBA: Simplified Mamba-Based Architecture for Vision and Multivariate
Time series
Paper
• 2403.15360
• Published
• 13
VidLA: Video-Language Alignment at Scale
Paper
• 2403.14870
• Published
• 15
ViTAR: Vision Transformer with Any Resolution
Paper
• 2403.18361
• Published
• 55
Mini-Gemini: Mining the Potential of Multi-modality Vision Language
Models
Paper
• 2403.18814
• Published
• 48
sDPO: Don't Use Your Data All at Once
Paper
• 2403.19270
• Published
• 41
TextCraftor: Your Text Encoder Can be Image Quality Controller
Paper
• 2403.18978
• Published
• 15
Unsolvable Problem Detection: Evaluating Trustworthiness of Vision
Language Models
Paper
• 2403.20331
• Published
• 16
Getting it Right: Improving Spatial Consistency in Text-to-Image Models
Paper
• 2404.01197
• Published
• 31
Direct Preference Optimization of Video Large Multimodal Models from
Language Model Reward
Paper
• 2404.01258
• Published
• 12
MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with
Interleaved Visual-Textual Tokens
Paper
• 2404.03413
• Published
• 27
LVLM-Intrepret: An Interpretability Tool for Large Vision-Language
Models
Paper
• 2404.03118
• Published
• 25
CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept
Matching
Paper
• 2404.03653
• Published
• 35
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs
Paper
• 2404.05719
• Published
• 83
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video
Understanding
Paper
• 2404.05726
• Published
• 23
MoMA: Multimodal LLM Adapter for Fast Personalized Image Generation
Paper
• 2404.05674
• Published
• 15
Koala: Key frame-conditioned long video-LLM
Paper
• 2404.04346
• Published
• 7
InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model
Handling Resolutions from 336 Pixels to 4K HD
Paper
• 2404.06512
• Published
• 30
Adapting LLaMA Decoder to Vision Transformer
Paper
• 2404.06773
• Published
• 18
BRAVE: Broadening the visual encoding of vision-language models
Paper
• 2404.07204
• Published
• 19
Transferable and Principled Efficiency for Open-Vocabulary Segmentation
Paper
• 2404.07448
• Published
• 12
Ferret-v2: An Improved Baseline for Referring and Grounding with Large
Language Models
Paper
• 2404.07973
• Published
• 32
HQ-Edit: A High-Quality Dataset for Instruction-based Image Editing
Paper
• 2404.09990
• Published
• 14
TextHawk: Exploring Efficient Fine-Grained Perception of Multimodal
Large Language Models
Paper
• 2404.09204
• Published
• 11
On Speculative Decoding for Multimodal Large Language Models
Paper
• 2404.08856
• Published
• 13
Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language
Models
Paper
• 2404.12387
• Published
• 39
BLINK: Multimodal Large Language Models Can See but Not Perceive
Paper
• 2404.12390
• Published
• 26
MultiBooth: Towards Generating All Your Concepts in an Image from Text
Paper
• 2404.14239
• Published
• 9
A Multimodal Automated Interpretability Agent
Paper
• 2404.14394
• Published
• 23
TextSquare: Scaling up Text-Centric Visual Instruction Tuning
Paper
• 2404.12803
• Published
• 30
Groma: Localized Visual Tokenization for Grounding Multimodal Large
Language Models
Paper
• 2404.13013
• Published
• 31
CatLIP: CLIP-level Visual Recognition Accuracy with 2.7x Faster
Pre-training on Web-scale Image-Text Data
Paper
• 2404.15653
• Published
• 29
Editable Image Elements for Controllable Synthesis
Paper
• 2404.16029
• Published
• 12
MoDE: CLIP Data Experts via Clustering
Paper
• 2404.16030
• Published
• 15
SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with
Text-Rich Visual Comprehension
Paper
• 2404.16790
• Published
• 10
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal
Models with Open-Source Suites
Paper
• 2404.16821
• Published
• 59
List Items One by One: A New Data Source and Learning Paradigm for
Multimodal LLMs
Paper
• 2404.16375
• Published
• 18
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video
Dense Captioning
Paper
• 2404.16994
• Published
• 37
HaLo-NeRF: Learning Geometry-Guided Semantics for Exploring
Unconstrained Photo Collections
Paper
• 2404.16845
• Published
• 7
BlenderAlchemy: Editing 3D Graphics with Vision-Language Models
Paper
• 2404.17672
• Published
• 19
Ag2Manip: Learning Novel Manipulation Skills with Agent-Agnostic Visual
and Action Representations
Paper
• 2404.17521
• Published
• 13
Automatic Creative Selection with Cross-Modal Matching
Paper
• 2405.00029
• Published
• 9
What matters when building vision-language models?
Paper
• 2405.02246
• Published
• 103
Plot2Code: A Comprehensive Benchmark for Evaluating Multi-modal Large
Language Models in Code Generation from Scientific Plots
Paper
• 2405.07990
• Published
• 20
No Time to Waste: Squeeze Time into Channel for Mobile Video
Understanding
Paper
• 2405.08344
• Published
• 15
Understanding the performance gap between online and offline alignment
algorithms
Paper
• 2405.08448
• Published
• 18
SpeechVerse: A Large-scale Generalizable Audio Language Model
Paper
• 2405.08295
• Published
• 19
SpeechGuard: Exploring the Adversarial Robustness of Multimodal Large
Language Models
Paper
• 2405.08317
• Published
• 12
Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model
Paper
• 2405.09215
• Published
• 22
LoRA Learns Less and Forgets Less
Paper
• 2405.09673
• Published
• 91
Many-Shot In-Context Learning in Multimodal Foundation Models
Paper
• 2405.09798
• Published
• 32
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Paper
• 2405.09818
• Published
• 132
Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection
Paper
• 2405.10300
• Published
• 30
Toon3D: Seeing Cartoons from a New Perspective
Paper
• 2405.10320
• Published
• 22
Octo: An Open-Source Generalist Robot Policy
Paper
• 2405.12213
• Published
• 29
Imp: Highly Capable Large Multimodal Models for Mobile Devices
Paper
• 2405.12107
• Published
• 29
Your Transformer is Secretly Linear
Paper
• 2405.12250
• Published
• 157
Diffusion for World Modeling: Visual Details Matter in Atari
Paper
• 2405.12399
• Published
• 30
AlignGPT: Multi-modal Large Language Models with Adaptive Alignment
Capability
Paper
• 2405.14129
• Published
• 14
CamViG: Camera Aware Image-to-Video Generation with Multimodal
Transformers
Paper
• 2405.13195
• Published
• 12
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision
Models
Paper
• 2405.15574
• Published
• 55
Denoising LM: Pushing the Limits of Error Correction Models for Speech
Recognition
Paper
• 2405.15216
• Published
• 15
An Introduction to Vision-Language Modeling
Paper
• 2405.17247
• Published
• 90
Matryoshka Multimodal Models
Paper
• 2405.17430
• Published
• 34
NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding
Models
Paper
• 2405.17428
• Published
• 20
ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal
Models
Paper
• 2405.15738
• Published
• 46
Dense Connector for MLLMs
Paper
• 2405.13800
• Published
• 24
Visual Echoes: A Simple Unified Transformer for Audio-Visual Generation
Paper
• 2405.14598
• Published
• 13
Jina CLIP: Your CLIP Model Is Also Your Text Retriever
Paper
• 2405.20204
• Published
• 37
Zipper: A Multi-Tower Decoder Architecture for Fusing Modalities
Paper
• 2405.18669
• Published
• 12
MotionLLM: Understanding Human Behaviors from Human Motions and Videos
Paper
• 2405.20340
• Published
• 20
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of
Multi-modal LLMs in Video Analysis
Paper
• 2405.21075
• Published
• 26
Show, Don't Tell: Aligning Language Models with Demonstrated Feedback
Paper
• 2406.00888
• Published
• 33
Parrot: Multilingual Visual Instruction Tuning
Paper
• 2406.02539
• Published
• 36
PosterLLaVa: Constructing a Unified Multi-modal Layout Generator with
LLM
Paper
• 2406.02884
• Published
• 18
ShareGPT4Video: Improving Video Understanding and Generation with Better
Captions
Paper
• 2406.04325
• Published
• 74
AgentGym: Evolving Large Language Model-based Agents across Diverse
Environments
Paper
• 2406.04151
• Published
• 24
Mobile-Agent-v2: Mobile Device Operation Assistant with Effective
Navigation via Multi-Agent Collaboration
Paper
• 2406.01014
• Published
• 33
Vript: A Video Is Worth Thousands of Words
Paper
• 2406.06040
• Published
• 28
An Image is Worth 32 Tokens for Reconstruction and Generation
Paper
• 2406.07550
• Published
• 60
AsyncDiff: Parallelizing Diffusion Models by Asynchronous Denoising
Paper
• 2406.06911
• Published
• 12
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio
Understanding in Video-LLMs
Paper
• 2406.07476
• Published
• 36
What If We Recaption Billions of Web Images with LLaMA-3?
Paper
• 2406.08478
• Published
• 43
MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation
in Videos
Paper
• 2406.08407
• Published
• 28
Needle In A Multimodal Haystack
Paper
• 2406.07230
• Published
• 54
mDPO: Conditional Preference Optimization for Multimodal Large Language
Models
Paper
• 2406.11839
• Published
• 40
VideoLLM-online: Online Video Large Language Model for Streaming Video
Paper
• 2406.11816
• Published
• 26
TroL: Traversal of Layers for Large Language and Vision Models
Paper
• 2406.12246
• Published
• 36
VoCo-LLaMA: Towards Vision Compression with Large Language Models
Paper
• 2406.12275
• Published
• 31
Benchmarking Multi-Image Understanding in Vision and Language Models:
Perception, Knowledge, Reasoning, and Multi-Hop Reasoning
Paper
• 2406.12742
• Published
• 15
Adversarial Attacks on Multimodal Agents
Paper
• 2406.12814
• Published
• 4
Multimodal Needle in a Haystack: Benchmarking Long-Context Capability of
Multimodal Large Language Models
Paper
• 2406.11230
• Published
• 33
Probabilistic Conceptual Explainers: Trustworthy Conceptual Explanations
for Vision Foundation Models
Paper
• 2406.12649
• Published
• 16
Understanding Hallucinations in Diffusion Models through Mode
Interpolation
Paper
• 2406.09358
• Published
• 5
CMC-Bench: Towards a New Paradigm of Visual Signal Compression
Paper
• 2406.09356
• Published
• 6
4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities
Paper
• 2406.09406
• Published
• 15
Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal
Language Models
Paper
• 2406.09403
• Published
• 23
MuirBench: A Comprehensive Benchmark for Robust Multi-image
Understanding
Paper
• 2406.09411
• Published
• 19
mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus
Paper
• 2406.08707
• Published
• 17
EMMA: Your Text-to-Image Diffusion Model Can Secretly Accept Multi-Modal
Prompts
Paper
• 2406.09162
• Published
• 14
OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images
Interleaved with Text
Paper
• 2406.08418
• Published
• 32
GUI Odyssey: A Comprehensive Dataset for Cross-App GUI Navigation on
Mobile Devices
Paper
• 2406.08451
• Published
• 26
Paper
• 2406.04127
• Published
• 39
NaRCan: Natural Refined Canonical Image with Integration of Diffusion
Prior for Video Editing
Paper
• 2406.06523
• Published
• 53
Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models
Paper
• 2406.08487
• Published
• 14
VCR: Visual Caption Restoration
Paper
• 2406.06462
• Published
• 13
An Image is Worth More Than 16x16 Patches: Exploring Transformers on
Individual Pixels
Paper
• 2406.09415
• Published
• 51
OpenVLA: An Open-Source Vision-Language-Action Model
Paper
• 2406.09246
• Published
• 43
DiTFastAttn: Attention Compression for Diffusion Transformer Models
Paper
• 2406.08552
• Published
• 25
Physics3D: Learning Physical Properties of 3D Gaussians via Video
Diffusion
Paper
• 2406.04338
• Published
• 39
Hibou: A Family of Foundational Vision Transformers for Pathology
Paper
• 2406.05074
• Published
• 10
Make It Count: Text-to-Image Generation with an Accurate Number of
Objects
Paper
• 2406.10210
• Published
• 78
XLand-100B: A Large-Scale Multi-Task Dataset for In-Context
Reinforcement Learning
Paper
• 2406.08973
• Published
• 89
MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and
Instruction-Tuning Dataset for LVLMs
Paper
• 2406.11833
• Published
• 62
Exploring the Role of Large Language Models in Prompt Encoding for
Diffusion Models
Paper
• 2406.11831
• Published
• 22
From Pixels to Prose: A Large Dataset of Dense Image Captions
Paper
• 2406.10328
• Published
• 18
Prism: A Framework for Decoupling and Assessing the Capabilities of VLMs
Paper
• 2406.14544
• Published
• 35
WildVision: Evaluating Vision-Language Models in the Wild with Human
Preferences
Paper
• 2406.11069
• Published
• 14
MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal
Dataset with One Trillion Tokens
Paper
• 2406.11271
• Published
• 21
Paper
• 2406.11775
• Published
• 9
Unifying Multimodal Retrieval via Document Screenshot Embedding
Paper
• 2406.11251
• Published
• 11
The Devil is in the Details: StyleFeatureEditor for Detail-Rich StyleGAN
Inversion and High Quality Image Editing
Paper
• 2406.10601
• Published
• 70
MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video
Understanding
Paper
• 2406.14515
• Published
• 33
Two Giraffes in a Dirt Field: Using Game Play to Investigate Situation
Modelling in Large Multimodal Models
Paper
• 2406.14035
• Published
• 13
ICAL: Continual Learning of Multimodal Agents by Transforming
Trajectories into Actionable Insights
Paper
• 2406.14596
• Published
• 5
Multimodal Structured Generation: CVPR's 2nd MMFM Challenge Technical
Report
Paper
• 2406.11403
• Published
• 4
VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in
Large Video-Language Models
Paper
• 2406.16338
• Published
• 26
Long Context Transfer from Language to Vision
Paper
• 2406.16852
• Published
• 33
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
Paper
• 2406.16860
• Published
• 63
MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning
Paper
• 2406.17770
• Published
• 19
video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models
Paper
• 2406.15704
• Published
• 6
Octo-planner: On-device Language Model for Planner-Action Agents
Paper
• 2406.18082
• Published
• 48
CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal
LLMs
Paper
• 2406.18521
• Published
• 30
Multimodal Task Vectors Enable Many-Shot Multimodal In-Context Learning
Paper
• 2406.15334
• Published
• 9
Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large
Language Models
Paper
• 2406.17294
• Published
• 11
OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and
Understanding
Paper
• 2406.19389
• Published
• 54
Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of
LLMs
Paper
• 2406.18629
• Published
• 42
MUMU: Bootstrapping Multimodal Image Generation from Text-to-Image Data
Paper
• 2406.18790
• Published
• 34
Simulating Classroom Education with LLM-Empowered Agents
Paper
• 2406.19226
• Published
• 32
AUTOHALLUSION: Automatic Generation of Hallucination Benchmarks for
Vision-Language Models
Paper
• 2406.10900
• Published
• 11
LLaRA: Supercharging Robot Learning Data for Vision-Language Policy
Paper
• 2406.20095
• Published
• 18
EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything
Model
Paper
• 2406.20076
• Published
• 10
Arboretum: A Large Multimodal Dataset Enabling AI for Biodiversity
Paper
• 2406.17720
• Published
• 8
We-Math: Does Your Large Multimodal Model Achieve Human-like
Mathematical Reasoning?
Paper
• 2407.01284
• Published
• 81
ROS-LLM: A ROS framework for embodied AI with task feedback and
structured reasoning
Paper
• 2406.19741
• Published
• 60
MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and
Efficient Evaluation
Paper
• 2407.00468
• Published
• 35
ColPali: Efficient Document Retrieval with Vision Language Models
Paper
• 2407.01449
• Published
• 51
OmniJARVIS: Unified Vision-Language-Action Tokenization Enables
Open-World Instruction Following Agents
Paper
• 2407.00114
• Published
• 13
Understanding Alignment in Multimodal LLMs: A Comprehensive Study
Paper
• 2407.02477
• Published
• 24
InternLM-XComposer-2.5: A Versatile Large Vision Language Model
Supporting Long-Contextual Input and Output
Paper
• 2407.03320
• Published
• 94
TokenPacker: Efficient Visual Projector for Multimodal LLM
Paper
• 2407.02392
• Published
• 23
Unveiling Encoder-Free Vision-Language Models
Paper
• 2406.11832
• Published
• 54
Flash-VStream: Memory-Based Real-Time Understanding for Long Video
Streams
Paper
• 2406.08085
• Published
• 17
Granular Privacy Control for Geolocation with Vision Language Models
Paper
• 2407.04952
• Published
• 7
ANOLE: An Open, Autoregressive, Native Large Multimodal Models for
Interleaved Image-Text Generation
Paper
• 2407.06135
• Published
• 22
Multi-Object Hallucination in Vision-Language Models
Paper
• 2407.06192
• Published
• 12
Vision language models are blind
Paper
• 2407.06581
• Published
• 84
VIMI: Grounding Video Generation through Multi-modal Instruction
Paper
• 2407.06304
• Published
• 10
Video-to-Audio Generation with Hidden Alignment
Paper
• 2407.07464
• Published
• 17
Stark: Social Long-Term Multi-Modal Conversation with Persona
Commonsense Knowledge
Paper
• 2407.03958
• Published
• 21
Understanding Visual Feature Reliance through the Lens of Complexity
Paper
• 2407.06076
• Published
• 6
Graph-Based Captioning: Enhancing Visual Descriptions by Interconnecting
Region Captions
Paper
• 2407.06723
• Published
• 11
PaliGemma: A versatile 3B VLM for transfer
Paper
• 2407.07726
• Published
• 72
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large
Multimodal Models
Paper
• 2407.07895
• Published
• 42
Do Vision and Language Models Share Concepts? A Vector Space Alignment
Study
Paper
• 2302.06555
• Published
• 9
DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal
Perception
Paper
• 2407.08303
• Published
• 19
The Synergy between Data and Multi-Modal Large Language Models: A Survey
from Co-Development Perspective
Paper
• 2407.08583
• Published
• 13
Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning
Instruction Using Language Model
Paper
• 2407.07053
• Published
• 47
E5-V: Universal Embeddings with Multimodal Large Language Models
Paper
• 2407.12580
• Published
• 42
Goldfish: Vision-Language Understanding of Arbitrarily Long Videos
Paper
• 2407.12679
• Published
• 8
AUITestAgent: Automatic Requirements Oriented GUI Function Testing
Paper
• 2407.09018
• Published
• 5
ThinkGrasp: A Vision-Language System for Strategic Part Grasping in
Clutter
Paper
• 2407.11298
• Published
• 6
NavGPT-2: Unleashing Navigational Reasoning Capability for Large
Vision-Language Models
Paper
• 2407.12366
• Published
• 4
Benchmarking Trustworthiness of Multimodal Large Language Models: A
Comprehensive Study
Paper
• 2406.07057
• Published
• 17
EVLM: An Efficient Vision-Language Model for Visual Understanding
Paper
• 2407.14177
• Published
• 45
VisFocus: Prompt-Guided Vision Encoders for OCR-Free Dense Document
Understanding
Paper
• 2407.12594
• Published
• 19
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language
Models
Paper
• 2407.15841
• Published
• 40
VideoGameBunny: Towards vision assistants for video games
Paper
• 2407.15295
• Published
• 23
CGB-DM: Content and Graphic Balance Layout Generation with
Transformer-based Diffusion Model
Paper
• 2407.15233
• Published
• 7
OutfitAnyone: Ultra-high Quality Virtual Try-On for Any Clothing and Any
Person
Paper
• 2407.16224
• Published
• 29
MovieDreamer: Hierarchical Generation for Coherent Long Visual Sequence
Paper
• 2407.16655
• Published
• 30
INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal
Large Language Model
Paper
• 2407.16198
• Published
• 13
VILA^2: VILA Augmented VILA
Paper
• 2407.17453
• Published
• 41
Learning to Manipulate Anywhere: A Visual Generalizable Framework For
Reinforcement Learning
Paper
• 2407.15815
• Published
• 14
AMEX: Android Multi-annotation Expo Dataset for Mobile GUI Agents
Paper
• 2407.17490
• Published
• 31
Efficient Inference of Vision Instruction-Following Models with Elastic
Cache
Paper
• 2407.18121
• Published
• 17
VSSD: Vision Mamba with Non-Casual State Space Duality
Paper
• 2407.18559
• Published
• 20
Wolf: Captioning Everything with a World Summarization Framework
Paper
• 2407.18908
• Published
• 32
Diffusion Feedback Helps CLIP See Better
Paper
• 2407.20171
• Published
• 36
VolDoGer: LLM-assisted Datasets for Domain Generalization in
Vision-Language Tasks
Paper
• 2407.19795
• Published
• 11
Mixture of Nested Experts: Adaptive Processing of Visual Tokens
Paper
• 2407.19985
• Published
• 37
MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware
Experts
Paper
• 2407.21770
• Published
• 22
Towards Achieving Human Parity on End-to-end Simultaneous Speech
Translation via LLM Agent
Paper
• 2407.21646
• Published
• 18
ShieldGemma: Generative AI Content Moderation Based on Gemma
Paper
• 2407.21772
• Published
• 14
Open-Vocabulary Audio-Visual Semantic Segmentation
Paper
• 2407.21721
• Published
• 9
SAM 2: Segment Anything in Images and Videos
Paper
• 2408.00714
• Published
• 120
OmniParser for Pure Vision Based GUI Agent
Paper
• 2408.00203
• Published
• 24
Generalized Out-of-Distribution Detection and Beyond in Vision Language
Model Era: A Survey
Paper
• 2407.21794
• Published
• 6
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
Paper
• 2408.01800
• Published
• 92
Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation
with Multimodal Generative Pretraining
Paper
• 2408.02657
• Published
• 35
Language Model Can Listen While Speaking
Paper
• 2408.02622
• Published
• 40
ExoViP: Step-by-step Verification and Exploration with Exoskeleton
Modules for Compositional Visual Reasoning
Paper
• 2408.02210
• Published
• 9
Operationalizing Contextual Integrity in Privacy-Conscious Assistants
Paper
• 2408.02373
• Published
• 5
LLaVA-OneVision: Easy Visual Task Transfer
Paper
• 2408.03326
• Published
• 61
Diffusion Models as Data Mining Tools
Paper
• 2408.02752
• Published
• 15
AVESFormer: Efficient Transformer Design for Real-Time Audio-Visual
Segmentation
Paper
• 2408.01708
• Published
• 4
Optimus-1: Hybrid Multimodal Memory Empowered Agents Excel in
Long-Horizon Tasks
Paper
• 2408.03615
• Published
• 31
Openstory++: A Large-scale Dataset and Benchmark for Instance-aware
Open-domain Visual Storytelling
Paper
• 2408.03695
• Published
• 13
Speech-MASSIVE: A Multilingual Speech Dataset for SLU and Beyond
Paper
• 2408.03900
• Published
• 10
Sketch2Scene: Automatic Generation of Interactive 3D Game Scenes from
User's Casual Sketches
Paper
• 2408.04567
• Published
• 26
Img-Diff: Contrastive Data Synthesis for Multimodal Large Language
Models
Paper
• 2408.04594
• Published
• 14
Puppet-Master: Scaling Interactive Video Generation as a Motion Prior
for Part-Level Dynamics
Paper
• 2408.04631
• Published
• 9
VITA: Towards Open-Source Interactive Omni Multimodal LLM
Paper
• 2408.05211
• Published
• 50
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal
Large Language Models
Paper
• 2408.04840
• Published
• 33
UniBench: Visual Reasoning Requires Rethinking Vision-Language Beyond
Scaling
Paper
• 2408.04810
• Published
• 24
ControlNeXt: Powerful and Efficient Control for Image and Video
Generation
Paper
• 2408.06070
• Published
• 55
VisualAgentBench: Towards Large Multimodal Models as Visual Foundation
Agents
Paper
• 2408.06327
• Published
• 17
UniPortrait: A Unified Framework for Identity-Preserving Single- and
Multi-Human Image Personalization
Paper
• 2408.05939
• Published
• 14
Paper
• 2408.07009
• Published
• 62
Amuro & Char: Analyzing the Relationship between Pre-Training and
Fine-Tuning of Large Language Models
Paper
• 2408.06663
• Published
• 16
Paper
• 2408.05366
• Published
• 14
Towards flexible perception with visual memory
Paper
• 2408.08172
• Published
• 23
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
Paper
• 2408.08872
• Published
• 101
JPEG-LM: LLMs as Image Generators with Canonical Codec Representations
Paper
• 2408.08459
• Published
• 45
D5RL: Diverse Datasets for Data-Driven Deep Reinforcement Learning
Paper
• 2408.08441
• Published
• 8
LongVILA: Scaling Long-Context Visual Language Models for Long Videos
Paper
• 2408.10188
• Published
• 52
MegaFusion: Extend Diffusion Models towards Higher-resolution Image
Generation without Further Tuning
Paper
• 2408.11001
• Published
• 13
Factorized-Dreamer: Training A High-Quality Video Generator with Limited
and Low-Quality Data
Paper
• 2408.10119
• Published
• 17
Transfusion: Predict the Next Token and Diffuse Images with One
Multi-Modal Model
Paper
• 2408.11039
• Published
• 63
NeCo: Improving DINOv2's spatial representations in 19 GPU hours with
Patch Neighbor Consistency
Paper
• 2408.11054
• Published
• 14
Predicting Rewards Alongside Tokens: Non-disruptive Parameter Insertion
for Efficient Inference Intervention in Large Language Model
Paper
• 2408.10764
• Published
• 9
Audio Match Cutting: Finding and Creating Matching Audio Transitions in
Movies and Videos
Paper
• 2408.10998
• Published
• 9
MambaEVT: Event Stream based Visual Object Tracking using State Space
Model
Paper
• 2408.10487
• Published
• 7
FocusLLM: Scaling LLM's Context by Parallel Decoding
Paper
• 2408.11745
• Published
• 25
TWLV-I: Analysis and Insights from Holistic Evaluation on Video
Foundation Models
Paper
• 2408.11318
• Published
• 56
GRAB: A Challenging GRaph Analysis Benchmark for Large Multimodal Models
Paper
• 2408.11817
• Published
• 9
FRAP: Faithful and Realistic Text-to-Image Generation with Adaptive
Prompt Weighting
Paper
• 2408.11706
• Published
• 7
TrackGo: A Flexible and Efficient Method for Controllable Video
Generation
Paper
• 2408.11475
• Published
• 18
Out-of-Distribution Detection with Attention Head Masking for Multimodal
Document Classification
Paper
• 2408.11237
• Published
• 6
Iterative Object Count Optimization for Text-to-image Diffusion Models
Paper
• 2408.11721
• Published
• 6
Sapiens: Foundation for Human Vision Models
Paper
• 2408.12569
• Published
• 94
Show-o: One Single Transformer to Unify Multimodal Understanding and
Generation
Paper
• 2408.12528
• Published
• 51
Open-FinLLMs: Open Multimodal Large Language Models for Financial
Applications
Paper
• 2408.11878
• Published
• 63
xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed
Representations
Paper
• 2408.12590
• Published
• 35
Scalable Autoregressive Image Generation with Mamba
Paper
• 2408.12245
• Published
• 26
Real-Time Video Generation with Pyramid Attention Broadcast
Paper
• 2408.12588
• Published
• 17
SPARK: Multi-Vision Sensor Perception and Reasoning Benchmark for
Large-scale Vision-Language Models
Paper
• 2408.12114
• Published
• 15
Anim-Director: A Large Multimodal Model Powered Agent for Controllable
Animation Video Generation
Paper
• 2408.09787
• Published
• 10
Building and better understanding vision-language models: insights and
future directions
Paper
• 2408.12637
• Published
• 133
MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution
Real-World Scenarios that are Difficult for Humans?
Paper
• 2408.13257
• Published
• 26
CustomCrafter: Customized Video Generation with Preserving Motion and
Concept Composition Abilities
Paper
• 2408.13239
• Published
• 11
Foundation Models for Music: A Survey
Paper
• 2408.14340
• Published
• 44
LLaVaOLMoBitnet1B: Ternary LLM goes Multimodal!
Paper
• 2408.13402
• Published
• 18
TVG: A Training-free Transition Video Generation Method with Diffusion
Models
Paper
• 2408.13413
• Published
• 14
BaichuanSEED: Sharing the Potential of ExtensivE Data Collection and
Deduplication by Introducing a Competitive Large Language Model Baseline
Paper
• 2408.15079
• Published
• 54
Law of Vision Representation in MLLMs
Paper
• 2408.16357
• Published
• 95
CogVLM2: Visual Language Models for Image and Video Understanding
Paper
• 2408.16500
• Published
• 57
WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio
Language Modeling
Paper
• 2408.16532
• Published
• 50
Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming
Paper
• 2408.16725
• Published
• 53
VisionTS: Visual Masked Autoencoders Are Free-Lunch Zero-Shot Time
Series Forecasters
Paper
• 2408.17253
• Published
• 39
TableBench: A Comprehensive and Complex Benchmark for Table Question
Answering
Paper
• 2408.09174
• Published
• 52
VideoLLaMB: Long-context Video Understanding with Recurrent Memory
Bridges
Paper
• 2409.01071
• Published
• 27
DepthCrafter: Generating Consistent Long Depth Sequences for Open-world
Videos
Paper
• 2409.02095
• Published
• 37
LinFusion: 1 GPU, 1 Minute, 16K Image
Paper
• 2409.02097
• Published
• 34
LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via
Hybrid Architecture
Paper
• 2409.02889
• Published
• 54
Attention Heads of Large Language Models: A Survey
Paper
• 2409.03752
• Published
• 92
Open-MAGVIT2: An Open-Source Project Toward Democratizing
Auto-regressive Visual Generation
Paper
• 2409.04410
• Published
• 25
MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct
Paper
• 2409.05840
• Published
• 49
Towards a Unified View of Preference Learning for Large Language Models:
A Survey
Paper
• 2409.02795
• Published
• 72
POINTS: Improving Your Vision-language Model with Affordable Strategies
Paper
• 2409.04828
• Published
• 24
Benchmarking Chinese Knowledge Rectification in Large Language Models
Paper
• 2409.05806
• Published
• 15
LLaMA-Omni: Seamless Speech Interaction with Large Language Models
Paper
• 2409.06666
• Published
• 60
Draw an Audio: Leveraging Multi-Instruction for Video-to-Audio Synthesis
Paper
• 2409.06135
• Published
• 16
PingPong: A Benchmark for Role-Playing Language Models with User
Emulation and Multi-Model Evaluation
Paper
• 2409.06820
• Published
• 68
MVLLaVA: An Intelligent Agent for Unified and Flexible Novel View
Synthesis
Paper
• 2409.07129
• Published
• 8
PiTe: Pixel-Temporal Alignment for Large Video-Language Model
Paper
• 2409.07239
• Published
• 15
Ferret: Federated Full-Parameter Tuning at Scale for Large Language
Models
Paper
• 2409.06277
• Published
• 15
Guiding Vision-Language Model Selection for Visual Question-Answering
Across Tasks, Domains, and Knowledge Types
Paper
• 2409.09269
• Published
• 8
One missing piece in Vision and Language: A Survey on Comics
Understanding
Paper
• 2409.09502
• Published
• 24
NVLM: Open Frontier-Class Multimodal LLMs
Paper
• 2409.11402
• Published
• 74
OmniGen: Unified Image Generation
Paper
• 2409.11340
• Published
• 115
Fine-Tuning Image-Conditional Diffusion Models is Easier than You Think
Paper
• 2409.11355
• Published
• 30
OSV: One Step is Enough for High-Quality Image to Video Generation
Paper
• 2409.11367
• Published
• 14
mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page
Document Understanding
Paper
• 2409.03420
• Published
• 26
InstantDrag: Improving Interactivity in Drag-based Image Editing
Paper
• 2409.08857
• Published
• 34
AudioBERT: Audio Knowledge Augmented Language Model
Paper
• 2409.08199
• Published
• 5
LLM-Powered Grapheme-to-Phoneme Conversion: Benchmark and Case Study
Paper
• 2409.08554
• Published
• 3
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at
Any Resolution
Paper
• 2409.12191
• Published
• 78
Qwen2.5-Coder Technical Report
Paper
• 2409.12186
• Published
• 153
Preference Tuning with Human Feedback on Language, Speech, and Vision
Tasks: A Survey
Paper
• 2409.11564
• Published
• 20
Takin: A Cohort of Superior Quality Zero-shot Speech Generation Models
Paper
• 2409.12139
• Published
• 12
Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary
Resolution
Paper
• 2409.12961
• Published
• 25
StoryMaker: Towards Holistic Consistent Characters in Text-to-image
Generation
Paper
• 2409.12576
• Published
• 16
Imagine yourself: Tuning-Free Personalized Image Generation
Paper
• 2409.13346
• Published
• 69
YesBut: A High-Quality Annotated Multimodal Dataset for evaluating
Satire Comprehension capability of Vision-Language Models
Paper
• 2409.13592
• Published
• 50
Portrait Video Editing Empowered by Multimodal Generative Priors
Paper
• 2409.13591
• Published
• 16
PixWizard: Versatile Image-to-Image Visual Assistant with Open-Language
Instructions
Paper
• 2409.15278
• Published
• 24
Phantom of Latent for Large Language and Vision Models
Paper
• 2409.14713
• Published
• 29
Reflecting Reality: Enabling Diffusion Models to Produce Faithful Mirror
Reflections
Paper
• 2409.14677
• Published
• 15
MIMO: Controllable Character Video Synthesis with Spatial Decomposed
Modeling
Paper
• 2409.16160
• Published
• 34
MonoFormer: One Transformer for Both Diffusion and Autoregression
Paper
• 2409.16280
• Published
• 18
Seeing Faces in Things: A Model and Dataset for Pareidolia
Paper
• 2409.16143
• Published
• 17
Attention Prompting on Image for Large Vision-Language Models
Paper
• 2409.17143
• Published
• 7
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art
Multimodal Models
Paper
• 2409.17146
• Published
• 121
MIO: A Foundation Model on Multimodal Tokens
Paper
• 2409.17692
• Published
• 53
MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning
Paper
• 2409.20566
• Published
• 55
Visual Question Decomposition on Multimodal Large Language Models
Paper
• 2409.19339
• Published
• 8
Loong: Generating Minute-level Long Videos with Autoregressive Language
Models
Paper
• 2410.02757
• Published
• 36
Revisit Large-Scale Image-Caption Data in Pre-training Multimodal
Foundation Models
Paper
• 2410.02740
• Published
• 54
LLaVA-Critic: Learning to Evaluate Multimodal Models
Paper
• 2410.02712
• Published
• 37
Interpreting and Editing Vision-Language Representations to Mitigate
Hallucinations
Paper
• 2410.02762
• Published
• 9
Vinoground: Scrutinizing LMMs over Dense Temporal Reasoning with Short
Videos
Paper
• 2410.02763
• Published
• 7
Addition is All You Need for Energy-efficient Language Models
Paper
• 2410.00907
• Published
• 151
VideoGuide: Improving Video Diffusion Models without Training Through a
Teacher's Guide
Paper
• 2410.04364
• Published
• 29
Navigating the Digital World as Humans Do: Universal Visual Grounding
for GUI Agents
Paper
• 2410.05243
• Published
• 20
UniMuMo: Unified Text, Music and Motion Generation
Paper
• 2410.04534
• Published
• 19
TLDR: Token-Level Detective Reward Model for Large Vision Language
Models
Paper
• 2410.04734
• Published
• 18
OmniBooth: Learning Latent Control for Image Synthesis with Multi-modal
Instruction
Paper
• 2410.04932
• Published
• 9
A Spark of Vision-Language Intelligence: 2-Dimensional Autoregressive
Transformer for Efficient Finegrained Image Generation
Paper
• 2410.01912
• Published
• 14
ControlAR: Controllable Image Generation with Autoregressive Models
Paper
• 2410.02705
• Published
• 11
Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video
Large Language Models
Paper
• 2410.03290
• Published
• 7
Aria: An Open Multimodal Native Mixture-of-Experts Model
Paper
• 2410.05993
• Published
• 111
Personalized Visual Instruction Tuning
Paper
• 2410.07113
• Published
• 70
Paper
• 2410.07073
• Published
• 69
IterComp: Iterative Composition-Aware Feedback Learning from Model
Gallery for Text-to-Image Generation
Paper
• 2410.07171
• Published
• 43
Deciphering Cross-Modal Alignment in Large Vision-Language Models with
Modality Integration Rate
Paper
• 2410.07167
• Published
• 39
Unveiling the Backbone-Optimizer Coupling Bias in Visual Representation
Learning
Paper
• 2410.06373
• Published
• 36
Pyramidal Flow Matching for Efficient Video Generative Modeling
Paper
• 2410.05954
• Published
• 40
Towards World Simulator: Crafting Physical Commonsense-Based Benchmark
for Video Generation
Paper
• 2410.05363
• Published
• 45
Story-Adapter: A Training-free Iterative Framework for Long Story
Visualization
Paper
• 2410.06244
• Published
• 20
MM-Ego: Towards Building Egocentric Multimodal LLMs
Paper
• 2410.07177
• Published
• 22
TweedieMix: Improving Multi-Concept Fusion for Diffusion-based
Image/Video Generation
Paper
• 2410.05591
• Published
• 13
Temporal Reasoning Transfer from Text to Video
Paper
• 2410.06166
• Published
• 13
MLLM as Retriever: Interactively Learning Multimodal Retrieval for
Embodied Agents
Paper
• 2410.03450
• Published
• 36
Intriguing Properties of Large Language and Vision Models
Paper
• 2410.04751
• Published
• 16
Progressive Autoregressive Video Diffusion Models
Paper
• 2410.08151
• Published
• 16
Preserving Multi-Modal Capabilities of Pre-trained VLMs for Improving
Vision-Linguistic Compositionality
Paper
• 2410.05210
• Published
• 11
Self-Boosting Large Language Models with Synthetic Preference Data
Paper
• 2410.06961
• Published
• 16
WALL-E: World Alignment by Rule Learning Improves World Model-based LLM
Agents
Paper
• 2410.07484
• Published
• 51
Agent S: An Open Agentic Framework that Uses Computers Like a Human
Paper
• 2410.08164
• Published
• 26
GLOV: Guided Large Language Models as Implicit Optimizers for Vision
Language Models
Paper
• 2410.06154
• Published
• 16
Baichuan-Omni Technical Report
Paper
• 2410.08565
• Published
• 87
From Generalist to Specialist: Adapting Vision Language Models via
Task-Specific Visual Instruction Tuning
Paper
• 2410.06456
• Published
• 37
EvolveDirector: Approaching Advanced Text-to-Image Generation with Large
Vision-Language Models
Paper
• 2410.07133
• Published
• 19
MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large
Vision-Language Models
Paper
• 2410.10139
• Published
• 51
VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality
Documents
Paper
• 2410.10594
• Published
• 29
MLLM can see? Dynamic Correction Decoding for Hallucination Mitigation
Paper
• 2410.11779
• Published
• 26
LVD-2M: A Long-take Video Dataset with Temporally Dense Captions
Paper
• 2410.10816
• Published
• 21
Improving Long-Text Alignment for Text-to-Image Diffusion Models
Paper
• 2410.11817
• Published
• 15
OMCAT: Omni Context Aware Transformer
Paper
• 2410.12109
• Published
• 4
VidEgoThink: Assessing Egocentric Video Understanding Capabilities for
Embodied AI
Paper
• 2410.11623
• Published
• 49
HumanEval-V: Benchmarking High-Level Visual Reasoning with Complex
Diagrams in Coding Tasks
Paper
• 2410.12381
• Published
• 43
The Curse of Multi-Modalities: Evaluating Hallucinations of Large
Multimodal Models across Language, Visual, and Audio
Paper
• 2410.12787
• Published
• 30
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding
and Generation
Paper
• 2410.13848
• Published
• 35
Harnessing Webpage UIs for Text-Rich Visual Understanding
Paper
• 2410.13824
• Published
• 30
WorldCuisines: A Massive-Scale Benchmark for Multilingual and
Multicultural Visual Question Answering on Global Cuisines
Paper
• 2410.12705
• Published
• 32
Fluid: Scaling Autoregressive Text-to-image Generative Models with
Continuous Tokens
Paper
• 2410.13863
• Published
• 37
MobA: A Two-Level Agent System for Efficient Mobile Task Automation
Paper
• 2410.13757
• Published
• 32
Roadmap towards Superhuman Speech Understanding using Large Language
Models
Paper
• 2410.13268
• Published
• 33
Movie Gen: A Cast of Media Foundation Models
Paper
• 2410.13720
• Published
• 100
DreamVideo-2: Zero-Shot Subject-Driven Video Customization with Precise
Motion Control
Paper
• 2410.13830
• Published
• 26
MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language
Models
Paper
• 2410.13085
• Published
• 24
A Comparative Study on Reasoning Patterns of OpenAI's o1 Model
Paper
• 2410.13639
• Published
• 19
VidPanos: Generative Panoramic Videos from Casual Panning Videos
Paper
• 2410.13832
• Published
• 13
Remember, Retrieve and Generate: Understanding Infinite Visual Concepts
as Your Personalized Assistant
Paper
• 2410.13360
• Published
• 9
γ-MoD: Exploring Mixture-of-Depth Adaptation for Multimodal Large
Language Models
Paper
• 2410.13859
• Published
• 8
Can MLLMs Understand the Deep Implication Behind Chinese Images?
Paper
• 2410.13854
• Published
• 12
FiTv2: Scalable and Improved Flexible Vision Transformer for Diffusion
Model
Paper
• 2410.13925
• Published
• 24
Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex
Capabilities
Paper
• 2410.11190
• Published
• 22
SemiEvol: Semi-supervised Fine-tuning for LLM Adaptation
Paper
• 2410.14745
• Published
• 47
SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a
Training-Free Memory Tree
Paper
• 2410.16268
• Published
• 69
Baichuan Alignment Technical Report
Paper
• 2410.14940
• Published
• 51
PUMA: Empowering Unified MLLM with Multi-granular Visual Generation
Paper
• 2410.13861
• Published
• 56
Toward Guidance-Free AR Visual Generation via Condition Contrastive
Alignment
Paper
• 2410.09347
• Published
• 5
AutoTrain: No-code training for state-of-the-art models
Paper
• 2410.15735
• Published
• 59
RM-Bench: Benchmarking Reward Models of Language Models with Subtlety
and Style
Paper
• 2410.16184
• Published
• 25
Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant
Paper
• 2410.15316
• Published
• 12
PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid
Visual Redundancy Reduction
Paper
• 2410.17247
• Published
• 47
Aligning Large Language Models via Self-Steering Optimization
Paper
• 2410.17131
• Published
• 24
Improve Vision Language Model Chain-of-thought Reasoning
Paper
• 2410.16198
• Published
• 26
xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video
Even in VLMs
Paper
• 2410.16267
• Published
• 18
MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large
Vision-Language Models
Paper
• 2410.17637
• Published
• 35
Can Knowledge Editing Really Correct Hallucinations?
Paper
• 2410.16251
• Published
• 55
LOGO -- Long cOntext aliGnment via efficient preference Optimization
Paper
• 2410.18533
• Published
• 43
Distill Visual Chart Reasoning Ability from LLMs to MLLMs
Paper
• 2410.18798
• Published
• 21
Infinity-MM: Scaling Multimodal Performance with Large-Scale and
High-Quality Instruction Data
Paper
• 2410.18558
• Published
• 18
ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language
Tuning
Paper
• 2410.17779
• Published
• 8
ROCKET-1: Master Open-World Interaction with Visual-Temporal Context
Prompting
Paper
• 2410.17856
• Published
• 52
Continuous Speech Synthesis using per-token Latent Diffusion
Paper
• 2410.16048
• Published
• 29
Paper
• 2410.21276
• Published
• 87
Vision Search Assistant: Empower Vision-Language Models as Multimodal
Search Engines
Paper
• 2410.21220
• Published
• 11
CLEAR: Character Unlearning in Textual and Visual Modalities
Paper
• 2410.18057
• Published
• 209
Toxicity of the Commons: Curating Open-Source Pre-Training Data
Paper
• 2410.22587
• Published
• 10
ReferEverything: Towards Segmenting Everything We Can Speak of in Videos
Paper
• 2410.23287
• Published
• 19
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
Paper
• 2410.23218
• Published
• 49
Personalization of Large Language Models: A Survey
Paper
• 2411.00027
• Published
• 33
Randomized Autoregressive Visual Generation
Paper
• 2411.00776
• Published
• 18
Face Anonymization Made Simple
Paper
• 2411.00762
• Published
• 9
AndroidLab: Training and Systematic Benchmarking of Android Autonomous
Agents
Paper
• 2410.24024
• Published
• 49
WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum
Reinforcement Learning
Paper
• 2411.02337
• Published
• 36
How Far is Video Generation from World Model: A Physical Law Perspective
Paper
• 2411.02385
• Published
• 34
Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated
Parameters by Tencent
Paper
• 2411.02265
• Published
• 25
Adaptive Caching for Faster Video Generation with Diffusion Transformers
Paper
• 2411.02397
• Published
• 23
AutoVFX: Physically Realistic Video Editing from Natural Language
Instructions
Paper
• 2411.02394
• Published
• 16
DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for
Efficient Robot Execution
Paper
• 2411.02359
• Published
• 14
Both Text and Images Leaked! A Systematic Analysis of Multimodal LLM
Data Contamination
Paper
• 2411.03823
• Published
• 49
Adaptive Length Image Tokenization via Recurrent Allocation
Paper
• 2411.02393
• Published
• 13
ReCapture: Generative Video Camera Controls for User-Provided Videos
using Masked Video Fine-Tuning
Paper
• 2411.05003
• Published
• 71
TIP-I2V: A Million-Scale Real Text and Image Prompt Dataset for
Image-to-Video Generation
Paper
• 2411.04709
• Published
• 26
M3DocRAG: Multi-modal Retrieval is What You Need for Multi-page
Multi-document Understanding
Paper
• 2411.04952
• Published
• 29
Needle Threading: Can LLMs Follow Threads through Near-Million-Scale
Haystacks?
Paper
• 2411.05000
• Published
• 22
VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in
Videos
Paper
• 2411.04923
• Published
• 23
Analyzing The Language of Visual Tokens
Paper
• 2411.05001
• Published
• 24
LLM2CLIP: Powerful Language Model Unlock Richer Visual Representation
Paper
• 2411.04997
• Published
• 39
RaVL: Discovering and Mitigating Spurious Correlations in Fine-Tuned
Vision-Language Models
Paper
• 2411.04097
• Published
• 5
OmniEdit: Building Image Editing Generalist Models Through Specialist
Supervision
Paper
• 2411.07199
• Published
• 50
Chinese SimpleQA: A Chinese Factuality Evaluation for Large Language
Models
Paper
• 2411.07140
• Published
• 35
Edify Image: High-Quality Image Generation with Pixel Space Laplacian
Diffusion Models
Paper
• 2411.07126
• Published
• 30
Add-it: Training-Free Object Insertion in Images With Pretrained
Diffusion Models
Paper
• 2411.07232
• Published
• 68
JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified
Multimodal Understanding and Generation
Paper
• 2411.07975
• Published
• 31
Autoregressive Models in Vision: A Survey
Paper
• 2411.05902
• Published
• 19
MagicQuill: An Intelligent Interactive Image Editing System
Paper
• 2411.09703
• Published
• 80
Sharingan: Extract User Action Sequence from Desktop Recordings
Paper
• 2411.08768
• Published
• 9
LLaVA-o1: Let Vision Language Models Reason Step-by-Step
Paper
• 2411.10440
• Published
• 129
Region-Aware Text-to-Image Generation via Hard Binding and Soft
Refinement
Paper
• 2411.06558
• Published
• 36
The Dawn of GUI Agent: A Preliminary Case Study with Claude 3.5 Computer
Use
Paper
• 2411.10323
• Published
• 34
Number it: Temporal Grounding Videos like Flipping Manga
Paper
• 2411.10332
• Published
• 14
BlueLM-V-3B: Algorithm and System Co-Design for Multimodal Large
Language Models on Mobile Devices
Paper
• 2411.10640
• Published
• 46
Generative World Explorer
Paper
• 2411.11844
• Published
• 77
AnimateAnything: Consistent and Controllable Animation for Video
Generation
Paper
• 2411.10836
• Published
• 24
SlimLM: An Efficient Small Language Model for On-Device Document
Assistance
Paper
• 2411.09944
• Published
• 12
Adaptive Decoding via Latent Preference Optimization
Paper
• 2411.09661
• Published
• 10
StableV2V: Stablizing Shape Consistency in Video-to-Video Editing
Paper
• 2411.11045
• Published
• 11
RedPajama: an Open Dataset for Training Large Language Models
Paper
• 2411.12372
• Published
• 56
SymDPO: Boosting In-Context Learning of Large Multimodal Models with
Symbol Demonstration Direct Preference Optimization
Paper
• 2411.11909
• Published
• 22
FlipSketch: Flipping Static Drawings to Text-Guided Sketch Animations
Paper
• 2411.10818
• Published
• 26
ITACLIP: Boosting Training-Free Semantic Segmentation with Image, Text,
and Architectural Enhancements
Paper
• 2411.12044
• Published
• 14
Continuous Speculative Decoding for Autoregressive Image Generation
Paper
• 2411.11925
• Published
• 16
Enhancing the Reasoning Ability of Multimodal Large Language Models via
Mixed Preference Optimization
Paper
• 2411.10442
• Published
• 87
Multimodal Autoregressive Pre-training of Large Vision Encoders
Paper
• 2411.14402
• Published
• 47
Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large
Language Models
Paper
• 2411.14432
• Published
• 25
Large Multi-modal Models Can Interpret Features in Large Multi-modal
Models
Paper
• 2411.14982
• Published
• 19
O1 Replication Journey -- Part 2: Surpassing O1-preview through Simple
Distillation, Big Progress or Bitter Lesson?
Paper
• 2411.16489
• Published
• 45
One Diffusion to Generate Them All
Paper
• 2411.16318
• Published
• 28
DreamRunner: Fine-Grained Storytelling Video Generation with
Retrieval-Augmented Motion Adaptation
Paper
• 2411.16657
• Published
• 19
Factorized Visual Tokenization and Generation
Paper
• 2411.16681
• Published
• 19
TEXGen: a Generative Diffusion Model for Mesh Textures
Paper
• 2411.14740
• Published
• 17
ROICtrl: Boosting Instance Control for Visual Generation
Paper
• 2411.17949
• Published
• 87
ShowUI: One Vision-Language-Action Model for GUI Visual Agent
Paper
• 2411.17465
• Published
• 89
SketchAgent: Language-Driven Sequential Sketch Generation
Paper
• 2411.17673
• Published
• 18
Rethinking Token Reduction in MLLMs: Towards a Unified Paradigm for
Training-Free Acceleration
Paper
• 2411.17686
• Published
• 19
MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs
Paper
• 2411.15296
• Published
• 21
Large Language Model-Brained GUI Agents: A Survey
Paper
• 2411.18279
• Published
• 30
VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video
Comprehension with Video-Text Duet Interaction Format
Paper
• 2411.17991
• Published
• 5
Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning
Paper
• 2411.18203
• Published
• 40
On Domain-Specific Post-Training for Multimodal Large Language Models
Paper
• 2411.19930
• Published
• 31
Yi-Lightning Technical Report
Paper
• 2412.01253
• Published
• 28
X-Prompt: Towards Universal In-Context Image Generation in
Auto-Regressive Vision Language Foundation Models
Paper
• 2412.01824
• Published
• 64
VISTA: Enhancing Long-Duration and High-Resolution Video Understanding
by Video Spatiotemporal Augmentation
Paper
• 2412.00927
• Published
• 29
Open-Sora Plan: Open-Source Large Video Generation Model
Paper
• 2412.00131
• Published
• 33
SOLAMI: Social Vision-Language-Action Modeling for Immersive Interaction
with 3D Autonomous Characters
Paper
• 2412.00174
• Published
• 23
VisOnlyQA: Large Vision Language Models Still Struggle with Visual
Perception of Geometric Information
Paper
• 2412.00947
• Published
• 8
AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand
Audio-Visual Information?
Paper
• 2412.02611
• Published
• 25
PaliGemma 2: A Family of Versatile VLMs for Transfer
Paper
• 2412.03555
• Published
• 133
TokenFlow: Unified Image Tokenizer for Multimodal Understanding and
Generation
Paper
• 2412.03069
• Published
• 34
Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene
Understanding
Paper
• 2412.00493
• Published
• 17
Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual
Prompt Instruction Tuning
Paper
• 2412.03565
• Published
• 10
VisionZip: Longer is Better but Not Necessary in Vision Language Models
Paper
• 2412.04467
• Published
• 117
Florence-VL: Enhancing Vision-Language Models with Generative Vision
Encoder and Depth-Breadth Fusion
Paper
• 2412.04424
• Published
• 62
NVILA: Efficient Frontier Visual Language Models
Paper
• 2412.04468
• Published
• 60
Negative Token Merging: Image-based Adversarial Feature Guidance
Paper
• 2412.01339
• Published
• 22
Personalized Multimodal Large Language Models: A Survey
Paper
• 2412.02142
• Published
• 13
OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows
Paper
• 2412.01169
• Published
• 13
p-MoD: Building Mixture-of-Depths MLLMs via Progressive Ratio Decay
Paper
• 2412.04449
• Published
• 7
Scaling Inference-Time Search with Vision Value Model for Improved
Visual Comprehension
Paper
• 2412.03704
• Published
• 6
Expanding Performance Boundaries of Open-Source Multimodal Models with
Model, Data, and Test-Time Scaling
Paper
• 2412.05271
• Published
• 160
MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at
Scale
Paper
• 2412.05237
• Published
• 46
LiFT: Leveraging Human Feedback for Text-to-Video Model Alignment
Paper
• 2412.04814
• Published
• 46
SwiftEdit: Lightning Fast Text-Guided Image Editing via One-Step
Diffusion
Paper
• 2412.04301
• Published
• 40
CompCap: Improving Multimodal Large Language Models with Composite
Captions
Paper
• 2412.05243
• Published
• 20
Mind the Time: Temporally-Controlled Multi-Event Video Generation
Paper
• 2412.05263
• Published
• 10
BigDocs: An Open and Permissively-Licensed Dataset for Training
Multimodal Models on Document and Code Tasks
Paper
• 2412.04626
• Published
• 13
Training Large Language Models to Reason in a Continuous Latent Space
Paper
• 2412.06769
• Published
• 94
Around the World in 80 Timesteps: A Generative Approach to Global Visual
Geolocation
Paper
• 2412.06781
• Published
• 23
Maya: An Instruction Finetuned Multilingual Multimodal Model
Paper
• 2412.07112
• Published
• 28
Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation
Paper
• 2412.04432
• Published
• 16
Exploring Multi-Grained Concept Annotations for Multimodal Large
Language Models
Paper
• 2412.05939
• Published
• 15
DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for
Customized Manga Generation
Paper
• 2412.07589
• Published
• 48
Perception Tokens Enhance Visual Reasoning in Multimodal Language Models
Paper
• 2412.03548
• Published
• 17
POINTS1.5: Building a Vision-Language Model towards Real World
Applications
Paper
• 2412.08443
• Published
• 38
LAION-SG: An Enhanced Large-Scale Dataset for Training Complex
Image-Text Models with Structural Annotations
Paper
• 2412.08580
• Published
• 45
MIT-10M: A Large Scale Parallel Corpus of Multilingual Image Translation
Paper
• 2412.07147
• Published
• 5
StreamChat: Chatting with Streaming Video
Paper
• 2412.08646
• Published
• 18
InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for
Long-term Streaming Video and Audio Interactions
Paper
• 2412.09596
• Published
• 97
Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity
Visual Descriptions
Paper
• 2412.08737
• Published
• 54
Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition
Paper
• 2412.09501
• Published
• 48
Multimodal Latent Language Modeling with Next-Token Diffusion
Paper
• 2412.08635
• Published
• 49
EasyRef: Omni-Generalized Group Image Reference for Diffusion Models via
Multimodal LLM
Paper
• 2412.09618
• Published
• 21
VisionArena: 230K Real World User-VLM Conversations with Preference
Labels
Paper
• 2412.08687
• Published
• 13
Arbitrary-steps Image Super-resolution via Diffusion Inversion
Paper
• 2412.09013
• Published
• 13
Apollo: An Exploration of Video Understanding in Large Multimodal Models
Paper
• 2412.10360
• Published
• 147
GenEx: Generating an Explorable World
Paper
• 2412.09624
• Published
• 98
InstanceCap: Improving Text-to-Video Generation via Instance-aware
Structured Caption
Paper
• 2412.09283
• Published
• 19
Multimodal Music Generation with Explicit Bridges and Retrieval
Augmentation
Paper
• 2412.09428
• Published
• 7
SynerGen-VL: Towards Synergistic Image Understanding and Generation with
Vision Experts and Token Folding
Paper
• 2412.09604
• Published
• 38
Byte Latent Transformer: Patches Scale Better Than Tokens
Paper
• 2412.09871
• Published
• 108
BrushEdit: All-In-One Image Inpainting and Editing
Paper
• 2412.10316
• Published
• 36
VidTok: A Versatile and Open-Source Video Tokenizer
Paper
• 2412.13061
• Published
• 8
Paper
• 2412.13501
• Published
• 29
Progressive Multimodal Reasoning via Active Retrieval
Paper
• 2412.14835
• Published
• 73
MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval
Paper
• 2412.14475
• Published
• 57
Descriptive Caption Enhancement with Visual Specialists for Multimodal
Perception
Paper
• 2412.14233
• Published
• 6
Large Motion Video Autoencoding with Cross-modal Video VAE
Paper
• 2412.17805
• Published
• 24
Friends-MMC: A Dataset for Multi-modal Multi-party Conversation
Understanding
Paper
• 2412.17295
• Published
• 9
Flowing from Words to Pixels: A Framework for Cross-Modality Evolution
Paper
• 2412.15213
• Published
• 28
Affordance-Aware Object Insertion via Mask-Aware Dual Diffusion
Paper
• 2412.14462
• Published
• 15
AV-Link: Temporally-Aligned Diffusion Features for Cross-Modal
Audio-Video Generation
Paper
• 2412.15191
• Published
• 5
Parallelized Autoregressive Visual Generation
Paper
• 2412.15119
• Published
• 53
Taming Multimodal Joint Training for High-Quality Video-to-Audio
Synthesis
Paper
• 2412.15322
• Published
• 20
Sequence Matters: Harnessing Video Models in 3D Super-Resolution
Paper
• 2412.11525
• Published
• 11
Diving into Self-Evolving Training for Multimodal Reasoning
Paper
• 2412.17451
• Published
• 42
Distilled Decoding 1: One-step Sampling of Image Auto-regressive Models
with Flow Matching
Paper
• 2412.17153
• Published
• 39
NILE: Internal Consistency Alignment in Large Language Models
Paper
• 2412.16686
• Published
• 8
DepthLab: From Partial to Complete
Paper
• 2412.18153
• Published
• 36
3DGraphLLM: Combining Semantic Graphs and Large Language Models for 3D
Scene Understanding
Paper
• 2412.18450
• Published
• 36
Fourier Position Embedding: Enhancing Attention's Periodic Extension for
Length Generalization
Paper
• 2412.17739
• Published
• 41
DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion
Transformer for Tuning-Free Multi-Prompt Longer Video Generation
Paper
• 2412.18597
• Published
• 20
How "Real" is Your Real-Time Simultaneous Speech-to-Text Translation
System?
Paper
• 2412.18495
• Published
• 9
Video-Panda: Parameter-efficient Alignment for Encoder-free
Video-Language Models
Paper
• 2412.18609
• Published
• 17
Bridging the Data Provenance Gap Across Text, Speech and Video
Paper
• 2412.17847
• Published
• 10
Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via
Collective Monte Carlo Tree Search
Paper
• 2412.18319
• Published
• 39
YuLan-Mini: An Open Data-efficient Language Model
Paper
• 2412.17743
• Published
• 66
MMFactory: A Universal Solution Search Engine for Vision-Language Tasks
Paper
• 2412.18072
• Published
• 18
Molar: Multimodal LLMs with Collaborative Filtering Alignment for
Enhanced Sequential Recommendation
Paper
• 2412.18176
• Published
• 16
Paper
• 2412.18653
• Published
• 86
Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey
Paper
• 2412.18619
• Published
• 60
Task Preference Optimization: Improving Multimodal Large Language Models
with Vision Task Alignment
Paper
• 2412.19326
• Published
• 18
Safeguard Fine-Tuned LLMs Through Pre- and Post-Tuning Model Merging
Paper
• 2412.19512
• Published
• 9
Explanatory Instructions: Towards Unified Vision Tasks Understanding and
Zero-shot Generalization
Paper
• 2412.18525
• Published
• 74
Edicho: Consistent Image Editing in the Wild
Paper
• 2412.21079
• Published
• 22
TangoFlux: Super Fast and Faithful Text to Audio Generation with Flow
Matching and Clap-Ranked Preference Optimization
Paper
• 2412.21037
• Published
• 24
Are Vision-Language Models Truly Understanding Multi-vision Sensor?
Paper
• 2412.20750
• Published
• 20
2.5 Years in Class: A Multimodal Textbook for Vision-Language
Pretraining
Paper
• 2501.00958
• Published
• 109
VideoAnydoor: High-fidelity Video Object Insertion with Precise Motion
Control
Paper
• 2501.01427
• Published
• 53
LTX-Video: Realtime Video Latent Diffusion
Paper
• 2501.00103
• Published
• 50
VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with
Video LLM
Paper
• 2501.00599
• Published
• 46
MLLM-as-a-Judge for Image Safety without Human Labeling
Paper
• 2501.00192
• Published
• 31
A3: Android Agent Arena for Mobile GUI Agents
Paper
• 2501.01149
• Published
• 22
Unifying Specialized Visual Encoders for Video Language Models
Paper
• 2501.01426
• Published
• 20
VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
Paper
• 2501.01957
• Published
• 47
LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One
Vision Token
Paper
• 2501.03895
• Published
• 52
MotionBench: Benchmarking and Improving Fine-grained Video Motion
Understanding for Vision Language Models
Paper
• 2501.02955
• Published
• 44
Cosmos World Foundation Model Platform for Physical AI
Paper
• 2501.03575
• Published
• 82
REINFORCE++: A Simple and Efficient Approach for Aligning Large Language
Models
Paper
• 2501.03262
• Published
• 104
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of
Images and Videos
Paper
• 2501.04001
• Published
• 47
OpenOmni: Large Language Models Pivot Zero-shot Omnimodal Alignment
across Language with Real-time Self-Aware Emotional Speech Synthesis
Paper
• 2501.04561
• Published
• 17
InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning
and Reflection
Paper
• 2501.04575
• Published
• 25
Search-o1: Agentic Search-Enhanced Large Reasoning Models
Paper
• 2501.05366
• Published
• 102
DPO Kernels: A Semantically-Aware, Kernel-Enhanced, and Divergence-Rich
Paradigm for Direct Preference Optimization
Paper
• 2501.03271
• Published
• 10
The GAN is dead; long live the GAN! A Modern GAN Baseline
Paper
• 2501.05441
• Published
• 95
Enhancing Human-Like Responses in Large Language Models
Paper
• 2501.05032
• Published
• 61
An Empirical Study of Autoregressive Pre-training from Videos
Paper
• 2501.05453
• Published
• 41
Centurio: On Drivers of Multilingual Ability of Large Vision-Language
Model
Paper
• 2501.05122
• Published
• 19
On Computational Limits and Provably Efficient Criteria of Visual
Autoregressive Models: A Fine-Grained Complexity Analysis
Paper
• 2501.04377
• Published
• 14
VideoRAG: Retrieval-Augmented Generation over Video Corpus
Paper
• 2501.05874
• Published
• 75
LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs
Paper
• 2501.06186
• Published
• 65
Migician: Revealing the Magic of Free-Form Multi-Image Grounding in
Multimodal Large Language Models
Paper
• 2501.05767
• Published
• 29
OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video
Understanding?
Paper
• 2501.05510
• Published
• 44
MinMo: A Multimodal Large Language Model for Seamless Voice Interaction
Paper
• 2501.06282
• Published
• 53
Omni-RGPT: Unifying Image and Video Region-level Understanding via Token
Marks
Paper
• 2501.08326
• Published
• 34
MatchAnything: Universal Cross-Modality Image Matching with Large-Scale
Pre-Training
Paper
• 2501.07556
• Published
• 7
MMDocIR: Benchmarking Multi-Modal Retrieval for Long Documents
Paper
• 2501.08828
• Published
• 30
RepVideo: Rethinking Cross-Layer Representation for Video Generation
Paper
• 2501.08994
• Published
• 15
ReFocus: Visual Editing as a Chain of Thought for Structured Image
Understanding
Paper
• 2501.05452
• Published
• 15
Multiagent Finetuning: Self Improvement with Diverse Reasoning Chains
Paper
• 2501.05707
• Published
• 20
VideoAuteur: Towards Long Narrative Video Generation
Paper
• 2501.06173
• Published
• 31
SPAM: Spike-Aware Adam with Momentum Reset for Stable LLM Training
Paper
• 2501.06842
• Published
• 16
Evaluating Sample Utility for Data Selection by Mimicking Model Weights
Paper
• 2501.06708
• Published
• 5
MiniMax-01: Scaling Foundation Models with Lightning Attention
Paper
• 2501.08313
• Published
• 300
Democratizing Text-to-Image Masked Generative Models with Compact
Text-Aware One-Dimensional Tokens
Paper
• 2501.07730
• Published
• 18
HALoGEN: Fantastic LLM Hallucinations and Where to Find Them
Paper
• 2501.08292
• Published
• 17
Tarsier2: Advancing Large Vision-Language Models from Detailed Video
Description to Comprehensive Video Understanding
Paper
• 2501.07888
• Published
• 15
OpenCSG Chinese Corpus: A Series of High-quality Chinese Datasets for
LLM Training
Paper
• 2501.08197
• Published
• 9
Parameter-Inverted Image Pyramid Networks for Visual Perception and
Multimodal Understanding
Paper
• 2501.07783
• Published
• 8
MINIMA: Modality Invariant Image Matching
Paper
• 2412.19412
• Published
• 4
OmniThink: Expanding Knowledge Boundaries in Machine Writing through
Thinking
Paper
• 2501.09751
• Published
• 46
Learnings from Scaling Visual Tokenizers for Reconstruction and
Generation
Paper
• 2501.09755
• Published
• 35
Do generative video models learn physical principles from watching
videos?
Paper
• 2501.09038
• Published
• 34
FAST: Efficient Action Tokenization for Vision-Language-Action Models
Paper
• 2501.09747
• Published
• 28
VideoWorld: Exploring Knowledge Learning from Unlabeled Videos
Paper
• 2501.09781
• Published
• 27
MMVU: Measuring Expert-Level Multi-Discipline Video Understanding
Paper
• 2501.12380
• Published
• 84
Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks
Paper
• 2501.11733
• Published
• 28
Can We Generate Images with CoT? Let's Verify and Reinforce Image
Generation Step by Step
Paper
• 2501.13926
• Published
• 43
Baichuan-Omni-1.5 Technical Report
Paper
• 2501.15368
• Published
• 60
Qwen2.5-1M Technical Report
Paper
• 2501.15383
• Published
• 72
Towards General-Purpose Model-Free Reinforcement Learning
Paper
• 2501.16142
• Published
• 31
Emilia: A Large-Scale, Extensive, Multilingual, and Diverse Dataset for
Speech Generation
Paper
• 2501.15907
• Published
• 17
Are Vision Language Models Texture or Shape Biased and Can We Steer
Them?
Paper
• 2403.09193
• Published
• 9
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model
Post-training
Paper
• 2501.17161
• Published
• 124
PixelWorld: Towards Perceiving Everything as Pixels
Paper
• 2501.19339
• Published
• 17
OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human
Animation Models
Paper
• 2502.01061
• Published
• 223
Process Reinforcement through Implicit Rewards
Paper
• 2502.01456
• Published
• 62
AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal
Understanding
Paper
• 2502.01341
• Published
• 39
Paper
• 2501.14249
• Published
• 77
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video
Understanding
Paper
• 2501.13106
• Published
• 90
Kimi k1.5: Scaling Reinforcement Learning with LLMs
Paper
• 2501.12599
• Published
• 126
Test-Time Preference Optimization: On-the-Fly Alignment via Iterative
Textual Feedback
Paper
• 2501.12895
• Published
• 61
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via
Reinforcement Learning
Paper
• 2501.12948
• Published
• 440
Token Assorted: Mixing Latent and Text Tokens for Improved Language
Model Reasoning
Paper
• 2502.03275
• Published
• 18
Analyze Feature Flow to Enhance Interpretation and Steering in Language
Models
Paper
• 2502.03032
• Published
• 60
Ola: Pushing the Frontiers of Omni-Modal Language Model with Progressive
Modality Alignment
Paper
• 2502.04328
• Published
• 29
VideoRoPE: What Makes for Good Video Rotary Position Embedding?
Paper
• 2502.05173
• Published
• 64
Fast Video Generation with Sliding Tile Attention
Paper
• 2502.04507
• Published
• 51
Goku: Flow Based Video Generative Foundation Models
Paper
• 2502.04896
• Published
• 106
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth
Approach
Paper
• 2502.05171
• Published
• 152
QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive
Multimodal Understanding and Generation
Paper
• 2502.05178
• Published
• 10
On-device Sora: Enabling Diffusion-Based Text-to-Video Generation for
Mobile Devices
Paper
• 2502.04363
• Published
• 12
Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time
Scaling
Paper
• 2502.06703
• Published
• 152
Scaling Pre-training to One Hundred Billion Data for Vision Language
Models
Paper
• 2502.07617
• Published
• 29
Expect the Unexpected: FailSafe Long Context QA for Finance
Paper
• 2502.06329
• Published
• 133
Magic 1-For-1: Generating One Minute Video Clips within One Minute
Paper
• 2502.07701
• Published
• 36
Light-A-Video: Training-free Video Relighting via Progressive Light
Fusion
Paper
• 2502.08590
• Published
• 42
TextAtlas5M: A Large-scale Dataset for Dense Text Image Generation
Paper
• 2502.07870
• Published
• 45
WorldGUI: Dynamic Testing for Comprehensive Desktop GUI Automation
Paper
• 2502.08047
• Published
• 28
TransMLA: Multi-head Latent Attention Is All You Need
Paper
• 2502.07864
• Published
• 57
mmE5: Improving Multimodal Multilingual Embeddings via High-quality
Synthetic Data
Paper
• 2502.08468
• Published
• 16
The Stochastic Parrot on LLM's Shoulder: A Summative Assessment of
Physical Concept Understanding
Paper
• 2502.08946
• Published
• 191
Skrr: Skip and Re-use Text Encoder Layers for Memory Efficient
Text-to-Image Generation
Paper
• 2502.08690
• Published
• 43
EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language
Models for Vision-Driven Embodied Agents
Paper
• 2502.09560
• Published
• 35
ZeroBench: An Impossible Visual Benchmark for Contemporary Large
Multimodal Models
Paper
• 2502.09696
• Published
• 43
Step-Video-T2V Technical Report: The Practice, Challenges, and Future of
Video Foundation Model
Paper
• 2502.10248
• Published
• 57
MM-RLHF: The Next Step Forward in Multimodal LLM Alignment
Paper
• 2502.10391
• Published
• 34
Large Language Diffusion Models
Paper
• 2502.09992
• Published
• 126
Learning Getting-Up Policies for Real-World Humanoid Robots
Paper
• 2502.12152
• Published
• 42
Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse
Attention
Paper
• 2502.11089
• Published
• 167
How Do LLMs Acquire New Knowledge? A Knowledge Circuits Perspective on
Continual Pre-Training
Paper
• 2502.11196
• Published
• 23
I Think, Therefore I Diffuse: Enabling Multimodal In-Context Reasoning
in Diffusion Models
Paper
• 2502.10458
• Published
• 38
HermesFlow: Seamlessly Closing the Gap in Multimodal Understanding and
Generation
Paper
• 2502.12148
• Published
• 17
Intuitive physics understanding emerges from self-supervised pretraining
on natural videos
Paper
• 2502.11831
• Published
• 20
video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model
Paper
• 2502.11775
• Published
• 9
Ask in Any Modality: A Comprehensive Survey on Multimodal
Retrieval-Augmented Generation
Paper
• 2502.08826
• Published
• 17
ILIAS: Instance-Level Image retrieval At Scale
Paper
• 2502.11748
• Published
• 4
Soundwave: Less is More for Speech-Text Alignment in LLMs
Paper
• 2502.12900
• Published
• 86
Continuous Diffusion Model for Language Modeling
Paper
• 2502.11564
• Published
• 53
Phantom: Subject-consistent video generation via cross-modal alignment
Paper
• 2502.11079
• Published
• 59
Magma: A Foundation Model for Multimodal AI Agents
Paper
• 2502.13130
• Published
• 58
SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance
Software Engineering?
Paper
• 2502.12115
• Published
• 46
Multimodal Mamba: Decoder-only Multimodal State Space Model via
Quadratic to Linear Distillation
Paper
• 2502.13145
• Published
• 38
RealSyn: An Effective and Scalable Multimodal Interleaved Document
Transformation Paradigm
Paper
• 2502.12513
• Published
• 16
Harnessing Vision Models for Time Series Analysis: A Survey
Paper
• 2502.08869
• Published
• 2
Qwen2.5-VL Technical Report
Paper
• 2502.13923
• Published
• 214
On the Trustworthiness of Generative Foundation Models: Guideline,
Assessment, and Perspective
Paper
• 2502.14296
• Published
• 45
SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic
Understanding, Localization, and Dense Features
Paper
• 2502.14786
• Published
• 158
How Much Knowledge Can You Pack into a LoRA Adapter without Harming LLM?
Paper
• 2502.14502
• Published
• 91
LongWriter-V: Enabling Ultra-Long and High-Fidelity Generation in
Vision-Language Models
Paper
• 2502.14834
• Published
• 24
Does Time Have Its Place? Temporal Heads: Where Language Models Recall
Time-specific Information
Paper
• 2502.14258
• Published
• 26
PC-Agent: A Hierarchical Multi-Agent Collaboration Framework for Complex
Task Automation on PC
Paper
• 2502.14282
• Published
• 29
How to Get Your LLM to Generate Challenging Problems for Evaluation
Paper
• 2502.14678
• Published
• 18
Dynamic Concepts Personalization from Single Videos
Paper
• 2502.14844
• Published
• 16
Scaling Text-Rich Image Understanding via Code-Guided Synthetic
Multimodal Data Generation
Paper
• 2502.14846
• Published
• 14
NAVIG: Natural Language-guided Analysis with Vision Language Models for
Image Geo-localization
Paper
• 2502.14638
• Published
• 11
From RAG to Memory: Non-Parametric Continual Learning for Large Language
Models
Paper
• 2502.14802
• Published
• 13
Cramming 1568 Tokens into a Single Vector and Back Again: Exploring the
Limits of Embedding Space Capacity
Paper
• 2502.13063
• Published
• 74
VLM^2-Bench: A Closer Look at How Well VLMs Implicitly Link Explicit
Matching Visual Cues
Paper
• 2502.12084
• Published
• 34
LLM-Microscope: Uncovering the Hidden Role of Punctuation in Context
Memory of Transformers
Paper
• 2502.15007
• Published
• 174
SurveyX: Academic Survey Automation via Large Language Models
Paper
• 2502.14776
• Published
• 100
PhotoDoodle: Learning Artistic Image Editing from Few-Shot Pairwise Data
Paper
• 2502.14397
• Published
• 41
DICEPTION: A Generalist Diffusion Model for Visual Perceptual Tasks
Paper
• 2502.17157
• Published
• 52
Multimodal Inconsistency Reasoning (MMIR): A New Benchmark for
Multimodal Reasoning Models
Paper
• 2502.16033
• Published
• 18
Reflective Planning: Vision-Language Models for Multi-Stage Long-Horizon
Robotic Manipulation
Paper
• 2502.16707
• Published
• 13
OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference
Paper
• 2502.18411
• Published
• 74
SpargeAttn: Accurate Sparse Attention Accelerating Any Model Inference
Paper
• 2502.18137
• Published
• 60
SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open
Software Evolution
Paper
• 2502.18449
• Published
• 75
KV-Edit: Training-Free Image Editing for Precise Background Preservation
Paper
• 2502.17363
• Published
• 37
ART: Anonymous Region Transformer for Variable Multi-Layer Transparent
Image Generation
Paper
• 2502.18364
• Published
• 36
K-LoRA: Unlocking Training-Free Fusion of Any Subject and Style LoRAs
Paper
• 2502.18461
• Published
• 17
Introducing Visual Perception Token into Multimodal Large Language Model
Paper
• 2502.17425
• Published
• 16
MLLMs Know Where to Look: Training-free Perception of Small Visual
Details with Multimodal LLMs
Paper
• 2502.17422
• Published
• 7
LDGen: Enhancing Text-to-Image Synthesis via Large Language Model-Driven
Language Representation
Paper
• 2502.18302
• Published
• 5
Shakti-VLMs: Scalable Vision-Language Models for Enterprise AI
Paper
• 2502.17092
• Published
• 3
TheoremExplainAgent: Towards Multimodal Explanations for LLM Theorem
Understanding
Paper
• 2502.19400
• Published
• 47
Towards an AI co-scientist
Paper
• 2502.18864
• Published
• 52
MedVLM-R1: Incentivizing Medical Reasoning Capability of Vision-Language
Models (VLMs) via Reinforcement Learning
Paper
• 2502.19634
• Published
• 63
UniTok: A Unified Tokenizer for Visual Generation and Understanding
Paper
• 2502.20321
• Published
• 30
Multimodal Representation Alignment for Image Generation: Text-Image
Interleaved Control Is Easier Than You Think
Paper
• 2502.20172
• Published
• 29
HAIC: Improving Human Action Understanding and Generation with Better
Captions for Multi-modal Large Language Models
Paper
• 2502.20811
• Published
• 3
Chain of Draft: Thinking Faster by Writing Less
Paper
• 2502.18600
• Published
• 50
Tell me why: Visual foundation models as self-explainable classifiers
Paper
• 2502.19577
• Published
• 11
SoS1: O1 and R1-Like Reasoning LLMs are Sum-of-Square Solvers
Paper
• 2502.20545
• Published
• 22
MIGE: A Unified Framework for Multimodal Instruction-Based Image
Generation and Editing
Paper
• 2502.21291
• Published
• 5
Predictive Data Selection: The Data That Predicts Is the Data That
Teaches
Paper
• 2503.00808
• Published
• 56
Visual-RFT: Visual Reinforcement Fine-Tuning
Paper
• 2503.01785
• Published
• 86
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language
Models via Mixture-of-LoRAs
Paper
• 2503.01743
• Published
• 89
Qilin: A Multimodal Information Retrieval Dataset with APP-level User
Sessions
Paper
• 2503.00501
• Published
• 12
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open
Language Models
Paper
• 2402.03300
• Published
• 140
DoraCycle: Domain-Oriented Adaptation of Unified Generative Model in
Multimodal Cycles
Paper
• 2503.03651
• Published
• 16
UFO: A Unified Approach to Fine-grained Visual Perception via Open-ended
Language Interface
Paper
• 2503.01342
• Published
• 8
From Hours to Minutes: Lossless Acceleration of Ultra Long Sequence
Generation up to 100K Tokens
Paper
• 2502.18890
• Published
• 30
HoT: Highlighted Chain of Thought for Referencing Supporting Facts from
Inputs
Paper
• 2503.02003
• Published
• 48
Process-based Self-Rewarding Language Models
Paper
• 2503.03746
• Published
• 39
CognitiveDrone: A VLA Model and Evaluation Benchmark for Real-Time
Cognitive Task Solving and Reasoning in UAVs
Paper
• 2503.01378
• Published
• 5
Token-Efficient Long Video Understanding for Multimodal LLMs
Paper
• 2503.04130
• Published
• 96
LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM
Paper
• 2503.04724
• Published
• 72
Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding
and Expert Reasoning Abilities
Paper
• 2503.03983
• Published
• 27
How to Steer LLM Latents for Hallucination Detection?
Paper
• 2503.01917
• Published
• 11
The Best of Both Worlds: Integrating Language Models and Diffusion
Models for Video Generation
Paper
• 2503.04606
• Published
• 9
Unified Reward Model for Multimodal Understanding and Generation
Paper
• 2503.05236
• Published
• 123
R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model
Paper
• 2503.05132
• Published
• 57
Sketch-of-Thought: Efficient LLM Reasoning with Adaptive
Cognitive-Inspired Sketching
Paper
• 2503.05179
• Published
• 46
S2S-Arena, Evaluating Speech2Speech Protocols on Instruction Following
with Paralinguistic Information
Paper
• 2503.05085
• Published
• 47
R1-Omni: Explainable Omni-Multimodal Emotion Recognition with
Reinforcing Learning
Paper
• 2503.05379
• Published
• 38
VideoPainter: Any-length Video Inpainting and Editing with Plug-and-Play
Context Control
Paper
• 2503.05639
• Published
• 26
TrajectoryCrafter: Redirecting Camera Trajectory for Monocular Videos
via Diffusion Models
Paper
• 2503.05638
• Published
• 20
MM-Eureka: Exploring Visual Aha Moment with Rule-based Large-scale
Reinforcement Learning
Paper
• 2503.07365
• Published
• 61
Automated Movie Generation via Multi-Agent CoT Planning
Paper
• 2503.07314
• Published
• 44
Taking Notes Brings Focus? Towards Multi-Turn Multimodal Dialogue
Learning
Paper
• 2503.07002
• Published
• 39
Vision-R1: Incentivizing Reasoning Capability in Multimodal Large
Language Models
Paper
• 2503.06749
• Published
• 31
Crowdsource, Crawl, or Generate? Creating SEA-VL, a Multicultural
Vision-Language Dataset for Southeast Asia
Paper
• 2503.07920
• Published
• 101
MagicInfinite: Generating Infinite Talking Videos with Your Words and
Voice
Paper
• 2503.05978
• Published
• 36
LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through
Two-Stage Rule-Based RL
Paper
• 2503.07536
• Published
• 88
Video Action Differencing
Paper
• 2503.07860
• Published
• 33
UniF^2ace: Fine-grained Face Understanding and Generation
with Unified Multimodal Models
Paper
• 2503.08120
• Published
• 32
SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by
Imitating Human Annotator Trajectories
Paper
• 2503.08625
• Published
• 27
Implicit Reasoning in Transformers is Reasoning through Shortcuts
Paper
• 2503.07604
• Published
• 23
LightGen: Efficient Image Generation through Knowledge Distillation and
Direct Preference Optimization
Paper
• 2503.08619
• Published
• 20
EasyControl: Adding Efficient and Flexible Control for Diffusion
Transformer
Paper
• 2503.07027
• Published
• 29
LLaVE: Large Language and Vision Embedding Models with Hardness-Weighted
Contrastive Learning
Paper
• 2503.04812
• Published
• 17
Words or Vision: Do Vision-Language Models Have Blind Faith in Text?
Paper
• 2503.02199
• Published
• 8
Seedream 2.0: A Native Chinese-English Bilingual Image Generation
Foundation Model
Paper
• 2503.07703
• Published
• 37
Gemini Embedding: Generalizable Embeddings from Gemini
Paper
• 2503.07891
• Published
• 45
OmniMamba: Efficient and Unified Multimodal Understanding and Generation
via State Space Models
Paper
• 2503.08686
• Published
• 19
CineBrain: A Large-Scale Multi-Modal Brain Dataset During Naturalistic
Audiovisual Narrative Processing
Paper
• 2503.06940
• Published
• 11
Transformers without Normalization
Paper
• 2503.10622
• Published
• 170
Charting and Navigating Hugging Face's Model Atlas
Paper
• 2503.10633
• Published
• 93
World Modeling Makes a Better Planner: Dual Preference Optimization for
Embodied Task Planning
Paper
• 2503.10480
• Published
• 56
GoT: Unleashing Reasoning Capability of Multimodal Large Language Model
for Visual Generation and Editing
Paper
• 2503.10639
• Published
• 53
VisualPRM: An Effective Process Reward Model for Multimodal Reasoning
Paper
• 2503.10291
• Published
• 36
4D LangSplat: 4D Language Gaussian Splatting via Multimodal Large
Language Models
Paper
• 2503.10437
• Published
• 34
CoRe^2: Collect, Reflect and Refine to Generate Better and Faster
Paper
• 2503.09662
• Published
• 33
OmniPaint: Mastering Object-Oriented Editing via Disentangled
Insertion-Removal Inpainting
Paper
• 2503.08677
• Published
• 29
Light-R1: Curriculum SFT, DPO and RL for Long COT from Scratch and
Beyond
Paper
• 2503.10460
• Published
• 30
GroundingSuite: Measuring Complex Multi-Granular Pixel Grounding
Paper
• 2503.10596
• Published
• 18
R1-Onevision: Advancing Generalized Multimodal Reasoning through
Cross-Modal Formalization
Paper
• 2503.10615
• Published
• 17
Open-Sora 2.0: Training a Commercial-Level Video Generation Model in
$200k
Paper
• 2503.09642
• Published
• 20
PLADIS: Pushing the Limits of Attention in Diffusion Models at Inference
Time by Leveraging Sparsity
Paper
• 2503.07677
• Published
• 86
ReCamMaster: Camera-Controlled Generative Rendering from A Single Video
Paper
• 2503.11647
• Published
• 146
GoalFlow: Goal-Driven Flow Matching for Multimodal Trajectories
Generation in End-to-End Autonomous Driving
Paper
• 2503.05689
• Published
• 3
SmolDocling: An ultra-compact vision-language model for end-to-end
multi-modal document conversion
Paper
• 2503.11576
• Published
• 147
Large-scale Pre-training for Grounded Video Caption Generation
Paper
• 2503.10781
• Published
• 16
ARMOR v0.1: Empowering Autoregressive Multimodal Understanding Model
with Interleaved Multimodal Generation via Asymmetric Synergy
Paper
• 2503.06542
• Published
• 7
DropletVideo: A Dataset and Approach to Explore Integral Spatio-Temporal
Consistent Video Generation
Paper
• 2503.06053
• Published
• 138
Being-0: A Humanoid Robotic Agent with Vision-Language Models and
Modular Skills
Paper
• 2503.12533
• Published
• 68
DreamRenderer: Taming Multi-Instance Attribute Control in Large-Scale
Text-to-Image Models
Paper
• 2503.12885
• Published
• 43
Edit Transfer: Learning Image Editing via Vision In-Context Relations
Paper
• 2503.13327
• Published
• 29
BlobCtrl: A Unified and Flexible Framework for Element-level Image
Generation and Editing
Paper
• 2503.13434
• Published
• 27
R1-VL: Learning to Reason with Multimodal Large Language Models via
Step-wise Group Relative Policy Optimization
Paper
• 2503.12937
• Published
• 30
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey
Paper
• 2503.12605
• Published
• 35
DeepPerception: Advancing R1-like Cognitive Visual Perception in MLLMs
for Knowledge-Intensive Visual Grounding
Paper
• 2503.12797
• Published
• 32
Aligning Multimodal LLM with Human Preference: A Survey
Paper
• 2503.14504
• Published
• 26
Cosmos-Transfer1: Conditional World Generation with Adaptive Multimodal
Control
Paper
• 2503.14492
• Published
• 20
TULIP: Towards Unified Language-Image Pretraining
Paper
• 2503.15485
• Published
• 49
φ-Decoding: Adaptive Foresight Sampling for Balanced Inference-Time
Exploration and Exploitation
Paper
• 2503.13288
• Published
• 51
Temporal Regularization Makes Your Video Generator Stronger
Paper
• 2503.15417
• Published
• 22
VERIFY: A Benchmark of Visual Explanation and Reasoning for
Investigating Multimodal Reasoning Fidelity
Paper
• 2503.11557
• Published
• 22
Stop Overthinking: A Survey on Efficient Reasoning for Large Language
Models
Paper
• 2503.16419
• Published
• 77
Unleashing Vecset Diffusion Model for Fast Shape Generation
Paper
• 2503.16302
• Published
• 43
DiffMoE: Dynamic Token Selection for Scalable Diffusion Transformers
Paper
• 2503.14487
• Published
• 28
JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play
Visual Games with Keyboards and Mouse
Paper
• 2503.16365
• Published
• 41
InfiniteYou: Flexible Photo Recrafting While Preserving Your Identity
Paper
• 2503.16418
• Published
• 36
Ultra-Resolution Adaptation with Ease
Paper
• 2503.16322
• Published
• 13
M3: 3D-Spatial MultiModal Memory
Paper
• 2503.16413
• Published
• 15
See-Saw Modality Balance: See Gradient, and Sew Impaired Vision-Language
Balance to Mitigate Dominant Modality Bias
Paper
• 2503.13834
• Published
• 5
Expert Race: A Flexible Routing Strategy for Scaling Diffusion
Transformer with Mixture of Experts
Paper
• 2503.16057
• Published
• 14
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Paper
• 2503.14476
• Published
• 144
RWKV-7 "Goose" with Expressive Dynamic State Evolution
Paper
• 2503.14456
• Published
• 153
Paper
• 2503.14378
• Published
• 61
Reinforcement Learning for Reasoning in Small LLMs: What Works and What
Doesn't
Paper
• 2503.16219
• Published
• 52
Inside-Out: Hidden Factual Knowledge in LLMs
Paper
• 2503.15299
• Published
• 56
Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning
Paper
• 2503.15558
• Published
• 50
Where do Large Vision-Language Models Look at when Answering Questions?
Paper
• 2503.13891
• Published
• 8
MAPS: A Multi-Agent Framework Based on Big Seven Personality and
Socratic Guidance for Multimodal Scientific Problem Solving
Paper
• 2503.16905
• Published
• 54
OpenVLThinker: An Early Exploration to Complex Vision-Language Reasoning
via Iterative Self-Improvement
Paper
• 2503.17352
• Published
• 24
Bridging Continuous and Discrete Tokens for Autoregressive Visual
Generation
Paper
• 2503.16430
• Published
• 34
When Preferences Diverge: Aligning Diffusion Models with Minority-Aware
Adaptive DPO
Paper
• 2503.16921
• Published
• 6
From Head to Tail: Towards Balanced Representation in Large
Vision-Language Models through Adaptive Data Calibration
Paper
• 2503.12821
• Published
• 10
MathFlow: Enhancing the Perceptual Flow of MLLMs for Visual Mathematical
Problems
Paper
• 2503.16549
• Published
• 15
Why Do Multi-Agent LLM Systems Fail?
Paper
• 2503.13657
• Published
• 48
When Less is Enough: Adaptive Token Reduction for Efficient Image
Representation
Paper
• 2503.16660
• Published
• 72
Can Large Vision Language Models Read Maps Like a Human?
Paper
• 2503.14607
• Published
• 10
GAEA: A Geolocation Aware Conversational Model
Paper
• 2503.16423
• Published
• 6
I Have Covered All the Bases Here: Interpreting Reasoning Features in
Large Language Models via Sparse Autoencoders
Paper
• 2503.18878
• Published
• 119
Video-T1: Test-Time Scaling for Video Generation
Paper
• 2503.18942
• Published
• 90
SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for
Open Base Models in the Wild
Paper
• 2503.18892
• Published
• 31
Aether: Geometric-Aware Unified World Modeling
Paper
• 2503.18945
• Published
• 28
Judge Anything: MLLM as a Judge Across Any Modality
Paper
• 2503.17489
• Published
• 23
Vision-R1: Evolving Human-Free Alignment in Large Vision-Language Models
via Vision-Guided Reinforcement Learning
Paper
• 2503.18013
• Published
• 20
Mind with Eyes: from Language Reasoning to Multimodal Reasoning
Paper
• 2503.18071
• Published
• 3
Exploring Hallucination of Large Multimodal Models in Video
Understanding: Benchmark, Analysis and Mitigation
Paper
• 2503.19622
• Published
• 31
CoMP: Continual Multimodal Pre-training for Vision Foundation Models
Paper
• 2503.18931
• Published
• 30
Long-Context Autoregressive Video Modeling with Next-Frame Prediction
Paper
• 2503.19325
• Published
• 73
Spot the Fake: Large Multimodal Model-Based Synthetic Image Detection
with Artifact Explanation
Paper
• 2503.14905
• Published
• 20
When Words Outperform Vision: VLMs Can Self-Improve Via Text-Only
Training For Human-Centered Decision Making
Paper
• 2503.16965
• Published
• 4
LEGO-Puzzles: How Good Are MLLMs at Multi-Step Spatial Reasoning?
Paper
• 2503.19990
• Published
• 35
Dita: Scaling Diffusion Transformer for Generalist
Vision-Language-Action Policy
Paper
• 2503.19757
• Published
• 51
GenHancer: Imperfect Generative Models are Secretly Strong
Vision-Centric Enhancers
Paper
• 2503.19480
• Published
• 16
Qwen2.5-Omni Technical Report
Paper
• 2503.20215
• Published
• 170
Wan: Open and Advanced Large-Scale Video Generative Models
Paper
• 2503.20314
• Published
• 59
Open Deep Search: Democratizing Search with Open-source Reasoning Agents
Paper
• 2503.20201
• Published
• 48
Beyond Words: Advancing Long-Text Image Generation via Multimodal
Autoregressive Models
Paper
• 2503.20198
• Published
• 4
Video-R1: Reinforcing Video Reasoning in MLLMs
Paper
• 2503.21776
• Published
• 79
UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement
Learning
Paper
• 2503.21620
• Published
• 62
Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for
Embodied Interactive Tasks
Paper
• 2503.21696
• Published
• 23
A Survey of Efficient Reasoning for Large Reasoning Models: Language,
Multimodality, and Beyond
Paper
• 2503.21614
• Published
• 43
OThink-MR1: Stimulating multimodal generalized reasoning capabilities
via dynamic reinforcement learning
Paper
• 2503.16081
• Published
• 28
Your ViT is Secretly an Image Segmentation Model
Paper
• 2503.19108
• Published
• 25
On Large Multimodal Models as Open-World Image Classifiers
Paper
• 2503.21851
• Published
• 5
TextCrafter: Accurately Rendering Multiple Texts in Complex Visual
Scenes
Paper
• 2503.23461
• Published
• 94
Any2Caption:Interpreting Any Condition to Caption for Controllable Video
Generation
Paper
• 2503.24379
• Published
• 76
Exploring the Effect of Reinforcement Learning on Video Understanding:
Insights from SEED-Bench-R1
Paper
• 2503.24376
• Published
• 38
Open-Qwen2VL: Compute-Efficient Pre-Training of Fully-Open Multimodal
LLMs on Academic Resources
Paper
• 2504.00595
• Published
• 37
Recitation over Reasoning: How Cutting-Edge Language Models Can Fail on
Elementary School-Level Reasoning Problems?
Paper
• 2504.00509
• Published
• 24
MoCha: Towards Movie-Grade Talking Character Synthesis
Paper
• 2503.23307
• Published
• 139
Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement
Learning on the Base Model
Paper
• 2503.24290
• Published
• 62
Unicorn: Text-Only Data Synthesis for Vision Language Model Training
Paper
• 2503.22655
• Published
• 39
TokenHSI: Unified Synthesis of Physical Human-Scene Interactions through
Task Tokenization
Paper
• 2503.19901
• Published
• 41
Expanding RL with Verifiable Rewards Across Diverse Domains
Paper
• 2503.23829
• Published
• 24
Paper
• 2504.00927
• Published
• 56
OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming
Video Contexts
Paper
• 2503.22952
• Published
• 17
Efficient LLaMA-3.2-Vision by Trimming Cross-attended Visual Features
Paper
• 2504.00557
• Published
• 15
Chapter-Llama: Efficient Chaptering in Hour-Long Videos with LLMs
Paper
• 2504.00072
• Published
• 6
MergeVQ: A Unified Framework for Visual Generation and Representation
with Disentangled Token Merging and Quantization
Paper
• 2504.00999
• Published
• 95
Improved Visual-Spatial Reasoning via R1-Zero-Like Training
Paper
• 2504.00883
• Published
• 67
DreamActor-M1: Holistic, Expressive and Robust Human Image Animation
with Hybrid Guidance
Paper
• 2504.01724
• Published
• 68
AnimeGamer: Infinite Anime Life Simulation with Next Game State
Prediction
Paper
• 2504.01014
• Published
• 70
Towards Physically Plausible Video Generation via VLM Planning
Paper
• 2503.23368
• Published
• 40
Understanding R1-Zero-Like Training: A Critical Perspective
Paper
• 2503.20783
• Published
• 59
ILLUME+: Illuminating Unified MLLM with Dual Visual Tokenization and
Diffusion Refinement
Paper
• 2504.01934
• Published
• 22
Articulated Kinematics Distillation from Video Diffusion Models
Paper
• 2504.01204
• Published
• 23
Safeguarding Vision-Language Models: Mitigating Vulnerabilities to
Gaussian Noise in Perturbation-based Attacks
Paper
• 2504.01308
• Published
• 14
DASH: Detection and Assessment of Systematic Hallucinations of VLMs
Paper
• 2503.23573
• Published
• 12
Enhanced OoD Detection through Cross-Modal Alignment of Multi-Modal
Representations
Paper
• 2503.18817
• Published
• 3
Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual
Editing
Paper
• 2504.02826
• Published
• 68
WikiVideo: Article Generation from Multiple Videos
Paper
• 2504.00939
• Published
• 37
GPT-ImgEval: A Comprehensive Benchmark for Diagnosing GPT4o in Image
Generation
Paper
• 2504.02782
• Published
• 57
Inference-Time Scaling for Generalist Reward Modeling
Paper
• 2504.02495
• Published
• 58
Rethinking RL Scaling for Vision Language Models: A Transparent,
From-Scratch Framework and Comprehensive Evaluation Scheme
Paper
• 2504.02587
• Published
• 32
ShortV: Efficient Multimodal Large Language Models by Freezing Visual
Tokens in Ineffective Layers
Paper
• 2504.00502
• Published
• 26
VARGPT-v1.1: Improve Visual Autoregressive Large Unified Model via
Iterative Instruction Tuning and Reinforcement Learning
Paper
• 2504.02949
• Published
• 21
MME-Unify: A Comprehensive Benchmark for Unified Multimodal
Understanding and Generation Models
Paper
• 2504.03641
• Published
• 14
Slow-Fast Architecture for Video Multi-Modal Large Language Models
Paper
• 2504.01328
• Published
• 7
URECA: Unique Region Caption Anything
Paper
• 2504.05305
• Published
• 35
Concept Lancet: Image Editing with Compositional Representation
Transplant
Paper
• 2504.02828
• Published
• 16
LiveVQA: Live Visual Knowledge Seeking
Paper
• 2504.05288
• Published
• 15
SmolVLM: Redefining small and efficient multimodal models
Paper
• 2504.05299
• Published
• 205
Why Reasoning Matters? A Survey of Advancements in Multimodal Reasoning
(v1)
Paper
• 2504.03151
• Published
• 15
Tuning-Free Image Editing with Fidelity and Editability via Unified
Latent Diffusion Model
Paper
• 2504.05594
• Published
• 11
Skywork R1V: Pioneering Multimodal Reasoning with Chain-of-Thought
Paper
• 2504.05599
• Published
• 85
Rethinking Reflection in Pre-Training
Paper
• 2504.04022
• Published
• 80
Mamba as a Bridge: Where Vision Foundation Models Meet Vision Language
Models for Domain-Generalized Semantic Segmentation
Paper
• 2504.03193
• Published
• 4
OmniSVG: A Unified Scalable Vector Graphics Generation Model
Paper
• 2504.06263
• Published
• 183
V-MAGE: A Game Evaluation Framework for Assessing Visual-Centric
Capabilities in Multimodal Large Language Models
Paper
• 2504.06148
• Published
• 13
OmniCaptioner: One Captioner to Rule Them All
Paper
• 2504.07089
• Published
• 20
Caption Anything in Video: Fine-grained Object-centric Captioning via
Spatiotemporal Multimodal Prompting
Paper
• 2504.05541
• Published
• 15
VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement
Fine-Tuning
Paper
• 2504.06958
• Published
• 13
Paper
• 2504.07491
• Published
• 137
DeepSeek-R1 Thoughtology: Let's <think> about LLM Reasoning
Paper
• 2504.07128
• Published
• 87
VCR-Bench: A Comprehensive Evaluation Framework for Video
Chain-of-Thought Reasoning
Paper
• 2504.07956
• Published
• 46
VisualCloze: A Universal Image Generation Framework via Visual
In-Context Learning
Paper
• 2504.07960
• Published
• 50
MM-IFEngine: Towards Multimodal Instruction Following
Paper
• 2504.07957
• Published
• 35
Scaling Laws for Native Multimodal Models Scaling Laws for Native
Multimodal Models
Paper
• 2504.07951
• Published
• 30
Towards Visual Text Grounding of Multimodal Large Language Model
Paper
• 2504.04974
• Published
• 17
Seaweed-7B: Cost-Effective Training of Video Generation Foundation Model
Paper
• 2504.08685
• Published
• 130
GigaTok: Scaling Visual Tokenizers to 3 Billion Parameters for
Autoregressive Image Generation
Paper
• 2504.08736
• Published
• 46
MineWorld: a Real-Time and Open-Source Interactive World Model on
Minecraft
Paper
• 2504.08388
• Published
• 42
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model
Paper
• 2504.07615
• Published
• 35
VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models
with Reinforcement Learning
Paper
• 2504.08837
• Published
• 43
FUSION: Fully Integration of Vision-Language Representations for Deep
Cross-Modal Understanding
Paper
• 2504.09925
• Published
• 39
Have we unified image generation and understanding yet? An empirical
study of GPT-4o's image generation ability
Paper
• 2504.08003
• Published
• 49
InternVL3: Exploring Advanced Training and Test-Time Recipes for
Open-Source Multimodal Models
Paper
• 2504.10479
• Published
• 306
Mavors: Multi-granularity Video Representation for Multimodal Large
Language Model
Paper
• 2504.10068
• Published
• 30
TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning
Paper
• 2504.09641
• Published
• 16
VisuoThink: Empowering LVLM Reasoning with Multimodal Tree Search
Paper
• 2504.09130
• Published
• 12
Reasoning Models Can Be Effective Without Thinking
Paper
• 2504.09858
• Published
• 12
The Scalability of Simplicity: Empirical Analysis of Vision-Language
Learning with a Single Transformer
Paper
• 2504.10462
• Published
• 15
Pixel-SAIL: Single Transformer For Pixel-Grounded Understanding
Paper
• 2504.10465
• Published
• 27
Generate, but Verify: Reducing Hallucination in Vision-Language Models
with Retrospective Resampling
Paper
• 2504.13169
• Published
• 39
VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference
Optimization for Large Video Models
Paper
• 2504.13122
• Published
• 20
SFT or RL? An Early Investigation into Training R1-Like Reasoning Large
Vision-Language Models
Paper
• 2504.11468
• Published
• 30
VisualPuzzles: Decoupling Multimodal Reasoning Evaluation from Domain
Knowledge
Paper
• 2504.10342
• Published
• 11
Multimodal Long Video Modeling Based on Temporal Dynamic Context
Paper
• 2504.10443
• Published
• 3
Summarization of Multimodal Presentations with Vision-Language Models:
Study of the Effect of Modalities and Structure
Paper
• 2504.10049
• Published
• 2
ColorBench: Can VLMs See and Understand the Colorful World? A
Comprehensive Benchmark for Color Perception, Reasoning, and Robustness
Paper
• 2504.10514
• Published
• 48
Perception Encoder: The best visual embeddings are not at the output of
the network
Paper
• 2504.13181
• Published
• 35
NoisyRollout: Reinforcing Visual Reasoning with Data Augmentation
Paper
• 2504.13055
• Published
• 19
DMM: Building a Versatile Image Generation Model via Distillation-Based
Model Merging
Paper
• 2504.12364
• Published
• 22
PerceptionLM: Open-Access Data and Models for Detailed Visual
Understanding
Paper
• 2504.13180
• Published
• 20
Could Thinking Multilingually Empower LLM Reasoning?
Paper
• 2504.11833
• Published
• 29
Does Reinforcement Learning Really Incentivize Reasoning Capacity in
LLMs Beyond the Base Model?
Paper
• 2504.13837
• Published
• 139
UFO2: The Desktop AgentOS
Paper
• 2504.14603
• Published
• 29
LearnAct: Few-Shot Mobile GUI Agent with a Unified Demonstration
Benchmark
Paper
• 2504.13805
• Published
• 11
Vidi: Large Multimodal Models for Video Understanding and Editing
Paper
• 2504.15681
• Published
• 14
LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale
Paper
• 2504.16030
• Published
• 36
Seeing from Another Perspective: Evaluating Multi-View Understanding in
MLLMs
Paper
• 2504.15280
• Published
• 25
InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to
Deliberative Reasoners
Paper
• 2504.14239
• Published
• 14
IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning
in Multimodal LLMs
Paper
• 2504.15415
• Published
• 23
Describe Anything: Detailed Localized Image and Video Captioning
Paper
• 2504.16072
• Published
• 64
Eagle 2.5: Boosting Long-Context Post-Training for Frontier
Vision-Language Models
Paper
• 2504.15271
• Published
• 67
AerialMegaDepth: Learning Aerial-Ground Reconstruction and View
Synthesis
Paper
• 2504.13157
• Published
• 20
Uni3C: Unifying Precisely 3D-Enhanced Camera and Human Motion Controls
for Video Generation
Paper
• 2504.14899
• Published
• 20
An LMM for Efficient Video Understanding via Reinforced Compression of
Video Cubes
Paper
• 2504.15270
• Published
• 9
BookWorld: From Novels to Interactive Agent Societies for Creative Story
Generation
Paper
• 2504.14538
• Published
• 30
Personalized Text-to-Image Generation with Auto-Regressive Models
Paper
• 2504.13162
• Published
• 18
From Reflection to Perfection: Scaling Inference-Time Optimization for
Text-to-Image Diffusion Models via Reflection Tuning
Paper
• 2504.16080
• Published
• 15
MR. Video: "MapReduce" is the Principle for Long Video Understanding
Paper
• 2504.16082
• Published
• 5
VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal
Large Language Models
Paper
• 2504.15279
• Published
• 78
DreamID: High-Fidelity and Fast diffusion-based Face Swapping via
Triplet ID Group Learning
Paper
• 2504.14509
• Published
• 53
Trillion 7B Technical Report
Paper
• 2504.15431
• Published
• 38
I-Con: A Unifying Framework for Representation Learning
Paper
• 2504.16929
• Published
• 30
PHYBench: Holistic Evaluation of Physical Perception and Reasoning in
Large Language Models
Paper
• 2504.16074
• Published
• 36
DreamO: A Unified Framework for Image Customization
Paper
• 2504.16915
• Published
• 24
Progressive Language-guided Visual Learning for Multi-Task Visual
Grounding
Paper
• 2504.16145
• Published
• 2
Paper2Code: Automating Code Generation from Scientific Papers in Machine
Learning
Paper
• 2504.17192
• Published
• 123
Step1X-Edit: A Practical Framework for General Image Editing
Paper
• 2504.17761
• Published
• 92
RefVNLI: Towards Scalable Evaluation of Subject-driven Text-to-image
Generation
Paper
• 2504.17502
• Published
• 55
Breaking the Modality Barrier: Universal Embedding Learning with
Multimodal LLMs
Paper
• 2504.17432
• Published
• 40
Perspective-Aware Reasoning in Vision-Language Models via Mental Imagery
Simulation
Paper
• 2504.17207
• Published
• 30
Token-Shuffle: Towards High-Resolution Image Generation with
Autoregressive Models
Paper
• 2504.17789
• Published
• 23
DyMU: Dynamic Merging and Virtual Unmerging for Efficient VLMs
Paper
• 2504.17040
• Published
• 13
Boosting Generative Image Modeling via Joint Image-Feature Synthesis
Paper
• 2504.16064
• Published
• 14
TimeChat-Online: 80% Visual Tokens are Naturally Redundant in Streaming
Videos
Paper
• 2504.17343
• Published
• 13
ViSMaP: Unsupervised Hour-long Video Summarisation by Meta-Prompting
Paper
• 2504.15921
• Published
• 7
Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning
Paper
• 2504.16656
• Published
• 57
Towards Understanding Camera Motions in Any Video
Paper
• 2504.15376
• Published
• 155
Can Large Language Models Help Multimodal Language Analysis? MMLA: A
Comprehensive Benchmark
Paper
• 2504.16427
• Published
• 18
DC-SAM: In-Context Segment Anything in Images and Videos via Dual
Consistency
Paper
• 2504.12080
• Published
• 8
Contrastive Localized Language-Image Pre-Training
Paper
• 2410.02746
• Published
• 37
CLIP-MoE: Towards Building Mixture of Experts for CLIP with Diversified
Multiplet Upcycling
Paper
• 2409.19291
• Published
• 21
GenSim2: Scaling Robot Data Generation with Multi-modal and Reasoning
LLMs
Paper
• 2410.03645
• Published
• 3
LLM-Powered GUI Agents in Phone Automation: Surveying Progress and
Prospects
Paper
• 2504.19838
• Published
• 23
RepText: Rendering Visual Text via Replicating
Paper
• 2504.19724
• Published
• 31
Benchmarking Multimodal Mathematical Reasoning with Explicit Visual
Dependency
Paper
• 2504.18589
• Published
• 13
Clinical knowledge in LLMs does not translate to human interactions
Paper
• 2504.18919
• Published
• 26
SPC: Evolving Self-Play Critic via Adversarial Games for LLM Reasoning
Paper
• 2504.19162
• Published
• 18
MMInference: Accelerating Pre-filling for Long-Context VLMs via
Modality-Aware Permutation Sparse Attention
Paper
• 2504.16083
• Published
• 8
NORA: A Small Open-Sourced Generalist Vision Language Action Model for
Embodied Tasks
Paper
• 2504.19854
• Published
• 7
Reinforcement Learning for Reasoning in Large Language Models with One
Training Example
Paper
• 2504.20571
• Published
• 98
Paper
• 2504.20879
• Published
• 72
UniversalRAG: Retrieval-Augmented Generation over Multiple Corpora with
Diverse Modalities and Granularities
Paper
• 2504.20734
• Published
• 62
YoChameleon: Personalized Vision and Language Generation
Paper
• 2504.20998
• Published
• 12
X-Fusion: Introducing New Modality to Frozen Large Language Models
Paper
• 2504.20996
• Published
• 13
A Review of 3D Object Detection with Vision-Language Models
Paper
• 2504.18738
• Published
• 2
Phi-4-Mini-Reasoning: Exploring the Limits of Small Reasoning Language
Models in Math
Paper
• 2504.21233
• Published
• 49
100 Days After DeepSeek-R1: A Survey on Replication Studies and More
Directions for Reasoning Language Models
Paper
• 2505.00551
• Published
• 36
COMPACT: COMPositional Atomic-to-Complex Visual Capability Tuning
Paper
• 2504.21850
• Published
• 27
ReVision: High-Quality, Low-Cost Video Generation with Explicit 3D
Physics Modeling for Complex Motion and Interaction
Paper
• 2504.21855
• Published
• 13
A Survey of Interactive Generative Video
Paper
• 2504.21853
• Published
• 46
T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level
and Token-level CoT
Paper
• 2505.00703
• Published
• 44
PixelHacker: Image Inpainting with Structural and Semantic Consistency
Paper
• 2504.20438
• Published
• 44
Improving Editability in Image Generation with Layer-wise Memory
Paper
• 2505.01079
• Published
• 29
Voila: Voice-Language Foundation Models for Real-Time Autonomous
Interaction and Voice Role-Play
Paper
• 2505.02707
• Published
• 85
RM-R1: Reward Modeling as Reasoning
Paper
• 2505.02387
• Published
• 81
R1-Reward: Training Multimodal Reward Model Through Stable Reinforcement
Learning
Paper
• 2505.02835
• Published
• 28
Ming-Lite-Uni: Advancements in Unified Architecture for Natural
Multimodal Interaction
Paper
• 2505.02471
• Published
• 15
SuperEdit: Rectifying and Facilitating Supervision for Instruction-Based
Image Editing
Paper
• 2505.02370
• Published
• 14
Agentic Reasoning and Tool Integration for LLMs via Reinforcement
Learning
Paper
• 2505.01441
• Published
• 39
LLaMA-Omni2: LLM-based Real-time Spoken Chatbot with Autoregressive
Streaming Speech Synthesis
Paper
• 2505.02625
• Published
• 23
HoloTime: Taming Video Diffusion Models for Panoramic 4D Scene
Generation
Paper
• 2504.21650
• Published
• 16
Unified Multimodal Chain-of-Thought Reward Model through Reinforcement
Fine-Tuning
Paper
• 2505.03318
• Published
• 92
OSUniverse: Benchmark for Multimodal GUI-navigation AI Agents
Paper
• 2505.03570
• Published
• 8
OmniGIRL: A Multilingual and Multimodal Benchmark for GitHub Issue
Resolution
Paper
• 2505.04606
• Published
• 9
OpenVision: A Fully-Open, Cost-Effective Family of Advanced Vision
Encoders for Multimodal Learning
Paper
• 2505.04601
• Published
• 29
Beyond Recognition: Evaluating Visual Perspective Taking in Vision
Language Models
Paper
• 2505.03821
• Published
• 24
HunyuanCustom: A Multimodal-Driven Architecture for Customized Video
Generation
Paper
• 2505.04512
• Published
• 36
ZeroSearch: Incentivize the Search Capability of LLMs without Searching
Paper
• 2505.04588
• Published
• 65
Unified Multimodal Understanding and Generation Models: Advances,
Challenges, and Opportunities
Paper
• 2505.02567
• Published
• 80
Scenethesis: A Language and Vision Agentic Framework for 3D Scene
Generation
Paper
• 2505.02836
• Published
• 8
VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient
Large Speech-Language Model
Paper
• 2505.03739
• Published
• 9
Perception, Reason, Think, and Plan: A Survey on Large Multimodal
Reasoning Models
Paper
• 2505.04921
• Published
• 186
On Path to Multimodal Generalist: General-Level and General-Bench
Paper
• 2505.04620
• Published
• 82
Flow-GRPO: Training Flow Matching Models via Online RL
Paper
• 2505.05470
• Published
• 88
FG-CLIP: Fine-Grained Visual and Textual Alignment
Paper
• 2505.05071
• Published
• 18
X-Reasoner: Towards Generalizable Reasoning Across Modalities and
Domains
Paper
• 2505.03981
• Published
• 15
Vision-Language-Action Models: Concepts, Progress, Applications and
Challenges
Paper
• 2505.04769
• Published
• 10
Bielik v3 Small: Technical Report
Paper
• 2505.02550
• Published
• 68
Bielik 11B v2 Technical Report
Paper
• 2505.02410
• Published
• 54
Seed1.5-VL Technical Report
Paper
• 2505.07062
• Published
• 155
Unified Continuous Generative Models
Paper
• 2505.07447
• Published
• 42
DanceGRPO: Unleashing GRPO on Visual Generation
Paper
• 2505.07818
• Published
• 32
Skywork-VL Reward: An Effective Reward Model for Multimodal
Understanding and Reasoning
Paper
• 2505.07263
• Published
• 30
H^{3}DP: Triply-Hierarchical Diffusion Policy for Visuomotor
Learning
Paper
• 2505.07819
• Published
• 5
MonetGPT: Solving Puzzles Enhances MLLMs' Image Retouching Skills
Paper
• 2505.06176
• Published
• 12
DynamicRAG: Leveraging Outputs of Large Language Model as Feedback for
Dynamic Reranking in Retrieval-Augmented Generation
Paper
• 2505.07233
• Published
• 8
MiniMax-Speech: Intrinsic Zero-Shot Text-to-Speech with a Learnable
Speaker Encoder
Paper
• 2505.07916
• Published
• 134
Fast Text-to-Audio Generation with Adversarial Post-Training
Paper
• 2505.08175
• Published
• 25
Bring Reason to Vision: Understanding Perception and Reasoning through
Model Merging
Paper
• 2505.05464
• Published
• 11
Aya Vision: Advancing the Frontier of Multilingual Multimodality
Paper
• 2505.08751
• Published
• 13
SkillFormer: Unified Multi-View Video Understanding for Proficiency
Estimation
Paper
• 2505.08665
• Published
• 5
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture,
Training and Dataset
Paper
• 2505.09568
• Published
• 99
Insights into DeepSeek-V3: Scaling Challenges and Reflections on
Hardware for AI Architectures
Paper
• 2505.09343
• Published
• 76
MathCoder-VL: Bridging Vision and Code for Enhanced Multimodal
Mathematical Reasoning
Paper
• 2505.10557
• Published
• 47
DeCLIP: Decoupled Learning for Open-Vocabulary Dense Perception
Paper
• 2505.04410
• Published
• 44
WavReward: Spoken Dialogue Models With Generalist Reward Evaluators
Paper
• 2505.09558
• Published
• 10
Omni-R1: Do You Really Need Audio to Fine-Tune Your Audio LLM?
Paper
• 2505.09439
• Published
• 10
VCRBench: Exploring Long-form Causal Reasoning Capabilities of Large
Video Language Models
Paper
• 2505.08455
• Published
• 5
Understanding and Mitigating Toxicity in Image-Text Pretraining
Datasets: A Case Study on LLaVA
Paper
• 2505.06356
• Published
• 3
Beyond 'Aha!': Toward Systematic Meta-Abilities Alignment in Large
Reasoning Models
Paper
• 2505.10554
• Published
• 120
OpenThinkIMG: Learning to Think with Images via Visual Tool
Reinforcement Learning
Paper
• 2505.08617
• Published
• 42
WorldPM: Scaling Human Preference Modeling
Paper
• 2505.10527
• Published
• 34
End-to-End Vision Tokenizer Tuning
Paper
• 2505.10562
• Published
• 22
Exploring the Deep Fusion of Large Language Models and Diffusion
Transformers for Text-to-Image Synthesis
Paper
• 2505.10046
• Published
• 9
AdaptCLIP: Adapting CLIP for Universal Visual Anomaly Detection
Paper
• 2505.09926
• Published
• 6
Paper
• 2505.09388
• Published
• 334
MMLongBench: Benchmarking Long-Context Vision-Language Models
Effectively and Thoroughly
Paper
• 2505.10610
• Published
• 55
GuardReasoner-VL: Safeguarding VLMs via Reinforced Reasoning
Paper
• 2505.11049
• Published
• 61
Visual Planning: Let's Think Only with Images
Paper
• 2505.11409
• Published
• 57
Simple Semi-supervised Knowledge Distillation from Vision-Language
Models via texttt{D}ual-texttt{H}ead
texttt{O}ptimization
Paper
• 2505.07675
• Published
• 21
Chain-of-Model Learning for Language Model
Paper
• 2505.11820
• Published
• 121
AdaptThink: Reasoning Models Can Learn When to Think
Paper
• 2505.13417
• Published
• 83
Model Merging in Pre-training of Large Language Models
Paper
• 2505.12082
• Published
• 40
Through the Looking Glass: Common Sense Consistency Evaluation of Weird
Images
Paper
• 2505.07704
• Published
• 29
Faster Video Diffusion with Trainable Sparse Attention
Paper
• 2505.13389
• Published
• 38
ViPlan: A Benchmark for Visual Planning with Symbolic Predicates and
Vision-Language Models
Paper
• 2505.13180
• Published
• 13
VisionReasoner: Unified Visual Perception and Reasoning via
Reinforcement Learning
Paper
• 2505.12081
• Published
• 18
R3: Robust Rubric-Agnostic Reward Models
Paper
• 2505.13388
• Published
• 11
Efficient Speech Language Modeling via Energy Distance in Continuous
Latent Space
Paper
• 2505.13181
• Published
• 9
A Token is Worth over 1,000 Tokens: Efficient Knowledge Distillation
through Low-Rank Clone
Paper
• 2505.12781
• Published
• 2
Emerging Properties in Unified Multimodal Pretraining
Paper
• 2505.14683
• Published
• 133
Paper
• 2505.14674
• Published
• 37
Visual Agentic Reinforcement Fine-Tuning
Paper
• 2505.14246
• Published
• 32
VisualQuality-R1: Reasoning-Induced Image Quality Assessment via
Reinforcement Learning to Rank
Paper
• 2505.14460
• Published
• 33
Think Only When You Need with Large Hybrid-Reasoning Models
Paper
• 2505.14631
• Published
• 20
Visionary-R1: Mitigating Shortcuts in Visual Reasoning with
Reinforcement Learning
Paper
• 2505.14677
• Published
• 15
Hunyuan-Game: Industrial-grade Intelligent Game Creation Model
Paper
• 2505.14135
• Published
• 16
VideoEval-Pro: Robust and Realistic Long Video Understanding Evaluation
Paper
• 2505.14640
• Published
• 16
Two Experts Are All You Need for Steering Thinking: Reinforcing
Cognitive Effort in MoE Reasoning Models Without Additional Training
Paper
• 2505.14681
• Published
• 10
Visual Instruction Bottleneck Tuning
Paper
• 2505.13946
• Published
• 10
Not All Correct Answers Are Equal: Why Your Distillation Source Matters
Paper
• 2505.14464
• Published
• 10
Lessons from Defending Gemini Against Indirect Prompt Injections
Paper
• 2505.14534
• Published
• 8
The Hallucination Tax of Reinforcement Finetuning
Paper
• 2505.13988
• Published
• 8
Incorporating brain-inspired mechanisms for multimodal learning in
artificial intelligence
Paper
• 2505.10176
• Published
• 3
Web-Shepherd: Advancing PRMs for Reinforcing Web Agents
Paper
• 2505.15277
• Published
• 104
MMaDA: Multimodal Large Diffusion Language Models
Paper
• 2505.15809
• Published
• 98
UniVG-R1: Reasoning Guided Universal Visual Grounding with Reinforcement
Learning
Paper
• 2505.14231
• Published
• 53
Diffusion vs. Autoregressive Language Models: A Text Embedding
Perspective
Paper
• 2505.15045
• Published
• 55
Vid2World: Crafting Video Diffusion Models to Interactive World Models
Paper
• 2505.14357
• Published
• 27
When to Continue Thinking: Adaptive Thinking Mode Switching for
Efficient Reasoning
Paper
• 2505.15400
• Published
• 23
lmgame-Bench: How Good are LLMs at Playing Games?
Paper
• 2505.15146
• Published
• 20
IA-T2I: Internet-Augmented Text-to-Image Generation
Paper
• 2505.15779
• Published
• 14
Deliberation on Priors: Trustworthy Reasoning of Large Language Models
on Knowledge Graphs
Paper
• 2505.15210
• Published
• 18
RLVR-World: Training World Models with Reinforcement Learning
Paper
• 2505.13934
• Published
• 16
ConvSearch-R1: Enhancing Query Reformulation for Conversational Search
with Reasoning via Reinforcement Learning
Paper
• 2505.15776
• Published
• 11
HumaniBench: A Human-Centric Framework for Large Multimodal Models
Evaluation
Paper
• 2505.11454
• Published
• 5
QuickVideo: Real-Time Long Video Understanding with System Algorithm
Co-Design
Paper
• 2505.16175
• Published
• 42
LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning
Paper
• 2505.16933
• Published
• 34
GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation
with Reinforcement Learning
Paper
• 2505.17022
• Published
• 27
Risk-Averse Reinforcement Learning with Itakura-Saito Loss
Paper
• 2505.16925
• Published
• 26
Understanding Generative AI Capabilities in Everyday Image Editing Tasks
Paper
• 2505.16181
• Published
• 24
Training-Free Efficient Video Generation via Dynamic Token Carving
Paper
• 2505.16864
• Published
• 24
Dimple: Discrete Diffusion Multimodal Large Language Model with Parallel
Decoding
Paper
• 2505.16990
• Published
• 22
VideoGameQA-Bench: Evaluating Vision-Language Models for Video Game
Quality Assurance
Paper
• 2505.15952
• Published
• 20
SophiaVL-R1: Reinforcing MLLMs Reasoning with Thinking Reward
Paper
• 2505.17018
• Published
• 15
Backdoor Cleaning without External Guidance in MLLM Fine-tuning
Paper
• 2505.16916
• Published
• 17
WebAgent-R1: Training Web Agents via End-to-End Multi-Turn Reinforcement
Learning
Paper
• 2505.16421
• Published
• 19
LaViDa: A Large Diffusion Language Model for Multimodal Understanding
Paper
• 2505.16839
• Published
• 13
GRIT: Teaching MLLMs to Think with Images
Paper
• 2505.15879
• Published
• 13
Think or Not? Selective Reasoning via Reinforcement Learning for
Vision-Language Models
Paper
• 2505.16854
• Published
• 11
OViP: Online Vision-Language Preference Learning
Paper
• 2505.15963
• Published
• 9
Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal
Large Language Models
Paper
• 2505.17015
• Published
• 9
VLM-R^3: Region Recognition, Reasoning, and Refinement for Enhanced
Multimodal Chain-of-Thought
Paper
• 2505.16192
• Published
• 12
Robo2VLM: Visual Question Answering from Large-Scale In-the-Wild Robot
Manipulation Datasets
Paper
• 2505.15517
• Published
• 4
How Do Large Vision-Language Models See Text in Image? Unveiling the
Distinctive Role of OCR Heads
Paper
• 2505.15865
• Published
• 5
RAVENEA: A Benchmark for Multimodal Retrieval-Augmented Visual Culture
Understanding
Paper
• 2505.14462
• Published
• 4
One RL to See Them All: Visual Triple Unified Reinforcement Learning
Paper
• 2505.18129
• Published
• 62
Teaching with Lies: Curriculum DPO on Synthetic Negatives for
Hallucination Detection
Paper
• 2505.17558
• Published
• 15
AudioTrust: Benchmarking the Multifaceted Trustworthiness of Audio Large
Language Models
Paper
• 2505.16211
• Published
• 18
Are Vision-Language Models Safe in the Wild? A Meme-Based Benchmark
Study
Paper
• 2505.15389
• Published
• 8
Reinforcement Fine-Tuning Powers Reasoning Capability of Multimodal
Large Language Models
Paper
• 2505.18536
• Published
• 18
QwenLong-L1: Towards Long-Context Large Reasoning Models with
Reinforcement Learning
Paper
• 2505.17667
• Published
• 88
Reasoning Model is Stubborn: Diagnosing Instruction Overriding in
Reasoning Models
Paper
• 2505.17225
• Published
• 64
QwenLong-CPRS: Towards infty-LLMs with Dynamic Context Optimization
Paper
• 2505.18092
• Published
• 43
RBench-V: A Primary Assessment for Visual Reasoning Models with
Multi-modal Outputs
Paper
• 2505.16770
• Published
• 12
Interactive Post-Training for Vision-Language-Action Models
Paper
• 2505.17016
• Published
• 6
G1: Bootstrapping Perception and Reasoning Abilities of Vision-Language
Model via Reinforcement Learning
Paper
• 2505.13426
• Published
• 13
Error Typing for Smarter Rewards: Improving Process Reward Models with
Error-Aware Hierarchical Supervision
Paper
• 2505.19706
• Published
• 3
RePrompt: Reasoning-Augmented Reprompting for Text-to-Image Generation
via Reinforcement Learning
Paper
• 2505.17540
• Published
• 7
Shifting AI Efficiency From Model-Centric to Data-Centric Compression
Paper
• 2505.19147
• Published
• 145
Can MLLMs Guide Me Home? A Benchmark Study on Fine-Grained Visual
Reasoning from Transit Maps
Paper
• 2505.18675
• Published
• 26
Omni-R1: Reinforcement Learning for Omnimodal Reasoning via Two-System
Collaboration
Paper
• 2505.20256
• Published
• 19
REARANK: Reasoning Re-ranking Agent via Reinforcement Learning
Paper
• 2505.20046
• Published
• 18
Hard Negative Contrastive Learning for Fine-Grained Geometric
Understanding in Large Multimodal Models
Paper
• 2505.20152
• Published
• 11
Interleaved Reasoning for Large Language Models via Reinforcement
Learning
Paper
• 2505.19640
• Published
• 15
InfantAgent-Next: A Multimodal Generalist Agent for Automated Computer
Interaction
Paper
• 2505.10887
• Published
• 10
STAR-R1: Spatial TrAnsformation Reasoning by Reinforcing Multimodal LLMs
Paper
• 2505.15804
• Published
• 10
Jodi: Unification of Visual Generation and Understanding via Joint
Modeling
Paper
• 2505.19084
• Published
• 20
Towards Holistic Evaluation of Large Audio-Language Models: A
Comprehensive Survey
Paper
• 2505.15957
• Published
• 3
Seeing is Believing, but How Much? A Comprehensive Analysis of
Verbalized Calibration in Vision-Language Models
Paper
• 2505.20236
• Published
• 3
Textual Steering Vectors Can Improve Visual Understanding in Multimodal
Large Language Models
Paper
• 2505.14071
• Published
• 1
Paper2Poster: Towards Multimodal Poster Automation from Scientific
Papers
Paper
• 2505.21497
• Published
• 109
Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?
Paper
• 2505.21374
• Published
• 28
MME-VideoOCR: Evaluating OCR-Based Capabilities of Multimodal LLMs in
Video Scenarios
Paper
• 2505.21333
• Published
• 38
MME-Reasoning: A Comprehensive Benchmark for Logical Reasoning in MLLMs
Paper
• 2505.21327
• Published
• 83
MMMR: Benchmarking Massive Multi-Modal Reasoning Tasks
Paper
• 2505.16459
• Published
• 45
SeePhys: Does Seeing Help Thinking? -- Benchmarking Vision-Based Physics
Reasoning
Paper
• 2505.19099
• Published
• 7
Active-O3: Empowering Multimodal Large Language Models with Active
Perception via GRPO
Paper
• 2505.21457
• Published
• 16
UI-Genie: A Self-Improving Approach for Iteratively Boosting MLLM-based
Mobile GUI Agents
Paper
• 2505.21496
• Published
• 38
ViewSpatial-Bench: Evaluating Multi-perspective Spatial Localization in
Vision-Language Models
Paper
• 2505.21500
• Published
• 13
ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic
Scientific Workflows
Paper
• 2505.19897
• Published
• 104
MLLMs are Deeply Affected by Modality Bias
Paper
• 2505.18657
• Published
• 5
Unsupervised Post-Training for Multi-Modal LLM Reasoning via GRPO
Paper
• 2505.22453
• Published
• 46
Advancing Multimodal Reasoning via Reinforcement Learning with Cold
Start
Paper
• 2505.22334
• Published
• 36
The Entropy Mechanism of Reinforcement Learning for Reasoning Language
Models
Paper
• 2505.22617
• Published
• 131
R2R: Efficiently Navigating Divergent Reasoning Paths with Small-Large
Model Token Routing
Paper
• 2505.21600
• Published
• 71
Skywork Open Reasoner 1 Technical Report
Paper
• 2505.22312
• Published
• 54
Sherlock: Self-Correcting Reasoning in Vision-Language Models
Paper
• 2505.22651
• Published
• 48
Fostering Video Reasoning via Next-Event Prediction
Paper
• 2505.22457
• Published
• 29
VRAG-RL: Empower Vision-Perception-Based RAG for Visually Rich Information Understanding via Iterative Reasoning with Reinforcement Learning
Paper
• 2505.22019
• Published
• 11
RICO: Improving Accuracy and Completeness in Image Recaptioning via
Visual Reconstruction
Paper
• 2505.22613
• Published
• 9
Zero-Shot Vision Encoder Grafting via LLM Surrogates
Paper
• 2505.22664
• Published
• 7
MangaVQA and MangaLMM: A Benchmark and Specialized Model for Multimodal
Manga Understanding
Paper
• 2505.20298
• Published
• 9
Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial
Intelligence
Paper
• 2505.23747
• Published
• 69
ZeroGUI: Automating Online GUI Learning at Zero Human Cost
Paper
• 2505.23762
• Published
• 45
The Climb Carves Wisdom Deeper Than the Summit: On the Noisy Rewards in
Learning to Reason
Paper
• 2505.22653
• Published
• 43
VF-Eval: Evaluating Multimodal LLMs for Generating Feedback on AIGC
Videos
Paper
• 2505.23693
• Published
• 53
VideoReasonBench: Can MLLMs Perform Vision-Centric Complex Video
Reasoning?
Paper
• 2505.23359
• Published
• 38
To Trust Or Not To Trust Your Vision-Language Model's Prediction
Paper
• 2505.23745
• Published
• 4
Can LLMs Deceive CLIP? Benchmarking Adversarial Compositionality of
Pre-trained Multimodal Representation via Text Updates
Paper
• 2505.22943
• Published
• 3
FS-DAG: Few Shot Domain Adapting Graph Networks for Visually Rich
Document Understanding
Paper
• 2505.17330
• Published
• 22
HoPE: Hybrid of Position Embedding for Length Generalization in
Vision-Language Models
Paper
• 2505.20444
• Published
• 5
cadrille: Multi-modal CAD Reconstruction with Online Reinforcement
Learning
Paper
• 2505.22914
• Published
• 37
Are Reasoning Models More Prone to Hallucination?
Paper
• 2505.23646
• Published
• 24
Multi-Domain Explainability of Preferences
Paper
• 2505.20088
• Published
• 20
REOrdering Patches Improves Vision Models
Paper
• 2505.23751
• Published
• 15
Re-ttention: Ultra Sparse Visual Generation via Attention Statistical
Reshape
Paper
• 2505.22918
• Published
• 6
Puzzled by Puzzles: When Vision-Language Models Can't Take a Hint
Paper
• 2505.23759
• Published
• 5
A Graph Perspective to Probe Structural Patterns of Knowledge in Large
Language Models
Paper
• 2505.19286
• Published
• 3
Grounded Reinforcement Learning for Visual Reasoning
Paper
• 2505.23678
• Published
• 2
Time Blindness: Why Video-Language Models Can't See What Humans Can?
Paper
• 2505.24867
• Published
• 82
AlphaOne: Reasoning Models Thinking Slow and Fast at Test Time
Paper
• 2505.24863
• Published
• 97
ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in
Large Language Models
Paper
• 2505.24864
• Published
• 143
Large Language Models for Data Synthesis
Paper
• 2505.14752
• Published
• 49
Don't Look Only Once: Towards Multimodal Interactive Reasoning with
Selective Visual Revisitation
Paper
• 2505.18842
• Published
• 36
ViStoryBench: Comprehensive Benchmark Suite for Story Visualization
Paper
• 2505.24862
• Published
• 30
DINO-R1: Incentivizing Reasoning Capability in Vision Foundation Models
Paper
• 2505.24025
• Published
• 27
MoDoMoDo: Multi-Domain Data Mixtures for Multimodal LLM Reinforcement
Learning
Paper
• 2505.24871
• Published
• 23
Open CaptchaWorld: A Comprehensive Web-based Platform for Testing and
Benchmarking Multimodal LLM Agents
Paper
• 2505.24878
• Published
• 23
Vision Language Models are Biased
Paper
• 2505.23941
• Published
• 23
More Thinking, Less Seeing? Assessing Amplified Hallucination in
Multimodal Reasoning Models
Paper
• 2505.21523
• Published
• 13
Fork-Merge Decoding: Enhancing Multimodal Understanding in Audio-Visual
Large Language Models
Paper
• 2505.20873
• Published
• 9
ReasonGen-R1: CoT for Autoregressive Image generation models through SFT
and RL
Paper
• 2505.24875
• Published
• 10
un^2CLIP: Improving CLIP's Visual Detail Capturing Ability via
Inverting unCLIP
Paper
• 2505.24517
• Published
• 5
SmolVLA: A Vision-Language-Action Model for Affordable and Efficient
Robotics
Paper
• 2506.01844
• Published
• 151
Temporal In-Context Fine-Tuning for Versatile Control of Video Diffusion
Models
Paper
• 2506.00996
• Published
• 40
Jigsaw-R1: A Study of Rule-based Visual Reinforcement Learning with
Jigsaw Puzzles
Paper
• 2505.23590
• Published
• 25
LoHoVLA: A Unified Vision-Language-Action Model for Long-Horizon
Embodied Tasks
Paper
• 2506.00411
• Published
• 31
SRPO: Enhancing Multimodal LLM Reasoning via Reflection-Aware
Reinforcement Learning
Paper
• 2506.01713
• Published
• 48
EarthMind: Towards Multi-Granular and Multi-Sensor Earth Observation
with Large Multimodal Models
Paper
• 2506.01667
• Published
• 21
VisualSphinx: Large-Scale Synthetic Vision Logic Puzzles for RL
Paper
• 2505.23977
• Published
• 10
Learning from Videos for 3D World: Enhancing MLLMs with 3D Vision
Geometry Priors
Paper
• 2505.24625
• Published
• 9
OmniResponse: Online Multimodal Conversational Response Generation in
Dyadic Interactions
Paper
• 2505.21724
• Published
• 5
Aligning VLM Assistants with Personalized Situated Cognition
Paper
• 2506.00930
• Published
• 2
MotionSight: Boosting Fine-Grained Motion Understanding in Multimodal
LLMs
Paper
• 2506.01674
• Published
• 28
SynthRL: Scaling Visual Reasoning with Verifiable Data Synthesis
Paper
• 2506.02096
• Published
• 52
VS-Bench: Evaluating VLMs for Strategic Reasoning and Decision-Making in
Multi-Agent Environments
Paper
• 2506.02387
• Published
• 58
UniWorld: High-Resolution Semantic Encoders for Unified Visual
Understanding and Generation
Paper
• 2506.03147
• Published
• 58
CSVQA: A Chinese Multimodal Benchmark for Evaluating STEM Reasoning
Capabilities of VLMs
Paper
• 2505.24120
• Published
• 49
OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for
Vision Language Models
Paper
• 2506.03135
• Published
• 40
Visual Embodied Brain: Let Multimodal Large Language Models See, Think,
and Control in Spaces
Paper
• 2506.00123
• Published
• 35
GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents
Paper
• 2506.03143
• Published
• 53
FuseLIP: Multimodal Embeddings via Early Fusion of Discrete Tokens
Paper
• 2506.03096
• Published
• 4
TimeHC-RL: Temporal-aware Hierarchical Cognitive Reinforcement Learning
for Enhancing LLMs' Social Intelligence
Paper
• 2505.24500
• Published
• 12
Advancing Multimodal Reasoning: From Optimized Cold Start to Staged
Reinforcement Learning
Paper
• 2506.04207
• Published
• 48
Paper
• 2506.03569
• Published
• 80
MMR-V: What's Left Unsaid? A Benchmark for Multimodal Deep Reasoning in
Videos
Paper
• 2506.04141
• Published
• 29
Video-Skill-CoT: Skill-based Chain-of-Thoughts for Domain-Adaptive Video
Reasoning
Paper
• 2506.03525
• Published
• 6
RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language
Models for Robotics
Paper
• 2506.04308
• Published
• 43
Qwen3 Embedding: Advancing Text Embedding and Reranking Through
Foundation Models
Paper
• 2506.05176
• Published
• 79
EOC-Bench: Can MLLMs Identify, Recall, and Forecast Objects in an
Egocentric World?
Paper
• 2506.05287
• Published
• 14
SparseMM: Head Sparsity Emerges from Visual Concept Responses in MLLMs
Paper
• 2506.05344
• Published
• 17
FusionAudio-1.2M: Towards Fine-grained Audio Captioning with Multimodal
Contextual Fusion
Paper
• 2506.01111
• Published
• 31
Is Extending Modality The Right Path Towards Omni-Modality?
Paper
• 2506.01872
• Published
• 24
Reinforcement Pre-Training
Paper
• 2506.08007
• Published
• 263
Astra: Toward General-Purpose Mobile Robots via Hierarchical Multimodal
Learning
Paper
• 2506.06205
• Published
• 30
Image Reconstruction as a Tool for Feature Analysis
Paper
• 2506.07803
• Published
• 29
Bootstrapping World Models from Dynamics Models in Multimodal Foundation
Models
Paper
• 2506.06006
• Published
• 14
Vision Transformers Don't Need Trained Registers
Paper
• 2506.08010
• Published
• 22
Lingshu: A Generalist Foundation Model for Unified Multimodal Medical
Understanding and Reasoning
Paper
• 2506.07044
• Published
• 113
DiscoVLA: Discrepancy Reduction in Vision, Language, and Alignment for
Parameter-Efficient Video-Text Retrieval
Paper
• 2506.08887
• Published
• 4
Autoregressive Semantic Visual Reconstruction Helps VLMs Understand
Better
Paper
• 2506.09040
• Published
• 34
Seedance 1.0: Exploring the Boundaries of Video Generation Models
Paper
• 2506.09113
• Published
• 107
MMRefine: Unveiling the Obstacles to Robust Refinement in Multimodal
Large Language Models
Paper
• 2506.04688
• Published
• 3
Confidence Is All You Need: Few-Shot RL Fine-Tuning of Language Models
Paper
• 2506.06395
• Published
• 133
Hidden in plain sight: VLMs overlook their visual representations
Paper
• 2506.08008
• Published
• 7
Vision Matters: Simple Visual Perturbations Can Boost Multimodal Math
Reasoning
Paper
• 2506.09736
• Published
• 9
VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos
Paper
• 2506.10857
• Published
• 30
Optimus-3: Towards Generalist Multimodal Minecraft Agents with Scalable
Task Experts
Paper
• 2506.10357
• Published
• 21
SAFE: Multitask Failure Detection for Vision-Language-Action Models
Paper
• 2506.09937
• Published
• 9
Paper
• 2506.10910
• Published
• 66
Ming-Omni: A Unified Multimodal Model for Perception and Generation
Paper
• 2506.09344
• Published
• 31
VideoDeepResearch: Long Video Understanding With Agentic Tool Using
Paper
• 2506.10821
• Published
• 19
UniPre3D: Unified Pre-training of 3D Point Cloud Models with Cross-Modal
Gaussian Splatting
Paper
• 2506.09952
• Published
• 6
Aligned Novel View Image and Geometry Synthesis via Cross-modal
Attention Instillation
Paper
• 2506.11924
• Published
• 34
ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual
Perception in VLMs
Paper
• 2506.10128
• Published
• 22
MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning
Attention
Paper
• 2506.13585
• Published
• 273
BridgeVLA: Input-Output Alignment for Efficient 3D Manipulation Learning
with Vision-Language Models
Paper
• 2506.07961
• Published
• 11
Discrete Diffusion in Large Language and Multimodal Models: A Survey
Paper
• 2506.13759
• Published
• 43
Stream-Omni: Simultaneous Multimodal Interactions with Large
Language-Vision-Speech Model
Paper
• 2506.13642
• Published
• 27
VGR: Visual Grounded Reasoning
Paper
• 2506.11991
• Published
• 20
AR-RAG: Autoregressive Retrieval Augmentation for Image Generation
Paper
• 2506.06962
• Published
• 28
DoTA-RAG: Dynamic of Thought Aggregation RAG
Paper
• 2506.12571
• Published
• 50
Ego-R1: Chain-of-Tool-Thought for Ultra-Long Egocentric Video Reasoning
Paper
• 2506.13654
• Published
• 43
Scientists' First Exam: Probing Cognitive Abilities of MLLM via
Perception, Understanding, and Reasoning
Paper
• 2506.10521
• Published
• 73
LongLLaDA: Unlocking Long Context Capabilities in Diffusion LLMs
Paper
• 2506.14429
• Published
• 44
EfficientVLA: Training-Free Acceleration and Compression for
Vision-Language-Action Models
Paper
• 2506.10100
• Published
• 10
VideoMolmo: Spatio-Temporal Grounding Meets Pointing
Paper
• 2506.05336
• Published
• 9
MultiFinBen: A Multilingual, Multimodal, and Difficulty-Aware Benchmark
for Financial LLM Evaluation
Paper
• 2506.14028
• Published
• 93
Sekai: A Video Dataset towards World Exploration
Paper
• 2506.15675
• Published
• 66
ProtoReasoning: Prototypes as the Foundation for Generalizable Reasoning
in LLMs
Paper
• 2506.15211
• Published
• 39
GenRecal: Generation after Recalibration from Large to Small
Vision-Language Models
Paper
• 2506.15681
• Published
• 42
SciVer: Evaluating Foundation Models for Multimodal Scientific Claim
Verification
Paper
• 2506.15569
• Published
• 12
FedNano: Toward Lightweight Federated Tuning for Pretrained Multimodal
Large Language Models
Paper
• 2506.14824
• Published
• 8
CoMemo: LVLMs Need Image Context with Image Memory
Paper
• 2506.06279
• Published
• 8
Show-o2: Improved Native Unified Multimodal Models
Paper
• 2506.15564
• Published
• 29
Vision-Guided Chunking Is All You Need: Enhancing RAG with Multimodal
Document Understanding
Paper
• 2506.16035
• Published
• 89
PAROAttention: Pattern-Aware ReOrdering for Efficient Sparse and
Quantized Attention in Visual Generation Models
Paper
• 2506.16054
• Published
• 60
Hunyuan-GameCraft: High-dynamic Interactive Game Video Generation with
Hybrid History Condition
Paper
• 2506.17201
• Published
• 57
Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual
Tokens
Paper
• 2506.17218
• Published
• 29
UniFork: Exploring Modality Alignment for Unified Multimodal
Understanding and Generation
Paper
• 2506.17202
• Published
• 10
Hunyuan3D 2.1: From Images to High-Fidelity 3D Assets with
Production-Ready PBR Material
Paper
• 2506.15442
• Published
• 15
InfiniPot-V: Memory-Constrained KV Cache Compression for Streaming Video
Understanding
Paper
• 2506.15745
• Published
• 14
MEXA: Towards General Multimodal Reasoning with Dynamic Multi-Expert
Aggregation
Paper
• 2506.17113
• Published
• 5
OmniGen2: Exploration to Advanced Multimodal Generation
Paper
• 2506.18871
• Published
• 78
Vision as a Dialect: Unifying Visual Understanding and Generation via
Text-Aligned Representations
Paper
• 2506.18898
• Published
• 34
From Intention to Execution: Probing the Generalization Boundaries of
Vision-Language-Action Models
Paper
• 2506.09930
• Published
• 8
USAD: Universal Speech and Audio Representation via Distillation
Paper
• 2506.18843
• Published
• 12
ScaleCap: Inference-Time Scalable Image Captioning via Dual-Modality
Debiasing
Paper
• 2506.19848
• Published
• 26
Unified Vision-Language-Action Model
Paper
• 2506.19850
• Published
• 27
GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal
Reasoning
Paper
• 2506.16141
• Published
• 27
Phantom-Data : Towards a General Subject-Consistent Video Generation
Dataset
Paper
• 2506.18851
• Published
• 30
ShareGPT-4o-Image: Aligning Multimodal Models with GPT-4o-Level Image
Generation
Paper
• 2506.18095
• Published
• 66
OctoThinker: Mid-training Incentivizes Reinforcement Learning Scaling
Paper
• 2506.20512
• Published
• 47
MMSearch-R1: Incentivizing LMMs to Search
Paper
• 2506.20670
• Published
• 64
WorldVLA: Towards Autoregressive Action World Model
Paper
• 2506.21539
• Published
• 40
FaSTA^*: Fast-Slow Toolpath Agent with Subroutine Mining for Efficient
Multi-turn Image Editing
Paper
• 2506.20911
• Published
• 41
LLaVA-Scissor: Token Compression with Semantic Connected Components for
Video LLMs
Paper
• 2506.21862
• Published
• 36
Fine-Grained Preference Optimization Improves Spatial Reasoning in VLMs
Paper
• 2506.21656
• Published
• 16
MiCo: Multi-image Contrast for Reinforcement Visual Reasoning
Paper
• 2506.22434
• Published
• 10
BlenderFusion: 3D-Grounded Visual Editing and Generative Compositing
Paper
• 2506.17450
• Published
• 64
ShotBench: Expert-Level Cinematic Understanding in Vision-Language
Models
Paper
• 2506.21356
• Published
• 22
Audio-FLAN: A Preliminary Release
Paper
• 2502.16584
• Published
• 36
Do Vision-Language Models Have Internal World Models? Towards an Atomic
Evaluation
Paper
• 2506.21876
• Published
• 28
Aha Moment Revisited: Are VLMs Truly Capable of Self Verification in
Inference-time Scaling?
Paper
• 2506.17417
• Published
• 11
Paper
• 2506.23044
• Published
• 61
Listener-Rewarded Thinking in VLMs for Image Preferences
Paper
• 2506.22832
• Published
• 23
GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable
Reinforcement Learning
Paper
• 2507.01006
• Published
• 251
MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional
Multimodal Embeddings
Paper
• 2506.23115
• Published
• 36
MusiXQA: Advancing Visual Music Understanding in Multimodal Large
Language Models
Paper
• 2506.23009
• Published
• 11
Kwai Keye-VL Technical Report
Paper
• 2507.01949
• Published
• 131
A Survey on Vision-Language-Action Models: An Action Tokenization
Perspective
Paper
• 2507.01925
• Published
• 39
LongAnimation: Long Animation Generation with Dynamic Global-Local
Memory
Paper
• 2507.01945
• Published
• 76
Thinking with Images for Multimodal Reasoning: Foundations, Methods, and
Future Frontiers
Paper
• 2506.23918
• Published
• 90
How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation
Models on Standard Computer Vision Tasks
Paper
• 2507.01955
• Published
• 36
MemOS: A Memory OS for AI System
Paper
• 2507.03724
• Published
• 159
DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive
World Knowledge
Paper
• 2507.04447
• Published
• 45
BMMR: A Large-Scale Bilingual Multimodal Multi-Discipline Reasoning
Dataset
Paper
• 2507.03483
• Published
• 24
ArtifactsBench: Bridging the Visual-Interactive Gap in LLM Code
Generation Evaluation
Paper
• 2507.04952
• Published
• 11
VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and
Visual Documents
Paper
• 2507.04590
• Published
• 17
Perception-Aware Policy Optimization for Multimodal Reasoning
Paper
• 2507.06448
• Published
• 48
4KAgent: Agentic Any Image to 4K Super-Resolution
Paper
• 2507.07105
• Published
• 106
Go to Zero: Towards Zero-shot Motion Generation with Million-scale Data
Paper
• 2507.07095
• Published
• 56
Scaling RL to Long Videos
Paper
• 2507.07966
• Published
• 160
Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and
Methodology
Paper
• 2507.07999
• Published
• 50
PyVision: Agentic Vision with Dynamic Tooling
Paper
• 2507.07998
• Published
• 33
Multi-Granular Spatio-Temporal Token Merging for Training-Free
Acceleration of Video LLMs
Paper
• 2507.07990
• Published
• 46
A Survey on Long-Video Storytelling Generation: Architectures,
Consistency, and Cinematic Quality
Paper
• 2507.07202
• Published
• 25
Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for
Visual Reasoning
Paper
• 2507.05255
• Published
• 75
Vision Foundation Models as Effective Visual Tokenizers for
Autoregressive Image Generation
Paper
• 2507.08441
• Published
• 62
Robust Multimodal Large Language Models Against Modality Conflict
Paper
• 2507.07151
• Published
• 6
Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality,
Long Context, and Next Generation Agentic Capabilities
Paper
• 2507.06261
• Published
• 67
Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large
Audio Language Models
Paper
• 2507.08128
• Published
• 13
SpeakerVid-5M: A Large-Scale High-Quality Dataset for Audio-Visual
Dyadic Interactive Human Generation
Paper
• 2507.09862
• Published
• 51
Vision-Language-Vision Auto-Encoder: Scalable Knowledge Distillation
from Diffusion Models
Paper
• 2507.07104
• Published
• 46
Can Multimodal Foundation Models Understand Schematic Diagrams? An
Empirical Study on Information-Seeking QA over Scientific Papers
Paper
• 2507.10787
• Published
• 13
UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New
Benchmarks
Paper
• 2507.11336
• Published
• 7
MMHU: A Massive-Scale Multimodal Benchmark for Human Behavior
Understanding
Paper
• 2507.12463
• Published
• 27
VisionThink: Smart and Efficient Vision Language Model via Reinforcement
Learning
Paper
• 2507.13348
• Published
• 79
Diffuman4D: 4D Consistent Human View Synthesis from Sparse-View Videos
with Spatio-Temporal Diffusion Models
Paper
• 2507.13344
• Published
• 59
AnyCap Project: A Unified Framework, Dataset, and Benchmark for
Controllable Omni-modal Captioning
Paper
• 2507.12841
• Published
• 42
π^3: Scalable Permutation-Equivariant Visual Geometry Learning
Paper
• 2507.13347
• Published
• 66
Automating Steering for Safe Multimodal Large Language Models
Paper
• 2507.13255
• Published
• 4
Mono-InternVL-1.5: Towards Cheaper and Faster Monolithic Multimodal
Large Language Models
Paper
• 2507.12566
• Published
• 15
Franca: Nested Matryoshka Clustering for Scalable Visual Representation
Learning
Paper
• 2507.14137
• Published
• 35
GUI-G^2: Gaussian Reward Modeling for GUI Grounding
Paper
• 2507.15846
• Published
• 133
Being-H0: Vision-Language-Action Pretraining from Large-Scale Human
Videos
Paper
• 2507.15597
• Published
• 34
Latent Denoising Makes Good Visual Tokenizers
Paper
• 2507.15856
• Published
• 12
Step-Audio 2 Technical Report
Paper
• 2507.16632
• Published
• 74
Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning
Paper
• 2507.16746
• Published
• 34
Semi-off-Policy Reinforcement Learning for Vision-Language Slow-thinking
Reasoning
Paper
• 2507.16814
• Published
• 21
ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent
Planning
Paper
• 2507.16815
• Published
• 42
Experience is the Best Teacher: Grounding VLMs for Robotics through
Self-Generated Memory
Paper
• 2507.16713
• Published
• 21
Pixels, Patterns, but No Poetry: To See The World like Humans
Paper
• 2507.16863
• Published
• 69
Yume: An Interactive World Generation Model
Paper
• 2507.17744
• Published
• 91
Finding Dori: Memorization in Text-to-Image Diffusion Models Is Less
Local Than Assumed
Paper
• 2507.16880
• Published
• 7
TTS-VAR: A Test-Time Scaling Framework for Visual Auto-Regressive
Generation
Paper
• 2507.18537
• Published
• 18
Iwin Transformer: Hierarchical Vision Transformer using Interleaved
Windows
Paper
• 2507.18405
• Published
• 5
Chat with AI: The Surprising Turn of Real-time Video Communication from
Human to AI
Paper
• 2507.10510
• Published
• 5
Deep Researcher with Test-Time Diffusion
Paper
• 2507.16075
• Published
• 68
When Tokens Talk Too Much: A Survey of Multimodal Long-Context Token
Compression across Images, Videos, and Audios
Paper
• 2507.20198
• Published
• 28
ARC-Hunyuan-Video-7B: Structured Video Comprehension of Real-World
Shorts
Paper
• 2507.20939
• Published
• 57
Region-based Cluster Discrimination for Visual Representation Learning
Paper
• 2507.20025
• Published
• 19
Reconstructing 4D Spatial Intelligence: A Survey
Paper
• 2507.21045
• Published
• 38
GPT-IMAGE-EDIT-1.5M: A Million-Scale, GPT-Generated Image Dataset
Paper
• 2507.21033
• Published
• 23
X-Omni: Reinforcement Learning Makes Discrete Autoregressive Image
Generative Models Great Again
Paper
• 2507.22058
• Published
• 40
MaPPO: Maximum a Posteriori Preference Optimization with Prior Knowledge
Paper
• 2507.21183
• Published
• 15
ForCenNet: Foreground-Centric Network for Document Image Rectification
Paper
• 2507.19804
• Published
• 12
Towards Omnimodal Expressions and Reasoning in Referring Audio-Visual
Segmentation
Paper
• 2507.22886
• Published
• 10
HunyuanWorld 1.0: Generating Immersive, Explorable, and Interactive 3D
Worlds from Words or Pixels
Paper
• 2507.21809
• Published
• 140
VL-Cogito: Progressive Curriculum Reinforcement Learning for Advanced
Multimodal Reasoning
Paper
• 2507.22607
• Published
• 47
Scalable Multi-Task Reinforcement Learning for Generalizable Spatial
Intelligence in Visuomotor Agents
Paper
• 2507.23698
• Published
• 11
Phi-Ground Tech Report: Advancing Perception in GUI Grounding
Paper
• 2507.23779
• Published
• 45
C3: A Bilingual Benchmark for Spoken Dialogue Models Exploring
Challenges in Complex Conversations
Paper
• 2507.22968
• Published
• 25
villa-X: Enhancing Latent Action Modeling in Vision-Language-Action
Models
Paper
• 2507.23682
• Published
• 24
Multimodal Referring Segmentation: A Survey
Paper
• 2508.00265
• Published
• 9
PixNerd: Pixel Neural Field Diffusion
Paper
• 2507.23268
• Published
• 52
Qwen-Image Technical Report
Paper
• 2508.02324
• Published
• 272
VeOmni: Scaling Any Modality Model Training with Model-Centric
Distributed Recipe Zoo
Paper
• 2508.02317
• Published
• 22
InstructVLA: Vision-Language-Action Instruction Tuning from
Understanding to Manipulation
Paper
• 2507.17520
• Published
• 15
A Glimpse to Compress: Dynamic Visual Token Pruning for Large
Vision-Language Models
Paper
• 2508.01548
• Published
• 14
Skywork UniPic: Unified Autoregressive Modeling for Visual Understanding
and Generation
Paper
• 2508.03320
• Published
• 63
LongVie: Multimodal-Guided Controllable Ultra-Long Video Generation
Paper
• 2508.03694
• Published
• 52
Representation Shift: Unifying Token Compression with FlashAttention
Paper
• 2508.00367
• Published
• 16
CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and
Outcome Reward
Paper
• 2508.03686
• Published
• 39
The Promise of RL for Autoregressive Image Editing
Paper
• 2508.01119
• Published
• 11
HyCodePolicy: Hybrid Language Controllers for Multimodal Monitoring and
Decision in Embodied Agents
Paper
• 2508.02629
• Published
• 6
LAMIC: Layout-Aware Multi-Image Composition via Scalability of
Multimodal Diffusion Transformer
Paper
• 2508.00477
• Published
• 11
ChartCap: Mitigating Hallucination of Dense Chart Captioning
Paper
• 2508.03164
• Published
• 7
Sotopia-RL: Reward Design for Social Intelligence
Paper
• 2508.03905
• Published
• 23
HPSv3: Towards Wide-Spectrum Human Preference Score
Paper
• 2508.03789
• Published
• 20
On the Generalization of SFT: A Reinforcement Learning Perspective with
Reward Rectification
Paper
• 2508.05629
• Published
• 184
Enhancing Vision-Language Model Training with Reinforcement Learning in
Synthetic Worlds for Real-World Success
Paper
• 2508.04280
• Published
• 35
CM^3: Calibrating Multimodal Recommendation
Paper
• 2508.01226
• Published
• 1
DeepPHY: Benchmarking Agentic VLMs on Physical Reasoning
Paper
• 2508.05405
• Published
• 64
Can Large Multimodal Models Actively Recognize Faulty Inputs? A
Systematic Evaluation Framework of Their Input Scrutiny Ability
Paper
• 2508.04017
• Published
• 11
Adapting Vision-Language Models Without Labels: A Comprehensive Survey
Paper
• 2508.05547
• Published
• 11
Voost: A Unified and Scalable Diffusion Transformer for Bidirectional
Virtual Try-On and Try-Off
Paper
• 2508.04825
• Published
• 60
Reinforcement Learning in Vision: A Survey
Paper
• 2508.08189
• Published
• 30
WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent
Paper
• 2508.05748
• Published
• 141
Adversarial Video Promotion Against Text-to-Video Retrieval
Paper
• 2508.06964
• Published
• 9
MathReal: We Keep It Real! A Real Scene Benchmark for Evaluating Math
Reasoning in Multimodal Large Language Models
Paper
• 2508.06009
• Published
• 16
Story2Board: A Training-Free Approach for Expressive Storyboard
Generation
Paper
• 2508.09983
• Published
• 70
Stand-In: A Lightweight and Plug-and-Play Identity Control for Video
Generation
Paper
• 2508.07901
• Published
• 40
Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with
Long-Term Memory
Paper
• 2508.09736
• Published
• 58
IAG: Input-aware Backdoor Attack on VLMs for Visual Grounding
Paper
• 2508.09456
• Published
• 8
NextStep-1: Toward Autoregressive Image Generation with Continuous
Tokens at Scale
Paper
• 2508.10711
• Published
• 145
We-Math 2.0: A Versatile MathBook System for Incentivizing Visual
Mathematical Reasoning
Paper
• 2508.10433
• Published
• 144
HumanSense: From Multimodal Perception to Empathetic Context-Aware
Responses through Reasoning MLLMs
Paper
• 2508.10576
• Published
• 8
Processing and acquisition traces in visual encoders: What does CLIP
know about your camera?
Paper
• 2508.10637
• Published
• 8
Thyme: Think Beyond Images
Paper
• 2508.11630
• Published
• 81
Paper
• 2508.10104
• Published
• 298
Controlling Multimodal LLMs via Reward-guided Decoding
Paper
• 2508.11616
• Published
• 7
Paper
• 2508.11737
• Published
• 112
Next Visual Granularity Generation
Paper
• 2508.12811
• Published
• 49
OmniTry: Virtual Try-On Anything without Masks
Paper
• 2508.13632
• Published
• 15
MultiRef: Controllable Image Generation with Multiple Visual References
Paper
• 2508.06905
• Published
• 21
Training-Free Text-Guided Color Editing with Multi-Modal Diffusion
Transformer
Paper
• 2508.09131
• Published
• 16
RynnEC: Bringing MLLMs into Embodied World
Paper
• 2508.14160
• Published
• 20
Intern-S1: A Scientific Multimodal Foundation Model
Paper
• 2508.15763
• Published
• 269
Waver: Wave Your Way to Lifelike Video Generation
Paper
• 2508.15761
• Published
• 36
Visual Autoregressive Modeling for Instruction-Guided Image Editing
Paper
• 2508.15772
• Published
• 9
When and What: Diffusion-Grounded VideoLLM with Entity Aware
Segmentation for Long Video Understanding
Paper
• 2508.15641
• Published
• 4
LLaSO: A Foundational Framework for Reproducible Research in Large
Language and Speech Model
Paper
• 2508.15418
• Published
• 8
AgentFly: Fine-tuning LLM Agents without Fine-tuning LLMs
Paper
• 2508.16153
• Published
• 160
PosterGen: Aesthetic-Aware Paper-to-Poster Generation via Multi-Agent
LLMs
Paper
• 2508.17188
• Published
• 17
OmniHuman-1.5: Instilling an Active Mind in Avatars via Cognitive
Simulation
Paper
• 2508.19209
• Published
• 42
Autoregressive Universal Video Segmentation Model
Paper
• 2508.19242
• Published
• 29
MovieCORE: COgnitive REasoning in Movies
Paper
• 2508.19026
• Published
• 7
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility,
Reasoning, and Efficiency
Paper
• 2508.18265
• Published
• 214
MMTok: Multimodal Coverage Maximization for Efficient Inference of VLMs
Paper
• 2508.18264
• Published
• 25
Explain Before You Answer: A Survey on Compositional Visual Reasoning
Paper
• 2508.17298
• Published
• 4
Deep Think with Confidence
Paper
• 2508.15260
• Published
• 90
Self-Rewarding Vision-Language Model via Reasoning Decomposition
Paper
• 2508.19652
• Published
• 84
Mixture of Contexts for Long Video Generation
Paper
• 2508.21058
• Published
• 35
Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable
Text-to-Image Reinforcement Learning
Paper
• 2508.20751
• Published
• 89
Beyond Transcription: Mechanistic Interpretability in ASR
Paper
• 2508.15882
• Published
• 87
MIDAS: Multimodal Interactive Digital-human Synthesis via Real-time
Autoregressive Video Generation
Paper
• 2508.19320
• Published
• 29
SEAM: Semantically Equivalent Across Modalities Benchmark for
Vision-Language Models
Paper
• 2508.18179
• Published
• 9
CogVLA: Cognition-Aligned Vision-Language-Action Model via
Instruction-Driven Routing & Sparsification
Paper
• 2508.21046
• Published
• 9
OneReward: Unified Mask-Guided Image Generation via Multi-Task Human
Preference Learning
Paper
• 2508.21066
• Published
• 13
Social-MAE: A Transformer-Based Multimodal Autoencoder for Face and
Voice
Paper
• 2508.17502
• Published
• 1
AudioStory: Generating Long-Form Narrative Audio with Large Language
Models
Paper
• 2508.20088
• Published
• 21
R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs
via Bi-Mode Annealing and Reinforce Learning
Paper
• 2508.21113
• Published
• 110
EmbodiedOneVision: Interleaved Vision-Text-Action Pretraining for
General Robot Control
Paper
• 2508.21112
• Published
• 77
UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn
Reinforcement Learning
Paper
• 2509.02544
• Published
• 125
POINTS-Reader: Distillation-Free Adaptation of Vision-Language Models
for Document Conversion
Paper
• 2509.01215
• Published
• 51
ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Long
Video Understanding
Paper
• 2508.21496
• Published
• 55
Kwai Keye-VL 1.5 Technical Report
Paper
• 2509.01563
• Published
• 38
OpenVision 2: A Family of Generative Pretrained Visual Encoders for
Multimodal Learning
Paper
• 2509.01644
• Published
• 34
Droplet3D: Commonsense Priors from Videos Facilitate 3D Generation
Paper
• 2508.20470
• Published
• 75
Planning with Reasoning using Vision Language World Model
Paper
• 2509.02722
• Published
• 24
Beyond Correctness: Harmonizing Process and Outcome Rewards through RL
Training
Paper
• 2509.03403
• Published
• 23
Drivel-ology: Challenging LLMs with Interpreting Nonsense with Depth
Paper
• 2509.03867
• Published
• 211
Video-MTR: Reinforced Multi-Turn Reasoning for Long Video Understanding
Paper
• 2508.20478
• Published
• 18
Why Language Models Hallucinate
Paper
• 2509.04664
• Published
• 196
WildScore: Benchmarking MLLMs in-the-Wild Symbolic Music Reasoning
Paper
• 2509.04744
• Published
• 12
LatticeWorld: A Multimodal Large Language Model-Empowered Framework for
Interactive Complex World Generation
Paper
• 2509.05263
• Published
• 11
Reinforced Visual Perception with Tools
Paper
• 2509.01656
• Published
• 32
Focusing by Contrastive Attention: Enhancing VLMs' Visual Reasoning
Paper
• 2509.06461
• Published
• 20
UniVerse-1: Unified Audio-Video Generation via Stitching of Experts
Paper
• 2509.06155
• Published
• 14
Easier Painting Than Thinking: Can Text-to-Image Models Set the Stage,
but Not Direct the Play?
Paper
• 2509.03516
• Published
• 12
D-HUMOR: Dark Humor Understanding via Multimodal Open-ended Reasoning
Paper
• 2509.06771
• Published
• 6
Visual Representation Alignment for Multimodal Large Language Models
Paper
• 2509.07979
• Published
• 84
Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual
Search
Paper
• 2509.07969
• Published
• 59
Reconstruction Alignment Improves Unified Multimodal Models
Paper
• 2509.07295
• Published
• 40
A Survey of Reinforcement Learning for Large Reasoning Models
Paper
• 2509.08827
• Published
• 190
RewardDance: Reward Scaling in Visual Generation
Paper
• 2509.08826
• Published
• 73
VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action
Model
Paper
• 2509.09372
• Published
• 246
HuMo: Human-Centric Video Generation via Collaborative Multi-Modal
Conditioning
Paper
• 2509.08519
• Published
• 128
SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning
Paper
• 2509.09674
• Published
• 80
Can Understanding and Generation Truly Benefit Together -- or Just
Coexist?
Paper
• 2509.09666
• Published
• 34
Kling-Avatar: Grounding Multimodal Instructions for Cascaded
Long-Duration Avatar Animation Synthesis
Paper
• 2509.09595
• Published
• 48
SpatialVID: A Large-Scale Video Dataset with Spatial Annotations
Paper
• 2509.09676
• Published
• 35
The Illusion of Diminishing Returns: Measuring Long Horizon Execution in
LLMs
Paper
• 2509.09677
• Published
• 35
Learning to Optimize Multi-Objective Alignment Through Dynamic Reward
Weighting
Paper
• 2509.11452
• Published
• 14
PersonaX: Multimodal Datasets with LLM-Inferred Behavior Traits
Paper
• 2509.11362
• Published
• 5
Lost in Embeddings: Information Loss in Vision-Language Models
Paper
• 2509.11986
• Published
• 29
OmniWorld: A Multi-Domain and Multi-Modal Dataset for 4D World Modeling
Paper
• 2509.12201
• Published
• 106
UI-S1: Advancing GUI Automation via Semi-online Reinforcement Learning
Paper
• 2509.11543
• Published
• 49
LazyDrag: Enabling Stable Drag-Based Editing on Multi-Modal Diffusion
Transformers via Explicit Correspondence
Paper
• 2509.12203
• Published
• 20
Multimodal Reasoning for Science: Technical Report and 1st Place
Solution to the ICML 2025 SeePhys Challenge
Paper
• 2509.06079
• Published
• 6
3D Aware Region Prompted Vision Language Model
Paper
• 2509.13317
• Published
• 14
Hunyuan3D Studio: End-to-End AI Pipeline for Game-Ready 3D Asset
Generation
Paper
• 2509.12815
• Published
• 40
Look Again, Think Slowly: Enhancing Visual Reflection in Vision-Language
Models
Paper
• 2509.12132
• Published
• 7
Dr.V: A Hierarchical Perception-Temporal-Cognition Framework to Diagnose
Video Hallucination by Fine-grained Spatial-Temporal Grounding
Paper
• 2509.11866
• Published
• 2
Scaling Agents via Continual Pre-training
Paper
• 2509.13310
• Published
• 117
SAIL-VL2 Technical Report
Paper
• 2509.14033
• Published
• 44
FlowRL: Matching Reward Distributions for LLM Reasoning
Paper
• 2509.15207
• Published
• 116
WorldForge: Unlocking Emergent 3D/4D Generation in Video Diffusion Model
via Training-Free Guidance
Paper
• 2509.15130
• Published
• 30
MultiEdit: Advancing Instruction-based Image Editing on Diverse and
Challenging Tasks
Paper
• 2509.14638
• Published
• 13
AToken: A Unified Tokenizer for Vision
Paper
• 2509.14476
• Published
• 36
Unleashing the Potential of Multimodal LLMs for Zero-Shot
Spatio-Temporal Video Grounding
Paper
• 2509.15178
• Published
• 6
PANORAMA: The Rise of Omnidirectional Vision in the Embodied AI Era
Paper
• 2509.12989
• Published
• 28
LLM-I: LLMs are Naturally Interleaved Multimodal Creators
Paper
• 2509.13642
• Published
• 9
MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid
Vision Tokenizer
Paper
• 2509.16197
• Published
• 58
Lynx: Towards High-Fidelity Personalized Video Generation
Paper
• 2509.15496
• Published
• 13
BaseReward: A Strong Baseline for Multimodal Reward Model
Paper
• 2509.16127
• Published
• 21
BTL-UI: Blink-Think-Link Reasoning Model for GUI Agent
Paper
• 2509.15566
• Published
• 14
Video2Roleplay: A Multimodal Dataset and Framework for Video-Guided
Role-playing Agents
Paper
• 2509.15233
• Published
• 2
A Vision-Language-Action-Critic Model for Robotic Real-World
Reinforcement Learning
Paper
• 2509.15937
• Published
• 20
OmniInsert: Mask-Free Video Insertion of Any Reference via Diffusion
Transformer Models
Paper
• 2509.17627
• Published
• 66
TempSamp-R1: Effective Temporal Sampling with Reinforcement Fine-Tuning
for Video LLMs
Paper
• 2509.18056
• Published
• 27
Qwen3-Omni Technical Report
Paper
• 2509.17765
• Published
• 149
GeoPQA: Bridging the Visual Perception Gap in MLLMs for Geometric
Reasoning
Paper
• 2509.17437
• Published
• 17
VideoFrom3D: 3D Scene Video Generation via Complementary Image and Video
Diffusion Models
Paper
• 2509.17985
• Published
• 26
ContextFlow: Training-Free Video Object Editing via Adaptive Context
Enrichment
Paper
• 2509.17818
• Published
• 8
MetaEmbed: Scaling Multimodal Retrieval at Test-Time with Flexible Late
Interaction
Paper
• 2509.18095
• Published
• 9
When Big Models Train Small Ones: Label-Free Model Parity Alignment for
Efficient Visual Question Answering using Small VLMs
Paper
• 2509.16633
• Published
• 2
MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and
Training Recipe
Paper
• 2509.18154
• Published
• 55
Reinforcement Learning on Pre-Training Data
Paper
• 2509.19249
• Published
• 67
Lyra: Generative 3D Scene Reconstruction via Video Diffusion Model
Self-Distillation
Paper
• 2509.19296
• Published
• 25
Hyper-Bagel: A Unified Acceleration Framework for Multimodal
Understanding and Generation
Paper
• 2509.18824
• Published
• 23
UniPixel: Unified Object Referring and Segmentation for Pixel-Level
Visual Reasoning
Paper
• 2509.18094
• Published
• 4
MAPO: Mixed Advantage Policy Optimization
Paper
• 2509.18849
• Published
• 27
EmbeddingGemma: Powerful and Lightweight Text Representations
Paper
• 2509.20354
• Published
• 47
Advancing Speech Understanding in Speech-Aware Language Models with GRPO
Paper
• 2509.16990
• Published
• 21
Video models are zero-shot learners and reasoners
Paper
• 2509.20328
• Published
• 100
EditVerse: Unifying Image and Video Editing and Generation with
In-Context Learning
Paper
• 2509.20360
• Published
• 18
Seedream 4.0: Toward Next-generation Multimodal Image Generation
Paper
• 2509.20427
• Published
• 82
MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling and
Open Resources
Paper
• 2509.21268
• Published
• 104
Hunyuan3D-Omni: A Unified Framework for Controllable Generation of 3D
Assets
Paper
• 2509.21245
• Published
• 39
How Far are VLMs from Visual Spatial Intelligence? A Benchmark-Driven
Perspective
Paper
• 2509.18905
• Published
• 30
Lavida-O: Elastic Large Masked Diffusion Models for Unified Multimodal
Understanding and Generation
Paper
• 2509.19244
• Published
• 12
AutoIntent: AutoML for Text Classification
Paper
• 2509.21138
• Published
• 36
CHARM: Control-point-based 3D Anime Hairstyle Auto-Regressive Modeling
Paper
• 2509.21114
• Published
• 16
V-GameGym: Visual Game Generation for Code Large Language Models
Paper
• 2509.20136
• Published
• 9
Interactive Recommendation Agent with Active User Commands
Paper
• 2509.21317
• Published
• 7
MOSS-ChatV: Reinforcement Learning with Process Reasoning Reward for
Video Temporal Reasoning
Paper
• 2509.21113
• Published
• 6
LongLive: Real-time Interactive Long Video Generation
Paper
• 2509.22622
• Published
• 188
VoiceAssistant-Eval: Benchmarking AI Assistants across Listening,
Speaking, and Viewing
Paper
• 2509.22651
• Published
• 23
CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement
Learning
Paper
• 2509.22647
• Published
• 33
UniVid: Unifying Vision Tasks with Pre-trained Video Generation Models
Paper
• 2509.21760
• Published
• 15
Think-on-Graph 3.0: Efficient and Adaptive LLM Reasoning on
Heterogeneous Graphs via Multi-Agent Dual-Evolving Context Retrieval
Paper
• 2509.21710
• Published
• 19
MinerU2.5: A Decoupled Vision-Language Model for Efficient
High-Resolution Document Parsing
Paper
• 2509.22186
• Published
• 146
Where MLLMs Attend and What They Rely On: Explaining Autoregressive
Token Generation
Paper
• 2509.22496
• Published
• 4
X-Streamer: Unified Human World Modeling with Audiovisual Interaction
Paper
• 2509.21574
• Published
• 8
Visual Jigsaw Post-Training Improves MLLMs
Paper
• 2509.25190
• Published
• 37
LOVE-R1: Advancing Long Video Understanding with an Adaptive Zoom-in
Mechanism via Multi-Step Reasoning
Paper
• 2509.24786
• Published
• 7
MGM-Omni: Scaling Omni LLMs to Personalized Long-Horizon Speech
Paper
• 2509.25131
• Published
• 16
StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient
SpeechLLMs
Paper
• 2509.22220
• Published
• 65
OpenGPT-4o-Image: A Comprehensive Dataset for Advanced Image Generation
and Editing
Paper
• 2509.24900
• Published
• 53
GSM8K-V: Can Vision Language Models Solve Grade School Math Word
Problems in Visual Contexts
Paper
• 2509.25160
• Published
• 32
EditScore: Unlocking Online RL for Image Editing via High-Fidelity
Reward Modeling
Paper
• 2509.23909
• Published
• 33
HunyuanImage 3.0 Technical Report
Paper
• 2509.23951
• Published
• 25
MMPB: It's Time for Multi-Modal Personalization
Paper
• 2509.22820
• Published
• 15
BRIDGE - Building Reinforcement-Learning Depth-to-Image Data Generation
Engine for Monocular Depth Estimation
Paper
• 2509.25077
• Published
• 15
Understanding Language Prior of LVLMs by Contrasting Chain-of-Embedding
Paper
• 2509.23050
• Published
• 15
UniMIC: Token-Based Multimodal Interactive Coding for Human-AI
Collaboration
Paper
• 2509.22570
• Published
• 4
Vision-Zero: Scalable VLM Self-Improvement via Strategic Gamified
Self-Play
Paper
• 2509.25541
• Published
• 140
LEAML: Label-Efficient Adaptation to Out-of-Distribution Visual Tasks
for Multimodal Large Language Models
Paper
• 2510.03232
• Published
• 1
Efficient Test-Time Scaling for Small Vision-Language Models
Paper
• 2510.03574
• Published
• 10
Self-Improvement in Multimodal Large Language Models: A Survey
Paper
• 2510.02665
• Published
• 21
Efficient Multi-modal Large Language Models via Progressive Consistency
Distillation
Paper
• 2510.00515
• Published
• 42
Paper
• 2510.01141
• Published
• 121
VLA-R1: Enhancing Reasoning in Vision-Language-Action Models
Paper
• 2510.01623
• Published
• 12
VideoNSA: Native Sparse Attention Scales Video Understanding
Paper
• 2510.02295
• Published
• 10
Visual Multi-Agent System: Mitigating Hallucination Snowballing via
Visual Flow
Paper
• 2509.21789
• Published
• 9
RewardMap: Tackling Sparse Rewards in Fine-grained Visual Reasoning via
Multi-Stage Reinforcement Learning
Paper
• 2510.02240
• Published
• 18
Learning to Reason for Hallucination Span Detection
Paper
• 2510.02173
• Published
• 19
VOGUE: Guiding Exploration with Visual Uncertainty Improves Multimodal
Reasoning
Paper
• 2510.01444
• Published
• 20
Ovi: Twin Backbone Cross-Modal Fusion for Audio-Video Generation
Paper
• 2510.01284
• Published
• 37
ModernVBERT: Towards Smaller Visual Document Retrievers
Paper
• 2510.01149
• Published
• 32
StealthAttack: Robust 3D Gaussian Splatting Poisoning via Density-Guided
Illusions
Paper
• 2510.02314
• Published
• 60
Self-Forcing++: Towards Minute-Scale High-Quality Video Generation
Paper
• 2510.02283
• Published
• 96
VLM-FO1: Bridging the Gap Between High-Level Reasoning and Fine-Grained
Perception in VLMs
Paper
• 2509.25916
• Published
• 6
Training Vision-Language Process Reward Models for Test-Time Scaling in
Multimodal Reasoning: Key Insights and Lessons Learned
Paper
• 2509.23250
• Published
• 6
EditReward: A Human-Aligned Reward Model for Instruction-Guided Image
Editing
Paper
• 2509.26346
• Published
• 19
VLA-RFT: Vision-Language-Action Reinforcement Fine-tuning with Verified
Rewards in World Simulators
Paper
• 2510.00406
• Published
• 66
IMG: Calibrating Diffusion Models via Implicit Multimodal Guidance
Paper
• 2509.26231
• Published
• 18
MotionRAG: Motion Retrieval-Augmented Image-to-Video Generation
Paper
• 2509.26391
• Published
• 22
DC-VideoGen: Efficient Video Generation with Deep Compression Video
Autoencoder
Paper
• 2509.25182
• Published
• 39
Learning to See Before Seeing: Demystifying LLM Visual Priors from
Language Pre-training
Paper
• 2509.26625
• Published
• 43
The Dragon Hatchling: The Missing Link between the Transformer and
Models of the Brain
Paper
• 2509.26507
• Published
• 547
More Thought, Less Accuracy? On the Dual Nature of Reasoning in
Vision-Language Models
Paper
• 2509.25848
• Published
• 80
Winning the Pruning Gamble: A Unified Approach to Joint Sample and Token
Pruning for Efficient Supervised Fine-Tuning
Paper
• 2509.23873
• Published
• 67
Heptapod: Language Modeling on Visual Signals
Paper
• 2510.06673
• Published
• 5
Patch-as-Decodable-Token: Towards Unified Multi-Modal Vision Tasks in
MLLMs
Paper
• 2510.01954
• Published
• 14
Are We Using the Right Benchmark: An Evaluation Framework for Visual
Token Compression Methods
Paper
• 2510.07143
• Published
• 13
TTRV: Test-Time Reinforcement Learning for Vision Language Models
Paper
• 2510.06783
• Published
• 12
MATRIX: Mask Track Alignment for Interaction-aware Video Generation
Paper
• 2510.07310
• Published
• 36
Lumina-DiMOO: An Omni Diffusion Large Language Model for Multi-Modal
Generation and Understanding
Paper
• 2510.06308
• Published
• 55
Ming-UniVision: Joint Image Understanding and Generation with a Unified
Continuous Tokenizer
Paper
• 2510.06590
• Published
• 77
Cache-to-Cache: Direct Semantic Communication Between Large Language
Models
Paper
• 2510.03215
• Published
• 98
SAEdit: Token-level control for continuous image editing via Sparse
AutoEncoder
Paper
• 2510.05081
• Published
• 5
MoME: Mixture of Matryoshka Experts for Audio-Visual Speech Recognition
Paper
• 2510.04136
• Published
• 4
Character Mixing for Video Generation
Paper
• 2510.05093
• Published
• 7
ChronoEdit: Towards Temporal Reasoning for Image Editing and World
Simulation
Paper
• 2510.04290
• Published
• 20
Factuality Matters: When Image Generation and Editing Meet Structured
Visuals
Paper
• 2510.05091
• Published
• 20
VChain: Chain-of-Visual-Thought for Reasoning in Video Generation
Paper
• 2510.05094
• Published
• 38
Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large
Multimodal Models
Paper
• 2510.05034
• Published
• 51
Paper2Video: Automatic Video Generation from Scientific Papers
Paper
• 2510.05096
• Published
• 119
MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with
Holistic Platform and Adaptive Hybrid Policy Optimization
Paper
• 2510.08540
• Published
• 109
DreamOmni2: Multimodal Instruction-based Editing and Generation
Paper
• 2510.06679
• Published
• 73
UniVideo: Unified Understanding, Generation, and Editing for Videos
Paper
• 2510.08377
• Published
• 81
VideoCanvas: Unified Video Completion from Arbitrary Spatiotemporal
Patches via In-Context Conditioning
Paper
• 2510.08555
• Published
• 64
ARTDECO: Towards Efficient and High-Fidelity On-the-Fly 3D
Reconstruction with Structured Scene Representation
Paper
• 2510.08551
• Published
• 34
PickStyle: Video-to-Video Style Transfer with Context-Style Adapters
Paper
• 2510.07546
• Published
• 22
UniMMVSR: A Unified Multi-Modal Framework for Cascaded Video
Super-Resolution
Paper
• 2510.08143
• Published
• 20
NaViL: Rethinking Scaling Properties of Native Multimodal Large Language
Models under Data Constraints
Paper
• 2510.08565
• Published
• 21
InstructX: Towards Unified Visual Editing with MLLM Guidance
Paper
• 2510.08485
• Published
• 18
Thinking with Camera: A Unified Multimodal Model for Camera-Centric
Understanding and Generation
Paper
• 2510.08673
• Published
• 126
D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to
Embodied AI
Paper
• 2510.05684
• Published
• 143
Multimodal Prompt Optimization: Why Not Leverage Multiple Modalities for
MLLMs
Paper
• 2510.09201
• Published
• 50
StreamingVLM: Real-Time Understanding for Infinite Video Streams
Paper
• 2510.09608
• Published
• 51
BEAR: Benchmarking and Enhancing Multimodal Language Models for Atomic
Embodied Capabilities
Paper
• 2510.08759
• Published
• 46
SpaceVista: All-Scale Visual Spatial Reasoning from mm to km
Paper
• 2510.09606
• Published
• 18
ARES: Multimodal Adaptive Reasoning via Difficulty-Aware Token-Level
Entropy Shaping
Paper
• 2510.08457
• Published
• 13
OmniVideoBench: Towards Audio-Visual Understanding Evaluation for Omni
MLLMs
Paper
• 2510.10689
• Published
• 47
AVoCaDO: An Audiovisual Video Captioner Driven by Temporal Orchestration
Paper
• 2510.10395
• Published
• 31
DiT360: High-Fidelity Panoramic Image Generation via Hybrid Training
Paper
• 2510.11712
• Published
• 31
Spotlight on Token Perception for Multimodal Reinforcement Learning
Paper
• 2510.09285
• Published
• 37
On Epistemic Uncertainty of Visual Tokens for Object Hallucinations in
Large Vision-Language Models
Paper
• 2510.09008
• Published
• 16
Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning
Paper
• 2510.11027
• Published
• 23
FlashVSR: Towards Real-Time Diffusion-Based Streaming Video
Super-Resolution
Paper
• 2510.12747
• Published
• 39
ERA: Transforming VLMs into Embodied Agents via Embodied Prior Learning
and Online Reinforcement Learning
Paper
• 2510.12693
• Published
• 28
DeepMMSearch-R1: Empowering Multimodal LLMs in Multimodal Web Search
Paper
• 2510.12801
• Published
• 13
UniFusion: Vision-Language Model as Unified Encoder in Image Generation
Paper
• 2510.12789
• Published
• 19
SAIL-Embedding Technical Report: Omni-modal Embedding Foundation Model
Paper
• 2510.12709
• Published
• 13
SRUM: Fine-Grained Self-Rewarding for Unified Multimodal Models
Paper
• 2510.12784
• Published
• 20
Spatial Forcing: Implicit Spatial Representation Alignment for
Vision-language-action Model
Paper
• 2510.12276
• Published
• 147
Scaling Language-Centric Omnimodal Representation Learning
Paper
• 2510.11693
• Published
• 104
HoneyBee: Data Recipes for Vision-Language Reasoners
Paper
• 2510.12225
• Published
• 11
UniMoE-Audio: Unified Speech and Music Generation with Dynamic-Capacity
MoE
Paper
• 2510.13344
• Published
• 63
FlashWorld: High-quality 3D Scene Generation within Seconds
Paper
• 2510.13678
• Published
• 73
InteractiveOmni: A Unified Omni-modal Model for Audio-Visual Multi-turn
Dialogue
Paper
• 2510.13747
• Published
• 30
Bee: A High-Quality Corpus and Full-Stack Suite to Unlock Advanced Fully
Open MLLMs
Paper
• 2510.13795
• Published
• 59
Generative Universal Verifier as Multimodal Meta-Reasoner
Paper
• 2510.13804
• Published
• 27
UniME-V2: MLLM-as-a-Judge for Universal Multimodal Embedding Learning
Paper
• 2510.13515
• Published
• 12
InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for
Generalist Robot Policy
Paper
• 2510.13778
• Published
• 17
FG-CLIP 2: A Bilingual Fine-grained Vision-Language Alignment Model
Paper
• 2510.10921
• Published
• 11
X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment
Vision-Language-Action Model
Paper
• 2510.10274
• Published
• 16
LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action
Models
Paper
• 2510.13626
• Published
• 46
PhysMaster: Mastering Physical Representation for Video Generation via
Reinforcement Learning
Paper
• 2510.13809
• Published
• 38
Trace Anything: Representing Any Video in 4D via Trajectory Fields
Paper
• 2510.13802
• Published
• 31
Uni-MMMU: A Massive Multi-discipline Multimodal Unified Benchmark
Paper
• 2510.13759
• Published
• 11
When Models Lie, We Learn: Multilingual Span-Level Hallucination
Detection with PsiloQA
Paper
• 2510.04849
• Published
• 115
WithAnyone: Towards Controllable and ID Consistent Image Generation
Paper
• 2510.14975
• Published
• 85
From Pixels to Words -- Towards Native Vision-Language Primitives at
Scale
Paper
• 2510.14979
• Published
• 67
PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model
Paper
• 2510.14528
• Published
• 118
MathCanvas: Intrinsic Visual Chain-of-Thought for Multimodal
Mathematical Reasoning
Paper
• 2510.14958
• Published
• 23
VR-Thinker: Boosting Video Reward Models through Thinking-with-Image
Reasoning
Paper
• 2510.10518
• Published
• 19
VLA^2: Empowering Vision-Language-Action Models with an Agentic
Framework for Unseen Concept Manipulation
Paper
• 2510.14902
• Published
• 17
Qwen3Guard Technical Report
Paper
• 2510.14276
• Published
• 15
OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding
LLM
Paper
• 2510.15870
• Published
• 91
Scaling Instruction-Based Video Editing with a High-Quality Synthetic
Dataset
Paper
• 2510.15742
• Published
• 51
PICABench: How Far Are We from Physically Realistic Image Editing?
Paper
• 2510.17681
• Published
• 64
Glyph: Scaling Context Windows via Visual-Text Compression
Paper
• 2510.17800
• Published
• 68
Towards Mixed-Modal Retrieval for Universal Retrieval-Augmented
Generation
Paper
• 2510.17354
• Published
• 35
FineVision: Open Data Is All You Need
Paper
• 2510.17269
• Published
• 75
RL makes MLLMs see better than SFT
Paper
• 2510.16333
• Published
• 49
Uniworld-V2: Reinforce Image Editing with Diffusion Negative-aware
Finetuning and MLLM Implicit Feedback
Paper
• 2510.16888
• Published
• 22
Visual Autoregressive Models Beat Diffusion Models on Inference Time
Scaling
Paper
• 2510.16751
• Published
• 21
ConsistEdit: Highly Consistent and Precise Training-free Visual Editing
Paper
• 2510.17803
• Published
• 14
Executable Knowledge Graphs for Replicating AI Research
Paper
• 2510.17795
• Published
• 15
Knowledge-based Visual Question Answer with Multimodal Processing,
Retrieval and Filtering
Paper
• 2510.14605
• Published
• 5
MultiVerse: A Multi-Turn Conversation Benchmark for Evaluating Large
Vision and Language Models
Paper
• 2510.16641
• Published
• 5
Paper2Web: Let's Make Your Paper Alive!
Paper
• 2510.15842
• Published
• 27
Towards Faithful and Controllable Personalization via Critique-Post-Edit
Reinforcement Learning
Paper
• 2510.18849
• Published
• 21
World-in-World: World Models in a Closed-Loop World
Paper
• 2510.18135
• Published
• 77
Grasp Any Region: Towards Precise, Contextual Pixel Understanding for
Multimodal LLMs
Paper
• 2510.18876
• Published
• 37
ProCLIP: Progressive Vision-Language Alignment via LLM-based Embedder
Paper
• 2510.18795
• Published
• 11
IF-VidCap: Can Video Caption Models Follow Instructions?
Paper
• 2510.18726
• Published
• 26
MT-Video-Bench: A Holistic Video Understanding Benchmark for Evaluating
Multimodal LLMs in Multi-Turn Dialogues
Paper
• 2510.17722
• Published
• 20
olmOCR 2: Unit Test Rewards for Document OCR
Paper
• 2510.19817
• Published
• 16
DaMo: Data Mixing Optimizer in Fine-tuning Multimodal LLMs for Mobile
Phone Agents
Paper
• 2510.19336
• Published
• 17
Unified Reinforcement and Imitation Learning for Vision-Language Models
Paper
• 2510.19307
• Published
• 32
OmniNWM: Omniscient Driving Navigation World Models
Paper
• 2510.18313
• Published
• 14
MINED: Probing and Updating with Multimodal Time-Sensitive Knowledge for Large Multimodal Models
Paper
• 2510.19457
• Published
• 9
Pico-Banana-400K: A Large-Scale Dataset for Text-Guided Image Editing
Paper
• 2510.19808
• Published
• 30
KORE: Enhancing Knowledge Injection for Large Multimodal Models via
Knowledge-Oriented Augmentations and Constraints
Paper
• 2510.19316
• Published
• 12
VideoAgentTrek: Computer Use Pretraining from Unlabeled Videos
Paper
• 2510.19488
• Published
• 20
GigaBrain-0: A World Model-Powered Vision-Language-Action Model
Paper
• 2510.19430
• Published
• 52
From Charts to Code: A Hierarchical Benchmark for Multimodal Models
Paper
• 2510.17932
• Published
• 8
Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal
Evidence
Paper
• 2510.20579
• Published
• 56
HoloCine: Holistic Generation of Cinematic Multi-Shot Long Video
Narratives
Paper
• 2510.20822
• Published
• 41
ARGenSeg: Image Segmentation with Autoregressive Image Generation Model
Paper
• 2510.20803
• Published
• 12
Conan: Progressive Learning to Reason Like a Detective over Multi-Scale
Visual Evidence
Paper
• 2510.20470
• Published
• 12
LayerComposer: Interactive Personalized T2I via Spatially-Aware Layered
Canvas
Paper
• 2510.20820
• Published
• 11
Directional Reasoning Injection for Fine-Tuning MLLMs
Paper
• 2510.15050
• Published
• 12
See the Text: From Tokenization to Visual Reading
Paper
• 2510.18840
• Published
• 4
Text or Pixels? It Takes Half: On the Token Efficiency of Visual Text
Inputs in Multimodal LLMs
Paper
• 2510.18279
• Published
• 6
Every Question Has Its Own Value: Reinforcement Learning with Explicit
Human Values
Paper
• 2510.20187
• Published
• 19
SAKE: Towards Editing Auditory Attribute Knowledge of Large
Audio-Language Models
Paper
• 2510.16917
• Published
• 20
UI-Ins: Enhancing GUI Grounding with Multi-Perspective
Instruction-as-Reasoning
Paper
• 2510.20286
• Published
• 24
Video-As-Prompt: Unified Semantic Control for Video Generation
Paper
• 2510.20888
• Published
• 50
From Denoising to Refining: A Corrective Framework for Vision-Language
Diffusion Model
Paper
• 2510.19871
• Published
• 30
Model Merging with Functional Dual Anchors
Paper
• 2510.21223
• Published
• 13
Visual Diffusion Models are Geometric Solvers
Paper
• 2510.21697
• Published
• 20
Sample By Step, Optimize By Chunk: Chunk-Level GRPO For Text-to-Image
Generation
Paper
• 2510.21583
• Published
• 31
Reasoning with Sampling: Your Base Model is Smarter Than You Think
Paper
• 2510.14901
• Published
• 48
RECALL: REpresentation-aligned Catastrophic-forgetting ALLeviation via
Hierarchical Model Merging
Paper
• 2510.20479
• Published
• 12
Taming Modality Entanglement in Continual Audio-Visual Segmentation
Paper
• 2510.17234
• Published
• 5
PhysVLM-AVR: Active Visual Reasoning for Multimodal Large Language
Models in Physical Environments
Paper
• 2510.21111
• Published
• 3
Omni-Reward: Towards Generalist Omni-Modal Reward Modeling with
Free-Form Preferences
Paper
• 2510.23451
• Published
• 28
LightBagel: A Light-weighted, Double Fusion Framework for Unified
Multimodal Understanding and Generation
Paper
• 2510.22946
• Published
• 18
LongCat-Video Technical Report
Paper
• 2510.22200
• Published
• 33
Tongyi DeepResearch Technical Report
Paper
• 2510.24701
• Published
• 101
Generalization or Memorization: Dynamic Decoding for Mode Steering
Paper
• 2510.22099
• Published
• 4
VL-SAE: Interpreting and Enhancing Vision-Language Alignment with a
Unified Concept Set
Paper
• 2510.21323
• Published
• 4
Rethinking Visual Intelligence: Insights from Video Pretraining
Paper
• 2510.24448
• Published
• 6
Lookahead Anchoring: Preserving Character Identity in Audio-Driven Human
Animation
Paper
• 2510.23581
• Published
• 42
VITA-E: Natural Embodied Interaction with Concurrent Seeing, Hearing,
Speaking, and Acting
Paper
• 2510.21817
• Published
• 42
Open Multimodal Retrieval-Augmented Factual Image Generation
Paper
• 2510.22521
• Published
• 31
MergeMix: A Unified Augmentation Paradigm for Visual and Multi-Modal
Understanding
Paper
• 2510.23479
• Published
• 16
Latent Sketchpad: Sketching Visual Thoughts to Elicit Multimodal
Reasoning in MLLMs
Paper
• 2510.24514
• Published
• 22
Emu3.5: Native Multimodal Models are World Learners
Paper
• 2510.26583
• Published
• 111
The End of Manual Decoding: Towards Truly End-to-End Language Models
Paper
• 2510.26697
• Published
• 117
Kimi Linear: An Expressive, Efficient Attention Architecture
Paper
• 2510.26692
• Published
• 125
Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with
the MME-CoF Benchmark
Paper
• 2510.26802
• Published
• 34
OmniX: From Unified Panoramic Generation and Perception to
Graphics-Ready 3D Scenes
Paper
• 2510.26800
• Published
• 22
OmniLayout: Enabling Coarse-to-Fine Learning with LLMs for Universal
Document Layout Generation
Paper
• 2510.26213
• Published
• 10
ThinkMorph: Emergent Properties in Multimodal Interleaved
Chain-of-Thought Reasoning
Paper
• 2510.27492
• Published
• 86
π_RL: Online RL Fine-tuning for Flow-based
Vision-Language-Action Models
Paper
• 2510.25889
• Published
• 66
Continuous Autoregressive Language Models
Paper
• 2510.27688
• Published
• 73
Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised
Reinforcement Learning
Paper
• 2510.27606
• Published
• 31
Revisiting Multimodal Positional Encoding in Vision-Language Models
Paper
• 2510.23095
• Published
• 22
SemCoT: Accelerating Chain-of-Thought Reasoning through
Semantically-Aligned Implicit Tokens
Paper
• 2510.24940
• Published
• 18
Visual Backdoor Attacks on MLLM Embodied Decision Making via Contrastive
Trigger Learning
Paper
• 2510.27623
• Published
• 13
The Underappreciated Power of Vision Models for Graph Structural
Understanding
Paper
• 2510.24788
• Published
• 36
LongCat-Flash-Omni Technical Report
Paper
• 2511.00279
• Published
• 26
Do Vision-Language Models Measure Up? Benchmarking Visual Measurement
Reading with MeasureBench
Paper
• 2510.26865
• Published
• 12
left|,circlearrowright,text{BUS},right|: A Large and
Diverse Multimodal Benchmark for evaluating the ability of Vision-Language
Models to understand Rebus Puzzles
Paper
• 2511.01340
• Published
• 13
Don't Blind Your VLA: Aligning Visual Representations for OOD
Generalization
Paper
• 2510.25616
• Published
• 105
VCode: a Multimodal Coding Benchmark with SVG as Symbolic Visual
Representation
Paper
• 2511.02778
• Published
• 102
When Visualizing is the First Step to Reasoning: MIRA, a Benchmark for
Visual Chain-of-Thought
Paper
• 2511.02779
• Published
• 59
When Modalities Conflict: How Unimodal Reasoning Uncertainty Governs
Preference Dynamics in MLLMs
Paper
• 2511.02243
• Published
• 25
Can Visual Input Be Compressed? A Visual Token Compression Benchmark for
Large Multimodal Models
Paper
• 2511.02650
• Published
• 10
MME-CC: A Challenging Multi-Modal Evaluation Benchmark of Cognitive
Capacity
Paper
• 2511.03146
• Published
• 8
Thinking with Video: Video Generation as a Promising Multimodal
Reasoning Paradigm
Paper
• 2511.04570
• Published
• 240
V-Thinker: Interactive Thinking with Images
Paper
• 2511.04460
• Published
• 97
Benchmark Designers Should "Train on the Test Set" to Expose Exploitable
Non-Visual Shortcuts
Paper
• 2511.04655
• Published
• 8
NVIDIA Nemotron Nano V2 VL
Paper
• 2511.03929
• Published
• 30
UniAVGen: Unified Audio and Video Generation with Asymmetric Cross-Modal
Interactions
Paper
• 2511.03334
• Published
• 53
Cambrian-S: Towards Spatial Supersensing in Video
Paper
• 2511.04670
• Published
• 38
Contamination Detection for VLMs using Multi-Modal Semantic Perturbation
Paper
• 2511.03774
• Published
• 13
Paper
• 2511.05491
• Published
• 52
Too Good to be Bad: On the Failure of LLMs to Role-Play Villains
Paper
• 2511.04962
• Published
• 57
Paper
• 2511.05369
• Published
• 10
DeepEyesV2: Toward Agentic Multimodal Model
Paper
• 2511.05271
• Published
• 45
HaluMem: Evaluating Hallucinations in Memory Systems of Agents
Paper
• 2511.03506
• Published
• 94
MVU-Eval: Towards Multi-Video Understanding Evaluation for Multimodal
LLMs
Paper
• 2511.07250
• Published
• 18
Long Grounded Thoughts: Distilling Compositional Visual Reasoning Chains at Scale
Paper
• 2511.05705
• Published
• 8
Reinforcement Learning Improves Traversal of Hierarchical Knowledge in LLMs
Paper
• 2511.05933
• Published
• 9
MPJudge: Towards Perceptual Assessment of Music-Induced Paintings
Paper
• 2511.07137
• Published
• 6
VADER: Towards Causal Video Anomaly Understanding with Relation-Aware
Large Language Models
Paper
• 2511.07299
• Published
• 8
Wasm: A Pipeline for Constructing Structured Arabic Interleaved
Multimodal Corpora
Paper
• 2511.07080
• Published
• 32
VideoSSR: Video Self-Supervised Reinforcement Learning
Paper
• 2511.06281
• Published
• 25
TimeSearch-R: Adaptive Temporal Search for Long-Form Video Understanding via Self-Verification Reinforcement Learning
Paper
• 2511.05489
• Published
• 3
WMPO: World Model-based Policy Optimization for Vision-Language-Action Models
Paper
• 2511.09515
• Published
• 20
MathSE: Improving Multimodal Mathematical Reasoning via Self-Evolving Iterative Reflection and Reward-Guided Fine-Tuning
Paper
• 2511.06805
• Published
• 13
Lumine: An Open Recipe for Building Generalist Agents in 3D Open Worlds
Paper
• 2511.08892
• Published
• 210
TiDAR: Think in Diffusion, Talk in Autoregression
Paper
• 2511.08923
• Published
• 128
Time-to-Move: Training-Free Motion Controlled Video Generation via Dual-Clock Denoising
Paper
• 2511.08633
• Published
• 55
GGBench: A Geometric Generative Reasoning Benchmark for Unified Multimodal Models
Paper
• 2511.11134
• Published
• 32
DoPE: Denoising Rotary Position Embedding
Paper
• 2511.09146
• Published
• 97
WEAVE: Unleashing and Benchmarking the In-context Interleaved Comprehension and Generation
Paper
• 2511.11434
• Published
• 45
UI2Code^N: A Visual Language Model for Test-Time Scalable Interactive UI-to-Code Generation
Paper
• 2511.08195
• Published
• 34
SpatialThinker: Reinforcing 3D Reasoning in Multimodal LLMs via Spatial Rewards
Paper
• 2511.07403
• Published
• 14
Simulating the Visual World with Artificial Intelligence: A Roadmap
Paper
• 2511.08585
• Published
• 30
EmoVid: A Multimodal Emotion Video Dataset for Emotion-Centric Video Understanding and Generation
Paper
• 2511.11002
• Published
• 4
Depth Anything 3: Recovering the Visual Space from Any Views
Paper
• 2511.10647
• Published
• 99
Music Flamingo: Scaling Music Understanding in Audio Language Models
Paper
• 2511.10289
• Published
• 17
AffordBot: 3D Fine-grained Embodied Reasoning via Multimodal Large Language Models
Paper
• 2511.10017
• Published
• 7
MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling
Paper
• 2511.11793
• Published
• 187
MMaDA-Parallel: Multimodal Large Diffusion Language Models for Thinking-Aware Editing and Generation
Paper
• 2511.09611
• Published
• 70
Part-X-MLLM: Part-aware 3D Multimodal Large Language Model
Paper
• 2511.13647
• Published
• 71
TiViBench: Benchmarking Think-in-Video Reasoning for Video Generative Models
Paper
• 2511.13704
• Published
• 43
Back to Basics: Let Denoising Generative Models Denoise
Paper
• 2511.13720
• Published
• 69
NORA-1.5: A Vision-Language-Action Model Trained using World Model- and Action-based Preference Rewards
Paper
• 2511.14659
• Published
• 13
VIDEOP2R: Video Understanding from Perception to Reasoning
Paper
• 2511.11113
• Published
• 111
A Style is Worth One Code: Unlocking Code-to-Style Image Generation with Discrete Style Space
Paper
• 2511.10555
• Published
• 62
Can World Simulators Reason? Gen-ViRe: A Generative Visual Reasoning Benchmark
Paper
• 2511.13853
• Published
• 36
MVI-Bench: A Comprehensive Benchmark for Evaluating Robustness to Misleading Visual Inputs in LVLMs
Paper
• 2511.14159
• Published
• 25
OmniZip: Audio-Guided Dynamic Token Compression for Fast Omnimodal Large Language Models
Paper
• 2511.14582
• Published
• 19
REVISOR: Beyond Textual Reflection, Towards Multimodal Introspective Reasoning in Long-Form Video Understanding
Paper
• 2511.13026
• Published
• 26
General Agentic Memory Via Deep Research
Paper
• 2511.18423
• Published
• 167
SAM 3: Segment Anything with Concepts
Paper
• 2511.16719
• Published
• 129
OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe
Paper
• 2511.16334
• Published
• 93
Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens
Paper
• 2511.19418
• Published
• 29
HunyuanVideo 1.5 Technical Report
Paper
• 2511.18870
• Published
• 28
HunyuanOCR Technical Report
Paper
• 2511.19575
• Published
• 22
RynnVLA-002: A Unified Vision-Language-Action and World Model
Paper
• 2511.17502
• Published
• 28
O-Mem: Omni Memory System for Personalized, Long Horizon, Self-Evolving Agents
Paper
• 2511.13593
• Published
• 27
Video-R4: Reinforcing Text-Rich Video Reasoning with Visual Rumination
Paper
• 2511.17490
• Published
• 22
Loomis Painter: Reconstructing the Painting Process
Paper
• 2511.17344
• Published
• 15
Mantis: A Versatile Vision-Language-Action Model with Disentangled Visual Foresight
Paper
• 2511.16175
• Published
• 12
VisMem: Latent Vision Memory Unlocks Potential of Vision-Language Models
Paper
• 2511.11007
• Published
• 15
Downscaling Intelligence: Exploring Perception and Reasoning Bottlenecks in Small Multimodal Models
Paper
• 2511.17487
• Published
• 12
VLA-4D: Embedding 4D Awareness into Vision-Language-Action Models for SpatioTemporally Coherent Robotic Manipulation
Paper
• 2511.17199
• Published
• 8
DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation
Paper
• 2511.19365
• Published
• 64
In-Video Instructions: Visual Signals as Generative Control
Paper
• 2511.19401
• Published
• 32
The Image as Its Own Reward: Reinforcement Learning with Adversarial Reward for Image Generation
Paper
• 2511.20256
• Published
• 28
AICC: Parse HTML Finer, Make Models Better -- A 7.3T AI-Ready Corpus Built by a Model-Based HTML Parser
Paper
• 2511.16397
• Published
• 10
Does Understanding Inform Generation in Unified Multimodal Models? From Analysis to Path Forward
Paper
• 2511.20561
• Published
• 32
Soft Adaptive Policy Optimization
Paper
• 2511.20347
• Published
• 42
STARFlow-V: End-to-End Video Generative Modeling with Normalizing Flow
Paper
• 2511.20462
• Published
• 32
MagicWorld: Interactive Geometry-driven Video World Exploration
Paper
• 2511.18886
• Published
• 19
UniGame: Turning a Unified Multimodal Model Into Its Own Adversary
Paper
• 2511.19413
• Published
• 20
Monet: Reasoning in Latent Visual Space Beyond Images and Language
Paper
• 2511.21395
• Published
• 18
G^2VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning
Paper
• 2511.21688
• Published
• 8
Video Generation Models Are Good Latent Reward Models
Paper
• 2511.21541
• Published
• 45
Canvas-to-Image: Compositional Image Generation with Multimodal Controls
Paper
• 2511.21691
• Published
• 36
REASONEDIT: Towards Reasoning-Enhanced Image Editing Models
Paper
• 2511.22625
• Published
• 47
InternVideo-Next: Towards General Video Foundation Models without Video-Text Supervision
Paper
• 2512.01342
• Published
• 18
TimeViper: A Hybrid Mamba-Transformer Vision-Language Model for Efficient Long Video Understanding
Paper
• 2511.16595
• Published
• 10
Video-as-Answer: Predict and Generate Next Video Event with Joint-GRPO
Paper
• 2511.16669
• Published
• 32
MiMo-Embodied: X-Embodied Foundation Model Technical Report
Paper
• 2511.16518
• Published
• 26
SRPO: Self-Referential Policy Optimization for Vision-Language-Action Models
Paper
• 2511.15605
• Published
• 24
Scaling Spatial Intelligence with Multimodal Foundation Models
Paper
• 2511.13719
• Published
• 47
Step-Audio-R1 Technical Report
Paper
• 2511.15848
• Published
• 58
V-ReasonBench: Toward Unified Reasoning Benchmark Suite for Video Generation Models
Paper
• 2511.16668
• Published
• 55
ARC-Chapter: Structuring Hour-Long Videos into Navigable Chapters and Hierarchical Summaries
Paper
• 2511.14349
• Published
• 18
VisPlay: Self-Evolving Vision-Language Models from Images
Paper
• 2511.15661
• Published
• 43
Reasoning via Video: The First Evaluation of Video Models' Reasoning Abilities through Maze-Solving Tasks
Paper
• 2511.15065
• Published
• 77
Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation
Paper
• 2511.14993
• Published
• 231
MultiShotMaster: A Controllable Multi-Shot Video Generation Framework
Paper
• 2512.03041
• Published
• 66
TRivia: Self-supervised Fine-tuning of Vision-Language Models for Table Recognition
Paper
• 2512.01248
• Published
• 12
OneThinker: All-in-one Reasoning Model for Image and Video
Paper
• 2512.03043
• Published
• 33
ViDiC: Video Difference Captioning
Paper
• 2512.03405
• Published
• 28
Thinking with Programming Vision: Towards a Unified View for Thinking with Images
Paper
• 2512.03746
• Published
• 17
Qwen3-VL Technical Report
Paper
• 2511.21631
• Published
• 158
AdaptVision: Efficient Vision-Language Models via Adaptive Visual Acquisition
Paper
• 2512.03794
• Published
• 5
Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length
Paper
• 2512.04677
• Published
• 171
Deep Forcing: Training-Free Long Video Generation with Deep Sink and Participative Compression
Paper
• 2512.05081
• Published
• 31
Some Modalities are More Equal Than Others: Decoding and Architecting Multimodal Integration in MLLMs
Paper
• 2511.22826
• Published
• 8
Guided Self-Evolving LLMs with Minimal Human Supervision
Paper
• 2512.02472
• Published
• 55
Skywork-R1V4: Toward Agentic Multimodal Intelligence through Interleaved Thinking with Images and DeepResearch
Paper
• 2512.02395
• Published
• 49
WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning
Paper
• 2512.02425
• Published
• 25
Does Hearing Help Seeing? Investigating Audio-Video Joint Denoising for Video Generation
Paper
• 2512.02457
• Published
• 14
MagicQuillV2: Precise and Interactive Image Editing with Layered Visual Cues
Paper
• 2512.03046
• Published
• 12
Revisiting the Necessity of Lengthy Chain-of-Thought in Vision-centric Reasoning Generalization
Paper
• 2511.22586
• Published
• 7
Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation
Paper
• 2512.04678
• Published
• 42
Relational Visual Similarity
Paper
• 2512.07833
• Published
• 25
One Layer Is Enough: Adapting Pretrained Visual Encoders for Image Generation
Paper
• 2512.07829
• Published
• 23
LongCat-Image Technical Report
Paper
• 2512.07584
• Published
• 23
OpenSubject: Leveraging Video-Derived Identity and Diversity Priors for Subject-driven Image Generation and Manipulation
Paper
• 2512.08294
• Published
• 18
UnityVideo: Unified Multi-Modal Multi-Task Learning for Enhancing World-Aware Video Generation
Paper
• 2512.07831
• Published
• 17
Paper
• 2511.23469
• Published
• 16
Wan-Move: Motion-controllable Video Generation via Latent Trajectory Guidance
Paper
• 2512.08765
• Published
• 132
Preserving Source Video Realism: High-Fidelity Face Swapping for Cinematic Quality
Paper
• 2512.07951
• Published
• 50
OneStory: Coherent Multi-Shot Video Generation with Adaptive Memory
Paper
• 2512.07802
• Published
• 46
DeepCode: Open Agentic Coding
Paper
• 2512.07921
• Published
• 33
See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models
Paper
• 2512.02231
• Published
• 9
InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input Vision-Language Models
Paper
• 2512.08829
• Published
• 21
Rethinking Chain-of-Thought Reasoning for Videos
Paper
• 2512.09616
• Published
• 19
VQRAE: Representation Quantization Autoencoders for Multimodal Understanding, Generation and Reconstruction
Paper
• 2511.23386
• Published
• 16
LongVie 2: Multimodal Controllable Ultra-Long Video World Model
Paper
• 2512.13604
• Published
• 74
QwenLong-L1.5: Post-Training Recipe for Long-Context Reasoning and Memory Management
Paper
• 2512.12967
• Published
• 108
Spatial-Aware VLA Pretraining through Visual-Physical Alignment from Human Videos
Paper
• 2512.13080
• Published
• 17
MMGR: Multi-Modal Generative Reasoning
Paper
• 2512.14691
• Published
• 119
Scone: Bridging Composition and Distinction in Subject-Driven Image Generation via Unified Understanding-Generation Modeling
Paper
• 2512.12675
• Published
• 41
Vector Prism: Animating Vector Graphics by Stratifying Semantic Structure
Paper
• 2512.14336
• Published
• 31
OpenDataArena: A Fair and Open Arena for Benchmarking Post-Training Dataset Value
Paper
• 2512.14051
• Published
• 46
ShowTable: Unlocking Creative Table Visualization with Collaborative Reflection and Refinement
Paper
• 2512.13303
• Published
• 17
Zoom-Zero: Reinforced Coarse-to-Fine Video Understanding via Temporal Zoom-in
Paper
• 2512.14273
• Published
• 10
Qwen-Image-Layered: Towards Inherent Editability via Layer Decomposition
Paper
• 2512.15603
• Published
• 66
DiffusionVL: Translating Any Autoregressive Models into Diffusion Vision Language Models
Paper
• 2512.15713
• Published
• 17
WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling
Paper
• 2512.14614
• Published
• 71
Video Reality Test: Can AI-Generated ASMR Videos fool VLMs and Humans?
Paper
• 2512.13281
• Published
• 64
MemFlow: Flowing Adaptive Memory for Consistent and Efficient Long Video Narratives
Paper
• 2512.14699
• Published
• 28
TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs
Paper
• 2512.14698
• Published
• 21
HyperVL: An Efficient and Dynamic Multimodal Large Language Model for Edge Devices
Paper
• 2512.14052
• Published
• 42
Puzzle Curriculum GRPO for Vision-Centric Reasoning
Paper
• 2512.14944
• Published
• 36
In Pursuit of Pixel Supervision for Visual Pre-training
Paper
• 2512.15715
• Published
• 11
Kling-Omni Technical Report
Paper
• 2512.16776
• Published
• 170
Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model
Paper
• 2512.13507
• Published
• 40
Multimodal RewardBench 2: Evaluating Omni Reward Models for Interleaved Text and Image
Paper
• 2512.16899
• Published
• 14
GroundingME: Exposing the Visual Grounding Gap in MLLMs through Multi-Dimensional Evaluation
Paper
• 2512.17495
• Published
• 20
Robust-R1: Degradation-Aware Reasoning for Robust Visual Understanding
Paper
• 2512.17532
• Published
• 67
PhysBrain: Human Egocentric Data as a Bridge from Vision Language Models to Physical Intelligence
Paper
• 2512.16793
• Published
• 75
Both Semantics and Reconstruction Matter: Making Representation Encoders Ready for Text-to-Image Generation and Editing
Paper
• 2512.17909
• Published
• 37
MMFormalizer: Multimodal Autoformalization in the Wild
Paper
• 2601.03017
• Published
• 105
Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking
Paper
• 2601.04720
• Published
• 55
VideoAuto-R1: Video Auto Reasoning via Thinking Once, Answering Twice
Paper
• 2601.05175
• Published
• 36
Few Tokens Matter: Entropy Guided Attacks on Vision-Language Models
Paper
• 2512.21815
• Published
• 22
JavisGPT: A Unified Multi-modal LLM for Sounding-Video Comprehension and Generation
Paper
• 2512.22905
• Published
• 20
InSight-o3: Empowering Multimodal Foundation Models with Generalized Visual Search
Paper
• 2512.18745
• Published
• 12
Latent Implicit Visual Reasoning
Paper
• 2512.21218
• Published
• 69
How Much 3D Do Video Foundation Models Encode?
Paper
• 2512.19949
• Published
• 10
Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models
Paper
• 2512.20557
• Published
• 50
Beyond Memorization: A Multi-Modal Ordinal Regression Benchmark to Expose Popularity Bias in Vision-Language Models
Paper
• 2512.21337
• Published
• 31
LongVideoAgent: Multi-Agent Reasoning with Long Videos
Paper
• 2512.20618
• Published
• 55
LTX-2: Efficient Joint Audio-Visual Foundation Model
Paper
• 2601.03233
• Published
• 154
UniCorn: Towards Self-Improving Unified Multimodal Models through Self-Generated Supervision
Paper
• 2601.03193
• Published
• 47
MindWatcher: Toward Smarter Multimodal Tool-Integrated Reasoning
Paper
• 2512.23412
• Published
• 41
MiMo-V2-Flash Technical Report
Paper
• 2601.02780
• Published
• 35
NextFlow: Unified Sequential Modeling Activates Multimodal Understanding and Generation
Paper
• 2601.02204
• Published
• 62
Taming Hallucinations: Boosting MLLMs' Video Understanding via Counterfactual Video Generation
Paper
• 2512.24271
• Published
• 63