GEMS: Agent-Native Multimodal Generation with Memory and Skills Paper • 2603.28088 • Published 4 days ago • 74
SAMA: Factorized Semantic Anchoring and Motion Alignment for Instruction-Guided Video Editing Paper • 2603.19228 • Published 15 days ago • 67
A Matter of Time: Revealing the Structure of Time in Vision-Language Models Paper • 2510.19559 • Published Oct 22, 2025 • 1
Gen-Searcher: Reinforcing Agentic Search for Image Generation Paper • 2603.28767 • Published 4 days ago • 51
Agent READMEs: An Empirical Study of Context Files for Agentic Coding Paper • 2511.12884 • Published Nov 17, 2025 • 28
MSA: Memory Sparse Attention for Efficient End-to-End Memory Model Scaling to 100M Tokens Paper • 2603.23516 • Published 28 days ago • 44
LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels Paper • 2603.19312 • Published 20 days ago • 18
RealRestorer: Towards Generalizable Real-World Image Restoration with Large-Scale Image Editing Models Paper • 2603.25502 • Published 8 days ago • 55
Repurposing Geometric Foundation Models for Multi-view Diffusion Paper • 2603.22275 • Published 11 days ago • 46
WildWorld: A Large-Scale Dataset for Dynamic World Modeling with Actions and Explicit State toward Generative ARPG Paper • 2603.23497 • Published 10 days ago • 90
Speed by Simplicity: A Single-Stream Architecture for Fast Audio-Video Generative Foundation Model Paper • 2603.21986 • Published 11 days ago • 120
F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions Paper • 2509.06951 • Published Sep 8, 2025 • 33
Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding Paper • 2603.19235 • Published 15 days ago • 94
Grounding World Simulation Models in a Real-World Metropolis Paper • 2603.15583 • Published 18 days ago • 152
Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding Paper • 2601.10611 • Published Jan 15 • 32