Small Vision-Language Models are Smart Compressors for Long Video Understanding Paper • 2604.08120 • Published 6 days ago • 20
MolmoWeb: Open Visual Web Agent and Open Data for the Open Web Paper • 2604.08516 • Published 6 days ago • 41
When Numbers Speak: Aligning Textual Numerals and Visual Instances in Text-to-Video Diffusion Models Paper • 2604.08546 • Published 6 days ago • 112
SkillClaw: Let Skills Evolve Collectively with Agentic Evolver Paper • 2604.08377 • Published 6 days ago • 273
ClawBench: Can AI Agents Complete Everyday Online Tasks? Paper • 2604.08523 • Published 6 days ago • 253
ELT: Elastic Looped Transformers for Visual Generation Paper • 2604.09168 • Published 5 days ago • 18
WildDet3D: Scaling Promptable 3D Detection in the Wild Paper • 2604.08626 • Published 6 days ago • 226
Audio Flamingo Next: Next-Generation Open Audio-Language Models for Speech, Sound, and Music Paper • 2604.10905 • Published 2 days ago • 20
Strips as Tokens: Artist Mesh Generation with Native UV Segmentation Paper • 2604.09132 • Published 5 days ago • 46
Running 3.78k The Ultra-Scale Playbook 🌌 3.78k The ultimate guide to training LLM on large GPU Clusters