Going Down Memory Lane: Scaling Tokens for Video Stream Understanding with Dynamic KV-Cache Memory Paper • 2602.18434 • Published 26 days ago
MMOU: A Massive Multi-Task Omni Understanding and Reasoning Benchmark for Long and Complex Real-World Videos Paper • 2603.14145 • Published 4 days ago • 9
Do text-free diffusion models learn discriminative visual representations? Paper • 2311.17921 • Published Nov 29, 2023 • 1
LEIA: Latent View-invariant Embeddings for Implicit 3D Articulation Paper • 2409.06703 • Published Sep 10, 2024 • 3
Towards Multimodal Understanding via Stable Diffusion as a Task-Aware Feature Extractor Paper • 2507.07106 • Published Jul 9, 2025 • 2