RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies
Abstract
RoboMME presents a large-scale standardized benchmark for evaluating vision-language-action models in long-horizon robotic manipulation tasks requiring memory mechanisms.
Memory is critical for long-horizon and history-dependent robotic manipulation. Such tasks often involve counting repeated actions or manipulating objects that become temporarily occluded. Recent vision-language-action (VLA) models have begun to incorporate memory mechanisms; however, their evaluations remain confined to narrow, non-standardized settings. This limits their systematic understanding, comparison, and progress measurement. To address these challenges, we introduce RoboMME: a large-scale standardized benchmark for evaluating and advancing VLA models in long-horizon, history-dependent scenarios. Our benchmark comprises 16 manipulation tasks constructed under a carefully designed taxonomy that evaluates temporal, spatial, object, and procedural memory. We further develop a suite of 14 memory-augmented VLA variants built on the π0.5 backbone to systematically explore different memory representations across multiple integration strategies. Experimental results show that the effectiveness of memory representations is highly task-dependent, with each design offering distinct advantages and limitations across different tasks. Videos and code can be found at our website https://robomme.github.io.
Community
TL;DR: RoboMME is a new benchmark for memory-augmented robotic manipulation, evaluating how well models remember, reason, and act across temporal, spatial, object, and procedural memory.
We also prepare a cool gradio online demo to play: https://huggingface.co/spaces/HongzeFu/RoboMME
Let's test your memory!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- RoboInter: A Holistic Intermediate Representation Suite Towards Robotic Manipulation (2026)
- Global Prior Meets Local Consistency: Dual-Memory Augmented Vision-Language-Action Model for Efficient Robotic Manipulation (2026)
- RMBench: Memory-Dependent Robotic Manipulation Benchmark with Insights into Policy Design (2026)
- VisGym: Diverse, Customizable, Scalable Environments for Multimodal Agents (2026)
- A Pragmatic VLA Foundation Model (2026)
- Notes-to-Self: Scratchpad Augmented VLAs for Memory Dependent Manipulation Tasks (2026)
- MemGUI-Bench: Benchmarking Memory of Mobile GUI Agents in Dynamic Environments (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper