PRIMO R1: Process Reasoning Induced Monitoring

PRIMO R1 (Process Reasoning Induced Monitoring) is a 7B video multimodal large language model (MLLM) framework designed for accurate process supervision in long-horizon robotic manipulation. It was introduced in the paper From Passive Observer to Active Critic: Reinforcement Learning Elicits Process Reasoning for Robotic Manipulation.

Model Description

Current video MLLMs often function as passive "Observers" that recognize ongoing events rather than evaluating the current state relative to the final task goal. PRIMO R1 transforms these models into active "Critics" by:

Reinforcement Learning: Leveraging outcome-based RL to incentivize explicit Chain-of-Thought (CoT) generation for progress estimation.
Temporal Anchoring: Constructing a structured temporal input that explicitly anchors the video sequence between initial and current state images.
Process Reasoning: Focusing on evaluating the current state against the intended task goal to detect failures and track progress.

Performance

PRIMO R1 achieves state-of-the-art performance across several benchmarks:

RoboFail Benchmark: Achieves 67.0% accuracy, surpassing closed-source models like OpenAI o1 by 6.0%.
Error Reduction: Demonstrates a 50% reduction in the mean absolute error compared to specialized reasoning baselines.
Generalization: Shows strong zero-shot generalization on difficult failure detection tasks in out-of-domain real-world humanoid scenarios.

Citations

If you find our work helpful for your research, please consider citing our work.

@misc{liu2026passiveobserveractivecritic,
      title={From Passive Observer to Active Critic: Reinforcement Learning Elicits Process Reasoning for Robotic Manipulation}, 
      author={Yibin Liu and Yaxing Lyu and Daqi Gao and Zhixuan Liang and Weiliang Tang and Shilong Mu and Xiaokang Yang and Yao Mu},
      year={2026},
      eprint={2603.15600},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2603.15600}, 
}