On the Direction of RLVR Updates for LLM Reasoning: Identification and Exploitation Paper • 2603.22117 • Published 15 days ago • 29
EPO: Entropy-regularized Policy Optimization for LLM Agents Reinforcement Learning Paper • 2509.22576 • Published Sep 26, 2025 • 137
Aligning Multimodal LLM with Human Preference: A Survey Paper • 2503.14504 • Published Mar 18, 2025 • 26
Robust Preference Optimization via Dynamic Target Margins Paper • 2506.03690 • Published Jun 4, 2025 • 2
Quantile Advantage Estimation for Entropy-Safe Reasoning Paper • 2509.22611 • Published Sep 26, 2025 • 120
Quantile Advantage Estimation for Entropy-Safe Reasoning Paper • 2509.22611 • Published Sep 26, 2025 • 120
Quantile Advantage Estimation for Entropy-Safe Reasoning Paper • 2509.22611 • Published Sep 26, 2025 • 120 • 2
Aligning Multimodal LLM with Human Preference: A Survey Paper • 2503.14504 • Published Mar 18, 2025 • 26
Optimizing Test-Time Compute via Meta Reinforcement Fine-Tuning Paper • 2503.07572 • Published Mar 10, 2025 • 48
Direct Multi-Turn Preference Optimization for Language Agents Paper • 2406.14868 • Published Jun 21, 2024
MM-RLHF: The Next Step Forward in Multimodal LLM Alignment Paper • 2502.10391 • Published Feb 14, 2025 • 34
Towards Robust Alignment of Language Models: Distributionally Robustifying Direct Preference Optimization Paper • 2407.07880 • Published Jul 10, 2024
MM-RLHF: The Next Step Forward in Multimodal LLM Alignment Paper • 2502.10391 • Published Feb 14, 2025 • 34