Papers
arxiv:2602.10224

Internalizing Meta-Experience into Memory for Guided Reinforcement Learning in Large Language Models

Published on Feb 10
· Submitted by
Yu Zeng
on Feb 12
Authors:
,
,
,
,
,
,
,

Abstract

Meta-Experience Learning enhances LLM reasoning by incorporating self-distilled error representations into parametric memory through contrastive trajectory analysis and language-modeled reward signals.

AI-generated summary

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an effective approach for enhancing the reasoning capabilities of Large Language Models (LLMs). Despite its efficacy, RLVR faces a meta-learning bottleneck: it lacks mechanisms for error attribution and experience internalization intrinsic to the human learning cycle beyond practice and verification, thereby limiting fine-grained credit assignment and reusable knowledge formation. We term such reusable knowledge representations derived from past errors as meta-experience. Based on this insight, we propose Meta-Experience Learning (MEL), a novel framework that incorporates self-distilled meta-experience into the model's parametric memory. Building upon standard RLVR, we introduce an additional design that leverages the LLM's self-verification capability to conduct contrastive analysis on paired correct and incorrect trajectories, identify the precise bifurcation points where reasoning errors arise, and summarize them into generalizable meta-experience. The meta-experience is further internalized into the LLM's parametric memory by minimizing the negative log-likelihood, which induces a language-modeled reward signal that bridges correct and incorrect reasoning trajectories and facilitates effective knowledge reuse. Experimental results demonstrate that MEL achieves consistent improvements on benchmarks, yielding 3.92%--4.73% Pass@1 gains across varying model sizes.

Community

Paper author Paper submitter

We propose Meta-Experience Learning (MEL), which breaks the meta-learning and credit-assignment bottleneck of standard RLVR by explicitly modeling and internalizing reusable error-based knowledge. MEL exploits an LLM's self-verification ability to perform contrastive analysis over correct and incorrect trajectories, pinpointing bifurcation points where reasoning goes wrong and abstracting them into generalizable meta-experiences. These meta-experiences are then distilled into the model's parametric memory via NLL minimization, inducing a language-modeled reward signal that bridges correct and incorrect reasoning paths and enables effective knowledge reuse.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2602.10224 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2602.10224 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2602.10224 in a Space README.md to link it from this page.

Collections including this paper 2