Papers
arxiv:2604.06628

Rethinking Generalization in Reasoning SFT: A Conditional Analysis on Optimization, Data, and Model Capability

Published on Apr 8
· Submitted by
Qihan Ren
on Apr 10
Authors:
,
,
,
,
,
,
,
,
,
,

Abstract

Supervised finetuning and reinforcement learning exhibit conditional cross-domain generalization in reasoning tasks, influenced by optimization dynamics, data quality, and model capability, with asymmetric outcomes between reasoning improvement and safety degradation.

AI-generated summary

A prevailing narrative in LLM post-training holds that supervised finetuning (SFT) memorizes while reinforcement learning (RL) generalizes. We revisit this claim for reasoning SFT with long chain-of-thought (CoT) supervision and find that cross-domain generalization is not absent but conditional, jointly shaped by optimization dynamics, training data, and base-model capability. Some reported failures are under-optimization artifacts: cross-domain performance first degrades before recovering and improving with extended training (a dip-and-recovery pattern), so shorttraining checkpoints can underestimate generalization. Data quality and structure both matter: low-quality solutions broadly hurt generalization,while verified long-CoT traces yield consistent cross-domain gains. Model capability is essential: stronger models internalize transferable procedural patterns (e.g., backtracking) even from a toy arithmetic game, while weaker ones imitate surface verbosity. This generalization is asymmetric, however: reasoning improves while safety degrades, reframing the question from whether reasoning SFT generalizes to under what conditions and at what cost.

Community

Paper submitter

Rethinking Generalization in Reasoning SFT: A Conditional Analysis on Optimization, Data, and Model Capability
We have open-sourced all our models and datasets: https://huggingface.co/collections/jasonrqh/rethink-sft-generalization
Also find them on modelscope: https://modelscope.cn/collections/nebularaid/Rethink_SFT_generalization

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2604.06628 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.06628 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.06628 in a Space README.md to link it from this page.

Collections including this paper 1