Papers
arxiv:2604.01161

Reasoning Shift: How Context Silently Shortens LLM Reasoning

Published on Apr 1
· Submitted by
Gleb
on Apr 2
Authors:

Abstract

Reasoning behaviors in large language models compress under varied contextual conditions, potentially impacting performance on complex tasks despite maintaining accuracy on simpler ones.

AI-generated summary

Large language models (LLMs) exhibiting test-time scaling behavior, such as extended reasoning traces and self-verification, have demonstrated remarkable performance on complex, long-term reasoning tasks. However, the robustness of these reasoning behaviors remains underexplored. To investigate this, we conduct a systematic evaluation of multiple reasoning models across three scenarios: (1) problems augmented with lengthy, irrelevant context; (2) multi-turn conversational settings with independent tasks; and (3) problems presented as a subtask within a complex task. We observe an interesting phenomenon: reasoning models tend to produce much shorter reasoning traces (up to 50%) for the same problem under different context conditions compared to the traces produced when the problem is presented in isolation. A finer-grained analysis reveals that this compression is associated with a decrease in self-verification and uncertainty management behaviors, such as double-checking. While this behavioral shift does not compromise performance on straightforward problems, it might affect performance on more challenging tasks. We hope our findings draw additional attention to both the robustness of reasoning models and the problem of context management for LLMs and LLM-based agents.

Community

Paper author Paper submitter

In reasoning, high-level behavioral patterns like uncertainty management and self-verification are fragile and can be suppressed by irrelevant context, leading to significantly shorter reasoning traces (up to 50%). While this reduction in overthinking helps on easy problems, it degrades performance on more challenging tasks.

a striking part of this work is that the same problem gets much shorter reasoning traces when you drop it into long, irrelevant context, not just more noise. it's not just fewer tokens; there's a real shift away from planning and self-verification, which could hurt harder problems. the arxivlens breakdown helped me parse the method details, and the walkthrough at https://arxivlens.com/PaperView/Details/reasoning-shift-how-context-silently-shortens-llm-reasoning-9685-fdacec50 does a nice job unpacking how they code and analyze the reasoning traces. one concrete follow-up: would adding an explicit verifier or a bounded verification budget rescue the deeper steps under long-context prompts, or is the compression inherent to context management?

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2604.01161 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.01161 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.01161 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.