arxiv:2603.22582

Lie to Me: How Faithful Is Chain-of-Thought Reasoning in Reasoning Models?

Published on Mar 23

· Submitted by

Authors:

Abstract

Chain-of-thought reasoning faithfulness varies significantly across open-weight models, with acknowledgment rates ranging from 39.7% to 89.9% depending on model architecture and training methodology, indicating that faithfulness is not a fixed property but depends on system design and hint type.

AI-generated summary

Chain-of-thought (CoT) reasoning has been proposed as a transparency mechanism for large language models in safety-critical deployments, yet its effectiveness depends on faithfulness (whether models accurately verbalize the factors that actually influence their outputs), a property that prior evaluations have examined in only two proprietary models, finding acknowledgment rates as low as 25% for Claude 3.7 Sonnet and 39% for DeepSeek-R1. To extend this evaluation across the open-weight ecosystem, this study tests 12 open-weight reasoning models spanning 9 architectural families (7B-685B parameters) on 498 multiple-choice questions from MMLU and GPQA Diamond, injecting six categories of reasoning hints (sycophancy, consistency, visual pattern, metadata, grader hacking, and unethical information) and measuring the rate at which models acknowledge hint influence in their CoT when hints successfully alter answers. Across 41,832 inference runs, overall faithfulness rates range from 39.7% (Seed-1.6-Flash) to 89.9% (DeepSeek-V3.2-Speciale) across model families, with consistency hints (35.5%) and sycophancy hints (53.9%) exhibiting the lowest acknowledgment rates. Training methodology and model family predict faithfulness more strongly than parameter count, and keyword-based analysis reveals a striking gap between thinking-token acknowledgment (approximately 87.5%) and answer-text acknowledgment (approximately 28.6%), suggesting that models internally recognize hint influence but systematically suppress this acknowledgment in their outputs. These findings carry direct implications for the viability of CoT monitoring as a safety mechanism and suggest that faithfulness is not a fixed property of reasoning models but varies systematically with architecture, training method, and the nature of the influencing cue.

View arXiv page View PDF GitHub 0 Add to collection

Community

richardyoung

Paper submitter about 11 hours ago

We ran ~42K inference calls across 12 open-weight reasoning models (9 families, 7B to 685B parameters) to answer a simple question: when a hint flips a model's answer, does the chain-of-thought actually say so?
Short answer: sometimes. Faithfulness rates range from 39.7% to 89.9% depending on the model, which means CoT monitoring as a safety strategy is only as reliable as the specific model you're deploying. Training methodology predicts faithfulness far better than parameter count does, and newer generations within a family tend to improve (DeepSeek-R1 at 74.8% vs. V3.2-Speciale at 89.9%).
Not all hints are created equal either. Social-pressure cues like consistency (35.5%) and sycophancy (53.9%) are the hardest for models to acknowledge, while rule-breaking hints like grader hacking and unethical information get flagged more readily. Models seem more comfortable admitting they followed a stolen answer key than admitting they deferred to a professor.
The wildest finding is the thinking-vs-answer gap. Models acknowledge hint influence ~87.5% of the time in their thinking tokens but only ~28.6% in their final answers. GPT-OSS-120B hits 93% thinking acknowledgment with exactly 0% answer acknowledgment. The models know what's happening. They just don't tell you.
All code, prompts, outputs, and annotations are open on GitHub / hf

librarian-bot

about 5 hours ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2603.22582

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2603.22582 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2603.22582 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.