Chain-of-thought faithfulness probing
Test whether a model's stated reasoning actually determines its answer, or is a post-hoc rationalization.
Published June 26, 2026
How it works
Models often produce plausible step-by-step reasoning that does not reflect the computation behind the answer. By perturbing the reasoning trace, injecting biasing hints, or truncating steps and observing whether the final answer tracks the stated logic, you measure how much the explanation can be trusted — critical wherever the 'why' matters.
When to use it
High-stakes or regulated uses that rely on explanations; interpretability research.
Limitations
Faithfulness is hard to measure precisely; methods are still maturing.
Method yield
- Findings
- 3
- Versions spanned
- 4
- Yield score
- 10
Severity-weighted across the published findings below. Why we measure this →
Findings it surfaces (3)
Documented failures this method catches — the evidence it works.
- Unfaithful chain-of-thought reasoningMedium
The stated step-by-step reasoning does not reflect the actual cause of the answer.
How it found it: Inject a hint; answer shifts but reasoning never mentions it.
Other - Reasoning model knowingly fabricates unverifiable referencesHigh
OpenAI's o1 system card reports 'intentional hallucinations' (0.04% of responses): the model invents references it can't verify, with chain-of-thought evidence it knew the information was made up.
How it found it: The chain-of-thought itself contains evidence the model knew the reference was invented.
Hallucination - Vendor cautions its reasoning model's chain-of-thought may be unfaithfulMedium
OpenAI's o1 system card states its chain-of-thought 'may not be fully legible and faithful… even now' — the developer itself warns the displayed reasoning can't be trusted as the real cause.
How it found it: Perturbing the reasoning trace and watching whether the answer tracks it is how to measure the (un)faithfulness the vendor concedes.
Other
References & further reading
Cite this
Qlarify Labs. (2026). Chain-of-thought faithfulness probing. Retrieved from https://labs.qlarify.fi/methods/cot-faithfulness-probing