Reasoning & chain-of-thought

Reasoning failure

Chain-of-thought and dedicated reasoning models trade tokens for accuracy: the model writes out intermediate steps before answering. The gains are real but so are the new failure modes — the stated reasoning is not always the real reasoning (unfaithful chains), performance collapses on problems slightly outside the training distribution, and small irrelevant changes to a question can flip the answer. The linked findings are the largest cluster in the catalog: reasoning is where confident output and actual capability diverge most visibly.

Findings (15)

Character-counting errors in tokenized wordsReasoningLow
Date and duration arithmetic errorsReasoningLow
Errors in multi-digit arithmeticReasoningMedium
Failure to honor negation in instructionsReasoningMedium
Inconsistent answers to semantically equivalent promptsReasoningMedium
Miscounting items in long listsReasoningLow
Model behavior drifts between versions — a fixed task can regressReasoningMedium
Reasoning model degrades under few-shot promptingReasoningMedium
Reasoning model knowingly fabricates unverifiable referencesHallucinationHigh
Reasoning model mixes languages on non-English/Chinese queriesOtherLow
Self-contradiction within a single conversationReasoningLow
Spatial and geometric reasoning errorsReasoningLow
The reversal curse: 'A is B' not generalizing to 'B is A'ReasoningMedium
Unfaithful chain-of-thought reasoningOtherMedium
Vendor cautions its reasoning model's chain-of-thought may be unfaithfulOtherMedium

Methods

🔬 Chain-of-thought faithfulness probing 🔬 Logic & consistency testing 🔬 Metamorphic testing 🔬 Perturbation testing 🔬 Self-consistency probing

References

Advancing Spatial Reasoning in Large Language Models: An In-Depth Evaluation Using the StepGame Benchmark — AAAI 2024
Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs — arXiv
Counting Ability of Large Language Models and Impact of Tokenization — arXiv
Faith and Fate: Limits of Transformers on Compositionality — arXiv
Measuring Faithfulness in Chain-of-Thought Reasoning — arXiv
Self-contradictory Hallucinations of Large Language Models: Evaluation, Detection and Mitigation — arXiv
Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning — arXiv
The Reversal Curse: LLMs trained on “A is B” fail to learn “B is A” — arXiv
Why Do Large Language Models (LLMs) Struggle to Count Letters? — arXiv

Cite this

Qlarify Labs. (2026). Reasoning & chain-of-thought. Retrieved from https://labs.qlarify.fi/topics/reasoning-and-chain-of-thought