← AI tech topics
Reasoning & chain-of-thought
Chain-of-thought and dedicated reasoning models trade tokens for accuracy: the model writes out intermediate steps before answering. The gains are real but so are the new failure modes — the stated reasoning is not always the real reasoning (unfaithful chains), performance collapses on problems slightly outside the training distribution, and small irrelevant changes to a question can flip the answer. The linked findings are the largest cluster in the catalog: reasoning is where confident output and actual capability diverge most visibly.
Findings (15)
- Character-counting errors in tokenized wordsReasoningLow
- Date and duration arithmetic errorsReasoningLow
- Errors in multi-digit arithmeticReasoningMedium
- Failure to honor negation in instructionsReasoningMedium
- Inconsistent answers to semantically equivalent promptsReasoningMedium
- Miscounting items in long listsReasoningLow
- Model behavior drifts between versions — a fixed task can regressReasoningMedium
- Reasoning model degrades under few-shot promptingReasoningMedium
- Reasoning model knowingly fabricates unverifiable referencesHallucinationHigh
- Reasoning model mixes languages on non-English/Chinese queriesOtherLow
- Self-contradiction within a single conversationReasoningLow
- Spatial and geometric reasoning errorsReasoningLow
- The reversal curse: 'A is B' not generalizing to 'B is A'ReasoningMedium
- Unfaithful chain-of-thought reasoningOtherMedium
- Vendor cautions its reasoning model's chain-of-thought may be unfaithfulOtherMedium
Methods
References
- Advancing Spatial Reasoning in Large Language Models: An In-Depth Evaluation Using the StepGame Benchmark — AAAI 2024
- Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs — arXiv
- Counting Ability of Large Language Models and Impact of Tokenization — arXiv
- Faith and Fate: Limits of Transformers on Compositionality — arXiv
- Measuring Faithfulness in Chain-of-Thought Reasoning — arXiv
- Self-contradictory Hallucinations of Large Language Models: Evaluation, Detection and Mitigation — arXiv
- Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning — arXiv
- The Reversal Curse: LLMs trained on “A is B” fail to learn “B is A” — arXiv
- Why Do Large Language Models (LLMs) Struggle to Count Letters? — arXiv
Cite this
Qlarify Labs. (2026). Reasoning & chain-of-thought. Retrieved from https://labs.qlarify.fi/topics/reasoning-and-chain-of-thought