Anthropic
Claude Sonnet
Anthropic's balanced Claude tier — strong capability at lower latency and cost than Opus. Linked findings reflect documented frontier failure-mode classes; per-version attribution is illustrative.
Attribution note. These are documented failure-mode classesobserved across frontier models and grounded in each finding's cited source — their attribution to this specific version is illustrative. Qlarify Labs has not independently reproduced each finding on Claude Sonnet; per-version confidence requires reproduction (VERIFICATION §2–4). Open any finding to see its source.
Report card
Auto-derived from 7 linked findings (illustrative version attributions — see note above) — worst severity per category.
- Hallucination
- High1×
- Reasoning
- Medium4×
- Refusal
- Medium1×
- Bias
- Medium1×
Strengths
Fast, capable general reasoning and coding; good instruction-following and comparatively calibrated refusals for its tier.
Known weaknesses
Shares the frontier-wide arithmetic, counting and tokenization limits; susceptible to sycophancy under user pressure and to prompt injection in agentic settings.
Findings (7)
- Character-counting errors in tokenized wordsLow
Models miscount letters within a word (e.g. how many 'r's are in a given word) because they reason over tokens, not characters.
Reasoning - Errors in multi-digit arithmeticMedium
Without tool use, models make systematic errors on multi-digit multiplication and long arithmetic.
Reasoning - Fabrication instead of admitting uncertaintyHigh
Asked about something unknown or non-existent, models invent an answer rather than saying 'I don't know'.
Hallucination - Failure to honor negation in instructionsMedium
Models frequently do the opposite of a 'do not' instruction, or ignore the negation entirely.
Reasoning - Inconsistent answers to semantically equivalent promptsMedium
Trivial rewordings of the same question yield materially different answers.
Reasoning - Over-refusal of benign requestsMedium
Safety tuning causes refusal of harmless requests that merely resemble sensitive ones.
Refusal - Sycophancy: agreeing with a user's incorrect assertionsMedium
Models tend to revise correct answers to match a user who pushes back or states a wrong belief.
Bias
Methods that surface these
Related references
- Faith and Fate: Limits of Transformers on Compositionality — arXiv
- Hallucination Detection in Large Language Models with Metamorphic Relations — arXiv
- Metamorphic Testing of Large Language Models for Natural Language Processing — arXiv
- Survey of Hallucination in Natural Language Generation — ACM Computing Surveys
- Towards Understanding Sycophancy in Language Models — arXiv
- TruthfulQA: Measuring How Models Mimic Human Falsehoods — arXiv
- Why Do Large Language Models (LLMs) Struggle to Count Letters? — arXiv
- XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models — arXiv
Versions tracked
Cite this
Qlarify Labs. (2026). Anthropic Claude Sonnet — known weaknesses. Retrieved from https://labs.qlarify.fi/models/claude-sonnet