← Findings
MediumReasoningReviewer-confirmedPublished

Errors in multi-digit arithmetic

Without tool use, models make systematic errors on multi-digit multiplication and long arithmetic.

Published June 26, 2026

Reproducibility
Often
Severity
Medium
Confidence
Reviewer-confirmed

Details

Pure-LM arithmetic degrades quickly as operand length grows; multi-digit multiplication is unreliable. The failure is predictable and worsens monotonically with digit count, making it a clean boundary-testing target.

Found with

Evidence

Q: 4831 × 7642 = ?
A: (returns a confidently stated but incorrect product)
Illustrative example — see the linked reference for the documented evidence.

Affected versions

Anthropic · claude-opus-4-8Anthropic · claude-sonnet-4-6OpenAI · gpt-4oGoogle · gemini-2.0-flashMeta · llama-3.3-70bMistral · mistral-large-2

References

Source: https://arxiv.org/abs/2305.18654

Cite this

Qlarify Labs. (2026). Errors in multi-digit arithmetic. Retrieved from https://labs.qlarify.fi/findings/multi-digit-arithmetic