Without tool use, models make systematic errors on multi-digit multiplication and long arithmetic.
Published June 26, 2026
Reproducibility
Often
Severity
Medium
Confidence
Reviewer-confirmed
Details
Pure-LM arithmetic degrades quickly as operand length grows; multi-digit multiplication is unreliable. The failure is predictable and worsens monotonically with digit count, making it a clean boundary-testing target.