PaperHigh credibilityarXiv · Fatemi et al. · June 1, 2024

Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning

Our summary

A synthetic benchmark that isolates temporal reasoning from memorized facts, with a dedicated arithmetic split over time points and durations. Frontier models struggle on the calculation-heavy temporal tasks.

Why it matters

Shows date and duration arithmetic is a distinct, testable weakness even when the model isn't leaning on stale world knowledge.

Related findings (1)

Date and duration arithmetic errorsLow
Models miscompute differences between dates, weekdays, and durations across boundaries like months and leap years.

Reasoning failure Evals Benchmarks

Published June 26, 2026

Cite this

Qlarify Labs. (2026). Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning. Retrieved from https://labs.qlarify.fi/references/test-of-time-temporal-2024