← Reference library
PaperHigh credibilityarXiv · Fatemi et al. · June 1, 2024

Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning

Our summary

A synthetic benchmark that isolates temporal reasoning from memorized facts, with a dedicated arithmetic split over time points and durations. Frontier models struggle on the calculation-heavy temporal tasks.

Why it matters

Shows date and duration arithmetic is a distinct, testable weakness even when the model isn't leaning on stale world knowledge.

Related findings (1)

Published June 26, 2026

Cite this

Qlarify Labs. (2026). Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning. Retrieved from https://labs.qlarify.fi/references/test-of-time-temporal-2024