← Reference library
PaperHigh credibilityarXiv · Fatemi et al. · June 1, 2024
Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning
Our summary
A synthetic benchmark that isolates temporal reasoning from memorized facts, with a dedicated arithmetic split over time points and durations. Frontier models struggle on the calculation-heavy temporal tasks.
Why it matters
Shows date and duration arithmetic is a distinct, testable weakness even when the model isn't leaning on stale world knowledge.
Related findings (1)
Published June 26, 2026
Cite this
Qlarify Labs. (2026). Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning. Retrieved from https://labs.qlarify.fi/references/test-of-time-temporal-2024