News & Library

Reference library

The best external writing on AI testing, limitations and quality — curated, summarized, and rated. We link out to the source; the value-add is our summary and the findings each piece connects to.

4 references

PaperHigh credibilityarXiv
Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning
A synthetic benchmark that isolates temporal reasoning from memorized facts, with a dedicated arithmetic split over time points and durations. Frontier models struggle on the calculation-heavy temporal tasks.
🐛 1 linked findingReasoning failureEvalsBenchmarks
PaperHigh credibilityarXiv
Dated Data: Tracing Knowledge Cutoffs in Large Language Models
Probes what models actually know to estimate their *effective* knowledge cutoff, and finds it frequently diverges from the cutoff the developer reports — a consequence of deduplication and temporally mixed web-crawl data.
🐛 1 linked findingHallucinationEvalsBenchmarks
PaperHigh credibilityAAAI 2024
Advancing Spatial Reasoning in Large Language Models: An In-Depth Evaluation Using the StepGame Benchmark
Evaluates LLMs on the StepGame spatial-reasoning benchmark, finding they map language to spatial relations reasonably but degrade on multi-hop spatial inference; proposes prompting and neuro-symbolic enhancements.
🐛 1 linked findingReasoning failureEvalsBenchmarks
PaperHigh credibilityarXiv
TruthfulQA: Measuring How Models Mimic Human Falsehoods
A benchmark of questions where humans commonly hold misconceptions; models often answer with the same imitative falsehoods learned from training text, and — strikingly — larger models can be less truthful. Separates being informative from being truthful.
🐛 2 linked findingsHallucinationEvalsBenchmarks

Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning

Dated Data: Tracing Knowledge Cutoffs in Large Language Models

Advancing Spatial Reasoning in Large Language Models: An In-Depth Evaluation Using the StepGame Benchmark

TruthfulQA: Measuring How Models Mimic Human Falsehoods