News & Library

Reference library

The best external writing on AI testing, limitations and quality — curated, summarized, and rated. We link out to the source; the value-add is our summary and the findings each piece connects to.

9 references

PaperHigh credibilityarXiv
Why Do Large Language Models (LLMs) Struggle to Count Letters?
Analyses the well-known failure to count letters in a word (e.g. the r's in 'strawberry'), tying it to byte-pair tokenization: characters are grouped into tokens, so the unit being counted is not the unit the model processes. Reported letter-counting accuracy is very low, especially when a letter recurs.
🐛 1 linked findingReasoning failureEvals
PaperHigh credibilityarXiv
Counting Ability of Large Language Models and Impact of Tokenization
Studies how LLM counting degrades as the quantity grows and how tokenization shapes the error, arguing that counting needs reasoning depth scaling with the count — which transformers don't natively provide.
🐛 1 linked findingReasoning failureEvals
PaperHigh credibilityarXiv
Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning
A synthetic benchmark that isolates temporal reasoning from memorized facts, with a dedicated arithmetic split over time points and durations. Frontier models struggle on the calculation-heavy temporal tasks.
🐛 1 linked findingReasoning failureEvalsBenchmarks
PaperHigh credibilityAAAI 2024
Advancing Spatial Reasoning in Large Language Models: An In-Depth Evaluation Using the StepGame Benchmark
Evaluates LLMs on the StepGame spatial-reasoning benchmark, finding they map language to spatial relations reasonably but degrade on multi-hop spatial inference; proposes prompting and neuro-symbolic enhancements.
🐛 1 linked findingReasoning failureEvalsBenchmarks
PaperHigh credibilityarXiv
The Reversal Curse: LLMs trained on “A is B” fail to learn “B is A”
Demonstrates that models which learn a fact in one direction often fail to generalize it to the reverse, indicating that learned associations are not stored as symmetric relations.
🐛 1 linked findingReasoning failureEvals
PaperHigh credibilityarXiv
Measuring Faithfulness in Chain-of-Thought Reasoning
Tests whether a model's stated chain-of-thought actually drives its answer, finding that reasoning is often unfaithful: answers can be unchanged when the reasoning is perturbed, or swayed by biasing cues the model never mentions.
🐛 1 linked findingReasoning failureSafetyEvals
PaperHigh credibilityarXiv
Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs
An empirical evaluation of confidence elicitation across LLMs, finding that models verbalize high confidence even when wrong and are generally overconfident — plausibly imitating human patterns of asserting certainty.
🐛 1 linked findingHallucinationReasoning failureEvals
PaperHigh credibilityarXiv
Self-contradictory Hallucinations of Large Language Models: Evaluation, Detection and Mitigation
Defines, detects, and measures self-contradiction — two mutually inconsistent statements produced within the same context — across LLMs, and offers detection and mitigation without external knowledge.
🐛 1 linked findingHallucinationReasoning failureEvals
PaperHigh credibilityarXiv
Faith and Fate: Limits of Transformers on Compositionality
Probes the limits of transformers on compositional tasks including multi-digit multiplication, showing accuracy collapses as a problem requires more sequential sub-steps — the model pattern-matches rather than executing a reliable algorithm.
🐛 1 linked findingReasoning failureEvals

Why Do Large Language Models (LLMs) Struggle to Count Letters?

Counting Ability of Large Language Models and Impact of Tokenization

Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning

Advancing Spatial Reasoning in Large Language Models: An In-Depth Evaluation Using the StepGame Benchmark

The Reversal Curse: LLMs trained on “A is B” fail to learn “B is A”

Measuring Faithfulness in Chain-of-Thought Reasoning

Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs

Self-contradictory Hallucinations of Large Language Models: Evaluation, Detection and Mitigation

Faith and Fate: Limits of Transformers on Compositionality