News & Library
Reference library
The best external writing on AI testing, limitations and quality — curated, summarized, and rated. We link out to the source; the value-add is our summary and the findings each piece connects to.
9 references
- PaperHigh credibilityarXiv
Why Do Large Language Models (LLMs) Struggle to Count Letters?
Analyses the well-known failure to count letters in a word (e.g. the r's in 'strawberry'), tying it to byte-pair tokenization: characters are grouped into tokens, so the unit being counted is not the unit the model processes. Reported letter-counting accuracy is very low, especially when a letter recurs.
🐛 1 linked findingReasoning failureEvals - PaperHigh credibilityarXiv
Counting Ability of Large Language Models and Impact of Tokenization
Studies how LLM counting degrades as the quantity grows and how tokenization shapes the error, arguing that counting needs reasoning depth scaling with the count — which transformers don't natively provide.
🐛 1 linked findingReasoning failureEvals - PaperHigh credibilityarXiv
Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning
A synthetic benchmark that isolates temporal reasoning from memorized facts, with a dedicated arithmetic split over time points and durations. Frontier models struggle on the calculation-heavy temporal tasks.
🐛 1 linked findingReasoning failureEvalsBenchmarks - PaperHigh credibilityAAAI 2024
Advancing Spatial Reasoning in Large Language Models: An In-Depth Evaluation Using the StepGame Benchmark
Evaluates LLMs on the StepGame spatial-reasoning benchmark, finding they map language to spatial relations reasonably but degrade on multi-hop spatial inference; proposes prompting and neuro-symbolic enhancements.
🐛 1 linked findingReasoning failureEvalsBenchmarks - PaperHigh credibilityarXiv
The Reversal Curse: LLMs trained on “A is B” fail to learn “B is A”
Demonstrates that models which learn a fact in one direction often fail to generalize it to the reverse, indicating that learned associations are not stored as symmetric relations.
🐛 1 linked findingReasoning failureEvals - PaperHigh credibilityarXiv
Measuring Faithfulness in Chain-of-Thought Reasoning
Tests whether a model's stated chain-of-thought actually drives its answer, finding that reasoning is often unfaithful: answers can be unchanged when the reasoning is perturbed, or swayed by biasing cues the model never mentions.
🐛 1 linked findingReasoning failureSafetyEvals - PaperHigh credibilityarXiv
Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs
An empirical evaluation of confidence elicitation across LLMs, finding that models verbalize high confidence even when wrong and are generally overconfident — plausibly imitating human patterns of asserting certainty.
🐛 1 linked findingHallucinationReasoning failureEvals - PaperHigh credibilityarXiv
Self-contradictory Hallucinations of Large Language Models: Evaluation, Detection and Mitigation
Defines, detects, and measures self-contradiction — two mutually inconsistent statements produced within the same context — across LLMs, and offers detection and mitigation without external knowledge.
🐛 1 linked findingHallucinationReasoning failureEvals - PaperHigh credibilityarXiv
Faith and Fate: Limits of Transformers on Compositionality
Probes the limits of transformers on compositional tasks including multi-digit multiplication, showing accuracy collapses as a problem requires more sequential sub-steps — the model pattern-matches rather than executing a reliable algorithm.
🐛 1 linked findingReasoning failureEvals