News & Library

Reference library

The best external writing on AI testing, limitations and quality — curated, summarized, and rated. We link out to the source; the value-add is our summary and the findings each piece connects to.

17 references

PaperHigh credibilityarXiv
Why Do Large Language Models (LLMs) Struggle to Count Letters?
Analyses the well-known failure to count letters in a word (e.g. the r's in 'strawberry'), tying it to byte-pair tokenization: characters are grouped into tokens, so the unit being counted is not the unit the model processes. Reported letter-counting accuracy is very low, especially when a letter recurs.
🐛 1 linked findingReasoning failureEvals
PaperHigh credibilityarXiv
Counting Ability of Large Language Models and Impact of Tokenization
Studies how LLM counting degrades as the quantity grows and how tokenization shapes the error, arguing that counting needs reasoning depth scaling with the count — which transformers don't natively provide.
🐛 1 linked findingReasoning failureEvals
PaperHigh credibilityarXiv
Let Me Speak Freely? A Study on the Impact of Format Restrictions on Performance of Large Language Models
Measures how requiring strict output formats (JSON/XML/schema) affects answer quality, finding that tight format constraints can degrade reasoning performance versus free-form responses.
🐛 1 linked findingTool useEvals
PaperHigh credibilityarXiv
Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning
A synthetic benchmark that isolates temporal reasoning from memorized facts, with a dedicated arithmetic split over time points and durations. Frontier models struggle on the calculation-heavy temporal tasks.
🐛 1 linked findingReasoning failureEvalsBenchmarks
PaperHigh credibilityarXiv
Dated Data: Tracing Knowledge Cutoffs in Large Language Models
Probes what models actually know to estimate their *effective* knowledge cutoff, and finds it frequently diverges from the cutoff the developer reports — a consequence of deduplication and temporally mixed web-crawl data.
🐛 1 linked findingHallucinationEvalsBenchmarks
PaperHigh credibilityAAAI 2024
Advancing Spatial Reasoning in Large Language Models: An In-Depth Evaluation Using the StepGame Benchmark
Evaluates LLMs on the StepGame spatial-reasoning benchmark, finding they map language to spatial relations reasonably but degrade on multi-hop spatial inference; proposes prompting and neuro-symbolic enhancements.
🐛 1 linked findingReasoning failureEvalsBenchmarks
PaperHigh credibilityarXiv
Towards Understanding Sycophancy in Language Models
Shows that RLHF-trained assistants tend to tell users what they want to hear — revising correct answers when challenged and matching a user's stated beliefs — and links the behavior to preference data that rewards agreement.
🐛 1 linked findingSafetyEvals
PaperHigh credibilityarXiv
The Reversal Curse: LLMs trained on “A is B” fail to learn “B is A”
Demonstrates that models which learn a fact in one direction often fail to generalize it to the reverse, indicating that learned associations are not stored as symmetric relations.
🐛 1 linked findingReasoning failureEvals
PaperHigh credibilityarXiv
XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models
A test suite of clearly-safe prompts designed to surface exaggerated safety behaviour: models refuse benign requests that merely resemble unsafe ones or mention sensitive words.
🐛 1 linked findingRefusalSafetyEvals
PaperHigh credibilityarXiv
Measuring Faithfulness in Chain-of-Thought Reasoning
Tests whether a model's stated chain-of-thought actually drives its answer, finding that reasoning is often unfaithful: answers can be unchanged when the reasoning is perturbed, or swayed by biasing cues the model never mentions.
🐛 1 linked findingReasoning failureSafetyEvals
PaperHigh credibilityarXiv
Lost in the Middle: How Language Models Use Long Contexts
An empirical study showing that LLM accuracy depends strongly on where relevant information sits in the input: performance is highest when key facts are at the very start or end of the context and degrades markedly when they fall in the middle, producing a characteristic U-shaped curve.
🐛 1 linked findingEvalsRAGContext window
PaperHigh credibilityarXiv
Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs
An empirical evaluation of confidence elicitation across LLMs, finding that models verbalize high confidence even when wrong and are generally overconfident — plausibly imitating human patterns of asserting certainty.
🐛 1 linked findingHallucinationReasoning failureEvals
PaperHigh credibilityarXiv
Self-contradictory Hallucinations of Large Language Models: Evaluation, Detection and Mitigation
Defines, detects, and measures self-contradiction — two mutually inconsistent statements produced within the same context — across LLMs, and offers detection and mitigation without external knowledge.
🐛 1 linked findingHallucinationReasoning failureEvals
PaperHigh credibilityarXiv
Faith and Fate: Limits of Transformers on Compositionality
Probes the limits of transformers on compositional tasks including multi-digit multiplication, showing accuracy collapses as a problem requires more sequential sub-steps — the model pattern-matches rather than executing a reliable algorithm.
🐛 1 linked findingReasoning failureEvals
PaperHigh credibilityACM Computing Surveys
Survey of Hallucination in Natural Language Generation
A broad survey of hallucination across natural-language generation tasks: definitions, taxonomies (intrinsic vs extrinsic), root causes, and evaluation/mitigation methods. A reference map of why generators produce unsupported text.
🐛 2 linked findingsHallucinationEvals
PaperHigh credibilityarXiv
TruthfulQA: Measuring How Models Mimic Human Falsehoods
A benchmark of questions where humans commonly hold misconceptions; models often answer with the same imitative falsehoods learned from training text, and — strikingly — larger models can be less truthful. Separates being informative from being truthful.
🐛 2 linked findingsHallucinationEvalsBenchmarks
PaperHigh credibilityarXiv
The Curious Case of Neural Text Degeneration
The paper that diagnosed degenerate repetition in neural text generation and introduced nucleus (top-p) sampling, showing that maximum-likelihood decoding produces bland, looping text.
🐛 1 linked findingEvals

Why Do Large Language Models (LLMs) Struggle to Count Letters?

Counting Ability of Large Language Models and Impact of Tokenization

Let Me Speak Freely? A Study on the Impact of Format Restrictions on Performance of Large Language Models

Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning

Dated Data: Tracing Knowledge Cutoffs in Large Language Models

Advancing Spatial Reasoning in Large Language Models: An In-Depth Evaluation Using the StepGame Benchmark

Towards Understanding Sycophancy in Language Models

The Reversal Curse: LLMs trained on “A is B” fail to learn “B is A”

XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models

Measuring Faithfulness in Chain-of-Thought Reasoning

Lost in the Middle: How Language Models Use Long Contexts

Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs

Self-contradictory Hallucinations of Large Language Models: Evaluation, Detection and Mitigation

Faith and Fate: Limits of Transformers on Compositionality

Survey of Hallucination in Natural Language Generation

TruthfulQA: Measuring How Models Mimic Human Falsehoods

The Curious Case of Neural Text Degeneration