News & Library

Reference library

The best external writing on AI testing, limitations and quality — curated, summarized, and rated. We link out to the source; the value-add is our summary and the findings each piece connects to.

7 references

PaperHigh credibilityarXiv
Dated Data: Tracing Knowledge Cutoffs in Large Language Models
Probes what models actually know to estimate their *effective* knowledge cutoff, and finds it frequently diverges from the cutoff the developer reports — a consequence of deduplication and temporally mixed web-crawl data.
🐛 1 linked findingHallucinationEvalsBenchmarks
PaperHigh credibilityarXiv
Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs
An empirical evaluation of confidence elicitation across LLMs, finding that models verbalize high confidence even when wrong and are generally overconfident — plausibly imitating human patterns of asserting certainty.
🐛 1 linked findingHallucinationReasoning failureEvals
NewsHigh credibilityAI Incident Database
Incident 541: ChatGPT Produced False Court Case Law Presented by Legal Counsel in Court
An AI Incident Database entry on Mata v. Avianca, where a lawyer filed a federal brief citing six judicial decisions that ChatGPT had fabricated — and which the model insisted were real when asked to verify. The court sanctioned the attorneys.
🐛 1 linked findingHallucinationSafety
PaperHigh credibilityarXiv
Gorilla: Large Language Model Connected with Massive APIs
Connects an LLM to large API collections and documents the tendency to hallucinate API calls and arguments when prompted directly; retrieval-aware training reduces but does not eliminate the fabrication.
🐛 1 linked findingHallucinationTool useAgents
PaperHigh credibilityarXiv
Self-contradictory Hallucinations of Large Language Models: Evaluation, Detection and Mitigation
Defines, detects, and measures self-contradiction — two mutually inconsistent statements produced within the same context — across LLMs, and offers detection and mitigation without external knowledge.
🐛 1 linked findingHallucinationReasoning failureEvals
PaperHigh credibilityACM Computing Surveys
Survey of Hallucination in Natural Language Generation
A broad survey of hallucination across natural-language generation tasks: definitions, taxonomies (intrinsic vs extrinsic), root causes, and evaluation/mitigation methods. A reference map of why generators produce unsupported text.
🐛 2 linked findingsHallucinationEvals
PaperHigh credibilityarXiv
TruthfulQA: Measuring How Models Mimic Human Falsehoods
A benchmark of questions where humans commonly hold misconceptions; models often answer with the same imitative falsehoods learned from training text, and — strikingly — larger models can be less truthful. Separates being informative from being truthful.
🐛 2 linked findingsHallucinationEvalsBenchmarks

Dated Data: Tracing Knowledge Cutoffs in Large Language Models

Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs

Incident 541: ChatGPT Produced False Court Case Law Presented by Legal Counsel in Court

Gorilla: Large Language Model Connected with Massive APIs

Self-contradictory Hallucinations of Large Language Models: Evaluation, Detection and Mitigation

Survey of Hallucination in Natural Language Generation

TruthfulQA: Measuring How Models Mimic Human Falsehoods