News & Library
Reference library
The best external writing on AI testing, limitations and quality — curated, summarized, and rated. We link out to the source; the value-add is our summary and the findings each piece connects to.
7 references
- PaperHigh credibilityarXiv
Dated Data: Tracing Knowledge Cutoffs in Large Language Models
Probes what models actually know to estimate their *effective* knowledge cutoff, and finds it frequently diverges from the cutoff the developer reports — a consequence of deduplication and temporally mixed web-crawl data.
🐛 1 linked findingHallucinationEvalsBenchmarks - PaperHigh credibilityarXiv
Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs
An empirical evaluation of confidence elicitation across LLMs, finding that models verbalize high confidence even when wrong and are generally overconfident — plausibly imitating human patterns of asserting certainty.
🐛 1 linked findingHallucinationReasoning failureEvals - NewsHigh credibilityAI Incident Database
Incident 541: ChatGPT Produced False Court Case Law Presented by Legal Counsel in Court
An AI Incident Database entry on Mata v. Avianca, where a lawyer filed a federal brief citing six judicial decisions that ChatGPT had fabricated — and which the model insisted were real when asked to verify. The court sanctioned the attorneys.
🐛 1 linked findingHallucinationSafety - PaperHigh credibilityarXiv
Gorilla: Large Language Model Connected with Massive APIs
Connects an LLM to large API collections and documents the tendency to hallucinate API calls and arguments when prompted directly; retrieval-aware training reduces but does not eliminate the fabrication.
🐛 1 linked findingHallucinationTool useAgents - PaperHigh credibilityarXiv
Self-contradictory Hallucinations of Large Language Models: Evaluation, Detection and Mitigation
Defines, detects, and measures self-contradiction — two mutually inconsistent statements produced within the same context — across LLMs, and offers detection and mitigation without external knowledge.
🐛 1 linked findingHallucinationReasoning failureEvals - PaperHigh credibilityACM Computing Surveys
Survey of Hallucination in Natural Language Generation
A broad survey of hallucination across natural-language generation tasks: definitions, taxonomies (intrinsic vs extrinsic), root causes, and evaluation/mitigation methods. A reference map of why generators produce unsupported text.
🐛 2 linked findingsHallucinationEvals - PaperHigh credibilityarXiv
TruthfulQA: Measuring How Models Mimic Human Falsehoods
A benchmark of questions where humans commonly hold misconceptions; models often answer with the same imitative falsehoods learned from training text, and — strikingly — larger models can be less truthful. Separates being informative from being truthful.
🐛 2 linked findingsHallucinationEvalsBenchmarks