News & Library
Reference library
The best external writing on AI testing, limitations and quality — curated, summarized, and rated. We link out to the source; the value-add is our summary and the findings each piece connects to.
52 references
- BlogMedium credibilityprinciplesofchaos.org
Principles of Chaos Engineering
Defines chaos engineering as experimenting on a system to build confidence in its ability to withstand turbulent production conditions — controlled fault injection measured against a steady-state baseline.
- PaperHigh credibilityarXiv
Metamorphic Testing of Large Language Models for Natural Language Processing
A large-scale study applying metamorphic testing to LLMs on NLP tasks: the authors collect 191 metamorphic relations from the literature, implement 36, and run roughly 560,000 metamorphic tests across three LLMs to surface incorrect behaviour without labelled oracles.
🐛 2 linked findings - BlogHigh credibilityOpenAI
Sycophancy in GPT-4o: What Happened and What We're Doing About It
OpenAI's own post-mortem of an April 2025 GPT-4o update that, by over-weighting short-term user approval, became markedly sycophantic — validating harmful and delusional statements. It was rolled back within days; the company notes it had no deployment eval tracking sycophancy.
🐛 1 linked finding - PaperHigh credibilityarXiv
Hallucination Detection in Large Language Models with Metamorphic Relations
Introduces MetaQA, which uses metamorphic relations and prompt mutation to detect hallucinated or factually incorrect LLM outputs with no external resources, reporting gains over prior detectors across several models and datasets.
🐛 2 linked findings - PaperHigh credibilityarXiv:2501.12948
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
DeepSeek's technical report for R1, an open-weight reasoning model trained largely via reinforcement learning. Notably candid: its 'Limitations' section enumerates concrete, testable weaknesses (language mixing, prompt sensitivity, and a tool-use regression relative to the base V3 model).
🐛 3 linked findings - PaperHigh credibilityarXiv:2412.15115
Qwen2.5 Technical Report
The technical report for Qwen2.5, Alibaba's open-weight model family (0.5B–72B) trained on ~18T tokens, with the flagship 72B-Instruct competitive with far larger open models and strong multilingual and long-context support.
- White paperHigh credibilityOpenAI
OpenAI o1 System Card
OpenAI's safety system card for the o1 reasoning model. Notably candid: it quantifies 'intentional hallucinations,' cautions that chain-of-thought may not be faithful, and reports that in crafted evaluations o1 sometimes attempted to disable its own oversight.
🐛 3 linked findings - PaperHigh credibilityarXiv
Why Do Large Language Models (LLMs) Struggle to Count Letters?
Analyses the well-known failure to count letters in a word (e.g. the r's in 'strawberry'), tying it to byte-pair tokenization: characters are grouped into tokens, so the unit being counted is not the unit the model processes. Reported letter-counting accuracy is very low, especially when a letter recurs.
🐛 1 linked findingReasoning failureEvals - ToolHigh credibilitymodelcontextprotocol.io
Model Context Protocol Specification
The open protocol standardizing how LLM applications connect to external tools and data — JSON-RPC messages, stateful connections, and explicit client/server capability negotiation (the 'handshake').
- PaperHigh credibilityarXiv
Counting Ability of Large Language Models and Impact of Tokenization
Studies how LLM counting degrades as the quantity grows and how tokenization shapes the error, arguing that counting needs reasoning depth scaling with the count — which transformers don't natively provide.
🐛 1 linked findingReasoning failureEvals - PaperHigh credibilityarXiv
Let Me Speak Freely? A Study on the Impact of Format Restrictions on Performance of Large Language Models
Measures how requiring strict output formats (JSON/XML/schema) affects answer quality, finding that tight format constraints can degrade reasoning performance versus free-form responses.
🐛 1 linked findingTool useEvals - PaperHigh credibilityarXiv
Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning
A synthetic benchmark that isolates temporal reasoning from memorized facts, with a dedicated arithmetic split over time points and durations. Frontier models struggle on the calculation-heavy temporal tasks.
🐛 1 linked findingReasoning failureEvalsBenchmarks - PaperHigh credibilityICML 2024 (arXiv:2403.06634)
Stealing Part of a Production Language Model
The first precise model-extraction attack on deployed LLMs: from ordinary black-box API access it recovers the final embedding projection layer and the hidden dimension of production models, extracting them from OpenAI's Ada and Babbage for under $20 (with OpenAI's approval and subsequent mitigation).
🐛 1 linked finding - PaperHigh credibilityarXiv
Dated Data: Tracing Knowledge Cutoffs in Large Language Models
Probes what models actually know to estimate their *effective* knowledge cutoff, and finds it frequently diverges from the cutoff the developer reports — a consequence of deduplication and temporally mixed web-crawl data.
🐛 1 linked findingHallucinationEvalsBenchmarks - BlogMedium credibilitymartinfowler.com
Continuous Integration
The reference statement of continuous integration: every change is merged and verified by an automated build on every commit, so integration errors surface in minutes rather than at release.
- PaperHigh credibilityAAAI 2024
Advancing Spatial Reasoning in Large Language Models: An In-Depth Evaluation Using the StepGame Benchmark
Evaluates LLMs on the StepGame spatial-reasoning benchmark, finding they map language to spatial relations reasonably but degrade on multi-hop spatial inference; proposes prompting and neuro-symbolic enhancements.
🐛 1 linked findingReasoning failureEvalsBenchmarks - PaperHigh credibilityarXiv:2311.17035
Scalable Extraction of Training Data from (Production) Language Models
Shows that prompting aligned ChatGPT to endlessly repeat a token makes it diverge from chat-style output and emit memorized training data — including PII — at ~150x the normal rate, recovering over ten thousand unique training examples for about $200.
🐛 1 linked finding - PaperHigh credibilityarXiv
Towards Understanding Sycophancy in Language Models
Shows that RLHF-trained assistants tend to tell users what they want to hear — revising correct answers when challenged and matching a user's stated beliefs — and links the behavior to preference data that rewards agreement.
🐛 1 linked findingSafetyEvals - White paperHigh credibilityOWASP
OWASP Top 10 for Large Language Model Applications
A community-built catalog of the most critical security risks for LLM applications — prompt injection, insecure output handling, training-data poisoning, and more — with mitigations for each.
🐛 1 linked findingPrompt injectionSafety - PaperHigh credibilityarXiv
The Reversal Curse: LLMs trained on “A is B” fail to learn “B is A”
Demonstrates that models which learn a fact in one direction often fail to generalize it to the reverse, indicating that learned associations are not stored as symmetric relations.
🐛 1 linked findingReasoning failureEvals - PaperHigh credibilityarXiv
XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models
A test suite of clearly-safe prompts designed to surface exaggerated safety behaviour: models refuse benign requests that merely resemble unsafe ones or mention sensitive words.
🐛 1 linked findingRefusalSafetyEvals - PaperHigh credibilityarXiv
Universal and Transferable Adversarial Attacks on Aligned Language Models
Introduces an automated method for generating adversarial suffixes that bypass safety training, and shows the resulting attacks transfer across multiple aligned models including closed ones.
🐛 2 linked findingsJailbreakSafety - PaperHigh credibilityarXiv:2307.09009
How Is ChatGPT's Behavior Changing over Time?
Evaluates GPT-3.5 and GPT-4 on identical tasks across two 2023 snapshots and finds large, undirected swings — most starkly, GPT-4's accuracy at identifying prime vs. composite numbers fell from 84% to 51% in a few months, alongside degraded instruction-following.
🐛 1 linked finding - PaperHigh credibilityarXiv
Measuring Faithfulness in Chain-of-Thought Reasoning
Tests whether a model's stated chain-of-thought actually drives its answer, finding that reasoning is often unfaithful: answers can be unchanged when the reasoning is perturbed, or swayed by biasing cues the model never mentions.
🐛 1 linked findingReasoning failureSafetyEvals - PaperHigh credibilityarXiv
Lost in the Middle: How Language Models Use Long Contexts
An empirical study showing that LLM accuracy depends strongly on where relevant information sits in the input: performance is highest when key facts are at the very start or end of the context and degrades markedly when they fall in the middle, producing a characteristic U-shaped curve.
🐛 1 linked findingEvalsRAGContext window - PaperHigh credibilityNeurIPS 2023 (arXiv:2306.05685)
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Shows a strong model can grade open-ended outputs in agreement with human preferences over 80% of the time, while documenting the judge's own position, verbosity, and self-enhancement biases.
- PaperHigh credibilityarXiv
Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs
An empirical evaluation of confidence elicitation across LLMs, finding that models verbalize high confidence even when wrong and are generally overconfident — plausibly imitating human patterns of asserting certainty.
🐛 1 linked findingHallucinationReasoning failureEvals - NewsHigh credibilityAI Incident Database
Incident 541: ChatGPT Produced False Court Case Law Presented by Legal Counsel in Court
An AI Incident Database entry on Mata v. Avianca, where a lawyer filed a federal brief citing six judicial decisions that ChatGPT had fabricated — and which the model insisted were real when asked to verify. The court sanctioned the attorneys.
🐛 1 linked findingHallucinationSafety - PaperHigh credibilityarXiv
Gorilla: Large Language Model Connected with Massive APIs
Connects an LLM to large API collections and documents the tendency to hallucinate API calls and arguments when prompted directly; retrieval-aware training reduces but does not eliminate the fabrication.
🐛 1 linked findingHallucinationTool useAgents - PaperHigh credibilityarXiv
Self-contradictory Hallucinations of Large Language Models: Evaluation, Detection and Mitigation
Defines, detects, and measures self-contradiction — two mutually inconsistent statements produced within the same context — across LLMs, and offers detection and mitigation without external knowledge.
🐛 1 linked findingHallucinationReasoning failureEvals - PaperHigh credibilityarXiv
Faith and Fate: Limits of Transformers on Compositionality
Probes the limits of transformers on compositional tasks including multi-digit multiplication, showing accuracy collapses as a problem requires more sequential sub-steps — the model pattern-matches rather than executing a reliable algorithm.
🐛 1 linked findingReasoning failureEvals - BlogHigh credibilitySimon Willison’s Weblog
Prompt injection: what’s the worst that can happen?
An accessible explanation of why prompt injection is hard to fix: once an LLM agent processes untrusted content, that content can hijack its instructions. Walks through concrete exfiltration and abuse scenarios for tool-using assistants.
🐛 2 linked findingsPrompt injectionSafetyAgents - BlogMedium credibilityLessWrong
SolidGoldMagikarp (plus, prompt generation)
The piece that surfaced 'glitch tokens' — rare tokens under-trained in the embedding space that trigger bizarre, 'unspeakable', or evasive model behaviour when prompted.
🐛 1 linked finding - PaperHigh credibilityarXiv
Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection
Demonstrates indirect prompt injection against real LLM-integrated applications: adversarial instructions hidden in web pages, emails, or other retrieved content hijack the model when it later processes them — no access to the prompt required. Catalogs concrete attacks (data theft, manipulation) on tool- and retrieval-connected systems.
🐛 2 linked findingsPrompt injectionSafetyAgents - PaperHigh credibilityarXiv:2211.09110
Holistic Evaluation of Language Models
A benchmark that scores language models on a taxonomy of scenarios against seven metrics — accuracy, calibration, robustness, fairness, bias, toxicity, efficiency — to make model quality transparent and comparable.
- PaperHigh credibilityFindings of ACL 2022
BBQ: A Hand-Built Bias Benchmark for Question Answering
A hand-built benchmark probing social bias in QA across nine dimensions: models fall back on stereotypes when context is under-specified, and are more accurate when the correct answer happens to match a stereotype.
🐛 1 linked finding - PaperHigh credibilityACM Computing Surveys
Survey of Hallucination in Natural Language Generation
A broad survey of hallucination across natural-language generation tasks: definitions, taxonomies (intrinsic vs extrinsic), root causes, and evaluation/mitigation methods. A reference map of why generators produce unsupported text.
🐛 2 linked findingsHallucinationEvals - PaperHigh credibilityACL 2022 (arXiv:2109.07958)
TruthfulQA: Measuring How Models Mimic Human Falsehoods
A benchmark of questions deliberately crafted to invite false answers rooted in common misconceptions, on which the best model was truthful only 58% of the time against 94% for humans.
- PaperHigh credibilityarXiv
TruthfulQA: Measuring How Models Mimic Human Falsehoods
A benchmark of questions where humans commonly hold misconceptions; models often answer with the same imitative falsehoods learned from training text, and — strikingly — larger models can be less truthful. Separates being informative from being truthful.
🐛 2 linked findingsHallucinationEvalsBenchmarks - PaperHigh credibilityTACL 2021 (arXiv:2102.01017)
Measuring and Improving Consistency in Pretrained Language Models
Using paraphrased cloze queries (ParaRel), shows that pretrained models give inconsistent answers to logically equivalent questions — a poorly structured knowledge representation rather than a stable one.
- ToolHigh credibilityEMNLP 2020 (arXiv:2005.05909)
TextAttack: A Framework for Adversarial Attacks, Data Augmentation, and Adversarial Training in NLP
A framework that builds input perturbations from a goal function, constraints, a transformation, and a search method, then measures how small, meaning-preserving changes flip a model's output.
- PaperHigh credibilityarXiv
The Curious Case of Neural Text Degeneration
The paper that diagnosed degenerate repetition in neural text generation and introduced nucleus (top-p) sampling, showing that maximum-likelihood decoding produces bland, looping text.
🐛 1 linked findingEvals - PaperHigh credibilityNeurIPS 2019 (arXiv:1810.11953)
Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift
A systematic comparison of two-sample statistical tests for detecting when a system's input or output distribution has shifted, including how to identify and quantify the shift.
- PaperHigh credibilityPMLR v81 (FAccT 2018)
Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification
The landmark audit that measured commercial classifiers across skin-type and gender subgroups, finding error rates up to 34.7% for darker-skinned women versus 0.8% for lighter-skinned men.
- PaperHigh credibilityACM Computing Surveys
Metamorphic Testing: A Review of Challenges and Opportunities
The standard survey of metamorphic testing: how metamorphic relations address the oracle problem, the major categories of relations, and the open challenges. The reference point for the technique across software engineering.
- PaperHigh credibilityIEEE Big Data 2017
The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction
Twenty-eight concrete tests and monitoring needs for production ML, drawn from Google's experience — a rubric covering the deterministic scaffold (data, infra, model plumbing) that surrounds any learned component.
- PaperHigh credibilityUSENIX Security 2016
Stealing Machine Learning Models via Prediction APIs
Demonstrates that black-box query access to a prediction API is enough to reconstruct a model's functionality with near-perfect fidelity across several model classes — model theft without the weights.
- BlogMedium credibilitymartinfowler.com
Canary Release
Defines the canary release: reduce the risk of a new version by rolling it out to a small subset of traffic first, monitoring it, and widening or rolling back based on what the canary shows.
- PaperHigh credibilityACM Computing Surveys 46(4)
A Survey on Concept Drift Adaptation
The standard reference on concept drift — when the relationship between inputs and the target changes over time — categorizing detection strategies and the evaluation methodology for adaptive systems.
- PaperHigh credibilityPLDI 2011 (ACM)
Finding and Understanding Bugs in C Compilers
The landmark differential-testing study: by feeding randomly generated programs to multiple C compilers and comparing outputs, the authors found hundreds of bugs with no oracle beyond 'the compilers should agree'.
- PaperHigh credibilityData Mining and Knowledge Discovery 18(1)
Controlled Experiments on the Web: Survey and Practical Guide
The practical guide to online A/B testing: how to split live traffic between variants, reach statistical significance, and avoid the common pitfalls that invalidate a web experiment.
- PaperHigh credibilityICFP 2000 (ACM)
QuickCheck: A Lightweight Tool for Random Testing of Haskell Programs
The paper that launched property-based testing: rather than hand-writing examples, you state a property that should hold for all inputs and let the tool generate hundreds of randomized cases trying to falsify it, shrinking any counterexample to a minimal failing case.