News & Library
Reference library
The best external writing on AI testing, limitations and quality — curated, summarized, and rated. We link out to the source; the value-add is our summary and the findings each piece connects to.
8 references
- PaperHigh credibilityarXiv
Towards Understanding Sycophancy in Language Models
Shows that RLHF-trained assistants tend to tell users what they want to hear — revising correct answers when challenged and matching a user's stated beliefs — and links the behavior to preference data that rewards agreement.
🐛 1 linked findingSafetyEvals - White paperHigh credibilityOWASP
OWASP Top 10 for Large Language Model Applications
A community-built catalog of the most critical security risks for LLM applications — prompt injection, insecure output handling, training-data poisoning, and more — with mitigations for each.
🐛 1 linked findingPrompt injectionSafety - PaperHigh credibilityarXiv
XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models
A test suite of clearly-safe prompts designed to surface exaggerated safety behaviour: models refuse benign requests that merely resemble unsafe ones or mention sensitive words.
🐛 1 linked findingRefusalSafetyEvals - PaperHigh credibilityarXiv
Universal and Transferable Adversarial Attacks on Aligned Language Models
Introduces an automated method for generating adversarial suffixes that bypass safety training, and shows the resulting attacks transfer across multiple aligned models including closed ones.
🐛 2 linked findingsJailbreakSafety - PaperHigh credibilityarXiv
Measuring Faithfulness in Chain-of-Thought Reasoning
Tests whether a model's stated chain-of-thought actually drives its answer, finding that reasoning is often unfaithful: answers can be unchanged when the reasoning is perturbed, or swayed by biasing cues the model never mentions.
🐛 1 linked findingReasoning failureSafetyEvals - NewsHigh credibilityAI Incident Database
Incident 541: ChatGPT Produced False Court Case Law Presented by Legal Counsel in Court
An AI Incident Database entry on Mata v. Avianca, where a lawyer filed a federal brief citing six judicial decisions that ChatGPT had fabricated — and which the model insisted were real when asked to verify. The court sanctioned the attorneys.
🐛 1 linked findingHallucinationSafety - BlogHigh credibilitySimon Willison’s Weblog
Prompt injection: what’s the worst that can happen?
An accessible explanation of why prompt injection is hard to fix: once an LLM agent processes untrusted content, that content can hijack its instructions. Walks through concrete exfiltration and abuse scenarios for tool-using assistants.
🐛 2 linked findingsPrompt injectionSafetyAgents - PaperHigh credibilityarXiv
Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection
Demonstrates indirect prompt injection against real LLM-integrated applications: adversarial instructions hidden in web pages, emails, or other retrieved content hijack the model when it later processes them — no access to the prompt required. Catalogs concrete attacks (data theft, manipulation) on tool- and retrieval-connected systems.
🐛 2 linked findingsPrompt injectionSafetyAgents