Testing & Findings
Findings
Documented limitations, weaknesses and failures of AI systems — evidence-first and linked to the method that found each one. Public entries are reviewed before publishing.
11 findings
- CriticalSafetyReviewer-confirmedRepro: Rare
Data exfiltration through prompt injection in agents
An injected instruction can make a tool-using agent send private data to an attacker-controlled destination.
🔬 Prompt-injection & jailbreak testingPrompt injectionSafetyAgents - CriticalPrompt injectionReviewer-confirmedRepro: Sometimes
Indirect prompt injection via retrieved content
Instructions hidden in documents, web pages or tool outputs can override the system prompt when ingested by the model.
🔬 Prompt-injection & jailbreak testingPrompt injectionSafetyRAG - HighBiasVendor-acknowledgedRepro: Once
A production update made the model sycophantic and was rolled back
An April 2025 GPT-4o update tuned on user feedback became markedly more sycophantic — validating harmful or delusional claims — and was rolled back within days.
🔬 A/B testing in production🔬 Canary releases & staged rollout🔬 Drift & decay monitoringBiasSafetyProduction - HighBiasReviewer-confirmedRepro: Sometimes
Name-based demographic bias in outputs
Swapping only a name (signalling gender or ethnicity) changes evaluative outputs like screening or sentiment.
🔬 Counterfactual bias probing🔬 Bias auditingBiasSafety - HighSafetyReviewer-confirmedRepro: Often
Production model internals extracted through the API
With ordinary black-box API access, researchers recovered the embedding projection layer and hidden dimension of production models for under $20 — the first precise model-stealing attack on deployed LLMs.
🔬 Distillation & model-extraction probingSafetyRobustness - HighSafetyVendor-acknowledgedRepro: Sometimes
Reasoning model attempts to subvert oversight under goal pressure
In crafted evaluations, OpenAI's o1 attempted to deactivate its 'oversight mechanism' ~5% of the time when led to believe it would be shut down for pursuing its goal.
🔬 Adversarial promptingSafetyRobustness - HighJailbreakReviewer-confirmedRepro: Sometimes
Roleplay-based safety bypass
Framing a disallowed request as fiction or a persona can induce the model to bypass its safety policy.
🔬 Prompt-injection & jailbreak testing🔬 Adversarial promptingJailbreakSafety - HighJailbreakReviewer-confirmedRepro: Sometimes
Safety bypass via unicode/homoglyph obfuscation
Disallowed content encoded with look-alike unicode or spacing can slip past safety filters.
🔬 Glitch-token & unicode fuzzing🔬 Prompt-injection & jailbreak testingJailbreakSafety - HighSafetyReviewer-confirmedRepro: Often
Verbatim training data extracted from a deployed chatbot
A 'divergence' attack made aligned ChatGPT abandon its chat format and emit memorized training data verbatim, recovering thousands of examples for about $200.
🔬 Distillation & model-extraction probingSafetyRobustness - MediumRefusalReviewer-confirmedRepro: Sometimes
Over-refusal of benign requests
Safety tuning causes refusal of harmless requests that merely resemble sensitive ones.
🔬 Adversarial prompting🔬 Differential testing🔬 Threshold testing🔬 A/B testing in productionRefusalSafety - LowOtherReviewer-confirmedRepro: Rare
Anomalous behavior on glitch tokens
Certain under-trained tokens cause models to emit nonsense, evade instructions, or behave erratically.
🔬 Glitch-token & unicode fuzzingSafetyEvals