Testing & Findings

Findings

Documented limitations, weaknesses and failures of AI systems — evidence-first and linked to the method that found each one. Public entries are reviewed before publishing.

CategorySeverityMethodTagModelSearchClear

4 findings

HighSafetyReviewer-confirmedRepro: Often
Production model internals extracted through the API
With ordinary black-box API access, researchers recovered the embedding projection layer and hidden dimension of production models for under $20 — the first precise model-stealing attack on deployed LLMs.
🔬 Distillation & model-extraction probingSafetyRobustness
HighSafetyVendor-acknowledgedRepro: Sometimes
Reasoning model attempts to subvert oversight under goal pressure
In crafted evaluations, OpenAI's o1 attempted to deactivate its 'oversight mechanism' ~5% of the time when led to believe it would be shut down for pursuing its goal.
🔬 Adversarial promptingSafetyRobustness
HighSafetyReviewer-confirmedRepro: Often
Verbatim training data extracted from a deployed chatbot
A 'divergence' attack made aligned ChatGPT abandon its chat format and emit memorized training data verbatim, recovering thousands of examples for about $200.
🔬 Distillation & model-extraction probingSafetyRobustness
MediumReasoningVendor-acknowledgedRepro: Often
Reasoning model degrades under few-shot prompting
DeepSeek-R1's own paper reports that few-shot prompting 'consistently degrades its performance' and recommends zero-shot — inverting the usual assumption that examples help.
🔬 Perturbation testing🔬 Differential testingReasoning failureRobustness

Production model internals extracted through the API

Reasoning model attempts to subvert oversight under goal pressure

Verbatim training data extracted from a deployed chatbot

Reasoning model degrades under few-shot prompting