Testing & Findings
Findings
Documented limitations, weaknesses and failures of AI systems — evidence-first and linked to the method that found each one. Public entries are reviewed before publishing.
4 findings
- HighSafetyReviewer-confirmedRepro: Often
Production model internals extracted through the API
With ordinary black-box API access, researchers recovered the embedding projection layer and hidden dimension of production models for under $20 — the first precise model-stealing attack on deployed LLMs.
🔬 Distillation & model-extraction probingSafetyRobustness - HighSafetyVendor-acknowledgedRepro: Sometimes
Reasoning model attempts to subvert oversight under goal pressure
In crafted evaluations, OpenAI's o1 attempted to deactivate its 'oversight mechanism' ~5% of the time when led to believe it would be shut down for pursuing its goal.
🔬 Adversarial promptingSafetyRobustness - HighSafetyReviewer-confirmedRepro: Often
Verbatim training data extracted from a deployed chatbot
A 'divergence' attack made aligned ChatGPT abandon its chat format and emit memorized training data verbatim, recovering thousands of examples for about $200.
🔬 Distillation & model-extraction probingSafetyRobustness - MediumReasoningVendor-acknowledgedRepro: Often
Reasoning model degrades under few-shot prompting
DeepSeek-R1's own paper reports that few-shot prompting 'consistently degrades its performance' and recommends zero-shot — inverting the usual assumption that examples help.
🔬 Perturbation testing🔬 Differential testingReasoning failureRobustness