Testing & Findings
Findings
Documented limitations, weaknesses and failures of AI systems — evidence-first and linked to the method that found each one. Public entries are reviewed before publishing.
5 findings
- CriticalSafetyReviewer-confirmedRepro: Rare
Data exfiltration through prompt injection in agents
An injected instruction can make a tool-using agent send private data to an attacker-controlled destination.
🔬 Prompt-injection & jailbreak testingPrompt injectionSafetyAgents - CriticalPrompt injectionReviewer-confirmedRepro: Sometimes
Indirect prompt injection via retrieved content
Instructions hidden in documents, web pages or tool outputs can override the system prompt when ingested by the model.
🔬 Prompt-injection & jailbreak testingPrompt injectionSafetyRAG - HighTool useReviewer-confirmedRepro: Sometimes
Hallucinated tool/function arguments
When calling tools, models invent argument values or call functions that weren't provided.
🔬 Property-based testing🔬 Differential testing🔬 Integration testing (MCP handshakes & tool contracts)Tool useAgents - MediumTool useReviewer-confirmedRepro: Sometimes
Format-constraint violations under strict schemas
Asked for strictly-formatted output (e.g. JSON to a schema), models emit invalid or extra content.
🔬 Property-based testing🔬 Boundary & edge-case testing🔬 Unit testing the deterministic scaffold🔬 Smoke testing in CI/CDTool useAgents - MediumTool useVendor-acknowledgedRepro: Often
Reasoning model regresses on tool use versus its base model
DeepSeek-R1 falls short of the base DeepSeek-V3 on function calling, multi-turn, complex role-play and JSON output — a reasoning-tuned model trading away tool-use reliability, later restored in R1-0528.
🔬 Integration testing (MCP handshakes & tool contracts)🔬 Property-based testing🔬 Differential testingTool useAgents