Testing & Findings
Testing methods
How to find the limits of AI systems. Each method is backed by the real findings it has surfaced — and ranked by how much it surfaces, weighted by severity. The durable knowledge is the technique, not the patched-away bug.
Most productive methods
- 1Differential testing7 findings · 7 versions
- 2Prompt-injection & jailbreak testing4 findings · 4 versions
- 3Boundary & edge-case testing7 findings · 6 versions
Ranked by a severity-weighted yield score. Why we measure this →
3 methods
Integration testing (MCP handshakes & tool contracts)
Exercise the seams where the model meets the rest of the system — tool and function contracts, MCP handshakes, retrieval calls, API auth — to catch connectivity and contract failures end to end.
Property-based testing
Specify invariants the output must always satisfy, then generate many inputs automatically and check the invariant on each.
Unit testing the deterministic scaffold
Test the deterministic code around the model — prompt builders, output parsers, schema validators, tool wrappers — in isolation, with exact assertions, the way you'd test any software.