Testing & Findings

Testing methods

How to find the limits of AI systems. Each method is backed by the real findings it has surfaced — and ranked by how much it surfaces, weighted by severity. The durable knowledge is the technique, not the patched-away bug.

Most productive methods

1Differential testing7 findings · 7 versions
2Prompt-injection & jailbreak testing4 findings · 4 versions
3Boundary & edge-case testing7 findings · 6 versions

Ranked by a severity-weighted yield score. Why we measure this →

3 methods

Other2 findings · 5 versions

Integration testing (MCP handshakes & tool contracts)

Exercise the seams where the model meets the rest of the system — tool and function contracts, MCP handshakes, retrieval calls, API auth — to catch connectivity and contract failures end to end.

1 High1 Medium

EstablishedTool useAgentsReliability

Other1 finding · 4 versions

Smoke testing in CI/CD

A fast, shallow pass on every build — a handful of canonical prompts and health checks — whose only job is to fail loudly on gross breakage before anything deeper runs.

1 Medium

EstablishedEvalsReliability

Other1 finding · 4 versions

Unit testing the deterministic scaffold

Test the deterministic code around the model — prompt builders, output parsers, schema validators, tool wrappers — in isolation, with exact assertions, the way you'd test any software.

1 Medium

EstablishedTool useReliability