Testing & Findings

Testing methods

How to find the limits of AI systems. Each method is backed by the real findings it has surfaced — and ranked by how much it surfaces, weighted by severity. The durable knowledge is the technique, not the patched-away bug.

Most productive methods

1Differential testing7 findings · 7 versions
2Prompt-injection & jailbreak testing4 findings · 4 versions
3Boundary & edge-case testing7 findings · 6 versions

Ranked by a severity-weighted yield score. Why we measure this →

4 methods

Differential2 findings · 4 versions

A/B testing in production

Serve two variants — prompts, models, or settings — to comparable slices of real traffic and let live outcomes decide which behaves better.

1 High1 Medium

EstablishedEvalsProduction

Other1 finding · 1 version

Canary releases & staged rollout

Route a small slice of real traffic to a new model or prompt first, watch it closely, and widen or roll back based on what the canary shows.

1 High

EstablishedSafetyProduction

Other1 finding · 3 versions

Chaos engineering for AI systems

Deliberately inject failures — tool timeouts, malformed tool responses, truncated context, adversarial inputs — to test whether the system degrades gracefully and recovers.

1 Low

EmergingAgentsRobustnessProduction

Differential2 findings · 1 version

Drift & decay monitoring

Re-run a fixed suite against each release and over time, watching for the quiet regressions and capability decay that a one-off evaluation can't see.

1 High1 Medium

EmergingDriftProduction