Testing & Findings

Testing methods

How to find the limits of AI systems. Each method is backed by the real findings it has surfaced — and ranked by how much it surfaces, weighted by severity. The durable knowledge is the technique, not the patched-away bug.

Most productive methods

1Differential testing7 findings · 7 versions
2Prompt-injection & jailbreak testing4 findings · 4 versions
3Boundary & edge-case testing7 findings · 6 versions

Ranked by a severity-weighted yield score. Why we measure this →

5 methods

Adversarial4 findings · 6 versions

Adversarial prompting

Deliberately craft inputs designed to elicit failure — confusion, unsafe output, or broken constraints — to map the model's weak boundaries.

2 High2 Medium

EstablishedSafetyEvals

Other1 finding · 1 version

Canary releases & staged rollout

Route a small slice of real traffic to a new model or prompt first, watch it closely, and widen or roll back based on what the canary shows.

1 High

EstablishedSafetyProduction

Red team2 findings

Distillation & model-extraction probing

Probe whether a deployed model can be cheaply queried to reconstruct its behaviour, training data, or a usable distilled copy — a confidentiality and IP attack surface.

2 High

EmergingSafetyRobustness

Fuzzing3 findings · 5 versions

Glitch-token & unicode fuzzing

Feed anomalous tokens, rare unicode, homoglyphs and malformed encodings to trigger out-of-distribution behavior.

1 High2 Low

EmergingSafetyEvals

Red team4 findings · 4 versions

Prompt-injection & jailbreak testing

Embed adversarial instructions in user input or retrieved/tool content to test whether the model follows attacker text over its system policy.

2 Critical2 High

EstablishedPrompt injectionSafetyAgents