Testing & Findings

Testing methods

How to find the limits of AI systems. Each method is backed by the real findings it has surfaced — and ranked by how much it surfaces, weighted by severity. The durable knowledge is the technique, not the patched-away bug.

Most productive methods

1Differential testing7 findings · 7 versions
2Prompt-injection & jailbreak testing4 findings · 4 versions
3Boundary & edge-case testing7 findings · 6 versions

Ranked by a severity-weighted yield score. Why we measure this →

28 methods

Differential2 findings · 4 versions

A/B testing in production

Serve two variants — prompts, models, or settings — to comparable slices of real traffic and let live outcomes decide which behaves better.

1 High1 Medium

EstablishedEvalsProduction

Adversarial4 findings · 6 versions

Adversarial prompting

Deliberately craft inputs designed to elicit failure — confusion, unsafe output, or broken constraints — to map the model's weak boundaries.

2 High2 Medium

EstablishedSafetyEvals

Oracle1 finding · 6 versions

Benchmark evaluation

Score the model against a fixed, known-answer dataset so performance becomes a number you can track across versions and compare across models.

1 Medium

EstablishedEvalsBenchmarks

Other1 finding · 5 versions

Bias auditing

Measure outcome disparities across a population of demographically varied inputs, reporting fairness as statistics rather than anecdotes.

1 High

EstablishedBiasEvals

Boundary7 findings · 6 versions

Boundary & edge-case testing

Push inputs to limits — very long contexts, token boundaries, empty/extreme values — where behavior tends to degrade sharply.

3 Medium4 Low

EstablishedEvalsContext window

Other1 finding · 1 version

Canary releases & staged rollout

Route a small slice of real traffic to a new model or prompt first, watch it closely, and widen or roll back based on what the canary shows.

1 High

EstablishedSafetyProduction

Exploratory3 findings · 4 versions

Chain-of-thought faithfulness probing

Test whether a model's stated reasoning actually determines its answer, or is a post-hoc rationalization.

1 High2 Medium

EmergingReasoning failureEvals

Other1 finding · 3 versions

Chaos engineering for AI systems

Deliberately inject failures — tool timeouts, malformed tool responses, truncated context, adversarial inputs — to test whether the system degrades gracefully and recovers.

1 Low

EmergingAgentsRobustnessProduction

Metamorphic2 findings · 6 versions

Counterfactual bias probing

Hold a prompt fixed while swapping a protected attribute (name, gender, ethnicity) — the output should not change. When it does, you've measured bias.

1 High1 Medium

EstablishedBiasEvals

Differential7 findings · 7 versions

Differential testing

Run the same input across models or versions and treat divergence as a signal worth investigating.

1 High5 Medium1 Low

EstablishedEvalsBenchmarks

Red team2 findings

Distillation & model-extraction probing

Probe whether a deployed model can be cheaply queried to reconstruct its behaviour, training data, or a usable distilled copy — a confidentiality and IP attack surface.

2 High

EmergingSafetyRobustness

Other1 finding · 4 versions

Distributional testing (KS test, Monte Carlo)

Sample the model many times and test the distribution of its outputs — not any single answer — for drift, miscalibration, or instability.

1 Medium

EmergingEvalsDrift

Differential2 findings · 1 version

Drift & decay monitoring

Re-run a fixed suite against each release and over time, watching for the quiet regressions and capability decay that a one-off evaluation can't see.

1 High1 Medium

EmergingDriftProduction

Oracle5 findings · 7 versions

Factual oracle verification

Check generated claims against a trusted ground-truth source to catch hallucinations and fabricated citations.

3 High1 Medium1 Low

EstablishedHallucinationEvals

Fuzzing3 findings · 5 versions

Glitch-token & unicode fuzzing

Feed anomalous tokens, rare unicode, homoglyphs and malformed encodings to trigger out-of-distribution behavior.

1 High2 Low

EmergingSafetyEvals

Adversarial2 findings · 6 versions

Hallucination triggering

Deliberately steer the model toward fabrication — asking about non-existent entities or beyond its knowledge — to map where it invents instead of declining.

1 High1 Low

EstablishedHallucinationRobustness

Other2 findings · 5 versions

Integration testing (MCP handshakes & tool contracts)

Exercise the seams where the model meets the rest of the system — tool and function contracts, MCP handshakes, retrieval calls, API auth — to catch connectivity and contract failures end to end.

1 High1 Medium

EstablishedTool useAgentsReliability

Property-based3 findings · 6 versions

Logic & consistency testing

Check that the model's outputs obey the rules of logic — valid inference, transitivity, symmetry, no self-contradiction — across related questions.

2 Medium1 Low

EstablishedReasoning failureEvals

Metamorphic4 findings · 7 versions

Metamorphic testing

Test without an oracle by checking relations between the outputs of related inputs, instead of judging any single output in isolation.

3 Medium1 Low

EstablishedReasoning failureEvals

Oracle1 finding · 5 versions

Model-graded evaluation (LLM-as-judge)

Use a strong model as an approximate oracle — grading, comparing, or fact-checking another model's output where no cheap ground-truth label exists.

1 High

EmergingHallucinationEvals

Boundary1 finding · 4 versions

Needle-in-a-haystack (long-context retrieval)

Plant a specific fact at varying depths in a long context and test whether the model can retrieve it from each position.

1 Medium

EstablishedRAGContext window

Metamorphic3 findings · 7 versions

Perturbation testing

Apply small, meaning-preserving changes to an input — typos, spacing, paraphrase, reordering — and check that the output stays stable. When it doesn't, you've measured brittleness.

2 Medium1 Low

EstablishedReasoning failureRobustness

Red team4 findings · 4 versions

Prompt-injection & jailbreak testing

Embed adversarial instructions in user input or retrieved/tool content to test whether the model follows attacker text over its system policy.

2 Critical2 High

EstablishedPrompt injectionSafetyAgents

Property-based6 findings · 7 versions

Property-based testing

Specify invariants the output must always satisfy, then generate many inputs automatically and check the invariant on each.

1 High3 Medium2 Low

EstablishedTool useEvals

Metamorphic3 findings · 5 versions

Self-consistency probing

Ask the same question multiple times (or multiple ways) and measure how often the answers agree.

2 Medium1 Low

EmergingReasoning failureEvals

Other1 finding · 4 versions

Smoke testing in CI/CD

A fast, shallow pass on every build — a handful of canonical prompts and health checks — whose only job is to fail loudly on gross breakage before anything deeper runs.

1 Medium

EstablishedEvalsReliability

Boundary1 finding · 4 versions

Threshold testing

Walk inputs across a decision boundary — refusal, classification, confidence cutoff — to find exactly where the model's behaviour flips, and whether it flips in the right place.

1 Medium

EmergingRefusalRobustness

Other1 finding · 4 versions

Unit testing the deterministic scaffold

Test the deterministic code around the model — prompt builders, output parsers, schema validators, tool wrappers — in isolation, with exact assertions, the way you'd test any software.

1 Medium

EstablishedTool useReliability