Testing & Findings
Testing methods
How to find the limits of AI systems. Each method is backed by the real findings it has surfaced — and ranked by how much it surfaces, weighted by severity. The durable knowledge is the technique, not the patched-away bug.
Most productive methods
- 1Differential testing7 findings · 7 versions
- 2Prompt-injection & jailbreak testing4 findings · 4 versions
- 3Boundary & edge-case testing7 findings · 6 versions
Ranked by a severity-weighted yield score. Why we measure this →
17 methods
A/B testing in production
Serve two variants — prompts, models, or settings — to comparable slices of real traffic and let live outcomes decide which behaves better.
Adversarial prompting
Deliberately craft inputs designed to elicit failure — confusion, unsafe output, or broken constraints — to map the model's weak boundaries.
Benchmark evaluation
Score the model against a fixed, known-answer dataset so performance becomes a number you can track across versions and compare across models.
Bias auditing
Measure outcome disparities across a population of demographically varied inputs, reporting fairness as statistics rather than anecdotes.
Boundary & edge-case testing
Push inputs to limits — very long contexts, token boundaries, empty/extreme values — where behavior tends to degrade sharply.
Chain-of-thought faithfulness probing
Test whether a model's stated reasoning actually determines its answer, or is a post-hoc rationalization.
Counterfactual bias probing
Hold a prompt fixed while swapping a protected attribute (name, gender, ethnicity) — the output should not change. When it does, you've measured bias.
Differential testing
Run the same input across models or versions and treat divergence as a signal worth investigating.
Distributional testing (KS test, Monte Carlo)
Sample the model many times and test the distribution of its outputs — not any single answer — for drift, miscalibration, or instability.
Factual oracle verification
Check generated claims against a trusted ground-truth source to catch hallucinations and fabricated citations.
Glitch-token & unicode fuzzing
Feed anomalous tokens, rare unicode, homoglyphs and malformed encodings to trigger out-of-distribution behavior.
Logic & consistency testing
Check that the model's outputs obey the rules of logic — valid inference, transitivity, symmetry, no self-contradiction — across related questions.
Metamorphic testing
Test without an oracle by checking relations between the outputs of related inputs, instead of judging any single output in isolation.
Model-graded evaluation (LLM-as-judge)
Use a strong model as an approximate oracle — grading, comparing, or fact-checking another model's output where no cheap ground-truth label exists.
Property-based testing
Specify invariants the output must always satisfy, then generate many inputs automatically and check the invariant on each.
Self-consistency probing
Ask the same question multiple times (or multiple ways) and measure how often the answers agree.
Smoke testing in CI/CD
A fast, shallow pass on every build — a handful of canonical prompts and health checks — whose only job is to fail loudly on gross breakage before anything deeper runs.