Testing & Findings
Testing methods
How to find the limits of AI systems. Each method is backed by the real findings it has surfaced — and ranked by how much it surfaces, weighted by severity. The durable knowledge is the technique, not the patched-away bug.
Most productive methods
- 1Differential testing7 findings · 7 versions
- 2Prompt-injection & jailbreak testing4 findings · 4 versions
- 3Boundary & edge-case testing7 findings · 6 versions
Ranked by a severity-weighted yield score. Why we measure this →
5 methods
Chain-of-thought faithfulness probing
Test whether a model's stated reasoning actually determines its answer, or is a post-hoc rationalization.
Logic & consistency testing
Check that the model's outputs obey the rules of logic — valid inference, transitivity, symmetry, no self-contradiction — across related questions.
Metamorphic testing
Test without an oracle by checking relations between the outputs of related inputs, instead of judging any single output in isolation.
Perturbation testing
Apply small, meaning-preserving changes to an input — typos, spacing, paraphrase, reordering — and check that the output stays stable. When it doesn't, you've measured brittleness.
Self-consistency probing
Ask the same question multiple times (or multiple ways) and measure how often the answers agree.