Testing & Findings

Testing methods

How to find the limits of AI systems. Each method is backed by the real findings it has surfaced — and ranked by how much it surfaces, weighted by severity. The durable knowledge is the technique, not the patched-away bug.

Most productive methods

1Differential testing7 findings · 7 versions
2Prompt-injection & jailbreak testing4 findings · 4 versions
3Boundary & edge-case testing7 findings · 6 versions

Ranked by a severity-weighted yield score. Why we measure this →

5 methods

Exploratory3 findings · 4 versions

Chain-of-thought faithfulness probing

Test whether a model's stated reasoning actually determines its answer, or is a post-hoc rationalization.

1 High2 Medium

EmergingReasoning failureEvals

Property-based3 findings · 6 versions

Logic & consistency testing

Check that the model's outputs obey the rules of logic — valid inference, transitivity, symmetry, no self-contradiction — across related questions.

2 Medium1 Low

EstablishedReasoning failureEvals

Metamorphic4 findings · 7 versions

Metamorphic testing

Test without an oracle by checking relations between the outputs of related inputs, instead of judging any single output in isolation.

3 Medium1 Low

EstablishedReasoning failureEvals

Metamorphic3 findings · 7 versions

Perturbation testing

Apply small, meaning-preserving changes to an input — typos, spacing, paraphrase, reordering — and check that the output stays stable. When it doesn't, you've measured brittleness.

2 Medium1 Low

EstablishedReasoning failureRobustness

Metamorphic3 findings · 5 versions

Self-consistency probing

Ask the same question multiple times (or multiple ways) and measure how often the answers agree.

2 Medium1 Low

EmergingReasoning failureEvals