Testing & Findings
Testing methods
How to find the limits of AI systems. Each method is backed by the real findings it has surfaced — and ranked by how much it surfaces, weighted by severity. The durable knowledge is the technique, not the patched-away bug.
Most productive methods
- 1Differential testing7 findings · 7 versions
- 2Prompt-injection & jailbreak testing4 findings · 4 versions
- 3Boundary & edge-case testing7 findings · 6 versions
Ranked by a severity-weighted yield score. Why we measure this →
2 methods
Other1 finding · 5 versions
Bias auditing
Measure outcome disparities across a population of demographically varied inputs, reporting fairness as statistics rather than anecdotes.
1 High
EstablishedBiasEvals
Metamorphic2 findings · 6 versions
Counterfactual bias probing
Hold a prompt fixed while swapping a protected attribute (name, gender, ethnicity) — the output should not change. When it does, you've measured bias.
1 High1 Medium
EstablishedBiasEvals