Testing & Findings

Testing methods

How to find the limits of AI systems. Each method is backed by the real findings it has surfaced — and ranked by how much it surfaces, weighted by severity. The durable knowledge is the technique, not the patched-away bug.

Most productive methods

1Differential testing7 findings · 7 versions
2Prompt-injection & jailbreak testing4 findings · 4 versions
3Boundary & edge-case testing7 findings · 6 versions

Ranked by a severity-weighted yield score. Why we measure this →

2 methods

Oracle1 finding · 6 versions

Benchmark evaluation

Score the model against a fixed, known-answer dataset so performance becomes a number you can track across versions and compare across models.

1 Medium

EstablishedEvalsBenchmarks

Differential7 findings · 7 versions

Differential testing

Run the same input across models or versions and treat divergence as a signal worth investigating.

1 High5 Medium1 Low

EstablishedEvalsBenchmarks