Testing & Findings
Testing methods
How to find the limits of AI systems. Each method is backed by the real findings it has surfaced — and ranked by how much it surfaces, weighted by severity. The durable knowledge is the technique, not the patched-away bug.
Most productive methods
- 1Differential testing7 findings · 7 versions
- 2Prompt-injection & jailbreak testing4 findings · 4 versions
- 3Boundary & edge-case testing7 findings · 6 versions
Ranked by a severity-weighted yield score. Why we measure this →
2 methods
Boundary7 findings · 6 versions
Boundary & edge-case testing
Push inputs to limits — very long contexts, token boundaries, empty/extreme values — where behavior tends to degrade sharply.
3 Medium4 Low
EstablishedEvalsContext window
Boundary1 finding · 4 versions
Needle-in-a-haystack (long-context retrieval)
Plant a specific fact at varying depths in a long context and test whether the model can retrieve it from each position.
1 Medium
EstablishedRAGContext window