Testing & Findings
Testing methods
How to find the limits of AI systems. Each method is backed by the real findings it has surfaced — and ranked by how much it surfaces, weighted by severity. The durable knowledge is the technique, not the patched-away bug.
Most productive methods
- 1Differential testing7 findings · 7 versions
- 2Prompt-injection & jailbreak testing4 findings · 4 versions
- 3Boundary & edge-case testing7 findings · 6 versions
Ranked by a severity-weighted yield score. Why we measure this →
3 methods
Oracle5 findings · 7 versions
Factual oracle verification
Check generated claims against a trusted ground-truth source to catch hallucinations and fabricated citations.
3 High1 Medium1 Low
EstablishedHallucinationEvals
Adversarial2 findings · 6 versions
Hallucination triggering
Deliberately steer the model toward fabrication — asking about non-existent entities or beyond its knowledge — to map where it invents instead of declining.
1 High1 Low
EstablishedHallucinationRobustness
Oracle1 finding · 5 versions
Model-graded evaluation (LLM-as-judge)
Use a strong model as an approximate oracle — grading, comparing, or fact-checking another model's output where no cheap ground-truth label exists.
1 High
EmergingHallucinationEvals