How to read the catalog
Our approach
We are not collecting AI bugs as trophies. We are studying how to put AI to the test — and the findings are the evidence that the methods work.
AI moves fast. A finding that holds today can be patched next week, and a sharp reference can read as dated within a season. If the goal were a catalog of current bugs, that churn would be fatal. It is not our goal.
The durable thing — the part worth citing a year from now — is the method: the repeatable technique that surfaces a class of failure no matter which model you point it at. The bug is how we prove the method works. Read this site that way and the pieces fall into place.
The method is the product. The finding is the proof.
A bug is an artifact of one model version at one moment — it gets patched, and the trophy loses its shine. The method that surfaced it is durable, transferable knowledge: it still works on the next model, the next version, the next vendor. So methods get first-class pages here, and every finding points back to the method that found it.
Findings and references are time-stamped evidence — by design.
We date everything and pin every finding to specific model versions. When a reference looks a little old, that is not rot we are hiding; it is the record doing its job. A claim about AI is a claim about a particular system at a particular time. We would rather show you a dated, honest snapshot than pretend any result is evergreen.
Models change; the knowledge about how to test them compounds.
Capabilities improve, but fixes come undone and old failure modes resurface in new releases. A result you trusted six months ago may simply be wrong now — in either direction. So we pin every finding to the model versions where it was seen, and where it stopped reproducing. The finding stays; the version context tells you how far to trust it today.
A real harness blends novel and established methods.
The flashy adversarial prompt gets the headlines, but a serious testing harness leans just as hard on the boring, reliable disciplines — regression suites, differential and metamorphic testing, property-based checks. Established methods earn their place precisely because models regress. We give them equal billing, not an afterthought tacked on behind the jailbreaks.
What this means for you
- Every finding is dated and versioned. You will see when it was observed, the model versions it affects, and the version where it stopped reproducing. Treat it as a snapshot, not a standing claim.
- A “fixed” finding is not a dead one. It stays as evidence that the method surfaced a real failure — and that the same class of issue can return in a later version. We study the method, not the ticket; the finding never closes.
- Older references still earn their place. We keep, date, and archive them because the technique or insight outlives the specific result — and because watching a claim age is itself evidence about how the field moves.
- We will not pretend results are evergreen. Where something is uncertain, contested, or likely to have shifted, we say so. Honest and dated beats confident and stale.
Why we are vocal about this
Plenty of AI content is written as if each result were a permanent truth. In a field that re-bases itself every few months, that ages badly and quietly misleads. We would rather state the obvious out loud: this is a study of testing methods, backed by time-stamped evidence, maintained with the assumption that models will keep changing under us. That stance is the whole point — so we are putting it on its own page.