Qlarify Labs
AI Testing Catalog
An evidence-based catalog of AI/LLM limitations, weaknesses, and bugs — and the testing methods that find them. Methods are the hero; findings are the proof. Models change fast, so we date everything and study what keeps holding up.
Most productive methods
All methods →- 1Differential testing7 findings · 7 versions
- 2Prompt-injection & jailbreak testing4 findings · 4 versions
- 3Boundary & edge-case testing7 findings · 6 versions
- 4Property-based testing6 findings · 7 versions
- 5Factual oracle verification5 findings · 7 versions
Ranked by a severity-weighted yield score. Why we measure this →
Latest from the library
All references →- Blog
Principles of Chaos Engineering
- PaperNovember 3, 2025
Metamorphic Testing of Large Language Models for Natural Language Processing
- BlogApril 29, 2025
Sycophancy in GPT-4o: What Happened and What We're Doing About It
- PaperFebruary 20, 2025
Hallucination Detection in Large Language Models with Metamorphic Relations
- PaperJanuary 22, 2025
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
In the catalog
Methods
How to find the limits of AI systems — the durable knowledge, ranked by what each method actually surfaces.
Findings
Time-stamped evidence that the methods work: verified limitations, weaknesses and bugs, anchored to model versions.
Models
Profiles of the models we track — known weaknesses, linked findings, and a report card per model.
Library
A curated feed of the best external writing on AI testing and quality, with our own summaries and credibility ratings.
Topics
Editorial primers on the recurring themes — hallucination, jailbreaking, evals, drift — each assembled from the evidence.
Approach
How to read all of this: why methods are the hero, why everything is time-stamped, and why a finding never retires.