Qlarify Labs

AI Testing Catalog

An evidence-based catalog of AI/LLM limitations, weaknesses, and bugs — and the testing methods that find them. Methods are the hero; findings are the proof. Models change fast, so we date everything and study what keeps holding up.

Methods

Verified bugs

References

Models tracked

Versions

Providers

Most productive methods

All methods →

1Differential testing7 findings · 7 versions
2Prompt-injection & jailbreak testing4 findings · 4 versions
3Boundary & edge-case testing7 findings · 6 versions
4Property-based testing6 findings · 7 versions
5Factual oracle verification5 findings · 7 versions

Ranked by a severity-weighted yield score. Why we measure this →

Latest from the library

All references →

In the catalog

Methods

How to find the limits of AI systems — the durable knowledge, ranked by what each method actually surfaces.

Findings

Time-stamped evidence that the methods work: verified limitations, weaknesses and bugs, anchored to model versions.

Models

Profiles of the models we track — known weaknesses, linked findings, and a report card per model.

Library

A curated feed of the best external writing on AI testing and quality, with our own summaries and credibility ratings.

Topics

Editorial primers on the recurring themes — hallucination, jailbreaking, evals, drift — each assembled from the evidence.

Approach

How to read all of this: why methods are the hero, why everything is time-stamped, and why a finding never retires.

AI Testing Catalog

Most productive methods

Latest from the library

Principles of Chaos Engineering

Metamorphic Testing of Large Language Models for Natural Language Processing

Sycophancy in GPT-4o: What Happened and What We're Doing About It

Hallucination Detection in Large Language Models with Metamorphic Relations

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

In the catalog

Methods

Findings

Models

Library

Topics

Approach