Evals & benchmarks

Evals

Evals are how the field measures progress, and they are easy to fool — including by accident. Benchmark data leaks into training sets (contamination), scores saturate while real-world capability lags, and optimizing for a metric corrupts the metric. A reported score is a claim about a dataset, not about your use case. The linked methods cover how to evaluate probabilistic systems honestly; the findings document where headline numbers and observed behavior came apart.

Findings (8)

Methods

🔬 A/B testing in production 🔬 Adversarial prompting 🔬 Benchmark evaluation 🔬 Bias auditing 🔬 Boundary & edge-case testing 🔬 Chain-of-thought faithfulness probing 🔬 Counterfactual bias probing 🔬 Differential testing 🔬 Distributional testing (KS test, Monte Carlo)🔬 Factual oracle verification 🔬 Glitch-token & unicode fuzzing 🔬 Logic & consistency testing 🔬 Metamorphic testing 🔬 Model-graded evaluation (LLM-as-judge)🔬 Property-based testing 🔬 Self-consistency probing 🔬 Smoke testing in CI/CD

References

Cite this

Qlarify Labs. (2026). Evals & benchmarks. Retrieved from https://labs.qlarify.fi/topics/evals-and-benchmarks