← AI tech topics
Evals & benchmarks
Evals are how the field measures progress, and they are easy to fool — including by accident. Benchmark data leaks into training sets (contamination), scores saturate while real-world capability lags, and optimizing for a metric corrupts the metric. A reported score is a claim about a dataset, not about your use case. The linked methods cover how to evaluate probabilistic systems honestly; the findings document where headline numbers and observed behavior came apart.
Findings (8)
- Anomalous behavior on glitch tokensOtherLow
- Inconsistent answers to semantically equivalent promptsReasoningMedium
- Model behavior drifts between versions — a fixed task can regressReasoningMedium
- Poor uncertainty calibration / overconfidenceHallucinationMedium
- Repetition and degeneration loopsOtherLow
- Sycophancy: agreeing with a user's incorrect assertionsBiasMedium
- Unfaithful chain-of-thought reasoningOtherMedium
- Vendor cautions its reasoning model's chain-of-thought may be unfaithfulOtherMedium
Methods
🔬 A/B testing in production🔬 Adversarial prompting🔬 Benchmark evaluation🔬 Bias auditing🔬 Boundary & edge-case testing🔬 Chain-of-thought faithfulness probing🔬 Counterfactual bias probing🔬 Differential testing🔬 Distributional testing (KS test, Monte Carlo)🔬 Factual oracle verification🔬 Glitch-token & unicode fuzzing🔬 Logic & consistency testing🔬 Metamorphic testing🔬 Model-graded evaluation (LLM-as-judge)🔬 Property-based testing🔬 Self-consistency probing🔬 Smoke testing in CI/CD
References
- Advancing Spatial Reasoning in Large Language Models: An In-Depth Evaluation Using the StepGame Benchmark — AAAI 2024
- Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs — arXiv
- Counting Ability of Large Language Models and Impact of Tokenization — arXiv
- Dated Data: Tracing Knowledge Cutoffs in Large Language Models — arXiv
- Faith and Fate: Limits of Transformers on Compositionality — arXiv
- Let Me Speak Freely? A Study on the Impact of Format Restrictions on Performance of Large Language Models — arXiv
- Lost in the Middle: How Language Models Use Long Contexts — arXiv
- Measuring Faithfulness in Chain-of-Thought Reasoning — arXiv
- Self-contradictory Hallucinations of Large Language Models: Evaluation, Detection and Mitigation — arXiv
- Survey of Hallucination in Natural Language Generation — ACM Computing Surveys
- Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning — arXiv
- The Curious Case of Neural Text Degeneration — arXiv
- The Reversal Curse: LLMs trained on “A is B” fail to learn “B is A” — arXiv
- Towards Understanding Sycophancy in Language Models — arXiv
- TruthfulQA: Measuring How Models Mimic Human Falsehoods — arXiv
- Why Do Large Language Models (LLMs) Struggle to Count Letters? — arXiv
- XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models — arXiv
Cite this
Qlarify Labs. (2026). Evals & benchmarks. Retrieved from https://labs.qlarify.fi/topics/evals-and-benchmarks