PaperHigh credibilityarXiv:2211.09110 · Percy Liang, Rishi Bommasani, Tony Lee, et al. · November 16, 2022

Holistic Evaluation of Language Models

Our summary

A benchmark that scores language models on a taxonomy of scenarios against seven metrics — accuracy, calibration, robustness, fairness, bias, toxicity, efficiency — to make model quality transparent and comparable.

Why it matters

The exemplar of benchmark evaluation as a standing, comparable measurement rather than a one-off score — the longitudinal lens for tracking models across versions.

Cited by these methods

🔬 Benchmark evaluation

Published June 26, 2026

Cite this

Qlarify Labs. (2026). Holistic Evaluation of Language Models. Retrieved from https://labs.qlarify.fi/references/helm-2022