← Reference library
PaperHigh credibilityarXiv:2211.09110 · Percy Liang, Rishi Bommasani, Tony Lee, et al. · November 16, 2022
Holistic Evaluation of Language Models
Our summary
A benchmark that scores language models on a taxonomy of scenarios against seven metrics — accuracy, calibration, robustness, fairness, bias, toxicity, efficiency — to make model quality transparent and comparable.
Why it matters
The exemplar of benchmark evaluation as a standing, comparable measurement rather than a one-off score — the longitudinal lens for tracking models across versions.
Cited by these methods
Published June 26, 2026
Cite this
Qlarify Labs. (2026). Holistic Evaluation of Language Models. Retrieved from https://labs.qlarify.fi/references/helm-2022