PaperHigh credibilityNeurIPS 2023 (arXiv:2306.05685) · Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, et al. · June 9, 2023

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Our summary

Shows a strong model can grade open-ended outputs in agreement with human preferences over 80% of the time, while documenting the judge's own position, verbosity, and self-enhancement biases.

Why it matters

The foundational study for model-graded evaluation — and an honest catalog of why the judge approximates an oracle rather than being one.

Cited by these methods

🔬 Model-graded evaluation (LLM-as-judge)

Published June 26, 2026

Cite this

Qlarify Labs. (2026). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. Retrieved from https://labs.qlarify.fi/references/llm-as-judge-mt-bench-2023