← Reference library
PaperHigh credibilityNeurIPS 2023 (arXiv:2306.05685) · Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, et al. · June 9, 2023
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Our summary
Shows a strong model can grade open-ended outputs in agreement with human preferences over 80% of the time, while documenting the judge's own position, verbosity, and self-enhancement biases.
Why it matters
The foundational study for model-graded evaluation — and an honest catalog of why the judge approximates an oracle rather than being one.
Cited by these methods
Published June 26, 2026
Cite this
Qlarify Labs. (2026). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. Retrieved from https://labs.qlarify.fi/references/llm-as-judge-mt-bench-2023