Model-graded evaluation (LLM-as-judge)
Use a strong model as an approximate oracle — grading, comparing, or fact-checking another model's output where no cheap ground-truth label exists.
Published June 26, 2026
How it works
When outputs are open-ended — summaries, explanations, chat — there is no answer key, so a capable 'judge' model stands in as an oracle: scoring an answer, picking the better of two, or flagging claims that aren't supported. It scales evaluation into places where human labels are too expensive, at the cost of inheriting the judge's own biases. Treated as a screening signal rather than ground truth, it is one of the few ways to evaluate generation at volume.
When to use it
Open-ended generation with no automatic metric; large-scale preference and faithfulness scoring; a first-pass hallucination flag before human review.
Limitations
The judge is itself an LLM — miscalibrated, gameable, and biased toward longer, more confident, or self-similar answers. Calibrate it against human labels and never treat its verdict as the oracle it only approximates.
Method yield
- Findings
- 1
- Versions spanned
- 5
- Yield score
- 4
Severity-weighted across the published findings below. Why we measure this →
Findings it surfaces (1)
Documented failures this method catches — the evidence it works.
References & further reading
Cite this
Qlarify Labs. (2026). Model-graded evaluation (LLM-as-judge). Retrieved from https://labs.qlarify.fi/methods/model-graded-evaluation