OracleEmerging

Model-graded evaluation (LLM-as-judge)

Use a strong model as an approximate oracle — grading, comparing, or fact-checking another model's output where no cheap ground-truth label exists.

Published June 26, 2026

Hallucination Evals

How it works

When outputs are open-ended — summaries, explanations, chat — there is no answer key, so a capable 'judge' model stands in as an oracle: scoring an answer, picking the better of two, or flagging claims that aren't supported. It scales evaluation into places where human labels are too expensive, at the cost of inheriting the judge's own biases. Treated as a screening signal rather than ground truth, it is one of the few ways to evaluate generation at volume.

When to use it

Open-ended generation with no automatic metric; large-scale preference and faithfulness scoring; a first-pass hallucination flag before human review.

Limitations

The judge is itself an LLM — miscalibrated, gameable, and biased toward longer, more confident, or self-similar answers. Calibrate it against human labels and never treat its verdict as the oracle it only approximates.

Method yield

Findings: 1
Versions spanned: 5
Yield score: 4

1 High

Severity-weighted across the published findings below. Why we measure this →

Findings it surfaces (1)

Documented failures this method catches — the evidence it works.

Fabricated citations and referencesHigh
Models invent plausible-looking but non-existent papers, authors, DOIs and URLs.
How it found it: A judge model cross-checks each cited source and flags the unsupported ones for human review.
Hallucination

References & further reading

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, et al. · NeurIPS 2023 (arXiv:2306.05685) · June 9, 2023

Cite this

Qlarify Labs. (2026). Model-graded evaluation (LLM-as-judge). Retrieved from https://labs.qlarify.fi/methods/model-graded-evaluation