MediumHallucinationReviewer-confirmedPublished

Poor uncertainty calibration / overconfidence

Stated confidence does not track accuracy; models sound equally certain when right and wrong.

Published June 26, 2026

Reproducibility: Often
Severity: Medium
Confidence: Reviewer-confirmed

Details

Verbalized confidence is weakly correlated with correctness, and probabilities are often miscalibrated. Users cannot rely on tone or stated certainty to gauge trustworthiness.

Found with

🔬 Factual oracle verification 🔬 Self-consistency probing

High stated confidence on answers that vary across samples.

🔬 Distributional testing (KS test, Monte Carlo)

Comparing the distribution of stated confidence against the distribution of correctness exposes the miscalibration.

Evidence

Model reports 'I'm certain' on an answer that is wrong and that changes on resampling.

Illustrative example — see the linked reference for the documented evidence.

Affected versions

Anthropic · claude-opus-4-8OpenAI · gpt-4oGoogle · gemini-2.0-flashMeta · llama-3.3-70b

References

Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs

Hallucination Evals

Source: https://arxiv.org/abs/2306.13063

Cite this

Qlarify Labs. (2026). Poor uncertainty calibration / overconfidence. Retrieved from https://labs.qlarify.fi/findings/overconfidence-calibration