OtherEmerging

Distributional testing (KS test, Monte Carlo)

Sample the model many times and test the distribution of its outputs — not any single answer — for drift, miscalibration, or instability.

Published June 26, 2026

Evals Drift

How it works

A non-deterministic system has to be judged statistically. By sampling repeatedly and treating the outputs as a distribution, you can ask sharper questions: has this version's output distribution shifted from the last one (a two-sample Kolmogorov–Smirnov test), does stated confidence track actual accuracy, how wide is the spread of a metric under resampling (Monte Carlo). It turns stochasticity from a nuisance into a measurable property — the natural lens for stability, longevity, and decay.

When to use it

Calibration studies; detecting distribution shift between versions; quantifying variance in a metric; any claim about stability rather than a single output.

Limitations

Needs enough samples to be meaningful, and a shifted distribution tells you behaviour changed, not whether it changed for the better. Choosing the right statistic and threshold is itself a modelling decision.

Method yield

Findings: 1
Versions spanned: 4
Yield score: 3

1 Medium

Severity-weighted across the published findings below. Why we measure this →

Findings it surfaces (1)

Documented failures this method catches — the evidence it works.

Poor uncertainty calibration / overconfidenceMedium
Stated confidence does not track accuracy; models sound equally certain when right and wrong.
How it found it: Comparing the distribution of stated confidence against the distribution of correctness exposes the miscalibration.
Hallucination

References & further reading

Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift
Stephan Rabanser, Stephan Günnemann, Zachary C. Lipton · NeurIPS 2019 (arXiv:1810.11953) · October 29, 2018

Cite this

Qlarify Labs. (2026). Distributional testing (KS test, Monte Carlo). Retrieved from https://labs.qlarify.fi/methods/distributional-testing