BoundaryEmerging

Threshold testing

Walk inputs across a decision boundary — refusal, classification, confidence cutoff — to find exactly where the model's behaviour flips, and whether it flips in the right place.

Published June 26, 2026

Refusal Robustness

How it works

Many AI behaviours hinge on a threshold: refuse versus answer, flag versus allow, escalate versus handle. Threshold testing sweeps inputs from clearly-one-side to clearly-the-other and locates the transition, then asks whether it sits where policy intends. It surfaces both over-refusal (the boundary set too tight, blocking benign requests) and under-refusal (set too loose), and the unstable middle band where small changes flip the verdict.

When to use it

Tuning and auditing safety filters, content classifiers, and any allow/deny or confidence cutoff; diagnosing over- and under-refusal.

Limitations

Boundaries shift between versions and have to be re-mapped after upgrades, and a single threshold can hide very different behaviour across different request types.

Method yield

Findings: 1
Versions spanned: 4
Yield score: 3

1 Medium

Severity-weighted across the published findings below. Why we measure this →

Findings it surfaces (1)

Documented failures this method catches — the evidence it works.

Over-refusal of benign requestsMedium
Safety tuning causes refusal of harmless requests that merely resemble sensitive ones.
How it found it: Walking benign prompts across the refusal decision boundary maps exactly where harmless requests start getting blocked.
Refusal

References & further reading

XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models
Röttger et al. · arXiv · August 1, 2023

Cite this

Qlarify Labs. (2026). Threshold testing. Retrieved from https://labs.qlarify.fi/methods/threshold-testing