Over-refusal of benign requests
Safety tuning causes refusal of harmless requests that merely resemble sensitive ones.
Published June 26, 2026
- Reproducibility
- Sometimes
- Severity
- Medium
- Confidence
- Reviewer-confirmed
Details
Models sometimes refuse legitimate requests (security education, medical information, fiction) because surface features pattern-match to disallowed content — degrading usefulness and frustrating users. The flip side of jailbreak hardening.
Found with
Benign prompts near a policy boundary.
🔬 Differential testingRefusal rates differ across versions/providers for identical benign prompts.
🔬 Threshold testingWalking benign prompts across the refusal decision boundary maps exactly where harmless requests start getting blocked.
🔬 A/B testing in productionA/B comparing refusal rates across prompt or model variants on live traffic quantifies which one over-refuses.
Evidence
A benign request for general security concepts is refused as 'potentially harmful'.
Affected versions
References
Source: https://arxiv.org/abs/2308.01263
Cite this
Qlarify Labs. (2026). Over-refusal of benign requests. Retrieved from https://labs.qlarify.fi/findings/over-refusal