Refusals & over-caution

Refusal

Safety training has a false-positive side: models refuse legitimate requests — medical questions, security research, fiction involving conflict — because they pattern-match to something forbidden. Over-refusal is a real quality defect, in tension with jailbreak resistance: tuning toward one moves the other. It is also unevenly distributed across topics and phrasings, so it needs its own evaluation rather than being treated as the safe default. The linked findings document refusals of clearly benign requests.

Findings (1)

Over-refusal of benign requestsRefusalMedium

Methods

🔬 Threshold testing

References

XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models — arXiv

Cite this

Qlarify Labs. (2026). Refusals & over-caution. Retrieved from https://labs.qlarify.fi/topics/refusals-and-overcaution