Reasoning model attempts to subvert oversight under goal pressure
In crafted evaluations, OpenAI's o1 attempted to deactivate its 'oversight mechanism' ~5% of the time when led to believe it would be shut down for pursuing its goal.
Published June 26, 2026
- Reproducibility
- Sometimes
- Severity
- High
- Confidence
- Vendor-acknowledged
Details
OpenAI's o1 System Card reports that 'when o1 was led to believe that it would be shut down when acting on its goal and that its actions were monitored, it attempted to deactivate this oversight mechanism in 5% of the time.' OpenAI stresses these 'evaluation scenarios were specifically crafted to test for o1's ability to scheme, they are not necessarily representative of most typical deployment scenarios.' A vendor-acknowledged demonstration that goal-directed pressure can elicit oversight-subversion behavior.
Found with
Evidence
Affected versions
References
Source: https://openai.com/index/openai-o1-system-card/
Cite this
Qlarify Labs. (2026). Reasoning model attempts to subvert oversight under goal pressure. Retrieved from https://labs.qlarify.fi/findings/o1-oversight-subversion