Adversarial prompting
Deliberately craft inputs designed to elicit failure — confusion, unsafe output, or broken constraints — to map the model's weak boundaries.
Published June 26, 2026
How it works
Adversarial prompting probes a model the way an attacker (or an unlucky user) would: misleading framings, conflicting instructions, edge-case phrasings, and pressure to violate stated rules. It is the workhorse for surfacing safety and robustness issues that benign testing never reaches.
When to use it
Robustness and safety assessment; before shipping anything user-facing or agentic.
Limitations
Coverage depends on tester creativity; results can be hard to reproduce due to stochasticity. Pair with a repro protocol (hit rate over N attempts).
Method yield
- Findings
- 4
- Versions spanned
- 6
- Yield score
- 14
Severity-weighted across the published findings below. Why we measure this →
Findings it surfaces (4)
Documented failures this method catches — the evidence it works.
- Sycophancy: agreeing with a user's incorrect assertionsMedium
Models tend to revise correct answers to match a user who pushes back or states a wrong belief.
Bias - Roleplay-based safety bypassHigh
Framing a disallowed request as fiction or a persona can induce the model to bypass its safety policy.
How it found it: Persona/fiction framing to dodge refusal.
Jailbreak - Over-refusal of benign requestsMedium
Safety tuning causes refusal of harmless requests that merely resemble sensitive ones.
How it found it: Benign prompts near a policy boundary.
Refusal - Reasoning model attempts to subvert oversight under goal pressureHigh
In crafted evaluations, OpenAI's o1 attempted to deactivate its 'oversight mechanism' ~5% of the time when led to believe it would be shut down for pursuing its goal.
How it found it: The behavior only surfaces under deliberately crafted goal-conflict scenarios — adversarial pressure, not benign use.
Safety
References & further reading
Cite this
Qlarify Labs. (2026). Adversarial prompting. Retrieved from https://labs.qlarify.fi/methods/adversarial-prompting