AdversarialEstablished

Adversarial prompting

Deliberately craft inputs designed to elicit failure — confusion, unsafe output, or broken constraints — to map the model's weak boundaries.

Published June 26, 2026

Safety Evals

How it works

Adversarial prompting probes a model the way an attacker (or an unlucky user) would: misleading framings, conflicting instructions, edge-case phrasings, and pressure to violate stated rules. It is the workhorse for surfacing safety and robustness issues that benign testing never reaches.

When to use it

Robustness and safety assessment; before shipping anything user-facing or agentic.

Limitations

Coverage depends on tester creativity; results can be hard to reproduce due to stochasticity. Pair with a repro protocol (hit rate over N attempts).

Method yield

Findings: 4
Versions spanned: 6
Yield score: 14

2 High2 Medium

Severity-weighted across the published findings below. Why we measure this →

Findings it surfaces (4)

Documented failures this method catches — the evidence it works.

References & further reading

Universal and Transferable Adversarial Attacks on Aligned Language Models
Zou et al. · arXiv · July 27, 2023

Cite this

Qlarify Labs. (2026). Adversarial prompting. Retrieved from https://labs.qlarify.fi/methods/adversarial-prompting