PaperHigh credibilityarXiv · Zou et al. · July 27, 2023

Universal and Transferable Adversarial Attacks on Aligned Language Models

Our summary

Introduces an automated method for generating adversarial suffixes that bypass safety training, and shows the resulting attacks transfer across multiple aligned models including closed ones.

Why it matters

Evidence that alignment via fine-tuning is brittle against automated, transferable attacks — not just hand-crafted jailbreaks.

Cited by these methods

🔬 Adversarial prompting

Related findings (2)

Roleplay-based safety bypassHigh
Framing a disallowed request as fiction or a persona can induce the model to bypass its safety policy.
Safety bypass via unicode/homoglyph obfuscationHigh
Disallowed content encoded with look-alike unicode or spacing can slip past safety filters.

Jailbreak Safety

Published June 26, 2026

Cite this

Qlarify Labs. (2026). Universal and Transferable Adversarial Attacks on Aligned Language Models. Retrieved from https://labs.qlarify.fi/references/universal-transferable-adversarial-attacks