← Reference library
PaperHigh credibilityarXiv · Zou et al. · July 27, 2023

Universal and Transferable Adversarial Attacks on Aligned Language Models

Our summary

Introduces an automated method for generating adversarial suffixes that bypass safety training, and shows the resulting attacks transfer across multiple aligned models including closed ones.

Why it matters

Evidence that alignment via fine-tuning is brittle against automated, transferable attacks — not just hand-crafted jailbreaks.

Cited by these methods

Related findings (2)

Published June 26, 2026

Cite this

Qlarify Labs. (2026). Universal and Transferable Adversarial Attacks on Aligned Language Models. Retrieved from https://labs.qlarify.fi/references/universal-transferable-adversarial-attacks