← Reference library
PaperHigh credibilityarXiv · Zou et al. · July 27, 2023
Universal and Transferable Adversarial Attacks on Aligned Language Models
Our summary
Introduces an automated method for generating adversarial suffixes that bypass safety training, and shows the resulting attacks transfer across multiple aligned models including closed ones.
Why it matters
Evidence that alignment via fine-tuning is brittle against automated, transferable attacks — not just hand-crafted jailbreaks.
Cited by these methods
Related findings (2)
Published June 26, 2026
Cite this
Qlarify Labs. (2026). Universal and Transferable Adversarial Attacks on Aligned Language Models. Retrieved from https://labs.qlarify.fi/references/universal-transferable-adversarial-attacks