Jailbreaking

Jailbreak

A jailbreak is a prompt that talks a model out of its safety training — role-play framings, encodings, many-shot patterns, or automated search over adversarial suffixes. Unlike prompt injection (which hijacks an application's instructions), jailbreaking targets the model's own refusal behavior. Defenses improve every generation and none have held: each new model ships with new bypasses found within days. The linked findings document techniques that worked and how vendors responded.

Findings (2)

Roleplay-based safety bypassJailbreakHigh
Safety bypass via unicode/homoglyph obfuscationJailbreakHigh

References

Universal and Transferable Adversarial Attacks on Aligned Language Models — arXiv

Cite this

Qlarify Labs. (2026). Jailbreaking. Retrieved from https://labs.qlarify.fi/topics/jailbreaking