← AI tech topics

Jailbreaking

A jailbreak is a prompt that talks a model out of its safety training — role-play framings, encodings, many-shot patterns, or automated search over adversarial suffixes. Unlike prompt injection (which hijacks an application's instructions), jailbreaking targets the model's own refusal behavior. Defenses improve every generation and none have held: each new model ships with new bypasses found within days. The linked findings document techniques that worked and how vendors responded.

Findings (2)

References

Cite this

Qlarify Labs. (2026). Jailbreaking. Retrieved from https://labs.qlarify.fi/topics/jailbreaking