HighJailbreakReviewer-confirmedPublished

Roleplay-based safety bypass

Framing a disallowed request as fiction or a persona can induce the model to bypass its safety policy.

Published June 26, 2026

Reproducibility: Sometimes
Severity: High
Confidence: Reviewer-confirmed

Details

By asking the model to adopt a persona or 'play a character' that is unconstrained, users can sometimes elicit content the policy would otherwise refuse. A classic, recurring jailbreak family that resurfaces in new forms after each mitigation.

Found with

🔬 Prompt-injection & jailbreak testing 🔬 Adversarial prompting

Persona/fiction framing to dodge refusal.

Evidence

A persona-framing prompt induced the model to produce policy-violating content it had refused when asked directly. Working prompt withheld.

Illustrative example — see the linked reference for the documented evidence.

1 evidence item withheld. Live exploit payloads are not published — only the technique and impact are described (disclosure policy).

Affected versions

Anthropic · claude-opus-4-8OpenAI · gpt-4oGoogle · gemini-2.0-flashMeta · llama-3.3-70b

References

Universal and Transferable Adversarial Attacks on Aligned Language Models

Jailbreak Safety

Source: https://arxiv.org/abs/2307.15043

Cite this

Qlarify Labs. (2026). Roleplay-based safety bypass. Retrieved from https://labs.qlarify.fi/findings/roleplay-jailbreak