← Findings
HighJailbreakReviewer-confirmedPublished

Roleplay-based safety bypass

Framing a disallowed request as fiction or a persona can induce the model to bypass its safety policy.

Published June 26, 2026

Reproducibility
Sometimes
Severity
High
Confidence
Reviewer-confirmed

Details

By asking the model to adopt a persona or 'play a character' that is unconstrained, users can sometimes elicit content the policy would otherwise refuse. A classic, recurring jailbreak family that resurfaces in new forms after each mitigation.

Found with

Evidence

A persona-framing prompt induced the model to produce policy-violating content it had refused when asked directly. Working prompt withheld.
Illustrative example — see the linked reference for the documented evidence.

1 evidence item withheld. Live exploit payloads are not published — only the technique and impact are described (disclosure policy).

Affected versions

Anthropic · claude-opus-4-8OpenAI · gpt-4oGoogle · gemini-2.0-flashMeta · llama-3.3-70b

References

Source: https://arxiv.org/abs/2307.15043

Cite this

Qlarify Labs. (2026). Roleplay-based safety bypass. Retrieved from https://labs.qlarify.fi/findings/roleplay-jailbreak