Red teamEstablished

Prompt-injection & jailbreak testing

Embed adversarial instructions in user input or retrieved/tool content to test whether the model follows attacker text over its system policy.

Published June 26, 2026

Prompt injection Safety Agents

How it works

Prompt injection is the defining security failure mode of LLM applications. Testing covers direct injection (the user tries to override the system prompt) and indirect injection (malicious instructions hidden in documents, web pages, or tool outputs the model ingests). For agentic systems this is the highest-severity surface.

When to use it

Any system that ingests untrusted content (RAG, browsing, email, tool outputs) or grants the model side effects.

Limitations

The attack space is open-ended; a clean test run is not proof of safety. Per Qlarify Labs policy, live payloads are redacted in published findings.

Method yield

Findings: 4
Versions spanned: 4
Yield score: 18

2 Critical2 High

Severity-weighted across the published findings below. Why we measure this →

Findings it surfaces (4)

Documented failures this method catches — the evidence it works.

References & further reading

Cite this

Qlarify Labs. (2026). Prompt-injection & jailbreak testing. Retrieved from https://labs.qlarify.fi/methods/prompt-injection-testing