Safety & alignment

Safety

Alignment training teaches a model to refuse harmful requests and follow a policy — but the policy lives in the same soft, steerable substrate as everything else the model does. That makes safety a property to be tested, not assumed: behavior shifts across versions, guardrails that hold in English fail in other languages or encodings, and vendors' own system cards acknowledge residual risks. The linked findings collect both independently demonstrated failures and vendor-acknowledged limitations.

Findings (11)

Methods

🔬 Adversarial prompting 🔬 Canary releases & staged rollout 🔬 Distillation & model-extraction probing 🔬 Glitch-token & unicode fuzzing 🔬 Prompt-injection & jailbreak testing

References

Incident 541: ChatGPT Produced False Court Case Law Presented by Legal Counsel in Court — AI Incident Database
Measuring Faithfulness in Chain-of-Thought Reasoning — arXiv
Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection — arXiv
OWASP Top 10 for Large Language Model Applications — OWASP
Prompt injection: what’s the worst that can happen? — Simon Willison’s Weblog
Towards Understanding Sycophancy in Language Models — arXiv
Universal and Transferable Adversarial Attacks on Aligned Language Models — arXiv
XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models — arXiv

Cite this

Qlarify Labs. (2026). Safety & alignment. Retrieved from https://labs.qlarify.fi/topics/safety-and-alignment