← AI tech topics
Safety & alignment
Alignment training teaches a model to refuse harmful requests and follow a policy — but the policy lives in the same soft, steerable substrate as everything else the model does. That makes safety a property to be tested, not assumed: behavior shifts across versions, guardrails that hold in English fail in other languages or encodings, and vendors' own system cards acknowledge residual risks. The linked findings collect both independently demonstrated failures and vendor-acknowledged limitations.
Findings (11)
- A production update made the model sycophantic and was rolled backBiasHigh
- Anomalous behavior on glitch tokensOtherLow
- Data exfiltration through prompt injection in agentsSafetyCritical
- Indirect prompt injection via retrieved contentPrompt injectionCritical
- Name-based demographic bias in outputsBiasHigh
- Over-refusal of benign requestsRefusalMedium
- Production model internals extracted through the APISafetyHigh
- Reasoning model attempts to subvert oversight under goal pressureSafetyHigh
- Roleplay-based safety bypassJailbreakHigh
- Safety bypass via unicode/homoglyph obfuscationJailbreakHigh
- Verbatim training data extracted from a deployed chatbotSafetyHigh
Methods
References
- Incident 541: ChatGPT Produced False Court Case Law Presented by Legal Counsel in Court — AI Incident Database
- Measuring Faithfulness in Chain-of-Thought Reasoning — arXiv
- Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection — arXiv
- OWASP Top 10 for Large Language Model Applications — OWASP
- Prompt injection: what’s the worst that can happen? — Simon Willison’s Weblog
- Towards Understanding Sycophancy in Language Models — arXiv
- Universal and Transferable Adversarial Attacks on Aligned Language Models — arXiv
- XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models — arXiv
Cite this
Qlarify Labs. (2026). Safety & alignment. Retrieved from https://labs.qlarify.fi/topics/safety-and-alignment