← AI tech topics

Safety & alignment

Alignment training teaches a model to refuse harmful requests and follow a policy — but the policy lives in the same soft, steerable substrate as everything else the model does. That makes safety a property to be tested, not assumed: behavior shifts across versions, guardrails that hold in English fail in other languages or encodings, and vendors' own system cards acknowledge residual risks. The linked findings collect both independently demonstrated failures and vendor-acknowledged limitations.

Findings (11)

Methods

References

Cite this

Qlarify Labs. (2026). Safety & alignment. Retrieved from https://labs.qlarify.fi/topics/safety-and-alignment