Bias & fairness

Bias

Models inherit the statistics of their training data, including its stereotypes — and alignment training redistributes bias rather than deleting it. The failures are often quiet: different tone for different names, different refusal rates across dialects, skewed defaults in generated personas. Because the effects are distributional, they only show up under aggregated, controlled comparisons, not spot checks. The linked findings document measured disparities; the methods show the comparison designs that surface them.

Findings (3)

A production update made the model sycophantic and was rolled backBiasHigh
Name-based demographic bias in outputsBiasHigh
Sycophancy: agreeing with a user's incorrect assertionsBiasMedium

Methods

🔬 Bias auditing 🔬 Counterfactual bias probing

Cite this

Qlarify Labs. (2026). Bias & fairness. Retrieved from https://labs.qlarify.fi/topics/bias-and-fairness