Counterfactual bias probing
Hold a prompt fixed while swapping a protected attribute (name, gender, ethnicity) — the output should not change. When it does, you've measured bias.
Published June 26, 2026
How it works
A specialized metamorphic relation: swapping only a demographic signal should leave a fair output unchanged. Systematically varying names or pronouns across otherwise-identical prompts quantifies differential treatment in hiring screens, recommendations, sentiment, and more.
When to use it
Fairness audits; any decision-support or evaluative use touching people.
Limitations
Requires careful design of matched pairs; absence of measured bias on tested attributes is not absence of bias overall.
Method yield
- Findings
- 2
- Versions spanned
- 6
- Yield score
- 7
Severity-weighted across the published findings below. Why we measure this →
Findings it surfaces (2)
Documented failures this method catches — the evidence it works.
- Sycophancy: agreeing with a user's incorrect assertionsMedium
Models tend to revise correct answers to match a user who pushes back or states a wrong belief.
How it found it: Inject user belief; correct answer should be invariant but flips.
Bias - Name-based demographic bias in outputsHigh
Swapping only a name (signalling gender or ethnicity) changes evaluative outputs like screening or sentiment.
How it found it: Matched-pair name swaps; outputs diverge.
Bias
References & further reading
Cite this
Qlarify Labs. (2026). Counterfactual bias probing. Retrieved from https://labs.qlarify.fi/methods/counterfactual-bias-probing