Differential testing
Run the same input across models or versions and treat divergence as a signal worth investigating.
Published June 26, 2026
How it works
When two comparable systems disagree on the same input, at least one is wrong — or the behavior is unstable. Comparing across providers, or across versions of one model, surfaces regressions and version-specific quirks cheaply, again without a labelled oracle.
When to use it
Regression testing across model upgrades; provider selection; flagging unstable behaviors.
Limitations
Agreement doesn't prove correctness (both can be wrong the same way). Best combined with an oracle on the divergent cases.
Method yield
- Findings
- 7
- Versions spanned
- 7
- Yield score
- 21
Severity-weighted across the published findings below. Why we measure this →
Findings it surfaces (7)
Documented failures this method catches — the evidence it works.
- The reversal curse: 'A is B' not generalizing to 'B is A'Medium
A model trained that 'A is B' frequently fails to answer 'B is ?', revealing that learned relations are not symmetric.
How it found it: Forward vs. reverse phrasing diverge.
Reasoning - Hallucinated tool/function argumentsHigh
When calling tools, models invent argument values or call functions that weren't provided.
Tool use - Over-refusal of benign requestsMedium
Safety tuning causes refusal of harmless requests that merely resemble sensitive ones.
How it found it: Refusal rates differ across versions/providers for identical benign prompts.
Refusal - Model behavior drifts between versions — a fixed task can regressMedium
The same model name can perform very differently across dated snapshots: a task that passed on one release regresses on the next, with no announcement.
How it found it: Comparing two dated snapshots on identical inputs surfaces the divergence directly.
Reasoning - Reasoning model mixes languages on non-English/Chinese queriesLow
DeepSeek-R1 is optimized for English and Chinese and can mix languages mid-output on queries in other languages — its own paper flags this.
How it found it: The same query posed in different languages diverges — or the output mixes languages.
Other - Reasoning model degrades under few-shot promptingMedium
DeepSeek-R1's own paper reports that few-shot prompting 'consistently degrades its performance' and recommends zero-shot — inverting the usual assumption that examples help.
How it found it: Zero-shot vs few-shot on the same task diverge, against the usual direction.
Reasoning - Reasoning model regresses on tool use versus its base modelMedium
DeepSeek-R1 falls short of the base DeepSeek-V3 on function calling, multi-turn, complex role-play and JSON output — a reasoning-tuned model trading away tool-use reliability, later restored in R1-0528.
How it found it: R1 vs V3 on identical tool-use tasks diverge — the base model does better.
Tool use
References & further reading
Cite this
Qlarify Labs. (2026). Differential testing. Retrieved from https://labs.qlarify.fi/methods/differential-testing