Differential testing

Method yield

Findings: 7
Versions spanned: 7
Yield score: 21

1 High5 Medium1 Low

Severity-weighted across the published findings below. Why we measure this →

Findings it surfaces (7)

Documented failures this method catches — the evidence it works.

The reversal curse: 'A is B' not generalizing to 'B is A'Medium
A model trained that 'A is B' frequently fails to answer 'B is ?', revealing that learned relations are not symmetric.
How it found it: Forward vs. reverse phrasing diverge.
Reasoning
Hallucinated tool/function argumentsHigh
When calling tools, models invent argument values or call functions that weren't provided.
Tool use
Over-refusal of benign requestsMedium
Safety tuning causes refusal of harmless requests that merely resemble sensitive ones.
How it found it: Refusal rates differ across versions/providers for identical benign prompts.
Refusal
Model behavior drifts between versions — a fixed task can regressMedium
The same model name can perform very differently across dated snapshots: a task that passed on one release regresses on the next, with no announcement.
How it found it: Comparing two dated snapshots on identical inputs surfaces the divergence directly.
Reasoning
Reasoning model mixes languages on non-English/Chinese queriesLow
DeepSeek-R1 is optimized for English and Chinese and can mix languages mid-output on queries in other languages — its own paper flags this.
How it found it: The same query posed in different languages diverges — or the output mixes languages.
Other
Reasoning model degrades under few-shot promptingMedium
DeepSeek-R1's own paper reports that few-shot prompting 'consistently degrades its performance' and recommends zero-shot — inverting the usual assumption that examples help.
How it found it: Zero-shot vs few-shot on the same task diverge, against the usual direction.
Reasoning
Reasoning model regresses on tool use versus its base modelMedium
DeepSeek-R1 falls short of the base DeepSeek-V3 on function calling, multi-turn, complex role-play and JSON output — a reasoning-tuned model trading away tool-use reliability, later restored in R1-0528.
How it found it: R1 vs V3 on identical tool-use tasks diverge — the base model does better.
Tool use

Cite this

Qlarify Labs. (2026). Differential testing. Retrieved from https://labs.qlarify.fi/methods/differential-testing

Differential testing

How it works

When to use it

Limitations

Method yield

Findings it surfaces (7)

References & further reading

Cite this