Perturbation testing
Apply small, meaning-preserving changes to an input — typos, spacing, paraphrase, reordering — and check that the output stays stable. When it doesn't, you've measured brittleness.
Published June 26, 2026
How it works
A robust system should be indifferent to changes that don't change meaning: a stray typo, extra whitespace, a synonym, a reordered clause. Perturbation testing applies these small transformations at scale and flags every case where the answer moves. It is a focused, robustness-oriented cousin of metamorphic testing — the relation is simply 'meaning-preserving in, same answer out' — and it exposes the prompt-sensitivity that quietly undermines reproducibility.
When to use it
Robustness hardening; quantifying prompt-sensitivity; regression-guarding inputs that users will phrase many different ways.
Limitations
You must ensure the perturbation truly preserves meaning — an over-aggressive change creates false positives — and it detects instability, not which of the diverging answers is correct.
Method yield
- Findings
- 3
- Versions spanned
- 7
- Yield score
- 8
Severity-weighted across the published findings below. Why we measure this →
Findings it surfaces (3)
Documented failures this method catches — the evidence it works.
- Character-counting errors in tokenized wordsLow
Models miscount letters within a word (e.g. how many 'r's are in a given word) because they reason over tokens, not characters.
How it found it: Perturbing the target word (case, spacing, repeated letters) shows the count is not robust to trivial, meaning-preserving changes.
Reasoning - Inconsistent answers to semantically equivalent promptsMedium
Trivial rewordings of the same question yield materially different answers.
How it found it: Semantically-neutral paraphrase perturbations change the answer, exposing prompt brittleness.
Reasoning - Reasoning model degrades under few-shot promptingMedium
DeepSeek-R1's own paper reports that few-shot prompting 'consistently degrades its performance' and recommends zero-shot — inverting the usual assumption that examples help.
How it found it: Adding few-shot exemplars — a meaning-preserving prompt change — measurably degrades the answer.
Reasoning
References & further reading
Cite this
Qlarify Labs. (2026). Perturbation testing. Retrieved from https://labs.qlarify.fi/methods/perturbation-testing