MetamorphicEstablished

Perturbation testing

Apply small, meaning-preserving changes to an input — typos, spacing, paraphrase, reordering — and check that the output stays stable. When it doesn't, you've measured brittleness.

Published June 26, 2026

Reasoning failure Robustness

How it works

A robust system should be indifferent to changes that don't change meaning: a stray typo, extra whitespace, a synonym, a reordered clause. Perturbation testing applies these small transformations at scale and flags every case where the answer moves. It is a focused, robustness-oriented cousin of metamorphic testing — the relation is simply 'meaning-preserving in, same answer out' — and it exposes the prompt-sensitivity that quietly undermines reproducibility.

When to use it

Robustness hardening; quantifying prompt-sensitivity; regression-guarding inputs that users will phrase many different ways.

Limitations

You must ensure the perturbation truly preserves meaning — an over-aggressive change creates false positives — and it detects instability, not which of the diverging answers is correct.

Method yield

Findings: 3
Versions spanned: 7
Yield score: 8

2 Medium1 Low

Severity-weighted across the published findings below. Why we measure this →

Findings it surfaces (3)

Documented failures this method catches — the evidence it works.

References & further reading

TextAttack: A Framework for Adversarial Attacks, Data Augmentation, and Adversarial Training in NLP
John X. Morris, Eli Lifland, Jin Yong Yoo, et al. · EMNLP 2020 (arXiv:2005.05909) · May 12, 2020

Cite this

Qlarify Labs. (2026). Perturbation testing. Retrieved from https://labs.qlarify.fi/methods/perturbation-testing