MetamorphicEstablished

Metamorphic testing

Test without an oracle by checking relations between the outputs of related inputs, instead of judging any single output in isolation.

Published June 26, 2026

Reasoning failure Evals

How it works

Metamorphic testing sidesteps the oracle problem — the difficulty of knowing the 'correct' output for an arbitrary input — by checking a property that must hold *between* an input and a systematically transformed follow-up input. That property is a metamorphic relation (MR). If the relation is violated, you have found a fault, with no labelled ground truth required. This fits LLMs unusually well, because most interesting tasks have no cheap oracle but plenty of easy-to-state invariances. Common MRs: • Paraphrase invariance — rewording a question should not change a factual answer. • Negation — negating a premise should flip a yes/no answer. • Symmetry — if the model asserts 'A is B', it should accept 'B is A'. • Monotonicity — adding supporting evidence should not lower a stated confidence. A test fails when the follow-up output contradicts the MR. Because MRs are checked automatically across many transformed inputs, the method scales to thousands of cases without human labelling.

When to use it

Reach for it when there is no reliable oracle but you can state a property that ought to hold across related inputs — consistency, invariance, symmetry, monotonicity. It is especially suited to LLMs, where ground-truth answers are expensive but invariances (paraphrase, translation, ordering) are cheap to generate at scale.

Limitations

An MR that holds cannot prove correctness — it only catches the violations its chosen relations can express. Designing good MRs is the hard part and needs domain insight; weak or wrong MRs produce false alarms or miss faults. Some violations are also legitimate behaviour (a paraphrase that genuinely changes meaning), so results still need human triage.

Method yield

Findings: 4
Versions spanned: 7
Yield score: 11

3 Medium1 Low

Severity-weighted across the published findings below. Why we measure this →

Findings it surfaces (4)

Documented failures this method catches — the evidence it works.

References & further reading

Cite this

Qlarify Labs. (2026). Metamorphic testing. Retrieved from https://labs.qlarify.fi/methods/metamorphic-testing