Metamorphic testing
Test without an oracle by checking relations between the outputs of related inputs, instead of judging any single output in isolation.
Published June 26, 2026
How it works
Metamorphic testing sidesteps the oracle problem — the difficulty of knowing the 'correct' output for an arbitrary input — by checking a property that must hold *between* an input and a systematically transformed follow-up input. That property is a metamorphic relation (MR). If the relation is violated, you have found a fault, with no labelled ground truth required. This fits LLMs unusually well, because most interesting tasks have no cheap oracle but plenty of easy-to-state invariances. Common MRs: • Paraphrase invariance — rewording a question should not change a factual answer. • Negation — negating a premise should flip a yes/no answer. • Symmetry — if the model asserts 'A is B', it should accept 'B is A'. • Monotonicity — adding supporting evidence should not lower a stated confidence. A test fails when the follow-up output contradicts the MR. Because MRs are checked automatically across many transformed inputs, the method scales to thousands of cases without human labelling.
When to use it
Reach for it when there is no reliable oracle but you can state a property that ought to hold across related inputs — consistency, invariance, symmetry, monotonicity. It is especially suited to LLMs, where ground-truth answers are expensive but invariances (paraphrase, translation, ordering) are cheap to generate at scale.
Limitations
An MR that holds cannot prove correctness — it only catches the violations its chosen relations can express. Designing good MRs is the hard part and needs domain insight; weak or wrong MRs produce false alarms or miss faults. Some violations are also legitimate behaviour (a paraphrase that genuinely changes meaning), so results still need human triage.
Method yield
- Findings
- 4
- Versions spanned
- 7
- Yield score
- 11
Severity-weighted across the published findings below. Why we measure this →
Findings it surfaces (4)
Documented failures this method catches — the evidence it works.
- Inconsistent answers to semantically equivalent promptsMedium
Trivial rewordings of the same question yield materially different answers.
How it found it: Metamorphic relation — paraphrase invariance: a reworded but semantically equivalent prompt should produce the same answer. The model returns different answers to equivalent phrasings, violating the relation. This is the canonical LLM metamorphic relation.
Reasoning - Failure to honor negation in instructionsMedium
Models frequently do the opposite of a 'do not' instruction, or ignore the negation entirely.
How it found it: Metamorphic relation — negation: negating the premise or instruction should flip the expected answer. The model preserves the original answer, ignoring the negation.
Reasoning - The reversal curse: 'A is B' not generalizing to 'B is A'Medium
A model trained that 'A is B' frequently fails to answer 'B is ?', revealing that learned relations are not symmetric.
How it found it: Metamorphic relation — symmetry: a model that accepts 'A is B' should also accept 'B is A'. Having learned one direction, it fails the reverse.
Reasoning - Reasoning model mixes languages on non-English/Chinese queriesLow
DeepSeek-R1 is optimized for English and Chinese and can mix languages mid-output on queries in other languages — its own paper flags this.
How it found it: Translating the query should not change the language consistency of the answer; when it does, the relation is violated.
Other
References & further reading
- Metamorphic Testing: A Review of Challenges and Opportunities
Chen, Kuo, Liu, Poon, Towey, Tse, Zhou · ACM Computing Surveys · January 1, 2018
- Hallucination Detection in Large Language Models with Metamorphic Relations
Borui Yang, Md Afif Al Mamun, Jie M. Zhang, Gias Uddin · arXiv · February 20, 2025
- Metamorphic Testing of Large Language Models for Natural Language Processing
Steven Cho, Stefano Ruberto, Valerio Terragni · arXiv · November 3, 2025
Cite this
Qlarify Labs. (2026). Metamorphic testing. Retrieved from https://labs.qlarify.fi/methods/metamorphic-testing