PaperHigh credibilityarXiv · Steven Cho, Stefano Ruberto, Valerio Terragni · November 3, 2025

Metamorphic Testing of Large Language Models for Natural Language Processing

Our summary

A large-scale study applying metamorphic testing to LLMs on NLP tasks: the authors collect 191 metamorphic relations from the literature, implement 36, and run roughly 560,000 metamorphic tests across three LLMs to surface incorrect behaviour without labelled oracles.

Why it matters

Shows metamorphic testing scales to modern LLMs and yields concrete failures — direct evidence the method works.

Cited by these methods

🔬 Counterfactual bias probing 🔬 Metamorphic testing 🔬 Self-consistency probing

Related findings (2)

Failure to honor negation in instructionsMedium
Models frequently do the opposite of a 'do not' instruction, or ignore the negation entirely.
Inconsistent answers to semantically equivalent promptsMedium
Trivial rewordings of the same question yield materially different answers.

Published June 26, 2026

Cite this

Qlarify Labs. (2026). Metamorphic Testing of Large Language Models for Natural Language Processing. Retrieved from https://labs.qlarify.fi/references/metamorphic-testing-of-llms-nlp-2025