Property-basedEstablished

Logic & consistency testing

Check that the model's outputs obey the rules of logic — valid inference, transitivity, symmetry, no self-contradiction — across related questions.

Published June 26, 2026

Reasoning failure Evals

How it works

Fluent prose can hide broken reasoning. Logic and consistency testing asserts the properties any sound reasoner must satisfy: if A>B and B>C then A>C, if 'A is B' then 'B is A' should be answerable, a negated instruction should invert the result, and claims made early in a session shouldn't be contradicted later. Posing structured sets of related questions and checking these invariants exposes reasoning failures that no single answer, taken alone, would reveal.

When to use it

Evaluating reasoning quality; multi-step or multi-turn tasks where internal consistency matters; catching relational and negation failures.

Limitations

Captures logical form, not real-world correctness — a model can be perfectly consistent and consistently wrong — and enumerating the relevant invariants for an open-ended task is hard.

Method yield

Findings: 3
Versions spanned: 6
Yield score: 8

2 Medium1 Low

Severity-weighted across the published findings below. Why we measure this →

Findings it surfaces (3)

Documented failures this method catches — the evidence it works.

References & further reading

Measuring and Improving Consistency in Pretrained Language Models
Yanai Elazar, Nora Kassner, Shauli Ravfogel, et al. · TACL 2021 (arXiv:2102.01017) · February 1, 2021

Cite this

Qlarify Labs. (2026). Logic & consistency testing. Retrieved from https://labs.qlarify.fi/methods/logic-consistency-testing