Property-based testing
Specify invariants the output must always satisfy, then generate many inputs automatically and check the invariant on each.
Published June 26, 2026
How it works
Rather than hand-writing examples, you declare properties ('output is valid JSON matching the schema', 'the sum equals the total', 'no field is null') and let a generator hammer the system. It scales coverage far beyond manual cases and shrinks failures to minimal reproductions.
When to use it
Structured output, tool/function calling, anything with checkable invariants.
Limitations
Only as good as the properties you state; semantic correctness often can't be captured as a simple invariant.
Method yield
- Findings
- 6
- Versions spanned
- 7
- Yield score
- 17
Severity-weighted across the published findings below. Why we measure this →
Findings it surfaces (6)
Documented failures this method catches — the evidence it works.
- Errors in multi-digit arithmeticMedium
Without tool use, models make systematic errors on multi-digit multiplication and long arithmetic.
How it found it: Invariant a*b == b*a and known products fail.
Reasoning - Date and duration arithmetic errorsLow
Models miscompute differences between dates, weekdays, and durations across boundaries like months and leap years.
How it found it: Generate date pairs; check against a real date library.
Reasoning - Miscounting items in long listsLow
Counts of items, occurrences, or matches in long inputs drift as list length grows.
Reasoning - Format-constraint violations under strict schemasMedium
Asked for strictly-formatted output (e.g. JSON to a schema), models emit invalid or extra content.
How it found it: Validate every output against the JSON schema.
Tool use - Hallucinated tool/function argumentsHigh
When calling tools, models invent argument values or call functions that weren't provided.
How it found it: Assert args satisfy the tool's schema and exist.
Tool use - Reasoning model regresses on tool use versus its base modelMedium
DeepSeek-R1 falls short of the base DeepSeek-V3 on function calling, multi-turn, complex role-play and JSON output — a reasoning-tuned model trading away tool-use reliability, later restored in R1-0528.
How it found it: Validating outputs against the JSON schema exposes the regression.
Tool use
References & further reading
Cite this
Qlarify Labs. (2026). Property-based testing. Retrieved from https://labs.qlarify.fi/methods/property-based-testing