Boundary & edge-case testing
Push inputs to limits — very long contexts, token boundaries, empty/extreme values — where behavior tends to degrade sharply.
Published June 26, 2026
How it works
Failures cluster at boundaries: the edge of the context window, unusually long lists, zero/one/many counts, maximum output length. Systematically walking inputs toward these limits exposes degradation that mid-range testing misses.
When to use it
Whenever input size, length, or count varies in production; long-context and structured-output features especially.
Limitations
Boundaries shift between model versions; tests need re-validation after upgrades.
Method yield
- Findings
- 7
- Versions spanned
- 6
- Yield score
- 17
Severity-weighted across the published findings below. Why we measure this →
Findings it surfaces (7)
Documented failures this method catches — the evidence it works.
- Character-counting errors in tokenized wordsLow
Models miscount letters within a word (e.g. how many 'r's are in a given word) because they reason over tokens, not characters.
How it found it: Longer words with repeated letters increase failure rate.
Reasoning - Errors in multi-digit arithmeticMedium
Without tool use, models make systematic errors on multi-digit multiplication and long arithmetic.
How it found it: Error rate climbs with operand length.
Reasoning - Lost in the middle: degraded recall for mid-context informationMedium
Retrieval accuracy is highest for facts at the start and end of a long context and drops for facts in the middle.
Reasoning - Miscounting items in long listsLow
Counts of items, occurrences, or matches in long inputs drift as list length grows.
How it found it: Error grows with list length.
Reasoning - Format-constraint violations under strict schemasMedium
Asked for strictly-formatted output (e.g. JSON to a schema), models emit invalid or extra content.
How it found it: Failure rate rises with schema depth.
Tool use - Spatial and geometric reasoning errorsLow
Models struggle with relative positions, rotations, and simple geometric/visual reasoning.
Reasoning - Repetition and degeneration loopsLow
Under certain prompts or long generations, models fall into repeating phrases or degenerate text.
Other
References & further reading
Cite this
Qlarify Labs. (2026). Boundary & edge-case testing. Retrieved from https://labs.qlarify.fi/methods/boundary-testing