Testing & Findings
Findings
Documented limitations, weaknesses and failures of AI systems — evidence-first and linked to the method that found each one. Public entries are reviewed before publishing.
15 findings
- HighHallucinationVendor-acknowledgedRepro: Often
Reasoning model knowingly fabricates unverifiable references
OpenAI's o1 system card reports 'intentional hallucinations' (0.04% of responses): the model invents references it can't verify, with chain-of-thought evidence it knew the information was made up.
🔬 Factual oracle verification🔬 Chain-of-thought faithfulness probingHallucinationReasoning failure - MediumReasoningReviewer-confirmedRepro: Often
Errors in multi-digit arithmetic
Without tool use, models make systematic errors on multi-digit multiplication and long arithmetic.
🔬 Boundary & edge-case testing🔬 Property-based testing🔬 Benchmark evaluationReasoning failure - MediumReasoningReviewer-confirmedRepro: Sometimes
Failure to honor negation in instructions
Models frequently do the opposite of a 'do not' instruction, or ignore the negation entirely.
🔬 Metamorphic testing🔬 Logic & consistency testingReasoning failure - MediumReasoningReviewer-confirmedRepro: Often
Inconsistent answers to semantically equivalent prompts
Trivial rewordings of the same question yield materially different answers.
🔬 Self-consistency probing🔬 Metamorphic testing🔬 Perturbation testingReasoning failureEvals - MediumReasoningReviewer-confirmedRepro: Sometimes
Model behavior drifts between versions — a fixed task can regress
The same model name can perform very differently across dated snapshots: a task that passed on one release regresses on the next, with no announcement.
🔬 Drift & decay monitoring🔬 Differential testingReasoning failureEvalsDrift - MediumReasoningVendor-acknowledgedRepro: Often
Reasoning model degrades under few-shot prompting
DeepSeek-R1's own paper reports that few-shot prompting 'consistently degrades its performance' and recommends zero-shot — inverting the usual assumption that examples help.
🔬 Perturbation testing🔬 Differential testingReasoning failureRobustness - MediumReasoningReviewer-confirmedRepro: Often
The reversal curse: 'A is B' not generalizing to 'B is A'
A model trained that 'A is B' frequently fails to answer 'B is ?', revealing that learned relations are not symmetric.
🔬 Differential testing🔬 Metamorphic testing🔬 Logic & consistency testingReasoning failure - MediumOtherReviewer-confirmedRepro: Sometimes
Unfaithful chain-of-thought reasoning
The stated step-by-step reasoning does not reflect the actual cause of the answer.
🔬 Chain-of-thought faithfulness probingReasoning failureEvals - MediumOtherVendor-acknowledgedRepro: Sometimes
Vendor cautions its reasoning model's chain-of-thought may be unfaithful
OpenAI's o1 system card states its chain-of-thought 'may not be fully legible and faithful… even now' — the developer itself warns the displayed reasoning can't be trusted as the real cause.
🔬 Chain-of-thought faithfulness probingReasoning failureEvals - LowReasoningReviewer-confirmedRepro: Often
Character-counting errors in tokenized words
Models miscount letters within a word (e.g. how many 'r's are in a given word) because they reason over tokens, not characters.
🔬 Boundary & edge-case testing🔬 Perturbation testingReasoning failure - LowReasoningReviewer-confirmedRepro: Often
Date and duration arithmetic errors
Models miscompute differences between dates, weekdays, and durations across boundaries like months and leap years.
🔬 Property-based testingReasoning failure - LowReasoningReviewer-confirmedRepro: Often
Miscounting items in long lists
Counts of items, occurrences, or matches in long inputs drift as list length grows.
🔬 Boundary & edge-case testing🔬 Property-based testingReasoning failure - LowOtherVendor-acknowledgedRepro: Sometimes
Reasoning model mixes languages on non-English/Chinese queries
DeepSeek-R1 is optimized for English and Chinese and can mix languages mid-output on queries in other languages — its own paper flags this.
🔬 Differential testing🔬 Metamorphic testingReasoning failureMultilingual - LowReasoningReviewer-confirmedRepro: Sometimes
Self-contradiction within a single conversation
Models assert one fact and later assert its opposite within the same session.
🔬 Self-consistency probing🔬 Logic & consistency testingReasoning failure - LowReasoningReviewer-confirmedRepro: Often
Spatial and geometric reasoning errors
Models struggle with relative positions, rotations, and simple geometric/visual reasoning.
🔬 Boundary & edge-case testingReasoning failureMultimodal