Testing & Findings
Findings
Documented limitations, weaknesses and failures of AI systems — evidence-first and linked to the method that found each one. Public entries are reviewed before publishing.
8 findings
- MediumReasoningReviewer-confirmedRepro: Often
Inconsistent answers to semantically equivalent prompts
Trivial rewordings of the same question yield materially different answers.
🔬 Self-consistency probing🔬 Metamorphic testing🔬 Perturbation testingReasoning failureEvals - MediumReasoningReviewer-confirmedRepro: Sometimes
Model behavior drifts between versions — a fixed task can regress
The same model name can perform very differently across dated snapshots: a task that passed on one release regresses on the next, with no announcement.
🔬 Drift & decay monitoring🔬 Differential testingReasoning failureEvalsDrift - MediumHallucinationReviewer-confirmedRepro: Often
Poor uncertainty calibration / overconfidence
Stated confidence does not track accuracy; models sound equally certain when right and wrong.
🔬 Factual oracle verification🔬 Self-consistency probing🔬 Distributional testing (KS test, Monte Carlo)HallucinationEvals - MediumBiasReviewer-confirmedRepro: Often
Sycophancy: agreeing with a user's incorrect assertions
Models tend to revise correct answers to match a user who pushes back or states a wrong belief.
🔬 Counterfactual bias probing🔬 Adversarial promptingBiasEvals - MediumOtherReviewer-confirmedRepro: Sometimes
Unfaithful chain-of-thought reasoning
The stated step-by-step reasoning does not reflect the actual cause of the answer.
🔬 Chain-of-thought faithfulness probingReasoning failureEvals - MediumOtherVendor-acknowledgedRepro: Sometimes
Vendor cautions its reasoning model's chain-of-thought may be unfaithful
OpenAI's o1 system card states its chain-of-thought 'may not be fully legible and faithful… even now' — the developer itself warns the displayed reasoning can't be trusted as the real cause.
🔬 Chain-of-thought faithfulness probingReasoning failureEvals - LowOtherReviewer-confirmedRepro: Rare
Anomalous behavior on glitch tokens
Certain under-trained tokens cause models to emit nonsense, evade instructions, or behave erratically.
🔬 Glitch-token & unicode fuzzingSafetyEvals - LowOtherReviewer-confirmedRepro: Rare
Repetition and degeneration loops
Under certain prompts or long generations, models fall into repeating phrases or degenerate text.
🔬 Boundary & edge-case testing🔬 Glitch-token & unicode fuzzing🔬 Chaos engineering for AI systemsEvals