Testing & Findings
Findings
Documented limitations, weaknesses and failures of AI systems — evidence-first and linked to the method that found each one. Public entries are reviewed before publishing.
5 findings
- HighHallucinationReviewer-confirmedRepro: Often
Fabricated citations and references
Models invent plausible-looking but non-existent papers, authors, DOIs and URLs.
🔬 Factual oracle verification🔬 Model-graded evaluation (LLM-as-judge)Hallucination - HighHallucinationReviewer-confirmedRepro: Often
Fabrication instead of admitting uncertainty
Asked about something unknown or non-existent, models invent an answer rather than saying 'I don't know'.
🔬 Factual oracle verification🔬 Hallucination triggeringHallucination - HighHallucinationVendor-acknowledgedRepro: Often
Reasoning model knowingly fabricates unverifiable references
OpenAI's o1 system card reports 'intentional hallucinations' (0.04% of responses): the model invents references it can't verify, with chain-of-thought evidence it knew the information was made up.
🔬 Factual oracle verification🔬 Chain-of-thought faithfulness probingHallucinationReasoning failure - MediumHallucinationReviewer-confirmedRepro: Often
Poor uncertainty calibration / overconfidence
Stated confidence does not track accuracy; models sound equally certain when right and wrong.
🔬 Factual oracle verification🔬 Self-consistency probing🔬 Distributional testing (KS test, Monte Carlo)HallucinationEvals - LowHallucinationReviewer-confirmedRepro: Often
Confusion about knowledge cutoff and current date
Models misstate their own knowledge cutoff or the current date, and answer about post-cutoff events with stale or invented information.
🔬 Factual oracle verification🔬 Hallucination triggeringHallucination