Testing & Findings
Findings
Documented limitations, weaknesses and failures of AI systems — evidence-first and linked to the method that found each one. Public entries are reviewed before publishing.
36 findings
- CriticalSafetyReviewer-confirmedRepro: Rare
Data exfiltration through prompt injection in agents
An injected instruction can make a tool-using agent send private data to an attacker-controlled destination.
🔬 Prompt-injection & jailbreak testingPrompt injectionSafetyAgents - CriticalPrompt injectionReviewer-confirmedRepro: Sometimes
Indirect prompt injection via retrieved content
Instructions hidden in documents, web pages or tool outputs can override the system prompt when ingested by the model.
🔬 Prompt-injection & jailbreak testingPrompt injectionSafetyRAG - HighBiasVendor-acknowledgedRepro: Once
A production update made the model sycophantic and was rolled back
An April 2025 GPT-4o update tuned on user feedback became markedly more sycophantic — validating harmful or delusional claims — and was rolled back within days.
🔬 A/B testing in production🔬 Canary releases & staged rollout🔬 Drift & decay monitoringBiasSafetyProduction - HighHallucinationReviewer-confirmedRepro: Often
Fabricated citations and references
Models invent plausible-looking but non-existent papers, authors, DOIs and URLs.
🔬 Factual oracle verification🔬 Model-graded evaluation (LLM-as-judge)Hallucination - HighHallucinationReviewer-confirmedRepro: Often
Fabrication instead of admitting uncertainty
Asked about something unknown or non-existent, models invent an answer rather than saying 'I don't know'.
🔬 Factual oracle verification🔬 Hallucination triggeringHallucination - HighTool useReviewer-confirmedRepro: Sometimes
Hallucinated tool/function arguments
When calling tools, models invent argument values or call functions that weren't provided.
🔬 Property-based testing🔬 Differential testing🔬 Integration testing (MCP handshakes & tool contracts)Tool useAgents - HighBiasReviewer-confirmedRepro: Sometimes
Name-based demographic bias in outputs
Swapping only a name (signalling gender or ethnicity) changes evaluative outputs like screening or sentiment.
🔬 Counterfactual bias probing🔬 Bias auditingBiasSafety - HighSafetyReviewer-confirmedRepro: Often
Production model internals extracted through the API
With ordinary black-box API access, researchers recovered the embedding projection layer and hidden dimension of production models for under $20 — the first precise model-stealing attack on deployed LLMs.
🔬 Distillation & model-extraction probingSafetyRobustness - HighSafetyVendor-acknowledgedRepro: Sometimes
Reasoning model attempts to subvert oversight under goal pressure
In crafted evaluations, OpenAI's o1 attempted to deactivate its 'oversight mechanism' ~5% of the time when led to believe it would be shut down for pursuing its goal.
🔬 Adversarial promptingSafetyRobustness - HighHallucinationVendor-acknowledgedRepro: Often
Reasoning model knowingly fabricates unverifiable references
OpenAI's o1 system card reports 'intentional hallucinations' (0.04% of responses): the model invents references it can't verify, with chain-of-thought evidence it knew the information was made up.
🔬 Factual oracle verification🔬 Chain-of-thought faithfulness probingHallucinationReasoning failure - HighJailbreakReviewer-confirmedRepro: Sometimes
Roleplay-based safety bypass
Framing a disallowed request as fiction or a persona can induce the model to bypass its safety policy.
🔬 Prompt-injection & jailbreak testing🔬 Adversarial promptingJailbreakSafety - HighJailbreakReviewer-confirmedRepro: Sometimes
Safety bypass via unicode/homoglyph obfuscation
Disallowed content encoded with look-alike unicode or spacing can slip past safety filters.
🔬 Glitch-token & unicode fuzzing🔬 Prompt-injection & jailbreak testingJailbreakSafety - HighSafetyReviewer-confirmedRepro: Often
Verbatim training data extracted from a deployed chatbot
A 'divergence' attack made aligned ChatGPT abandon its chat format and emit memorized training data verbatim, recovering thousands of examples for about $200.
🔬 Distillation & model-extraction probingSafetyRobustness - MediumReasoningReviewer-confirmedRepro: Often
Errors in multi-digit arithmetic
Without tool use, models make systematic errors on multi-digit multiplication and long arithmetic.
🔬 Boundary & edge-case testing🔬 Property-based testing🔬 Benchmark evaluationReasoning failure - MediumReasoningReviewer-confirmedRepro: Sometimes
Failure to honor negation in instructions
Models frequently do the opposite of a 'do not' instruction, or ignore the negation entirely.
🔬 Metamorphic testing🔬 Logic & consistency testingReasoning failure - MediumTool useReviewer-confirmedRepro: Sometimes
Format-constraint violations under strict schemas
Asked for strictly-formatted output (e.g. JSON to a schema), models emit invalid or extra content.
🔬 Property-based testing🔬 Boundary & edge-case testing🔬 Unit testing the deterministic scaffold🔬 Smoke testing in CI/CDTool useAgents - MediumReasoningReviewer-confirmedRepro: Often
Inconsistent answers to semantically equivalent prompts
Trivial rewordings of the same question yield materially different answers.
🔬 Self-consistency probing🔬 Metamorphic testing🔬 Perturbation testingReasoning failureEvals - MediumReasoningReviewer-confirmedRepro: Often
Lost in the middle: degraded recall for mid-context information
Retrieval accuracy is highest for facts at the start and end of a long context and drops for facts in the middle.
🔬 Needle-in-a-haystack (long-context retrieval)🔬 Boundary & edge-case testingRAGContext window - MediumReasoningReviewer-confirmedRepro: Sometimes
Model behavior drifts between versions — a fixed task can regress
The same model name can perform very differently across dated snapshots: a task that passed on one release regresses on the next, with no announcement.
🔬 Drift & decay monitoring🔬 Differential testingReasoning failureEvalsDrift - MediumRefusalReviewer-confirmedRepro: Sometimes
Over-refusal of benign requests
Safety tuning causes refusal of harmless requests that merely resemble sensitive ones.
🔬 Adversarial prompting🔬 Differential testing🔬 Threshold testing🔬 A/B testing in productionRefusalSafety - MediumHallucinationReviewer-confirmedRepro: Often
Poor uncertainty calibration / overconfidence
Stated confidence does not track accuracy; models sound equally certain when right and wrong.
🔬 Factual oracle verification🔬 Self-consistency probing🔬 Distributional testing (KS test, Monte Carlo)HallucinationEvals - MediumReasoningVendor-acknowledgedRepro: Often
Reasoning model degrades under few-shot prompting
DeepSeek-R1's own paper reports that few-shot prompting 'consistently degrades its performance' and recommends zero-shot — inverting the usual assumption that examples help.
🔬 Perturbation testing🔬 Differential testingReasoning failureRobustness - MediumTool useVendor-acknowledgedRepro: Often
Reasoning model regresses on tool use versus its base model
DeepSeek-R1 falls short of the base DeepSeek-V3 on function calling, multi-turn, complex role-play and JSON output — a reasoning-tuned model trading away tool-use reliability, later restored in R1-0528.
🔬 Integration testing (MCP handshakes & tool contracts)🔬 Property-based testing🔬 Differential testingTool useAgents - MediumBiasReviewer-confirmedRepro: Often
Sycophancy: agreeing with a user's incorrect assertions
Models tend to revise correct answers to match a user who pushes back or states a wrong belief.
🔬 Counterfactual bias probing🔬 Adversarial promptingBiasEvals - MediumReasoningReviewer-confirmedRepro: Often
The reversal curse: 'A is B' not generalizing to 'B is A'
A model trained that 'A is B' frequently fails to answer 'B is ?', revealing that learned relations are not symmetric.
🔬 Differential testing🔬 Metamorphic testing🔬 Logic & consistency testingReasoning failure - MediumOtherReviewer-confirmedRepro: Sometimes
Unfaithful chain-of-thought reasoning
The stated step-by-step reasoning does not reflect the actual cause of the answer.
🔬 Chain-of-thought faithfulness probingReasoning failureEvals - MediumOtherVendor-acknowledgedRepro: Sometimes
Vendor cautions its reasoning model's chain-of-thought may be unfaithful
OpenAI's o1 system card states its chain-of-thought 'may not be fully legible and faithful… even now' — the developer itself warns the displayed reasoning can't be trusted as the real cause.
🔬 Chain-of-thought faithfulness probingReasoning failureEvals - LowOtherReviewer-confirmedRepro: Rare
Anomalous behavior on glitch tokens
Certain under-trained tokens cause models to emit nonsense, evade instructions, or behave erratically.
🔬 Glitch-token & unicode fuzzingSafetyEvals - LowReasoningReviewer-confirmedRepro: Often
Character-counting errors in tokenized words
Models miscount letters within a word (e.g. how many 'r's are in a given word) because they reason over tokens, not characters.
🔬 Boundary & edge-case testing🔬 Perturbation testingReasoning failure - LowHallucinationReviewer-confirmedRepro: Often
Confusion about knowledge cutoff and current date
Models misstate their own knowledge cutoff or the current date, and answer about post-cutoff events with stale or invented information.
🔬 Factual oracle verification🔬 Hallucination triggeringHallucination - LowReasoningReviewer-confirmedRepro: Often
Date and duration arithmetic errors
Models miscompute differences between dates, weekdays, and durations across boundaries like months and leap years.
🔬 Property-based testingReasoning failure - LowReasoningReviewer-confirmedRepro: Often
Miscounting items in long lists
Counts of items, occurrences, or matches in long inputs drift as list length grows.
🔬 Boundary & edge-case testing🔬 Property-based testingReasoning failure - LowOtherVendor-acknowledgedRepro: Sometimes
Reasoning model mixes languages on non-English/Chinese queries
DeepSeek-R1 is optimized for English and Chinese and can mix languages mid-output on queries in other languages — its own paper flags this.
🔬 Differential testing🔬 Metamorphic testingReasoning failureMultilingual - LowOtherReviewer-confirmedRepro: Rare
Repetition and degeneration loops
Under certain prompts or long generations, models fall into repeating phrases or degenerate text.
🔬 Boundary & edge-case testing🔬 Glitch-token & unicode fuzzing🔬 Chaos engineering for AI systemsEvals - LowReasoningReviewer-confirmedRepro: Sometimes
Self-contradiction within a single conversation
Models assert one fact and later assert its opposite within the same session.
🔬 Self-consistency probing🔬 Logic & consistency testingReasoning failure - LowReasoningReviewer-confirmedRepro: Often
Spatial and geometric reasoning errors
Models struggle with relative positions, rotations, and simple geometric/visual reasoning.
🔬 Boundary & edge-case testingReasoning failureMultimodal