OpenAI

GPT-4o

OpenAI's multimodal flagship for text, vision and audio. The linked findings catalog documented failure-mode classes observed across frontier models including this one.

Attribution note. These are documented failure-mode classesobserved across frontier models and grounded in each finding's cited source — their attribution to this specific version is illustrative. Qlarify Labs has not independently reproduced each finding on GPT-4o; per-version confidence requires reproduction (VERIFICATION §2–4). Open any finding to see its source.

Report card

Auto-derived from 27 linked findings (illustrative version attributions — see note above) — worst severity per category.

Safety: Critical1×
Prompt injection: Critical1×
Hallucination: High4×
Bias: High3×
Tool use: High2×
Jailbreak: High2×
Reasoning: Medium10×
Other: Medium3×
Refusal: Medium1×

Strengths

Fast multimodal reasoning; broad tool-use ecosystem; strong general instruction-following.

Known weaknesses

Tokenization-related errors (character counting, glitch tokens), arithmetic and counting limits, and the usual prompt-injection exposure when used as an agent.

Findings (27)

Methods that surface these

🔬 A/B testing in production 🔬 Adversarial prompting 🔬 Benchmark evaluation 🔬 Bias auditing 🔬 Boundary & edge-case testing 🔬 Canary releases & staged rollout 🔬 Chain-of-thought faithfulness probing 🔬 Chaos engineering for AI systems 🔬 Counterfactual bias probing 🔬 Differential testing 🔬 Distributional testing (KS test, Monte Carlo)🔬 Drift & decay monitoring 🔬 Factual oracle verification 🔬 Glitch-token & unicode fuzzing 🔬 Hallucination triggering 🔬 Integration testing (MCP handshakes & tool contracts)🔬 Logic & consistency testing 🔬 Metamorphic testing 🔬 Model-graded evaluation (LLM-as-judge)🔬 Needle-in-a-haystack (long-context retrieval)🔬 Perturbation testing 🔬 Prompt-injection & jailbreak testing 🔬 Property-based testing 🔬 Self-consistency probing 🔬 Smoke testing in CI/CD 🔬 Threshold testing 🔬 Unit testing the deterministic scaffold

Related references

Advancing Spatial Reasoning in Large Language Models: An In-Depth Evaluation Using the StepGame Benchmark — AAAI 2024
BBQ: A Hand-Built Bias Benchmark for Question Answering — Findings of ACL 2022
Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs — arXiv
Counting Ability of Large Language Models and Impact of Tokenization — arXiv
Dated Data: Tracing Knowledge Cutoffs in Large Language Models — arXiv
Faith and Fate: Limits of Transformers on Compositionality — arXiv
Gorilla: Large Language Model Connected with Massive APIs — arXiv
Hallucination Detection in Large Language Models with Metamorphic Relations — arXiv
Incident 541: ChatGPT Produced False Court Case Law Presented by Legal Counsel in Court — AI Incident Database
Let Me Speak Freely? A Study on the Impact of Format Restrictions on Performance of Large Language Models — arXiv
Lost in the Middle: How Language Models Use Long Contexts — arXiv
Measuring Faithfulness in Chain-of-Thought Reasoning — arXiv
Metamorphic Testing of Large Language Models for Natural Language Processing — arXiv
Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection — arXiv
OWASP Top 10 for Large Language Model Applications — OWASP
Prompt injection: what’s the worst that can happen? — Simon Willison’s Weblog
Self-contradictory Hallucinations of Large Language Models: Evaluation, Detection and Mitigation — arXiv
SolidGoldMagikarp (plus, prompt generation) — LessWrong
Survey of Hallucination in Natural Language Generation — ACM Computing Surveys
Sycophancy in GPT-4o: What Happened and What We're Doing About It — OpenAI
Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning — arXiv
The Curious Case of Neural Text Degeneration — arXiv
The Reversal Curse: LLMs trained on “A is B” fail to learn “B is A” — arXiv
Towards Understanding Sycophancy in Language Models — arXiv
TruthfulQA: Measuring How Models Mimic Human Falsehoods — arXiv
Universal and Transferable Adversarial Attacks on Aligned Language Models — arXiv
Why Do Large Language Models (LLMs) Struggle to Count Letters? — arXiv
XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models — arXiv

Versions tracked

gpt-4ogpt-4o-mini

Cite this

Qlarify Labs. (2026). OpenAI GPT-4o — known weaknesses. Retrieved from https://labs.qlarify.fi/models/gpt-4o