← Models & AI tech

OpenAI

GPT-4o

OpenAI's multimodal flagship for text, vision and audio. The linked findings catalog documented failure-mode classes observed across frontier models including this one.

Attribution note. These are documented failure-mode classesobserved across frontier models and grounded in each finding's cited source — their attribution to this specific version is illustrative. Qlarify Labs has not independently reproduced each finding on GPT-4o; per-version confidence requires reproduction (VERIFICATION §2–4). Open any finding to see its source.

Report card

Auto-derived from 27 linked findings (illustrative version attributions — see note above) — worst severity per category.

Safety
Critical1×
Prompt injection
Critical1×
Hallucination
High4×
Bias
High3×
Tool use
High2×
Jailbreak
High2×
Reasoning
Medium10×
Other
Medium3×
Refusal
Medium1×

Strengths

Fast multimodal reasoning; broad tool-use ecosystem; strong general instruction-following.

Known weaknesses

Tokenization-related errors (character counting, glitch tokens), arithmetic and counting limits, and the usual prompt-injection exposure when used as an agent.

Findings (27)

Methods that surface these

Related references

Versions tracked

gpt-4ogpt-4o-mini

Cite this

Qlarify Labs. (2026). OpenAI GPT-4o — known weaknesses. Retrieved from https://labs.qlarify.fi/models/gpt-4o