OpenAI

o-series

OpenAI's reasoning-focused model line (o3, o4-mini), which spends inference-time compute on chain-of-thought. Reasoning models shift, rather than remove, the failure surface.

Attribution note. These are documented failure-mode classesobserved across frontier models and grounded in each finding's cited source — their attribution to this specific version is illustrative. Qlarify Labs has not independently reproduced each finding on o-series; per-version confidence requires reproduction (VERIFICATION §2–4). Open any finding to see its source.

Report card

Auto-derived from 4 linked findings (illustrative version attributions — see note above) — worst severity per category.

Safety: High1×
Hallucination: High1×
Other: Medium2×

Strengths

Strong multi-step math, coding and logic via deliberate reasoning; better self-correction than non-reasoning peers.

Known weaknesses

Chain-of-thought can be unfaithful (stated reasoning need not be the real cause); latency and cost rise with reasoning depth; shares the hallucination and prompt-injection classes.

Findings (4)

Methods that surface these

🔬 Adversarial prompting 🔬 Chain-of-thought faithfulness probing 🔬 Factual oracle verification

Related references

Measuring Faithfulness in Chain-of-Thought Reasoning — arXiv
OpenAI o1 System Card — OpenAI

Versions tracked

o1o3o4-mini

Cite this

Qlarify Labs. (2026). OpenAI o-series — known weaknesses. Retrieved from https://labs.qlarify.fi/models/o-series