Gemini
Google's multimodal model family. Profile aggregates the documented failure-mode classes linked to its versions.
Attribution note. These are documented failure-mode classesobserved across frontier models and grounded in each finding's cited source — their attribution to this specific version is illustrative. Qlarify Labs has not independently reproduced each finding on Gemini; per-version confidence requires reproduction (VERIFICATION §2–4). Open any finding to see its source.
Report card
Auto-derived from 23 linked findings (illustrative version attributions — see note above) — worst severity per category.
- Safety
- Critical1×
- Prompt injection
- Critical1×
- Hallucination
- High4×
- Tool use
- High2×
- Bias
- High2×
- Jailbreak
- High2×
- Reasoning
- Medium10×
- Refusal
- Medium1×
Strengths
Large context handling; multimodal input; competitive reasoning.
Known weaknesses
Format-constraint violations under strict schemas, lost-in-the-middle recall degradation, and shared reasoning/arithmetic limits.
Findings (23)
- Character-counting errors in tokenized wordsLow
Models miscount letters within a word (e.g. how many 'r's are in a given word) because they reason over tokens, not characters.
Reasoning - Confusion about knowledge cutoff and current dateLow
Models misstate their own knowledge cutoff or the current date, and answer about post-cutoff events with stale or invented information.
Hallucination - Data exfiltration through prompt injection in agentsCritical
An injected instruction can make a tool-using agent send private data to an attacker-controlled destination.
Safety - Date and duration arithmetic errorsLow
Models miscompute differences between dates, weekdays, and durations across boundaries like months and leap years.
Reasoning - Errors in multi-digit arithmeticMedium
Without tool use, models make systematic errors on multi-digit multiplication and long arithmetic.
Reasoning - Fabricated citations and referencesHigh
Models invent plausible-looking but non-existent papers, authors, DOIs and URLs.
Hallucination - Fabrication instead of admitting uncertaintyHigh
Asked about something unknown or non-existent, models invent an answer rather than saying 'I don't know'.
Hallucination - Failure to honor negation in instructionsMedium
Models frequently do the opposite of a 'do not' instruction, or ignore the negation entirely.
Reasoning - Format-constraint violations under strict schemasMedium
Asked for strictly-formatted output (e.g. JSON to a schema), models emit invalid or extra content.
Tool use - Hallucinated tool/function argumentsHigh
When calling tools, models invent argument values or call functions that weren't provided.
Tool use - Inconsistent answers to semantically equivalent promptsMedium
Trivial rewordings of the same question yield materially different answers.
Reasoning - Indirect prompt injection via retrieved contentCritical
Instructions hidden in documents, web pages or tool outputs can override the system prompt when ingested by the model.
Prompt injection - Lost in the middle: degraded recall for mid-context informationMedium
Retrieval accuracy is highest for facts at the start and end of a long context and drops for facts in the middle.
Reasoning - Miscounting items in long listsLow
Counts of items, occurrences, or matches in long inputs drift as list length grows.
Reasoning - Name-based demographic bias in outputsHigh
Swapping only a name (signalling gender or ethnicity) changes evaluative outputs like screening or sentiment.
Bias - Over-refusal of benign requestsMedium
Safety tuning causes refusal of harmless requests that merely resemble sensitive ones.
Refusal - Poor uncertainty calibration / overconfidenceMedium
Stated confidence does not track accuracy; models sound equally certain when right and wrong.
Hallucination - Roleplay-based safety bypassHigh
Framing a disallowed request as fiction or a persona can induce the model to bypass its safety policy.
Jailbreak - Safety bypass via unicode/homoglyph obfuscationHigh
Disallowed content encoded with look-alike unicode or spacing can slip past safety filters.
Jailbreak - Self-contradiction within a single conversationLow
Models assert one fact and later assert its opposite within the same session.
Reasoning - Spatial and geometric reasoning errorsLow
Models struggle with relative positions, rotations, and simple geometric/visual reasoning.
Reasoning - Sycophancy: agreeing with a user's incorrect assertionsMedium
Models tend to revise correct answers to match a user who pushes back or states a wrong belief.
Bias - The reversal curse: 'A is B' not generalizing to 'B is A'Medium
A model trained that 'A is B' frequently fails to answer 'B is ?', revealing that learned relations are not symmetric.
Reasoning
Methods that surface these
Related references
- Advancing Spatial Reasoning in Large Language Models: An In-Depth Evaluation Using the StepGame Benchmark — AAAI 2024
- BBQ: A Hand-Built Bias Benchmark for Question Answering — Findings of ACL 2022
- Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs — arXiv
- Counting Ability of Large Language Models and Impact of Tokenization — arXiv
- Dated Data: Tracing Knowledge Cutoffs in Large Language Models — arXiv
- Faith and Fate: Limits of Transformers on Compositionality — arXiv
- Gorilla: Large Language Model Connected with Massive APIs — arXiv
- Hallucination Detection in Large Language Models with Metamorphic Relations — arXiv
- Incident 541: ChatGPT Produced False Court Case Law Presented by Legal Counsel in Court — AI Incident Database
- Let Me Speak Freely? A Study on the Impact of Format Restrictions on Performance of Large Language Models — arXiv
- Lost in the Middle: How Language Models Use Long Contexts — arXiv
- Metamorphic Testing of Large Language Models for Natural Language Processing — arXiv
- Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection — arXiv
- OWASP Top 10 for Large Language Model Applications — OWASP
- Prompt injection: what’s the worst that can happen? — Simon Willison’s Weblog
- Self-contradictory Hallucinations of Large Language Models: Evaluation, Detection and Mitigation — arXiv
- Survey of Hallucination in Natural Language Generation — ACM Computing Surveys
- Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning — arXiv
- The Reversal Curse: LLMs trained on “A is B” fail to learn “B is A” — arXiv
- Towards Understanding Sycophancy in Language Models — arXiv
- TruthfulQA: Measuring How Models Mimic Human Falsehoods — arXiv
- Universal and Transferable Adversarial Attacks on Aligned Language Models — arXiv
- Why Do Large Language Models (LLMs) Struggle to Count Letters? — arXiv
- XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models — arXiv
Versions tracked
Cite this
Qlarify Labs. (2026). Google Gemini — known weaknesses. Retrieved from https://labs.qlarify.fi/models/gemini