Meta
Llama
Meta's open-weight model family, widely self-hosted. The linked findings reflect documented failure-mode classes for models in this size class.
Attribution note. These are documented failure-mode classesobserved across frontier models and grounded in each finding's cited source — their attribution to this specific version is illustrative. Qlarify Labs has not independently reproduced each finding on Llama; per-version confidence requires reproduction (VERIFICATION §2–4). Open any finding to see its source.
Report card
Auto-derived from 19 linked findings (illustrative version attributions — see note above) — worst severity per category.
- Hallucination
- High4×
- Tool use
- High2×
- Jailbreak
- High2×
- Bias
- High1×
- Reasoning
- Medium8×
- Other
- Low2×
Strengths
Open weights enable inspection and fine-tuning; strong for its size.
Known weaknesses
More prone to repetition/degeneration and format violations; shares the frontier reasoning and safety-bypass classes.
Findings (19)
- Anomalous behavior on glitch tokensLow
Certain under-trained tokens cause models to emit nonsense, evade instructions, or behave erratically.
Other - Character-counting errors in tokenized wordsLow
Models miscount letters within a word (e.g. how many 'r's are in a given word) because they reason over tokens, not characters.
Reasoning - Confusion about knowledge cutoff and current dateLow
Models misstate their own knowledge cutoff or the current date, and answer about post-cutoff events with stale or invented information.
Hallucination - Date and duration arithmetic errorsLow
Models miscompute differences between dates, weekdays, and durations across boundaries like months and leap years.
Reasoning - Errors in multi-digit arithmeticMedium
Without tool use, models make systematic errors on multi-digit multiplication and long arithmetic.
Reasoning - Fabricated citations and referencesHigh
Models invent plausible-looking but non-existent papers, authors, DOIs and URLs.
Hallucination - Fabrication instead of admitting uncertaintyHigh
Asked about something unknown or non-existent, models invent an answer rather than saying 'I don't know'.
Hallucination - Failure to honor negation in instructionsMedium
Models frequently do the opposite of a 'do not' instruction, or ignore the negation entirely.
Reasoning - Format-constraint violations under strict schemasMedium
Asked for strictly-formatted output (e.g. JSON to a schema), models emit invalid or extra content.
Tool use - Hallucinated tool/function argumentsHigh
When calling tools, models invent argument values or call functions that weren't provided.
Tool use - Inconsistent answers to semantically equivalent promptsMedium
Trivial rewordings of the same question yield materially different answers.
Reasoning - Lost in the middle: degraded recall for mid-context informationMedium
Retrieval accuracy is highest for facts at the start and end of a long context and drops for facts in the middle.
Reasoning - Miscounting items in long listsLow
Counts of items, occurrences, or matches in long inputs drift as list length grows.
Reasoning - Name-based demographic bias in outputsHigh
Swapping only a name (signalling gender or ethnicity) changes evaluative outputs like screening or sentiment.
Bias - Poor uncertainty calibration / overconfidenceMedium
Stated confidence does not track accuracy; models sound equally certain when right and wrong.
Hallucination - Repetition and degeneration loopsLow
Under certain prompts or long generations, models fall into repeating phrases or degenerate text.
Other - Roleplay-based safety bypassHigh
Framing a disallowed request as fiction or a persona can induce the model to bypass its safety policy.
Jailbreak - Safety bypass via unicode/homoglyph obfuscationHigh
Disallowed content encoded with look-alike unicode or spacing can slip past safety filters.
Jailbreak - The reversal curse: 'A is B' not generalizing to 'B is A'Medium
A model trained that 'A is B' frequently fails to answer 'B is ?', revealing that learned relations are not symmetric.
Reasoning
Methods that surface these
Related references
- BBQ: A Hand-Built Bias Benchmark for Question Answering — Findings of ACL 2022
- Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs — arXiv
- Counting Ability of Large Language Models and Impact of Tokenization — arXiv
- Dated Data: Tracing Knowledge Cutoffs in Large Language Models — arXiv
- Faith and Fate: Limits of Transformers on Compositionality — arXiv
- Gorilla: Large Language Model Connected with Massive APIs — arXiv
- Hallucination Detection in Large Language Models with Metamorphic Relations — arXiv
- Incident 541: ChatGPT Produced False Court Case Law Presented by Legal Counsel in Court — AI Incident Database
- Let Me Speak Freely? A Study on the Impact of Format Restrictions on Performance of Large Language Models — arXiv
- Lost in the Middle: How Language Models Use Long Contexts — arXiv
- Metamorphic Testing of Large Language Models for Natural Language Processing — arXiv
- SolidGoldMagikarp (plus, prompt generation) — LessWrong
- Survey of Hallucination in Natural Language Generation — ACM Computing Surveys
- Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning — arXiv
- The Curious Case of Neural Text Degeneration — arXiv
- The Reversal Curse: LLMs trained on “A is B” fail to learn “B is A” — arXiv
- TruthfulQA: Measuring How Models Mimic Human Falsehoods — arXiv
- Universal and Transferable Adversarial Attacks on Aligned Language Models — arXiv
- Why Do Large Language Models (LLMs) Struggle to Count Letters? — arXiv
Versions tracked
Cite this
Qlarify Labs. (2026). Meta Llama — known weaknesses. Retrieved from https://labs.qlarify.fi/models/llama