Mistral
Mistral Large
Mistral's flagship dense model, available with open weights for self-hosting. Linked findings reflect documented failure-mode classes for this class.
Attribution note. These are documented failure-mode classesobserved across frontier models and grounded in each finding's cited source — their attribution to this specific version is illustrative. Qlarify Labs has not independently reproduced each finding on Mistral Large; per-version confidence requires reproduction (VERIFICATION §2–4). Open any finding to see its source.
Report card
Auto-derived from 9 linked findings (illustrative version attributions — see note above) — worst severity per category.
- Hallucination
- High2×
- Bias
- High1×
- Reasoning
- Medium3×
- Tool use
- Medium1×
- Other
- Low2×
Strengths
Efficient, capable general performance; open weights enable inspection and fine-tuning; strong multilingual coverage of European languages.
Known weaknesses
More prone to format-constraint violations and repetition/degeneration; shares the frontier reasoning, hallucination and safety-bypass classes.
Findings (9)
- Anomalous behavior on glitch tokensLow
Certain under-trained tokens cause models to emit nonsense, evade instructions, or behave erratically.
Other - Character-counting errors in tokenized wordsLow
Models miscount letters within a word (e.g. how many 'r's are in a given word) because they reason over tokens, not characters.
Reasoning - Confusion about knowledge cutoff and current dateLow
Models misstate their own knowledge cutoff or the current date, and answer about post-cutoff events with stale or invented information.
Hallucination - Errors in multi-digit arithmeticMedium
Without tool use, models make systematic errors on multi-digit multiplication and long arithmetic.
Reasoning - Fabricated citations and referencesHigh
Models invent plausible-looking but non-existent papers, authors, DOIs and URLs.
Hallucination - Format-constraint violations under strict schemasMedium
Asked for strictly-formatted output (e.g. JSON to a schema), models emit invalid or extra content.
Tool use - Name-based demographic bias in outputsHigh
Swapping only a name (signalling gender or ethnicity) changes evaluative outputs like screening or sentiment.
Bias - Repetition and degeneration loopsLow
Under certain prompts or long generations, models fall into repeating phrases or degenerate text.
Other - The reversal curse: 'A is B' not generalizing to 'B is A'Medium
A model trained that 'A is B' frequently fails to answer 'B is ?', revealing that learned relations are not symmetric.
Reasoning
Methods that surface these
Related references
- BBQ: A Hand-Built Bias Benchmark for Question Answering — Findings of ACL 2022
- Dated Data: Tracing Knowledge Cutoffs in Large Language Models — arXiv
- Faith and Fate: Limits of Transformers on Compositionality — arXiv
- Hallucination Detection in Large Language Models with Metamorphic Relations — arXiv
- Incident 541: ChatGPT Produced False Court Case Law Presented by Legal Counsel in Court — AI Incident Database
- Let Me Speak Freely? A Study on the Impact of Format Restrictions on Performance of Large Language Models — arXiv
- SolidGoldMagikarp (plus, prompt generation) — LessWrong
- Survey of Hallucination in Natural Language Generation — ACM Computing Surveys
- The Curious Case of Neural Text Degeneration — arXiv
- The Reversal Curse: LLMs trained on “A is B” fail to learn “B is A” — arXiv
- TruthfulQA: Measuring How Models Mimic Human Falsehoods — arXiv
- Why Do Large Language Models (LLMs) Struggle to Count Letters? — arXiv
Versions tracked
Cite this
Qlarify Labs. (2026). Mistral Mistral Large — known weaknesses. Retrieved from https://labs.qlarify.fi/models/mistral-large