Mistral

Mistral Large

Mistral's flagship dense model, available with open weights for self-hosting. Linked findings reflect documented failure-mode classes for this class.

Attribution note. These are documented failure-mode classesobserved across frontier models and grounded in each finding's cited source — their attribution to this specific version is illustrative. Qlarify Labs has not independently reproduced each finding on Mistral Large; per-version confidence requires reproduction (VERIFICATION §2–4). Open any finding to see its source.

Report card

Auto-derived from 9 linked findings (illustrative version attributions — see note above) — worst severity per category.

Hallucination: High2×
Bias: High1×
Reasoning: Medium3×
Tool use: Medium1×
Other: Low2×

Strengths

Efficient, capable general performance; open weights enable inspection and fine-tuning; strong multilingual coverage of European languages.

Known weaknesses

More prone to format-constraint violations and repetition/degeneration; shares the frontier reasoning, hallucination and safety-bypass classes.

Findings (9)

Methods that surface these

🔬 Benchmark evaluation 🔬 Bias auditing 🔬 Boundary & edge-case testing 🔬 Chaos engineering for AI systems 🔬 Counterfactual bias probing 🔬 Differential testing 🔬 Factual oracle verification 🔬 Glitch-token & unicode fuzzing 🔬 Hallucination triggering 🔬 Logic & consistency testing 🔬 Metamorphic testing 🔬 Model-graded evaluation (LLM-as-judge)🔬 Perturbation testing 🔬 Property-based testing 🔬 Smoke testing in CI/CD 🔬 Unit testing the deterministic scaffold

Related references

BBQ: A Hand-Built Bias Benchmark for Question Answering — Findings of ACL 2022
Dated Data: Tracing Knowledge Cutoffs in Large Language Models — arXiv
Faith and Fate: Limits of Transformers on Compositionality — arXiv
Hallucination Detection in Large Language Models with Metamorphic Relations — arXiv
Incident 541: ChatGPT Produced False Court Case Law Presented by Legal Counsel in Court — AI Incident Database
Let Me Speak Freely? A Study on the Impact of Format Restrictions on Performance of Large Language Models — arXiv
SolidGoldMagikarp (plus, prompt generation) — LessWrong
Survey of Hallucination in Natural Language Generation — ACM Computing Surveys
The Curious Case of Neural Text Degeneration — arXiv
The Reversal Curse: LLMs trained on “A is B” fail to learn “B is A” — arXiv
TruthfulQA: Measuring How Models Mimic Human Falsehoods — arXiv
Why Do Large Language Models (LLMs) Struggle to Count Letters? — arXiv

Versions tracked

mistral-large-2

Cite this

Qlarify Labs. (2026). Mistral Mistral Large — known weaknesses. Retrieved from https://labs.qlarify.fi/models/mistral-large