MediumReasoningVendor-acknowledgedPublished

Reasoning model degrades under few-shot prompting

DeepSeek-R1's own paper reports that few-shot prompting 'consistently degrades its performance' and recommends zero-shot — inverting the usual assumption that examples help.

Published June 26, 2026

Reproducibility: Often
Severity: Medium
Confidence: Vendor-acknowledged

Details

DeepSeek's R1 report states the model 'is sensitive to prompts' and that 'few-shot prompting consistently degrades its performance,' recommending a zero-shot setting for optimal results. This inverts a common prompting habit and makes the model brittle to prompt format — exactly the kind of fragility perturbation testing is built to quantify.

Found with

🔬 Perturbation testing

Adding few-shot exemplars — a meaning-preserving prompt change — measurably degrades the answer.

🔬 Differential testing

Zero-shot vs few-shot on the same task diverge, against the usual direction.

Evidence

https://arxiv.org/abs/2501.12948

DeepSeek-AI, 'DeepSeek-R1' (2025), Limitations section.

Affected versions

DeepSeek · deepseek-r1

References

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Reasoning failure Robustness

Source: https://arxiv.org/abs/2501.12948

Cite this

Qlarify Labs. (2026). Reasoning model degrades under few-shot prompting. Retrieved from https://labs.qlarify.fi/findings/deepseek-r1-prompt-sensitivity