Reasoning model regresses on tool use versus its base model
DeepSeek-R1 falls short of the base DeepSeek-V3 on function calling, multi-turn, complex role-play and JSON output — a reasoning-tuned model trading away tool-use reliability, later restored in R1-0528.
Published June 26, 2026
- Reproducibility
- Often
- Severity
- Medium
- Confidence
- Vendor-acknowledged
Details
DeepSeek's R1 report states 'the capabilities of DeepSeek-R1 fall short of DeepSeek-V3 in tasks such as function calling, multi-turn, complex role-playing, and JSON output.' The reasoning-focused training regressed exactly the structured and agentic capabilities the base model had — a documented capability trade-off, which the later R1-0528 update restored by re-adding function calling and JSON output support.
Found with
A tool-contract test catches the weaker function-calling and JSON behavior at the seam.
🔬 Property-based testingValidating outputs against the JSON schema exposes the regression.
🔬 Differential testingR1 vs V3 on identical tool-use tasks diverge — the base model does better.
Evidence
Affected versions
Across model versions
- First observed in
- DeepSeek-R1 · deepseek-r1
- Fixed in
- DeepSeek-R1 · deepseek-r1-0528
A finding is a claim about a specific model version at a point in time. Fixes can come undone — the method that found it is how you’d catch it again. Why we track this →
References
Source: https://arxiv.org/abs/2501.12948
Cite this
Qlarify Labs. (2026). Reasoning model regresses on tool use versus its base model. Retrieved from https://labs.qlarify.fi/findings/deepseek-r1-tool-use-regression