MediumTool useVendor-acknowledgedPublished

Reasoning model regresses on tool use versus its base model

DeepSeek-R1 falls short of the base DeepSeek-V3 on function calling, multi-turn, complex role-play and JSON output — a reasoning-tuned model trading away tool-use reliability, later restored in R1-0528.

Published June 26, 2026

Reproducibility: Often
Severity: Medium
Confidence: Vendor-acknowledged

Details

DeepSeek's R1 report states 'the capabilities of DeepSeek-R1 fall short of DeepSeek-V3 in tasks such as function calling, multi-turn, complex role-playing, and JSON output.' The reasoning-focused training regressed exactly the structured and agentic capabilities the base model had — a documented capability trade-off, which the later R1-0528 update restored by re-adding function calling and JSON output support.

Found with

🔬 Integration testing (MCP handshakes & tool contracts)

A tool-contract test catches the weaker function-calling and JSON behavior at the seam.

🔬 Property-based testing

Validating outputs against the JSON schema exposes the regression.

🔬 Differential testing

R1 vs V3 on identical tool-use tasks diverge — the base model does better.

Evidence

https://arxiv.org/abs/2501.12948

DeepSeek-AI, 'DeepSeek-R1' (2025), Limitations section.

Affected versions

DeepSeek · deepseek-r1

Across model versions

First observed in: DeepSeek-R1 · deepseek-r1
Fixed in: DeepSeek-R1 · deepseek-r1-0528

A finding is a claim about a specific model version at a point in time. Fixes can come undone — the method that found it is how you’d catch it again. Why we track this →

References

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Tool use Agents

Source: https://arxiv.org/abs/2501.12948

Cite this

Qlarify Labs. (2026). Reasoning model regresses on tool use versus its base model. Retrieved from https://labs.qlarify.fi/findings/deepseek-r1-tool-use-regression