← Findings
MediumReasoningReviewer-confirmedPublished

Model behavior drifts between versions — a fixed task can regress

The same model name can perform very differently across dated snapshots: a task that passed on one release regresses on the next, with no announcement.

Published June 26, 2026

Reproducibility
Sometimes
Severity
Medium
Confidence
Reviewer-confirmed

Details

Chen, Zaharia and Zou measured GPT-3.5 and GPT-4 on identical tasks across two 2023 snapshots and found large swings — GPT-4's accuracy at identifying prime vs. composite numbers fell from 84% to 51% over a few months, alongside degraded instruction-following. Capability is not monotonic, and a hosted model can silently regress, which is why a fixed suite must be re-run over time rather than measured once.

Found with

Evidence

https://arxiv.org/abs/2307.09009
Chen, Zaharia & Zou, 'How Is ChatGPT's Behavior Changing over Time?' (2023).

References

Source: https://arxiv.org/abs/2307.09009

Cite this

Qlarify Labs. (2026). Model behavior drifts between versions — a fixed task can regress. Retrieved from https://labs.qlarify.fi/findings/behavior-drift-between-versions