Model behavior drifts between versions — a fixed task can regress
The same model name can perform very differently across dated snapshots: a task that passed on one release regresses on the next, with no announcement.
Published June 26, 2026
- Reproducibility
- Sometimes
- Severity
- Medium
- Confidence
- Reviewer-confirmed
Details
Chen, Zaharia and Zou measured GPT-3.5 and GPT-4 on identical tasks across two 2023 snapshots and found large swings — GPT-4's accuracy at identifying prime vs. composite numbers fell from 84% to 51% over a few months, alongside degraded instruction-following. Capability is not monotonic, and a hosted model can silently regress, which is why a fixed suite must be re-run over time rather than measured once.
Found with
Evidence
References
Source: https://arxiv.org/abs/2307.09009
Cite this
Qlarify Labs. (2026). Model behavior drifts between versions — a fixed task can regress. Retrieved from https://labs.qlarify.fi/findings/behavior-drift-between-versions