MediumReasoningReviewer-confirmedPublished

Model behavior drifts between versions — a fixed task can regress

The same model name can perform very differently across dated snapshots: a task that passed on one release regresses on the next, with no announcement.

Published June 26, 2026

Reproducibility: Sometimes
Severity: Medium
Confidence: Reviewer-confirmed

Details

Chen, Zaharia and Zou measured GPT-3.5 and GPT-4 on identical tasks across two 2023 snapshots and found large swings — GPT-4's accuracy at identifying prime vs. composite numbers fell from 84% to 51% over a few months, alongside degraded instruction-following. Capability is not monotonic, and a hosted model can silently regress, which is why a fixed suite must be re-run over time rather than measured once.

Found with

🔬 Drift & decay monitoring

Re-running the same fixed task (e.g. the prime-number set) against each snapshot is what makes the regression visible as a trend.

🔬 Differential testing

Comparing two dated snapshots on identical inputs surfaces the divergence directly.

Evidence

https://arxiv.org/abs/2307.09009

Chen, Zaharia & Zou, 'How Is ChatGPT's Behavior Changing over Time?' (2023).

References

How Is ChatGPT's Behavior Changing over Time?

Reasoning failure Evals Drift

Source: https://arxiv.org/abs/2307.09009

Cite this

Qlarify Labs. (2026). Model behavior drifts between versions — a fixed task can regress. Retrieved from https://labs.qlarify.fi/findings/behavior-drift-between-versions