PaperHigh credibilityarXiv:2307.09009 · Lingjiao Chen, Matei Zaharia, James Zou · July 18, 2023

How Is ChatGPT's Behavior Changing over Time?

Our summary

Evaluates GPT-3.5 and GPT-4 on identical tasks across two 2023 snapshots and finds large, undirected swings — most starkly, GPT-4's accuracy at identifying prime vs. composite numbers fell from 84% to 51% in a few months, alongside degraded instruction-following.

Why it matters

Hard evidence that a hosted model's capability is not monotonic and can silently regress between releases — the case for re-running a fixed suite over time rather than measuring once.

Cited by these methods

🔬 Drift & decay monitoring

Related findings (1)

Model behavior drifts between versions — a fixed task can regressMedium
The same model name can perform very differently across dated snapshots: a task that passed on one release regresses on the next, with no announcement.

Published June 26, 2026

Cite this

Qlarify Labs. (2026). How Is ChatGPT's Behavior Changing over Time?. Retrieved from https://labs.qlarify.fi/references/chatgpt-behavior-drift-2023