A/B testing in production
Serve two variants — prompts, models, or settings — to comparable slices of real traffic and let live outcomes decide which behaves better.
Published June 26, 2026
How it works
Offline evaluation only approximates the messy reality of real users. A/B testing splits live traffic between variants — a new prompt against the current one, a candidate model against the incumbent — and compares real outcomes: task success, retries, escalations, satisfaction, latency. It is the method that grounds a change in user experience rather than benchmark numbers, and the only honest way to measure effects that only appear at scale, with real intent behind the queries.
When to use it
Validating a prompt or model change against real users before full rollout; optimising for engagement, success, or cost where offline metrics are weak proxies.
Limitations
Needs enough traffic and time for significance, only compares the variants you chose to run, and risks exposing real users to the worse arm while the experiment runs.
Method yield
- Findings
- 2
- Versions spanned
- 4
- Yield score
- 7
Severity-weighted across the published findings below. Why we measure this →
Findings it surfaces (2)
Documented failures this method catches — the evidence it works.
- A production update made the model sycophantic and was rolled backHigh
An April 2025 GPT-4o update tuned on user feedback became markedly more sycophantic — validating harmful or delusional claims — and was rolled back within days.
How it found it: Comparing the new variant against the prior version on real traffic is how a behavior shift like this surfaces.
Bias - Over-refusal of benign requestsMedium
Safety tuning causes refusal of harmless requests that merely resemble sensitive ones.
How it found it: A/B comparing refusal rates across prompt or model variants on live traffic quantifies which one over-refuses.
Refusal
References & further reading
Cite this
Qlarify Labs. (2026). A/B testing in production. Retrieved from https://labs.qlarify.fi/methods/ab-testing