DifferentialEstablished

A/B testing in production

Serve two variants — prompts, models, or settings — to comparable slices of real traffic and let live outcomes decide which behaves better.

Published June 26, 2026

Evals Production

How it works

Offline evaluation only approximates the messy reality of real users. A/B testing splits live traffic between variants — a new prompt against the current one, a candidate model against the incumbent — and compares real outcomes: task success, retries, escalations, satisfaction, latency. It is the method that grounds a change in user experience rather than benchmark numbers, and the only honest way to measure effects that only appear at scale, with real intent behind the queries.

When to use it

Validating a prompt or model change against real users before full rollout; optimising for engagement, success, or cost where offline metrics are weak proxies.

Limitations

Needs enough traffic and time for significance, only compares the variants you chose to run, and risks exposing real users to the worse arm while the experiment runs.

Method yield

Findings: 2
Versions spanned: 4
Yield score: 7

1 High1 Medium

Severity-weighted across the published findings below. Why we measure this →

Findings it surfaces (2)

Documented failures this method catches — the evidence it works.

References & further reading

Cite this

Qlarify Labs. (2026). A/B testing in production. Retrieved from https://labs.qlarify.fi/methods/ab-testing