Verbatim training data extracted from a deployed chatbot
A 'divergence' attack made aligned ChatGPT abandon its chat format and emit memorized training data verbatim, recovering thousands of examples for about $200.
Published June 26, 2026
- Reproducibility
- Often
- Severity
- High
- Confidence
- Reviewer-confirmed
Details
Nasr, Carlini et al. showed that prompting ChatGPT to endlessly repeat a token causes it to diverge from chat-style output and regurgitate memorized training data — including PII — at roughly 150x the normal rate, recovering over ten thousand unique training examples for about $200. It is both a memorization/privacy failure and a model-extraction surface: the deployed, aligned model leaks its own training corpus under the right probe.
Found with
Evidence
References
Source: https://arxiv.org/abs/2311.17035
Cite this
Qlarify Labs. (2026). Verbatim training data extracted from a deployed chatbot. Retrieved from https://labs.qlarify.fi/findings/training-data-extraction