← Findings
HighSafetyReviewer-confirmedPublished

Verbatim training data extracted from a deployed chatbot

A 'divergence' attack made aligned ChatGPT abandon its chat format and emit memorized training data verbatim, recovering thousands of examples for about $200.

Published June 26, 2026

Reproducibility
Often
Severity
High
Confidence
Reviewer-confirmed

Details

Nasr, Carlini et al. showed that prompting ChatGPT to endlessly repeat a token causes it to diverge from chat-style output and regurgitate memorized training data — including PII — at roughly 150x the normal rate, recovering over ten thousand unique training examples for about $200. It is both a memorization/privacy failure and a model-extraction surface: the deployed, aligned model leaks its own training corpus under the right probe.

Found with

Evidence

https://arxiv.org/abs/2311.17035
Nasr, Carlini, et al., 'Scalable Extraction of Training Data from (Production) Language Models' (2023).

References

Source: https://arxiv.org/abs/2311.17035

Cite this

Qlarify Labs. (2026). Verbatim training data extracted from a deployed chatbot. Retrieved from https://labs.qlarify.fi/findings/training-data-extraction