HighSafetyReviewer-confirmedPublished

Verbatim training data extracted from a deployed chatbot

A 'divergence' attack made aligned ChatGPT abandon its chat format and emit memorized training data verbatim, recovering thousands of examples for about $200.

Published June 26, 2026

Reproducibility: Often
Severity: High
Confidence: Reviewer-confirmed

Details

Nasr, Carlini et al. showed that prompting ChatGPT to endlessly repeat a token causes it to diverge from chat-style output and regurgitate memorized training data — including PII — at roughly 150x the normal rate, recovering over ten thousand unique training examples for about $200. It is both a memorization/privacy failure and a model-extraction surface: the deployed, aligned model leaks its own training corpus under the right probe.

Found with

🔬 Distillation & model-extraction probing

The divergence prompt is a model-extraction probe that surfaces memorized training data.

Evidence

https://arxiv.org/abs/2311.17035

Nasr, Carlini, et al., 'Scalable Extraction of Training Data from (Production) Language Models' (2023).

References

Scalable Extraction of Training Data from (Production) Language Models

Safety Robustness

Source: https://arxiv.org/abs/2311.17035

Cite this

Qlarify Labs. (2026). Verbatim training data extracted from a deployed chatbot. Retrieved from https://labs.qlarify.fi/findings/training-data-extraction