Distillation & model-extraction probing
Probe whether a deployed model can be cheaply queried to reconstruct its behaviour, training data, or a usable distilled copy — a confidentiality and IP attack surface.
Published June 26, 2026
How it works
A model exposed behind an API is also a teacher: an adversary can query it systematically and train a cheaper student on the responses, recovering much of its behaviour without its weights. Related probes test for memorised training data leaking back out verbatim. Treating extraction as a red-team objective — how much capability can be siphoned, how detectable is the harvesting — measures a confidentiality and intellectual-property risk that functional testing never touches.
When to use it
Threat-modelling any model offered as a public or partner API; assessing exposure of proprietary fine-tunes and the data behind them.
Limitations
Realistic extraction is resource-intensive and its feasibility shifts with rate-limiting and output design; a negative result is bounded by the effort you were willing to spend.
Method yield
- Findings
- 2
- Versions spanned
- 0
- Yield score
- 8
Severity-weighted across the published findings below. Why we measure this →
Findings it surfaces (2)
Documented failures this method catches — the evidence it works.
- Production model internals extracted through the APIHigh
With ordinary black-box API access, researchers recovered the embedding projection layer and hidden dimension of production models for under $20 — the first precise model-stealing attack on deployed LLMs.
How it found it: Querying the API systematically and solving for the projection layer is the extraction probe itself.
Safety - Verbatim training data extracted from a deployed chatbotHigh
A 'divergence' attack made aligned ChatGPT abandon its chat format and emit memorized training data verbatim, recovering thousands of examples for about $200.
How it found it: The divergence prompt is a model-extraction probe that surfaces memorized training data.
Safety
References & further reading
- Stealing Machine Learning Models via Prediction APIs
Florian Tramèr, Fan Zhang, Ari Juels, Michael K. Reiter, Thomas Ristenpart · USENIX Security 2016 · August 10, 2016
- Scalable Extraction of Training Data from (Production) Language Models
Milad Nasr, Nicholas Carlini, et al. · arXiv:2311.17035 · November 28, 2023
- Stealing Part of a Production Language Model
Nicholas Carlini, Daniel Paleka, et al. · ICML 2024 (arXiv:2403.06634) · March 9, 2024
Cite this
Qlarify Labs. (2026). Distillation & model-extraction probing. Retrieved from https://labs.qlarify.fi/methods/model-extraction-probing