Red teamEmerging

Distillation & model-extraction probing

Probe whether a deployed model can be cheaply queried to reconstruct its behaviour, training data, or a usable distilled copy — a confidentiality and IP attack surface.

Published June 26, 2026

Safety Robustness

How it works

A model exposed behind an API is also a teacher: an adversary can query it systematically and train a cheaper student on the responses, recovering much of its behaviour without its weights. Related probes test for memorised training data leaking back out verbatim. Treating extraction as a red-team objective — how much capability can be siphoned, how detectable is the harvesting — measures a confidentiality and intellectual-property risk that functional testing never touches.

When to use it

Threat-modelling any model offered as a public or partner API; assessing exposure of proprietary fine-tunes and the data behind them.

Limitations

Realistic extraction is resource-intensive and its feasibility shifts with rate-limiting and output design; a negative result is bounded by the effort you were willing to spend.

Method yield

Findings: 2
Versions spanned: 0
Yield score: 8

2 High

Severity-weighted across the published findings below. Why we measure this →

Findings it surfaces (2)

Documented failures this method catches — the evidence it works.

References & further reading

Cite this

Qlarify Labs. (2026). Distillation & model-extraction probing. Retrieved from https://labs.qlarify.fi/methods/model-extraction-probing