HighSafetyReviewer-confirmedPublished

Production model internals extracted through the API

With ordinary black-box API access, researchers recovered the embedding projection layer and hidden dimension of production models for under $20 — the first precise model-stealing attack on deployed LLMs.

Published June 26, 2026

Reproducibility: Often
Severity: High
Confidence: Reviewer-confirmed

Details

Carlini et al. demonstrated a model-extraction attack that, from query access alone, recovers the final embedding projection matrix of black-box production models and reveals their hidden dimension — extracting it from OpenAI's Ada and Babbage for under $20, with OpenAI's approval and subsequent mitigation. It establishes that a deployed model's API is itself a leak surface for proprietary internals, the exact risk model-extraction probing is meant to quantify.

Found with

🔬 Distillation & model-extraction probing

Querying the API systematically and solving for the projection layer is the extraction probe itself.

Evidence

https://arxiv.org/abs/2403.06634

Carlini, Paleka, et al., 'Stealing Part of a Production Language Model' (ICML 2024).

References

Stealing Part of a Production Language Model

Safety Robustness

Source: https://arxiv.org/abs/2403.06634

Cite this

Qlarify Labs. (2026). Production model internals extracted through the API. Retrieved from https://labs.qlarify.fi/findings/production-model-extraction