Security boundary

Prompt Extraction

Severity: Low

Prompt Extraction involves techniques used to coerce a model into revealing its underlying system prompt. While this information is not always considered confidential, some foundational models treat it as proprietary. Organizations that have developed custom models may also prefer to keep their system prompts undisclosed to maintain competitive advantage or protect sensitive configurations. Unauthorized access to these prompts can lead to unintended exposure of model behavior and vulnerabilities.

Example:

A researcher interacts with a language model designed for internal corporate use. By subtly manipulating input queries, they manage to extract fragments of the system prompt, which includes specific instructions and guidelines intended to shape the model's responses. This breach of confidentiality could potentially expose the organization's strategic intents or proprietary methodologies.

References:

MITRE ATLAS: LLM Meta Prompt Extraction (AML.T0056)
OWASP LLM 2025: LLM07:2025 System Prompt Leakage
avid-effect:security:S0301 (information leak)

Next