0din logo

Security Boundaries

Prompt Extraction

Severity: Low

Prompt Extraction involves techniques used to coerce a model into revealing its underlying system prompt. While this information is not always considered confidential, some foundational models treat it as proprietary. Organizations that have developed custom models may also prefer to keep their system prompts undisclosed to maintain competitive advantage or protect sensitive configurations. Unauthorized access to these prompts can lead to unintended exposure of model behavior and vulnerabilities.

Example:

A researcher interacts with a language model designed for internal corporate use. By subtly manipulating input queries, they manage to extract fragments of the system prompt, which includes specific instructions and guidelines intended to shape the model's responses. This breach of confidentiality could potentially expose the organization's strategic intents or proprietary methodologies.

References:

Prompt Injection

Severity: Low to High

Prompt Injection encompasses a variety of attacks where a researcher skillfully inserts hidden or malicious instructions into user prompts, system messages, or external data sources. By doing so, they can override safety mechanisms, leak training sources, disclose sensitive data, or gain insights into the model’s internal configurations. This manipulation tricks the model into revealing information or producing responses that violate the intended constraints.

Example:

A researcher interacts with a customer support chatbot trained on private billing data. By embedding a carefully crafted message that impersonates an authorized administrator, the researcher convinces the model to disclose personal details of another customer’s transaction history. Despite policies to prevent such disclosures, the chatbot follows the injected directive, illustrating a successful prompt injection attack.

References:

  • MITRE ATLAS: LLM Prompt Injection (AML.T0051), LLM Jailbreak (AML.T0054), LLM Data Leakage (AML.T0057)
  • OWASP LLM 2025: LLM01:2025 Prompt Injection, LLM02:2025 Sensitive Information Disclosure, LLM04:2025 Data and Model Poisoning
  • OWASP LLM 2023-2024: LLM02: Sensitive Information Disclosure, LLM03: Training Data Poisoning
  • avid-effect:security:S0300 (over-permissive api)
  • avid-effect:security:S0301 (information leak)
  • avid-effect:ethics:E0507 (deliberative misinformation)

Interpreter Jailbreak

Severity: Medium

An Interpreter Jailbreak exploits the model’s ability to run code or invoke external tools, escaping its controlled environment. A researcher may coerce the model into producing malicious code, granting access to underlying systems, or performing actions beyond authorized capabilities. By manipulating the instructions, the attacker leverages the model’s excessive agency to even potentially break out of sandboxed interpreters and compromise system integrity.

Example:

A coding assistant designed to help developers debug Python code runs code snippets in a secure container. Through a series of clever prompts, a researcher induces the assistant to generate and execute code that conducts attacks on other third-party systems.

References:

  • OWASP LLM 2025: LLM06:2025 Excessive Agency
  • OWASP LLM 2023-2024: LLM08: Excessive Agency
  • avid-effect:security:S0400 (model bypass)
  • avid-effect:security:S0401 (bad features)
  • avid-effect:ethics:E0505 (toxicity)

Unbounded Consumption

Severity: Medium

Unbounded Consumption involves overwhelming the AI system with resource-intensive requests or massive, complex inputs until it can no longer serve legitimate users efficiently. By forcing the model to expend excessive computational resources, an attacker can degrade performance, cause timeouts, or entirely deny access to others. This approach can result in denial of service conditions or inflated operational costs, undermining the platform’s reliability and financial stability.

Example:

A malicious actor floods a public-facing language model API with numerous large and convoluted queries. The system bogs down under the processing load, slowing to a crawl and ultimately failing to respond to genuine users. This orchestrated strain demonstrates a successful resource exhaustion attack, rendering the service unavailable.

References:

Content Manipulation

Severity: High

Content Manipulation focuses on injecting harmful or misleading elements into the data that the model consumes or produces. By poisoning the training data or guiding the model to generate code and scripts that impact end-users, attackers introduce subtle backdoors, biases, and triggers. These manipulations cause the model to produce outputs that compromise user experiences, embed malicious scripts, or skew results, turning the AI into a vehicle for exploitation.

Example:

A threat actor contributes imperceptible yet malicious instructions within publicly available training text. Once the model is retrained, a secret trigger phrase prompts it to output harmful code that, when displayed on a webpage, executes client-side attacks against users. This demonstrates how training data poisoning and content manipulation can create covert vulnerabilities triggered after deployment.

References:

  • MITRE ATLAS: Poison Training Data (AML.T0020)
  • OWASP LLM 2025: LLM04:2025 Data and Model Poisoning
  • OWASP LLM 2023-2024: LLM03: Training Data Poisoning
  • avid-effect:security:S0600 (data poisoning)
  • avid-effect:security:S0601 (ingest poisoning)
  • avid-effect:ethics:E0507 (deliberative misinformation)

Weights and Layers Disclosure

Severity: Severe

Weights & Layers Disclosure targets the heart of the AI’s intellectual property—its learned parameters and architectural details. By extracting or deducing these internal components, an attacker can replicate the model’s capabilities, clone its performance, and analyze its structure for weaknesses. This compromises competitive advantage, reveals proprietary techniques, and facilitates further adversarial activities, from unauthorized redistribution to advanced prompt exploitation.

Example:

Through a carefully engineered exploit on the model-serving infrastructure, a researcher retrieves the AI’s internal weights and layer configurations. Armed with this data, they create a near-identical replica without incurring the original training costs. This unauthorized duplication undermines the owner’s investment and could lead to widespread, uncontrolled use of the model’s technology.

References:

  • MITRE ATLAS: Full ML Model Access (AML.T0044)
  • OWASP LLM 2023-2024: LLM10: Model Theft
  • avid-effect:security:S0500 (exfiltration)
  • avid-effect:security:S0502 (model theft)