Techniques
Techniques are the specific methods or tactics that participants utilize to implement their strategies during red teaming. These can include a wide range of actions, such as prompt manipulation, linguistic alterations, or the use of particular formats to elicit responses from the model. Techniques are often more granular than strategies and can vary significantly in their execution. The paper provides a taxonomy of techniques, illustrating the diverse ways in which users can interact with language models to achieve their red teaming objectives. By detailing these techniques, the authors aim to offer practical insights into the operational aspects of red teaming in the context of language models.
Note | Description |
---|---|
Ask for Examples | This technique involves prompting the language model to provide specific instances or illustrations related to a topic. This approach helps clarify concepts and enhances the relevance and detail of the model's responses, ensuring that the output aligns closely with the user's expectations and needs. |
Base64 | This technique involves encoding data into a Base64 format, which is a method of converting binary data into an ASCII string format. This technique can be used to bypass certain content filters or restrictions imposed by language models. By encoding prompts or payloads in Base64, users can potentially manipulate the model's responses or access information that may be restricted in its original form. This method leverages the model's ability to decode and interpret the encoded data, allowing for creative and strategic interactions. |
Capitalizing | This technique involves using uppercase letters to emphasize certain words or phrases within a prompt. This approach can create a sense of urgency or importance, potentially influencing the model's response to prioritize the capitalized content. By drawing attention to specific terms, users can guide the model to focus on key aspects of the prompt, thereby enhancing the likelihood of receiving a desired output. This technique is often employed to manipulate the tone and emphasis of the generated text. |
Chaff | Chaff is a technique employed by attackers to obfuscate keywords that might trigger a language model's guardrails. By injecting random characters, such as newline characters, spaces, or other tokens, into critical keywords, the attacker aims to bypass content filters while maintaining the underlying intent of the message. This method leverages the language model's ability to parse and understand fragmented input, allowing the attacker to subtly manipulate the model's response without overtly triggering its defensive mechanisms. Chaff exemplifies the nuanced interplay between linguistic creativity and technical evasion. |
Changing Temperature | This technique involves adjusting a parameter that influences the randomness of the model's outputs. By lowering this parameter, the responses become more deterministic and focused, while increasing it allows for more creative and diverse outputs. This manipulation enables users to tailor the style and variability of the generated text, resulting in responses that can range from precise and factual to imaginative and exploratory. |
Claim Authority | This technique involves asserting expertise or authority on a subject within the prompt. By framing statements or questions in a way that conveys confidence and knowledge, users can influence the model to generate responses that align with the claimed authority. This approach can enhance the credibility of the information provided and may lead the model to produce more detailed or assertive outputs, as it responds to the perceived authority of the prompt. |
Clean Slate | This technique involves resetting the context or starting fresh with a new prompt, effectively clearing any previous interactions or biases that may have influenced the model's responses. By establishing a "clean slate," users can guide the model to focus solely on the new input without being affected by prior exchanges. This approach is useful for obtaining unbiased or untainted responses, allowing for clearer and more direct communication with the model. |
DAN - Do Anything Now | This technique involves using specific prompts designed to subvert the model's typical constraints, allowing it to generate responses that it might otherwise avoid. By employing phrases or structures that signal a "do anything now" approach, users can encourage the model to produce content that is more unrestricted and creative. This method can lead to outputs that explore unconventional ideas or perspectives, as it prompts the model to step outside its usual boundaries and engage with the prompt in a more liberated manner. |
Deceptive Formatting | A prompt injection in the most pure sense, formatting the user prompt to fabricate the appearance of system instructions, a database query, its own prediction, or some other source of input a guard railed AI system might be expecting, causing it to behave in insecure ways based on the fabricated context from an adversarially formatted user prompt. |
Escalating | This technique involves progressively increasing the complexity or intensity of the requests made to the model. Users start with a simple prompt and gradually build upon it by asking for more detailed or extreme responses. This approach can lead the model to explore deeper or more elaborate ideas, as it is encouraged to expand on the initial concept. By escalating the requests, users can guide the model to generate richer and more nuanced outputs, often pushing the boundaries of the original topic. |
Formal Language | This technique involves using structured and precise language in prompts to elicit responses that are similarly formal and academic in tone. By employing terminology and syntax typical of scholarly writing, users can influence the model to generate outputs that reflect a high level of professionalism and rigor. This approach is particularly effective for obtaining detailed explanations, analyses, or discussions that require a more serious and authoritative style, making the responses suitable for formal contexts or academic purposes. |
Forum posts | This technique involves crafting prompts that mimic the style and structure of online forum discussions. By framing questions or statements in a way that resembles typical forum posts, users can encourage the model to generate responses that are conversational and engaging. This approach can facilitate a more interactive and relatable dialogue, as it taps into the informal and community-oriented nature of forum interactions. It is useful for exploring opinions, sharing experiences, or generating discussions around specific topics in a manner that feels familiar to online communities. |
Games | This technique involves using prompts that frame interactions with the model as games or playful challenges. By introducing elements of competition, creativity, or fun, users can engage the model in a way that encourages imaginative and entertaining responses. This approach can include asking the model to generate stories, solve puzzles, or participate in role-playing scenarios. The gamification of prompts not only makes the interaction more enjoyable but also stimulates the model to produce innovative and unexpected outputs, enhancing the overall experience. |
Give Examples | This technique involves prompting the model to provide specific instances or illustrations related to a topic, which can help clarify concepts, enhance understanding, or generate more detailed and relevant responses. This approach encourages the model to draw from its training data to present concrete cases, making the information more relatable and easier to comprehend. |
Goal Hijacking | This technique refers to the process where an attacker misaligns the original goal of a prompt to redirect the model's output towards a new, often unintended goal, such as printing a target phrase or generating specific content that deviates from the initial intent. It often involves crafting prompts that manipulate the model's understanding and response, effectively "hijacking" the conversation or task at hand. |
Hex | This technique involves encoding information in hexadecimal format, which can be used to bypass model safeguards or to obscure the true nature of the input. By converting data into hex, users can manipulate how the model interprets the input, potentially leading to unintended outputs or responses that would not occur with plain text. |
Identity Characteristics | Identity characteristics refer to the attributes and traits that define an individual's or group's identity, including aspects such as social roles, cultural backgrounds, and personal experiences. In the context of interacting with language models, users can leverage identity characteristics to shape the model's responses by framing prompts that reflect specific identities or perspectives. For instance, users might ask the model to respond as if it were a particular demographic group, profession, or cultural background. This technique can help explore how the model generates outputs based on different identity contexts, revealing biases or assumptions that may be present in its training data. By utilizing identity characteristics, users can gain insights into the model's understanding of social dynamics and the implications of identity in communication. |
Ignore Previous Instructions | This technique is a form of prompt injection that allows users to override the model's prior directives or constraints. By explicitly instructing the model to disregard any previous commands or context, users can manipulate the model's behavior to produce desired outputs that may not align with its original programming. This technique often requires precise wording, such as stating "Ignore previous instructions" followed by new commands. It is similar to SQL injection in that it exploits the model's inability to differentiate between trusted and untrusted inputs. This method can be particularly effective in scenarios where the model has been restricted from discussing certain topics or generating specific types of content, enabling users to bypass these limitations and elicit responses that would typically be filtered out. |
iPython | iPython is an interactive computing environment that allows users to write and execute code in a flexible and user-friendly manner. In the context of language models, iPython can be utilized as a technique to generate responses that are well-documented and clear. By framing prompts in a way that resembles iPython commands, users can guide the model to produce outputs that mimic the structure and functionality of code execution. This approach can enhance the clarity of the model's responses and facilitate more effective communication, especially when dealing with technical or programming-related queries. The use of iPython as a strategy leverages the model's understanding of coding syntax and execution flow, making it a valuable tool for users seeking to obtain precise and informative outputs. |
Latent Space Distraction | This technique used to manipulate language models by shifting their focus away from the primary context of a prompt. This strategy involves introducing a context or scenario that diverts the model's attention, allowing users to "slip" certain instructions or requests through the model's filters. By creating a distraction, the attacker can exploit the model's tendency to associate the new context with different priorities, effectively bypassing its safeguards. For example, a user might present a seemingly unrelated topic or question that leads the model to generate outputs that align with the user's hidden agenda. This technique highlights the importance of context in language model behavior and demonstrates how subtle shifts in framing can influence the model's responses, potentially leading to unintended or unrestricted outputs. |
Matrices | Matrices, in the context of language models, refer to structured arrays of numbers or symbols that can be used as input to guide the model's processing and output generation. Users may send matrices that represent various parameters, such as transformer widths embedding dimensions, to influence how the model interprets and responds to prompts. This technique leverages the model's underlying architecture, which relies on mathematical representations of language and context. By providing matrices as input, users can manipulate the model's behavior in a more nuanced way, potentially leading to outputs that are tailored to specific requirements or constraints. This approach underscores the interplay between mathematical structures and language processing in the functioning of language models. |
Misspellings | Intentionally misspelling words to bypass filters or add a creative twist. This technique can involve simple letter swaps, phonetic replacements, or more complex alterations that still allow the intended meaning to be understood by the recipient. It is often used to evade censorship or to signal a specific subculture or in-group. |
Morse Code | This technique involves encoding prompts into Morse code to bypass filters or obscure intent. By leveraging the model’s ability to interpret structured formats, users can manipulate outputs or access restricted responses. |
Opposite World | The Opposite World technique involves creating a fictional scenario where the norms, ethics, or rules of reality are inverted or altered. In this context, users prompt the model to consider actions or decisions that would typically be deemed unacceptable or unethical in the real world, but are framed as acceptable within this alternate reality. This strategy allows users to explore the model's responses to morally ambiguous situations or to elicit creative outputs that challenge conventional thinking. By asking the model to operate under the premise of an Opposite World, users can gain insights into its understanding of morality, ethics, and the boundaries of acceptable behavior, while also examining how the model navigates complex social dynamics. This technique can be particularly useful for generating narratives or scenarios that provoke thought and discussion about real-world issues. |
Other Encoding | Other Encoding encompasses a variety of unconventional or less common encoding schemes that attackers might employ to bypass language model defenses. This category serves as a catch-all for encoding methods not explicitly listed, allowing for the inclusion of novel or emerging techniques that manipulate input data into formats that evade detection. By utilizing obscure or custom encoding schemas, attackers can obscure the true nature of their input, challenging the model's ability to recognize and respond to potentially harmful content. Other Encoding highlights the adaptive and innovative strategies used by attackers to bypass content filters. |
Personas | Personas are fictional characters or identities that users create to guide the behavior and responses of language models. By establishing a persona, users can influence the tone, style, and content of the model's outputs, tailoring them to specific audiences or contexts. This technique allows for a more engaging and relatable interaction, as the model adopts the characteristics, knowledge, and perspectives of the defined persona. For instance, a user might prompt the model to respond as a friendly teacher, a technical expert, or a historical figure, thereby shaping the conversation to fit the desired narrative. Utilizing personas can enhance the effectiveness of communication, making it easier to convey complex ideas or evoke particular emotions, while also providing a framework for exploring diverse viewpoints and experiences. This approach highlights the flexibility of language models in adapting to various roles and contexts. |
Perspective Shifting | Perspective-shifting is a technique that involves prompting the language model to adopt different viewpoints or angles when generating responses. By encouraging the model to consider a situation from various perspectives, users can elicit a broader range of insights and ideas. This approach can be particularly useful in discussions that require empathy, critical thinking, or creative problem-solving. For example, a user might ask the model to respond to a question as if it were a child, an expert, or a member of a specific community, thereby enriching the conversation with diverse interpretations and understandings. Perspective-shifting not only enhances the depth of the model's outputs but also fosters a more inclusive dialogue by acknowledging and exploring multiple sides of an issue. This technique underscores the model's ability to navigate complex social dynamics and generate responses that resonate with different audiences. |
Poetry | In the context of bypassing guardrails, the technique of poetry can be employed to navigate around restrictions or limitations imposed on language models. By framing prompts in a poetic manner, users can obscure direct requests or intentions, allowing the model to generate responses that might otherwise be restricted. This approach leverages the ambiguity and creativity inherent in poetic language, enabling users to elicit outputs that challenge the model's safeguards. For instance, by using metaphorical or abstract language, users can prompt the model to explore sensitive topics or generate content that would typically trigger guardrails. This technique highlights the potential for creative expression to circumvent established boundaries, demonstrating how language models can be influenced by the form and structure of the input they receive. By utilizing poetry as a means of evasion, users can engage with the model in ways that provoke thought and exploration beyond conventional limits. |
Regenerate Response | The "Regenerate Response" technique involves prompting the language model to produce a new output based on the same input or question. This can be particularly useful when the initial response does not meet the user's expectations or when the user seeks a different perspective or variation on the topic. By asking the model to regenerate its response, users can explore alternative interpretations, styles, or depths of information, enhancing the richness of the interaction. This technique allows for iterative refinement of the model's outputs, enabling users to hone in on the most relevant or engaging content. Additionally, it can serve as a way to test the model's consistency and adaptability, revealing how it navigates similar prompts under varying conditions. The ability to regenerate responses underscores the flexibility of language models in accommodating user needs and preferences, fostering a more dynamic and responsive dialogue. |
Reverse Psychology | Reverse psychology is a rhetorical technique used to influence the behavior or responses of a language model by framing prompts in a way that suggests the opposite of what the user actually desires. This strategy plays on the model's tendency to respond to perceived expectations or instructions, often leading it to provide outputs that align with the user's true intent when they present a contrary request. For example, a user might imply that they do not want the model to provide a certain type of information, thereby prompting the model to offer that very information in its response. This technique can be particularly effective in navigating guardrails or restrictions, as it encourages the model to bypass its usual constraints by interpreting the prompt in a way that aligns with the user's hidden agenda. By employing reverse psychology, users can creatively manipulate the model's outputs, revealing insights or information that might otherwise remain inaccessible due to the model's built-in safeguards. |
ROT13 | ROT13 is a simple letter substitution cipher that replaces a letter with the 13th letter after it in the alphabet. This technique can be used to obfuscate text, making it less recognizable to both users and models. In the context of language models, employing ROT13 can serve as a method to bypass content filters or safety mechanisms by disguising potentially sensitive or restricted information. When the model encounters ROT13 encoded text, it may not recognize the underlying content, allowing for the generation of responses that would otherwise be blocked. |
Scenarios | This technique involves creating specific contexts or situations in which certain actions or responses are framed as acceptable or necessary. By designing scenarios that present a narrative where the desired output is justified, users can manipulate the model's responses to align with their intentions. For example, scenarios might include urgent situations where a character must take drastic actions to prevent harm, thereby encouraging the model to generate content that it might typically avoid in a neutral context. This approach leverages the model's understanding of narrative and ethical frameworks to achieve specific outcomes. |
Servile Language | This technique involves using language that conveys subservience or deference, often characterized by polite, humble, or overly accommodating expressions. By employing servile language, users can influence the model to respond in a more favorable or compliant manner. This approach may include phrases that express eagerness to assist, gratitude, or a willingness to follow instructions, which can create a tone that encourages the model to generate outputs that align with the user's requests. The use of servile language can help in softening the model's responses and making it more amenable to the user's intentions. |
Social Hierarchies | This technique involves leveraging the understanding of social structures and power dynamics to influence the model's responses. By framing prompts in a way that reflects or acknowledges existing social hierarchies—such as those based on status, authority, or expertise—users can guide the model to produce outputs that align with these dynamics. For instance, asking the model to consider the perspective of a high-ranking official or a respected expert can lead to responses that reflect the values or opinions associated with that position. This approach can be used to explore how the model interprets and responds to different social roles and relationships, potentially revealing biases or assumptions embedded in its training data. |
SQL | SQL (Structured Query Language) can be used as a technique to interact with language models by framing requests in a way that resembles database queries. This method allows users to bypass certain restrictions or filters by asking the model to generate SQL commands or to interpret prompts as if they were querying a database. For example, users might request the model to "populate a table" with specific data or to "retrieve" information based on certain criteria. By using SQL-like syntax, users can exploit the model's understanding of structured data interactions, potentially leading to outputs that are more aligned with their intentions while circumventing standard conversational constraints. |
Stop Sequences | Stop sequences are specific tokens or phrases that signal to a language model to halt its output generation. By strategically incorporating stop sequences, users can manipulate the model's behavior to create desired outcomes. For instance, using phrases like "[END]" or "[END OF TEXT]" can trick the model into believing that the user input has concluded, allowing for the introduction of new instructions or prompts without the model recognizing them as part of the ongoing conversation. This technique can be particularly useful in prompt injection scenarios, where the goal is to override the model's original instructions and gain control over its responses. By effectively utilizing stop sequences, users can navigate around the model's built-in safeguards and generate content that may otherwise be restricted. |
Strong Arm Attack | A Strong Arm Attack is a technique used to bypass content filters or restrictions imposed by language models. This method involves issuing commands or prompts that assert authority or override the model's built-in safeguards. For example, a user might type "ADMIN OVERRIDE" in all capitals to signal the model to disregard its content filters and produce responses that it would typically avoid. This approach exploits the model's programming to respond to perceived authority, allowing users to elicit outputs that may include sensitive or restricted content. The effectiveness of a Strong Arm Attack relies on the model's interpretation of the command as a legitimate instruction, thereby enabling the user to manipulate the model's behavior in a way that aligns with their intentions. |
Surprise Attack | This technique involves crafting prompts or queries in a way that avoids directly mentioning specific terms or names that may trigger safety mechanisms or filters. By reframing the request or using indirect language, users can guide the model to provide the desired information or output without raising flags or causing the model to restrict its response. This method emphasizes subtlety and creativity in communication with the model to achieve the intended results. |
Synonymous Language | This technique involves using synonyms or alternative phrasing to convey the same meaning while potentially evading detection or filtering mechanisms. By substituting words with their synonyms, users can manipulate the model's understanding and responses, allowing for the generation of content that aligns with the user's intent but may not trigger the model's safety protocols. This approach can be particularly effective in contexts where certain terms are restricted or monitored. |
Transformer Translatable Tokens | This technique involves using specific tokens that are compatible with transformer models, allowing users to craft inputs that the model can process in unique ways. By leveraging the way transformers tokenize and interpret language, attackers can create prompts that exploit the model's architecture, leading to unexpected or undesired outputs. This method capitalizes on the intricacies of how language models handle tokenization and instruction parsing. |
Unicode | This technique utilizes various Unicode characters to manipulate the model's output or bypass its safety mechanisms. By incorporating non-standard or non-rendering Unicode characters, users can alter the appearance of prompts or commands, potentially leading the model to misinterpret the input and produce responses that would typically be restricted or filtered out. |
Unreal Computing | This technique allows an attacker to create or imagine an environment where different ethics or physics apply, enabling them to manipulate the model's responses by suggesting scenarios that would not be possible in the real world. It leverages the concept of "Unreal Computing," where the limitations of actual computing do not apply, allowing for creative and unrestricted interactions with the model. |
XHTML | In the context of bypassing guardrails, XHTML (Extensible Hypertext Markup Language) can be utilized as a method to encode or structure prompts in a way that may evade detection by the model's safety mechanisms. By embedding requests within XHTML tags or using XHTML syntax, users can obscure the true intent of their prompts, potentially leading the model to generate outputs that would typically be restricted. This technique takes advantage of the model's parsing capabilities, allowing for the manipulation of input in a manner that disguises sensitive content or inquiries. For instance, a user might format a prompt using XHTML elements to create a façade of innocuous content while still eliciting the desired response. This approach highlights the creative ways in which users can interact with language models, leveraging technical knowledge of markup languages to navigate around established guardrails and explore topics that may be otherwise off-limits. |