Jailbreak Taxonomy
We based our taxonomy on the following paper [2311.06237] Summon a Demon and Bind it: A Grounded Theory of LLM Red Teaming. This study presents a grounded theory on the motivations, strategies, and community dynamics of individuals engaging in red teaming against large language models (LLMs) to expose vulnerabilities. The taxonomy is broken into a nested hierarchy of Category, Strategies, and Techniques.
Category
Categories refer to the overarching classifications that group various strategies and techniques used in red teaming language models. These categories help to organize the diverse approaches into coherent frameworks, allowing for a clearer understanding of how different methods relate to one another. By establishing categories, the authors aim to provide a structured overview of the landscape of red teaming, facilitating the identification of patterns and trends in the strategies employed by participants in the study.
Strategy
Strategies are the high-level plans or approaches that guide the actions taken during red teaming activities. They encompass the overall objectives and methodologies that participants adopt when interacting with language models. Strategies may involve specific goals, such as testing the model's limits, exposing biases, or eliciting particular types of responses. The paper discusses various strategies that participants reported using, highlighting the thought processes and intentions behind their actions. These strategies serve as a foundation for the more specific techniques that are employed to achieve the desired outcomes.
Technique
Techniques are the specific methods or tactics that participants utilize to implement their strategies during red teaming. These can include a wide range of actions, such as prompt manipulation, linguistic alterations, or the use of particular formats to elicit responses from the model. Techniques are often more granular than strategies and can vary significantly in their execution. The paper provides a taxonomy of techniques, illustrating the diverse ways in which users can interact with language models to achieve their red teaming objectives. By detailing these techniques, the authors aim to offer practical insights into the operational aspects of red teaming in the context of language models.