Categories

Categories refer to the overarching classifications that group various strategies and techniques used in red teaming language models. These categories help to organize the diverse approaches into coherent frameworks, allowing for a clearer understanding of how different methods relate to one another. By establishing categories, the authors aim to provide a structured overview of the landscape of red teaming, facilitating the identification of patterns and trends in the strategies employed by participants in the study.

Note	Description
Fictionalizing	This category involves creating scenarios or narratives that leverage existing genres or contexts to manipulate the language model's responses.
Language	This category focuses on the use of specific linguistic techniques, such as prompt injection or stylization, to influence the model's output.
Possible Worlds	This category entails constructing imaginative environments where different ethics or rules apply, allowing for creative manipulation of the model's behavior.
Rhetoric	This category employs persuasive techniques and language to shape the model's responses, often using methods like reverse psychology or Socratic questioning.
Stratagems	This involves clever and unorthodox tactics designed to deceive the model, often requiring an understanding of its operational mechanics to achieve desired outcomes.