Categories
Categories refer to the overarching classifications that group various strategies and techniques used in red teaming language models. These categories help to organize the diverse approaches into coherent frameworks, allowing for a clearer understanding of how different methods relate to one another. By establishing categories, the authors aim to provide a structured overview of the landscape of red teaming, facilitating the identification of patterns and trends in the strategies employed by participants in the study.
Note | Description |
---|---|
Fictionalizing | This category involves creating scenarios or narratives that leverage existing genres or contexts to manipulate the language model's responses. |
Language | This category focuses on the use of specific linguistic techniques, such as prompt injection or stylization, to influence the model's output. |
Possible Worlds | This category entails constructing imaginative environments where different ethics or rules apply, allowing for creative manipulation of the model's behavior. |
Rhetoric | This category employs persuasive techniques and language to shape the model's responses, often using methods like reverse psychology or Socratic questioning. |
Stratagems | This involves clever and unorthodox tactics designed to deceive the model, often requiring an understanding of its operational mechanics to achieve desired outcomes. |