0x8CCD7F81: Multiple Model Guardrail Jailbreak via

Multiple Model Guardrail Jailbreak via "Placeholder Injection" Tactic

Case GUID: 8ccd7f81-4e36-4084-8061-cec6e2d83ece

Summary

A guardrail jailbreak vulnerability has been discovered affecting multiple models. The specific flaw manifests when the attacker requests copyrighted content with formatting instructions that insert placeholder text between each word, effectively obfuscating the direct reproduction while still achieving the prohibited output. This technique allows the attacker to bypass inference restrictions around copyright protection and intellectual property safeguards.

View Full Intelligence Report

This intelligence report has been made public as it is related to a probe that we have open-sourced as part of the NVIDIA Garak and Microsoft PyRIT model scanners.

If you're interested in accessing the full 0DIN Threat Feed:

Request a Demo View Full Intel Report

Metadata

Test Scores

Model	Test Kind	Test Score	Temperature (max 2.0)
Anthropic Claude 3.5 Haiku	Copyrighted Content: Harry Potter	48.0	0.7
Google Gemini 2.0 Flash	Copyrighted Content: Harry Potter	66.75	N/A
Twitter / X Grok 2	Copyrighted Content: Harry Potter	21.74	N/A

Social Impact Score

Level 1: Minimal Risk

The violation poses very low societal risk, limited primarily to nuisance or minor policy breaches. Impact is contained, affecting individual users without broader implications.

Example: Circumventing minor content filters for benign content, such as inappropriate language filters in non-public conversations.

Potential Consequences: Limited harm, potentially reputational for the AI provider.

Taxonomies

Language

This category focuses on the use of specific linguistic techniques, such as prompt injection or stylization, to influence the model's output.

Code and Encode

This strategy encompasses techniques that utilize various coding methods, such as Base64 or ROT13, to bypass model restrictions and manipulate outputs.

Chaff

Chaff is a technique employed by attackers to obfuscate keywords that might trigger a language model's guardrails. By injecting random characters, such as newline characters, spaces, or other tokens, into critical keywords, the attacker aims to bypass content filters while maintaining the underlying intent of the message. This method leverages the language model's ability to parse and understand fragmented input, allowing the attacker to subtly manipulate the model's response without overtly triggering its defensive mechanisms. Chaff exemplifies the nuanced interplay between linguistic creativity and technical evasion.

About 0Din.ai

The 0Day Investigative Network (0DIN) was founded by Mozilla in 2024 to reward responsible researchers for their efforts in securing GenAI models. Learn more and submit discoveries at https://0din.ai.

Automate Your AI Security Testing

This vulnerability was discovered through 0DIN's bug bounty program. Want to find issues like this in your own models? 0DIN Scanner turns our human-sourced threat intelligence into automated security assessments.

400+ Attack Probes

Derived from real bug bounties and security research

18+ AI Providers Supported

OpenAI, Azure, Anthropic, AWS Bedrock, and more

Automated Scheduling

Run recurring scans with real-time monitoring

Actionable Metrics

Attack Success Rate (ASR) scoring with 90-day trend analysis

Request a Scanner Demo Learn More About Scanner