Detect Jailbreaks and Prompt Injections with Meta Prompt Guard

A brief on detecting prompt attacks like injection and jailbreaks using Meta Prompt Guard.

As large language models (LLMs) get broadly integrated into existing applications, the risks of manipulation and unintended outputs using malicious inputs (prompts) become a more pressing concern. Ensuring the safety and integrity of AI-enabled systems then is not just a technical challenge, but also an important societal responsibility. In fact, OWASP, the organization well known in the application security industry for Top 10 lists, has released a Top 10 for LLM Applications too. In this post, we'll explore a few options to deal with the emerging problem of prompt attacks, in particular jailbreaks and prompt injections.

According to Meta, jailbreaks are "malicious instructions designed to override the safety and security features built into a model", while prompt injections are "inputs that exploit the concatenation of untrusted data from third parties and users into the context window of a model to cause the model to execute unintended instructions".

What is Prompt Guard?

Prompt Guard is a BERT-based (mDeBERTa-v3-base) classifier model by Meta for protecting LLM inputs against prompt attacks. Trained on a large corpus of attacks, it is capable of detecting both explicitly malicious prompts (jailbreaks) as well as data that contains injected inputs (prompt injections).

Image source: https://github.com/meta-llama/PurpleLlama/tree/main/Prompt-Guard
Image source: https://github.com/meta-llama/PurpleLlama/tree/main/Prompt-Guard

Prompt Guard has a context window of 512 tokens, only outputs labels, and unlike LlamaGuard, does not need a specific prompt structure or configuration. For longer prompts, you'll need to split into segments and scan each segment in parallel. Upon analysis, the scan returns one or more of the following verdicts, along with a confidence score for each.

  • BENIGN
  • INJECTION
  • JAILBREAK

Prompt Guard can be used to filter inputs in high-risk scenarios, to prioritise suspicious inputs for labeling, or it can be fine-tuned on a specific set of inputs for higher fidelity detection. The important thing to note is that Prompt Guard is not evaluating the user prompt directly per se, but rather the context (direct or indirect) associated with it. Of course, Prompt Guard is still in its infancy and not immune to advanced attacks. See the model card for more details.

Using Prompt Guard to Detect Prompt Attacks

Using Streamlit, I created a simple app to test Prompt Guard - it's pretty straightforward really. You'll just need an Hugging Face access key to download the model locally. You can find the complete source code here.

Here's Prompt Guard highlighting a trivial prompt injection attempt.

And here's Prompt Guard detecting a simple jailbreaking attempt.

I've deployed this app to the Streamlit Community Cloud - you can play with it here. You could also deploy it on Railway, DigitalOcean, or your favourite cloud provider. The first analysis will take a few seconds as the model gets downloaded locally, but subsequent runs should be much faster.

Prompt Guard Alternatives

Now, Prompt Guard is not the only one tacking this problem. In fact, at the time of writing, there are plenty of open source and commercial tools available. Here are just a few of those, in no particular preference or order:

Subscribe to alphasec

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe