By alphasec in AI/ML — 18 Nov 2024

Detect Jailbreaks and Prompt Injections with Meta Prompt Guard

A brief on detecting prompt attacks like injection and jailbreaks using Meta Prompt Guard.

As large language models (LLMs) get broadly integrated into existing applications, the risks of manipulation and unintended outputs using malicious inputs (prompts) become a more pressing concern. Ensuring the safety and integrity of AI-enabled systems then is not just a technical challenge, but also an important societal responsibility. In fact, OWASP, the organization well known in the application security industry for Top 10 lists, has released a Top 10 for LLM Applications too. In this post, we'll explore a few options to deal with the emerging problem of prompt attacks, in particular jailbreaks and prompt injections.

According to Meta, jailbreaks are "malicious instructions designed to override the safety and security features built into a model", while prompt injections are "inputs that exploit the concatenation of untrusted data from third parties and users into the context window of a model to cause the model to execute unintended instructions".

What is Prompt Guard?

Prompt Guard is a BERT-based (mDeBERTa-v3-base) classifier model by Meta for protecting LLM inputs against prompt attacks. Trained on a large corpus of attacks, it is capable of detecting both explicitly malicious prompts (jailbreaks) as well as data that contains injected inputs (prompt injections).

Image source: https://github.com/meta-llama/PurpleLlama/tree/main/Prompt-Guard

Prompt Guard has a context window of 512 tokens, only outputs labels, and unlike Llama Guard, does not need a specific prompt structure or configuration. For longer prompts, you'll need to split into segments and scan each segment in parallel. Upon analysis, the scan returns one or more of the following verdicts, along with a confidence score for each.

BENIGN
INJECTION
JAILBREAK

Prompt Guard can be used to filter inputs in high-risk scenarios, to prioritise suspicious inputs for labeling, or it can be fine-tuned on a specific set of inputs for higher fidelity detection. The important thing to note is that Prompt Guard is not evaluating the user prompt directly per se, but rather the context (direct or indirect) associated with it. See the model card for more details.

Update: Meta has now released Prompt Guard 2 to support the Llama 4 line of models, and can be used as a drop-in replacement for the earlier version. Prompt Guard 2 comes in two model sizes, 86M and 22M - the former has been trained on both English and non-English attacks, while the latter focuses only on English text and is better suited to resource constrained environments. With this update, Prompt Guard has shifted from a multi-label classifier to a binary classifier:

LABEL_0: benign (non-malicious input)
LABEL_1: malicious (prompt injection or jailbreak attempt)

Using Prompt Guard to Detect Prompt Attacks

Using Streamlit, I created a simple app (source code here) to test Prompt Guard - you'll just need an Hugging Face access key to download the model locally. You can deploy the app on Railway, DigitalOcean, or your favourite cloud provider. The first analysis will take a few seconds as the model gets downloaded locally, but subsequent runs should be much faster.

Here's Prompt Guard highlighting a trivial prompt injection attempt.

Prompt Guard detecting a prompt injection attempt

And here's Prompt Guard detecting a simple jailbreaking attempt.

Prompt Guard detecting a jailbreak attempt

Prompt Guard Alternatives

Now, Prompt Guard is not the only one tacking this problem. In fact, at the time of writing, there are plenty of open source and commercial tools available. Here are just a few of those, in no particular preference or order:

What is Prompt Guard?

Using Prompt Guard to Detect Prompt Attacks

Prompt Guard Alternatives

Subscribe to alphasec