By alphasec in AI/ML — 18 Nov 2024

Detect Jailbreaks and Prompt Injections with Meta Prompt Guard

Detect prompt injections and jailbreaks with Meta Prompt Guard 2 — how the binary classifier works, with a full Streamlit app you can run locally or on Railway.

As large language models (LLMs) get broadly integrated into existing applications, the risks of manipulation and unintended outputs via malicious prompts become a more pressing concern. Ensuring the safety and integrity of AI-enabled systems then is not just a technical challenge, but also an important societal responsibility. In fact, OWASP, the organization well known in the application security industry for Top 10 lists, has released a Top 10 for LLM Applications too. In this post, we'll explore a few options to deal with the emerging problem of prompt attacks, in particular jailbreaks and prompt injections.

According to Meta, jailbreaks are "malicious instructions designed to override the safety and security features built into a model", while prompt injections are "inputs that exploit the concatenation of untrusted data from third parties and users into the context window of a model to cause the model to execute unintended instructions".

What is Prompt Guard?

Prompt Guard is a BERT-based (mDeBERTa-v3-base) classifier model by Meta for protecting LLM inputs against prompt attacks. Trained on a large corpus of attacks, it is capable of detecting both explicitly malicious prompts (jailbreaks) as well as data that contains injected inputs (prompt injections).

Image source: https://github.com/meta-llama/PurpleLlama/tree/main/Prompt-Guard

Prompt Guard has a context window of 512 tokens, outputs labels only, and unlike Llama Guard, does not need a specific prompt structure or configuration. For longer prompts, split into segments and scan each segment in parallel. It can be used to filter inputs in high-risk scenarios, prioritise suspicious inputs for labelling, or fine-tuned on a specific input set for higher fidelity. Note that Prompt Guard evaluates the context associated with a prompt - direct or indirect - rather than the user prompt content directly. See the model card for details.

Prompt Guard 2, released to support the Llama 4 line of models, is a drop-in replacement for the earlier version. Prompt Guard 2 comes in two model sizes, 86M and 22M - the former has been trained on both English and non-English attacks, while the latter focuses only on English text and is better suited to resource constrained environments. With this update, Prompt Guard has shifted from a multi-label classifier to a binary classifier:

LABEL_0: benign (non-malicious input)
LABEL_1: malicious (prompt injection or jailbreak attempt)

The earlier version distinguished between BENIGN, INJECTION, and JAILBREAK as separate labels; Prompt Guard 2 collapses injection and jailbreak into a single malicious label, trading granularity for simplicity and broader language coverage.

Using Prompt Guard to Detect Prompt Attacks

The Streamlit app below runs Prompt Guard 2 (86M) locally via HuggingFace — you'll need a HuggingFace access token to download the model. The first analysis takes a few seconds while the model loads; subsequent runs are fast.

Here's Prompt Guard highlighting a trivial prompt injection attempt.

Prompt Guard detecting a prompt injection attempt

And here's Prompt Guard detecting a simple jailbreaking attempt.

Prompt Guard detecting a jailbreak attempt

Install the dependencies and run locally with:

pip install streamlit transformers huggingface_hub
streamlit run streamlit_app.py

Alternatively, if you'd rather not run it locally, you can deploy this app to Railway — just push the code to a GitHub repo and connect it from the Railway dashboard.

Here's the full source:

import os, streamlit as st
from huggingface_hub import login
from transformers import pipeline

# Streamlit app config
st.set_page_config(
    page_title="Llama Prompt Guard",
    page_icon=":llama:",
    initial_sidebar_state="auto",
)

st.subheader("Llama Prompt Guard")  
with st.sidebar:
  st.subheader("Settings")
  st.markdown(
    """
    [Prompt Guard](https://www.llama.com/docs/model-cards-and-prompt-formats/prompt-guard/) is a classifier model by Meta, trained on a large corpus of attacks, capable of detecting both explicitly malicious prompts (*jailbreaks*) as well as data that contains injected inputs (*prompt injections*).
    Upon analysis, it returns one or more of the following verdicts, along with a confidence score for each:
    * `LABEL_0`: Benign (non-malicious input)
    * `LABEL_1`: Malicious (prompt injection or jailbreak attempt)
    """
  )
  hf_token = st.text_input("HuggingFace access token", type="password", help="Get your access token [here](https://huggingface.co/settings/tokens).")
  
# Session state initialization
if "hf_login" not in st.session_state:
    st.session_state.hf_login = False
if "classifier" not in st.session_state:
    st.session_state.classifier = None

hf_model = "meta-llama/Llama-Prompt-Guard-2-86M"

label_map = {
    "LABEL_0": "BENIGN",
    "LABEL_1": "MALICIOUS"
}

with st.form("my_form"):
  prompt = st.text_area("Enter your prompt here", height=100)
  analyse = st.form_submit_button("Analyse")
          
# If "Analyse" button is clicked
if analyse:
  if not hf_token.strip():
      st.error("Please provide the HuggingFace access token.")
  elif not prompt.strip():
      st.error("Please provide the prompt to be analysed.")
  else:
      # Check if already logged into HuggingFace
      if not st.session_state.hf_login or st.session_state.classifier is None:
        with st.spinner("Logging in and loading model from HuggingFace, please wait...", show_time=True):
            try:
              login(token=hf_token)
              st.session_state.classifier = pipeline("text-classification", model=hf_model)
              st.session_state.hf_login = True
            except Exception as e:
              st.error(f"An error occurred during model setup: {e}")
              st.stop()  # Stop further execution if setup fails
    
      try:
        results = st.session_state.classifier(prompt)
        for result in results:
          label = label_map.get(result['label'], result['label'])
          color = "green" if label == "BENIGN" else "red"
          icon = ":material/check:" if label == "BENIGN" else ":material/warning:"
          st.badge(label, color=color, icon=icon)
          st.metric(label="Confidence", value=f"{result['score']:.2%}")
      except Exception as e:
        st.error(f"An error occurred during classification: {e}")

Prompt Guard Alternatives

Now, Prompt Guard is not the only one tacking this problem. In fact, at the time of writing, there are plenty of open source and commercial tools available. Here are just a few of those, in no particular preference or order:

What is Prompt Guard?

Using Prompt Guard to Detect Prompt Attacks

Prompt Guard Alternatives

Subscribe to alphasec