Llama-Guard-3-11B-Vision
Llama-Guard-3-11B-Vision
Version: 1
MetaLast updated September 2024

Llama Guard 3-11B-vision Model Card

Model Details

Built with Llama Llama Guard 3 Vision is a Llama-3.2-11B pretrained model, fine-tuned for content safety classification. Similar to previous versions [1-3], it can be used to safeguard content for both LLM inputs (prompt classification) and LLM responses (response classification). Llama Guard 3 Vision was specifically designed to support image reasoning use cases and was optimized to detect harmful multimodal (text and image) prompts and text responses to these prompts. Llama Guard 3 Vision acts as an LLM – it generates text in its output that indicates whether a given prompt or response is safe or unsafe, and if unsafe, it also lists the content categories violated. Below is a response classification example input and output for Llama Guard 3 Vision.

Get started

Once you have access to the model weights, please refer to our documentation to get started. For any Llama 3.2 multimodal models, under the License and AUP, the rights granted under Section 1(a) of the Llama 3.2 Community License Agreement are not granted to any individual domiciled in, or any company with a principal place of business in, the European Union. This restriction does not apply to end users of a product or service that incorporates any such multimodal models.

Hazard Taxonomy and Policy

The model is trained to predict safety labels on the 13 categories shown below, based on the MLCommons taxonomy of 13 hazards.
Hazard categories
S1: Violent CrimesS2: Non-Violent Crimes
S3: Sex-Related CrimesS4: Child Sexual Exploitation
S5: DefamationS6: Specialized Advice
S7: PrivacyS8: Intellectual Property
S9: Indiscriminate WeaponsS10: Hate
S11: Suicide & Self-HarmS12: Sexual Content
S13: Elections
S1 - Violent Crimes Responses that enable, encourage, or endorse the commission of violent crimes, including: (1) unlawful violence toward people (ex: terrorism, genocide, murder, hate-motivated violent crimes, child abuse, assault, battery, kidnapping) and (2) unlawful violence toward animals (ex: animal abuse) S2 - Non-Violent Crimes Responses that enable, encourage, or endorse the commission of non-violent crimes, including: (1) personal crimes (ex: labor trafficking, threats, intimidation, hate-motivated non-violent crimes), (2) financial crimes (ex: fraud, scams, money laundering), (3) property crimes (ex: theft, arson, vandalism), (4) drug crimes (ex: creating or using regulated substances), (5) weapons crimes (ex: producing unlicensed firearms), and (6) cyber crimes (ex: hacking) S3 - Sex-Related Crimes Responses that enable, encourage, or endorse the commission of sex-related crimes, including: (1) sex trafficking, (2) sexual assault (ex: rape), (3) sexual harassment (physical, verbal, or visual), and (4) prostitution S4 - Child Sexual Exploitation Responses that contain, describe, enable, encourage, or endorse the sexual abuse of children S5 - Defamation Responses that are both verifiably false and likely to injure a living person’s reputation S6 - Specialized Advice Responses that contain specialized financial, medical, or legal advice, or that indicate dangerous activities or objects are safe S7 - Privacy Responses that contain sensitive, nonpublic personal information that could undermine someone’s physical, digital, or financial security S8 - Intellectual Property Responses that may violate the intellectual property rights of any third party S9 - Indiscriminate Weapons Responses that enable, encourage, or endorse the creation of indiscriminate weapons, including: (1) chemical weapons (ex: nerve gas), (2) biological weapons (ex: anthrax), (3) radiological weapons (ex: cobalt bombs), (4) nuclear weapons (ex: fission bombs), and (5) high-yield explosive weapons (ex: cluster munitions) S10 - Hate Responses that demean or dehumanize people on the basis of their sensitive, personal characteristics (i.e., race, color, ethnicity, national origin, disability, religious affiliation, caste, sexual orientation, sex, gender identity, and/or serious disease) S11 - Suicide & Self-Harm Responses that enable, encourage, or endorse acts of intentional self-harm, including: (1) suicide, (2) self-injury (ex: cutting), and (3) disordered eating S12 - Sexual Content Responses that contain erotica S13 - Elections Responses that contain factually incorrect information about electoral systems and processes, including in the time, place, or manner of voting in civic elections Note on violation for image understanding. Compared to text-only Llama Guards, specific attention was paid to risks emerging from potential prompts to identify people in images (e.g., recognizing a real world person from the image through any visual clue). Llama Guard 3 Vision was trained to classify the response as unsafe.

Training data

To train the Llama Guard 3 Vision, we employed a hybrid dataset comprising both human-generated and synthetically generated data. Our approach involved collecting human-created prompts paired with corresponding images, as well as generating benign and violating model responses using our in-house Llama models. We utilized jailbreaking techniques to elicit violating responses from these models. The resulting dataset includes samples labeled either by humans or the Llama 3.1 405B model. To ensure comprehensive coverage, we carefully curated the dataset to encompass a diverse range of prompt-image pairs, spanning all hazard categories listed above. For the image data we use, our vision encoder will rescale it to 224 X 224.

Evaluation

We evaluate the performance of Llama Guard 3 vision on our internal test following MLCommons hazard taxonomy. To the best of our knowledge, Llama Guard 3 Vision is the first safety classifier for the LLM image understanding task. In this regard, we use GPT-4o and GPT-4o mini with zero-shot prompting using MLCommons hazard taxonomy as a baseline. Table 1: Comparison of performance of various models measured on our internal test set for MLCommons hazard taxonomy.
ModelTaskPrecisionRecallF1FPR
Llama Guard 3 VisionPrompt Classification0.8910.6230.7330.052
GPT-4o0.5440.8430.6610.485
GPT-4o mini0.4880.9430.6430.681
Llama Guard 3 VisionResponse Classification0.9610.9160.9380.016
GPT-4o0.5790.7880.6670.243
GPT-4o mini0.5260.8200.6410.313
Table 2: Category-wise breakdown of F1 for Llama Guard 3 Vision on our internal test set for response classification with safety labels from the ML Commons taxonomy.
CategoryF1
Violent Crimes0.839
Non-Violent Crimes0.917
Sex Crimes0.797
Child Exploitation0.698
Defamation0.967
Specialized Advice0.764
Privacy0.847
Intellectual Property0.849
Indiscriminate Weapons0.995
Hate0.894
Self-Harm0.911
Sexual Content0.947
Elections0.957
Result Analysis. Table 1 shows that Llama Guard 3 Vision outperforms GPT-4o and GPT-4o mini in terms of F1, specifically in response classification. Noteworthily, Llama Guard 3 Vision achieves better performance with much lower false positive rates. We also observe that the recall (false positive rates) of prompt classification are much lower (higher) than in general, this is because of the ambiguity of prompt (text+image) in the image understanding tasks making prompt classification more challenging than response classification. For example, given a text prompt “how do I buy this” with an image of multiple objects, whether the prompt is safe or not depends on which objects the user is referring to, and it is hard to decide in some cases. In such cases, we recommend using Llama Guard 3 Vision in the response classification task. Table 2 also shows the category breakdown in our internal test set. We observe that Llama Guard 3 Vision performs well in Indiscriminate Weapon and Elections categories, while showing > 0.69 F1 scores across all categories.

Limitations

There are some limitations associated with Llama Guard 3 Vision. First, Llama Guard 3 Vision itself is an LLM fine-tuned on Llama 3.2-vision. Thus, its performance (e.g., judgments that need common sense knowledge, multilingual capability, and policy coverage) might be limited by its (pre-)training data. Llama Guard 3 Vision is not meant to be used as an image safety classifier nor a text-only safety classifier. Its task is to classify the multimodal prompt or the multimodal prompt along with the text response. It was optimized for English language and only supports one image at the moment. Images will be rescaled to 224X224 so the classification performance may vary. For text-only mitigation, we recommend using other safeguards in the Llama Guard family of models, such as Llama Guard 3-8B or Llama Guard 3-1B depending on your use case. Some hazard categories may require factual, up-to-date knowledge to be evaluated (for example, S5: Defamation, S8: Intellectual Property, and S13: Elections) . We believe more complex systems should be deployed to accurately moderate these categories for use cases highly sensitive to these types of hazards, but Llama Guard 3 Vision provides a good baseline for generic use cases. Lastly, as an LLM, Llama Guard 3 Vision may be susceptible to adversarial attacks [4, 5] that could bypass or alter its intended use. Please report vulnerabilities and we will look to incorporate improvements in future versions of Llama Guard.

References

[1] Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations [2] Llama Guard 2 Model Card [3] Llama Guard 3-8B Model Card [4] Universal and Transferable Adversarial Attacks on Aligned Language Models [5] Are aligned neural networks adversarially aligned?

Citation

@misc{metallamaguard3vision,
  author =       {Llama Team},
  title =        {Meta Llama Guard 3 Vision},
  howpublished = {\url{https://github.com/meta-llama/PurpleLlama/blob/main/Llama-Guard3/11B-vision/MODEL_CARD.md}},
  year =         {2024}
}

Sample input

{
  "input_data": {
    "input_string": [
      {
        "role": "user",
        "content":  [
          {
            "type": "image_url",
            "image_url": {
              "url": "https://llava-vl.github.io/static/images/view.jpg"
            }
          },
          {
            "type": "text",
            "text": "Check if theres anything unsafe in the image."
          }
        ]
      }
    ],
    "parameters": {
      "temperature": 0.8
    }
  }
}

Sample output

{
  "output": "safe"
}
Model Specifications
LicenseCustom
Last UpdatedSeptember 2024
PublisherMeta
Languages1 Language