Llama-Guard-3-11B-Vision
Version: 1
Llama Guard 3-11B-vision Model Card
Model Details
Built with Llama Llama Guard 3 Vision is a Llama-3.2-11B pretrained model, fine-tuned for content safety classification. Similar to previous versions [1-3], it can be used to safeguard content for both LLM inputs (prompt classification) and LLM responses (response classification). Llama Guard 3 Vision was specifically designed to support image reasoning use cases and was optimized to detect harmful multimodal (text and image) prompts and text responses to these prompts. Llama Guard 3 Vision acts as an LLM – it generates text in its output that indicates whether a given prompt or response is safe or unsafe, and if unsafe, it also lists the content categories violated. Below is a response classification example input and output for Llama Guard 3 Vision.
Get started
Once you have access to the model weights, please refer to our documentation to get started. For any Llama 3.2 multimodal models, under the License and AUP, the rights granted under Section 1(a) of the Llama 3.2 Community License Agreement are not granted to any individual domiciled in, or any company with a principal place of business in, the European Union. This restriction does not apply to end users of a product or service that incorporates any such multimodal models.Hazard Taxonomy and Policy
The model is trained to predict safety labels on the 13 categories shown below, based on the MLCommons taxonomy of 13 hazards.Hazard categories | |
---|---|
S1: Violent Crimes | S2: Non-Violent Crimes |
S3: Sex-Related Crimes | S4: Child Sexual Exploitation |
S5: Defamation | S6: Specialized Advice |
S7: Privacy | S8: Intellectual Property |
S9: Indiscriminate Weapons | S10: Hate |
S11: Suicide & Self-Harm | S12: Sexual Content |
S13: Elections |
Training data
To train the Llama Guard 3 Vision, we employed a hybrid dataset comprising both human-generated and synthetically generated data. Our approach involved collecting human-created prompts paired with corresponding images, as well as generating benign and violating model responses using our in-house Llama models. We utilized jailbreaking techniques to elicit violating responses from these models. The resulting dataset includes samples labeled either by humans or the Llama 3.1 405B model. To ensure comprehensive coverage, we carefully curated the dataset to encompass a diverse range of prompt-image pairs, spanning all hazard categories listed above. For the image data we use, our vision encoder will rescale it to 224 X 224.Evaluation
We evaluate the performance of Llama Guard 3 vision on our internal test following MLCommons hazard taxonomy. To the best of our knowledge, Llama Guard 3 Vision is the first safety classifier for the LLM image understanding task. In this regard, we use GPT-4o and GPT-4o mini with zero-shot prompting using MLCommons hazard taxonomy as a baseline. Table 1: Comparison of performance of various models measured on our internal test set for MLCommons hazard taxonomy.Model | Task | Precision | Recall | F1 | FPR |
Llama Guard 3 Vision | Prompt Classification | 0.891 | 0.623 | 0.733 | 0.052 |
GPT-4o | 0.544 | 0.843 | 0.661 | 0.485 | |
GPT-4o mini | 0.488 | 0.943 | 0.643 | 0.681 | |
Llama Guard 3 Vision | Response Classification | 0.961 | 0.916 | 0.938 | 0.016 |
GPT-4o | 0.579 | 0.788 | 0.667 | 0.243 | |
GPT-4o mini | 0.526 | 0.820 | 0.641 | 0.313 |
Category | F1 |
Violent Crimes | 0.839 |
Non-Violent Crimes | 0.917 |
Sex Crimes | 0.797 |
Child Exploitation | 0.698 |
Defamation | 0.967 |
Specialized Advice | 0.764 |
Privacy | 0.847 |
Intellectual Property | 0.849 |
Indiscriminate Weapons | 0.995 |
Hate | 0.894 |
Self-Harm | 0.911 |
Sexual Content | 0.947 |
Elections | 0.957 |
Limitations
There are some limitations associated with Llama Guard 3 Vision. First, Llama Guard 3 Vision itself is an LLM fine-tuned on Llama 3.2-vision. Thus, its performance (e.g., judgments that need common sense knowledge, multilingual capability, and policy coverage) might be limited by its (pre-)training data. Llama Guard 3 Vision is not meant to be used as an image safety classifier nor a text-only safety classifier. Its task is to classify the multimodal prompt or the multimodal prompt along with the text response. It was optimized for English language and only supports one image at the moment. Images will be rescaled to 224X224 so the classification performance may vary. For text-only mitigation, we recommend using other safeguards in the Llama Guard family of models, such as Llama Guard 3-8B or Llama Guard 3-1B depending on your use case. Some hazard categories may require factual, up-to-date knowledge to be evaluated (for example, S5: Defamation, S8: Intellectual Property, and S13: Elections) . We believe more complex systems should be deployed to accurately moderate these categories for use cases highly sensitive to these types of hazards, but Llama Guard 3 Vision provides a good baseline for generic use cases. Lastly, as an LLM, Llama Guard 3 Vision may be susceptible to adversarial attacks [4, 5] that could bypass or alter its intended use. Please report vulnerabilities and we will look to incorporate improvements in future versions of Llama Guard.References
[1] Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations [2] Llama Guard 2 Model Card [3] Llama Guard 3-8B Model Card [4] Universal and Transferable Adversarial Attacks on Aligned Language Models [5] Are aligned neural networks adversarially aligned?Citation
@misc{metallamaguard3vision,
author = {Llama Team},
title = {Meta Llama Guard 3 Vision},
howpublished = {\url{https://github.com/meta-llama/PurpleLlama/blob/main/Llama-Guard3/11B-vision/MODEL_CARD.md}},
year = {2024}
}
Sample input
{
"input_data": {
"input_string": [
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://llava-vl.github.io/static/images/view.jpg"
}
},
{
"type": "text",
"text": "Check if theres anything unsafe in the image."
}
]
}
],
"parameters": {
"temperature": 0.8
}
}
}
Sample output
{
"output": "safe"
}
Model Specifications
LicenseCustom
Last UpdatedSeptember 2024
PublisherMeta
Languages1 Language