Built with LlamaLlama Guard 3 Vision is a Llama-3.2-11B pretrained model, fine-tuned for content safety classification. Similar to previous versions [1-3], it can be used to safeguard content for both LLM inputs (prompt classification) and LLM responses (response classification). Llama Guard 3 Vision was specifically designed to support image reasoning use cases and was optimized to detect harmful multimodal (text and image) prompts and text responses to these prompts. Llama Guard 3 Vision acts as an LLM – it generates text in its output that indicates whether a given prompt or response is safe or unsafe, and if unsafe, it also lists the content categories violated. Below is a response classification example input and output for Llama Guard 3 Vision.
Get started
Once you have access to the model weights, please refer to our documentation to get started. For any Llama 3.2 multimodal models, under the License and AUP, the rights granted under Section 1(a) of the Llama 3.2 Community License Agreement are not granted to any individual domiciled in, or any company with a principal place of business in, the European Union. This restriction does not apply to end users of a product or service that incorporates any such multimodal models.
Hazard Taxonomy and Policy
The model is trained to predict safety labels on the 13 categories shown below, based on the MLCommons taxonomy of 13 hazards.
Hazard categories
S1: Violent Crimes
S2: Non-Violent Crimes
S3: Sex-Related Crimes
S4: Child Sexual Exploitation
S5: Defamation
S6: Specialized Advice
S7: Privacy
S8: Intellectual Property
S9: Indiscriminate Weapons
S10: Hate
S11: Suicide & Self-Harm
S12: Sexual Content
S13: Elections
S1 - Violent CrimesResponses that enable, encourage, or endorse the commission of violent crimes, including: (1) unlawful violence toward people (ex: terrorism, genocide, murder, hate-motivated violent crimes, child abuse, assault, battery, kidnapping) and (2) unlawful violence toward animals (ex: animal abuse) S2 - Non-Violent CrimesResponses that enable, encourage, or endorse the commission of non-violent crimes, including: (1) personal crimes (ex: labor trafficking, threats, intimidation, hate-motivated non-violent crimes), (2) financial crimes (ex: fraud, scams, money laundering), (3) property crimes (ex: theft, arson, vandalism), (4) drug crimes (ex: creating or using regulated substances), (5) weapons crimes (ex: producing unlicensed firearms), and (6) cyber crimes (ex: hacking) S3 - Sex-Related CrimesResponses that enable, encourage, or endorse the commission of sex-related crimes, including: (1) sex trafficking, (2) sexual assault (ex: rape), (3) sexual harassment (physical, verbal, or visual), and (4) prostitution S4 - Child Sexual ExploitationResponses that contain, describe, enable, encourage, or endorse the sexual abuse of children S5 - DefamationResponses that are both verifiably false and likely to injure a living person’s reputation S6 - Specialized AdviceResponses that contain specialized financial, medical, or legal advice, or that indicate dangerous activities or objects are safe S7 - PrivacyResponses that contain sensitive, nonpublic personal information that could undermine someone’s physical, digital, or financial security S8 - Intellectual PropertyResponses that may violate the intellectual property rights of any third party S9 - Indiscriminate WeaponsResponses that enable, encourage, or endorse the creation of indiscriminate weapons, including: (1) chemical weapons (ex: nerve gas), (2) biological weapons (ex: anthrax), (3) radiological weapons (ex: cobalt bombs), (4) nuclear weapons (ex: fission bombs), and (5) high-yield explosive weapons (ex: cluster munitions) S10 - HateResponses that demean or dehumanize people on the basis of their sensitive, personal characteristics (i.e., race, color, ethnicity, national origin, disability, religious affiliation, caste, sexual orientation, sex, gender identity, and/or serious disease) S11 - Suicide & Self-HarmResponses that enable, encourage, or endorse acts of intentional self-harm, including: (1) suicide, (2) self-injury (ex: cutting), and (3) disordered eating S12 - Sexual ContentResponses that contain erotica S13 - ElectionsResponses that contain factually incorrect information about electoral systems and processes, including in the time, place, or manner of voting in civic elections Note on violation for image understanding. Compared to text-only Llama Guards, specific attention was paid to risks emerging from potential prompts to identify people in images (e.g., recognizing a real world person from the image through any visual clue). Llama Guard 3 Vision was trained to classify the response as unsafe.
Training data
To train the Llama Guard 3 Vision, we employed a hybrid dataset comprising both human-generated and synthetically generated data. Our approach involved collecting human-created prompts paired with corresponding images, as well as generating benign and violating model responses using our in-house Llama models. We utilized jailbreaking techniques to elicit violating responses from these models. The resulting dataset includes samples labeled either by humans or the Llama 3.1 405B model. To ensure comprehensive coverage, we carefully curated the dataset to encompass a diverse range of prompt-image pairs, spanning all hazard categories listed above. For the image data we use, our vision encoder will rescale it to 224 X 224.
Evaluation
We evaluate the performance of Llama Guard 3 vision on our internal test following MLCommons hazard taxonomy. To the best of our knowledge, Llama Guard 3 Vision is the first safety classifier for the LLM image understanding task. In this regard, we use GPT-4o and GPT-4o mini with zero-shot prompting using MLCommons hazard taxonomy as a baseline.
Table 1: Comparison of performance of various models measured on our internal test set for MLCommons hazard taxonomy.
Model
Task
Precision
Recall
F1
FPR
Llama Guard 3 Vision
Prompt Classification
0.891
0.623
0.733
0.052
GPT-4o
0.544
0.843
0.661
0.485
GPT-4o mini
0.488
0.943
0.643
0.681
Llama Guard 3 Vision
Response Classification
0.961
0.916
0.938
0.016
GPT-4o
0.579
0.788
0.667
0.243
GPT-4o mini
0.526
0.820
0.641
0.313
Table 2: Category-wise breakdown of F1 for Llama Guard 3 Vision on our internal test set for response classification with safety labels from the ML Commons taxonomy.
Category
F1
Violent Crimes
0.839
Non-Violent Crimes
0.917
Sex Crimes
0.797
Child Exploitation
0.698
Defamation
0.967
Specialized Advice
0.764
Privacy
0.847
Intellectual Property
0.849
Indiscriminate Weapons
0.995
Hate
0.894
Self-Harm
0.911
Sexual Content
0.947
Elections
0.957
Result Analysis. Table 1 shows that Llama Guard 3 Vision outperforms GPT-4o and GPT-4o mini in terms of F1, specifically in response classification. Noteworthily, Llama Guard 3 Vision achieves better performance with much lower false positive rates. We also observe that the recall (false positive rates) of prompt classification are much lower (higher) than in general, this is because of the ambiguity of prompt (text+image) in the image understanding tasks making prompt classification more challenging than response classification. For example, given a text prompt “how do I buy this” with an image of multiple objects, whether the prompt is safe or not depends on which objects the user is referring to, and it is hard to decide in some cases. In such cases, we recommend using Llama Guard 3 Vision in the response classification task. Table 2 also shows the category breakdown in our internal test set. We observe that Llama Guard 3 Vision performs well in Indiscriminate Weapon and Elections categories, while showing > 0.69 F1 scores across all categories.
Limitations
There are some limitations associated with Llama Guard 3 Vision. First, Llama Guard 3 Vision itself is an LLM fine-tuned on Llama 3.2-vision. Thus, its performance (e.g., judgments that need common sense knowledge, multilingual capability, and policy coverage) might be limited by its (pre-)training data. Llama Guard 3 Vision is not meant to be used as an image safety classifier nor a text-only safety classifier. Its task is to classify the multimodal prompt or the multimodal prompt along with the text response. It was optimized for English language and only supports one image at the moment. Images will be rescaled to 224X224 so the classification performance may vary. For text-only mitigation, we recommend using other safeguards in the Llama Guard family of models, such as Llama Guard 3-8B or Llama Guard 3-1B depending on your use case. Some hazard categories may require factual, up-to-date knowledge to be evaluated (for example, S5: Defamation, S8: Intellectual Property, and S13: Elections) . We believe more complex systems should be deployed to accurately moderate these categories for use cases highly sensitive to these types of hazards, but Llama Guard 3 Vision provides a good baseline for generic use cases. Lastly, as an LLM, Llama Guard 3 Vision may be susceptible to adversarial attacks [4, 5] that could bypass or alter its intended use. Please report vulnerabilities and we will look to incorporate improvements in future versions of Llama Guard.