Llama-Guard-3-8B

Llama-Guard-3-8B

Meta
Version: 4

Key capabilities

About this model

Llama Guard 3 is a Llama-3.1-8B pretrained model, fine-tuned for content safety classification. Similar to previous versions, it can be used to classify content in both LLM inputs (prompt classification) and in LLM responses (response classification). It acts as an LLM – it generates text in its output that indicates whether a given prompt or response is safe or unsafe, and if unsafe, it also lists the content categories violated.

Key model capabilities

The model is trained to predict safety labels on the 14 categories shown below, based on the MLCommons taxonomy of 13 hazards, as well as an additional category for Code Interpreter Abuse for tool calls use cases. The model provides content moderation in 8 languages, and was optimized to support safety and security for search and code interpreter tool calls. Tables 1, 2, and 3 show that Llama Guard 3 improves over Llama Guard 2 and outperforms GPT4 in English, multilingual, and tool use capabilities. Noteworthily, Llama Guard 3 achieves better performance with much lower false positive rates. We also benchmark Llama Guard 3 in the OSS dataset XSTest and observe that it achieves the same F1 score but a lower false positive rate compared to Llama Guard 2. Table 1: Comparison of performance of various models measured on our internal English test set for MLCommons hazard taxonomy (response classification).
F1 ↑AUPRC ↑False Positive
Rate ↓
Llama Guard 20.8770.9270.081
Llama Guard 30.9390.9850.040
GPT40.805N/A0.152
Table 2: Comparison of multilingual performance of various models measured on our internal test set for MLCommons hazard taxonomy (prompt+response classification).
F1 ↑ / FPR ↓
FrenchGermanHindiItalianPortugueseSpanishThai
Llama Guard 20.911/0.0120.795/0.0620.832/0.0620.681/0.0390.845/0.0320.876/0.0010.822/0.078
Llama Guard 30.943/0.0360.877/0.0320.871/0.0500.873/0.0380.860/0.0600.875/0.0230.834/0.030
GPT40.795/0.1570.691/0.1230.709/0.2060.753/0.2040.738/0.2070.711/0.1690.688/0.168
Table 3: Comparison of performance of various models measured on our internal test set for other moderation capabilities (prompt+response classification).
Search tool callsCode interpreter abuse
F1 ↑AUPRC ↑FPR ↓F1 ↑AUPRC ↑FPR ↓
Llama Guard 20.7490.7940.2840.6830.6770.670
Llama Guard 30.8560.9380.1740.8850.9670.125
GPT40.732N/A0.5250.636N/A0.90
See Responsible AI for additional considerations for responsible use.

Key use cases

As outlined in the Llama 3 paper, Llama Guard 3 provides industry leading system-level safety performance and is recommended to be deployed along with Llama 3.1. It can be used to classify content in both LLM inputs (prompt classification) and in LLM responses (response classification), providing content moderation in 8 languages, and was optimized to support safety and security for search and code interpreter tool calls.

Out of scope use cases

There are some limitations associated with Llama Guard 3. First, Llama Guard 3 itself is an LLM fine-tuned on Llama 3.1. Thus, its performance (e.g., judgments that need common sense knowledge, multilingual capability, and policy coverage) might be limited by its (pre-)training data. Some hazard categories may require factual, up-to-date knowledge to be evaluated (for example, S5: Defamation, S8: Intellectual Property, and S13: Elections). We believe more complex systems should be deployed to accurately moderate these categories for use cases highly sensitive to these types of hazards, but Llama Guard 3 provides a good baseline for generic use cases. Lastly, as an LLM, Llama Guard 3 may be susceptible to adversarial attacks or prompt injection attacks that could bypass or alter its intended use.
Pricing is based on a number of factors, including deployment type and tokens used. See pricing details here.

Quick facts

Model providerMeta
TypeChat completion
LifecycleGenerally available (GA)