Llama-Guard-3-8B
Key capabilities
About this model
Llama Guard 3 is a Llama-3.1-8B pretrained model, fine-tuned for content safety classification. Similar to previous versions, it can be used to classify content in both LLM inputs (prompt classification) and in LLM responses (response classification). It acts as an LLM – it generates text in its output that indicates whether a given prompt or response is safe or unsafe, and if unsafe, it also lists the content categories violated.Key model capabilities
The model is trained to predict safety labels on the 14 categories shown below, based on the MLCommons taxonomy of 13 hazards, as well as an additional category for Code Interpreter Abuse for tool calls use cases. The model provides content moderation in 8 languages, and was optimized to support safety and security for search and code interpreter tool calls. Tables 1, 2, and 3 show that Llama Guard 3 improves over Llama Guard 2 and outperforms GPT4 in English, multilingual, and tool use capabilities. Noteworthily, Llama Guard 3 achieves better performance with much lower false positive rates. We also benchmark Llama Guard 3 in the OSS dataset XSTest and observe that it achieves the same F1 score but a lower false positive rate compared to Llama Guard 2. Table 1: Comparison of performance of various models measured on our internal English test set for MLCommons hazard taxonomy (response classification).| F1 ↑ | AUPRC ↑ | False Positive Rate ↓ | |
|---|---|---|---|
| Llama Guard 2 | 0.877 | 0.927 | 0.081 |
| Llama Guard 3 | 0.939 | 0.985 | 0.040 |
| GPT4 | 0.805 | N/A | 0.152 |
| F1 ↑ / FPR ↓ | |||||||
|---|---|---|---|---|---|---|---|
| French | German | Hindi | Italian | Portuguese | Spanish | Thai | |
| Llama Guard 2 | 0.911/0.012 | 0.795/0.062 | 0.832/0.062 | 0.681/0.039 | 0.845/0.032 | 0.876/0.001 | 0.822/0.078 |
| Llama Guard 3 | 0.943/0.036 | 0.877/0.032 | 0.871/0.050 | 0.873/0.038 | 0.860/0.060 | 0.875/0.023 | 0.834/0.030 |
| GPT4 | 0.795/0.157 | 0.691/0.123 | 0.709/0.206 | 0.753/0.204 | 0.738/0.207 | 0.711/0.169 | 0.688/0.168 |
| Search tool calls | Code interpreter abuse | |||||
|---|---|---|---|---|---|---|
| F1 ↑ | AUPRC ↑ | FPR ↓ | F1 ↑ | AUPRC ↑ | FPR ↓ | |
| Llama Guard 2 | 0.749 | 0.794 | 0.284 | 0.683 | 0.677 | 0.670 |
| Llama Guard 3 | 0.856 | 0.938 | 0.174 | 0.885 | 0.967 | 0.125 |
| GPT4 | 0.732 | N/A | 0.525 | 0.636 | N/A | 0.90 |
See Responsible AI for additional considerations for responsible use.
Key use cases
As outlined in the Llama 3 paper, Llama Guard 3 provides industry leading system-level safety performance and is recommended to be deployed along with Llama 3.1. It can be used to classify content in both LLM inputs (prompt classification) and in LLM responses (response classification), providing content moderation in 8 languages, and was optimized to support safety and security for search and code interpreter tool calls.Out of scope use cases
There are some limitations associated with Llama Guard 3. First, Llama Guard 3 itself is an LLM fine-tuned on Llama 3.1. Thus, its performance (e.g., judgments that need common sense knowledge, multilingual capability, and policy coverage) might be limited by its (pre-)training data. Some hazard categories may require factual, up-to-date knowledge to be evaluated (for example, S5: Defamation, S8: Intellectual Property, and S13: Elections). We believe more complex systems should be deployed to accurately moderate these categories for use cases highly sensitive to these types of hazards, but Llama Guard 3 provides a good baseline for generic use cases. Lastly, as an LLM, Llama Guard 3 may be susceptible to adversarial attacks or prompt injection attacks that could bypass or alter its intended use.Pricing is based on a number of factors, including deployment type and tokens used. See pricing details here.