Mistral Small 3.1
Version: 1
Mistral Small 3.1 (25.03) is the enhanced version of Mistral Small 3 (25.01), featuring multimodal capabilities and an extended context length of up to 128k. It can now process and understand visual inputs as well as long documents, further expanding its range of applications. Like its predecessor, Mistral Small 3.1 (25.03) is a versatile model designed for various tasks such as programming, mathematical reasoning, document understanding, and dialogue. Mistral Small 3.1 (25.03) was designed with low-latency applications in mind and delivers best-in-class efficiency compared to models of the same quality.
Mistral Small 3.1 (25.03) has undergone a full post-training process to align the model with human preferences and needs, so it is suitable out-of-the-box for applications that require chat or precise instruction following.
Intended Use
Primary Use Cases
Mistral Small 3.1 (25.03) is a great versatile model for tasks such as:- Programming
- Math reasoning
- Dialogue
- Long document understanding
- Visual understanding
- Summarization
- Low-latency applications
Vision Evals
With our improved training methodologies, we observe strong vision capabilities in the model. Not only does Mistral Small 3.1 (25.03) outperforms light-weight models like GPT4o mini, it also rivals larger models like qwen-vl-2 on visual knowledge and reasoning tasks. Moreover, with our innovations in the past few months, Mistral Small 3.1 (25.03) demonstrates performance on par with Pixtral Large we released last year across the board.
Text Pretrain Evals
Text Instruct Evals
In addition to its strong multimodal capabilities, Mistral Small 3.1 (25.03) retains the robust text performance of Mistral Small 3. It excels in knowledge benchmarks like MMLU and MMLU-Pro, graduate-level question answering (GPQA), reading comprehension (TriviaQA), and math and coding tasks (MATH, HumanEval). Mistral Small 3.1 (25.03) often matches or outperforms much larger models, including 70B parameter Llama models, as well as closed-source models like GPT4o mini and Claude 3.5 Haiku.
Long-context Evals
Mistral Small 3.1 (25.03) is our best generalist model for long-context tasks. It demonstrates 100% retrieval capability on passkey evaluations up to 128k context. Compared to both closed-source and open-source competitor models, Mistral Small 3.1 (25.03) excels in question-answering over long documents (LongBench v2) and in reasoning over entire contexts with challenging latent structure evaluations (Michelangelo Latent List). Most notably, Mistral Small 3.1 (25.03) improves upon Mistral Large in long-context capabilities.
NOTE: For competitors, we run all evals using our stack as they did not report these evals
Model | MMMU | MMMU Pro | Mathvista | ChartQA | DocVQA | AI2D |
---|---|---|---|---|---|---|
Mistral Small 3.1 (25.03) Instruct | 64.00 | 49.25 | 68.91 | 86.24 | 94.08 | 93.72 |
GPT4o mini | 60.00 | 37.60 | 52.50 | - | - | - |
Qwen2-VL 7B | 54.10 | 30.50 | 58.20 | 83.00 | 94.50 | 83.00 |
Qwen2.5-VL 7B | 58.60 | 38.30 | 68.20 | 87.30 | 95.70 | 83.90 |
Qwen2-VL 72B | 64.50 | 46.20 | 70.50 | 88.30 | 96.50 | 88.10 |
Qwen2.5-VL 72B | 70.20 | 51.10 | 74.80 | 89.50 | 96.40 | 88.70 |
Claude 3.5 Haiku | 60.50 | - | 61.60 | 87.20 | 90.00 | 92.10 |
Gemini 2.0 Flash-Lite | 68.00 | - | - | - | - | - |
Pixtral Large | 64.00 | - | 69.40 | 88.10 | 93.30 | 93.80 |
Model | MMLU (5-shot) | MMLU Pro (5-shot CoT) | GPQA Main (5-shot CoT) | TriviaQA (5-shot) |
---|---|---|---|---|
Mistral Small 3.1 (25.03) Base | 81.01% | 56.03% | 37.50% | 80.50% |
Mistral Small 3 Base | 80.73% | 54.37% | 34.37% | 80.32% |
Gemma 2 27B | 75.20% | - | - | 83.70% |
Qwen 2.5 32B | 83.30% | 55.10% | 48.00% | - |
Llama 3.1 70B | 79.30% | 53.80% | - | - |
Model | MMLU Pro (5-shot CoT) | MATH | HumanEval | GPQA Main (5-shot CoT) |
---|---|---|---|---|
Mistral Small 3.1 (25.03) Instruct | 66.76% | 69.30% | 88.41% | 44.42% |
Mistral Small 3 (25.01) Instruct | 66.30% | 70.60% | 84.80% | 45.30% |
Gemma 2 27B Instruct | - | - | - | - |
Qwen2.5 32B Instruct | 69.00% | 83.10% | 88.40% | 49.50% |
Llama 3.3 70B Instruct | 68.90% | 77.00% | 88.40% | - |
GPT4o mini | - | 70.20% | 87.20% | 40.20% |
Claude 3.5 Haiku | 65.00% | 69.40% | 88.10% | - |
Gemini 2.0 Flash-Lite | 71.60% | 86.80% | - | - |
Model | Michelangelo Latent List 128k | LongBench v2 128k |
---|---|---|
Mistral Small 3.1 (25.03) Instruct | 23.59% | 36.78% |
GPT4o mini (up to 128k context) | 10.53% | 29.30% |
Gemini 2.0 Flash-Lite (up to 1M context) | 25.22% | - |
Qwen2.5 7B Instruct (with YaRN) (up to 128k context) | - | 30% |
Qwen2.5 32B Instruct (with YaRN) (up to 128k context) | 31.66% | - |
Qwen2.5 72B Instruct (with YaRN) (up to 128k context) | - | 42.10% |
Llama 3.3 70B Instruct (up to 128k context) | 3.98% | 29.80% |
Mistral Large (24.11) | 15.85% | 34.40% |
Model Specifications
Context Length128000
Quality Index0.70
LicenseCustom
Training DataOct 2023
Last UpdatedMarch 2025
Input TypeText,Image
Output TypeText
PublisherMistral AI
Languages27 Languages
Related Models