AI Model Catalog | Microsoft Foundry Models

Mistral Medium 3 (25.05)

Version: 1

Mistral AI•Last updated May 2025

Mistral Medium 3 is an advanced Large Language Model (LLM) with state-of-the-art reasoning, knowledge, coding and vision capabilities.

Multipurpose

Vision

Mistral Medium 3 is a SOTA & versatile model designed for a wide range of tasks, including programming, mathematical reasoning, understanding long documents, summarization, and dialogue. It boasts multi-modal capabilities, enabling it to process visual inputs, and supports dozens of languages, including over 80 coding languages. Additionally, it features function calling and agentic workflows. Mistral Medium 3 is optimized for single-node inference, particularly for long-context applications. Its size allows it to achieve high throughput on a single node.

Intended Use

Primary Use Cases

Mistral Medium 3 (25.05) is a great versatile model for tasks such as:

Programming
Math reasoning
Dialogue
Long document understanding
Visual understanding
Summarization
Low-latency applications

Academic Evals

Coding

Benchmark	Mistral Medium 3	Llama 4 Maverick	GPT-4o	Claude Sonnet 3.7	Command-A	DeepSeek 3.1
HumanEval 0-shot	0.921	0.854	0.915	0.921	0.829	0.933
LiveCodeBench (v6) 0-shot	0.303	0.287	0.314	0.360	0.263	0.429
MultiPL-E average 0-shot	0.814	0.764	0.798	0.834	0.731	0.849

Instruction Following

Benchmark	Mistral Medium 3	Llama 4 Maverick	GPT-4o	Claude Sonnet 3.7	Command-A	DeepSeek 3.1
ArenaHard 0-shot	0.971	0.918	0.954	0.932	0.951	0.973
IfEval 0-shot	0.894	0.889	0.872	0.918	0.897	0.891

Math

Benchmark	Mistral Medium 3	Llama 4 Maverick	GPT-4o	Claude Sonnet 3.7	Command-A	DeepSeek 3.1
Math500 Instruct 0-shot	0.910	0.900	0.764	0.830	0.820	0.938

Knowledge

Benchmark	Mistral Medium 3	Llama 4 Maverick	GPT-4o	Claude Sonnet 3.7	Command-A	DeepSeek 3.1
GPQA Diamond 5-shot CoT	0.571	0.611	0.525	0.697	0.465	0.611
MMLU Pro 5-shot CoT	0.772	0.804	0.758	0.800	0.689	0.811

Long Context

Benchmark	Mistral Medium 3	Llama 4 Maverick	GPT-4o	Claude Sonnet 3.7	Command-A	DeepSeek 3.1
RULER 32K	0.960	0.948	0.960	0.957	0.956	0.958
RULER 128K	0.902	0.867	0.889	0.938	0.912	0.919

Multimodal

Benchmark	Mistral Medium 3	Llama 4 Maverick	GPT-4o	Claude Sonnet 3.7	Command-A	DeepSeek 3.1
MMMU 0-shot	0.661	0.718	0.661	0.713	-	-
DocVQA 0-shot	0.953	0.941	0.859	0.843	-	-
AI2D 0-shot	0.937	0.844	0.933	0.788	-	-
ChartQA 0-shot	0.826	0.904	0.860	0.763	-	-

Human Evals

Mistral Wins vs Llama 4 Maverick

Domain	Mistral Win Rate	Llama 4 Maverick Win Rate
Coding	81.82	18.18
Multimodal	53.85	46.15
English	66.67	33.33
French	71.43	28.57
Spanish	73.33	26.67
German	62.50	37.50
Arabic	64.71	35.29

Mistral Wins vs Competitor Wins for Coding

Model	Mistral Wins	Other Model Wins
claude_3_7	40.00	60.00
deepseek_v3_1	37.50	62.50
gpt_4o	50.00	50.00
command_a	69.23	30.77
llama_4_maverick	81.82	18.18

Model Specifications

Context Length128000

Quality Index0.77

LicenseCustom

Last UpdatedMay 2025

Input TypeText,Image

Output TypeText

PublisherMistral AI

Languages27 Languages

Quick Start

Related Models

o3-mini

Llama-4-Maverick-17B-128E-Instruct-FP8

Phi-4