Phi-4-mini-reasoning
Lightweight math reasoning model optimized for multi-step problem solving
Overall, Phi-4-mini-reasoning demonstrates competitive reasoning capabilities across these benchmarks, excelling particularly in mathematical reasoning tasks. The model outperforms comparable models in multiple benchmarks, reinforcing its suitability for applications requiring high-quality mathematical problem-solving.
Models from Microsoft, Partners, and Community models are a select portfolio of curated models both general-purpose and niche models across diverse scenarios by developed by Microsoft teams, partners, and community contributors
- Managed by Microsoft: Purchase and manage models directly through Azure with a single license, world class support and enterprise grade Azure infrastructure
- Validated by providers: Each model is validated and maintained by its respective provider, with Azure offering integration and deployment guidance.
- Innovation and agility: Combines Microsoft research models with rapid, community-driven advancements.
- Seamless Azure integration: Standard Microsoft Foundry experience, with support managed by the model provider.
- Flexible deployment: Deployable as Managed Compute or Serverless API, based on provider preference.
About this model
Built on synthetic and high-quality math datasets, the model leverages advanced fine-tuning techniques such as supervised fine-tuning and preference modeling to enhance reasoning capabilities. Its training incorporates safety and alignment protocols, ensuring robust and reliable performance across supported use cases.Key model capabilities
Phi-4-mini-reasoning is designed for multi-step, logic-intensive mathematical problem-solving tasks. Some of the use cases include formal proof generation, symbolic computation, advanced word problems, and a wide range of mathematical reasoning scenarios. These models excel at maintaining context across steps, applying structured logic, and delivering accurate, reliable solutions in domains that require deep analytical thinking. To understand the capabilities, the Phi-4-mini-reasoning model was evaluated using a variety of popular math reasoning benchmarks. These benchmarks assess the model's performance on complex mathematical reasoning and problem-solving tasks, particularly focused on multi-step, logic-intensive questions. The model demonstrated strong performance in reasoning and problem-solving, particularly in contexts requiring advanced mathematical understanding. We evaluate the model with three of the most popular math benchmarks where the strongest reasoning models are competing together. Specifically:- Math-500: This benchmark consists of 500 challenging math problems designed to test the model's ability to perform complex mathematical reasoning and problem-solving.
- AIME 2024: The American Invitational Mathematics Examination (AIME) is a highly regarded math competition that features a series of difficult problems aimed at assessing advanced mathematical skills and logical reasoning.
- GPQA Diamond: The Graduate-Level Google-Proof Q&A (GPQA) Diamond benchmark focuses on evaluating the model's ability to understand and solve a wide range of mathematical questions, including both straightforward calculations and more intricate problem-solving tasks.
| Benchmark | o1-mini* | DeepSeek-R1-Distill-Qwen-7B | DeepSeek-R1-Distill-Llama-8B | Bespoke-Stratos-7B* | OpenThinker-7B* | Llama-3.2-3B-Instruct | Phi-4-Mini (base model, 3.8B) | Phi-4-mini-reasoning (3.8B) | |
|---|---|---|---|---|---|---|---|---|---|
| AIME | 63.6 | 53.3 | 43.3 | 20.0 | 31.3 | 6.7 | 10.0 | 57.5 | |
| MATH-500 | 90.0 | 91.4 | 86.9 | 82.0 | 83.0 | 44.4 | 71.8 | 94.6 | |
| GPQA Diamond | 60.0 | 49.5 | 47.3 | 37.8 | 42.4 | 25.3 | 36.9 | 52.0 |
Quick facts
Model providerMicrosoft
TypeChat completion
LifecycleGenerally available (GA)
Input typetext
Output typetext
Context window128k
Token limits128k output
PricingView pricing