DeepSeek-R1-0528
DeepSeek-R1-0528
Version: 1
DeepSeekLast updated December 2025
The DeepSeek R1 0528 model has improved reasoning capabilities, this version also offers a reduced hallucination rate, enhanced support for function calling, and better experience for vibe coding.
Reasoning
Coding
Agents

Direct from Azure models

Direct from Azure models are a select portfolio curated for their market-differentiated capabilities:
  • Secure and managed by Microsoft: Purchase and manage models directly through Azure with a single license, consistent support, and no third-party dependencies, backed by Azure's enterprise-grade infrastructure.
  • Streamlined operations: Benefit from unified billing, governance, and seamless PTU portability across models hosted on Azure - all as part of one Azure AI Foundry platform.
  • Future-ready flexibility: Access the latest models as they become available, and easily test, deploy, or switch between them within Azure AI Foundry; reducing integration effort.
  • Cost control and optimization: Scale on demand with pay-as-you-go flexibility or reserve PTUs for predictable performance and savings.
Learn more about Direct from Azure models .

Key capabilities

About this model

Compared to the previous version, the upgraded model shows significant improvements in handling complex reasoning tasks. For instance, in the AIME 2025 test, the model's accuracy has increased from 70% in the previous version to 87.5% in the current version. This advancement stems from enhanced thinking depth during the reasoning process: in the AIME test set, the previous model used an average of 12K tokens per question, whereas the new version averages 23K tokens per question.

Key model capabilities

Beyond its improved reasoning capabilities, this version also offers a reduced hallucination rate, enhanced support for function calling, and better experience for vibe coding.
CategoryBenchmark (Metric)DeepSeek R1DeepSeek R1 0528
General
MMLU-Redux (EM)92.993.4
MMLU-Pro (EM)84.085.0
GPQA-Diamond (Pass@1)71.581.0
SimpleQA (Correct)30.127.8
FRAMES (Acc.)82.583.0
Humanity's Last Exam (Pass@1)8.517.7
Code
LiveCodeBench (2408-2505) (Pass@1)63.573.3
Codeforces-Div1 (Rating)15301930
SWE Verified (Resolved)49.257.6
Aider-Polyglot (Acc.)53.371.6
Math
AIME 2024 (Pass@1)79.891.4
AIME 2025 (Pass@1)70.087.5
HMMT 2025 (Pass@1)41.779.4
CNMO 2024 (Pass@1)78.886.9
Tools
BFCL_v3_MultiTurn (Acc)-37.0
Tau-Bench (Pass@1)-53.5(Airline)/63.9(Retail)
Note: We use Agentless framework to evaluate model performance on SWE-Verified. We only evaluate text-only prompts in HLE testsets. GPT-4.1 is employed to act user role in Tau-bench evaluation.

Use cases

See Responsible AI for additional considerations for responsible use.

Key use cases

The model has demonstrated outstanding performance across various benchmark evaluations, including mathematics, programming, and general logic.

Out of scope use cases

Microsoft and external researchers have found Deepseek R1 to be less aligned than other models -- meaning the model appears to have undergone less refinement designed to make its behavior and outputs more safe and appropriate for users -- resulting in (i) higher risks that the model will produce potentially harmful content and (ii) lower scores on safety and jailbreak benchmarks. We recommend customers use Azure AI Content Safety in conjunction with this model and conduct their own evaluations on production systems. The model's reasoning output (contained within the tags) may contain more harmful content than the model's final response. Consider how your application will use or display the reasoning output; you may want to suppress the reasoning output in a production setting. When deployed via Azure AI Foundry, prompts and completions are passed through a default configuration of Azure AI Content Safety classification models to detect and prevent the output of harmful content. Learn more about Azure AI Content Safety . Configuration options for content filtering vary when you deploy a model for production in Azure AI; learn more .

Pricing

Pricing is based on a number of factors, including deployment type and tokens used. See pricing details here.

Technical specs

For all our models, the maximum generation length is set to 64K tokens. For benchmarks requiring sampling, we use a temperature of $0.6, a top-p value of 0.95, and generate 16 responses per query to estimate pass@1.

Training cut-off date

The provider has not supplied this information.

Training time

The provider has not supplied this information.

Input formats

The provider has not supplied this information.

Output formats

The provider has not supplied this information.

Supported languages

The provider has not supplied this information.

Sample JSON response

The provider has not supplied this information.

Model architecture

The provider has not supplied this information.

Long context

For all our models, the maximum generation length is set to 64K tokens.

Optimizing model performance

We recommend adhering to the following configurations when utilizing the DeepSeek-R1-0528 series models, including benchmarking, to achieve the expected performance:
  • Avoid adding a system prompt; all instructions should be contained within the user prompt.
  • For mathematical problems, it is advisable to include a directive in your prompt such as: "Please reason step by step, and put your final answer within \boxed{}."
  • When evaluating model performance, it is recommended to conduct multiple tests and average the results.

Additional assets

Learn more: [original model announcement ]
Extract from the original model evaluation

Training disclosure

Training, testing and validation

The provider has not supplied this information.

Distribution

Distribution channels

The provider has not supplied this information.

More information

The provider has not supplied this information.

Responsible AI considerations

Safety techniques

Microsoft and external researchers have found Deepseek R1 to be less aligned than other models -- meaning the model appears to have undergone less refinement designed to make its behavior and outputs more safe and appropriate for users -- resulting in (i) higher risks that the model will produce potentially harmful content and (ii) lower scores on safety and jailbreak benchmarks. We recommend customers use Azure AI Content Safety in conjunction with this model and conduct their own evaluations on production systems. The model's reasoning output (contained within the tags) may contain more harmful content than the model's final response. Consider how your application will use or display the reasoning output; you may want to suppress the reasoning output in a production setting. When deployed via Azure AI Foundry, prompts and completions are passed through a default configuration of Azure AI Content Safety classification models to detect and prevent the output of harmful content. Learn more about Azure AI Content Safety . Configuration options for content filtering vary when you deploy a model for production in Azure AI; learn more .

Safety evaluations

The provider has not supplied this information.

Known limitations

Beyond its improved reasoning capabilities, this version also offers a reduced hallucination rate, enhanced support for function calling, and better experience for vibe coding.

Acceptable use

Acceptable use policy

The provider has not supplied this information.

Quality and performance evaluations

Source: DeepSeek For all our models, the maximum generation length is set to 64K tokens. For benchmarks requiring sampling, we use a temperature of $0.6, a top-p value of 0.95, and generate 16 responses per query to estimate pass@1.
CategoryBenchmark (Metric)DeepSeek R1DeepSeek R1 0528
General
MMLU-Redux (EM)92.993.4
MMLU-Pro (EM)84.085.0
GPQA-Diamond (Pass@1)71.581.0
SimpleQA (Correct)30.127.8
FRAMES (Acc.)82.583.0
Humanity's Last Exam (Pass@1)8.517.7
Code
LiveCodeBench (2408-2505) (Pass@1)63.573.3
Codeforces-Div1 (Rating)15301930
SWE Verified (Resolved)49.257.6
Aider-Polyglot (Acc.)53.371.6
Math
AIME 2024 (Pass@1)79.891.4
AIME 2025 (Pass@1)70.087.5
HMMT 2025 (Pass@1)41.779.4
CNMO 2024 (Pass@1)78.886.9
Tools
BFCL_v3_MultiTurn (Acc)-37.0
Tau-Bench (Pass@1)-53.5(Airline)/63.9(Retail)
Note: We use Agentless framework to evaluate model performance on SWE-Verified. We only evaluate text-only prompts in HLE testsets. GPT-4.1 is employed to act user role in Tau-bench evaluation.

Benchmarking methodology

Source: DeepSeek We recommend adhering to the following configurations when utilizing the DeepSeek-R1-0528 series models, including benchmarking, to achieve the expected performance:
  • Avoid adding a system prompt; all instructions should be contained within the user prompt.
  • For mathematical problems, it is advisable to include a directive in your prompt such as: "Please reason step by step, and put your final answer within \boxed{}."
  • When evaluating model performance, it is recommended to conduct multiple tests and average the results.

Public data summary

Source: DeepSeek Microsoft and external researchers have found Deepseek R1 to be less aligned than other models -- meaning the model appears to have undergone less refinement designed to make its behavior and outputs more safe and appropriate for users -- resulting in (i) higher risks that the model will produce potentially harmful content and (ii) lower scores on safety and jailbreak benchmarks. We recommend customers use Azure AI Content Safety in conjunction with this model and conduct their own evaluations on production systems.
Model Specifications
Context Length128000
Quality Index0.87
LicenseMit
Last UpdatedDecember 2025
Input TypeText
Output TypeText
ProviderDeepSeek
Languages2 Languages