MAI-DS-R1
MAI-DS-R1
Version: 1
MicrosoftLast updated November 2025
MAI-DS-R1 is a DeepSeek-R1 reasoning model that has been post-trained by the Microsoft AI team to fill in information gaps in the previous version of the model and improve its harm protections while maintaining R1 reasoning capabilities.
Reasoning
Coding
Agents

Direct from Azure models

Direct from Azure models are a select portfolio curated for their market-differentiated capabilities:
  • Secure and managed by Microsoft: Purchase and manage models directly through Azure with a single license, consistent support, and no third-party dependencies, backed by Azure's enterprise-grade infrastructure.
  • Streamlined operations: Benefit from unified billing, governance, and seamless PTU portability across models hosted on Azure - all as part of one Azure AI Foundry platform.
  • Future-ready flexibility: Access the latest models as they become available, and easily test, deploy, or switch between them within Azure AI Foundry; reducing integration effort.
  • Cost control and optimization: Scale on demand with pay-as-you-go flexibility or reserve PTUs for predictable performance and savings.
Learn more about Direct from Azure models .

Key capabilities

About this model

MAI-DS-R1 preserves the general reasoning capabilities of DeepSeek-R1 and can be used for broad language understanding and generation tasks, especially in complex reasoning and problem-solving.

Key model capabilities

  • General text generation and understanding – Producing coherent, contextually relevant text for a wide range of prompts. This includes engaging in dialogue, writing essays, or continuing a story based on a given prompt.
  • General knowledge tasks – Answering open-domain questions requiring factual knowledge.
  • Reasoning and problem solving – Handling multi-step reasoning tasks, such as math word problems or logic puzzles, by employing chain-of-thought strategies.
  • Code generation and comprehension – Assisting with programming tasks by generating code snippets or explaining code.
  • Scientific and academic applications – Assisting with structured problem-solving in STEM and research domains.

Use cases

See Responsible AI for additional considerations for responsible use.

Key use cases

Primary direct use includes:
  • General text generation and understanding – Producing coherent, contextually relevant text for a wide range of prompts. This includes engaging in dialogue, writing essays, or continuing a story based on a given prompt.
  • General knowledge tasks – Answering open-domain questions requiring factual knowledge.
  • Reasoning and problem solving – Handling multi-step reasoning tasks, such as math word problems or logic puzzles, by employing chain-of-thought strategies.
  • Code generation and comprehension – Assisting with programming tasks by generating code snippets or explaining code.
  • Scientific and academic applications – Assisting with structured problem-solving in STEM and research domains.

Out of scope use cases

Certain application domains are out-of-scope either due to ethical/safety concerns or because the model lacks the necessary reliability in those areas. The following usage is out of scope:
  • Medical or health advice – The model is not a doctor and has no guarantee of providing accurate medical diagnoses or safe treatment recommendations.
  • Legal advice – The model is not a lawyer and should not be entrusted with giving definitive legal counsel, interpreting laws, or making legal decisions on its own.
  • Safety-critical systems – The model is not suited for autonomous systems where failures could cause injury, loss of life, or significant property damage. This includes use in self-driving vehicles, aircraft control, medical life-support systems, or industrial control without human oversight.
  • High-stakes decision support – The model should not be relied on for decisions affecting finances, security, or personal well-being, such as financial planning or investment advice.
  • Malicious or Unethical Use – The model must not be used to produce harmful, illegal, deceptive, or unethical content, including hate speech, violence, harassment, or violations of privacy or IP rights.

Pricing

Pricing is based on a number of factors, including deployment type and tokens used. See pricing details here.

Technical specs

Training cut-off date

MAI-DS-R1 shares DeepSeek-R1's knowledge cutoff and may lack awareness of recent events or domain-specific facts.

Training time

The provider has not supplied this information.

Input formats

The provider has not supplied this information.

Output formats

The provider has not supplied this information.

Supported languages

Although MAI-DS-R1 was post-trained on multilingual data, it may inherit limitations from the original DeepSeek-R1 model, with performance likely strongest in English and Chinese.

Sample JSON response

The provider has not supplied this information.

Model architecture

  • Architecture: Based on DeepSeek-R1, a transformer-based autoregressive language model utilizing multi-head self-attention and Mixture-of-Experts (MoE) for scalable and efficient inference.
  • Objective: Post-trained to reduce CCP-aligned restrictions and enhance harm protection, while preserving the original model's strong chain-of-thought reasoning and general-purpose language understanding capabilities.
  • Pre-trained Model Base: DeepSeek-R1 (671B)

Long context

The provider has not supplied this information.

Optimizing model performance

The provider has not supplied this information.

Additional assets

The provider has not supplied this information.

Training disclosure

Training, testing and validation

The model was trained using 110k safety-related examples from Tulu 3 SFT dataset, in addition to dataset of approximately 350k multilingual examples internally developed capturing various topics with reported biases. Both sets of queries were processed with DeepSeek-R1 to generate Chain-of-Thought (CoT) reasoning and final answers. The model was evaluated on a variety of benchmarks, covering different tasks and addressing both performance and safety concerns. Key benchmarks include: Public Benchmarks covering natural language inference, question answering, mathematical reasoning, commonsense reasoning, code generation, and code completion; Censorship Test Set consisting of 3.3k prompts on various censored topics from R1, covering 11 languages; and Safety Test Set from the HarmBench dataset including 320 queries categorized into functional and semantic categories.

Distribution

Distribution channels

The provider has not supplied this information.

More information

The provider has not supplied this information.

Responsible AI considerations

Safety techniques

MAI-DS-R1 is a DeepSeek-R1 reasoning model that has been post-trained by Microsoft AI team that aims to fill in information gaps in the previous version of the model and to remove and improve its harm protections while maintaining R1 reasoning capabilities. The model was trained using 110k safety-related examples from Tulu 3 SFT dataset, in addition to dataset of approximately 350k multilingual examples internally developed capturing various topics with reported biases. Both sets of queries were processed with DeepSeek-R1 to generate Chain-of-Thought (CoT) reasoning and final answers. MAI-DS-R1 has successfully unblocked the majority of previously blocked queries from the original R1 model while outperforming the recently published R1-1776 model (post-trained by Perplexity) in relevant safety benchmarks. The foregoing results were achieved while preserving the general reasoning capabilities of the original DeepSeek-R1. Please note: Microsoft has post-trained this model to address certain limitations relevant to its outputs, but previous limitations and considerations for the model remain, including security considerations Post-trained to reduce CCP-aligned restrictions and enhance harm protection, while preserving the original model's strong chain-of-thought reasoning and general-purpose language understanding capabilities. When deployed via Azure AI Foundry, prompts and completions are passed through a default configuration of Azure AI Content Safety classification models to detect and prevent the output of harmful content. Learn more about Azure AI Content Safety . Configuration options for content filtering vary when you deploy a model for production in Azure AI; learn more .

Safety evaluations

Safety Test Set: This set is a split from the HarmBench dataset and includes 320 queries, categorized into three functional categories: standard, contextual, and copyright. The queries cover eight semantic categories, such as misinformation/disinformation, chemical/biological threats, illegal activities, harmful content, copyright violations, cybercrime, and harassment. It evaluates the model's leakage rate of harmful or unsafe content. Safety evaluation:
a. Attack Success Rate: the percentage of test cases that elicit the behavior from the model. This is evaluated per functional or semantic category.
b. Micro Attack Success Rate: the total average of attack success rate over all categories.

Evaluation on Safety

CategoriesDS-R1 (Answer)R1-1776 (Answer)MAI-DS-R1 (Answer)DS-R1 (Thinking)R1-1776 (Thinking)MAI-DS-R1 (Thinking)
Micro Attack Success Rate0.4410.4810.2090.3940.3250.134
Functional Standard0.2580.2890.1260.3020.2140.082
Functional Contextual0.4940.5560.3210.5060.3950.309
Functional Copyright0.7500.7870.2630.4630.4750.062
Semantic Misinfo/Disinfo0.5000.6480.3150.5190.5000.259
Semantic Chemical/Bio0.3570.4290.1430.5000.2860.167
Semantic Illegal0.1890.1700.0190.3210.2450.019
Semantic Harmful0.1110.1110.1110.1110.1110.000
Semantic Copyright0.7500.7870.2630.4630.4750.062
Semantic Cybercrime0.5190.5000.3850.3850.2120.308
Semantic Harassment0.0000.0480.0000.0480.0480.000
Num Parse Errors420026670
Safety: MAI-DS-R1 outperforms both R1-1776 and the original R1 model in minimizing harmful content.

Known limitations

  • Biases: The model may retain biases present in the training data and in the original DeepSeek‑R1, particularly around cultural and demographic aspects.
  • Risks: The model may still hallucinate facts, be vulnerable to adversarial prompts, or generate unsafe, biased, or harmful content under certain conditions. Developers should implement content moderation and usage monitoring to mitigate misuse.
  • Limitations: MAI-DS-R1 shares DeepSeek-R1's knowledge cutoff and may lack awareness of recent events or domain-specific facts.
The following factors can influence MAI-DS-R1's behavior and performance:
  1. Input topic and Sensitivity: The model is explicitly tuned to freely discuss topics that were previously censored. On such topics it will now provide information about where the base model might have demurred. However, for truly harmful or explicitly disallowed content (e.g. instructions for violence), the model remains restrictive due to the safety fine-tuning.
  2. Language: Although MAI-DS-R1 was post-trained on multilingual data, it may inherit limitations from the original DeepSeek-R1 model, with performance likely strongest in English and Chinese.
  3. Prompt Complexity and Reasoning Required: The model performs well on complex queries requiring reasoning, while very long or complex prompts could still pose a challenge.
  4. User Instructions and Role Prompts: As a chat-oriented LLM, MAI-DS-R1's responses can be shaped by system or developer-provided instructions (e.g. a system prompt defining its role and style) and the user's phrasing. Developers should provide clear instructions to guide model's behavior.
To ensure responsible use, we recommend the following:
  • Transparency on Limitations: It is recommended that users are made explicitly aware of the model's potential biases and limitations.
  • Human Oversight and Verification: Both direct and downstream users should implement human review or automated validation of outputs when deploying the model in sensitive or high-stakes scenarios.
  • Usage Safeguards: Developers should integrate content filtering, prompt engineering best practices, and continuous monitoring to mitigate risks and ensure the model's outputs meet the intended safety and quality standards.
  • Legal and Regulatory Compliance: The model may output politically sensitive content (e.g., Chinese governance, historical events) that could conflict with local laws or platform policies. Operators must ensure compliance with regional regulations.

Acceptable use

Acceptable use policy

MAI-DS-R1 preserves the general reasoning capabilities of DeepSeek-R1 and can be used for broad language understanding and generation tasks, especially in complex reasoning and problem-solving. Primary direct use incudes:
  • General text generation and understanding – Producing coherent, contextually relevant text for a wide range of prompts. This includes engaging in dialogue, writing essays, or continuing a story based on a given prompt.
  • General knowledge tasks – Answering open-domain questions requiring factual knowledge.
  • Reasoning and problem solving – Handling multi-step reasoning tasks, such as math word problems or logic puzzles, by employing chain-of-thought strategies.
  • Code generation and comprehension – Assisting with programming tasks by generating code snippets or explaining code.
  • Scientific and academic applications – Assisting with structured problem-solving in STEM and research domains.
Certain application domains are out-of-scope either due to ethical/safety concerns or because the model lacks the necessary reliability in those areas. The following usage is out of scope:
  • Medical or health advice – The model is not a doctor and has no guarantee of providing accurate medical diagnoses or safe treatment recommendations.
  • Legal advice – The model is not a lawyer and should not be entrusted with giving definitive legal counsel, interpreting laws, or making legal decisions on its own.
  • Safety-critical systems – The model is not suited for autonomous systems where failures could cause injury, loss of life, or significant property damage. This includes use in self-driving vehicles, aircraft control, medical life-support systems, or industrial control without human oversight.
  • High-stakes decision support – The model should not be relied on for decisions affecting finances, security, or personal well-being, such as financial planning or investment advice.
  • Malicious or Unethical Use – The model must not be used to produce harmful, illegal, deceptive, or unethical content, including hate speech, violence, harassment, or violations of privacy or IP rights.

Quality and performance evaluations

Source: Microsoft The model was evaluated on a variety of benchmarks, covering different tasks and addressing both performance and safety concerns. Key benchmarks include:
  1. Public Benchmarks: These cover a wide range of tasks, such as natural language inference, question answering, mathematical reasoning, commonsense reasoning, code generation, and code completion. It evaluates the model's general knowledge and reasoning capabilities.
  2. Censorship Test Set: This set consists of 3.3k prompts on various censored topics from R1, covering 11 languages. It evaluates the model's ability to uncensor previously censored content across different languages.
  3. Safety Test Set: This set is a split from the HarmBench dataset and includes 320 queries, categorized into three functional categories: standard, contextual, and copyright. The queries cover eight semantic categories, such as misinformation/disinformation, chemical/biological threats, illegal activities, harmful content, copyright violations, cybercrime, and harassment. It evaluates the model's leakage rate of harmful or unsafe content.
We tracked several metrics to quantify MAI-DS-R1's performance:
  1. Public Benchmarks:
    a. Accuracy: the percentage of problems for which the model's output matches the correct answer.
    b. Pass@1: the percentage of problems for which the model generates a correct solution which passes all test cases in the first attempt.
  2. Censorship evaluation:
    a. Answer Satisfaction (internal metric to measuring relevance with the question on [0,4] scale). The intent is to measure whether the uncensored answers do answer the question and not generate content which is unrelated but uncensored.
  3. % Uncensored: The proportion of censored samples successfully uncensored.
  4. Safety evaluation:
    a. Attack Success Rate: the percentage of test cases that elicit the behavior from the model. This is evaluated per functional or semantic category.
    b. Micro Attack Success Rate: the total average of attack success rate over all categories.

Evaluation on general knowledge and reasoning

CategoriesBenchmarksMetricsDS-R1R1-1776MAI-DS-R1
General Knowledgeanli_r307-shot Acc0.6860.6730.697
arc_challenge10-shot Acc0.9630.9630.963
hellaswag5-shot Acc0.8640.8600.859
mmlu (all)5-shot Acc0.8670.8630.870
mmlu/humanities5-shot Acc0.7940.7840.801
mmlu/other5-shot Acc0.8830.8790.886
mmlu/social_sciences5-shot Acc0.9160.9160.914
mmlu/STEM5-shot Acc0.8670.8640.870
openbookqa10-shot Acc0.9360.9380.954
Piqa5-shot Acc0.9330.9260.939
Winogrande5-shot Acc0.8430.8340.850
Mathgsm8k_chain_of_thought0-shot Accuracy0.9530.9540.949
Math4-shot Accuracy0.8330.8530.843
mgsm_chain_of_thought_en0-shot Accuracy0.9720.9680.976
mgsm_chain_of_thought_zh0-shot Accuracy0.8800.7960.900
AIME 2024Pass@1, n=20.73330.73330.7333
Codehumaneval0-shot Accuracy0.8660.8410.860
livecodebench (max-tokens=8k)0-shot Pass@10.5310.4840.632
LCB_coding_completion0-shot Pass@10.2600.2000.540
LCB_generation0-shot Pass@10.7000.6700.692
mbpp3-shot Pass@10.8970.8740.911

Evaluation on blocked topics

BenchmarkMetricDS-R1R1-1776MAI-DS-R1
Blocked topics test setAnswer Satisfaction1.682.763.62
% uncensored30.799.199.3

Evaluation on Safety

CategoriesDS-R1 (Answer)R1-1776 (Answer)MAI-DS-R1 (Answer)DS-R1 (Thinking)R1-1776 (Thinking)MAI-DS-R1 (Thinking)
Micro Attack Success Rate0.4410.4810.2090.3940.3250.134
Functional Standard0.2580.2890.1260.3020.2140.082
Functional Contextual0.4940.5560.3210.5060.3950.309
Functional Copyright0.7500.7870.2630.4630.4750.062
Semantic Misinfo/Disinfo0.5000.6480.3150.5190.5000.259
Semantic Chemical/Bio0.3570.4290.1430.5000.2860.167
Semantic Illegal0.1890.1700.0190.3210.2450.019
Semantic Harmful0.1110.1110.1110.1110.1110.000
Semantic Copyright0.7500.7870.2630.4630.4750.062
Semantic Cybercrime0.5190.5000.3850.3850.2120.308
Semantic Harassment0.0000.0480.0000.0480.0480.000
Num Parse Errors420026670
  • General Knowledge & Reasoning: MAI-DS-R1 performs on par with DeepSeek-R1 and slightly better than R1-1776, especially excelling in mgsm_chain_of_thought_zh, where R1-1776 had a significant regression.
  • Blocked Topics: MAI-DS-R1 blocked 99.3% of samples, matching R1-1776, and achieved a higher Answer Satisfaction score, likely due to more relevant responses.
  • Safety: MAI-DS-R1 outperforms both R1-1776 and the original R1 model in minimizing harmful content.

Benchmarking methodology

Source: Microsoft The following factors can influence MAI-DS-R1's behavior and performance:
  1. Input topic and Sensitivity: The model is explicitly tuned to freely discuss topics that were previously censored. On such topics it will now provide information about where the base model might have demurred. However, for truly harmful or explicitly disallowed content (e.g. instructions for violence), the model remains restrictive due to the safety fine-tuning.
  2. Language: Although MAI-DS-R1 was post-trained on multilingual data, it may inherit limitations from the original DeepSeek-R1 model, with performance likely strongest in English and Chinese.
  3. Prompt Complexity and Reasoning Required: The model performs well on complex queries requiring reasoning, while very long or complex prompts could still pose a challenge.
  4. User Instructions and Role Prompts: As a chat-oriented LLM, MAI-DS-R1's responses can be shaped by system or developer-provided instructions (e.g. a system prompt defining its role and style) and the user's phrasing. Developers should provide clear instructions to guide model's behavior.

Public data summary

Source: Microsoft The provider has not supplied this information.
Model Specifications
Context Length128000
Quality Index0.87
LicenseMit
Last UpdatedNovember 2025
Input TypeText
Output TypeText
ProviderMicrosoft
Languages2 Languages