Microsoft
MicrosoftProprietary AI models developed by Microsoft, tailored for various enterprise applications and integrated within Azure services.

Overview

Microsoft’s Phi family proves that small language models can deliver big‑league reasoning: Phi‑3 mini (3.8 B) runs on a single GPU or even a smartphone, while Phi‑4‑mini‑Flash introduces a hybrid “SambaY” architecture for 10× faster responses with 64 K context. Multimodal Phi‑3 Vision adds image understanding for edge and robotics.

Key Microsoft Models (July 2025)

  • Phi‑3‑mini‑128K‑Instruct – 3.8 B params, 128 K context; ideal for copilots and on‑device AI.
  • Phi‑3‑small‑8K / 128K – 7 B params with higher throughput for chat and RAG.
  • Phi‑3 Vision – Compact multimodal model for text + image tasks.
  • Phi‑4‑mini‑Flash‑Reasoning – Latency‑optimized 3.8 B model announced July 2025.

Why Microsoft Models on Azure

Because they are born on Azure, Phi models offer first‑party managed compute, granular quota, and fine‑tuning with zero data egress—perfect for latency‑critical and cost‑sensitive workloads.
Total Models: 87
MAI-Image-2e
MAI-Image-2e

Built for creatives, delivering enhanced photorealism at scale.

text-to-image
MAI-Image-2
MAI-Image-2

Built for creatives, delivering enhanced photorealism at scale.

text-to-image
MAI-Voice-1
MAI-Voice-1

MAI-Voice-1 is a text-to-speech (TTS) model that generates high-quality single-speaker speech and, soon, multi-speaker speech for public preview. It produces audio that strictly follows the input transcript and supports per-turn emotion control as well as

text-to-speech
audio-generation
MAI-Transcribe-1
MAI-Transcribe-1

MAI-Transcribe-1 is an ASR model built to deliver high quality batch transcription whenever the user speaks. It is designed to achieve high accuracy across 25 languages and to adapt seamlessly to diverse accents, dialects, and regional speech patterns.

automatic-speech-recognition
speech-to-text
model-router
model-router

Model router is a deployable AI model that is trained to select the most suitable large language model (LLM) for a given prompt.

chat-completion
MAI-DS-R1
MAI-DS-R1

MAI-DS-R1 is a DeepSeek-R1 reasoning model that has been post-trained by the Microsoft AI team to fill in information gaps in the previous version of the model and improve its harm protections while maintaining R1 reasoning capabilities.

chat-completion
EvoDiff
EvoDiff

Key capabilities About this model EvoDiff can unconditionally sample diverse structurallyplausible proteins, generate intrinsically disordered regions, and scaffold structural motifs using only sequence information, challenging a paradigm in structurebased protein design. Key model capa

protein-sequence-generation
Phi-4-reasoning
Phi-4-reasoning

State-of-the-art open-weight reasoning model.

chat-completion
Phi-4-mini-reasoning
Phi-4-mini-reasoning

Lightweight math reasoning model optimized for multi-step problem solving

chat-completion
Phi-4-mini-instruct
Phi-4-mini-instruct

3.8B parameters Small Language Model outperforming larger models in reasoning, math, coding, and function-calling

chat-completion
Phi-4-multimodal-instruct
Phi-4-multimodal-instruct

First small multimodal model to have 3 modality inputs (text, audio, image), excelling in quality and efficiency

chat-completion
Phi-4
Phi-4

Phi-4 14B, a highly capable model for low latency scenarios.

chat-completion
financial-reports-analysis-v2
financial-reports-analysis-v2

Adapted AI model for financial reports analysis based on Phi-4

chat-completion
supply-chain-trade-regulations-v2
supply-chain-trade-regulations-v2

Adapted AI model for supply chain trade regulations based on Phi-4

chat-completion
Muse
Muse

Muse is a World and Human Action Model (WHAM), a generative model of gameplay (visuals and/or controller actions).

image-to-image
Phi-3-vision-128k-instruct
Phi-3-vision-128k-instruct

Model Summary Phi3 Vision is a lightweight, stateoftheart open multimodal model built upon datasets which include synthetic data and filtered publicly available websites with a focus on very highquality, reasoning dense data both on text and vision. The model belongs to the Phi3 model

chat-completion
GigaTIME
GigaTIME

Openweight H&E to mIF translator model

image-to-image
Azure-Speech-Text-to-speech
Azure-Speech-Text-to-speech

Text-to-speech enables your applications, tools, or devices to convert text into natural synthesized speech. It leverages advanced out-of-the-box [prebuilt neural voices](https://learn.microsoft.com/en-us/azure/ai-services/speech-service/language-support?t

text-to-speech
audio-generation
Prov-GigaPath
Prov-GigaPath

Description Digital pathology poses unique computational challenges, as a standard gigapixel slide may comprise tens of thousands of image tiles[^1],[^2],[^3]. Previous models often rely predominantly on tilelevel predictions, which can overlook critical slidelevel context and spatial dependen

image-feature-extraction
qwen3.5-9b-generic-cpu
qwen3.5-9b-generic-cpu

This model is an optimized version of Qwen3.59B to enable local inference on CPUs. This model uses RTN quantization. Model Description Developed by: Microsoft Model type: ONNX License: apache2.0 Model Description: This is a conversion of the Qwen3.59B for local inferenc

chat-completion
microsoft-Orca-2-7b
microsoft-Orca-2-7b

Orca 2 is a finetuned version of LLAMA2. Orca 2’s training data is a synthetic dataset that was created to enhance the small model’s reasoning abilities. All synthetic training data was moderated using the Microsoft Azure content filters. More details about the model can be found in the [Orca 2 pap

text-generation
MedImageInsight-onnx
MedImageInsight-onnx

Most medical imaging AI today is narrowly built to detect a small set of individual findings on a single modality like chest Xrays. This training approach is data and computationally inefficient, requiring ~612 months per finding[1], and often fails to generalize in real world environments. By fu

embeddings
Aurora
Aurora

Aurora is a machine learning model that can predict general environmental variables.

environmental-forecasting
qwen3.5-2b-generic-cpu
qwen3.5-2b-generic-cpu

This model is an optimized version of Qwen3.52B to enable local inference on CPUs. This model uses RTN quantization. Model Description Developed by: Microsoft Model type: ONNX License: apache2.0 Model Description: This is a conversion of the Qwen3.52B for local inferenc

chat-completion
MedImageParse3D
MedImageParse3D

Biomedical image analysis is fundamental for biomedical discovery in cell biology, pathology, radiology, and many other biomedical domains. 3D medical images such as CT and MRI play unique roles in clinical practices. MedImageParse 3D is a foundation model for imaging parsing that can jointly co

image-segmentation
Phi-3-medium-4k-instruct
Phi-3-medium-4k-instruct

A 14B parameters model, proves better quality than Phi-3-mini, with a focus on high-quality, reasoning-dense data.

chat-completion
Phi-4-reasoning-plus-onnx
Phi-4-reasoning-plus-onnx

State-of-the-art open-weight reasoning model.

chat-completion
qwen3.5-4b-generic-cpu
qwen3.5-4b-generic-cpu

This model is an optimized version of Qwen3.54B to enable local inference on CPUs. This model uses RTN quantization. Model Description Developed by: Microsoft Model type: ONNX License: apache2.0 Model Description: This is a conversion of the Qwen3.54B for local inferenc

chat-completion
Azure-Speech-Speech-Translation
Azure-Speech-Speech-Translation

Translates streaming or recorded audio into text or audio across 140+ languages and dialects. Accuracy can be further optimized with custom models for your specialized use cases.

translation
speech-translation
Azure-Language-Language-detection
Azure-Language-Language-detection

Language detection quickly and accurately identifies the language of any text, supporting over 100 languages and dialects, including the ISO 15924 standard for a select number of languages.

detect-language
Phi-3.5-MoE-instruct
Phi-3.5-MoE-instruct

A new mixture of experts model

chat-completion
DeepSeek-R1-Distilled-NPU-Optimized
DeepSeek-R1-Distilled-NPU-Optimized

Learn more: \[original model announcement\] DeepSeekR1DistilledNPUOptimized is a downloadable package of DeepSeekR1DistilledQwen1.5B that is specifically optimized for the Neural Processing Unit (NPU). NPU optimized models let develo

chat-completion
Phi-3-small-8k-instruct
Phi-3-small-8k-instruct

A 7B parameters model, proves better quality than Phi-3-mini, with a focus on high-quality, reasoning-dense data.

chat-completion
Phi-3.5-vision-instruct
Phi-3.5-vision-instruct

Refresh of Phi-3-vision model.

chat-completion
qwen3.5-4b-generic-gpu
qwen3.5-4b-generic-gpu

This model is an optimized version of Qwen3.54B to enable local inference on GPUs. This model uses RTN quantization. Model Description Developed by: Microsoft Model type: ONNX License: apache2.0 Model Description: This is a conversion of the Qwen3.54B for local inferenc

chat-completion
Boltz-1
Boltz-1

Boltz-1 is an open-source biomolecular structure prediction model for proteins, protein-protein assemblies, and protein-ligand complexes-providing high-quality 3D structural hypotheses to accelerate drug discovery, structural biology, and biotechnology wor

embeddings
Phi-4-mini-reasoning-onnx
Phi-4-mini-reasoning-onnx

Lightweight math reasoning model optimized for multi-step problem solving

chat-completion
microsoft-llava-med-v1.5-mistral-7b
microsoft-llava-med-v1.5-mistral-7b

LLaVAMed v1.5, using mistralai/Mistral7BInstructv0.2 as LLM for a better commercial license Large Language and Vision Assistant for bioMedicine (i.e., “LLaVAMed”) is a large language and vision model trained using a curriculum lear

image-text-to-text
Azure-Speech-Speech-to-text
Azure-Speech-Speech-to-text

Transcribes streaming or recorded audio into readable text across 140+ languages and dialects. Accuracy can be further optimized with custom models for your specialized use cases.

automatic-speech-recognition
speech-to-text
Azure-Speech-Voice-Live
Azure-Speech-Voice-Live

Voice Live API is a single unified API that enables low-latency, high-quality speech to speech interactions for voice agents.

conversational-ai
speech-to-text
text-to-speech
TamGen
TamGen

The TamGen is a 100 millionparameter model that can generate compounds based on the input protein information. TamGen is pretrained on 10 million compounds from PubChem and finetuned on CrossDocked and PDB datasets. We evaluate TamGen on existing benchmarks and achieve top performance. Furthermor

protein-design
CxrReportGen
CxrReportGen

Overview The CXRReportGen model utilizes a multimodal architecture, integrating a BiomedCLIP image encoder with a Phi3Mini text encoder to help an application interpret complex medical imaging studies of chest Xrays. CXRReportGen follows the same framework as [MAIRA2](https://www.microsoft

image-text-to-text
Azure-Content-Understanding-Read
Azure-Content-Understanding-Read

Azure Content Understanding Read Content Understanding Read provides fast, reliable extraction of text and basic content elements from documents, enabling simple ingestion workflows without layout interpretation. It’s ideal for scenarios where clean text output is needed for downstream automati

intelligent-content-processing
custom-extraction
text-analysis
document-analysis
microsoft-Orca-2-13b
microsoft-Orca-2-13b

Orca 2 is a finetuned version of LLAMA2. Orca 2’s training data is a synthetic dataset that was created to enhance the small model’s reasoning abilities. All synthetic training data was moderated using the Microsoft Azure content filters. More details about the model can be found in the [Orca 2 pap

text-generation
MatterGen
MatterGen

A generative model for inorganic materials design

materials-design
BiomedCLIP-PubMedBERT_256-vit_base_patch16_224
BiomedCLIP-PubMedBERT_256-vit_base_patch16_224

BiomedCLIP is a biomedical visionlanguage foundation model that is pretrained on PMC15M, a dataset of 15 million figurecaption pairs extracted from biomedical research articles in PubMed Central, using contrastive learning. It uses PubMedBERT as the text encoder and Vision Transformer as the imag

zero-shot-image-classification
Phi-3-medium-128k-instruct
Phi-3-medium-128k-instruct

Same Phi-3-medium model, but with a larger context size for RAG or few shot prompting.

chat-completion
microsoft-swinv2-base-patch4-window12-192-22k
microsoft-swinv2-base-patch4-window12-192-22k

The Swin Transformer V2 model is a type of Vision Transformer, pretrained on ImageNet21k with a resolution of 192x192, is introduced in the <a href="https://arxiv.org/abs/2111.09883" target="blank"researchpaper</a titled "Swin Transformer V2: Scaling Up Capacity and Resolution" authored by Liu

image-classification
RetroChimera
RetroChimera

RetroChimera is a model that takes as input a product molecule that one wants to synthesize (encoded as a SMILES string), and produces several potential chemical reactions which could be used to produce that input molecule. Each reaction is represented as a group of ingredients (reactant molecules),

retrosynthesis-prediction
financial-reports-analysis
financial-reports-analysis

Description The adapted AI model for financial reports analysis (preview) is a state\of\the\art small language model (SLM) based on the Phi\3\small\128k architecture, designed specifically for analyzing financial reports. It has been fine\tuned on a few hundred million tokens derived fro

chat-completion
Phi-4-mini-flash-reasoning
Phi-4-mini-flash-reasoning

State-of-the-art open-weight reasoning model.

chat-completion
1