Overview
Microsoft’s Phi family proves that small language models can deliver big‑league reasoning: Phi‑3 mini (3.8 B) runs on a single GPU or even a smartphone, while Phi‑4‑mini‑Flash introduces a hybrid “SambaY” architecture for 10× faster responses with 64 K context. Multimodal Phi‑3 Vision adds image understanding for edge and robotics.Key Microsoft Models (July 2025)
- Phi‑3‑mini‑128K‑Instruct – 3.8 B params, 128 K context; ideal for copilots and on‑device AI.
- Phi‑3‑small‑8K / 128K – 7 B params with higher throughput for chat and RAG.
- Phi‑3 Vision – Compact multimodal model for text + image tasks.
- Phi‑4‑mini‑Flash‑Reasoning – Latency‑optimized 3.8 B model announced July 2025.
Why Microsoft Models on Azure
Because they are born on Azure, Phi models offer first‑party managed compute, granular quota, and fine‑tuning with zero data egress—perfect for latency‑critical and cost‑sensitive workloads.Model router is a deployable AI model that is trained to select the most suitable large language model (LLM) for a given prompt.
MAI-DS-R1 is a DeepSeek-R1 reasoning model that has been post-trained by the Microsoft AI team to fill in information gaps in the previous version of the model and improve its harm protections while maintaining R1 reasoning capabilities.
Key capabilities About this model EvoDiff can unconditionally sample diverse structurallyplausible proteins, generate intrinsically disordered regions, and scaffold structural motifs using only sequence information, challenging a paradigm in structurebased protein design. Key model capa
State-of-the-art open-weight reasoning model.
Lightweight math reasoning model optimized for multi-step problem solving
3.8B parameters Small Language Model outperforming larger models in reasoning, math, coding, and function-calling
First small multimodal model to have 3 modality inputs (text, audio, image), excelling in quality and efficiency
Phi-4 14B, a highly capable model for low latency scenarios.
Adapted AI model for financial reports analysis based on Phi-4
Adapted AI model for supply chain trade regulations based on Phi-4
Muse is a World and Human Action Model (WHAM), a generative model of gameplay (visuals and/or controller actions).
Model Summary Phi3 Vision is a lightweight, stateoftheart open multimodal model built upon datasets which include synthetic data and filtered publicly available websites with a focus on very highquality, reasoning dense data both on text and vision. The model belongs to the Phi3 model
Text-to-speech enables your applications, tools, or devices to convert text into natural synthesized speech. It leverages advanced out-of-the-box [prebuilt neural voices](https://learn.microsoft.com/en-us/azure/ai-services/speech-service/language-support?t
Biomedical image analysis is fundamental for biomedical discovery in cell biology, pathology, radiology, and many other biomedical domains. 3D medical images such as CT and MRI play unique roles in clinical practices. MedImageParse 3D is a foundation model for imaging parsing that can jointly co
Orca 2 is a finetuned version of LLAMA2. Orca 2’s training data is a synthetic dataset that was created to enhance the small model’s reasoning abilities. All synthetic training data was moderated using the Microsoft Azure content filters. More details about the model can be found in the [Orca 2 pap
Most medical imaging AI today is narrowly built to detect a small set of individual findings on a single modality like chest Xrays. This training approach is data and computationally inefficient, requiring ~612 months per finding[1], and often fails to generalize in real world environments. By fu
A 14B parameters model, proves better quality than Phi-3-mini, with a focus on high-quality, reasoning-dense data.
Aurora is a machine learning model that can predict general environmental variables.
Biomedical image analysis is fundamental for biomedical discovery in cell biology, pathology, radiology, and many other biomedical domains. 3D medical images such as CT and MRI play unique roles in clinical practices. MedImageParse 3D is a foundation model for imaging parsing that can jointly co
State-of-the-art open-weight reasoning model.
Language detection quickly and accurately identifies the language of any text, supporting over 100 languages and dialects, including the ISO 15924 standard for a select number of languages.
Learn more: \[original model announcement\] DeepSeekR1DistilledNPUOptimized is a downloadable package of DeepSeekR1DistilledQwen1.5B that is specifically optimized for the Neural Processing Unit (NPU). NPU optimized models let develo
A new mixture of experts model
Refresh of Phi-3-vision model.
A 7B parameters model, proves better quality than Phi-3-mini, with a focus on high-quality, reasoning-dense data.
Lightweight math reasoning model optimized for multi-step problem solving
LLaVAMed v1.5, using mistralai/Mistral7BInstructv0.2 as LLM for a better commercial license Large Language and Vision Assistant for bioMedicine (i.e., “LLaVAMed”) is a large language and vision model trained using a curriculum lear
This model is an optimized version of gptoss20b to enable local inference on CPUs. This model uses RTN quantization. Model Description Developed by: Microsoft Model type: ONNX License: Apache2.0 License Description: Use of this model is subject to the terms of the Apach
Biomedical image analysis is fundamental for biomedical discovery in cell biology, pathology, radiology, and many other biomedical domains. MedImageParse is a biomedical foundation model for imaging parsing that can jointly conduct segmentation, detection, and recognition across 9 imaging modalities
The TamGen is a 100 millionparameter model that can generate compounds based on the input protein information. TamGen is pretrained on 10 million compounds from PubChem and finetuned on CrossDocked and PDB datasets. We evaluate TamGen on existing benchmarks and achieve top performance. Furthermor
Boltz-1 is an open-source biomolecular structure prediction model for proteins, protein-protein assemblies, and protein-ligand complexes-providing high-quality 3D structural hypotheses to accelerate drug discovery, structural biology, and biotechnology wor
Transcribes streaming or recorded audio into readable text across 140+ languages and dialects. Accuracy can be further optimized with custom models for your specialized use cases.
Voice Live API is a single unified API that enables low-latency, high-quality speech to speech interactions for voice agents.
Biomedical image analysis is fundamental for biomedical discovery in cell biology, pathology, radiology, and many other biomedical domains. MedImageParse is a biomedical foundation model for imaging parsing that can jointly conduct segmentation, detection, and recognition across 9 imaging modalities
Orca 2 is a finetuned version of LLAMA2. Orca 2’s training data is a synthetic dataset that was created to enhance the small model’s reasoning abilities. All synthetic training data was moderated using the Microsoft Azure content filters. More details about the model can be found in the [Orca 2 pap
Biomedical image analysis is fundamental for biomedical discovery in cell biology, pathology, radiology, and many other biomedical domains. 3D medical images such as CT and MRI play unique roles in clinical practices. MedImageParse 3D is a foundation model for imaging parsing that can jointly co
Description The adapted AI model for financial reports analysis (preview) is a state\of\the\art small language model (SLM) based on the Phi\3\small\128k architecture, designed specifically for analyzing financial reports. It has been fine\tuned on a few hundred million tokens derived fro
Description Digital pathology poses unique computational challenges, as a standard gigapixel slide may comprise tens of thousands of image tiles[^1],[^2],[^3]. Previous models often rely predominantly on tilelevel predictions, which can overlook critical slidelevel context and spatial dependen
BiomedCLIP is a biomedical visionlanguage foundation model that is pretrained on PMC15M, a dataset of 15 million figurecaption pairs extracted from biomedical research articles in PubMed Central, using contrastive learning. It uses PubMedBERT as the text encoder and Vision Transformer as the imag
Same Phi-3-medium model, but with a larger context size for RAG or few shot prompting.
The Swin Transformer V2 model is a type of Vision Transformer, pretrained on ImageNet21k with a resolution of 192x192, is introduced in the <a href="https://arxiv.org/abs/2111.09883" target="blank"researchpaper</a titled "Swin Transformer V2: Scaling Up Capacity and Resolution" authored by Liu
A generative model for inorganic materials design
Azure Content Understanding Read Content Understanding Read provides fast, reliable extraction of text and basic content elements from documents, enabling simple ingestion workflows without layout interpretation. It’s ideal for scenarios where clean text output is needed for downstream automati
State-of-the-art open-weight reasoning model.
Azure AI Content Understanding Introduction Azure AI Content Understanding empowers you to transform unstructured multimodal data—such as text, images, audio, and video—into structured, actionable insights. By streamlining content processing with advanced AI techniques like schema extraction
Microsoft Phi2 The phi2 is a language model with 2.7 billion parameters. The phi2 model was trained using the same data sources as phi1, augmented with a new data source that consists of various NLP synthetic texts and filtered websites (for safety and educational value). When assessed a
Same Phi-3-mini model, but with a larger context size for RAG or few shot prompting.
Model Description Model card for RADDINO Model description RADDINO is a vision transformer model trained to encode chest Xrays using the selfsupervised learning method DINOv2. RADDINO is described in detail in [RADDINO: Exploring Scalab
Fara is a multimodal web agent model that observes the browser and acts on behalf of the user by emitting tool‑calls (e.g., click(x,y), type, scroll, select) to complete web tasks end‑to‑end. Fara is trained on data generated by a scalable multi‑agent pipeline that synthesizes diverse web tasks, exe
Text to speech avatar converts text into a digital video of a human (either a standard avatar or a custom text to speech avatar) speaking with a natural-sounding voice. The text to speech avatar video can be synthesized asynchronously or in real time. Deve
Azure AI Vision Introduction The Azure AI Vision service gives you access to advanced algorithms that process images and videos and return insights based on the visual features and content you are interested in. Azure AI Vision can power a diverse set of scenarios, including digital asset man