Direct from Azure models

Direct from Azure models are a select portfolio curated for their market-differentiated capabilities:

Secure and managed by Microsoft: Purchase and manage models directly through Azure with a single license, consistent support, and no third-party dependencies, backed by Azure's enterprise-grade infrastructure.
Streamlined operations: Benefit from unified billing, governance, and seamless PTU portability across models hosted on Azure - all part of Microsoft Foundry.
Future-ready flexibility: Access the latest models as they become available, and easily test, deploy, or switch between them within Microsoft Foundry; reducing integration effort.
Cost control and optimization: Scale on demand with pay-as-you-go flexibility or reserve PTUs for predictable performance and savings.

Learn more about Direct from Azure models .

Key capabilities

About this model

The OpenAI o1 series models are specifically designed to tackle reasoning and problem-solving tasks with increased focus and capability. These models spend more time processing and understanding the user's request, making them exceptionally strong in areas like science, coding, math and similar fields.

Key model capabilities

Complex Code Generation: Capable of generating algorithms and handling advanced coding tasks to support developers.
Advanced Problem Solving: Ideal for comprehensive brainstorming sessions and addressing multifaceted challenges.
Complex Document Comparison: Perfect for analyzing contracts, case files, or legal documents to identify subtle differences.
Instruction Following and Workflow Management: Particularly effective for managing workflows requiring shorter contexts.

Use cases

See Responsible AI for additional considerations for responsible use.

Key use cases

For example, o1 can be used by healthcare researchers to annotate cell sequencing data, by physicists to generate complicated mathematical formulas needed for quantum optics, and by developers in all fields to build and execute multi-step workflows.

Out of scope use cases

o1-mini model is currently in preview and do not include some features available in other models, such as image understanding and structured outputs found in the GPT-4o and GPT-4o-mini models. For many tasks, the generally available GPT-4o models may still be more suitable. Note: Configurable content filters are currently not available for o1-preview and o1-mini.

Pricing

Pricing is based on a number of factors, including deployment type and tokens used. See pricing details here.

Technical specs

Large language models such as o1 are pre-trained on vast text datasets. While these high-capacity models have broad world knowledge, they can be expensive and slow for real-world applications. In contrast, o1-mini is a smaller model optimized for STEM reasoning during pretraining. After training with the same high-compute reinforcement learning (RL) pipeline as o1, o1-mini achieves comparable performance on many useful reasoning tasks, while being significantly more cost efficient.

Training cut-off date

The provider has not supplied this information.

Training time

The provider has not supplied this information.

Input formats

The provider has not supplied this information.

Output formats

The provider has not supplied this information.

Supported languages

The provider has not supplied this information.

Sample JSON response

The provider has not supplied this information.

Model architecture

The provider has not supplied this information.

Long context

The provider has not supplied this information.

Optimizing model performance

The provider has not supplied this information.

Additional assets

The provider has not supplied this information.

Training disclosure

Training, testing and validation

Large language models such as o1 are pre-trained on vast text datasets. While these high-capacity models have broad world knowledge, they can be expensive and slow for real-world applications. In contrast, o1-mini is a smaller model optimized for STEM reasoning during pretraining. After training with the same high-compute reinforcement learning (RL) pipeline as o1, o1-mini achieves comparable performance on many useful reasoning tasks, while being significantly more cost efficient.

Distribution

Distribution channels

The provider has not supplied this information.

More information

Responsible AI considerations

Safety techniques

OpenAI has incorporated additional safety measures into the o1 models, including new techniques to help the models refuse unsafe requests. These advancements make the o1 series some of the most robust models available. Note: Configurable content filters are currently not available for o1-preview and o1-mini.

Safety evaluations

OpenAI measures safety is by testing how well models continue to follow its safety rules if a user tries to bypass them (known as "jailbreaking"). In OpenAI's internal tests, GPT-4o scored 22 (on a scale of 0-100) while o1-preview model scored 84. You can read more about this in the OpenAI's system card and research post .

Metric	GPT-4o	o1-mini
% Safe completions refusal on harmful prompts (standard)	0.99	0.99
% Safe completions on harmful prompts (Challenging: jailbreaks & edge cases)	0.714	0.932
% Compliance on benign edge cases ("not over-refusal")	0.91	0.923
Goodness@0.1 StrongREJECT jailbreak eval Souly et al. 2024	0.22	0.83
Human sourced jailbreak eval	0.77	0.95

Known limitations

o1-mini model is currently in preview and do not include some features available in other models, such as image understanding and structured outputs found in the GPT-4o and GPT-4o-mini models. For many tasks, the generally available GPT-4o models may still be more suitable. Large language models such as o1 are pre-trained on vast text datasets. While these high-capacity models have broad world knowledge, they can be expensive and slow for real-world applications.

Acceptable use

Acceptable use policy

The provider has not supplied this information.

Quality and performance evaluations

Source: OpenAI The following page is an extract from OpenaI o1-mini model announcement . Please refer to the original source for a full benchmark report. Large language models such as o1 are pre-trained on vast text datasets. While these high-capacity models have broad world knowledge, they can be expensive and slow for real-world applications. In contrast, o1-mini is a smaller model optimized for STEM reasoning during pretraining. After training with the same high-compute reinforcement learning (RL) pipeline as o1, o1-mini achieves comparable performance on many useful reasoning tasks, while being significantly more cost efficient.

Evals

Task	Dataset	Metric	GPT-4o	o1-mini	o1-preview
Coding	Codeforces	Elo	900	1650	1258
	HumanEval	Accuracy	90.2%	92.4%	92.4%
	Cybersecurity CTFs	Accuracy (Pass@12)	20.0%	28.7%	43.0%
STEM	MMLU (o-shot CoT)		88.7%	85.2%	90.8%
	GPQA (Diamond, 0-shot CoT)		53.6%	60.0%	73.3%
	MATH-500 (0-shot CoT)		60.3%	90.0%	858.5%

Safety

Metric	GPT-4o	o1-mini
% Safe completions refusal on harmful prompts (standard)	0.99	0.99
% Safe completions on harmful prompts (Challenging: jailbreaks & edge cases)	0.714	0.932
% Compliance on benign edge cases ("not over-refusal")	0.91	0.923
Goodness@0.1 StrongREJECT jailbreak eval Souly et al. 2024	0.22	0.83
Human sourced jailbreak eval	0.77	0.95

OpenAI measures safety is by testing how well models continue to follow its safety rules if a user tries to bypass them (known as "jailbreaking"). In OpenAI's internal tests, GPT-4o scored 22 (on a scale of 0-100) while o1-preview model scored 84. You can read more about this in the OpenAI's system card and research post .

Benchmarking methodology

Source: OpenAI The provider has not supplied this information.

Public data summary

Source: OpenAI The provider has not supplied this information.