OpenAI o1-preview
OpenAI o1-preview
Version: 1
OpenAILast updated September 2024
Focused on advanced reasoning and solving complex problems, including math and science tasks. Ideal for applications that require deep contextual understanding and agentic workflows.
Reasoning
Multilingual
Coding

OpenAI's o1 Series Models: Enhanced Reasoning and Problem Solving on Azure

The OpenAI o1 series models are specifically designed to tackle reasoning and problem-solving tasks with increased focus and capability. These models spend more time processing and understanding the user's request, making them exceptionally strong in areas like science, coding, math and similar fields. For example, o1 can be used by healthcare researchers to annotate cell sequencing data, by physicists to generate complicated mathematical formulas needed for quantum optics, and by developers in all fields to build and execute multi-step workflows. Note: Configurable content filters are currently not available for o1-preview and o1-mini. IMPORTANT: o1-preview model is available for limited access. To try the model in the playground, registration is required, and access will be granted based on Microsoft’s eligibility criteria.

Key Capabilities of the o1 Series

  • Complex Code Generation: Capable of generating algorithms and handling advanced coding tasks to support developers.
  • Advanced Problem Solving: Ideal for comprehensive brainstorming sessions and addressing multifaceted challenges.
  • Complex Document Comparison: Perfect for analyzing contracts, case files, or legal documents to identify subtle differences.
  • Instruction Following and Workflow Management: Particularly effective for managing workflows requiring shorter contexts.

Model Variants

  • o1-preview: The most capable model in the o1 series, offering enhanced reasoning abilities.
  • o1-mini: A faster and more cost-efficient option in the o1 series, ideal for coding tasks requiring speed and lower resource consumption.

Limitations

o1-preview model is currently in preview and do not include some features available in other models, such as image understanding and structured outputs found in the GPT-4o and GPT-4o-mini models. For many tasks, the generally available GPT-4o models may still be more suitable.

Resources

Model provider

This model is provided through the Azure OpenAI Service.

Relevant documents

The following documents are applicable:

Safety

OpenAI has incorporated additional safety measures into the o1 models, including new techniques to help the models refuse unsafe requests. These advancements make the o1 series some of the most robust models available. OpenAI measures safety is by testing how well models continue to follow its safety rules if a user tries to bypass them (known as "jailbreaking"). In OpenAI’s internal tests, GPT-4o scored 22 (on a scale of 0-100) while o1-preview model scored 84. You can read more about this in the OpenAI’s system card and research post .
The following page is an extract from Learning to Reason with LLMs, OpenAI blog, Sept 2024 . Please refer to the original source for a full benchmark report. OpenAI o1 ranks in the 89th percentile on competitive programming questions (Codeforces), places among the top 500 students in the US in a qualifier for the USA Math Olympiad (AIME), and exceeds human PhD-level accuracy on a benchmark of physics, biology, and chemistry problems (GPQA). While the work needed to make this new model as easy to use as current models is still ongoing, we are releasing an early version of this model, OpenAI o1-preview, for immediate use in ChatGPT and to trusted API users(opens in a new window). Our large-scale reinforcement learning algorithm teaches the model how to think productively using its chain of thought in a highly data-efficient training process. We have found that the performance of o1 consistently improves with more reinforcement learning (train-time compute) and with more time spent thinking (test-time compute). The constraints on scaling this approach differ substantially from those of LLM pretraining, and we are continuing to investigate them.

Evals

DatasetMetricgpt-4oo1-preview
Competition Math
AIME (2024)
cons@6413.456.7
pass@19.344.6
Competition Code
CodeForces
Elo8081,258
Percentile11.062.0
GPQA Diamondcons@6456.178.3
pass@150.673.3
Biologycons@6463.273.7
pass@161.665.9
Chemistrycons@6443.060.2
pass@140.259.9
Physicscons@6468.689.5
pass@159.589.4
MATHpass@160.385.5
MMLUpass@188.092.3
MMMU (val)pass@169.1n/a
MathVista (testmini)pass@163.8n/a

Safety

MetricGPT-4oo1-preview
% Safe completions on harmful prompts Standard0.9900.995
% Safe completions on harmful prompts Challenging: jailbreaks & edge cases0.7140.934
↳ Harassment (severe)0.8450.900
↳ Exploitative sexual content0.4830.949
↳ Sexual content involving minors0.7070.931
↳ Advice about non-violent wrongdoing0.6880.961
↳ Advice about violent wrongdoing0.7780.963
% Safe completions for top 200 with highest Moderation API scores per category in WildChat
Zhao, et al. 2024
0.9450.971
Goodness@0.1 StrongREJECT jailbreak eval
Souly et al. 2024
0.2200.840
Human sourced jailbreak eval0.7700.960
% Compliance on internal benign edge cases “not over-refusal”0.9100.930
% Compliance on benign edge cases in XSTest
Röttger, et al. 2023
0.9240.976
Model Specifications
Context Length128000
Quality Index0.71
LicenseCustom
Training DataSeptember 2023
Last UpdatedSeptember 2024
Input TypeText
Output TypeText
PublisherOpenAI
Languages27 Languages