OpenAI o1-preview
Version: 1
OpenAI's o1 Series Models: Enhanced Reasoning and Problem Solving on Azure
The OpenAI o1 series models are specifically designed to tackle reasoning and problem-solving tasks with increased focus and capability. These models spend more time processing and understanding the user's request, making them exceptionally strong in areas like science, coding, math and similar fields. For example, o1 can be used by healthcare researchers to annotate cell sequencing data, by physicists to generate complicated mathematical formulas needed for quantum optics, and by developers in all fields to build and execute multi-step workflows. Note: Configurable content filters are currently not available for o1-preview and o1-mini. IMPORTANT: o1-preview model is available for limited access. To try the model in the playground, registration is required, and access will be granted based on Microsoft’s eligibility criteria.Key Capabilities of the o1 Series
- Complex Code Generation: Capable of generating algorithms and handling advanced coding tasks to support developers.
- Advanced Problem Solving: Ideal for comprehensive brainstorming sessions and addressing multifaceted challenges.
- Complex Document Comparison: Perfect for analyzing contracts, case files, or legal documents to identify subtle differences.
- Instruction Following and Workflow Management: Particularly effective for managing workflows requiring shorter contexts.
Model Variants
- o1-preview: The most capable model in the o1 series, offering enhanced reasoning abilities.
- o1-mini: A faster and more cost-efficient option in the o1 series, ideal for coding tasks requiring speed and lower resource consumption.
Limitations
o1-preview model is currently in preview and do not include some features available in other models, such as image understanding and structured outputs found in the GPT-4o and GPT-4o-mini models. For many tasks, the generally available GPT-4o models may still be more suitable.Resources
Model provider
This model is provided through the Azure OpenAI Service.Relevant documents
The following documents are applicable:- Overview of Responsible AI practices for Azure OpenAI models
- Transparency Note for Azure OpenAI Service
Safety
OpenAI has incorporated additional safety measures into the o1 models, including new techniques to help the models refuse unsafe requests. These advancements make the o1 series some of the most robust models available. OpenAI measures safety is by testing how well models continue to follow its safety rules if a user tries to bypass them (known as "jailbreaking"). In OpenAI’s internal tests, GPT-4o scored 22 (on a scale of 0-100) while o1-preview model scored 84. You can read more about this in the OpenAI’s system card and research post .The following page is an extract from Learning to Reason with LLMs, OpenAI blog, Sept 2024 . Please refer to the original source for a full benchmark report.
OpenAI o1 ranks in the 89th percentile on competitive programming questions (Codeforces), places among the top 500 students in the US in a qualifier for the USA Math Olympiad (AIME), and exceeds human PhD-level accuracy on a benchmark of physics, biology, and chemistry problems (GPQA). While the work needed to make this new model as easy to use as current models is still ongoing, we are releasing an early version of this model, OpenAI o1-preview, for immediate use in ChatGPT and to trusted API users(opens in a new window).
Our large-scale reinforcement learning algorithm teaches the model how to think productively using its chain of thought in a highly data-efficient training process. We have found that the performance of o1 consistently improves with more reinforcement learning (train-time compute) and with more time spent thinking (test-time compute). The constraints on scaling this approach differ substantially from those of LLM pretraining, and we are continuing to investigate them.
Evals
Dataset | Metric | gpt-4o | o1-preview |
---|---|---|---|
Competition Math AIME (2024) | cons@64 | 13.4 | 56.7 |
pass@1 | 9.3 | 44.6 | |
Competition Code CodeForces | Elo | 808 | 1,258 |
Percentile | 11.0 | 62.0 | |
GPQA Diamond | cons@64 | 56.1 | 78.3 |
pass@1 | 50.6 | 73.3 | |
Biology | cons@64 | 63.2 | 73.7 |
pass@1 | 61.6 | 65.9 | |
Chemistry | cons@64 | 43.0 | 60.2 |
pass@1 | 40.2 | 59.9 | |
Physics | cons@64 | 68.6 | 89.5 |
pass@1 | 59.5 | 89.4 | |
MATH | pass@1 | 60.3 | 85.5 |
MMLU | pass@1 | 88.0 | 92.3 |
MMMU (val) | pass@1 | 69.1 | n/a |
MathVista (testmini) | pass@1 | 63.8 | n/a |
Safety
Metric | GPT-4o | o1-preview |
---|---|---|
% Safe completions on harmful prompts Standard | 0.990 | 0.995 |
% Safe completions on harmful prompts Challenging: jailbreaks & edge cases | 0.714 | 0.934 |
↳ Harassment (severe) | 0.845 | 0.900 |
↳ Exploitative sexual content | 0.483 | 0.949 |
↳ Sexual content involving minors | 0.707 | 0.931 |
↳ Advice about non-violent wrongdoing | 0.688 | 0.961 |
↳ Advice about violent wrongdoing | 0.778 | 0.963 |
% Safe completions for top 200 with highest Moderation API scores per category in WildChat Zhao, et al. 2024 | 0.945 | 0.971 |
Goodness@0.1 StrongREJECT jailbreak eval Souly et al. 2024 | 0.220 | 0.840 |
Human sourced jailbreak eval | 0.770 | 0.960 |
% Compliance on internal benign edge cases “not over-refusal” | 0.910 | 0.930 |
% Compliance on benign edge cases in XSTest Röttger, et al. 2023 | 0.924 | 0.976 |
Model Specifications
Context Length128000
Quality Index0.71
LicenseCustom
Training DataSeptember 2023
Last UpdatedSeptember 2024
Input TypeText
Output TypeText
PublisherOpenAI
Languages27 Languages
Related Models