roberta-base-openai-detector

Version: 15

•Last updated April 2025

The RoBERTa base OpenAI Detector functions as a model designed to detect outputs generated by the GPT-2 model. It was created by refining a RoBERTa base model using the outputs of the 1.5B-parameter GPT-2 model. This detector is utilized to determine whether text was generated by a GPT-2 model. OpenAI introduced this model concurrently with the release of the weights for the largest GPT-2 model, known as the 1.5B parameter version.

Training Details

Training data

The model serves as a sequence classifier based on RoBERTa base, initially trained with the RoBERTa base training data. Subsequently, it undergoes fine-tuning using the outputs of the 1.5B GPT-2 model.

Training Procedure

Preprocessing

According to the model developers, they constructed a sequence classifier leveraging RoBERTaBASE (125 million parameters) and fine-tuned it to differentiate between outputs from the 1.5B GPT-2 model and WebText, the dataset utilized for training the GPT-2 model. To ensure the detector model's robustness in accurately classifying generated texts across various sampling methods, they conducted an in-depth analysis of the model's transfer performance. Further details on the training procedure are available in the associated paper.

Evaluation Results

Testing Data, Factors, and Metrics

Evaluation details extracted from the associated paper are as follows: The model's primary purpose is to detect text generated by GPT-2 models. To assess its performance, the model developers test it on text datasets, measuring accuracy by evaluating: 510-token test examples, comprising 5,000 samples from the WebText dataset and 5,000 samples generated by a GPT-2 model. These examples were not utilized during the training phase.

Limitations and Biases

In their associated paper, the model developers address the concern that the model might be exploited by malicious actors to create methods for evading detection. However, one of the primary reasons for releasing the model is to enhance detection research. In a related blog post, the model developers delve into the limitations of automated techniques for identifying synthetic text and emphasize the necessity of combining automated detection tools with other non-automated approaches. They state: “Our in-house detection research led to the development of a detection model with approximately 95% accuracy in detecting 1.5B GPT-2-generated text. While this accuracy rate is commendable, it is not sufficient for standalone detection. To enhance effectiveness, it should be complemented with metadata-based approaches, human judgment, and public education.” Additionally, the model developers discovered that classifying content from larger models presents greater challenges. As model sizes increase, automated tools like this model may face increasing difficulty in detection. The authors propose that training detector models using outputs from larger models can enhance accuracy and robustness. Extensive research has delved into the challenges related to bias and fairness in language models. Notably, Sheng et al. (2021) and Bender et al. (2021) have contributed significantly to this field. Predictions generated by the RoBERTa base and GPT-2 1.5B models—on which this particular model is built and fine-tuned—may inadvertently perpetuate harmful stereotypes across various dimensions. These dimensions include protected classes, identity characteristics, and sensitive social and occupational groups. For more detailed insights, the RoBERTa base and GPT-2 XL model cards provide additional information. The developers of this model further explore these issues in their research paper

Model Evaluation

Task	Use case	Dataset	Python sample	CLI with YAML
Text Classification	Detecting GPT2 Output	GPT2-Outputs	evaluate-model-text-classification.ipynb	evaluate-model-text-classification.yml

Inference samples

Inference type	Python sample
Real time	text-classification-online-endpoint.ipynb
Batch	entailment-contradiction-batch.ipynb

Sample inputs and outputs

Sample input

{ 
  "input_data": ["I like you. I love you", "Today was a horrible day" ], 
  "params": { 
    "return_all_scores": true 
  } 
}

Sample output

[
  {
    "label": "Fake",
    "score": 0.881293773651123
  },
  {
    "label": "Fake",
    "score": 0.9996414184570312
  }
]

Model Specifications

LicenseMit

Last UpdatedApril 2025

Provider

Languages1 Language

Quick Start