Loading....

Technical Report on High Accuracy AI-generated Text Detection

Bradley Emi and Max SperoFebruary 21, 2024

Training process for the Pangram Labs AI-generated text classifier

Introduction

At Pangram Labs, we are building the best AI text detection model to protect the internet from being flooded by inauthentic, deceptive, and low-quality content. We believe in a world enabled by LLMs, humans will need to be equipped with the best toolkit to identify the truth and we want to provide the right technology to meet that need.

Pangram Labs has built a serious classifier to detect AI-generated text that could be scaled in spam or fraudulent content. How much better is our model than the alternatives out there? In this blog post, we present a comprehensive analysis of our model’s performance, accompanied by our first ever public technical whitepaper.

This blog post will cover several topics:

  • Why is AI-generated text detection an important problem?
  • Which AI-generated content detector is the best?
  • Why does high accuracy matter?
  • What kinds of content can Pangram Labs detect?
  • How did Pangram Labs approach solving this problem?

For a more technical deep dive including methodology, see our Technical Report on the Pangram AI-Generated Text Classifier.

TL;DR

We performed a competitive benchmarking using almost 2000 documents to determine key accuracy metrics including overall accuracy, false positive instance, and false negative instances.

Our text classifier outperforms academic methods and shows significantly lower error rates in a comprehensive benchmark against other available AI text detection methods. Our model demonstrates 99.85% accuracy with 0.19% false positive rate across thousands of examples across ten different categories of writing and eight commonly used large language models. Other methods fail on more capable LLMs such as GPT-4 (<=75% accuracy) while Pangram Labs sustains 99-100% accuracy across all language models tested.

Overall accuracy comparison

Introduction to AI-Generated text

Large language models (LLMs) such as ChatGPT exploded in popularity in 2023 as AI capabilities reached an inflection point. LLMs powering AI assistants could answer questions, brainstorm, write content, all while sounding convincingly human. This has produced some good outcomes - information is more accessible than ever and assistants can save us time doing menial tasks. However, anyone is able to produce convincingly human text at basically no effort– which has its own share of downsides. Spammers can write emails that are harder to filter. Online marketplace sellers can produce thousands of authentic-looking reviews in minutes. Bad actors can take to social media and sway public opinion with thousands of LLM-powered bots.

Unfortunately these societal risks cannot be mitigated at the LLM-level - language models have no understanding of whether a request is legitimate or one of thousands created by a spammer. For this reason, we need content filters at the application layer - to keep human spaces human.

Why Pangram Labs is obsessed with accuracy

We've heard plenty of skepticism around this line of work. That the problem is impossible, that it's been shown AI detectors "don't work," or that you can just prompt around it. Or even if it's possible now, it will be harder next year, impossible by the time AGI comes out.

Our thesis is a little different. We believe with conviction that this problem is not only possible, but necessary to solve. It doesn't matter how difficult it is, how many hours we have to put in to build something that users can use and trust. Without our work, it's only a matter of years before the internet is overrun by AI spammers. Human voices will be drowned out by noise.

For us, making sure the problem is solved involves continuing to increase the difficulty of our evaluation sets. Early evaluations were easy to max out to 100% accuracy, but it quickly became evident that this did not reflect real world accuracy. By building harder evals, we are able to measure our improvement in an objective way. We already believe that our current benchmark is slightly harder than what real world spammers put out, and this benchmark is close to maxed out. When we return with new numbers, it might look like other methods got even worse, but reality is we will come back with a harder evaluation set, with the most capable AIs being pushed to their limits to create text that looks authentic, and our goal is to still be able to catch it with 99% accuracy.

The problem will never be fully solved, but we need to make steady progress forward to avoid falling behind as LLMs become increasingly capable. This is what we signed up for, and what we will continue pursuing until the end.

Comparison of AI detection tools

In our technical report, we compared Pangram Labs against the two leading AI detection tools, as well as a 2023 state-of-the-art academic method for AI detection.

We compare:

  • Pangram Labs
  • GPTZero
  • Originality.ai
  • DetectGPT

Our benchmark includes 1,976 documents - half of them written by humans, the other half generated by eight of the most popular LLMs, including ChatGPT and GPT-4.

Overall accuracy comparison

A quick explainer on what these numbers mean:

  • Accuracy: What percentage of total documents did the tool classify correctly?
  • False positive rate: Of all of the human documents, how many of them were incorrectly classified as AI?
  • False negative rate: Of all of the AI documents, how many of them were incorrectly classified as human?

To demonstrate false positive rate concretely -- 9% means one in every 11 human documents will get flagged as AI. 2% false positive rate means one in every 50 human documents will be flagged as AI. And 0.67% means one in every 150 human documents will be flagged as AI.

Similarly, 10% false negative rate means one in ten AI documents pass through undetected, while 1.4% false negative rate means one in every seventy AI documents passes through undetected.

Consider the implications of these results. A detection model with a 9% false positive rate cannot be trusted - otherwise false accusations would abound. And a detection model with a 10% false negative rate would let so much AI spam through that given any attack, users would still be inundated.

Diving deeper into results

Our benchmark is split up across two different axes: text domain and origin LLM. "Text domain" or just "domain" is a way of referring to a specific category of writing. For example, a middle school essay reads very differently than a scientific paper, which reads very differently than an email. By splitting results out into different domains, we can get a more comprehensive look into which areas we do well on and where we can focus our efforts to improve.

Accuracy by text domain

The results show that Pangram Labs beats GPTZero and Originality in all ten domains evaluated.

One of the domains, email, is an especially strong result because Pangram Labs does not include any email in its training data. Our performance on email is driven entirely by training a robust model that generalizes to most categories of writing that an LLM can produce.

AI documents correctly classified, by origin LLM

Splitting by origin LLM tells another story, that competing AI detection models can do better on less-capable open source models, but do worse on ChatGPT (gpt-3.5-turbo) and really struggle on GPT-4, OpenAI's most capable LLM. We evaluated multiple versions of the GPT 3.5 Turbo and GPT-4 models, as these are the most commonly used in the wild.

We find that we are the only model that can detect GPT-4 text reliably, and outperform the competition on every other model that we tested as well.

One interesting observation is that our competition performs much better on the open source models than the closed-source GPT and Gemini models. We hypothesize that this is due to overreliance on perplexity and burstiness features– while these features are valuable, one can only precisely compute perplexity and burstiness on an open source model: on the closed-source models, one can only make an approximate estimation. This shows the value of our deep learning based approach– it does not rely on brittle features like perplexity, and can learn more subtle underlying patterns.

Robustness

A question we often get asked is - what happens when a new language model is released? Do you need to train on each new model to detect its outputs? In short, no. OpenAI released two new versions of their LLMs in the previous weeks. Without training on these new LLMs at all, we evaluated our model and found that we still did quite well!

  • GPT-3.5-Turbo-0125: 99.66% accuracy
  • GPT-4-0125-Preview: 99.18% accuracy

These new releases are similar to previous versions released by OpenAI. So the next question we ask is - how do we do on completely different model families? To answer this, we evaluated our model on a bunch of open source models that our classifier has never seen before.

Performance by open source LLM, unseen by Pangram Labs during training.

Pretty great! A lot of this has to do with the fact that many open source models either start from the Llama family or use similar open source training sets, but this helps us be confident in our ability to generalize without needing to train on every single open source model.

With that being said, our data pipeline is built so that we can generate a new training set within hours of an LLM API being released - bottlenecked only by the API rate limit. We are well aware that LLMs continue to get better, and as we approach AGI it will be increasingly important to stay up-to-date and make sure we can catch even the most advanced AI agents.

English as a Second Language

Previous research found that commercial LLM detectors are consistently biased against nonnative speakers (ESL, or English as a Second Language). To test this, the researchers used a benchmark of 91 essays from TOEFL (Test of English as a Foreign Language) to test several detectors.

We held out the 91 TOEFL essays from our training set and evaluated Pangram Labs on the benchmark. Due to our work minimizing false positive rates for ESL, we report a false positive rate of 0% on the TOEFL benchmark - meaning none of the human essays in this benchmark were misclassified as AI.

Comparison on TOEFL benchmark

Pangram Labs' approach to AI detection

Detecting AI-generated content is not an easy task. We train a deep learning model with a transformers-based architecture, using two key methods to bring our model's accuracy to the next level.

Synthetic Mirrors

Every document in our training set is labeled either "Human" or "AI." In machine learning, we call these documents "examples."

We have millions of human examples available to train on from public datasets, but no equivalent AI datasets. We solve this by pairing every human example with a "Synthetic mirror" - a term we use to describe an AI-generated document that is based off of a human document. We prompt an LLM by requesting a document on the same topic, of the same length. For a fraction of examples, we have the LLM start with the first sentence of the human document, to make the AI documents more varied.

Hard Negative Mining

Early on, we hit a ceiling training our model. We tried adding more examples but eventually found that the model was "saturated" - more training examples did not improve the model further.

Scaling laws experiment

The performance of this initial model was unsatisfactory - it still had a false positive rate of over 1% on many domains. What we found was that we didn't just need more examples, we needed harder ones.

We identified harder examples by taking our initial model and scanning tens of millions of human examples in open datasets, looking for the hardest documents that our model misclassified. We then generated synthetic mirrors for these documents and added them to our training set. Finally, we re-trained the model and repeated the process.

Training process for the Pangram Labs AI-generated text classifier

With this training method, we were able to reduce our false positive rates by a factor of 100 and ship a model that we're proud of.

Table of false positive rates by domain

We call this method hard negative mining with synthetic mirrors, and go over the process in more detail in our technical report.

What's next for Pangram Labs?

Obviously, this is not the end of our journey. We have a bunch of new ideas for how we can drive performance to the next level. We're going to continue improving our evaluation sets so we can better track false positive rate into the hundredths of a percent. We're planning on expanding our model to work in non-English languages and working to understand and catch our failure cases. Keep an eye out for what we do next!

Any questions or comments? Contact us at info@pangram.com!

Subscribe to our newsletter
We share monthly updates on our AI detection research.