All About False Positives in AI Detectors

Bradley EmiMarch 27, 2025

One of the most important aspects of our work at Pangram is minimizing our false positive rate. This means, reducing the chance that human writing is flagged as AI-generated as much as possible. Today, we'll explain Pangram's false positive rates across many different kinds of writing, how we measure and evaluate our models to ensure that the false positive rate is as low as possible, and finally, some of the techniques we employ to build AI detection software with the lowest false positive rate in the industry.

What is a false positive?

In the context of AI detection, a false positive is when a detector mistakenly predicts a human-generated sample as being AI-generated. In contrast, a false negative is when an AI-generated sample is mispredicted as being human.

False positives and false negatives in AI detection

The diagram above illustrates the two types of errors. If red represents the negative class and green represents the positive class, a red X predicted as green would be a false positive, and a green O predicted as red would be a false negative.

In statistics, the terms Type I error and Type II error are used: these terms mean exactly the same thing. A type I error is a false positive, and a type II error is a false negative. Statisticians and in particular those working in medical sciences also use the terms sensitivity and specificity to distinguish these two error rates. Machine learning scientists use the terms precision and recall. While there are some slight technical differences between these terms, for educational purposes, in this post we'll stick to simply "false positives" and "false negatives" as I think these are the most self-explanatory terms for these two kinds of errors.

In AI detection, a false positive is far worse than a false negative. Repeatedly accusing students who write assignments by themselves with no AI assistance of AI plagiarism greatly undermines trust between student and teacher, and can cause a great deal of anxiety and stress for the student. On the other hand, a false negative may mean that a cheater may slip through every once in a while, which is not as bad of an outcome.

It's worth noting that in other detection problems, the false negative can cause much more harm than the false positive: for example, it is much better in a cancer screening test for the test to mistakenly say that the patient has cancer, than for the test to miss a patient's actual cancer entirely. If the test falsely says the patient has cancer, then while it may be inconvenient for the patient to have to go in again for follow-ups and additional work and testing, that is far better than missing a cancer diagnosis, which is a threat to the patient's life.

Going back to AI detection, a false positive causes more harm than a false negative, but both matter: consistently missing AI-generated text and falsely predicting it as human undermines the value of the tool as well. So, at Pangram, our general approach is to both minimize false negatives and false positives as much as possible, but treating false positives as a higher priority.

What is Pangram's false positive rate?

The answer is that it depends!

Overall, we measure our false positive rate to be approximately 1 in 10,000: sometimes a bit higher, or a bit lower, depending on the kind of writing and other variables.

We measure Pangram's false positive rate on a wide variety of writing: we call these domains. While not exhaustive, below are our most up-to-date false positive rates that we measure internally on each domain:

DomainFalse Positive Rate
Academic Essays0.004%
Product Reviews (English)0.004%
Product Reviews (Spanish)0.008%
Product Reviews (Japanese)0.015%
Scientific Abstracts0.001%
Code Documentation0.0%
Congressional Transcripts0.0%
Recipes0.23%
Medical Papers0.000%
US Business Reviews0.0004%
Hollywood Movie Scripts0.0%
Wikipedia (English)0.016%
Wikipedia (Spanish)0.07%
Wikipedia (Japanese)0.02%
Wikipedia (Arabic)0.08%
News Articles0.001%
Books0.003%
Poems0.05%
Political Speeches0.0%
Social Media Q&A0.01%
Creative Writing, Short Stories0.009%
How-To Articles0.07%

What factors determine Pangram's susceptibility to false positives?

In general, Pangram performs best when the following conditions are met:

  • The text is long enough (over a couple hundred words)
  • The text is written in complete sentences
  • The domain is well-represented in common online training sets
  • The text contains more creative input, and is less formulaic

We believe these factors are why Pangram performs best on essays, creative writing, and reviews. While news articles, scientific papers, and Wikipedia entries are more formulaic and technical, the availability of data is abundant in these domains and so Pangram has gotten very good at recognizing even subtle patterns in the writing. Finally, domains such as recipes and poetry are the weakest, because the text tends to be short, not written in complete sentences (giving the LLM less of a chance to inject its idiosyncratic style into the text), and are generally rarer online than the other domains.

Practically speaking, what does this mean? While Pangram is still relatively reliable across all domains, you can be more confident in Pangram's accuracy when the text is long, in complete sentences, and requires more original input from the writer. For this reason, we recommend against screening things like short bullet point lists and outlines, math, very short (e.g. single sentences) responses, and extremely formulaic text such as long lists of data, spreadsheets, template based writing, and instruction manuals.

How does Pangram's false positive rate compare to competitors?

We cannot run the same thorough benchmark on our competitors, simply because the cost to do so would be extremely prohibitive. However, we can look at what our competitors say their false positive rate is.

TurnItIn

TurnItIn's reported false positive rate on their website

TurnItIn's latest whitepaper reports a false positive rate of 0.51% on academic writing, or approximately 1 in 200, on the document level. That means 1 in every 200 student submissions will be falsely flagged as AI.

Our false positive rate, measured on a similar dataset of academic essays, is 0.004%, which is 1 in 25,000.

This is a significant difference. At a large research university, 100,000 papers may be submitted per year. This is the difference between 500 false flags for TurnItIn and only 4 for Pangram.

GPTZero

GPTZero's reported false positive rate on their website

GPTZero claims a 1% false positive rate, which is 2x worse than TurnItIn and 250x worse than Pangram.

We internally benchmarked GPTZero on a smaller set of documents from our general VIP set, for fair comparison to Pangram. We found the false positive worse than reported, at 2.01%.

Copyleaks

Copyleaks' reported false positive rate on their website

Copyleaks claims a 0.2% false positive rate, or 1 in 500, which would be 50x worse than Pangram if true.

Moreover, a naked number in isolation like this does not tell the whole story. We do not know where the data comes from, and what potential biases there may have been in the evaluation. That's why we benchmark thoroughly, and are releasing this article detailing our process for evaluating our model.

RAID benchmark

Taking a look at the RAID study that was published last year by Liam Dugan and coauthors, study #2 in the research roundup article we posted, we'd like to draw attention to the following graph.

RAID study false positive rates across detectors

Most detectors give a "threshold", which is the percentage confidence at which above the line, the model says the text is AI, and below the line, the model says the text is human. By moving the threshold, false positives and false negatives can be traded off.

In this graph, on the x-axis is the false positive rate caused by moving the threshold, and on the y-axis is the recall: which is the fraction of AI documents able to be classified as AI when evaluated at that threshold.

The long story short is that our competitors' detectors fail to operate when forced to have a false positive rate under 1 percent; i.e., they would not be able to catch any AI when the threshold is low enough to produce a 1 percent FPR.

How do we evaluate Pangram's false positive rate?

Pangram undergoes an extremely rigorous process for signoff and testing before any new model is allowed to be deployed to our dashboard and API.

During our QA, we have three kinds of tests for false positives, that each strike a balance between quantitative and qualitative assessment. Our evaluations include:

  1. Large-scale holdout sets. Approximately 10,000 to 10,000,000 examples per set. These are large-scale, open-access Internet databases from pre-ChatGPT (2022), from which we have selected a holdout set that is not trained on, and set aside purely for evaluation purposes only.

  2. Medium-scale VIP sets. Approximately 1,000 examples per set. These are datasets that engineers or labelers have hand-collected from reputable sources, inspected by eye, and are personally validated to be human-written. While trained experts are good at detecting AI-generated content by eye, they do occasionally make mistakes and so we regularly audit the data and clean it for accuracy.

  3. Challenge sets. Approximately 10-100 examples per set. These are previously reported false positives, tough cases that our friends have sent us, and in general, just interesting examples that we want to know how we perform on. We also collect examples of out of the ordinary text, such as recipes, poetry, movie scripts, and other written forms that aren't well-represented in large language model training sets, and consider these to be challenge sets as well, as well as an overall benchmark for how well our model performs when put "out of distribution."

In addition to these three kinds of QA, we also have unit tests. These unit tests are, colloquially, testing our model for what we would call "embarrassing failures." Our current unit test suite requires us to predict human for documents like the Declaration of Independence, famous lines from literature, and our own website copy and blog posts. If any single one of these unit tests fails, we block deployment of a new model and go back to the drawing board. One of our guiding philosophies to evaluation is being hypervigilant about tracking and monitoring for these "embarrassing failures" so that they never regress when a new model is released.

Diagram showing the three types of evaluation sets used at Pangram: large-scale holdout sets (10M+ examples), medium-scale VIP sets (1000+ examples), and challenge sets (10-100 examples)

Those folks who are mathematically and scientifically inclined might ask: why do you need qualitative assessment? Isn't more samples always better?

My response to this would be: more samples is not always better. As a wise prophet once said, there are lies, damn lies, and statistics. But in all seriousness, we believe when you create a large dataset at scale, you are always going to be injecting some kind of bias. And when you have a dataset so large that you cannot inspect every example, you do not know if your model has overfit to a bias in the dataset that is going to cause it to do well on the test, but poorly in the real world. (As an aside, we believe this is why there are many online AI detectors that report "99% accuracy" but are not even close to that when you actually test them).

A funny example illustrating the importance of these multiple flavors of test suites happened in the early days of Pangram, when we first introduced Wikipedia to the training set. One of our first failed attempts ended up being great on the holdout set, but very poor on the VIP set, which was hand-collected Wikipedia articles. What we ended up finding was that in the Huggingface dataset we were using, on the human side, the name pronunciation expressed in the International Phonetic Alphabet was being reformatted in a really weird way that the model was overfitting to: it would just look at the formatting of the name, and then conclude based on the formatting whether the document was AI or human. Great on the holdout set, but terrible in the real-world when the model didn't have that particular clue! That is the importance of having a test set that accurately reflects what kind of text Pangram is going to see in the real world.

Before we ship a model to customers at Pangram, we undergo a rigorous sign off procedure that involves both quantitative and qualitative evaluation, where we stress test the model and scrutinize its performance relative to the current model.

  1. Quantitative evaluation: means that false positive rate metrics on all holdouts, VIP sets, and challenge cases should not be regressed.

  2. Qualitative evaluation: in most cases, some examples will be improved, and some examples will be regressed. Whenever possible, we look by eye at the specific examples that are regressed and make sure that failures are explainable. This is often nuanced and specific to the particular hypotheses that we are testing, but in general, we want to make sure that the failure cases do not exhibit a particular pattern that would generalize to failure in the real-world after deployment.

  3. Vibe check / red teaming: Finally, once quantitative and qualitative evaluation are complete, we simply "vibe check" the model by sending it out to the team and asking them to play with it for a while. For some updates, we may also have internal testers or beta customers test the model as well before widely releasing the model publicly (usually we encourage them to try to find cases that break the model!)

  4. Retroactive A/B testing: we run offline inference on our old predictions, and look at the differences between the old model and the new model. We do not always have the ground truth for data that we have previously inferenced, but again, we are looking for consistent patterns that may exhibit real-world failure cases.

In summary, while we are extremely thorough and scientific about measuring the performance of our model with metrics and statistics, we do not only rely on numbers to tell us the whole story. We also trust our eyes, intuition, and pattern-recognition ability to scrutinize the model and find error patterns that our metrics may have missed. We also rely on our team of testers, red-teamers, and beta customers to find holes that the team may have missed.

What are the techniques we use to achieve such a low false positive rate?

Maintaining a low false positive rate is core to our research mission. Here are some of the techniques we've used so far in order to achieve a best-in-class error rate.

Comprehensive training data coverage

While competitor AI detectors may be "built for academia/schools/the classroom/educators," what that really might mean is their training set contains only academic writing.

On the other hand, we built Pangram to take advantage of the Bitter Lesson: that general learning algorithms, trained on high volumes of data from a wide variety of sources, are more effective than specific models trained on domain-specific data.

That means we train our AI detector on a wide variety of writing: creative, technical, scientific, encyclopedic, reviews, websites, blog posts... the list goes on. The reason for this is like a well-rounded liberal arts education, exposure to many disciplines and styles of writing helps the model understand and generalize better when it encounters new cases. Following the broader trend in AI training, ChatGPT and other large language models aren't trained on specific data for particular use cases, they are trained on general large-scale text data so that they can have general intelligence: we believe in the same strategy for training AI detectors that are robust to all the different general kinds of text that an LLM may produce.

Hard negative mining / Active learning

We've written extensively about our active learning algorithm, which takes advantage of a technique called hard negative mining, and we believe this is the main reason we are able to drive our false positive rate down to near-zero.

In essence, the reason why this works is because most of the examples in the wild are "easy examples"-- once the model learns the basic patterns of what is human and what is AI, it is very easy to tell which is which for a vast majority of the dataset. However, that only gets you to around 99% accuracy. In order to claw the last couple of 9's of accuracy, we must find the hardest cases to train the model: we can think of these cases as the ones where a human just decides to write in a very similar fashion to an AI language model, but is in fact, just writing like that by coincidence. To find these hard negatives, we perform large-scale search over Internet-scale datasets like the ones used to train LLMs, and then perform synthetic mirroring to generate similar sounding AI examples. More detail can be found on our how it works page.

Loss Weighting and Oversampling

We formulate our optimization objective so that the model is also prioritizing false positives over false negatives during the training procedure itself. When the model gets a human document wrong, it is "penalized" by a much heavier factor than if it gets an AI document wrong. This forces the model to be conservative and only predict that a document is AI if it is absolutely sure.

Calibration

This relates to the threshold selection as described in RAID. We select our threshold based on evaluation of millions of documents in our evaluation sets to trade off false positive and false negative rates appropriately. With our threshold selection, we try to strike a balance between keeping the false negative rate reasonable while not compromising on our false positives.

Takeaways

  • Pangram exhibits a significantly lower false positive rate than competitors.
  • Pangram's extremely low false positive rate is due to a mixture of scale, training, and search.
  • Because the false positive rate is so important in AI detection, we have built an extremely comprehensive testing and QA suite and developed a thorough signoff process that combines careful statistical evaluation with more messy, qualitative human judgement and vibe checks.

We love working with researchers to improve the overall accuracy of our software, and are passionate about open benchmarking and transparency in AI detection. For inquiries in working with us, collaborating with us, or further questions on Pangram's accuracy, please reach out to info@pangram.com.

Subscribe to our newsletter
We share monthly updates on our AI detection research.