When you search online for how AI detectors work, you'll typically see many sources citing the terms "perplexity" and "burstiness". What do these terms mean, and why do they ultimately not work for detecting AI-generated content? Today I want to unpack what perplexity and burstiness are, and explain why they are not suitable for detecting AI-generated writing. We'll also get into the understanding of why they don't work, and why perplexity and burstiness based detectors falsely cite the Declaration of Independence as AI-generated, and why these detectors are also biased against nonnative English speakers. Let's go!
We'll start with an imprecise nontechnical definition of perplexity, just to get a general sense of what perplexity is and what it's doing. For more background on perplexity, I found this two minute explainer article to be very useful.
Perplexity is how unexpected, or surprising, each word in a piece of text is, when looked at from the perspective of a particular language model or LLM.
For example, here are two sentences. Let's focus on the last word of each sentence, for demonstration purposes. In the first example, the last word has low perplexity, while in the second example, the last word has high perplexity.
Low perplexity:
For lunch today, I ate a bowl of *soup*.
High perplexity:
For lunch today, I ate a bowl of *spiders*.
The reason that the second sentence is high perplexity is because very rarely would a language model see examples of people eating bowls of spiders in its training dataset, and so it is very surprising to the language model that the sentence ends with "spiders", as opposed to something like "soup" or "a sandwich" or "a salad".
Perplexity comes from the same root as the word "perplexed", which means "confused" or "puzzled". It is helpful to think of perplexity as the confusion of the language model: when it sees something that is unfamiliar or unexpected, in comparison to what it has read and ingested in its training procedure, then we can think of the language model as getting confused or befuddled by the completion.
Okay, great, so what about burstiness? Burstiness is the change in perplexity over the course of a document. If some surprising words and phrases are interspersed throughout the document, we would say that it is high in burstiness.
Unfortunately, most commercial detectors (aside from Pangram) are not transparent about their methodology, but from what is understood by their descriptions, human text is considered to be higher perplexity and higher in burstiness than AI-generated text, and AI-generated text is lower probability and lower burstiness.
We can see a visualization of this below! I downloaded the GPT-2 model off of Huggingface, and calculated the perplexity of all the text in two documents: one set of human restaurant reviews, and one set of AI-generated reviews. I then highlighted the low perplexity text in blue, and the high perplexity text in red.
Perplexity visualization comparing AI and human text
As you can see, the AI-generated text is a deep blue all around, suggesting uniform low perplexity values. And the human-generated text is mostly blue, but has spikes of red in it. That's what we would say is high burstiness.
It's this idea that inspires perplexity and burstiness detectors. Not only are some of the earliest commercial AI detectors based on this idea, but it has also inspired some academic literature such as DetectGPT and Binoculars.
To be completely fair, these perplexity and burstiness detectors do work some of the time! We just do not believe that they can work reliably in high stakes settings where inaccuracies must be avoided, such as in the classroom, where a false positive AI detection can potentially undermine trust between the teacher and the student, or even worse, create inaccurate evidence in a legal case.
For those unfamiliar with how LLMs are created, before LLMs are available to be deployed and used as chatbots, they must first undergo a procedure called training. During training, the language model sees billions of texts and learns the underlying linguistic patterns of what is called its "training set".
The precise mechanical details of the training procedure are out of scope of this blog post, but the one critical detail is that in the optimization process, the LLM is directly incentivized to minimize perplexity on its training set documents! In other words, the model learns over time that the pieces of text that it sees repeatedly in its training procedure should have as little perplexity as possible.
Why is that a problem?
Because the model is asked to make its training set documents low perplexity, perplexity and burstiness detectors classify common training set documents as AI, even when the training set documents are actually human written!
That is why perplexity-based AI detectors classify the Declaration of Independence as AI-generated: because the Declaration of Independence is a famous historical document that has been reproduced in countless textbooks and Internet articles across the web, it shows up in AI training sets... a lot. And because the text is exactly the same every time it is seen during training, the model can memorize what the Declaration of Independence is when it sees it, and then automatically assign all of the tokens a very low perplexity, which then also makes the burstiness really low too.
I ran the same visualization above on the Declaration of Independence-- and we see the same AI signature: a deep, consistent blue color throughout, indicating every word has low perplexity. From the perspective of a perplexity and burstiness based detector, the Declaration of Independence is completely indistinguishable from AI-generated content.
Interestingly, we notice that the first sentence of the Declaration of Independence, is even deeper blue and low perplexity than the rest. This happens because the first sentence is by far the most reproduced part of the passage, and shows up the most frequently in the GPT-2 training set.
Perplexity visualization of the Declaration of Independence
Similarly, we find that other common sources of LLM training data also see elevated false positive rates with perplexity and burstiness detectors. Wikipedia is a very common training dataset due to its high quality and unrestrictive license: and therefore it is extremely commonly mispredicted as AI-generated because the language models are directly optimized to reduce perplexity on Wikipedia articles.
This is a worsening problem as AI continues to develop and becomem more advanced because the newest language models are extremely data hungry: OpenAI and Google and Anthropic's crawlers are all furiously scraping the Internet as you read this article, continuing to ingest data for language model training. Should publishers and website owners have to worry that allowing these scrapers to crawl their website for LLM training might mean that their content might get misclassified as AI-generated in the future? Should companies considering licensing their data to OpenAI have to weigh the risk of that data also coming back to be mispredicted as AI once the LLMs ingest it? We find this a completely unacceptable failure case, and one that is worsening over time.
Another problem with using perplexity and burstiness as metrics for detection is that they are relative to a particular language model. What may be expected for GPT for example may not be expected for Claude. And when new models come out, their perplexity is different as well.
So called "black box" perplexity based detectors need to choose a language model to measure the actual perplexity. But when that language model's perplexity differs from the generator's perplexity, you get wildly inaccurate results, and this problem only compounds with new model releases.
Closed source providers do not always serve the probabilities of each token, so you cannot even calculate perplexity for closed-source commerical models, such as ChatGPT, Gemini, and Claude. At best, you can use an open-source model to measure perplexity, but that runs into the same problems as Shortcoming 2.
A narrative has emerged that AI detection is biased against nonnative English speakers, supported by a 2023 Stanford study on 91 TOEFL essays. While Pangram extensively benchmarks nonnative English text and incorporates it into our training set so the model is able to recognize and detect it, perplexity based detectors do indeed have an elevated false positive rate on nonnative English text.
The reason for this is because text written by English language learners is lower perplexity and lower burstiness in general. We believe that this is not an accident: this arises because during the language learning process, the student's vocabulary is significantly more limited, and the student is also not able to form complex sentence structures that would be out of the ordinary, or high surprisingness, for a language model. We argue that learning to write in a high perplexity, bursty way that is still linguistically correct is an advanced language skill that comes from experience with the language.
Nonnative English speakers, and we believe by extension neurodiverse students or students with disabilities, are more vulnerable to being caught by perplexity-based AI detectors.
What we believe is the biggest shortcoming of perplexity based detectors, and why we at Pangram chose a deep learning based approach instead, is that these perplexity-based detectors cannot self-improve with data and compute scale.
What does this mean? As Pangram gets more experience with human text through our active learning algorithm, it gradually gets better. That is how we have gotten our false positive rate from 2%, to 1%, to 0.1%, and now down to 0.01%. Perplexity based detectors are not able to improve by seeing more data.
DetectGPT: Zero-Shot Machine-Generated Text Detection using Probability Curvature is a paper that looks at the local perplexity landscape to distinguish human and AI writing rather than absolute perplexity values.
Spotting LLMs with Binoculars: Zero-Shot Detection of Machine-Generated Text uses a novel metric called "cross-perplexity" to improve upon basic perplexity detection.
Pangram's technical whitepaper goes deeper into our alternative solution for detecting AI-generated text based on deep active learning.
There's a big difference between computing a statistic that correlates with AI-generated writing and building a production grade system that can reliably detect AI-generated writing. While perplexity based detectors capture an important facet of what makes human writing human and what makes AI writing AI, for the reasons described in this article, you cannot use a perplexity based detector to reliably detect AI-generated writing while maintaining a false positive rate low enough for production applications.
In environments like education where false positive avoidance is critical, we hope to see more research move towards deep learning based methods and away from perplexity and burstiness, or metric-based methods.
We hope this gives some insight into why Pangram has chosen not to use perplexity and burstiness to detect AI-generated text, and instead focus on reliable methods that scale.