Pangram is the only AI detector that outperforms human experts at identifying AI content

We are excited to see new research from Jenna Russell, Marzena Karpinksa, and Mohit Ayyer, collaborators from the University of Maryland and Microsoft, that shows that Pangram is the most accurate AI detection system, and the only system that can outperform trained human experts at detecting AI generated content. Read the full paper here.

Tweet from Jenna Russell

In addition to studying the efficacy of automated AI detectors, the researchers also dive into how trained human experts pick up signals that help them determine the telltale signs of AI-generated content. We believe that this research is a huge step forward for explainability and interpretability in AI detection and are excited to explore this research direction further.

In this blog post, we will explain the highlights of the research and what it means for LLM detection moving forward.

Training Humans to Become AI Detectors

We've written in the past about how to detect AI writing and the human baseline test, and how we use it to gain valuable intuition about AI-generated text that helps us develop better models.

Normally, when we begin trying to train ourselves to spot AI-generated reviews, essays, blog posts, or news, at the beginning, we are not very good. It takes a while before we begin picking up on the telltale signs that a piece of text is generated by ChatGPT or another language model. For example, when we began studying reviews, we learned over time by looking at a lot of data that ChatGPT loves to start a review with the phrase, "I recently had the pleasure of", or when we began reading AI-generated sci-fi stories, it frequently starts with the phrase "In the year of". However, over time, we begin to internalize these patterns and can start to recognize them.

The researchers also wondered if experts can be trained to detect AI-generated articles in the same way. They trained five annotators on Upwork to detect AI-generated content and compared their ability to detect AI by eye to non-experts.

While we should expect a difference in the ability of these two groups to spot the AI-written text, what the researchers found was a substantial gap. Non-experts perform similar to random chance at detecting AI-generated text, while experts are highly accurate (over 90% true positive rate, on average).

One section we found most interesting was the section "What do expert annotators see that nonexperts don't?". The researchers asked the participants to explain why they thought a piece of writing was AI-generated or not, and then analyzed the participant comments.

Here is some analysis taken directly from the paper:

"Nonexperts often mistakenly fixate on certain linguistic properties compared to experts. One example is vocabulary choice, where nonexperts take the inclusion of any "fancy" or otherwise low-frequency word types as signs of AI-generated text; in contrast, experts are much more familiar with exact words and phrases overused by AI (e.g. testament, crucial). Nonexperts also believe that human authors are more likely to form grammatically correct sentences and thus attribute run-on sentences to AI, but the opposite is true: humans are more likely than AI to use ungrammatical or run-on sentences. Finally, nonexperts attribute any text written in a neutral tone to AI, which results in many false positives because formal human writing is also often neutral in tone." (Russell, Karpinska, & Iyyer, 2025).

In the Appendix, the authors provide a list of "AI vocabulary" that is commonly used by ChatGPT– a feature that we recently released in the Pangram dashboard that highlights commonly used AI phrases!

In our experience, we have found that despite many people thinking that AI uses sophisticated, "fancy" vocabulary, we find in practice that AI tends to instead use more cliched, metaphorical vocabulary that doesn't often make any sense. Informally, we would say that LLMs are more like people who are trying to sound smart, but are really just using phrases that they think will make them sound smart.

Robustness of AI detectors to state-of-the-art models

One flavor of question that we get a lot at Pangram is, how do you keep up with the state-of-the-art models? When the language models get better, does that mean that Pangram is not going to work anymore? Is it a cat and mouse game that the frontier labs like OpenAI will beat us at?

The researchers wondered this as well, and studied the performance of several AI detection methods against OpenAI's o1-pro, the most advanced model released to date.

The researchers found that Pangram is 100% accurate in detecting o1-pro outputs, and we are still 96.7% accurate in detecting "humanized" o1-pro outputs (which we will get to in a little bit)! By comparison, no other automated detector even clears 76.7% on base o1-pro outputs.

How is Pangram able to generalize like this? After all, at the time of the study, we did not even have any o1-pro data in our training set.

Like all deep learning models, we believe in the power of scale and compute. First, we start with a powerful base model that is pretrained on a huge training corpus, just like the LLMs themselves. Second, we have built a data pipeline that is meant for scale. Pangram is able to do subtle pattern recognition from its training corpus of 100 million human documents.
We do not just build a dataset for essays, or news, or reviews: we attempt to source the widest net possible of all human-written data that exists, so the model can learn from the highest quality and most diverse data distribution and learn about all kinds of human writing. We find that this general approach to AI detection works much better than the specialized approach of building one model per text domain.

Complementary to our extremely large, high quality human dataset is our synthetic data pipeline and active learning based search algorithm. In order to source the AI data for our algorithm, we use an exhaustive library of prompts and all of the major open and closed-source AI models to generate synthetic data. We use synthetic mirror prompts, which we've written about in our technical report and hard negative mining, which looks for the examples in our data pool with the highest error, and creates AI examples that look very similar to the human ones, and retraining the model until we do not see any more errors. Doing so allows us to drive the false positive and false negative rates of our model down to zero very efficiently.

Put succinctly, our generalization comes from the scale of our pretraining data, the diversity of prompts and LLMs used for synthetic data generation, and the data efficiency from our active learning and hard negative mining approach.

Furthermore, we not only strive for great out-of-distribution performance, but we also want to make sure that as many of the common LLMs are as in-distribution as possible. Therefore, we have built a robust automated pipeline to pull data from the latest models so that we can begin training on new LLMs as soon as they are released and stay up-to-date. We find that it is not a tradeoff between balancing performance on different models: we find every time we introduce a new LLM into the training set, the generalization of the model improves.

With our current system, we are not finding that as the models are improving, they are getting harder to detect. In many cases, the next-generation model is actually easier to detect. For example, we found that we were more accurate in detecting Claude 3 when it was released than Claude 2.

Paraphraser and Humanizer Attacks

In our recent blog post series, we described what an AI humanizer is and also shipped a model with greatly improved performance on humanized AI text. We are pleased to see already that a third party has validated our claims with a dataset of humanized o1-pro articles.

On humanized o1-pro text, we achieve an accuracy of 96.7%, while the next best automated model is only able to detect 46.7% of humanized text.

We are also 100% accurate on GPT-4o text that has been paraphrased sentence-by-sentence.

Conclusion

We are excited to see Pangram's strong performance in an independent study of AI detection capabilities. We are always happy to support academic research and we provide open-access for any academics who wish to study our detector.

In addition to benchmarking the performance of automated detectors, we are excited to see research that also begins to tackle the explainability and interpretability of AI detection: not just whether something is AI-written, but why. We are looking forward to further writing about how these results can help teachers and educators spot AI-generated text by eye, and how we are planning on further incorporating this research into more explainable automated detection tools.

For more information, please visit our website pangram.com or contact us at info@pangram.com.

Products

Use Cases

Company

Resources

Pangram is the only AI detector that outperforms human experts at identifying AI content

Training Humans to Become AI Detectors

Robustness of AI detectors to state-of-the-art models

Paraphraser and Humanizer Attacks

Conclusion