Pangram Text Update: GPT-4o, Claude 3, LLaMA 3

Photo by Google DeepMind.

Today we’re excited to show off our capability to rapidly adapt to new LLMs in the marketplace by releasing an update to our model that achieves near-perfect accuracy in detecting AI-written text from GPT-4o, Claude 3, and LLaMA 3.

TL;DR:

We released a new version of Pangram Text that improves performance on GPT-4o, Claude 3, and LLaMA 3.
Our infrastructure pipeline is set up to quickly ingest large amounts of AI text from new models as soon as they become publicly available.
We find that as the performance of all these new models converges to GPT-4 level performance, they are all starting to sound the same stylistically too.

Results

Our most recently released model was pretty good at detecting the output from the new models, even without seeing any examples of them in the training set. However, we are not satisfied with simply “pretty good”, we want to ensure that we are continually pushing the frontier of what is possible with AI detection and achieving the best possible accuracy for our customers.

To test how well we perform on the next generation language models, we revamped our evaluation set of 25,000 examples of difficult to classify human text and AI-generated text from a panel of language models. About 40% of this new evaluation set consists of a wide variety of AI-generated text from GPT-4o, Claude 3, and LLaMA 3, spanning several domains of text including news, reviews, education, and more.

We use all versions of the new models when available: for example, we sample evenly from Claude 3’s Opus, Sonnet, and Haiku versions.

After updating our training dataset to incorporate the latest LLMs, we find we are once again achieving near-perfect accuracy on text generated by the newest generation of language models.

LLM	Pangram Text March Accuracy	Pangram Text May Accuracy	% Improvement
All	99.54%	99.84%	+0.30%
GPT-4o	99.78%	100%	+0.22%
Claude 3	99.12%	99.76%	+0.64%
LLaMA 3	99.58%	99.97%	+0.39%

In addition to improving performance on the new models, we find that including training data from the latest generation of models actually marginally improves performance on several old models.

We find that while introducing no regressions on our old model evaluation set, we actually improve several cases from GPT-3.5 and (regular) GPT-4 detection. Specifically, we find that 8 GPT-3.5 cases previously failed by the model are now passing, and 13 GPT-4 cases previously failed by the model are now passing. We conclude here that our model’s increased ability to detect GPT-4o, Claude 3, and LLaMA 3 does not come with any cost in ability to detect older models.

Staying ahead of the curve

We were aware from the start that the frontier of LLMs would be rapidly changing, so we designed our system architecture with that in mind. Our systems are built to be able to regenerate data and begin training a new model within hours of a new API becoming publicly available.

When a new model is released, generating a new dataset and retraining the model is as simple as a config change. We have a standard library of prompt templates that are designed to be fed into LLMs to produce human-like text that is close to, but not exactly the same, as the human side of our dataset. We detail this process, called Hard Negative Mining with Synthetic Mirrors, in our technical report.

The timeline for the release of this new model was as follows:

May 13: GPT-4o was released and made available in the OpenAI API. May 14: Dataset pipeline was updated and new training and evaluation sets were created. May 15-16: AI detection model was trained using the new datasets. May 17: QA and sanity checks were performed and the model was released.

The infrastructure that we have built enables us to quickly adapt, including text from new models into the production detection system in just a week.

Diminishing Returns?

As new models get better and better, they must become harder to detect, right? We still have yet to find evidence for this tempting but ultimately misguided argument.

Observationally, we are finding that the more capable models, due to their more idiosyncratic styles, are actually easier to detect than the less capable models. For example, we found that our old model was better at detecting Claude Opus than Sonnet and Haiku.

As we see on the LMSYS leaderboard, we are seeing that many foundation models are asymptotically converging to the level of GPT-4, but no model has yet to convincingly beat it out by a substantial margin. Taking a bird’s eye view of the situation, if a few different foundation model companies take the same attention based architecture and train it on the entire Internet, it is unsurprising that the language coming out of all the models will end up sounding all incredibly similar to each other. Those that interact with language models on a regular basis will immediately understand what we mean by that.

At an observational level, we are still finding that LLMs, when asked to write creatively and authentically, such as an opinion essay, a review, or a creative short story, still produce unimaginative and bland drivel. We believe that this is fundamentally a property of the optimization objective of predicting high probability completions while staying away from off-distribution original thoughts and ideas.

We value original writing from our fellow humans because it may offer us a fresh perspective or a different way of thinking, not because it is the average thing that a person might say. As long as this value holds true, there will always be a need for AI detection, and there will always be a pathway to solving it.

Products

Use Cases

Company

Resources

Pangram Text Update: GPT-4o, Claude 3, LLaMA 3

TL;DR:

Results

Staying ahead of the curve

Diminishing Returns?