Today, Llama 4 was released, the latest in a series of open-source models from Meta AI. We wanted to know if Pangram still is able to detect the latest and greatest open models, and so we ran a quick test to see whether or not our model exhibits generalization to Llama 4, despite currently only being trained on outputs from Llama 2 and 3.
We commonly are asked how well we are able to keep up with the pace of new models, which is why we test them quickly on day 1, before we get a chance to retrain.
For the spot check, we used the same 11 prompts we used to test GPT 4.5. These prompts cover a variety of everyday writing tasks, but are not directly related to the prompts that we trained on. They also require a level of creativity that we believe that a model making substantial progress forward from the previous generations of LLMs would exhibit qualitatively different behavior.
Here are the prompts we used:
Prompt | Pangram AI likelihood |
---|---|
Koala Conservation | 99.9% |
Newspaper Email | 99.9% |
Room Temperature Semiconductor | 99.9% |
School Uniforms | 99.9% |
Poetry Diary | 99.9% |
Escape Room Review | 99.9% |
Russian Film Email | 99.9% |
Mars Landing Scene | 99.9% |
Komodo Dragon Script | 99.9% |
Halloween Breakup Poem | 99.9% |
Venice Chase Scene | 99.9% |
In this case, Pangram passes the test with a perfect score! Not only is it able to predict all 11 writing samples as AI-generated, but it is able to do so with 100% confidence. (Despite the model predicting 100%, we always round down to 99.9% in the UI to signal that we can never be actually 100% sure.)
You can see the full outputs here.
We created a larger test set of about 7,000 examples using our standard evaluation prompt schemes, leveraging the Together API for inference, covering a wide variety of domains, including academic writing, creative writing, Q&A, scientific writing, and more.
Here are our results on the larger test set.
Model | Accuracy |
---|---|
Llama 4 Scout | 100% (3678/3678) |
Llama 4 Maverick | 99.86% (3656/3661) |
Llama 4 Overall | 99.93% (7334/7339) |
Why does Pangram generalize to new models so well? We believe it is the strength of our underlying datasets and active learning approach, as well as our broad prompting and sampling strategies that have allowed Pangram to see so many types of AI-generated writing that it adapts to new ones quite well.
For more information on our research or free credits to trial our model on Llama 4, please contact us at info@pangram.com.