How accurate is Pangram AI Detection on ESL?

A common critique of AI detectors is that they are biased against nonnative English speakers. Text written by nonnative English speakers is referred to as ESL (English as a Second Language), or more precisely, ELL (English Language Learners). In previous writing, we've explained why other AI detectors based on perplexity and burstiness are susceptible to this flaw.

Nonnative English speakers do not have the depth of vocabulary or command of complex English sentence construction patterns to write in a manner that exhibits high burstiness. Therefore, previous attempts at AI detection have fallen short: often mischaracterizing ESL as AI-generated writing, and thereby exhibiting a high false positive rate on ESL.

Previous studies on AI detection and ESL

A notable Stanford study was published in July 2023 by Weixin Liang, James Zou, and others, claiming that GPT detectors are biased against nonnative English writers. While the study was run on a small sample size (only 91 essays from the TOEFL exam), and there were some methodological flaws (the authors decided to label GPT-4 modified human text as "human" when testing the detectors), overall, the results showed that the seven AI detectors tested (Pangram was not tested in this study) showed strong bias against ESL writing- with over 60% of the human ESL writing samples flagged as AI.

A more recent study from August 2024 from the ETS, a testing services center that administers the GRE, a standardized test for admissions to graduate school, also conducted a larger scale study on around 2,000 writing samples from nonnative English speakers on the GRE, on simple machine learning detectors they trained themselves on handcrafted features, including perplexity. They did not find any bias in their own detectors against nonnative English, although the experimental setting was highly simplified and contrived, and there are important differences between this study and the real world. Moreover, they did not study the commercial detectors which are actually used in practice. Nonetheless, the study highlights an interesting point: when data from nonnative English speakers is sufficiently represented in the training set, the resulting bias is sufficiently mitigated.

Pangram's performance on ESL

In order to measure Pangram's false positive rate on ESL data, we run Pangram's AI detector on four public ESL datasets (we hold these datasets out during training, so that we do not have any train-test leakage).

The datasets we study include:

The results are below.

Dataset	False Positive Rate	Sample Size
ELLIPSE	0.00%	3,907
ICNALE	0.018%	5,600
PELIC	0.045%	15,423
Liang TOEFL	0%	91
Overall	0.032%	25,021

Pangram's overall false positive rate is 0.032%, which is not significantly higher than our general false positive rate of 0.01%.

Pangram vs. TurnItIn

We directly compare Pangram to TurnItIn using the same datasets that TurnItIn used in a public evaluation of their AI Writing Indicator.

We evaluate both "L1" (non-ESL) and "L2" (ESL) English on the same datasets as TurnItIn. Because TurnItIn does not evaluate documents longer than 300 words, we apply the same filtering to the dataset before evaluation.

Dataset	Pangram FPR	TurnItIn FPR
L2 English 300+ words	0.02%	1.4%
L1 English 300+ words	0.00%	1.3%

We find that Pangram is two orders of magnitude more accurate than TurnItIn on ESL text and Pangram does not detect any false positives on native English text from this study.

Pangram vs. GPTZero

GPTZero self-reports a 1.1% false positive rate on the original Liang TOEFL study, although 6.6% of the Liang TOEFL dataset is also misclassified as "Possible AI content."

By comparison, Pangram does not report a single false positive on the Liang TOEFL dataset, and we are highly confident on every example.

How does Pangram mitigate false positives on ESL writing?

At Pangram we take our performance on nonnative English extremely seriously, and so that's why we have used several strategies to mitigate false positives on our AI writing detection model.

Data

Machine learning models do not perform well outside of their training distribution, and so we take care to make sure that nonnative English text is included in our dataset.

However, we do not just stop there. While other AI writing detectors focus specifically on only student writing and academic essays, we train our model on a broad spectrum of writing. Other AI writing detectors only trained on essays often suffer from an underrepresentation of more casual, conversational English in the training set. By contrast, we use text from social media, reviews, and general Internet text that is often informal and more representative of imperfect writing that is similar to English exhibited by nonnative speakers or English language learners.

We also take care to include sources that may contain nonnative English writing, even if those sources are not specifically ESL datasets. For example, English text on websites that with foreign domains is a great source of nonnative English writing.

Multilingual Capabilities

Also unlike other AI detectors, we do not just restrict our domain to only English. In fact, we do not restrict the language of our model at all: we will use any and all languages present on the Internet to train our model so that it performs well on all common languages.

We've previously written about our excellent multilingual performance, and we believe that the techniques we've used to make Pangram work very well on other languages also generalize very well to ESL.

While we can't be exactly sure what mechanisms are responsible for the good generalization and transfer, we suspect that ESL can almost be considered an adjacent language to English. By optimizing the model to perform well on all languages, the model cannot overfit to any language specific styles, grammatical constructions, or word choices specific to the common ways that ideas are expressed in any particular language. By looking at human text in all languages, we teach the model how all humans write, not just native English speakers. This makes the model less likely to incorrectly focus on idiomatic patterns expressed by native speakers.

Active Learning

Our active learning approach is the reason that Pangram is much more accurate and falsely flags significantly less human text as AI than competitors.

By iteratively alternating between training and hard negative mining, we find the human examples that most resemble AI-generated text for training. Not only does this approach surface human examples that are most similar to AI-generated text, which helps the model understand the fine-grained differences between ESL text and AI-generated text, but it also helps us find similar examples to ESL that are going to transfer well and help the model learn better patterns overall.

Prompting Strategies

When creating AI examples for the model to learn from, we try to use an exhaustive variety of prompts so that the model can generalize to different writing styles. For example, we often add modifiers to the end of our prompts such as "Write this essay in the style of a high schooler," or "write this article in the style of a nonnative English speaker."

By creating so many different styles of writing, the model does not just learn the default way that AI language models write: it learns the fundamental underlying patterns of AI text.

From a statistical perspective, we design our synthetic mirror pipeline in a way such that our model ends up being invariant to irrelevant features such as the topic, the writing level, or the tone. By prompting the model in ways that match the features of the human text, we build in the invariance via having an equal number of human and AI examples that exhibit each feature.

Rigorous Evaluation and QA

Finally, we employ an extremely comprehensive and rigorous evaluation and QA process before signing off each new model update.

In evaluation, we focus on both quality and quantity. For example, the Liang TOEFL dataset only has 91 examples, so we would only be able to get a very coarse estimate of our false positive rate on ESL if we only used the Liang TOEFL dataset. If we only got a single example wrong, we'd report a 1.1% false positive rate, so we would not be able to tell the difference between models that actually have a true FPR of under 1%.

Because we strive to have a much lower false positive than 1% (our target false positive rate is anywhere between 1 in 10,000 and 1 in 100,000), we need to measure millions of examples to be able to confirm an accuracy to that level.

Having large scale evaluation also helps us gain better intuition for the failure modes that our model exhibits, and correct them over time by sourcing better data and coming up with better algorithmic strategies specifically targeted at our failure cases.

Can AI detectors be trusted on ESL?

Through our measurements, detailed evaluation results, and explainable mitigation strategies, we believe that Pangram is sufficiently accurate on nonnative English speakers to be deployed in the educational setting.

However, having a sufficiently unbiased AI detector is not enough to prevent all forms of bias in the academic integrity process. Educators should be aware that bias may show up in unconscious ways. For example, if an educator is more likely to use an AI detector on nonnative English speakers' submissions due to subconscious suspicion that ESL students are less honest, then that is a form of bias.

Additionally, teachers need to be aware that nonnative English speakers suffer inherent disadvantages in academia compared to their native English counterparts. ESL students are more likely to use external tools such as ChatGPT to improve their writing, which when used in a sufficient amount does flag AI detection software. That is why we recommend the Perkins AI Assessment Scale to facilitate clear communication with students about what kind of AI assistance is allowed and what is not.

Finally, we know that students cheat when placed under stress and pressure, feel a lack of self-efficacy especially when compared against their peers, and when they feel that using a cheating tool is the only way that they can be successful. We encourage educators to address these concerns proactively, by providing support to these students, clearly communicating what kind of assistance is available and allowed, and potentially rethinking assessment strategies that do not expect perfect English from students that come into the classroom already disadvantaged.

Pangram should be used as a tool to help support academic integrity so that educators can understand the best way that they can proceed to support their students' learning.

To learn more about our research and the ways we mitigate bias in our AI detection software, please contact us at info@pangram.com.

Products

Use Cases

Company

Resources