Pangram Text AI Detector now supports Arabic, Japanese, Korean, Hindi, and more
About two months ago, we released our first multilingual Pangram Text AI detection model, and now we are ready to announce our first multilingual update! Pangram Text now officially supports the top 20 languages on the Internet and unofficially performs well on many more. We observe especially strong and greatly improved performance on Arabic, Japanese, Korean, and Hindi.
We evaluated about 2,000 documents per language in our official support set. The human side is a mix of real reviews, news articles, and Wikipedia articles. The AI side is a set of essays, news articles, and blog posts that we prompted GPT-4o to write in various lengths, styles, and topics.
Language | Accuracy | False Positive Rate | False Negative Rate |
---|---|---|---|
Arabic | 99.95% | 0.10% | 0.00% |
Czech | 99.95% | 0.00% | 0.11% |
German | 99.85% | 0.00% | 0.32% |
Greek | 99.90% | 0.00% | 0.21% |
Spanish | 100.00% | 0.00% | 0.00% |
Persian | 100.00% | 0.00% | 0.00% |
French | 100.00% | 0.00% | 0.00% |
Hindi | 99.79% | 0.00% | 0.42% |
Hungarian | 99.49% | 0.10% | 0.95% |
Italian | 100.00% | 0.00% | 0.00% |
Japanese | 100.00% | 0.00% | 0.00% |
Dutch | 99.95% | 0.10% | 0.00% |
Polish | 100.00% | 0.00% | 0.00% |
Portuguese | 100.00% | 0.00% | 0.00% |
Romanian | 99.95% | 0.10% | 0.00% |
Russian | 100.00% | 0.00% | 0.00% |
Swedish | 99.95% | 0.00% | 0.11% |
Turkish | 99.90% | 0.00% | 0.21% |
Ukrainian | 99.95% | 0.00% | 0.11% |
Urdu | 99.44% | 0.00% | 1.16% |
Vietnamese | 99.95% | 0.00% | 0.11% |
Chinese | 99.95% | 0.00% | 0.11% |
Here are the key changes we made to improve our multilingual support:
We ran an active learning data campaign against web-scale data focused on the top 20 languages on the Internet.
We changed the tokenizer to better support non-English languages.
We increased the parameter count of the base model and LoRA adapters.
We applied a data augmentation to machine translate a random fraction of our dataset before training.
We fixed a bug in word counting that caused East Asian languages to be accidentally underrepresented in the training set.
The core foundation of our process of building models with extremely low false positive rates is active learning: simply put, we mine the pre-2022 Internet for examples that our model performs poorly on (e.g. false positives), add those examples to our training set, retrain, and repeat. We detail this algorithm in our technical report.
We can apply our active learning approach to some large multilingual datasets on the web to find multilingual text which our current model struggles on, and then use this data to iterate, along with our large library of prompts for creating synthetic mirrors: AI text that looks similar to the mined false positives that we found. While we focus on the top 20 languages on the Internet, we remove our language filtering step from our data pipeline: meaning text from all languages is fair game for hard negative mining and inclusion into our training set.
One of the benefits of our active learning approach is that it automatically rebalances the distribution of languages based on our model's accuracy. Low resource languages are underrepresented online, but because of this class imbalance, our first model initially performs poorly on low resource languages, causing more text from uncommon languages to be bubbled up in the hard negative mining run. We see over the active learning process, data from high resource languages such as English, Spanish, and Chinese gradually decreases in proportion in our training set, and more uncommon languages increase in proportion. We find this to be a relatively elegant solution to the natural imbalanced data distribution of multilingual model training. Via our active learning algorithm, the model is able to select for itself the data in the languages that it needs to see more of.
To better support multilingual text in the input domain, we also wanted to make sure that the base LLM that we use to build our classifier is also widely fluent in many non-English languages. We performed a sweep of several LLM backbones and tokenizers on our dataset to find the one that performs generally best among a wide swath of non-English languages. We found that performance on multilingual benchmarks does not seem to correlate strongly with how well the backbone would perform on our AI detection task: in other words, even if the base model can solve reasoning tasks and answer questions in other languages, the effectiveness of the skill transfer to multilingual AI detection varies extremely widely.
We also found that our initial models that we trained tended to underfit the new multilingual distribution-- we initially observed a higher training loss. To that end, we also increased the base model size as well as the parameter count in our LoRA adapaters, and also trained the model for more steps. (Because we are in an active learning / high data regime, we almost never train for longer than 1 epoch. In this case, we just had to extend the size of the epoch!)
Even with active learning, the diversity of data in non-English languages is noticeably lower than the diversity and volume of English data online, and we cannot fully rectify that simply by rebalancing the language distribution in the training set. A coarse way of saying this is there is some English data that is valuable, that just does not exist or have a native parallel in other languages. Thus, we decided to randomly apply a machine-translation augmentation to a small fraction of our dataset (in our case we used Amazon Translate).
While it is not standard practice to apply machine-translation augmentations to the training set in LLM training, due to the fact that machine-translated data is often unnatural and suffers from "translation-ese", in our case, because we are not training a generative model, it does not seem to affect output quality and we noticed improvements to our metrics upon applying this augmentation.
We take Spanish as a characteristic example of a high-resource language which was previously supported by Pangram Text, but is now much improved. We measure the false positive rate on a variety of domains.
Dataset | False Positive Rate (Before) | False Positive Rate (After) | Number of Examples |
---|---|---|---|
Spanish Amazon reviews | 0.09% | 0% | 20,000 |
Wikilingua (WikiHow article text) | 3.17% | 0.14% | 113,000 |
XL-SUM (news articles in native Spanish) | 0.08% | 0% | 3,800 |
Spanish Wikipedia | 0.29% | 0.04% | 67,000 |
Spanish CulturaX | 0.22% | 0.01% | 1,800,000 |
Spanish blog posts we curated manually | 0% | 0% | 60 |
We also measured the false negative rate (the rate at which AI-generated text is incorrectly classified as human) for various large language models. In this experiment, we came up with a list of prompts for LLMs to generate essays, blog posts, and news articles in a variety of lengths and styles, and then we translated the prompts into Spanish. The LLMs themselves are multilingual, so respond to the instructions in Spanish.
Model | False Negative Rate (Before) | False Negative Rate (After) | Number of Examples |
---|---|---|---|
GPT-4o | 2.1% | 0% | 1,400 |
Claude 3.5 Sonnet | 0.7% | 0% | 1,400 |
Claude 3 Opus | 1.05% | 0% | 1,400 |
Gemini 1.5 Pro | 2.85% | 0% | 1,400 |
As we can see, our updated model achieves perfect detection across all tested LLMs, significantly improving upon our previous version.
Two of the languages we focused most on improving are widely spoken in the world but are actually less common on the Internet- Arabic and Japanese.
Dataset | Arabic False Positive Rate | Japanese False Positive Rate | Arabic Examples | Japanese Examples |
---|---|---|---|---|
Amazon Reviews | 0% | 0% | N/A | 20,000 |
AR-AES (Arabic student writing) | 0% | N/A | 2,000 | N/A |
Wikilingua (WikiHow article text) | 0.58% | 0.55% | 29,000 | 12,000 |
XL-SUM (news articles in native language) | 0% | 0% | 4,000 | 733 |
Wikipedia | 0.09% | 0.009% | 31,000 | 96,000 |
CulturaX | 0.08% | 0.21% | 1,785,000 | 1,409,000 |
Blog posts we curated manually | 0% | 0% | 60 | 60 |
We previously did not support these two languages, so the false negative rates were extremely high. We now reliably predict AI-generated Arabic and Japanese very well.
Model | Arabic FNR | Japanese FNR |
---|---|---|
GPT-4o | 0% | 0% |
Claude 3.5 Sonnet | 0% | 0% |
Claude 3 Opus | 0% | 0% |
Gemini 1.5 Pro | 0% | 0.21% |
As we can see, our updated model achieves near-perfect detection across all tested LLMs for both Arabic and Japanese, with only a slight 0.21% false negative rate for Gemini 1.5 Pro in Japanese.
Full language benchmark results are available upon request.
While our performance is strong on native web text, our model sometimes struggles to detect "translation-ese"-- text that is badly translated or otherwise does not sound natural. To make matters worse, many people are now using LLMs like ChatGPT directly for translation tasks. Should LLM-translated text be classified as human or AI? It depends on the heavy-handedness of the translation, and also the downstream application use case. A Spanish teacher may consider using machine-translation on an assignment to be academic dishonesty, but a publisher may want to allow translated works through their QA process. Pangram is actively working on understanding translated text as a "third modality" that lies somewhere in between human and AI, and providing more information to our users so downstream consumers of our model can decide what is right for them.
Have more questions? Contact us at info@pangram.com!