Photo by Valentin Antonucci.
We’re excited to announce a major update to Pangram Text, our flagship AI detection model. Pangram Text can now detect AI-generated text in Spanish, French, Italian, Portuguese, German, Russian, and Mandarin Chinese, with the same industry-leading accuracy as text written in English. We are rolling out our new multilingual model to protect online platforms from AI spam immediately.
In order to test the accuracy of our model on non-English languages, we use 3 large, diverse multilingual corpora from different domains: Amazon multilingual reviews, Wikipedia, and XLSum (BBC News International).
For the human side of the benchmark, we sample random documents that pass our sanity check filters. For the AI side of the benchmark, we use a mix of GPT-3.5, GPT-4 and GPT-4o. First, we ask the LLM to summarize the real document, e.g., “What is this review about?” Then, we ask it to generate a review, article, or news piece given the summary. Generating the benchmark in this way removes the possibility of label noise, as well as ensuring that the human and AI data distributions are as similar as possible to each other.
Language | Amazon Reviews Accuracy | Wikipedia Accuracy | XLSum (BBC News) Accuracy |
---|---|---|---|
Spanish | 99.59% | 99.75% | 99.75% |
French | 98.84% | 99.33% | 98.50% |
Italian | N/A | 99.82% | N/A |
German | 99.44% | 99.95% | N/A |
Portuguese | N/A | 99.83% | 99.70% |
Russian | N/A | 98.34% | 99.35% |
Chinese | 99.70% | 99.54% | 98.10% |
As our model is based on a similar architecture to modern large language models, we use large scale pretraining to ensure that our backbone is trained on a large multilingual corpus before fine-tuning an AI detection head. We also use a tokenizer that supports many languages, including Russian and Chinese.
We chose languages that represent the majority of the languages that are used on the internet.
We use Amazon Comprehend to detect the language of the input text. If the language is not supported, then we will return "Unsupported Language" as the prediction.
Yes, we expect to release future updates with improved performance on non-English languages as we continue to grow our multilingual dataset with active learning.
We plan to support more languages in the future. If you have a language that you would like to see supported, please let us know!
Contact us at info@pangram.com for more information on multilingual AI detection.