Loading....

Pangram Text AI Detector is now multilingual!

Bradley EmiJuly 1, 2024

Photo by Valentin Antonucci.

We’re excited to announce a major update to Pangram Text, our flagship AI detection model. Pangram Text can now detect AI-generated text in Spanish, French, Italian, Portuguese, German, Russian, and Mandarin Chinese, with the same industry-leading accuracy as text written in English. We are rolling out our new multilingual model to protect online platforms from AI spam immediately.

Benchmarking

In order to test the accuracy of our model on non-English languages, we use 3 large, diverse multilingual corpora from different domains: Amazon multilingual reviews, Wikipedia, and XLSum (BBC News International).

For the human side of the benchmark, we sample random documents that pass our sanity check filters. For the AI side of the benchmark, we use a mix of GPT-3.5, GPT-4 and GPT-4o. First, we ask the LLM to summarize the real document, e.g., “What is this review about?” Then, we ask it to generate a review, article, or news piece given the summary. Generating the benchmark in this way removes the possibility of label noise, as well as ensuring that the human and AI data distributions are as similar as possible to each other.

LanguageAmazon Reviews AccuracyWikipedia AccuracyXLSum (BBC News) Accuracy
Spanish99.59%99.75%99.75%
French98.84%99.33%98.50%
ItalianN/A99.82%N/A
German99.44%99.95%N/A
PortugueseN/A99.83%99.70%
RussianN/A98.34%99.35%
Chinese99.70%99.54%98.10%

FAQ

  • How did you update the model to support these languages?

As our model is based on a similar architecture to modern large language models, we use large scale pretraining to ensure that our backbone is trained on a large multilingual corpus before fine-tuning an AI detection head. We also use a tokenizer that supports many languages, including Russian and Chinese.

  • Why did you choose these specific languages?

We chose languages that represent the majority of the languages that are used on the internet.

  • What happens if I submit text in a language that is not supported?

We use Amazon Comprehend to detect the language of the input text. If the language is not supported, then we will return "Unsupported Language" as the prediction.

  • Will the model improve over time?

Yes, we expect to release future updates with improved performance on non-English languages as we continue to grow our multilingual dataset with active learning.

  • What about other languages?

We plan to support more languages in the future. If you have a language that you would like to see supported, please let us know!

Contact us at info@pangram.com for more information on multilingual AI detection.

Subscribe to our newsletter
We share monthly updates on our AI detection research.