News is a $150BN industry employing thousands of reporters and journalists to write news articles receiving billions of views. With AI and the rise of large language models, many lower-quality news sites, and some bad actors, have leaned on AI to generate content cheaply, quickly, and at scale. Because AI cannot fill a journalist's role, these news sites are limited to repeating info from their training or stealing and rephrasing other outlets' articles.
Inauthentic content has also been proven to be less desirable and less visited by online viewers. From a recent blog post, we cited research conducted by NP Digital which firmly found that online readers preferred and prioritized human generated articles. Specifically:
These AI publications exist mainly to siphon traffic and potential ad revenue away from authentic news content, and serve as part of a growing content farming operation that captured 21% of ad impressions and more than $10BN last year in 2023.
Knowing the threat and potential damages incurred by this rise of inauthentic news, we wanted to quantify the actual scale of this problem. We collaborated with NewsCatcher to classify a given day’s sample of globally published news.
We began by first compiling a collection of all the news in the world published on July 1, 2024.
NewsCatcher’s API is the most exhaustive source of daily published, global news articles, with over 75,000 sources and serving large enterprise organizations. Their technology allowed us to query the full text of articles published from around the world - written in different languages and covering a broad range of topics.
Using NewsCatcher, we collected all the news published on one day; from this data dump, we analyzed 857,434 articles collected from 26,675 online publishers, which we'll assume as a representative set of the daily news published.
After sourcing the articles, we ran our Pangram Text classifier to determine which articles were AI generated. Pangram Text is the industry-leader in classification accuracy (over 30x more accurate than the next leading commercial solution), with a strong commitment to low false positive incidence. In our technical report, we show that our false positive rate on news is only 0.001%, which allows us to be confident when we predict news as AI that it truly is. Our solution typically takes in a document or a piece of text, and returns a prediction of the likelihood that it was generated by an LLM. For a web page, we would have to do some post-processing and cleaning of the page’s content to isolate just the article text, but using the NewsCatcher solution we were able to pull the cleaned text directly and run inference with our text classifier.
Distribution of our predictions on a log scale. We use a log scale to show that predictions near 0 or 1 are 100-1000x more common than predictions in the middle of the spectrum.
We then categorized the publishers as an aggregate of each of their total articles and bucketed them by a breakdown of their total AI content. The bucketing framework is as follows:
Of the total articles sampled, we found that:
59,653 articles were classified as AI, representing 6.96% of the article set.
Publishers organized by how much AI content they publish We then looked at the AI classifications across key features including the language the article was written, country where the article was published, and topic that article covered as well as special political relevance.
Graph of AI articles produced by country (percentage of total news articles written by country) We notice in general that Ghana is a rather strong outlier in terms of AI-generated content. While the overall frequency is lower, India is also a major publisher of AI-generated content, which should not be surprising given the impact of deepfakes on the recent Indian election.
Graph of AI articles produced by topic (percentage of total news articles written on each topic)
We notice that beauty (sponsored articles), tech and business (crypto scams) are especially large topics that people write AI articles about. Somewhat surprisingly, politics tends to be lower than average when it comes to AI articles: we think this is because advertisers tend to avoid political news sites due to brand safety risks, lowering the incentive for publishers to produce made-for-advertising political content.
We identify several categories of AI news articles: made-for-advertising sites (MFAs), sponsored articles, fraud, and disinformation.
A site whose only purpose is to serve ads rather than deliver legitimate content an “MFA”-- a made-for-advertising site. Here’s an example of an MFA:
Made-for-advertising site full of ads
As we can see, above the fold of the website, there is no actual content other than the title, and there are 8 display ads clamoring for the user’s attention. The AI content below is not really meant to be read: it is just there to attract visitors to the site to soak up ad revenue before users typically immediately bounce. Advertisers are often not even aware that they are advertising on these sites: the programmatic nature of digital advertising means that bids for this ad real estate are being bought and sold in matters of milliseconds using automated bidding algorithms. Companies like Jounce Media help advertisers avoid wasting their budget on sites like this, and are part of a group of companies called “Supply Chain Optimizers”.
Jounce defines three key characteristics of an MFA:
To summarize, MFAs steal ad traffic from sites with legitimate content, in order to cheaply offer advertising space supply. They deliver vanity metrics to programmatic ad campaigns, while not actually providing any useful content or any actual ROI for advertisers. They litter the internet and make for a hostile user experience for the average internet consumer.
While there is not a concrete metric on what defines an MFA, we estimate that MFAs make up about 50% of the AI-generated content online.
Some news on the Internet can be bought as a means to advertise a product, while masquerading as actual content that was written by an influencer or legitimate review publication. We noticed that beauty was one of the topics with the highest frequency of AI-generated content. When we dug into the data, we found that much of the “news” articles under the beauty topic are simply sponsored articles like this one:
AI wrote this low-quality sponsored content
Many copywriters are simply resorting to the use of AI to write these low-quality sponsored articles, because the goal is simply to sell the placement, rather than generate an authentic review.
Crypto scammers use AI to pump out content at a high velocity
We notice a lot of run-of-the-mill scam campaigns generated with AI as well. In particular, crypto scams seem to be very commonplace, and are even promoted on reputable sites such as Medium.
A disinformation site populated with AI content
While we find that the use of AI is typically less prevalent in political news (in large part due to the fact that many advertisers tend to avoid political news due to brand safety risk), AI is a growing component of disinformation campaigns. Newsguard has an AI tracking center that has detailed up-to-date tracking of AI-enabled disinformation.
Unlike the other forms of deception that we see bad actors using AI for, the point of these articles is actually to get people to read the content. Typically, the purpose of these campaigns is to change public sentiment or opinion on a particular topic.
As the US election approaches in November, we can only expect this kind of AI abuse to continue.
Want to learn more about our map of AI content across the web, or our AI blocklist for advertisers? Get in touch at info@pangram.com!