Deep Dive on Yelp reviews

NOTE: We've changed our name to Pangram Labs! See our blog post for more details.

At Checkfor.ai, we strive to be the best-in-class AI text detector to promote our mission of protecting the internet from low-quality AI-generated pollution. One of the most important areas to defend is user review platforms.

Fake online reviews ultimately hurt both businesses and consumers, and ChatGPT has only made review fraud even easier to commit at a large scale.

ChatGPT generated review on Yelp

Keeping user trust in online reviews is an important part of our mission at Checkfor.ai to protect the authenticity of human-generated content online.

About me

My name is Bradley Emi, and I’m the CTO of Checkfor.ai. I’ve worked as an AI researcher at Stanford, shipped production models as an ML Scientist on the Tesla Autopilot team, and led a research team that built a platform to design drugs with large neural networks at Absci. In self-driving cars and drug discovery, 99% accuracy is simply just not good enough. 99% accuracy could mean that 1 out of 100 pedestrians is run over by an autonomous vehicle, or 1 out of 100 patients experiences life-threatening side effects from a poorly designed drug.

While detecting AI generated text isn’t necessarily a life or death situation, we want to design models and software systems here at Checkfor.ai that are held to the same quality bar. Our detector must hold up to adversarial attacks such as paraphrasing, advanced prompt engineering, and detection evasion tools such as undetectable.ai. We are serious about solving this problem (e.g., not just getting to 99%), and thus, one of the highest priorities of our engineering team is to develop an extremely robust evaluation platform.

Evaluation Philosophy: Test Sets are Unit Tests

A Software 1.0 cybersecurity company would never ship a product without unit tests. As a Software 2.0 company, we need the equivalent of unit tests, except they need to test large models with millions or even billions of parameters, that may behave stochastically, and must work correctly while covering a wide distribution of tail cases. We cannot achieve “99% test set accuracy” and call it a day: we need evaluations that specifically test the kinds of examples that we will encounter in the real world.

A good test set answers specific questions and minimizes the number of confounding variables.

Examples of targeted test questions and corresponding test sets include:

How well does our model work on Yelp reviews? Test set of a thousand real Yelp reviews, and a thousand AI-generated Yelp reviews.
How well does our model work on paraphrased text? Test set of hundreds of real student essays, hundreds of AI essays, and those exact same essays paraphrased through QuillBot or Undetectable.AI.

There are several reasons that you cannot just combine everything in your test set and report a number.

There are too many confounding variables — we don’t know whether or not the test passed or failed due to the data distribution or the model.
Anyone can artificially inflate their accuracy number by just flooding the test set with easy examples.
Without an open and reproducible explanation of how the test set was created in an unbiased way, we can’t know if someone just cherry-picked examples their model succeeds on and the baseline fails on.

That’s why benchmark studies such as these completely miss the mark. They are unfocused, and do not test specific behaviors that we want the model to perform. Biased test sets show off the model when it puts its best foot forward, not when the model faces real world examples.

An unbiased Yelp benchmark

An example of a real-world application of AI text detection is detecting AI-generated reviews on Yelp. Yelp is committed to strict moderation of their review platform, and if you go into their Trust and Safety Report for 2022, you can see that it’s clear that Yelp cares deeply about fighting fraudulent, compensated, incentivized, or otherwise dishonest reviews.

Fortunately, Yelp has also released an excellent open source dataset. We randomly sampled 1000 reviews from this dataset, as well as generated 1000 synthetic reviews from ChatGPT, the most commonly used LLM.

It’s important to note that the ChatGPT reviews are for real Yelp businesses from their Kaggle dataset: that way the model can’t cheat by overfitting to details such as a difference in business distribution. During evaluation, we test to see if the model really learned to use the correct features in the text in order to differentiate real from fake.

We use this dataset to figure out which of the AI detection models really can differentiate ChatGPT-generated reviews from real ones!

Model accuracies

Our simplest metric is accuracy: how many examples did each model classify correct?

Checkfor.ai: 99.85% (1997/2000)
Originality.AI: 96.2% (1738/1806) (note: Originality.AI refuses to classify documents under 50 words long).
GPTZero: 90.8% (1815/2000)

While a difference of 99.85% vs. 96% may not initially seem like a large difference, when we consider the error rate, we can put these numbers into a better context.

Checkfor.ai is only expected to fail once out of every 666 queries, while Originality.AI is expected to fail once out of every 26 queries, and GPTZero fails once out of every 11 queries. This means our error rate is over 25x better than Originality.AI, and 60x better than GPTZero.

False positives and false negatives

In order to look at false positives and false negatives (in machine learning parlance, we would consider the very similar statistics precision and recall), we can look at the confusion matrix– what are the relative rates of true positives, false positives, true negatives, and false negatives?

Over all 2,000 examples, Checkfor.ai produces 0 false positives and 3 false negatives, exhibiting high precision and high recall. While admirably, GPTZero does not often predict false positives, with only 2 false positives, it comes at the expense of predicting 183 false negatives– an incredibly high false negative rate! We’d call this a model that exhibits high precision but low recall. Finally, Originality.AI predicts 60 false positives and 8 false negatives– and it refuses to predict a likelihood on short reviews (<50 words) — which are the hardest cases and most likely to be false positives. This high false positive rate means that this model is low precision, high recall.

While in AI text detection, a low false positive rate is more important (we don’t want to be falsely accusing real humans of plagiarizing from ChatGPT), a low false negative rate is also necessary– we cannot be allowing upwards of 10–20% of AI-generated content to be slipping through the cracks.

Model Confidence

Ultimately, we would like our model to express high confidence when it is clear that text is human, or written by ChatGPT.

Following a similar visualization strategy to the excellent academic paper DetectGPT by Mitchell et. al., we plot the histograms of model predictions for both AI-generated reviews and real reviews for all three models. Since all three models are over 90% accurate, a log scale on the y-axis is the most helpful for visualizing the characteristics of each model’s confidence.

On this plot, the x-axis represents the probability that the model predicts the input review as AI-generated. The y-axis represents how often the model predicts that particular probability for real (blue bars) or AI (red bars) text. We see that when looking at these “soft” predictions, rather than just a yes or a no, Checkfor.ai is much better at drawing a clear decision boundary and making more confident predictions than either GPTZero or Originality.AI.

GPTZero tends to predict too many examples in the 0.4–0.6 range of probability, with a mode right around 0.5. On the other hand, Originality.AI’s false positive issue becomes even more visible when examining the soft predictions. Many real reviews are very close to being predicted as AI-generated, even if they do not clear the threshold of 0.5. This makes it hard for a user to trust that the model can reliably predict AI-generated text, as small perturbations to the review can allow an adversary to bypass the detector by iteratively editing the review until it is under the detection threshold.

Our model, on the other hand, is usually very decisive. We are generally able to make confident decisions. For readers with a deep learning or information theory background, we have the lowest cross entropy/KL-divergence between the true distribution and the predicted distribution.

There is clear value in predicting real text as real with high confidence (see this humorous figure from Twitter). While clearly this educator misinterpreted the AI probability as an amount of text that was AI-written, when detectors are unconfident about real text being really real, it leaves room for misinterpretation.

https://twitter.com/rustykitty_/status/1709316764868153537

On the 3 errors predicted by Checkfor.ai, unfortunately, two of these errors are pretty confident. Our detector isn’t perfect, and we are actively working on calibrating the model to avoid such confident mispredictions.

Conclusion

We are open-sourcing the datasets used for this evaluation of both real and fake Yelp reviews, so that future models can use this important benchmark to test the accuracy of their detectors.

Our main takeaways are:

Checkfor.ai exhibits both a low false positive and a low false negative rate. Checkfor.ai is able to tell the difference between real and AI-generated reviews not just with high accuracy, but with high confidence. We will be releasing more of this style of blog post in the future, and sharing publicly our honest assessments of our model as we learn more. Stay tuned, and let us know what you think!

Products

Use Cases

Company

Resources