Our classifier uses a traditional language model architecture. It receives input text and tokenizes it. Then, the model turns each token into an embedding, which is a vector of numbers representing the meaning of each token.
The input is passed through the neural network, producing an output embedding. A classifier head transforms the output embedding into a 0 or 1 prediction, where 0 is the human label and 1 is the AI label.
The initial model was already quite effective, but we wanted to maximize accuracy and reduce any possibility of false positives (incorrectly predicting human-authored documents as AI-generated). To do this, we developed an algorithm specifically for AI detection models.
With the initial dataset, our model did not have enough signal to go from 99% accurate to 99.999% accurate. While the model learns the initial patterns in the data quickly, the model needs to see hard edge cases in order to precisely distinguish between human and AI text.
We solve this by using the model to search large datasets for false positives and augmenting the initial training set with these additional hard examples before retraining. After several cycles of this, the resulting model exhibits a near-zero false positive rate as well as overall improved performance on held-out evaluation sets.