Scaling up with LoRA

Photo by Tara Winstead.

Last month, we released our technical report which comprehensively benchmarked our model against our competition as well as a leading academic method.

Today, we’re announcing another model release that even further improves our performance on this challenging benchmark.

	Accuracy	False Negative Rate	False Positive Rate
February Model	99.0%	1.30%	0.67%
March Model	99.84%	0.11%	0.19%

What is responsible for this improvement?

In order to produce the new model, we used the same active learning approach that we used in our technical report, Hard Negative Mining with Synthetic Mirrors. However, for this release update, we significantly scaled up our model, increasing the number of total parameters in the model by an order of magnitude. In order to do this, we had to also scale up the compute resources required to train the new model, and implement Low-Rank Adaptation (LoRA) – a commonly used technique to efficiently fine-tune LLMs. This new model is also our first release of a model trained on NVIDIA’s new H100 GPUs!

Scaling up the model without overfitting

Smaller models have been found to work better for DetectGPT at detecting AI generated text, and we have previously discussed the saturation of scaling laws in our technical report. As a recap, we find that adding more data does not improve the model after a critical threshold of data (in our case, around 40k documents).

In addition, if you take a look at the leaderboard for other text classification tasks such as MTEB, IMDB sentiment analysis, and AGNews, you will see that the leaderboard is still dominated by models such as XLNet, DeBERTa, and T5-XXL. While these models are tried and true architectures that have worked well on simple classification tasks for years, they are nowhere near the size of current state of the art large language models. These BERT-style models have around a couple hundred million parameters, while leading open source LLMs now have tens of billions of parameters– a huge difference!

The reason that LLM style architectures do not do so well on text classification is largely because they overfit easily. How can we get the best of both worlds: a model that has much more “base” knowledge like an LLM, but does not overfit on classification tasks?

LoRA to the rescue

In our latest release, we take advantage of a relatively common technique for finetuning large language models known as LoRA.

Visualization of the LoRA tensor operations from the original paper.

The main idea of LoRA is that rather than finetuning the entire model, which (1) takes a lot of time and memory, (2) is very overfitting prone, and (3) can cause catastrophic forgetting of the pretraining data, the base LLM is held in place, and adapter modules are trained as side networks alongside the LLM’s core attention blocks. LoRA stands for “Low-Rank Adaptation” which means that the adapter modules decompose nicely into parameter efficient weight matrices– making them very quick to train and memory efficient.

This figure from the LoRA paper nicely explains the idea. The original LLM is represented by only the blue W matrix. The orange modules are allowed to train, while the blue module from the original LLM is simply frozen in place as the adapter module learns to go around it.

We find that LoRA helps our performance significantly, reducing both false positive and false negative rates.

We hypothesize that the improvement is largely due to the greater amount of pre training knowledge contained in the LLM, which we are able to take advantage of without overfitting via the LoRA adapter idea. Pretty cool!

Next Steps

We will continue to make architecture improvements over time to stay current with the best deep learning architectures out there. We also have additional architectural and data improvements in the pipeline, but first it’s time to make an even harder evaluation set!

Stay tuned…

Want to get in touch with us? Send us an email at info@pangram.com!

Products

Use Cases

Company

Resources

What is responsible for this improvement?

Scaling up the model without overfitting

LoRA to the rescue

Next Steps