Risk of AI Language Models Running Out of Training Data by 2030, Study Warns

Risk of AI Language Models Running Out of Training Data by 2030, Study Warns

Artificial intelligence (AI) language models, like OpenAI’s ChatGPT, are at risk of running out of training data by 2030, according to a recent study by research group Epoch AI. The study projects that tech companies will exhaust the supply of publicly available training data for AI language models by the turn of the decade, between 2026 and 2032. The AI field may face challenges in maintaining its current pace of progress once it depletes the reserves of human-generated writing.

Comparing the situation to a “literal gold rush,” Tamay Besiroglu, an author of the study, explains that companies such as OpenAI and Google are racing to secure and pay for high-quality data sources to train their AI models. They have signed deals to tap into the steady flow of sentences from Reddit forums and news media outlets. However, in the long term, there won’t be enough new content, such as blogs, news articles, and social media commentary, to sustain the current trajectory of AI development.

This puts pressure on companies to either tap into sensitive data that is currently considered private, such as emails or text messages, or rely on less-reliable “synthetic data” generated by the AI models themselves. Besiroglu highlights the bottleneck that arises when there is a constraint on the amount of available data, as scaling up models has been crucial to expanding their capabilities and improving the quality of their output.

The research team initially made projections two years ago, forecasting a more imminent cutoff of high-quality text data by 2026, shortly before the debut of ChatGPT. Since then, AI researchers have found new techniques to make better use of existing data and even “overtrain” models on the same sources multiple times. However, there are limits to these approaches, leading Epoch to now predict a depletion of public text data within the next two to eight years.

The study, which is peer-reviewed and set to be presented at the International Conference on Machine Learning, also highlights the rapid growth of computing power and text data fed into AI language models. While the amount of text data has been growing about 2.5 times per year, computing power has grown about four times per year. For example, Meta Platforms, the parent company of Facebook, recently claimed that its upcoming Llama 3 model, which has not yet been released, has been trained on up to 15 trillion tokens.

However, there are differing opinions on the significance of the data bottleneck. Nicolas Papernot, an assistant professor of computer engineering at the University of Toronto, suggests that there is no need to train larger and larger models. Instead, more specialized models could be trained for specific tasks. Papernot raises concerns about training AI systems on the same outputs they produce, as it can lead to degraded performance and the encoding of mistakes, biases, and unfairness present in the information ecosystem.

The study also sparks a conversation about the use of human-created data and the stewardship of valuable data sources by websites like Reddit, Wikipedia, news publishers, and book publishers. Some entities have restricted AI companies' access to their data after it has already been taken without compensation. However, Wikipedia has placed few restrictions on how AI companies use its volunteer-written entries. Selena Deckelmann, chief product and technology officer at the Wikimedia Foundation, jokes about the natural resource conversation surrounding human-created data and highlights the need for incentives for continued human contributions.

As the AI field looks to the future, the study suggests that paying millions of humans to generate the text needed for AI models is unlikely to be an economical solution. OpenAI, for example, has already experimented with generating synthetic data for training its models. CEO Sam Altman acknowledges the importance of high-quality data but expresses reservations about relying too heavily on synthetic data. He believes that the best way to train a model should not simply involve generating vast amounts of synthetic data.

In conclusion, the risk of AI language models running out of training data by 2030 is a significant concern. While tech companies are currently racing to secure high-quality data sources, the long-term sustainability of their models is uncertain. The depletion of publicly available data may force companies to tap into sensitive data or rely on less-reliable synthetic data. The ongoing conversation about the use of human-created data and the need for incentives to maintain its accessibility will play a crucial role in addressing these challenges.


Written By

Jiri Bílek

In the vast realm of AI and U.N. directives, Jiri crafts tales that bridge tech divides. With every word, he champions a world where machines serve all, harmoniously.