AI Faces Data Shortage as Crucial Sources Disappear

A new study shows that AI faces significant data shortages for training. Credit: deepak pal / Flickr / CC BY-SA 2.0

Researchers have long relied on vast amounts of text, images, and videos from the internet to create powerful artificial intelligence systems. However, this resource is becoming scarce.

In the past year, many key websites have limited access to their data. This information comes from a recent study by the Data Provenance Initiative, a research group led by M.I.T.

The study examined 14,000 web domains used in three popular AI training data sets. It found an “emerging crisis in consent” as publishers and online platforms are now blocking access to their data.

Researchers estimate that in the data sets named C4, RefinedWeb, and Dolma, 5 percent of all data and 25 percent of data from the best sources have been restricted.

These restrictions are implemented using the Robots Exclusion Protocol, an old method whereby website owners use a file called robots.txt to stop automated bots from accessing their pages.

The study revealed that up to 45 percent of the data in one of the sets, known as C4, had been restricted due to websites’ terms of service.

Contents hide

1 Decline in data shortage impacts researchers and academics

2 Blocking data for OpenAI, Anthropic, and Google

Decline in data shortage impacts researchers and academics

The decline in data availability is expected to impact not only AI companies but also researchers, academics, and noncommercial entities. The lead author, Shayne Longpre, emphasized that the rapid reduction in consent for data usage across the web poses significant challenges for these groups.

New Preprint How are shifting norms on the web impacting AI?

We find:

A rapid decline in the consenting data commons (the web)

Differing access to data by company, due to crawling restrictions (e.g.26% OpenAI, 13% Anthropic)

Robots.txt preference protocols… pic.twitter.com/ipXbALDbl7

— Shayne Longpre (@ShayneRedford) July 19, 2024

Data is essential for today’s generative AI systems. These systems use billions of examples of text, images, and videos, often taken from public websites by researchers. This information is collected into large data sets that can be freely downloaded and used or combined with other data.

Generative AI tools like OpenAI’s ChatGPT, Google’s Gemini, and Anthropic’s Claude rely on learning from this data. These tools can write, code, and create images and videos. The quality of their outputs improves with more high-quality data.

Blocking data for OpenAI, Anthropic, and Google

For years, AI developers could easily gather data. However, the recent surge in generative AI has created tensions with data owners. Many of them are uncomfortable with their data being used for AI training without compensation.

In response, some publishers have implemented paywalls or changed their terms of service to restrict the use of their data for AI training. Others have blocked automated web crawlers used by companies like OpenAI, Anthropic, and Google.

Websites like Reddit and StackOverflow are now charging AI companies for use of their data. Some publishers have even gone to court. For instance, The New York Times sued OpenAI and Microsoft last year. They claimed these companies used their news articles to train AI models without permission, as reported by The New York Times.