Apple’s AI Scraping Hits Roadblock as Major Sites Opt Out

Apple recently introduced a tool allowing publishers to prevent their content from being used in AI training. In less than three months, several major news outlets and social media platforms have chosen to use this option.

Among those opting out are Facebook, Instagram, Craigslist, Tumblr, The New York Times, the Financial Times, The Atlantic, Vox Media, the USA Today network, and Condé Nast, the parent company of WIRED.

This move shows a growing concern about how robots, which have been scanning the internet for years, are now gathering data to train AI. Moreover, this highlights a growing debate over who owns the information on the web and how it should be used in the future, as reported by WIRED.

Contents hide

1 Applebot-Extended to ‘control data usage’

2 Use of ‘robots.txt’ to stop Applebot-Extended

Applebot-Extended to ‘control data usage’

Apple has introduced Applebot-Extended, a tool that allows website owners to prevent their data from being used in AI training. This tool is an update to Apple’s original web-crawling bot, Applebot. In a blog post, Apple describes this feature as a way for site owners to “control data usage.”

Applebot, first introduced in 2015, was originally designed to help power Apple’s search tools, such as Siri and Spotlight. However, its role has now expanded. The data collected by Applebot is also being used to train the AI models that support Apple’s ongoing AI development efforts.

Big news! Apple's AI training hits a wall as notable publishers block their data from being used. What’s happening behind the scenes? pic.twitter.com/fQ7CRziMcV

— Talk AI Today (@TalkAIToday) August 29, 2024

Applebot-Extended is designed to protect publishers’ rights, according to Apple spokesperson Nadine Haija. While it does not stop the original Applebot from crawling websites—a process important for how content appears in Apple’s search products—it does prevent that data from being used to train Apple’s AI models.

Essentially, this tool allows site owners to control how their data is used by Apple’s AI systems without affecting their visibility in Apple’s search features. In simple terms, it’s a way to fine-tune the operation of Apple’s web-crawling bot.

Use of ‘robots.txt’ to stop Applebot-Extended

Publishers can stop Applebot-Extended by editing a basic file on their websites called robots.txt. Moreover, this file, known as the Robots Exclusion Protocol, has been used for many years to guide how bots collect data from the web.

Now, it’s becoming a central part of the discussion on AI training. Some publishers have already updated their robots.txt files to keep AI bots from companies such as OpenAI, Anthropic, and others from accessing their data, as reported by WIRED.

The robots.txt file lets website owners decide which bots can access their sites and which cannot. Although there is no legal requirement for bots to follow the rules set in this file, it has been a widely accepted practice for years.

However, not all bots comply. Earlier this year, a WIRED investigation discovered that the AI startup Perplexity was bypassing robots.txt and secretly scraping data from websites without permission.