Cloudflare launches free tool to block AI bots scraping websites

Customers don’t want AI bots visiting their websites, especially those that do so dishonestly

Tech Desk - Jul 04, 2024

A representational image. — The Cloudflare Blog

Cloudflare, an American cloud service provider, has unveiled a new tool to stop bots from scraping websites hosted on its platform for data to train AI models.

Different AI vendors, including Google, OpenAI and Apple, let website owners block the bots they utilise for data scraping and model training by improving their site’s robots.txt, the text file that utter bots which pages they can access on a website. However, as Cloudflare points out in a post unveiling its bot-combating tool, not all AI scrapers respect this.

The company in its official blog stated: “Customers don’t want AI bots visiting their websites, especially those that do so dishonestly. We fear that some AI companies intent on circumventing rules to access content will persistently adapt to evade bot detection.”

To address the issue, Cloudflare examined AI bot and crawler traffic to fine-tune automatic bot detection models. The models consider, among other factors, whether an AI bot might be trying to avoid detection by mimicking the appearance and behaviour of someone utilising a web browser.

Cloudflare said: “When bad actors attempt to crawl websites at scale, they generally use tools and frameworks that we can fingerprint.

“Based on these signals, our models [are] able to appropriately flag traffic from evasive AI bots as bots.”

Cloudflare has introduced a form for hosts to report suspected AI bots and crawlers and claims that it’ll continue to manually blacklist AI bots over time. The issue of AI bots has come into sharp comfort as the generative AI boom fuels the demand for model training data.

Various sites, wary of AI vendors training models on their content without warning or compensating them, have chosen to impede AI scrapers and crawlers. However, impeding isn’t a reliable protection, but some vendors seem to be ignoring standard bot exclusion rules to gain a competitive advantage in the AI race.

Perplexity, an AI search engine, recently faced accusations of mocking legitimate visitors to scrape content from websites, and OpenAI and Anthropic are claimed to have ignored robots.txt rules. Content licensing startup TollBit said it sees “many AI agents” ignoring the robots.txt standard.

Tools like Cloudflare could help, but only if they prove to be precise in detecting clandestine AI bots. And they won’t solve the more stubborn problem of publishers risking referral traffic from AI tools like Google’s AI Overviews, which exclude sites from addition if they block particular AI crawlers.