OpenAI, Anthropic Ignore Rule That Prevents Bots Scraping Web Content – Business Insider
The world’s top two AI startups are ignoring requests by media publishers to stop scraping their web content for free model training data, Business Insider has learned.
OpenAI and Anthropic have been found to be either ignoring or circumventing an established web rule, called robots.txt, that prevents automated scraping of websites, according to a person with knowledge of the analytics of TollBit, as well as another person familiar with the matter.
TollBit is a startup that’s aiming to broker paid licensing deals between publishers and AI companies. It found several AI companies are acting in this way and informed certain large publishers in a Friday letter, which was reported earlier by Reuters. The letter did not include the names of any of the AI companies accused of skirting the rule.
OpenAI and Anthropic have stated publicly that they respect robots.txt and blocks to their specific web crawlers, GPTBot and ClaudeBot.
However, according to TollBit’s findings, such blocks are not being respected, as claimed. AI companies, including OpenAI and Anthropic, are simply choosing to “bypass” robots.txt in order to retrieve or scrape all of the content from a given website or page.
Related stories
A spokeswoman for OpenAI declined to comment beyond pointing BI to a corporate blogpost from May, in which the company says it takes web crawler permissions “into account each time we train a new model.” A spokesperson for Anthropic did not respond to emails seeking comment.
Robots.txt is a single bit of code that’s been used since the late 1990s as a way for websites to tell bot crawlers they don’t want their data scraped and collected. It was widely accepted as one of the unofficial rules supporting the web.
With the rise of generative AI, startups and tech companies are racing to build the most powerful AI models. A key ingredient is high-quality data. The thirst for such training data has undermined robots.txt and the unofficial agreements supporting the use of this code.
OpenAI is behind the popular chatbot ChatGPT. The company’s largest investor is Microsoft. Anthropic is behind another relatively popular chatbot, Claude. It’s largest investor is Amazon.
Both chatbots serve up answers to user questions in the tone of a human. Such answers are only possible because the AI models they are built on include massive amounts of written text and data scraped from the web, much of it under copyright or otherwise owned by creators.
Several tech companies last year argued to the US Copyright Office that nothing on the web should be considered under copyright when it comes to AI training data.
OpenAI has struck a few deals with publishers for access to content, including Axel Springer, which owns BI. The US Copyright Office is set to update its guidance on AI and copyright later this year.
Are you a tech employee or someone else with a tip or insight to share? Contact Kali Hays at khays@businessinsider.com or on secure messaging appSignal at +1-949-280-0267. Reach out using a non-work device.
Read next
Jump to