Why Blocking Web Crawlers Could Backfire in the Era of AI
AI-generated, human-edited.
On a recent episode of Intelligent Machines, Richard Skrenta, executive director of Common Crawl, argued that opting out of open web crawlers isn’t just a defensive move against big tech—it may also erase brands, publishers, and even vital institutions from the new frontiers of AI search and discovery. With major publishers and platforms increasingly blocking AI crawlers, Skrenta warns that this strategy could have unintended, long-term consequences not just for news organizations, but for the entire public information ecosystem.
Today, AI relies on massive public web crawls—like those by the nonprofit Common Crawl—to feed large language models (LLMs) with current, diverse, high-quality information. These crawls aren’t just powering commercial AIs; they’re a foundational resource for academic research, translation tools, small startups, and thousands of data science projects worldwide. Skrenta highlighted that over 10,000 scholarly papers cite Common Crawl as a source.
The debate: As commercial AI companies strike exclusive content deals (like Reddit’s OpenAI arrangement) and web infrastructure providers (such as Cloudflare) block bots, more publishers are asking for their material to be delisted from open web archives—even content freely and publicly available.
But as Skrenta explained, when organizations block crawlers, they are not only shutting out OpenAI or Google; they’re also cutting themselves off from a future where AI-based agents and smart assistants may be the primary way users find information, make recommendations, and take action.
It’s common to believe only news outlets care about being discovered online, but Skrenta shared an example that flips that idea. Major hotel brands, for instance, are beginning to spend heavily (Marriott, he noted, spends over $1 billion a year on SEO) to ensure their information is present in AI-powered answers given by chatbots and smart assistants. Likewise, hospitals and educational sites want to appear in AI citations and recommendations.
His message was clear: Forward-thinking brands are starting to see inclusion in training datasets as an opportunity, not just a risk. Removing your site from crawls doesn’t stop AI—it stops AI from referencing you.
Key risks of blocking crawlers:
- Lost discoverability: LLMs will not reference you if you’re not in their training data
- No future renown: If your site is removed, getting back in later may be difficult or impossible
- Collateral impact: Opting out prevents not just big tech, but also independent researchers, translators, and open projects from using your public material
- Biased data sets: Widescale exclusions bias training data toward what remains accessible, potentially distorting AI-generated answers and research results
Skrenta stressed that while legal or paywall concerns are understandable, open crawls like Common Crawl only index public material, and strictly honor robots.txt rules. Increasingly, however, publishers are demanding removal post hoc, driven by “AI panic” and misunderstanding about where real value lies in aggregate data.
Skrenta’s advice to publishers and organizations:
- Think long-term: Don’t just consider today’s licensing deals—consider how AI will power search, agents, and decision-making in the future.
- Understand your options: Opting out removes your visibility from AI and academic research; opting in allows participation and influence.
- Request amendments, not exclusions: If attribution or usage is your concern, work towards transparency and referral tracking, not blanket bans.
- Preserve the commons: Blocking open web crawls does not hurt big players as much as it harms the digital and research commons—cutting access for thousands of beneficial uses.
- Act before it’s too late: Removals are hard to undo; many sites discover too late that they’ve erased their information from next-generation tools and research.
Key Takeaways
- Open web crawlers like Common Crawl are foundational for AI and research.
- Opting out blocks not only AI companies but also academics, startups, and public-interest projects.
- Publishers who block crawlers may regret it, as re-inclusion is slow or impossible.
- Organizations beyond media—hospitals, travel sites, nonprofits—depend on discoverability in the AI era.
- Biases and data “holes” grow as more websites block crawlers, reducing AI’s utility and accuracy.
- Responsible open crawling, with strict adherence to robots.txt, is possible—and essential—for a healthy AI and web future.
Blocking open crawlers may feel like a safeguard, but it could erase your influence from the rapidly evolving world of AI-powered discovery. Instead, publishers should weigh the long-term consequences, focus on transparency and attribution, and support open initiatives that benefit not just large tech companies, but research, education, and innovation for all.
Full episode and in-depth discussion:
Listen to the complete conversation with Richard Skrenta on Intelligent Machines.