Perplexity, Cloudflare, and the AI Web Crawler Controversy: Are AI Bots Breaking the Rules?
AI-created, human-edited.
The recent Security Now episode tackled a heated and highly relevant subject: the growing friction between website operators and AI companies over web scraping, with a spotlight on the Cloudflare vs. Perplexity situation. As generative AI becomes central to how information is served up online, the ethical, technical, and legal boundaries of AI access are being rigorously tested—and sometimes crossed.
The episode centered on Cloudflare's discovery and public accusation that Perplexity, a leading AI-powered answer engine, was allegedly evading website block rules by using stealth techniques. According to Cloudflare, Perplexity not only ignored robots.txt restrictions (the standard file used by webmasters to control which bots can crawl which parts of their sites) but also disguised its crawling identity by changing user agents and IP addresses. This behavior, Cloudflare claimed, directly disrespected sites' explicit wishes not to be scraped—especially by AI bots.
Cloudflare detailed experiments showing that Perplexity still accessed and summarized content from freshly created, unpublished test domains set to block all bots in robots.txt and with web application firewall rules denying access to Perplexity's known crawlers. What was especially notable was Perplexity's alleged use of undeclared crawlers, sometimes even impersonating browsers, and rotating through various IP ranges, some not officially linked to Perplexity. According to Cloudflare, such activity was observed at scale, across thousands of domains and millions of requests per day.
Perplexity, for its part, maintains that the controversy is overblown and rooted in misunderstanding. The company argues that when a user requests information about a site via Perplexity, their service is acting, effectively, like a browser on the user's behalf. They stress that they are not using these requests for AI training—only for summarization and direct user questions—making it no different, in their view, than a user browsing a website.
Perplexity also claims that much of Cloudflare's technical analysis was flawed, particularly around the attribution of web requests served by the service “Browser Base,” which Perplexity uses for occasional automated browsing. Perplexity suggests that Cloudflare overestimated the extent and intent of its access, and that their technical diagram of how Perplexity operates was inaccurate.
The discussion between hosts Steve Gibson and Leo Laporte dove deep into why this issue is so tangled:
Website Owners’ Rights: Sites, through robots.txt and other server-side tools, should have the freedom to determine who (or what) accesses their content—including the right to keep AI scrapers out.
User Versus Bot Access: Is an AI like Perplexity, when fetching and summarizing a page for an end user, fundamentally any different from a real browser? Does it matter whether the data is being directly served to the user, or used to train a broader AI model?
Legal and Ethical Concerns: Restricting AI systems from “reading” open web content raises complicated First Amendment and open web questions. Should automated tools be regulated differently than browsers used by people? Does intent (training vs. on-demand summarization) change the rules?
Economic Impact: AI answer engines reduce site visits and ad impressions, undermining websites’ revenue models—a primary reason many sites now want to block AI bots entirely.
Steve strongly felt that regardless of technical workarounds, if a site operator clearly blocks a bot—especially using mechanisms the AI company itself publishes—then circumventing that block is not just disrespectful, it violates the basic expectations of the web. He argued that Perplexity’s attempt to sidestep these barriers, especially after being directly signaled to stay away, undermines the social and technical fabric of the internet. He also compared this to OpenAI, which, the hosts noted, appears to honor blocks.
Leo acknowledged Steve’s concerns but highlighted the complexity of the situation. He pointed out that the fundamental nature of the open web is that content is available to everyone, and there's a blurry line between allowing real users with browsers and users employing various tools—including AI—to access that content. Leo questioned whether distinguishing between an automated system acting on a user's behalf (as with Perplexity) and a human with a browser is a defensible position, especially when talking about user-driven queries. He also compared this to the ongoing acceptance (or not) of ad blockers—another case of users changing how they consume site content, sometimes in defiance of the site owner’s wishes.
Leo closed by warning that as more sites build walls or require payment, and as AI models are potentially restricted from accessing the full breadth of information, the value and effectiveness of both the open web and AI systems risk being diminished. Both hosts agreed that it’s a multifaceted, unresolved issue.
This debate isn’t just a meta-conversation among tech insiders—it’s about the future openness of the web, business models for content creators, and the evolving rules of engagement in a world where AI intermediates so much of our online life.
Should webmasters have ironclad control over who accesses their content? Does the open web demand everyone—including AIs—have equal rights to read and summarize? Or do we need a new technical and legal framework that reflects the realities and stakes of the AI era?
To hear the full discussion—including technical details, historical context, and more listener perspectives—tune in to Security Now Episode 1038: "Perplexity’s Duplicity." Listen now on TWiT.tv or wherever you get your podcasts.