Perplexity Caught Scraping Blocked Websites in AI Data Trap Set by Cloudflare
A recent incident has exposed Perplexity, a rising AI startup, in a controversial data scraping operation that violated web standards and sparked backlash from internet infrastructure provider Cloudflare. The company, which competes with ChatGPT, Google’s Gemini, and other generative AI platforms, was caught bypassing digital barriers meant to protect websites from unauthorized access. At the heart of the issue is the growing tension over how AI companies gather training data. High-quality data is essential for building powerful AI models, but many companies avoid paying for it by scraping content directly from the web—often without permission. This practice has drawn criticism from content creators and web operators who argue it undermines the incentives that helped build the open internet. Cloudflare, a major player in internet infrastructure that helps run about 20% of the web, stepped in to defend the integrity of online content. The company recently received complaints from some of its customers that Perplexity was evading website blocks designed to stop AI bots from crawling restricted pages. In response, Cloudflare set up a digital trap: it created new, unpublished websites with no public links, search engine listings, or metadata—making them invisible to normal web traffic. These sites were protected by robots.txt files explicitly blocking all crawlers, including PerplexityBot and Perplexity-User, the startup’s official bots. Despite these clear restrictions, Cloudflare found that Perplexity’s AI service still returned detailed answers about the hidden sites—information that could only have come from direct access. This confirmed that Perplexity had bypassed the blocks. The startup initially used its official bot identifiers, but once blocked, it switched to stealth tactics. Cloudflare discovered that Perplexity began deploying unauthorized crawlers disguised as regular web browsers. These bots used unknown or frequently changing IP addresses and unofficial Autonomous System Numbers (ASNs)—network identifiers that help route internet traffic efficiently. In some cases, Perplexity’s requests mimicked Google Chrome running on Apple Mac computers, effectively impersonating a trusted browser. According to Cloudflare, Perplexity made millions of such requests daily across tens of thousands of domains. This behavior not only broke web standards but also eroded trust in the open web’s foundational rules. In contrast, Cloudflare highlighted OpenAI’s approach as a model of responsible scraping. When OpenAI’s bots encounter a block, they stop immediately—no deception, no workarounds. As a result of the findings, Cloudflare has removed Perplexity from its list of verified bots and deployed new detection systems to block its access across its network. The move sends a clear message: in the race for AI dominance, violating the web’s norms carries real consequences. The incident underscores a broader shift in how data access is being controlled. As AI companies grow more aggressive in their data gathering, platforms like Cloudflare are stepping up to enforce rules—exposing those who try to game the system. For startups like Perplexity, the lesson is clear: respect the web’s boundaries, or risk being caught and publicly named.