Cloudflare Accuses Perplexity of Evading No-Crawl Directives

Home page — News — Cloudflare Accuses Perplexity of Evading No-Crawl Directives

By Dan Goodin – Aug 4, 2025

Introduction

Network security and performance provider Cloudflare has publicly accused AI-powered search engine Perplexity of deploying “stealth bots” that ignore standard robots.txt and Web Application Firewall (WAF) rules. According to a detailed Cloudflare blog post, Perplexity’s crawlers spun up unlisted IPs, rotated through multiple Autonomous System Numbers (ASNs), and manipulated User-Agent strings in order to scrape content from sites that explicitly disallowed crawling.

Related topic

Review: Framework Desktop – Modular PC vs Mac Studio

2025-08-07

Scope of the Alleged Evasion

Observed across 10,000+ domains with millions of HTTP requests per day.
Multiple IP ranges not published by Perplexity, rotating every 5–15 minutes.
ASNs used included DigitalOcean, Hetzner, and smaller cloud providers to obfuscate origin.
Stealth bots employed dynamic TLS fingerprints—altering cipher suites and ALPN vectors—to evade Cloudflare’s bot‐management heuristics.

“We observed a fleet of undeclared crawlers switching IPs and ASNs in real time to slip past robots.txt rules and WAF blocks. This is an unprecedented scale of evasion for an AI search service,” wrote Cloudflare senior security researchers.

Robots Exclusion Protocol: A 30-Year Standard

First proposed by Martijn Koster in 1994 and standardized as RFC 9309 under the IETF in January 2022, the Robots Exclusion Protocol (REP) allows site operators to declare crawling rules via robots.txt. Major search engines—Googlebot, Bingbot, Baiduspider—have all respected REP for decades. Cloudflare argues Perplexity’s stealth tactics contravene both the letter and spirit of this long-standing Internet norm.

Related topic

AI Voice Cloning in Deepfake Vishing Attacks

2025-08-07

Expert Perspectives

Cybersecurity Analyst (Alice Zhang, SANS Institute): “Rotation of IPs and ASNs, combined with User-Agent spoofing and TLS fingerprint variation, indicates a sophisticated crawler farm. This raises questions about data governance and copyright compliance.”
Web Standards Advocate (Dr. Markus Engel, W3C): “Transparency is core to the REP. Any system that deliberately hides from robots.txt undermines trust in the open web.”
Privacy & AI Governance Expert (Laura Kim, Future of Privacy Forum): “Under the EU AI Act and GDPR, companies must disclose data sources and respect opt-outs. Stealth crawling could expose Perplexity to regulatory scrutiny.”

Technical Deep Dive: Anatomy of a Stealth Crawler

Cloudflare’s analysis revealed multiple layers of evasion:

IP and ASN Rotation: A pool of ~200 IPs across 6 ASNs automatically cycled to avoid blacklisting.
Dynamic User-Agent String: Bot sent more than 50 different UA patterns, matching common browsers and lesser-known bots.
TLS Jitter: By tweaking cipher suite order and extension order, requests bypassed fingerprint-based bot detection.
Low-and-Slow Crawling: Intermittent GET intervals (10 s–2 min) to blend with human browsing patterns and evade rate limits.

Related topic

Google Search Chief Defends AI Results Amid CTR Concerns

2025-08-06

Ethical and Legal Considerations

In addition to violating web norms, stealth scraping poses risks:

Copyright Infringement: Publishers such as Forbes and Wired have alleged Perplexity copied proprietary content verbatim, potentially breaching copyright law.
Data Privacy: GDPR and CCPA require clear notice and opt-out mechanisms for personal data processing. Secret crawlers may breach consent frameworks.
Regulatory Action: The EU AI Act (effective 2026) mandates transparency in AI training data—opaque crawling could trigger enforcement actions.

Mitigation and Compliance Strategies

Cloudflare has already updated its Managed Rulesets to include heuristics targeting Perplexity’s stealth fingerprints. Site operators can also:

Implement rate limiting tied to behavioral anomaly detection.
Deploy custom WAF rules filtering requests with mismatched TLS fingerprints.
Augment honeypot URIs to trap unauthorized crawlers.
Monitor logs for ASN hopping and rapid IP churn.

Related topic

US Executive Branch Uses ChatGPT Enterprise for $1 per Agency

2025-08-06

Latest Developments

In late July 2025, Perplexity announced a new “Crawl Transparency” dashboard and an updated robots.txt listing of official crawler IPs. However, independent researchers at Project Guardian report continued anomalies in traffic patterns, suggesting stealth activity persists despite Perplexity’s public commitments.

Looking Ahead

As AI search engines proliferate, balancing open access to information with respect for site operators’ preferences will be critical. Industry bodies, including the IETF and W3C, are convening working groups this autumn to strengthen bot-management standards and explore digital rights for content creators.