The Cloudflare Blog

Human Native is joining Cloudflare

Will Allen — Thu, 15 Jan 2026 14:00:00 GMT

Today, we’re excited to share that Cloudflare has acquired Human Native, a UK-based AI data marketplace specializing in transforming multimedia content into searchable and useful data.

Human Native x Cloudflare

The Human Native team has spent the past few years focused on helping AI developers create better AI through licensed data. Their technology helps publishers and developers turn messy, unstructured content into something that can be understood, licensed and ultimately valued. They have approached data not as something to be scraped, but as an asset class that deserves structure, transparency and respect.

Access to high-quality data can lead to better technical performance. One of Human Native’s customers, a prominent UK video AI company, threw away their existing training data after achieving superior results with data sourced through Human Native. Going forward they are only training on fully licensed, reputably sourced, high-quality content.

This gives a preview of what the economic model of the Internet can be in the age of generative AI: better AI built on better data, with fair control, compensation and credit for creators.

The Internet needs new economic models

For the last 30 years, the open Internet has been based on a fundamental value exchange: creators create content, aggregators (such as search engines or social media) send traffic. Creators can monetize that traffic through advertisements, subscriptions or direct support. This is the economic loop that has powered the explosive growth of the Internet.

But it’s under real strain.

Crawl-to-referral ratios are skyrocketing, with 10s of thousands of AI and bot crawls per real human visitor, and it’s unclear how multipurpose crawlers are using the content they access.

The community of creators who publish on the Internet is a diverse group: news publishers, content creators, financial professionals, technology companies, aggregators and more. But they have one thing in common: They want to decide how their content is used by AI systems.

Cloudflare’s work in building AI Crawl Control and Pay Per Crawl is predicated on a simple philosophy: Content owners should get to decide how and when their content is accessed by others. Many of our customers want to optimize their brand and content to make sure it is in every training data set and shows up in every new search; others want to have more control and only allow access if there is direct compensation.

Our tools like AI Search, AI Crawl Control and Pay Per Crawl can help, wherever you land in that equation. The important thing is that the content owner gets to decide.

New tools for AI developers

With the Human Native team joining Cloudflare, we are accelerating our work in helping customers transform their content to be easily accessed and understood by AI bots and agents in addition to their traditional human audiences.

Crawling is complex, expensive in terms of engineering and compute to process the content, and has no guarantees of quality control. A crawled index can contain duplicates, spam, illegal material and many more headaches. Developers are left with messy, unstructured data.

We recently announced our work in building the AI Index, a powerful new way for both foundation model companies and agents to access content at scale.

Instead of sending crawlers blindly and repeatedly across the open Internet, AI developers will be able to connect via a pub/sub model: participating websites will expose structured updates whenever their content changes, and developers will be able to subscribe to receive those updates in real time.

This opens up new avenues for content creators to experiment with new business models.

Building the foundation for these new business models

Cloudflare is investing heavily in creating the foundations for these new business models, starting with x402.

We recently announced that we are creating the x402 Foundation, in partnership with Coinbase, to enable machine-to-machine transactions for digital resources.

Payments on the web have historically been designed for humans. We browse a merchant’s website, show intent by adding items to a cart, and confirm our intent to purchase by putting in our credit card information and clicking “Pay.” But what if you want to enable direct transactions between automated systems? We need protocols to allow machine-to-machine transactions.

Together, Human Native and Cloudflare will accelerate our work in building the basis of these new economic models for the Internet.

What’s next

The Internet works best when it is open, fair, and independently sustainable. We’re excited to welcome the Human Native team to Cloudflare, and even more excited about what we will build together to improve the foundations of the Internet in the age of AI.

Onwards.

To build a better Internet in the age of AI, we need responsible AI bot principles. Here’s our proposal.

Leah Romm — Wed, 24 Sep 2025 13:00:00 GMT

Cloudflare has a unique vantage point: we see not only how changes in technology shape the Internet, but also how new technologies can unintentionally impact different stakeholders. Take, for instance, the increasing reliance by everyday Internet users on AI–powered chatbots and search summaries. On the one hand, end users are getting information faster than ever before. On the other hand, web publishers, who have historically relied on human eyeballs to their website to support their businesses, are seeing a dramatic decrease in those eyeballs, which can reduce their ability to create original high-quality content. This cycle will ultimately hurt end users and AI companies (whose success relies on fresh, high-quality content to train models and provide services) alike.

We are indisputably at a point in time when the Internet needs clear “rules of the road” for AI bot behavior (a note on terminology: throughout this blog we refer to AI bots and crawlers interchangeably). We have had ongoing cross-functional conversations, both internally and with stakeholders and partners across the world, and it’s clear to us that the Internet at large needs key groups — publishers and content creators, bot operators, and Internet infrastructure and cybersecurity companies — to reach a consensus on certain principles that AI bots should follow.

Of course, agreeing on what exactly those principles are will take time and require continued discussion and collaboration, and a policy framework can’t perfectly capture every technical concern. Nevertheless, we think it’s important to start a conversation that we hope others will join. After all, a rough draft is better than a blank page.

That is why we are proposing the following responsible AI bot principles as starting points:

Public disclosure: Companies should publicly disclose information about their AI bots;
Self-identification: AI bots should truthfully self-identify, eventually replacing less reliable methods, like user agent and IP address verification, with cryptographic verification;
Declared single purpose: AI bots should have one distinct purpose and declare it;
Respect preferences: AI bots should respect and comply with preferences expressed by website operators where proportionate and technically feasible;
Act with good intent: AI bots must not flood sites with excessive traffic or engage in deceptive behavior.

Each principle is discussed in greater detail below. These principles focus on AI bots because of the impact generative AI is having on the Internet, but we have already seen these practices in action with other types of (non-AI) bots as well. We believe these principles will help move the Internet in a better direction. That said, we acknowledge that they are a starting point for this conversation, which requires input from other stakeholders. The Internet has always been a collaborative place for innovation, and these principles should be seen as equally dynamic and evolving.

Why Cloudflare is encouraging this conversation

Since declaring July 1st Content Independence Day, Cloudflare has strived to play a balanced and effective role in safeguarding the future of the Internet in the age of generative AI. We have enabled customers to charge AI crawlers for access or block them with one click, published and enforced our verified bots policy and developed the Web Bot Auth proposal, and unapologetically called out and stopped bad behavior.

While we have recently focused our attention on AI crawlers, Cloudflare has long been a leader in the bot management space, helping our customers protect their websites from unwanted — and even malicious —traffic. We also want to make sure that anyone — whether they’re our customer or not — can see which AI bots are abiding by all, some, or none of these best practices.

But we aren’t ignorant to the fact that companies operating crawlers are also adapting to a new Internet landscape — and we genuinely believe that most players in this space want to do the right thing, while continuing to innovate and propel the Internet in an exciting direction. Our hope is that we can use our expertise and unique vantage point on the Internet to help bring seemingly incompatible parties together and find a path forward — continuing our mission of helping to build a better Internet for everyone.

Responsible AI bot principles

The following principles are a launchpad for a larger conversation, and we recognize that there is work to be done to address many nuanced perspectives. We envision these principles applying to AI bots but understand that technical complexity may require flexibility. Ultimately, our goal is to emphasize transparency, accountability, and respect for content access and use preferences. If these principles fall short of that — or fail to consider other important priorities — we want to know.

Principle #1: Public disclosure

Companies should publicly disclose information about their AI bots. The following information should be publicly available and easy to find:

Identity: information that helps external parties identify a bot, e.g., user agent, relevant IP address(es), and/or individual cryptographic identification (more on this below, in Principle #2: Self-identification).
Operator: the legal entity responsible for the AI bot, including a point of contact (e.g., for reporting abuse);
Purpose: for which purpose the accessed data will be used, i.e., search, AI-input, or training (more on this below, in Principle #3: Declared Single Purpose).

OpenAI is an example of a leading AI company that clearly discloses their bots, complete with detailed explanations of each bot’s purpose. The benefits of this disclosure are apparent in the subsequent principles. It helps website operators validate that a given request is in fact coming from OpenAI and what its purpose is (e.g., search indexing or AI model training). This, in turn, enables website operators to control access to and use of their content through preference expression mechanisms, like robots.txt files.

Principle #2: Self-identification

AI bots should truthfully self-identify. Not only should information about bots be disclosed in a publicly accessible location, this information should also be clearly communicated by bots themselves, e.g., through an HTTP request that conveys the bot’s official user agent and comes from an IP address that the bot claims to send traffic from. Admittedly, this current approach is flawed, as we discuss in more detail below. But until cryptographic verification is more widely adopted, we think relying on user agent and IP verification is better than nothing.

OpenAI’s GPTBot is an example of this principle in action. OpenAI publicly shares the expected full user-agent string for this bot and includes it in its requests. OpenAI also explains this bot’s purpose (“used to make [OpenAI’s] generative AI foundation models more useful and safe” and “to crawl content that may be used in training [their] generative AI foundation models”). And we have observed this bot sending traffic from IP addresses reported by OpenAI. Because site operators see GPTBot’s user agent and IP addresses matching what is publicly disclosed and expected, and they know information about the bot is publicly documented, they can confidently recognize the bot. This enables them to make informed decisions about whether they want to allow traffic from it.

Unfortunately, not all bots uphold this principle, making it difficult for website owners to know exactly which bot operators respect their crawl preferences, much less enforce them. For example, while Anthropic publishes its user agent alone, absent other verifiable information, it’s unclear which requests are truly from Anthropic. And xAI’s bot, grok, does not self-identify at all, making it impossible for website operators to block it. Anthropic and xAI’s lack of identification undermines trust between them and website owners, yet this could be fixed with minimal effort on their parts.

A note on cryptographic verification and the future of Principle #2

Truthful declaration of user agent and dedicated IP lists have historically been a functional way to verify. But in today’s rapidly-evolving bot climate, bots are increasingly vulnerable to being spoofed by bad actors. These bad actors, in turn, ignore robots.txt, which communicates allow/disallow preferences only on a user agent basis (so, a bad bot could spoof a permitted user agent and circumvent that domain’s preferences).

Ultimately, every AI bot should be cryptographically verified using an accepted standard. This would protect them against spoofing and ensure website operators have the accurate and reliable information they need to properly evaluate access by AI bots. At this time, we believe that Web Bot Auth is sufficient proof of compliance with Principle #2. We recognize that this standard is still in development, and, as a result, this principle may evolve accordingly.

Web Bot Auth uses cryptography to verify bot traffic; cryptographic signatures in HTTP messages are used as verification that a given request came from an automated bot. Our implementation relies on proposed IETF directory and protocol drafts. Initial reception of Web Bot Auth has been very positive, and we expect even more adoption. For example, a little over a month ago, Vercel announced that its bot verification now supports Web Bot Auth. And OpenAI’s ChatGPT agent now signs its requests using Web Bot Auth, in addition to using the HTTP Message Signatures standard.

We envision a future where cryptographic authentication becomes the norm, as we believe this will further strengthen the trustworthiness of bots.

Principle #3: Declared single purpose

AI bots should have one distinct purpose and declare it. Today, some bots self-identify their purpose as Training, Search, or User Action (i.e., accessing a web page in response to a user’s query).

However, these purposes are sometimes combined without clear distinction. For example, content accessed for search purposes might also be used to train the AI model powering the search engine. When a bot’s purpose is unclear, website operators face a difficult decision: block it and risk undermining search engine optimization (SEO), or allow it and risk content being used in unwanted ways.

When operators deploy bots with distinct purposes, website owners are able to make clear decisions over who can access their content. What those purposes should be is up for debate, but we think the following breakdown is a starting point based on bot activity we see. We recognize this is an evolving space and changes may be required as innovation continues:

Search: building a search index and providing search results (e.g., returning hyperlinks and short excerpts from your website’s contents). Search does not include providing AI-generated search summaries;
AI-input: inputting content into one or more AI models, e.g., retrieval-augmented generation (RAG), grounding, or other real-time taking of content for generative AI search answers; and
Training: training or fine-tuning AI models.

Relatedly, bots should not combine purposes in a way that prevents web operators from deliberately and effectively deciding whether to allow crawling.

Let’s consider two AI bots, OAI-SearchBot and Googlebot, from the perspective of Vinny, a website operator trying to make a living on the Internet. OAI-SearchBot has a single purpose: linking to and surfacing websites in ChatGPT’s search features. If Vinny takes OpenAI at face value (which we think it makes sense to do), he can trust that OAI-SearchBot does not crawl his content for training OpenAI’s generative AI models rather, a separate bot (GPTBot, as discussed in Principle #2: Self-identification) does. Vinny can decide how he wants his content used by OpenAI, e.g., permitting its use for search but not for AI training, and feel confident that his choices are respected because OAI-SearchBot only crawls for search purposes, while GPTBot is not granted access to the content in the first place (and therefore cannot use it).

On the other hand, while Googlebot scrapes content for traditional search-indexing (not model training), it also uses that content for inference purposes, such as for AI Overviews and AI Mode. Why is this a problem for Vinny? While he almost certainly wants his content appearing in search results, which drive the human eyeballs that fund his site, Vinny is forced to also accept that his content will appear in Google’s AI-generated summaries. If eyeballs are satisfied by the summary then they never visit Vinny’s website, which leads to “zero-click” searches and undermines Vinny’s ability to financially benefit from his content.

This is a vicious cycle: creating high-quality content, which typically leads to higher search rankings, now inadvertently also reduces the chances an eyeball will visit the site because that same valuable content is surfaced in an AI Overview (if it is even referenced as a source in the summary). To prevent this, Vinny must either opt out of search completely or use snippet controls (which risks degrading how his content appears in search results). This is because the only available signal to opt-out of AI, disallowing Google-Extended, is limited to training and does not apply to AI Overview, which is attached to search. Whether by accident or by design, this setup forces an impossible choice onto website owners.

Finally, the prominent technical argument in favor of combining multiple purposes — that this reduces the crawler operator’s costs — needs to be debunked. To reason by analogy: it’s like arguing that placing one call to order two pizzas is cheaper than placing two calls to order two pizzas. In reality, the cost of the two pizzas (both of which take time and effort to make) remains the same. The extra phone call may be annoying, but its costs are negligible.

Similarly, whether one bot request is made for two purposes (e.g., search indexing and AI model training) or a separate bot request is made for each of two purposes, the costs basically remain the same. For the crawler, the cost of compute is the same because the content still needs to be processed for each purpose. And the cost of two connections (i.e., for two requests) is virtually the same as one. We know this because Cloudflare runs one of the largest networks in the world, handling on average 84 million requests per second, so we understand the cost of requests at Internet scale. (As an aside, while additional crawls incur costs on website operators, they have the ability to choose whether the crawl is worth the cost, especially when bots have a single purpose.)

Principle # 4: Respect preferences

AI bots should respect and comply with preferences expressed by website operators where proportionate and technically feasible. There are multiple options for expressing preferences. Prominent examples include the longstanding and familiar robots.txt, as well as newly emerging HTTP headers.

Given the widespread use of robots.txt files, bots should make a good faith attempt to fetch a robots.txt file first, in accordance with RFC 9309, and abide by both the access and use preferences specified therein. AI bot operators should also stay up to date on how those preferences evolve as a result of a draft vocabulary currently under development by an IETF working group. The goal of the proposed vocabulary is to improve granularity in robots.txt files, so that website operators are empowered to control how their assets are used.

At the same time, new industry standards under discussion may involve the attachment of machine-readable preferences to different formats, such as individual files. AI bot operators should eventually be prepared to comply with these standards, too. One idea currently being explored is a way for site owners to list preferences via HTTP headers, which offer a server-level method of declaring how content should be used.

Principle #5: Act with good intent

AI bots must not flood sites with excessive traffic or engage in deceptive behavior. AI bot behavior should be benign or helpful to website operators and their users. It is also incumbent on companies that operate AI bots to monitor their networks and resources for breaches and patch vulnerabilities. Jeopardizing a website’s security or performance or engaging in harmful tactics is unacceptable.

Nor is it appropriate to appear to comply with the principles, only to secretly circumvent them. Reaffirming a long-standing principle of acceptable bot behavior, AI bots must never engage in stealth crawling or use other stealth tactics to try and dodge detection, such as modifying their user agent, changing their source ASNs to hide their crawling activity, or ignoring robots.txt files. Doing so would undermine the preceding four principles, hurting website operators and worsening the Internet for all.

The road ahead: multi-stakeholder efforts to bring these principles to life

As we continue working on these principles and soliciting feedback, we strive to find a balance: we want the wishes of content creators respected while still encouraging AI innovation. It’s a privilege to sit at the intersection of these important interests and to play a crucial role in developing an agreeable path forward.

We are continuing to engage with right holders, AI companies, policy-makers, and regulators to shape global industry standards and regulatory frameworks accordingly. We believe that the influx of generative AI use need not threaten the Internet’s place as an open source of quality content. Protecting its integrity requires agreement on workable technical standards that reflect the interests of web publishers, content creators, and AI companies alike.

The whole ecosystem must continue to come together and collaborate towards a better Internet that truly works for everyone. Cloudflare advocates for neutral forums where all affected parties can discuss the impact of AI developments on the Internet. One such example is the IETF, which has current work focused on some of the technical aspects being considered. Those efforts attempt to address some, but not all, of the issues in an area that deserves holistic consideration. We believe the principles we have proposed are a step in the right direction — but we hope others will join this complex and important conversation, so that norms and behavior on the Internet can successfully adapt to this exciting new technological age.

AI Week 2025: Recap

Kenny Johnson — Wed, 03 Sep 2025 14:00:00 GMT

How do we embrace the power of AI without losing control?

That was one of our big themes for AI Week 2025, which has now come to a close. We announced products, partnerships, and features to help companies successfully navigate this new era.

Everything we built was based on feedback from customers like you that want to get the most out of AI without sacrificing control and safety. Over the next year, we will double down on our efforts to deliver world-class features that augment and secure AI. Please keep an eye on our Blog, AI Avenue, Product Change Log and CloudflareTV for more announcements.

This week we focused on four core areas to help companies secure and deliver AI experiences safely and securely:

Securing AI environments and workflows
Protecting original content from misuse by AI
Helping developers build world-class, secure, AI experiences
Making Cloudflare better for you with AI

Thank you for following along with our first ever AI week at Cloudflare. This recap blog will summarize each announcement across these four core areas. For more information, check out our “This Week in NET” recap episode also featured at the end of this blog.

Securing AI environments and workflows

These posts and features focused on helping companies control and understand their employee’s usage of AI tools.

Blog	Recap
Beyond the ban: A better way to secure generative AI applications	Generative AI tools present a trade-off of productivity and data risk. Cloudflare One’s new AI prompt protection feature provides the visibility and control needed to govern these tools, allowing organizations to confidently embrace AI.
Unmasking the Unseen: Your Guide to Taming Shadow AI with Cloudflare One	Don't let "Shadow AI" silently leak your data to unsanctioned AI. This new threat requires a new defense. Learn how to gain visibility and control without sacrificing innovation.
Introducing Cloudflare Application Confidence Score For AI Applications	Cloudflare will provide confidence scores within our application library for Gen AI applications, allowing customers to assess their risk for employees using shadow IT.
ChatGPT, Claude, & Gemini security scanning with Cloudflare CASB	Cloudflare CASB now scans ChatGPT, Claude, and Gemini for misconfigurations, sensitive data exposure, and compliance issues, helping organizations adopt AI with confidence.
Securing the AI Revolution: Introducing Cloudflare MCP Server Portals	Cloudflare MCP Server Portals are now available in Open Beta. MCP Server Portals are a new capability that enable you to centralize, secure, and observe every MCP connection in your organization.
Best Practices for Securing Generative AI with SASE	This guide provides best practices for Security and IT leaders to securely adopt generative AI using Cloudflare’s SASE architecture as part of a strategy for AI Security Posture Management (AI-SPM).

Protecting original content from misuse by AI

Cloudflare is committed to helping content creators control access to their original work. These announcements focused on analysis of what we’re currently seeing on the Internet with respect to AI bots and crawlers and significant improvements to our existing control features.

Blog	Recap
A deeper look at AI crawlers: breaking down traffic by purpose and industry	We are extending AI-related insights on Cloudflare Radar with new industry-focused data and a breakdown of bot traffic by purpose, such as training or user action.
The age of agents: cryptographically recognizing agent traffic	Cloudflare now lets websites and bot creators use Web Bot Auth to segment agents from verified bots, making it easier for customers to allow or disallow the many types of user and partner directed.
Make Your Website Conversational for People and Agents with NLWeb and AutoRAG	With NLWeb, an open project by Microsoft, and Cloudflare AutoRAG, conversational search is now a one-click setup for your website.
The next step for content creators in working with AI bots: Introducing AI Crawl Control	Cloudflare launches AI Crawl Control (formerly AI Audit) and introduces easily customizable 402 HTTP responses.
The crawl-to-click gap: Cloudflare data on AI bots, training, and referrals	By mid-2025, training drives nearly 80% of AI crawling, while referrals to publishers (especially from Google) are falling and crawl-to-refer ratios show AI consumes far more than it sends back.

Helping developers build world-class, secure, AI experiences

At Cloudflare we are committing to building the best platform to build AI experiences, all with security by default.

Blog	Recap
AI Gateway now gives you access to your favorite AI models, dynamic routing and more — through just one endpoint	AI Gateway now gives you access to your favorite AI models, dynamic routing and more — through just one endpoint.
How we built the most efficient inference engine for Cloudflare’s network	Infire is an LLM inference engine that employs a range of techniques to maximize resource utilization, allowing us to serve AI models more efficiently with better performance for Cloudflare workloads.
State-of-the-art image generation Leonardo models and text-to-speech Deepgram models now available in Workers AI	We're expanding Workers AI with new partner models from Leonardo.Ai and Deepgram. Start using state-of-the-art image generation models from Leonardo and real-time TTS and STT models from Deepgram.
How Cloudflare runs more AI models on fewer GPUs: A technical deep-dive	Cloudflare built an internal platform called Omni. This platform uses lightweight isolation and memory over-commitment to run multiple AI models on a single GPU.
Cloudflare Launching AI Miniseries for Developers (and Everyone Else They Know)	In AI Avenue, we address people’s fears, show them the art of the possible, and highlight the positive human stories where AI is augmenting — not replacing — what people can do. And yes, we even let people touch AI themselves.
Block unsafe prompts targeting your LLM endpoints with Firewall for AI	Cloudflare's AI security suite now includes unsafe content moderation, integrated into the Application Security Suite via Firewall for AI.
Cloudflare is the best place to build realtime voice agents	Today, we're excited to announce new capabilities that make it easier than ever to build real-time, voice-enabled AI applications on Cloudflare's global network.

Making Cloudflare better for you with AI

Cloudflare logs and analytics can often be a needle in the haystack challenge, AI helps surface and alert to issues that need attention or review. Instead of a human having to spend hours sifting and searching for an issue, they can focus on action and remediation while AI does the sifting.

Blog	Except
Evaluating image segmentation models for background removal for Images	An inside look at how the Images team compared dichotomous image segmentation models to identify and isolate subjects in an image from the background.
Automating threat analysis and response with Cloudy	Cloudy now supercharges analytics investigations and Cloudforce One threat intelligence! Get instant insights from threat events and APIs on APTs, DDoS, cybercrime & more - powered by Workers AI!
Cloudy Summarizations of Email Detections: Beta Announcement	We're now leveraging our internal LLM, Cloudy, to generate automated summaries within our Email Security product, helping SOC teams better understand what's happening within flagged messages.
Troubleshooting network connectivity and performance with Cloudflare AI	Troubleshoot network connectivity issues by using Cloudflare AI-Power to quickly self diagnose and resolve WARP client and network issues.

We thank you for following along this week — and please stay tuned for exciting announcements coming during Cloudflare’s 15th birthday week in September!

Check out the full video recap, featuring insights from Kenny Johnson and host João Tomé, in our special This Week in NET episode (ThisWeekinNET.com) covering everything announced during AI Week 2025.

Perplexity is using stealth, undeclared crawlers to evade website no-crawl directives

Gabriel Corral — Mon, 04 Aug 2025 13:00:00 GMT

We are observing stealth crawling behavior from Perplexity, an AI-powered answer engine. Although Perplexity initially crawls from their declared user agent, when they are presented with a network block, they appear to obscure their crawling identity in an attempt to circumvent the website’s preferences. We see continued evidence that Perplexity is repeatedly modifying their user agent and changing their source ASNs to hide their crawling activity, as well as ignoring — or sometimes failing to even fetch — robots.txt files.

The Internet as we have known it for the past three decades is rapidly changing, but one thing remains constant: it is built on trust. There are clear preferences that crawlers should be transparent, serve a clear purpose, perform a specific activity, and, most importantly, follow website directives and preferences. Based on Perplexity’s observed behavior, which is incompatible with those preferences, we have de-listed them as a verified bot and added heuristics to our managed rules that block this stealth crawling.

How we tested

We received complaints from customers who had both disallowed Perplexity crawling activity in their robots.txt files and also created WAF rules to specifically block both of Perplexity’s declared crawlers: PerplexityBot and Perplexity-User. These customers told us that Perplexity was still able to access their content even when they saw its bots successfully blocked. We confirmed that Perplexity’s crawlers were in fact being blocked on the specific pages in question, and then performed several targeted tests to confirm what exact behavior we could observe.

We created multiple brand-new domains, similar to testexample.com and secretexample.com. These domains were newly purchased and had not yet been indexed by any search engine nor made publicly accessible in any discoverable way. We implemented a robots.txt file with directives to stop any respectful bots from accessing any part of a website:

We conducted an experiment by querying Perplexity AI with questions about these domains, and discovered Perplexity was still providing detailed information regarding the exact content hosted on each of these restricted domains. This response was unexpected, as we had taken all necessary precautions to prevent this data from being retrievable by their crawlers.

Obfuscating behavior observed

Bypassing Robots.txt and undisclosed IPs/User Agents

Our multiple test domains explicitly prohibited all automated access by specifying in robots.txt and had specific WAF rules that blocked crawling from Perplexity’s public crawlers. We observed that Perplexity uses not only their declared user-agent, but also a generic browser intended to impersonate Google Chrome on macOS when their declared crawler was blocked.

Declared	Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Perplexity-User/1.0; +https://perplexity.ai/perplexity-user)	20-25m daily requests
Stealth	Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36	3-6m daily requests

Both their declared and undeclared crawlers were attempting to access the content for scraping contrary to the web crawling norms as outlined in RFC 9309.

This undeclared crawler utilized multiple IPs not listed in Perplexity’s official IP range, and would rotate through these IPs in response to the restrictive robots.txt policy and block from Cloudflare. In addition to rotating IPs, we observed requests coming from different ASNs in attempts to further evade website blocks. This activity was observed across tens of thousands of domains and millions of requests per day. We were able to fingerprint this crawler using a combination of machine learning and network signals.

An example:

Of note: when the stealth crawler was successfully blocked, we observed that Perplexity uses other data sources — including other websites — to try to create an answer. However, these answers were less specific and lacked details from the original content, reflecting the fact that the block had been successful.

How well-meaning bot operators respect website preferences

In contrast to the behavior described above, the Internet has expressed clear preferences on how good crawlers should behave. All well-intentioned crawlers acting in good faith should:

Be transparent. Identify themselves honestly, using a unique user-agent, a declared list of IP ranges or Web Bot Auth integration, and provide contact information if something goes wrong.

Be well-behaved netizens. Don’t flood sites with excessive traffic, scrape sensitive data, or use stealth tactics to try and dodge detection.

Serve a clear purpose. Whether it’s powering a voice assistant, checking product prices, or making a website more accessible, every bot has a reason to be there. The purpose should be clearly and precisely defined and easy for site owners to look up publicly.

Separate bots for separate activities. Perform each activity from a unique bot. This makes it easy for site owners to decide which activities they want to allow. Don’t force site owners to make an all-or-nothing decision.

Follow the rules. That means checking for and respecting website signals like robots.txt, staying within rate limits, and never bypassing security protections.

More details are outlined in our official Verified Bots Policy Developer Docs.

OpenAI is an example of a leading AI company that follows these best practices. They clearly outline their crawlers and give detailed explanations for each crawler’s purpose. They respect robots.txt and do not try to evade either a robots.txt directive or a network level block. And ChatGPT Agent is signing http requests using the newly proposed open standard Web Bot Auth.

When we ran the same test as outlined above with ChatGPT, we found that ChatGPT-User fetched the robots file and stopped crawling when it was disallowed. We did not observe follow-up crawls from any other user agents or third party bots. When we removed the disallow directive from the robots entry, but presented ChatGPT with a block page, they again stopped crawling, and we saw no additional crawl attempts from other user agents. Both of these demonstrate the appropriate response to website owner preferences.

How can you protect yourself?

All the undeclared crawling activity that we observed from Perplexity’s hidden User Agent was scored by our bot management system as a bot and was unable to pass managed challenges. Any bot management customer who has an existing block rule in place is already protected. Customers who don’t want to block traffic can set up rules to challenge requests, giving real humans an opportunity to proceed. Customers with existing challenge rules are already protected. Lastly, we added signature matches for the stealth crawler into our managed rule that blocks AI crawling activity. This rule is available to all customers, including our free customers.

What’s next?

It's been just over a month since we announced Content Independence Day, giving content creators and publishers more control over how their content is accessed. Today, over two and a half million websites have chosen to completely disallow AI training through our managed robots.txt feature or our managed rule blocking AI Crawlers. Every Cloudflare customer is now able to selectively decide which declared AI crawlers are able to access their content in accordance with their business objectives.

We expected a change in bot and crawler behavior based on these new features, and we expect that the techniques bot operators use to evade detection will continue to evolve. Once this post is live the behavior we saw will almost certainly change, and the methods we use to stop them will keep evolving as well.

Cloudflare is actively working with technical and policy experts around the world, like the IETF efforts to standardize extensions to robots.txt, to establish clear and measurable principles that well-meaning bot operators should abide by. We think this is an important next step in this quickly evolving space.

Trapping misbehaving bots in an AI Labyrinth

Reid Tatoris — Wed, 19 Mar 2025 13:00:00 GMT

Today, we’re excited to announce AI Labyrinth, a new mitigation approach that uses AI-generated content to slow down, confuse, and waste the resources of AI Crawlers and other bots that don’t respect “no crawl” directives. When you opt in, Cloudflare will automatically deploy an AI-generated set of linked pages when we detect inappropriate bot activity, without the need for customers to create any custom rules.

AI Labyrinth is available on an opt-in basis to all customers, including the Free plan.

Using Generative AI as a defensive weapon

AI-generated content has exploded, reportedly accounting for four of the top 20 Facebook posts last fall. Additionally, Medium estimates that 47% of all content on their platform is AI-generated. Like any newer tool it has both wonderful and malicious uses.

At the same time, we’ve also seen an explosion of new crawlers used by AI companies to scrape data for model training. AI Crawlers generate more than 50 billion requests to the Cloudflare network every day, or just under 1% of all web requests we see. While Cloudflare has several tools for identifying and blocking unauthorized AI crawling, we have found that blocking malicious bots can alert the attacker that you are on to them, leading to a shift in approach, and a never-ending arms race. So, we wanted to create a new way to thwart these unwanted bots, without letting them know they’ve been thwarted.

To do this, we decided to use a new offensive tool in the bot creator’s toolset that we haven’t really seen used defensively: AI-generated content. When we detect unauthorized crawling, rather than blocking the request, we will link to a series of AI-generated pages that are convincing enough to entice a crawler to traverse them. But while real looking, this content is not actually the content of the site we are protecting, so the crawler wastes time and resources.

As an added benefit, AI Labyrinth also acts as a next-generation honeypot. No real human would go four links deep into a maze of AI-generated nonsense. Any visitor that does is very likely to be a bot, so this gives us a brand-new tool to identify and fingerprint bad bots, which we add to our list of known bad actors. Here’s how we do it…

How we built the labyrinth

When AI crawlers follow these links, they waste valuable computational resources processing irrelevant content rather than extracting your legitimate website data. This significantly reduces their ability to gather enough useful information to train their models effectively.

To generate convincing human-like content, we used Workers AI with an open source model to create unique HTML pages on diverse topics. Rather than creating this content on-demand (which could impact performance), we implemented a pre-generation pipeline that sanitizes the content to prevent any XSS vulnerabilities, and stores it in R2 for faster retrieval. We found that generating a diverse set of topics first, then creating content for each topic, produced more varied and convincing results. It is important to us that we don’t generate inaccurate content that contributes to the spread of misinformation on the Internet, so the content we generate is real and related to scientific facts, just not relevant or proprietary to the site being crawled.

This pre-generated content is seamlessly integrated as hidden links on existing pages via our custom HTML transformation process, without disrupting the original structure or content of the page. Each generated page includes appropriate meta directives to protect SEO by preventing search engine indexing. We also ensured that these links remain invisible to human visitors through carefully implemented attributes and styling. To further minimize the impact to regular visitors, we ensured that these links are presented only to suspected AI scrapers, while allowing legitimate users and verified crawlers to browse normally.

^{A graph of daily requests over time, comparing different categories of AI Crawlers.}

What makes this approach particularly effective is its role in our continuously evolving bot detection system. When these links are followed, we know with high confidence that it's automated crawler activity, as human visitors and legitimate browsers would never see or click them. This provides us with a powerful identification mechanism, generating valuable data that feeds into our machine learning models. By analyzing which crawlers are following these hidden pathways, we can identify new bot patterns and signatures that might otherwise go undetected. This proactive approach helps us stay ahead of AI scrapers, continuously improving our detection capabilities without disrupting the normal browsing experience.

By building this solution on our developer platform, we've created a system that serves convincing decoy content instantly while maintaining consistent quality - all without impacting your site's performance or user experience.

How to use AI Labyrinth to stop AI crawlers

Enabling AI Labyrinth is simple and requires just a single toggle in your Cloudflare dashboard. Navigate to the bot management section within your zone, and toggle the new AI Labyrinth setting to on:

Once enabled, the AI Labyrinth begins working immediately with no additional configuration needed.

AI honeypots, created by AI

The core benefit of AI Labyrinth is to confuse and distract bots. However, a secondary benefit is to serve as a next-generation honeypot. In this context, a honeypot is just an invisible link that a website visitor can’t see, but a bot parsing HTML would see and click on, therefore revealing itself to be a bot. Honeypots have been used to catch hackers as early as the late 1986 Cuckoo’s Egg incident. And in 2004, Project Honeypot was created by Cloudflare founders (prior to founding Cloudflare) to let everyone easily deploy free email honeypots, and receive lists of crawler IPs in exchange for contributing to the database. But as bots have evolved, they now proactively look for honeypot techniques like hidden links, making this approach less effective.

AI Labyrinth won’t simply add invisible links, but will eventually create whole networks of linked URLs that are much more realistic, and not trivial for automated programs to spot. The content on the pages is obviously content no human would spend time-consuming, but AI bots are programmed to crawl rather deeply to harvest as much data as possible. When bots hit these URLs, we can be confident they aren’t actual humans, and this information is recorded and automatically fed to our machine learning models to help improve our bot identification. This creates a beneficial feedback loop where each scraping attempt helps protect all Cloudflare customers.

What’s next

This is only the first iteration of using generative AI to thwart bots for us. Currently, while the content we generate is convincingly human, it won’t conform to the existing structure of every website. In the future, we’ll continue to work to make these links harder to spot and make them fit seamlessly into the existing structure of the website they’re embedded in. You can help us by opting in now.

To take the next step in the fight against bots, opt-in to AI Labyrinth today.

Global expansion in Generative AI: a year of growth, newcomers, and attacks

João Tomé — Mon, 10 Mar 2025 14:00:00 GMT

AI (Artificial Intelligence) is a broad concept encompassing machines that simulate or duplicate human cognitive tasks, with Machine Learning (ML) serving as its data-driven engine. Both have existed for decades but gained fresh momentum when Generative AI, AI models that can create text, images, audio, code, and video, surged in popularity following the release of OpenAI’s ChatGPT in late 2022. In this blog post, we examine the most popular Generative AI services and how they evolved throughout 2024 and early 2025. We also try to answer questions like how much traffic growth these Generative AI websites have experienced from Cloudflare’s perspective, how much of that traffic was malicious, and other insights.

To accomplish this, we use aggregated data from our 1.1.1.1 DNS resolver to measure the popularity of specific Generative AI services. We typically do this for our Year in Review and now also on the DNS domain rankings page of Cloudflare Radar, where we aggregate related domains for each service and identify sites that provide services to users. For overall traffic growth and attack trends, we rely on aggregated data from the cohort of Generative AI customers that use Cloudflare for performance (including AI inference) and security.

Key takeaways:

ChatGPT maintains the top spot: OpenAI’s ChatGPT remains #1 in Generative AI popularity, hovering around the top 50 Internet domains overall, up from #200 in late 2023.
Rapid traffic growth: Monthly traffic to Generative AI services grew by 251% over the past year, between February 1, 2024, and March 1, 2025.
New entrants on the rise: Chinese chatbot DeepSeek and Grok/xAI quickly climbed the ranks, illustrating how fast newcomers can gain traction in the AI space.
Global reach with regional variations: The U.S. leads with 23% of Generative AI visitors, but Asia dominates certain platforms like poe.com. Brazil also shows up as a strong user of multiple AI services.
Targeted by cyberattacks: Over 197 billion potential attack requests were blocked by Cloudflare in the past year, with 39 billion part of DDoS attack campaigns — particularly affecting general AI chatbots and image-generation sites.

Generative AI services popularity ranking: new kids in town

We begin by looking at Generative AI service popularity using the new AI tab on Cloudflare Radar. The newest entrant to our Top 10 is DeepSeek, a Chinese chatbot launched on January 10, 2025. It debuted at #9 on January 26, 2025, climbed to #3 on January 29 (coinciding with Lunar/Chinese New Year), and maintained that position until February 4, before settling at its current position of #6.

Also highlighted here is another AI chatbot that has recently gained popularity — X’s Grok/xAI. This Generative AI service released its Android app in February and gained attention after February 17, 2025, when it launched the Grok-3 model. In our Generative AI ranking, it first entered the top 10 on February 21, 2025, at #9, briefly reached Claude’s typical spot at #8, and is now fluctuating between #9 and #10.

Here is the current Generative AI Top 10 from the Cloudflare Radar AI page, as of March 9, 2025, with ChatGPT/OpenAI as #1 since the start of the year (a trend also observed in previous years, as the table below shows).

To make ranking changes and trends easier to spot, the table below shows the February 1 - March 1, 2025 (monthly average) standings on the left, with color-coded comparisons to 2024’s list: services that dropped since 2024 appear in red, while new or higher-ranked ones appear in green. For reference, the second column presents the top 10 from our 2024 Year in Review (including comparisons to the previous year), and the third column displays the 2023 Top 10.

Top 10 Generative AI services in February 2025 ChatGPT / OpenAI (=) Character.AI (=) QuillBot (#4 in 2024) Codeium (#3) GitHub Copilot (#7) DeepSeek (new) Perplexity (#6) Claude / Anthropic (#5) Hugging Face (new) Suno AI (new)	Top 10 Generative AI services in 2024 (Radar Year in Review) ChatGPT / OpenAI (=) Character.AI (=) Codeium (new) QuillBot (#3 in 2023) Claude / Anthropic (new) Perplexity (=) GitHub Copilot (new) Wordtune (#7) Poe (#5) Tabnine (new)	Top 10 Generative AI services in 2023 (Radar Year in Review) ChatGPT / OpenAI Character.AI QuillBot Hugging Face Poe Perplexity Wordtune Google Bard ProWritingAid Voicemod

Other than the previously mentioned DeepSeek, Grok/xAI and ChatGPT/OpenAI, the top 10 includes other chatbots like Anthropic’s Claude, as well as other types of Generative AI services. Character.AI — a specialized platform for creating and interacting with character-based personalities — is #2, then there’s Perplexity (#7) that functions as an AI search engine, while QuillBot (#3) is an AI-powered writing assistant for paraphrasing, grammar, and summarizing. Codeium (#4), which includes developer productivity services like Windsurf AI, and GitHub Copilot (#5) serve as AI coding assistants.

There’s also Hugging Face (#9), an open-source hub for AI models (we’re including it here as a Generative AI platform, just as we do for other AI model enablers like Replicate and Stability AI), and Suno AI (#10), a music generator that creates songs from text prompts.

We saw that Grok/xAI entered the top 10 during the last days of February, but since we’re using February’s monthly average, it appears at #11 here. Curious about the rest of the February 2025 Top 20? Here it is, with AI coding services having a strong presence — beyond Codeium and GitHub Copilot, Sider AI and Tabnine also make the list.

11 Grok / xAI

12 Poe

13 Sider AI

14 Civitai

15 Tabnine

16 Google Gemini

17 Voicemod

18 GliaCloud

19 Runway ml

20 Midjourney

We have published Generative AI popularity rankings in both the 2023 and 2024 Cloudflare Radar Year in Review, and in both, OpenAI’s ChatGPT has consistently held the #1 spot. In 2024, as explained in our blog post, ChatGPT also moved in our overall rankings, nearly breaking into the top 50 by the end of the year. (It was just outside the top 100 in 2023).

ChatGPT's influence in the overall ranking

A recent addition to Cloudflare Radar is the updated domains ranking page in our DNS section, which includes a number of detailed trends. There, we now show the top 100 overall Internet services ranking next to a top 100 domains list. ChatGPT / OpenAI, the leading Generative AI service, is typically ranked in the mid-50’s on weekdays and close to #60 on weekends (based on early March 2025 insights), next to non-AI services like Temu, eBay, or Disney Plus.

Looking at previous trends, as noted in our Year in Review blog, ChatGPT / OpenAI ranked around #200 in early 2023 and climbed to near the top 100 by the end of the year. In 2024, it started just outside the top 100, reached the top 60 in May with the release of the 4o model, and has been near the top 50 since September 2024, aligning with the return of employees and students to their routines.

Visitor location distribution: Americas, Europe and Asia

The Domain Information page on Cloudflare Radar enables users to look at the location popularity of a specific domain (from the last seven days), derived from Cloudflare 1.1.1.1 resolver traffic data in a period of 48 hours (Radar’s default) on March 3-4, 2025.

In this case, the chatgpt.com domain has most of its DNS traffic from the United States (17%), followed by Germany(7%), Brazil (4%), Indonesia (4%), and India (4%).

In the case of the new kid in town, deepseek.com, the U.S. is #1 location, with 14% of that domain’s DNS traffic, followed by China (11%), Germany (10%), Brazil (7%), and Hong Kong (5%).

Grok.com, on the other hand, has 20% of its traffic from the U.S., 8% from Hong Kong, 6% from Germany, 6% from Japan, and 6% from Vietnam, reflecting a strong presence in Asia within its top 5 locations. Asia is even more dominant for another well-known Generative AI chatbot domain, poe.com, with Hong Kong ranking #1 (29% of traffic), followed by the U.S. (13%), Japan (6%), China (6%), and Singapore (5%).

Hugging Face (huggingface.co), the Generative AI models platform, also has the U.S. as its top location (34% of traffic), but its top 5 includes four European countries: France (6%), the United Kingdom (6%), Germany (4%), and Sweden (4%).

Looking more specifically at AI-powered coding tools, DNS traffic for githubcopilot.com is primarily driven by the United States (22%), followed by Germany (6%), Hong Kong (5%), India (5%), and Japan (5%). A similar pattern appears for codeium.com, where the U.S. leads with 15%, followed by Hong Kong (8%), Japan (7%), Brazil (5%), and the Netherlands (5%). Likewise, cursor.com has 20% of its DNS traffic from the U.S., followed by Hong Kong (10%), India (6%), China (6%), and Japan (5%). Tabnine.com, another AI code completion tool, has its highest traffic from the U.S. (15%), followed by India (6%), Brazil (5%), Germany (5%), and Hong Kong (5%).

The DNS traffic data from Cloudflare Radar highlights strong U.S. usage across all major Generative AI and AI coding tools, with regional adoption varying by platform. (It is worth noting that 1.1.1.1 has a larger user base in the U.S., but these specific trends vary depending on the domains.)

Asia dominates poe.com and AI coding tools like Codeium and Cursor.
Europe plays a significant role in Hugging Face and GitHub Copilot.
Brazil emerges as a notable player, particularly in DeepSeek and Tabnine.

Generative AI general traffic growth

Cloudflare, in terms of Generative AI customers, has a unique perspective on the industry. We power many Generative AI services, both large and small. From a cohort of Generative AI customers — some recently popular, others established chatbots or image AI generators, and some just starting — we’ve aggregated both HTTP request data over the past months and application-layer attack trends.

Let’s start with HTTP requests traffic growth in the past year. From February 1, 2024, through March 1, 2025 (a 13-month period to compare February 2024 with February 2025), monthly traffic grew a total of 251%, and over 2% of the requests processed by Cloudflare were mitigated as potential attacks.

Note that there was an increase over most of the entities in the cohort of Generative AI websites, and this 251% growth also includes recent Generative AI customers, although those mostly don’t influence the growth trend that much — if we exclude Generative AI customers that onboarded to Cloudflare in late 2024 and early 2025, year growth is 234%.

In this next perspective, shown at a daily level, the expected drop during Christmas and the end of the year holidays is quite clear. Another trend surfaces: the cohort of Cloudflare’s Generative AI customers definitely see more use during weekdays than weekends, suggesting a workplace focus. The clear drop during the holidays also includes the summer in the Northern Hemisphere — there's a slight drop in peak traffic in July, for example (similar to what we typically see in terms of general traffic in most countries).

We also have a perspective on the top visitor locations to Generative AI websites, where the U.S. ranks #1, with 23% of all requests in this category, followed by India (8%), Brazil (5%), Indonesia (4%), and Philippines (4%) in the top 5. European countries, such as the U.K. and Germany, come next in the ranking. Below, we show the top 50 for further exploration. Note that Egypt is the first African country appearing in the ranking, at #32, with the same 0.7% as South Africa.

Top locations by share of traffic to Generative AI websites

Rank	Country	Percentage of total	Rank	Country	Percentage of total
1	United States	22.7%	26	Singapore	1.1%
2	India	8.3%	27	Ukraine	1%
3	Brazil	4.9%	28	Taiwan	0.9%
4	Indonesia	4.2%	29	Thailand	0.9%
5	Philippines	4%	30	Chile	0.8%
6	United Kingdom	3.8%	31	United Arab Emirates	0.7%
7	Germany	3.7%	32	Egypt	0.7%
8	Canada	3.2%	33	Saudi Arabia	0.7%
9	France	3%	34	South Africa	0.7%
10	Mexico	2.7%	35	Sweden	0.6%
11	Japan	2.4%	36	Belgium	0.6%
12	Russian Federation	2.2%	37	Bangladesh	0.6%
13	Spain	2%	38	Switzerland	0.6%
14	Australia	2%	39	Morocco	0.6%
15	South Korea	1.8%	40	Ecuador	0.6%
16	Vietnam	1.6%	41	Israel	0.5%
17	Italy	1.5%	42	Nigeria	0.5%
18	Malaysia	1.5%	43	Romania	0.5%
19	Turkey	1.4%	44	Portugal	0.5%
20	Poland	1.4%	45	Kazakhstan	0.5%
21	Netherlands	1.4%	46	Austria	0.4%
22	Argentina	1.2%	47	Czech Republic	0.4%
23	Colombia	1.2%	48	Hong Kong	0.4%
24	Pakistan	1.2%	49	Algeria	0.4%
25	Peru	1.1%	50	Denmark	0.4%

Attacks targeting Generative AI websites

On the security front, Generative AI websites have become key targets for DDoS attacks as they have gained attention and grown in popularity. Recently, our Cloudforce One team published a threat analysis on attacks by Anonymous Sudan targeting AI-related companies: Inside LameDuck: Analyzing Anonymous Sudan’s Threat Operations. In this report, they explained how the U.S. Department of Justice indicted two Sudanese brothers behind LameDuck, linking them to 35,000+ DDoS attacks via the Skynet Botnet. The case exposes both political and financial motives behind their operations and underscores the global effort — including Cloudflare’s — to strengthen cybersecurity.

Over the last 13 months, from February 1, 2024, until March 1, 2025, Cloudflare blocked 197 billion requests as potential attacks. Of that number, 39 billion requests were part of DDoS attacks targeting Generative AI websites.

In terms of malicious requests that were blocked, June 2024 saw the highest number of potential attacks blocked by Cloudflare, followed by January 2025. For DDoS attacks, January 2025 recorded the highest activity, followed by November 2024 and February 2024.

Looking more closely at DDoS traffic at a daily level, the largest attack occurred on February 23, 2024, when 3.7 billion requests were blocked as part of a DDoS attack. The second largest was a 1.5 billion request DDoS attack on November 13, 2024. Additionally, a series of multiday DDoS attacks took place between January 20 and 31, 2025, with January 29 seeing the highest number of DDoS attack-related requests, at over one billion (7.3 billion in total for the month).

During the February 23, 2024, DDoS attack, which targeted a specific Generative AI customer, more than 20% of all requests across all Generative AI customers were blocked as part of the attack.

Taking a more granular view of DDoS attacks against that particular Generative AI customer, the attack began on February 22, 2024, at 22:45 UTC, lasting for over eight hours of continuous traffic spikes, peaking at 270,000 requests per second. Further attacks followed, with the most significant occurring on February 26, 2024, at 03:45 UTC, lasting three minutes and peaking at 309,000 requests per second.

Another popular Generative AI customer was targeted in a DDoS campaign from January 25 to January 31, 2025, with traffic peaking on January 30, reaching 523,000 requests per second.

Another perspective to consider over the same February 2024 to February 2025 period is the type of Generative AI websites most targeted by DDoS attacks. General AI chatbots accounted for over 80% of all blocked requests, making them the primary targets.

DDoS attacks targets by Generative AI category

Category	Percentage
General Chatbots	82.7%
Image AI Generators	8.2%
Code Assistants	3.4%
Other	2.6%
AI Research & Infra	1.3%
AI Music Creation	1.2%
Writing & Content AI	0.4%
Voice & Video AI	0.3%

However, when looking at the percentage of total traffic blocked as DDoS attacks within each category, image AI-related websites had the highest proportion, with over 50% of their total traffic being blocked.

Websites category with the highest percentage of traffic blocked as DDoS attacks

Category	Blocked DDoS (%)
Image AI	50.8%
AI Chatbot	31%
AI Search	9.4%
AI Code Assistant	6.8%
AI Model	5.8%
AI Music	3.6%
AI Company	2.9%

Conclusion: AI transformation

Generative AI continues to grow and transform Internet usage, driving traffic growth of over 250% for AI services over the course of the last year. ChatGPT is definitely the most popular service, and nears the top 50 of all Internet services as seen through analysis of traffic from our 1.1.1.1 DNS resolver. New entrants like DeepSeek and Grok/xAI have quickly climbed the popularity rankings, while regional adoption patterns show the U.S., India, and Brazil leading in visitor traffic.

This rapid rise has also drawn cyberattacks, with 39 billion requests identified as DDoS attacks targeting specific Generative AI websites over the past year. While most attacks focus on general AI chatbots, image-generation sites show the highest percentage of blocked requests, at over 50%. As Generative AI evolves, tracking these trends provides a historical record of growth surges, global reach, and emerging threats.

If you’re interested in more trends and insights about the Internet, check out Cloudflare Radar. Follow us on social media at @CloudflareRadar (X), noc.social/@cloudflareradar (Mastodon), and radar.cloudflare.com (Bluesky), or contact us via email.

AI Everywhere with the WAF Rule Builder Assistant, Cloudflare Radar AI Insights, and updated AI bot protection

Adam Martinetti — Fri, 27 Sep 2024 13:00:00 GMT

The continued growth of AI has fundamentally changed the Internet over the past 24 months. AI is increasingly ubiquitous, and Cloudflare is leaning into the new opportunities and challenges it presents in a big way. This year for Cloudflare’s birthday, we’ve extended our AI Assistant capabilities to help you build new WAF rules, added AI bot traffic insights on Cloudflare Radar, and given customers new AI bot blocking capabilities.

AI Assistant for WAF Rule Builder

At Cloudflare, we’re always listening to your feedback and striving to make our products as user-friendly and powerful as possible. One area where we've heard your feedback loud and clear is in the complexity of creating custom and rate-limiting rules for our Web Application Firewall (WAF). With this in mind, we’re excited to introduce a new feature that will make rule creation easier and more intuitive: the AI Assistant for WAF Rule Builder.

By simply entering a natural language prompt, you can generate a custom or rate-limiting rule tailored to your needs. For example, instead of manually configuring a complex rule matching criteria, you can now type something like, "Match requests with low bot score," and the assistant will generate the rule for you. It’s not about creating the perfect rule in one step, but giving you a strong foundation that you can build on.

The assistant will be available in the Custom and Rate Limit Rule Builder for all WAF users. We’re launching this feature in Beta for all customers, and we encourage you to give it a try. We’re looking forward to hearing your feedback (via the UI itself) as we continue to refine and enhance this tool to meet your needs.

AI bot traffic insights on Cloudflare Radar

AI platform providers use bots to crawl and scrape websites, vacuuming up data to use for model training. This is frequently done without the permission of, or a business relationship with, the content owners and providers. In July, Cloudflare urged content owners and providers to “declare their AIndependence”, providing them with a way to block AI bots, scrapers, and crawlers with a single click. In addition to this so-called “easy button” approach, sites can provide more specific guidance to these bots about what they are and are not allowed to access through directives in a robots.txt file. Regardless of whether a customer chooses to block or allow requests from AI-related bots, Cloudflare has insight into request activity from these bots, and associated traffic trends over time.

Tracking traffic trends for AI bots can help us better understand their activity over time — which are the most aggressive and have the highest volume of requests, which launch crawls on a regular basis, etc. The new AI bot & crawler traffic graph on Radar’s Traffic page provides insight into these traffic trends gathered over the selected time period for the top known AI bots. The associated list of bots tracked here is based on the ai.robots.txt list, and will be updated with new bots as they are identified. Time series and summary data is available from the Radar API as well. (Traffic trends for the full set of AI bots & crawlers can be viewed in the new Data Explorer.)

Blocking more AI bots

For Cloudflare’s birthday, we’re following up on our previous blog post, Declaring Your AIndependence, with an update on the new detections we’ve added to stop AI bots. Customers who haven’t already done so can simply click the button to block AI bots to gain more protection for their website.

Enabling dynamic updates for the AI bot rule

The old button allowed customers to block verified AI crawlers, those that respect robots.txt and crawl rate, and don’t try to hide their behavior. We’ve added new crawlers to that list, but we’ve also expanded the previous rule to include 27 signatures (and counting) of AI bots that don’t follow the rules. We want to take time to say “thank you” to everyone who took the time to use our “tip line” to point us towards new AI bots. These tips have been extremely helpful in finding some bots that would not have been on our radar so quickly.

For each bot we’ve added, we’re also adding them to our “Definitely automated” definition as well. So, if you’re a self-service plan customer using Super Bot Fight Mode, you’re already protected. Enterprise Bot Management customers will see more requests shift from the “Likely Bot” range to the “Definitely automated” range, which we’ll discuss more below.

Under the hood, we’ve converted this rule logic to a Cloudflare managed rule (the same framework that powers our WAF). This enables our security analysts and engineers to safely push updates to the rule in real-time, similar to how new WAF rule changes are rapidly delivered to ensure our customers are protected against the latest CVEs. If you haven’t logged back into the Bots dashboard since the previous version of our AI bot protection was announced, click the button again to update to the latest protection.

The impact of new fingerprints on the model

One hidden beneficiary of fingerprinting new AI bots is our ML model. As we’ve discussed before, our global ML model uses supervised machine learning and greatly benefits from more sources of labeled bot data. Below, you can see how well our ML model recognized these requests as automated, before and after we updated the button, adding new rules. To keep things simple, we have shown only the top 5 bots by the volume of requests on the chart. With the introduction of our new managed rule, we have observed an improvement in our detection capabilities for the majority of these AI bots. Button v1 represents the old option that let customers block only verified AI crawlers, while Button v2 is the newly introduced feature that includes managed rule detections.

So how did we make our detections more robust? As we have mentioned before, sometimes a single attribute can give a bot away. We developed a sophisticated set of heuristics tailored to these AI bots, enabling us to effortlessly and accurately classify them as such. Although our ML model was already detecting the vast majority of these requests, the integration of additional heuristics has resulted in a noticeable increase in detection rates for each bot, and ensuring we score every request correctly 100% of the time. Transitioning from a purely machine learning approach to incorporating heuristics offers several advantages, including faster detection times and greater certainty in classification. While deploying a machine learning model is complex and time-consuming, new heuristics can be created in minutes.

The initial launch of the AI bots block button was well-received and is now used by over 133,000 websites, with significant adoption even among our Free tier customers. The newly updated button, launched on August 20, 2024, is rapidly gaining traction. Over 90,000 zones have already adopted the new rule, with approximately 240 new sites integrating it every hour. Overall, we are now helping to protect the intellectual property of more than 146,000 sites from AI bots, and we are currently blocking 66 million requests daily with this new rule. Additionally, we’re excited to announce that support for configuring AI bots protection via Terraform will be available by the end of this year, providing even more flexibility and control for managing your bot protection settings.

Bot behavior

With the enhancements to our detection capabilities, it is essential to assess the impact of these changes to bot activity on the Internet. Since the launch of the updated AI bots block button, we have been closely monitoring for any shifts in bot activity and adaptation strategies. The most basic fingerprinting technique we use to identify AI bot looking for simple user-agent matches. User-agent matches are important to monitor because they indicate the bot is transparently announcing who they are when they’re crawling a website.

The graph below shows a volume of traffic we label as AI bot over the past two months. The blue line indicates the daily request count, while the red line represents the monthly average number of requests. In the past two months, we have seen an average reduction of nearly 30 million requests, with a decrease of 40 million in the most recent month.This decline coincides with the release of Button v1 and Button v2. Our hypothesis is that with the new AI bots blocking feature, Cloudflare is blocking a majority of these bots, which is discouraging them from crawling.

This hypothesis is supported by the observed decline in requests from several top AI crawlers. Specifically, the Bytespider bot reduced its daily requests from approximately 100 million to just 50 million between the end of June and the end of August (see graph below). This reduction could be attributed to several factors, including our new AI bots block button and changes in the crawler's strategy.

We have also observed an increase in the accountability of some AI crawlers. The most basic fingerprinting technique we use to identify AI bot looking for simple user-agent matches. User-agent matches are important to monitor because they indicate the bot is transparently announcing who they are when they’re crawling a website. These crawlers are now more frequently using their agents, reflecting a shift towards more transparent and responsible behavior. Notably, there has been a dramatic surge in the number of requests from the Perplexity user agent. This increase might be linked to previous accusations that Perplexity did not properly present its user agent, which could have prompted a shift in their approach to ensure better identification and compliance.

These trends suggest that our updates are likely affecting how AI crawlers interact with content. We will continue to monitor AI bot activity to help users control who accesses their content and how. By keeping a close watch on emerging patterns, we aim to provide users with the tools and insights needed to make informed decisions about managing their traffic.

Wrap up

We’re excited to continue to explore the AI landscape, whether we’re finding more ways to make the Cloudflare dashboard usable or new threats to guard against. Our AI insights on Radar update in near real-time, so please join us in watching as new trends emerge and discussing them in the Cloudflare Community.

Declare your AIndependence: block AI bots, scrapers and crawlers with a single click

Alex Bocharov — Wed, 03 Jul 2024 13:00:26 GMT

To help preserve a safe Internet for content creators, we’ve just launched a brand new “easy button” to block all AI bots. It’s available for all customers, including those on our free tier.

The popularity of generative AI has made the demand for content used to train models or run inference on skyrocket, and, although some AI companies clearly identify their web scraping bots, not all AI companies are being transparent. Google reportedly paid $60 million a year to license Reddit’s user generated content, and most recently, Perplexity has been accused of impersonating legitimate visitors in order to scrape content from websites. The value of original content in bulk has never been higher.

Last year, Cloudflare announced the ability for customers to easily block AI bots that behave well. These bots follow robots.txt, and don’t use unlicensed content to train their models or run inference for RAG applications using website data. Even though these AI bots follow the rules, Cloudflare customers overwhelmingly opt to block them.

We hear clearly that customers don’t want AI bots visiting their websites, and especially those that do so dishonestly. To help, we’ve added a brand new one-click to block all AI bots. It’s available for all customers, including those on the free tier. To enable it, simply navigate to the Security > Bots section of the Cloudflare dashboard, and click the toggle labeled AI Scrapers and Crawlers.

This feature will automatically be updated over time as we see new fingerprints of offending bots we identify as widely scraping the web for model training. To ensure we have a comprehensive understanding of all AI crawler activity, we surveyed traffic across our network.

AI bot activity today

The graph below illustrates the most popular AI bots seen on Cloudflare’s network in terms of their request volume. We looked at common AI crawler user agents and aggregated the number of requests on our platform from these AI user agents over the last year:

When looking at the number of requests made to Cloudflare sites, we see that Bytespider, Amazonbot, ClaudeBot, and GPTBot are the top four AI crawlers. Operated by ByteDance, the Chinese company that owns TikTok, Bytespider is reportedly used to gather training data for its large language models (LLMs), including those that support its ChatGPT rival, Doubao. Amazonbot and ClaudeBot follow Bytespider in request volume. Amazonbot, reportedly used to index content for Alexa’s question-answering, sent the second-most number of requests and ClaudeBot, used to train the Claude chat bot, has recently increased in request volume.

Among the top AI bots that we see, Bytespider not only leads in terms of number of requests but also in both the extent of its Internet property crawling and the frequency with which it is blocked. Following closely is GPTBot, which ranks second in both crawling and being blocked. GPTBot, managed by OpenAI, collects training data for its LLMs, which underpin AI-driven products such as ChatGPT. In the table below, “Share of websites accessed” refers to the proportion of websites protected by Cloudflare that were accessed by the named AI bot.

AI Bot	Share of Websites Accessed
Bytespider	40.40%
GPTBot	35.46%
ClaudeBot	11.17%
ImagesiftBot	8.75%
CCBot	2.14%
ChatGPT-User	1.84%
omgili	0.10%
Diffbot	0.08%
Claude-Web	0.04%
PerplexityBot	0.01%

While our analysis identified the most popular crawlers in terms of request volume and number of Internet properties accessed, many customers are likely not aware of the more popular AI crawlers actively crawling their sites. Our Radar team performed an analysis of the top robots.txt entries across the top 10,000 Internet domains to identify the most commonly actioned AI bots, then looked at how frequently we saw these bots on sites protected by Cloudflare.

In the graph below, which looks at disallowed crawlers for these sites, we see that customers most often reference GPTBot, CCBot, and Google in robots.txt, but do not specifically disallow popular AI crawlers like Bytespider and ClaudeBot.

With the Internet now flooded with these AI bots, we were curious to see how website operators have already responded. In June, AI bots accessed around 39% of the top one million Internet properties using Cloudflare, but only 2.98% of these properties took measures to block or challenge those requests. Moreover, the higher-ranked (more popular) an Internet property is, the more likely it is to be targeted by AI bots, and correspondingly, the more likely it is to block such requests.

Top N Internet properties by number of visitors seen by Cloudflare	% accessed by AI bots	% blocking AI bots
10	80.0%	40.0%
100	63.0%	16.0%
1,000	53.2%	8.8%
10,000	47.99%	8.92%
100,000	44.53%	6.36%
1,000,000	38.73%	2.98%

We see website operators completely block access to these AI crawlers using robots.txt. However, these blocks are reliant on the bot operator respecting robots.txt and adhering to RFC9309 (ensuring variations on user against all match the product token) to honestly identify who they are when they visit an Internet property, but user agents are trivial for bot operators to change.

How we find AI bots pretending to be real web browsers

Sadly, we’ve observed bot operators attempt to appear as though they are a real browser by using a spoofed user agent. We’ve monitored this activity over time, and we’re proud to say that our global machine learning model has always recognized this activity as a bot, even when operators lie about their user agent.

Take one example of a specific bot that others observed to be hiding their activity. We ran an analysis to see how our machine learning models scored traffic from this bot. In the diagram below, you can see that all bot scores are firmly below 30, indicating that our scoring thinks this activity is likely to be coming from a bot.

The diagram reflects scoring of the requests using our newest model, where “hotter” colors indicate more requests falling in that band, and “cooler” colors meaning fewer requests did. We can see the vast majority of requests fell into the bottom two bands, showing that Cloudflare’s model gave the offending bot a score of 9 or less. The user agent changes have no effect on the score, because this is the very first thing we expect bot operators to do.

Any customer with an existing WAF rule set to challenge visitors with a bot score below 30 (our recommendation) automatically blocked all of this AI bot traffic with no new action on their part. The same will be true for future AI bots that use similar techniques to hide their activity.

We leverage Cloudflare global signals to calculate our Bot Score, which for AI bots like the one above, reflects that we correctly identify and score them as a “likely bot.”

When bad actors attempt to crawl websites at scale, they generally use tools and frameworks that we are able to fingerprint. For every fingerprint we see, we use Cloudflare’s network, which sees over 57 million requests per second on average, to understand how much we should trust this fingerprint. To power our models, we compute global aggregates across many signals. Based on these signals, our models were able to appropriately flag traffic from evasive AI bots, like the example mentioned above, as bots.

The upshot of this globally aggregated data is that we can immediately detect new scraping tools and their behavior without needing to manually fingerprint the bot, ensuring that customers stay protected from the newest waves of bot activity.

If you have a tip on an AI bot that’s not behaving, we’d love to investigate. There are two options you can use to report misbehaving AI crawlers:

1. Enterprise Bot Management customers can submit a False Negative Feedback Loop report via Bot Analytics by simply selecting the segment of traffic where they noticed misbehavior:

2. We’ve also set up a reporting tool where any Cloudflare customer can submit reports of an AI bot scraping your website without permission.

We fear that some AI companies intent on circumventing rules to access content will persistently adapt to evade bot detection. We will continue to keep watch and add more bot blocks to our AI Scrapers and Crawlers rule and evolve our machine learning models to help keep the Internet a place where content creators can thrive and keep full control over which models their content is used to train or run inference on.