Posts tagged "AI Week"

Deploy your own AI vibe coding platform — in one click!

Ashish Kumar Singh — Tue, 23 Sep 2025 14:00:00 GMT

It’s an exciting time to build applications. With the recent AI-powered "vibe coding" boom, anyone can build a website or application by simply describing what they want in a few sentences. We’re already seeing organizations expose this functionality to both their users and internal employees, empowering anyone to build out what they need.

Today, we’re excited to open-source an AI vibe coding platform, VibeSDK, to enable anyone to run an entire vibe coding platform themselves, end-to-end, with just one click.

Want to see it for yourself? Check out our demo platform that you can use to create and deploy applications. Or better yet, click the button below to deploy your own AI-powered platform, and dive into the repo to learn about how it’s built.

Deploying VibeSDK sets up everything you need to run your own AI-powered development platform:

Integration with LLM models to generate code, build applications, debug errors, and iterate in real-time, powered by Agents SDK.
Isolated development environments that allow users to safely build and preview their applications in secure sandboxes.
Infinite scale that allows you to deploy thousands or even millions of applications that end users deploy, all served on Cloudflare’s global network
Observability and caching across multiple AI providers, giving you insight into costs and performance with built-in caching for popular responses.
Project templates that the LLM can use as a starting point to build common applications and speed up development.
One-click project export to the user’s Cloudflare account or GitHub repo, so users can take their code and continue development on their own.

Building an AI vibe coding platform from start to finish

Step 0: Get started immediately with VibeSDK

We’re seeing companies build their own AI vibe coding platforms to enable both internal and external users. With a vibe coding platform, internal teams like marketing, product, and support can build their own landing pages, prototypes, or internal tools without having to rely on the engineering team. Similarly, SaaS companies can embed this capability into their product to allow users to build their own customizations.

Every platform has unique requirements and specializations. By building your own, you can write custom logic to prompt LLMs for your specific needs, giving your users more relevant results. This also grants you complete control over the development environment and application hosting, giving you a secure platform that keeps your data private and within your control.

We wanted to make it easy for anyone to build this themselves, which is why we built a complete platform that comes with project templates, previews, and project deployment. Developers can repurpose the whole platform, or simply take the components they need and customize them to fit their needs.

Step 1: Finding a safe, isolated environment for running untrusted, AI generated code

AI can now build entire applications, but there's a catch: you need somewhere safe to run this untrusted, AI-generated code. Imagine if an LLM writes an application that needs to install packages, run build commands, and start a development server — you can't just run this directly on your infrastructure where it might affect other users or systems.

With Cloudflare Sandboxes, you don't have to worry about this. Every user gets their own isolated environment where the AI-generated code can do anything a normal development environment can do: install npm packages, run builds, start servers, but it's fully contained in a secure, container-based environment that can't affect anything outside its sandbox.

The platform assigns each user to their own sandbox based on their session, so that if a user comes back, they can continue to access the same container with their files intact:

Step 2: Generating the code

Once the sandbox is created, you have a development environment that can bring the code to life. VibeSDK orchestrates the whole workflow from writing the code, installing the necessary packages, and starting the development server. If you ask it to build a to-do app, it will generate the React application, write the component files, run bun install to get the dependencies, and start the server, so you can see the end result.

Once the user submits their request, the AI will generate all the necessary files, whether it's a React app, Node.js API, or full-stack application, and write them directly to the sandbox:

To speed this up even more, we’ve provided a set of templates, stored in an R2 bucket, that the platform can use and quickly customize, instead of generating every file from scratch. This is just an initial set, but you can expand it and add more examples.

Step 3: Getting a preview of your deployment

Once everything is ready, the platform starts the development server and uses the Sandbox SDK to expose it to the internet with a public preview URL which allows users to instantly see their AI-generated application running live:

Step 4: Test, log, fix, repeat

But that’s not all! Throughout this process, the platform will capture console output, build logs, and error messages and feed them back to the LLM for automatic fixes. As the platform makes any updates or fixes, the user can see it all happening live — the file editing, installation progress, and error resolution.

Deploying applications: From Sandbox to Region Earth

Once the application is developed, it needs to be deployed. The platform packages everything in the sandbox and then uses a separate specialized "deployment sandbox" to deploy the application to Cloudflare Workers. This deployment sandbox runs wrangler deploy inside the secure environment to publish the application to Cloudflare's global network.

Since the platform may deploy up to thousands or millions of applications, Workers for Platforms is used to deploy the Workers at scale. Although all the Workers are deployed to the same Namespace, they are all isolated from one another by default, ensuring there’s no cross-tenant access. Once deployed, each application receives its own isolated Worker instance with a unique public URL like my-app.vibe-build.example.com.

Exportable Applications

The platform also allows users to export their application to their own Cloudflare account and GitHub repo, so they can continue the development on their own.

Observability, caching, and multi-model support built in!

It's no secret that LLM models have their specialties, which means that when building an AI-powered platform, you may end up using a few different models for different operations. By default, VibeSDK leverages Google’s Gemini models (gemini-2.5-pro, gemini-2.5-flash-lite, gemini-2.5-flash) for project planning, code generation, and debugging.

VibeSDK is automatically set up with AI Gateway, so that by default, the platform is able to:

Use a unified access point to route requests across LLM providers, allowing you to use models from a range of providers (OpenAI, Anthropic, Google, and others)
Cache popular responses, so when someone asks to "build a to-do list app", the gateway can serve a cached response instead of going to the provider (saving inference costs)
Get observability into the requests, tokens used, and response times across all providers in one place
Track costs across models and integrations

Open sourced, so you can build your own Platform!

We're open-sourcing VibeSDK for the same reason Cloudflare open-sourced the Workers runtime — we believe the best development happens in the open. That's why we wanted to make it as easy as possible for anyone to build their own AI coding platform, whether it's for internal company use, for your website builder, or for the next big vibe coding platform. We tied all the pieces together for you, so you can get started with the click of a button instead of spending months figuring out how to connect everything yourself. To learn more, check out our reference architecture for vibe coding platforms.

AI Week 2025: Recap

Kenny Johnson — Wed, 03 Sep 2025 14:00:00 GMT

How do we embrace the power of AI without losing control?

That was one of our big themes for AI Week 2025, which has now come to a close. We announced products, partnerships, and features to help companies successfully navigate this new era.

Everything we built was based on feedback from customers like you that want to get the most out of AI without sacrificing control and safety. Over the next year, we will double down on our efforts to deliver world-class features that augment and secure AI. Please keep an eye on our Blog, AI Avenue, Product Change Log and CloudflareTV for more announcements.

This week we focused on four core areas to help companies secure and deliver AI experiences safely and securely:

Securing AI environments and workflows
Protecting original content from misuse by AI
Helping developers build world-class, secure, AI experiences
Making Cloudflare better for you with AI

Thank you for following along with our first ever AI week at Cloudflare. This recap blog will summarize each announcement across these four core areas. For more information, check out our “This Week in NET” recap episode also featured at the end of this blog.

Securing AI environments and workflows

These posts and features focused on helping companies control and understand their employee’s usage of AI tools.

Protecting original content from misuse by AI

Cloudflare is committed to helping content creators control access to their original work. These announcements focused on analysis of what we’re currently seeing on the Internet with respect to AI bots and crawlers and significant improvements to our existing control features.

Helping developers build world-class, secure, AI experiences

At Cloudflare we are committing to building the best platform to build AI experiences, all with security by default.

Making Cloudflare better for you with AI

Cloudflare logs and analytics can often be a needle in the haystack challenge, AI helps surface and alert to issues that need attention or review. Instead of a human having to spend hours sifting and searching for an issue, they can focus on action and remediation while AI does the sifting.

We thank you for following along this week — and please stay tuned for exciting announcements coming during Cloudflare’s 15th birthday week in September!

Check out the full video recap, featuring insights from Kenny Johnson and host João Tomé, in our special This Week in NET episode (ThisWeekinNET.com) covering everything announced during AI Week 2025.

Automating threat analysis and response with Cloudy

Alexandra Moraru — Fri, 29 Aug 2025 14:05:00 GMT

Security professionals everywhere face a paradox: while more data provides the visibility needed to catch threats, it also makes it harder for humans to process it all and find what's important. When there’s a sudden spike in suspicious traffic, every second counts. But for many security teams — especially lean ones — it’s hard to quickly figure out what’s going on. Finding a root cause means diving into dashboards, filtering logs, and cross-referencing threat feeds. All the data tracking that has happened can be the very thing that slows you down — or worse yet, what buries the threat that you’re looking for.

Today, we’re excited to announce that we’ve solved that problem. We’ve integrated Cloudy — Cloudflare’s first AI agent — with our security analytics functionality, and we’ve also built a new, conversational interface that Cloudflare users can use to ask questions, refine investigations, and get answers. With these changes, Cloudy can now help Cloudflare users find the needle in the digital haystack, making security analysis faster and more accessible than ever before.

Since Cloudy’s launch in March of this year, its adoption has been exciting to watch. Over 54,000 users have tried Cloudy for custom rule creation, and 31% of them have deployed a rule suggested by the agent. For our log explainers in Cloudflare Gateway, Cloudy has been loaded over 30,000 times in just the last month, with 80% of the feedback we received confirming the summaries were insightful. We are excited to empower our users to do even more.

Talk to your traffic: a new conversational interface for faster RCA and mitigation

Security analytics dashboards are powerful, but they often require you to know exactly what you're looking for — and the right queries to get there. The new Cloudy chat interface changes this. It is designed for faster root cause analysis (RCA) of traffic anomalies, helping you get from “something’s wrong” to “here’s the fix” in minutes. You can now start with a broad question and narrow it down, just like you would with a human analyst.

For example, you can start an investigation by asking Cloudy to look into a recommendation from Security Analytics.

From there, you can ask follow-up questions to dig deeper:

"Focus on login endpoints only."
"What are the top 5 IP addresses involved?"
"Are any of these IPs known to be malicious?"

This is just the beginning of how Cloudy is transforming security. You can read more about how we’re using Cloudy to bring clarity to another critical security challenge: automating summaries of email detections. This is the same core mission — translating complex security data into clear, actionable insights — but applied to the constant stream of email threats that security teams face every day.

Use Cloudy to understand, prioritize, and act on threats

Analyzing your own logs is powerful — but it only shows part of the picture. What if Cloudy could look beyond your own data and into Cloudflare’s global network to identify emerging threats? This is where Cloudforce One's Threat Events platform comes in.

Cloudforce One translates the high-volume attack data observed on the Cloudflare network into real-time, attacker-attributed events relevant to your organization. This platform helps you track adversary activity at scale — including APT infrastructure, cybercrime groups, compromised devices, and volumetric DDoS activity. Threat events provide detailed, context-rich events, including interactive timelines and mappings to attacker TTPs, regions, and targeted verticals.

We have spent the last few months making Cloudy more powerful by integrating it with the Cloudforce One Threat Events platform. Cloudy now can offer contextual data about the threats we observe and mitigate across Cloudflare's global network, spanning everything from APT activity and residential proxies to ACH fraud, DDoS attacks, WAF exploits, cybercrime, and compromised devices. This integration empowers our users to quickly understand, prioritize, and act on indicators of compromise (IOCs) based on a vast ocean of real-time threat data.

Cloudy lets you query this global dataset in a natural language and receive clear, concise answers. For example, imagine asking these questions and getting immediate actionable answers:

Who is targeting my industry vertical or country?
What are the most relevant indicators (IPs, JA3/4 hashes, ASNs, domains, URLs, SHA fingerprints) to block right now?
How has a specific adversary progressed across the cyber kill chain over time?
What novel new threats are threat actors using that might be used against your network next, and what insights do Cloudflare analysts know about them?

Simply interact with Cloudy in the Cloudflare Dashboard > Security Center > Threat Intelligence, providing your queries in natural language. It can walk you from a single indicator (like an IP address or domain) to the specific threat event Cloudflare observed, and then pivot to other related data — other attacks, related threats, or even other activity from the same actor.

This cuts through the noise, so you can quickly understand an adversary's actions across the cyber kill chain and MITRE ATT&CK framework, and then block attacks with precise, actionable intelligence. The threat events platform is like an evidence board on the wall that helps you understand threats; Cloudy is like your sidekick that will run down every lead.

How it works: Agents SDK and Workers AI

Developing this advanced capability for Cloudy was a testament to the agility of Cloudflare's AI ecosystem. We leveraged our Agents SDK running on Workers AI. This allowed for rapid iteration and deployment, ensuring Cloudy could quickly grasp the nuances of threat intelligence and provide highly accurate, contextualized insights. The combination of our massive network telemetry, purpose-built LLM prompts, and the flexibility of Workers AI means Cloudy is not just fast, but also remarkably precise.

And a quick word on what we didn’t do when developing Cloudy: We did not train Cloudy on any Cloudflare customer data. Instead, Cloudy relies on models made publicly available through Workers AI. For more information on Cloudflare’s approach to responsible AI, please see these FAQs.

What's next for Cloudy

This is just the next step in Cloudy’s journey. We're working on expanding Cloudy's abilities across the board. This includes intelligent debugging for WAF rules and deeper integrations with Alerts to give you more actionable, contextual notifications. At the same time, we are continuously enriching our threat events datasets and exploring ways for Cloudy to help you visualize complex attacker timelines, campaign overviews, and intricate attack graphs. Our goal remains the same: make Cloudy an indispensable partner in understanding and reacting to the security landscape.

The new chat interface is now available on all plans, and the threat intelligence capabilities are live for Cloudforce One customers. Learn more about Cloudforce One here and reach out for a consultation if you want to go deeper with our experts.

Cloudy Summarizations of Email Detections: Beta Announcement

Ayush Kumar — Fri, 29 Aug 2025 14:00:00 GMT

Background

Organizations face continuous threats from phishing, business email compromise (BEC), and other advanced email attacks. Attackers adapt their tactics daily, forcing defenders to move just as quickly to keep inboxes safe.

Cloudflare’s visibility across a large portion of the Internet gives us an unparalleled view of malicious campaigns. We process billions of email threat signals every day, feeding them into multiple AI and machine learning models. This lets our detection team create and deploy new rules at high speed, blocking malicious and unwanted emails before they reach the inbox.

But rapid protection introduces a new challenge: making sure security teams understand exactly what we blocked — and why.

The Challenge

Cloudflare’s fast-moving detection pipeline is one of our greatest strengths — but it also creates a communication gap for customers. Every day, our detection analysts publish new rules to block phishing, BEC, and other unwanted messages. These rules often blend signals from multiple AI and machine learning models, each looking at different aspects of a message like its content, headers, links, attachments, and sender reputation.

While this layered approach catches threats early, SOC teams don’t always have insight into the specific combination of factors that triggered a detection. Instead, they see a rule name in the investigation tab with little explanation of what it means.

Take the rule BEC.SentimentCM_BEC.SpoofedSender as an example. Internally, we know this indicates:

The email contained no unique links or attachments a common BEC pattern
It was flagged as highly likely to be BEC by our Churchmouse sentiment analysis models
Spoofing indicators were found, such as anomalies in the envelope_from header

Those details are second nature to our detection team, but without that context, SOC analysts are left to reverse-engineer the logic from opaque labels. They don’t see the nuanced ML outputs (like Churchmouse’s sentiment scoring) or the subtle header anomalies, or the sender IP/domain reputation data that factored into the decision.

The result is time lost to unclear investigations or the risk of mistakenly releasing malicious emails. For teams operating under pressure, that’s more than just an inconvenience, it's a security liability.

That’s why we extended Cloudy (our AI-powered agent) to translate complex detection logic into clear explanations, giving SOC teams the context they need without slowing them down.

Enter Cloudy Summaries

Several weeks ago, we launched Cloudy within our Cloudflare One product suite to help customers understand gateway policies and their impacts (you can read more about the launch here: https://blog.cloudflare.com/introducing-ai-agent/).

We began testing Cloudy's ability to explain the detections and updates we continuously deploy. Our first attempt revealed significant challenges.

The Hallucination Problem

We observed frequent LLM hallucinations, the model generating inaccurate information about messages. While this might be acceptable when analyzing logs, it's dangerous for email security detections. A hallucination claiming a malicious message is clean could lead SOC analysts to release it from quarantine, potentially causing a security breach.

These hallucinations occurred because email detections involve numerous and complex inputs. Our scanning process runs messages through multiple ML algorithms examining different components: body content, attachments, links, IP reputation, and more. The same complexity that makes manual detection explanation difficult also caused our initial LLM implementation to produce inconsistent and sometimes inaccurate outputs.

Building Guardrails

To minimize hallucination risk while maintaining inbox security, we implemented several manual safeguards:

Step 1: RAG Implementation

We ensured Cloudy only accessed information from our detection dataset corpus, creating a Retrieval-Augmented Generation (RAG) system. This significantly reduced hallucinations by grounding the LLM's assessments in actual detection data.

Step 2: Model Context Enhancement

We added crucial context about our internal models. For example, the "Churchmouse" designation refers to a group of sentiment detection models, not a single algorithm. Without this context, Cloudy attempted to define "churchmouse" using the common idiom "poor as a church mouse" referencing starving church mice because holy bread never falls to the floor. While historically interesting, this was completely irrelevant to our security context.

Current Results

Our testing shows Cloudy now produces more stable explanations with minimal hallucinations. For example, the detection SPAM.ASNReputation.IPReputation_Scuttle.Anomalous_HC now generates this summary:

"This rule flags email messages as spam if they come from a sender with poor Internet reputation, have been identified as suspicious by a blocklist, and have unusual email server setup, indicating potential malicious activity."

This strikes the right balance. Customers can quickly understand what the detection found and why we classified the message accordingly.

Beta Program

We're opening Cloudy email detection summaries to a select group of beta users. Our primary goal is ensuring our guardrails prevent hallucinations that could lead to security compromises. During this beta phase, we'll rigorously test outputs and verify their quality before expanding access to all customers.

Ready to enhance your email security?

We provide all organizations (whether a Cloudflare customer or not) with free access to our Retro Scan tool, allowing them to use our predictive AI models to scan existing inbox messages. Retro Scan will detect and highlight any threats found, enabling organizations to remediate them directly in their email accounts. With these insights, organizations can implement further controls, either using Cloudflare Email Security or their preferred solution, to prevent similar threats from reaching their inboxes in the future.

If you are interested in how Cloudflare can help secure your inboxes, sign up for a phishing risk assessment here.

Troubleshooting network connectivity and performance with Cloudflare AI

Chris Draper — Fri, 29 Aug 2025 14:00:00 GMT

Monitoring a corporate network and troubleshooting any performance issues across that network is a hard problem, and it has become increasingly complex over time. Imagine that you’re maintaining a corporate network, and you get the dreaded IT ticket. An executive is having a performance issue with an application, and they want you to look into it. The ticket doesn’t have a lot of details. It simply says: “Our internal documentation is taking forever to load. PLS FIX NOW”.

In the early days of IT, a corporate network was built on-premises. It provided network connectivity between employees that worked in person and a variety of corporate applications that were hosted locally.

The shift to cloud environments, the rise of SaaS applications, and a “work from anywhere” model has made IT environments significantly more complex in the past few years. Today, it’s hard to know if a performance issue is the result of:

An employee’s device
Their home or corporate wifi
The corporate network
A cloud network hosting a SaaS app
An intermediary ISP

A performance ticket submitted by an employee might even be a combination of multiple performance issues all wrapped together into one nasty problem.

Cloudflare built Cloudflare One, our Secure Access Service Edge (SASE) platform, to protect enterprise applications, users, devices, and networks. In particular, this platform relies on two capabilities to simplify troubleshooting performance issues:

Cloudflare’s Zero Trust client, also known as WARP, forwards and encrypts traffic from devices to Cloudflare edge.
Digital Experience Monitoring (DEX) works alongside WARP to monitor device, network, and application performance.

We’re excited to announce two new AI-powered tools that will make it easier to troubleshoot WARP client connectivity and performance issues. We’re releasing a new WARP diagnostic analyzer in the Zero Trust dashboard and a MCP (Model Context Protocol) server for DEX. Today, every Cloudflare One customer has free access to both of these new features by default.

WARP diagnostic analyzer

The WARP client provides diagnostic logs that can be used to troubleshoot connectivity issues on a device. For desktop clients, the most common issues can be investigated with the information captured in logs called WARP diagnostic. Each WARP diagnostic log contains an extensive amount of information spanning days of captured events occurring on the client. It takes expertise to manually go through all of this information and understand the full picture of what is occurring on a client that is having issues. In the past, we’ve advised customers having issues to send their WARP diagnostic log straight to us so that our trained support experts can do a root cause analysis for them. While this is effective, we want to give our customers the tools to take control of deciphering common troubleshooting issues for even quicker resolution.

Enter the WARP diagnostic analyzer, a new AI available for free in the Cloudflare One dashboard as of today! This AI demystifies information in the WARP diagnostic log so you can better understand events impacting the performance of your clients and network connectivity. Now, when you run a remote capture for WARP diagnostics in the Cloudflare One dashboard, you can generate an AI analysis of the WARP diagnostic file. Simply go to your organization’s Zero Trust dashboard and select DEX > Remote Captures from the side navigation bar. After you successfully run diagnostics and produce a WARP diagnostic file, you can open the status details and select View WARP Diag to generate your AI analysis.

In the WARP Diag analysis, you will find a Cloudy summary of events that we recommend a deeper dive into.

Below this summary is an events section, where the analyzer highlights occurrences of events commonly occurring when there are client and connectivity issues.

Expanding on any of the events detected will reveal a detailed page explaining the event, recommended resources to help troubleshoot, and a list of time stamped recent occurrences of the event on the device.

To further help with trouble shooting we’ve added a Device and WARP details section at the bottom of this page with a quick view of the device specifications and WARP configurations such as Operating system, WARP version, and the device profile ID.

Finally, we’ve made it easy to take all the information created in your AI summary with you by navigating to the JSON file tab and copying the contents. Your WARP Diag file is also available to download from this screen for any further analysis.

MCP server for DEX

Alongside the new WARP Diagnostic Analyzer, we’re excited to announce that all Cloudflare One customers have access to a MCP (Model Context Protocol) server for our Digital Experience Monitoring (DEX) product. Let’s dive into how this will save our customers time and money.

Cloudflare One customers use Digital Experience Monitoring (DEX) to monitor devices across their employee network and troubleshoot any connectivity or performance issues. Like many products at Cloudflare, every data point generated by DEX is available to customers via Cloudflare’s API or log ingestion. DEX API and log data is valuable because it enables Enterprises to create custom analytics for their devices’ connectivity and performance in a SIEM (Security Information and Event Management). Building new data pipelines and various dashboards can take a lot of time and can be expensive. Some of Cloudflare One customers aren’t able to dedicate the engineering time to build custom analytics: whether it’s due to budget, resource constraints, time constraints, or other factors.

Model Context Protocol (MCP) is an AI standardization that allows AI to connect to other applications. In this instance, using MCP together with DEX gives users a custom analytics experience — without all the work of integrating log data into a SIEM. Network security practitioners can type out a question (like “Show me the performance data for alice@acme.com’s device in a few graphs”), and an MCP server will answer that question with a customized graph on Alice’s device data from the DEX API.

Troubleshooting DNS query performance with the DEX MCP server

Any IT admin, security engineer, or network engineer can ask the MCP server a question like: “I received an IT ticket from bob@acme.com who is an enterprise employee. He is complaining about the performance of his device. Can you investigate this for me?”.

After you submit this question, the DEX MCP server will start by reviewing the device’s overall health.

So far, the device’s health looks great. Next, the DEX MCP server will begin analyzing the results of the performance tests that are configured on Bob’s device.

Now we’re making progress in our troubleshooting effort. The DEX MCP server identified that the HTTP GET test for wiki.internal.acme.com has a high average resource fetch time when compared to other websites (like google.com). In particular, high resource fetch time is the result of slow DNS resolution. Now, the DEX MCP server will summarize its findings.

Now, we’ve successfully identified that Bob’s slow wiki performance is the result of a high average DNS response time. Slow DNS resolution increases the average resource fetch time, and also generates HTTP 400 errors for Enterprise employees that are attempting to access wiki.internal.acme.com. Notably, it’s likely that if the wiki is seeing slow DNS response times, other internal applications are experiencing performance issues as well.

We can ask the DEX MCP server to suggest potential solutions that would fix this wider DNS performance issue.

Try out the DEX MCP server today

Fast and easy option for testing an MCP server

Any Cloudflare One customer with a Free, PayGo, or ENT plan can start using the DEX MCP server in less than one minute. The fastest and easiest way to try out the DEX MCP server is to visit playground.ai.cloudflare.com. There are five steps to get started:

Copy the URL for the DEX MCP server: https://dex.mcp.cloudflare.com/sse
Open playground.ai.cloudflare.com in a browser
Find the section in the left side bar titled MCP Servers
Paste the URL for the DEX MCP server into the URL input box and click Connect
Authenticate your Cloudflare account, and then start asking questions to the DEX MCP server

It’s worth noting that end users will need to ask specific and explicit questions to the DEX MCP server to get a response. For example, you may need to say, “Set my production account as the active account”, and then give the separate command, “Fetch the DEX test results for the user bob@acme.com over the past 24 hours”.

Better experience for MCP servers that requires additional steps

Customers will get a more flexible prompt experience by configuring the DEX MCP server with their preferred AI assistant (Claude, Gemini, ChatGPT, etc.) that has MCP server support. MCP server support may require a subscription for some AI assistants. You can read the Digital Experience Monitoring - MCP server documentation for step by step instructions on how to get set up with each of the major AI assistants that are available today.

As an example, you can configure the DEX MCP server in Claude by downloading the Claude Desktop client, then selecting Claude Code > Developer > Edit Config. You will be prompted to open “claude_desktop_config.json” in a code editor of your choice. Simply add the following JSON configuration, and you’re ready to use Claude to call the DEX MCP server.

Get started with Cloudflare One today

Are you ready to secure your Internet traffic, employee devices, and private resources without compromising speed? You can get started with our new Cloudflare One AI powered tools today.

The WARP diagnostic analyzer and the DEX MCP server are generally available to all customers. Head to the Zero Trust dashboard to run a WARP diagnostic and learn more about your client’s connectivity with the WARP diagnostic analyzer. You can test out the new DEX MCP server (https://dex.mcp.cloudflare.com/sse) in less than one minute at playground.ai.cloudflare.com, and you can also configure an AI assistant like Claude to use the new DEX MCP server.

If you don’t have a Cloudflare account, and you want to try these new features, you can create a free account for up to 50 users. If you’re an Enterprise customer, and you’d like a demo of these new Cloudflare One AI features, you can reach out to your account team to set up a demo anytime.

You can stay up to date on latest feature releases across the Cloudflare One platform by following the Cloudflare One changelogs and joining the conversation in the Cloudflare community hub or on our Discord Server.

The crawl-to-click gap: Cloudflare data on AI bots, training, and referrals

João Tomé — Fri, 29 Aug 2025 14:00:00 GMT

In 2025, Generative AI is reshaping how people and companies use the Internet. Search engines once drove traffic to content creators through links. Now, AI training crawlers — the engines behind commonly-used LLMs — are consuming vast amounts of web data, while sending far fewer users back. We covered this shift, along with related trends and Cloudflare features (like pay per crawl) in early July. Studies from Pew Research Center (1, 2) and Authoritas already point to AI overviews — Google’s new AI-generated summaries shown at the top of search results — contributing to sharp declines in news website traffic. For a news site, this means lots of bot hits, but far fewer real readers clicking through — which in turn means fewer people clicking on ads or chances to convert to subscriptions.

Cloudflare's data shows the same pattern. Crawling by search engines and AI services surged in the first half of 2025 — up 24% year-over-year in June — before slowing to just 4% year-over-year growth in July. How is the space evolving? Which crawling purposes are most common, and how is that changing? Spoiler: training-related crawling is leading the way. In this post, we track AI and search bot crawl activity, what purposes dominate, and which platforms contribute the least referral traffic back to creators.

Key takeaways

Training crawling grows: Training now drives nearly 80% of AI bot activity, up from 72% a year ago.
Publisher referrals drop: Google referrals to news sites fell, with March 2025 down ~9% compared to January.
AI & search crawling increase: Crawling rose 32% year-over-year in April 2025, before slowing to 4% year-over-year growth in July.
AI-only crawler shifts: OpenAI’s GPTBot more than doubled in share of AI crawling traffic (4.7% to 11.7%), Anthropic’s ClaudeBot rose (6% to ~10%), while ByteDance’s Bytespider fell from 14.1% to 2.4%.
Crawl-to-refer imbalance (how many pages a bot crawls per page that a user clicks back to): Anthropic increased referrals but still leads with 38,000 crawls per visitor in July (down from 286,000:1 in January). Perplexity decreased referrals in 2025 — with more crawling but fewer referrals at 194 crawls per visitor in July.

Several of the trends in this blog use Cloudflare Radar’s new AI Insights features, explained in more detail in the post: “A deeper look at AI crawlers: breaking down traffic by purpose and industry.”

Google referrals fall as AI Overviews expand

Referral traffic from search is already shifting, as we noted above and as studies have shown. In our dataset of news-related customers (spanning the Americas, Europe, and Asia), Google’s referrals have been clearly declining since February 2025. This drop is unusual, since overall Internet traffic (and referrals as well) historically has only dipped during July and August — the summer months when the Northern Hemisphere is largely on break from school or work. The sharpest and least seasonal decline came in March. Despite being a 31-day month, March had almost the same referral volume as the shorter, 28-day February.

Looking at longer comparisons: March 2025 referral traffic from Google was 9% lower than January, the same drop seen in June. April was worse, down 15% compared with January.

This drop seems to coincide with some of Google’s changes. AI Overviews launched in the U.S. in May 2024, but in March 2025, Google upgraded AI Overviews with Gemini 2.0, introduced AI Mode in Labs, and expanded Overviews to more European countries. By May 2025, AI Mode rolled out broadly in the U.S. with Gemini 2.5, adding conversational search, Deep Search, and personalized recommendations.

The search-to-news site pipeline seems to be weakening, replaced in part by AI-driven results.

Looking at a daily perspective, we can also spot a clear U.S.-election-related peak in referrals from Google to the cohort of known news sites on November 5–6, 2024.

AI and search crawling: spring surge (+24%), summer slowdown

In June, we talked about search and AI crawler growth, and our picture of the trend is now more complete with more data. To focus only on AI and search crawlers, and to remove the bias of customer growth, we analyzed a fixed set of customers from specific weeks, a method we’ve also used in the Cloudflare Radar Year in Review.

What the data shows: crawling spiked twice: first in November 2024, then again between March and April 2025. April 2025 alone was up 32% compared with May 2024, the first full month where we have comparable data. After that surge, growth stabilized. In June 2025, crawling traffic was still 24% higher year-over-year, but by July the increase was down to just 4%. That shift highlights how quickly crawler activity can accelerate and then cool down.

As the chart below shows, crawling traffic rose sharply in March and April. It remained high but slightly lower in May, before starting to drop in June. The seasonal dip is similar to what we see in overall Internet traffic during the Northern Hemisphere’s summer months (August and September are often the quietest), though in the case of crawlers, this is likely due to reduced overall web activity rather than bots themselves taking a “break.” Historically, activity tends to rise again in November — as it did in 2024 for AI and search bot traffic — when people spend more time online for shopping and seasonal habits (a pattern we’ve seen in past years).

Googlebot is still the anchor, accounting for 39% of all AI and search crawler traffic, but the fastest growth now comes from AI-specific crawlers, though bots related to Amazon and ByteDance (Bytespider) have lost significant ground. GPTBot’s share grew from 4.7% in July 2024 to 11.7% in July 2025. ClaudeBot also increased, from 6% to nearly 10%, while Meta’s crawler jumped from 0.9% to 7.5%. By contrast, Amazonbot dropped from 10.2% to 5.9%, and ByteDance’s Bytespider dropped from 14.1% to just 2.4%.

The table below shows how market shares have shifted between July 2024 and July 2025:

AI-only crawlers: OpenAI rises, ByteDance falls

Looking only at AI bot traffic (as tracked on our Radar AI page), the trend is clear. Since January 2025, GPTBot has steadily increased its crawling volume, driven mainly by training-related activity. ClaudeBot crawling accelerated in June, while Amazonbot and Bytespider activity slowed.

The chart below shows how GPTBot surged over the past 12 months, overtaking Amazonbot and Bytespider, which both fell sharply:

A comparison between July 2024 and July 2025 makes the shift even more obvious. GPTBot gained 16 percentage points, Meta’s crawler rose by more than 15, and ClaudeBot grew by 8. On the shrinking side, Amazonbot dropped 12 percentage points and Bytespider dropped over 31 percentage points.

We covered the functionality of these bots in our June blog post.

Crawling by purpose: training dominates

Training is the clear leader. (We classify purpose based on operator disclosures and industry sources, a method we explained in this AI Week blog.) Over the past 12 months, 80% of AI crawling was for training, compared with 18% for search and just 2% for user actions. In the last six months, the share for training rose further to 82%, while search dropped to 15% and user actions increased slightly to 3%.

The chart below shows how training-related crawling steadily grew over the past year, far outpacing other purposes:

The year-over-year comparison reinforces this trend. In July 2024, training accounted for 72% of AI crawling. By July 2025, it had risen to 79%. Over the same period, search fell from 26% to 17%, while user actions grew modestly from 2% to 3.2%.

Crawl-to-refer ratios shifts: tens of thousands of bot crawls per human click

The crawl-to-refer ratio measures how many pages a platform crawls compared with how often it drives users to a website. In practice, a high ratio means heavy crawling but little referral traffic. For example, for every visitor Anthropic refers back to a website, its crawlers have already visited tens of thousands of pages.

Why does this metric matter? It highlights the imbalance between how much content AI systems consume and how little traffic they return. For publishers, it can feel like giving away the raw material for free. With that in mind, here’s how different platforms compare from January to July 2025.

Anthropic remains the most crawl-heavy platform. Even after an 87% decline this year, it still crawled 38,000 pages for every referred page visit in July 2025 — the highest imbalance among major AI players. Referrals may be improving, though, after Anthropic added web search to Claude in March 2025 (initially for U.S. paid users) and expanded it globally by May to all users, including the free tier. The feature introduced direct citations with clickable URLs, creating new referral pathways.

The full dataset is below, showing January–July 2025 ratios by platform ordered by the highest ratio average:
(Note: a rising ratio means more bot crawling per human click sent back, while a falling ratio means less bot crawling per human click sent back)

Crawl-to-refer ratio (from Cloudflare Radar’s data)

Looking at the changes from January to July 2025:

Anthropic recorded the steepest decrease in bot to human traffic, down 86.7%. From 286,930 bots per human in January, to 38,065 bots per human in July, the change shows a dramatic increase in referrals. Despite the change, it remains by far the most crawl-heavy platform, with tens of thousands of pages still crawled for every referral.
Perplexity moved in the opposite direction, with bot crawling increasing +256.7% relative to human visitors; climbing from 54 bots per human in January to 195 bots per human in July. While the ratio is still far below Anthropic, the increase shows it is crawling more heavily, relative to the traffic it refers, than it did earlier.
OpenAI ratio dropped slightly, from 1,217 bots per human in January to 1,091 in July (-10%). The shift is smaller than Anthropic’s but suggests OpenAI is sending a bit more referral traffic relative to its crawling.
Microsoft stayed steady, with its ratio moving only slightly, from 38.5 bots per human in January to 40.7 in July (+6%). This consistency suggests stable behavior from Bing-linked services.
Yandex increased from 15.5 bots per human in January to 21.4 in July (+38%). The overall ratio is far smaller than Anthropic’s or Perplexity’s, but it shows Yandex is crawling more heavily relative to the traffic it sends back.

Alongside measuring crawling volumes and referral traffic (now also visible on the AI Insights page of Cloudflare Radar), it’s worth looking at whether AI operators follow good practices when deploying their bots. Cloudflare data shows that most leading AI crawlers are on our verified bots list, meaning their IP addresses match published ranges and they respect robots.txt. But adoption of newer standards like WebBotAuth — which uses cryptographic signatures in HTTP messages to confirm a request comes from a specific bot, and is especially relevant today — is still missing.

Meta, OpenAI, and Anthropic run distinct bots for different purposes, while Google and Microsoft rely on unified crawlers. Anthropic, however, still lags in verification, which makes it easier for bad actors to spoof its crawler and ignore robots.txt. Without verification, it’s difficult to distinguish real from fake traffic — leaving its compliance effectively unclear. (A longer list of AI bots is available here).

Conclusion and what’s next

If training-related crawling continues to dominate while referrals stay flat, creators face a paradox: feeding AI systems without gaining traffic in return. Many want their content to appear in chatbot answers, but without monetization or cooperation, the incentive to produce quality work declines.

The Web now stands at a fork in the road. Either a new balance emerges — one where the new AI era helps sustain publishers and creators — or AI turns the open web into a one-way training set, extracting value with little flowing back.

You can learn more about some of these data trends on Cloudflare Radar’s updated AI Insights page.

Cloudflare is the best place to build realtime voice agents

Renan Dincer — Fri, 29 Aug 2025 14:00:00 GMT

The way we interact with AI is fundamentally changing. While text-based interfaces like ChatGPT have shown us what's possible, in terms of interaction, it’s only the beginning. Humans communicate not only by texting, but also talking — we show things, we interrupt and clarify in real-time. Voice AI brings these natural interaction patterns to our applications.

Today, we're excited to announce new capabilities that make it easier than ever to build real-time, voice-enabled AI applications on Cloudflare's global network. These new features create a complete platform for developers building the next generation of conversational AI experiences or can function as building blocks for more advanced AI agents running across platforms.

We're launching:

Cloudflare Realtime Agents - A runtime for orchestrating voice AI pipelines at the edge
Pipe raw WebRTC audio as PCM in Workers - You can now connect WebRTC audio directly to your AI models or existing complex media pipelines already built on
Workers AI WebSocket support - Realtime AI inference with models like PipeCat's smart-turn-v2
Deepgram on Workers AI - Speech-to-text and text-to-speech running in over 330 cities worldwide

Why realtime AI matters now

Today, building voice AI applications is hard. You need to coordinate multiple services such as speech-to-text, language models, text-to-speech while managing complex audio pipelines, handling interruptions, and keeping latency low enough for natural conversation.

Building production voice AI requires orchestrating a complex symphony of technologies. You need low latency speech recognition, intelligent language models that understand context and can handle interruptions, natural-sounding voice synthesis, and all of this needs to happen in under 800 milliseconds — the threshold where conversation feels natural rather than stilted. This latency budget is unforgiving. Every millisecond counts: 40ms for microphone input, 300ms for transcription, 400ms for LLM inference, 150ms for text-to-speech. Any additional latency from poor infrastructure choices or distant servers transforms a delightful experience into a frustrating one.

That's why we're building real-time AI tools: we want to make real-time voice AI as easy to deploy as a static website. We're also witnessing a critical inflection point where conversational AI moves from experimental demos to production-ready systems that can scale globally. If you’re already a developer in the real-time AI ecosystem, we want to build the best building blocks for you to get the lowest latency by leveraging the 330+ datacenters Cloudflare has built.

Introducing Cloudflare Realtime Agents

Cloudflare Realtime Agents is a simple runtime for orchestrating voice AI pipelines that run on our global network, as close to your users as possible. Instead of managing complex infrastructure yourself, you can focus on building great conversational experiences.

How it works

When a user connects to your voice AI application, here's what happens:

WebRTC connection - Audio streams from the user's device is sent to the nearest Cloudflare location via WebRTC, using Cloudflare RealtimeKit mobile or web SDKs
AI pipeline orchestration - Your pre-configured pipeline runs: speech-to-text → LLM → text-to-speech, with support for interruption detection and turn-taking
Your configured runtime options/callbacks/tools run
Response delivery - Generated audio streams back to the user with minimal latency

The magic is in how we've designed this as composable building blocks. You're not locked into a rigid pipeline — you can configure data flows, add tee and join operations, and control exactly how your AI agent behaves.

Take a look at the MyTextHandler function from the above diagram, for example. It’s just a function that takes in text and returns text back, inserted after speech-to-text and before text-to-speech:

Your agent is a JavaScript class that extends RealtimeAgent, where you initialize a pipeline consisting of the various text-to-speech, speech-to-text, text-to-text and even speech-to-speech transformations.

View a full example in the developer docs and get your own Realtime Agent running. View Realtime Agents on your dashboard.

Built for flexibility

What makes Realtime Agents powerful is its flexibility:

Many AI provider options - Use the models on Workers AI, OpenAI, Anthropic, or any provider through AI Gateway
Multiple input/output modes - Accept audio and/or text and respond with audio and/or text
Stateful coordination - Maintain context across the conversation without managing complex state yourself
Speed and flexibility - use RealtimeKit to manage WebRTC sessions and UI for faster development, or for full control over your stack, you can also connect directly using any standard WebRTC client or raw WebSockets
Integrate with the Cloudflare Agents SDK

During the open beta starting today, Cloudflare Realtime Agents runtime is free to use and works with various AI models:

Speech and Audio: Integration with platforms like ElevenLabs and Deepgram.
LLM Inference: Flexible options to use large language models through Cloudflare Workers AI and AI Gateway, connect to third-party models like OpenAi, Gemini, Grok, Claude, or bring your own custom models.

Pipe raw WebRTC audio as PCM in Workers

For developers who need the most flexibility with their applications beyond Realtime Agents, we're exposing the raw WebRTC audio pipeline directly to Workers.

WebRTC audio in Workers works by leveraging Cloudflare’s Realtime SFU, which converts WebRTC audio in Opus codec to PCM and streams it to any WebSocket endpoint you specify. This means you can use Workers to implement:

Live transcription - Stream audio from a video call directly to a transcription service
Custom AI pipelines - Send audio to AI models without setting up complex infrastructure
Recording and processing - Save, audit, or analyze audio streams in real-time

WebSockets vs WebRTC for voice AI

WebSockets and WebRTC can handle audio for AI services, but they work best in different situations. WebSockets are perfect for server-to-server communication and work fine when you don't need super-fast responses, making them great for testing and experimenting. However, if you're building an app where users need real-time conversations with low delay, WebRTC is the better choice.

WebRTC has several advantages that make it superior for live audio streaming. It uses UDP instead of TCP, which prevents audio delays caused by lost packets holding up the entire stream (head of line blocking is a common topic discussed on this blog). The Opus audio codec in WebRTC automatically adjusts to network conditions and can handle packet loss gracefully. WebRTC also includes built-in features like echo cancellation and noise reduction that WebSockets would require you to build separately.

With this feature, you can use WebRTC for client to server communication and leveraging Cloudflare to convert to familiar WebSockets for server-to-server communication and backend processing.

The power of Workers + WebRTC

When WebRTC audio gets converted to WebSockets, you get PCM audio at the original sample rate, and from there, you can run any task in and out of the Cloudflare developer platform:

Resample audio and send to different AI providers
Run WebAssembly-based audio processing
Build complex applications with Durable Objects, Alarms and other Workers primitives
Deploy containerized processing pipelines with Workers Containers

The WebSocket works bidirectionally, so data sent back on the WebSocket becomes available as a WebRTC track on the Realtime SFU, ready to be consumed within WebRTC.

To illustrate this setup, we’ve made a simple WebRTC application demo that uses the ElevenLabs API for text-to-speech.

Visit the Realtime SFU developer docs on how to get started.

Realtime AI inference with WebSockets

WebSockets provide the backbone of real-time AI pipelines because it is a low-latency, bidirectional primitive with ubiquitous support in developer tooling, especially for server to server communication. Although HTTP works great for many use cases like chat or batch inference, real-time voice AI needs persistent, low-latency connections when talking to AI inference servers. To support your real-time AI workloads, Workers AI now supports WebSocket connections in select models.

Launching with PipeCat SmartTurn V2

The first model with WebSocket support is PipeCat's smart-turn-v2 turn detection model — a critical component for natural conversation. Turn detection models determine when a speaker has finished talking and it's appropriate for the AI to respond. Getting this right is the difference between an AI that constantly interrupts and one that feels natural to talk to.

Below is an example on how to call smart-turn-v2 running on Workers AI.

Deepgram in Workers AI

On Wednesday, we announced that Deepgram's speech-to-text and text-to-speech models are available on Workers AI, running in Cloudflare locations worldwide. This means:

Lower latency - Speech recognition happens at the edge, close to users running in the same network as Workers
WebRTC audio processing without leaving the Cloudflare network
State-of-the-art audio ML models powerful, capable, and fast audio models, available directly through Workers AI
Global scale - leverages Cloudflare’s global network in 330+ cities automatically

Deepgram is a popular choice for voice AI applications. By building your voice AI systems on the Cloudflare platform, you get access to powerful models and the lowest latency infrastructure to give your application a natural, responsive experience.

Interested in other realtime AI models running on Cloudflare?

If you're developing AI models for real-time applications, we want to run them on Cloudflare's network. Whether you have proprietary models or need ultra-low latency inference at scale with open source models reach out to us.

Get started today

All of these features are available now:

Cloudflare Realtime Agents - Start testing in beta
WebRTC audio as PCM in Workers - Read the documentation and integrate with your applications
Workers AI WebSocket support - Try out PipeCat’s smart-turn-v2 model
Deepgram on Workers AI - Available now at @cf/deepgram/aura-1 and @cf/deepgram/nova-3

Want to pick the brains of the engineers who built this? Join them for technical deep dives, live demos Q&A at Cloudflare Connect in Las Vegas. Explore the full schedule and register.

A deeper look at AI crawlers: breaking down traffic by purpose and industry

David Belson — Thu, 28 Aug 2025 14:05:00 GMT

Search platforms historically crawled web sites with the implicit promise that, as the sites showed up in the results for relevant searches, they would send traffic on to those sites — in turn leading to ad revenue for the publisher. This model worked fairly well for several decades, with a whole industry emerging around optimizing content for optimal placement in search results. It led to higher click-through rates, more eyeballs for publishers, and, ideally, more ad revenue. However, the emergence of AI platforms over the last several years, and the incorporation of AI "overviews" into classic search platforms, has turned the model on its head. When users turn to these AI platforms with queries that used to go to search engines, they often won't click through to the original source site once an answer is provided — and that assumes that a link to the source is provided at all! No clickthrough, no eyeballs, and no ad revenue.

To provide a perspective on the scope of this problem, Radar launched crawl/refer ratios on July 1, based on traffic seen across our whole customer base. These ratios effectively compare the number of crawling requests for HTML pages from the crawler associated with a given platform, to the number of HTML page requests referred by that platform (measuring human traffic). This data complements insights into AI bot & crawler traffic trends that were launched during Birthday Week 2024.

Today, we're adding two new capabilities to the AI Insights page on Cloudflare Radar to give you more insight into this activity: industry-focused AI bot traffic data, and a new breakdown of AI bot traffic by its purpose.

Traffic by type

Since the launch of LLMs into the public consciousness in November 2022, much of the crawling traffic seen from user agents associated with AI platforms has been to collect content used to train AI models. This crawling activity can be aggressive at times, often ignoring directives found in robots.txt files. In addition to offering chatbots trained on this scraped content, AI platforms have emerged that aim to replace classic search tools, while those tools have themselves integrated AI-powered summaries as part of their results. These platforms may crawl your site to build indexes for their search engines. And some AI platforms may crawl your site in response to a specific user prompt, such as looking for flights to plan a vacation.

The new Crawl purpose selector within the AI bot & crawler traffic card allows users to select between Training, Search, User action, and Undeclared. (The latter is for crawlers where no information is available from the operator or other industry sources regarding its purpose.)

Once a purpose is selected, the HTTP traffic by bot graph updates to show traffic trends over the selected time period for the top five most active AI bots that crawl for the selected purpose.

As an example, selecting User action results in a graph like the one below, which covers the first 28 days of July 2025. OpenAI’s ChatGPT-User bot is responsible for nearly three quarters of the request traffic from this cohort of crawlers. A daily cycle is clearly evident, suggesting regular usage of ChatGPT in that fashion, with such usage gradually increasing throughout the month. If ChatGPT-User is removed from the chart, Perplexity-User also exhibits a similar pattern.

A new Crawl purpose graph has also been added to Radar, breaking out traffic trends by purpose. Training traffic, responsible for nearly 80% of the crawling from AI bots, is somewhat erratic in nature, with no clear cyclical pattern. However, such patterns are visible for the User action and Undeclared purposes, as shown in the graph below, although they account for less than 5% of AI bot traffic across this time period.

Within the Data Explorer view for the AI Bots & Crawlers dataset, you can now break the data down by Crawl purpose to explore how the activity has changed over time. Alternatively, you can break the data down by User agent, and filter by Crawl purpose, to explore traffic trends across a larger set of bots (beyond the top five). Comparisons with previous time periods are available here as well.

Visibility by industry

You can use your own traffic data to see how aggressively crawlers scrape your content. You can also see how frequently they refer traffic back to you. However, you may also want to understand how those measurements compare with your peer group — are you being crawled more or less frequently, and are the platforms referring more or less traffic back to your sites? The new industry set filtering available for the HTTP traffic by bot graph and the Crawl-to-refer ratio table in the AI Insights section of Radar can provide you with this perspective.

Within the AI bot & crawler traffic card on the AI Insights page, select an industry set from the drop-down list at the top right of the card. The graphs in the HTTP traffic by bot and Crawl purpose sections of the card update to reflect the selection, as does the Crawl-to-refer ratio table. (Selecting a Crawl purpose from that drop-down menu will further update the HTTP traffic by bot graph.)

It is interesting to observe how the crawling patterns change between industry sets, along with the mix of most active bots and crawl-to-refer ratios. For example, across the first week of August, with no vertical or crawl purpose selected, ClaudeBot and GPTBot account for nearly half of the observed crawling activity, with Meta-ExternalAgent the only one among the top five exhibiting activity that remotely resembles a pattern. For the default view, Anthropic had the highest crawl-to-refer ratio at nearly 50,000:1, followed by OpenAI at 887:1 and Perplexity at 118:1.

However, when the News and Publications industry set is selected, we see a much tighter distribution of traffic among the top five, ranging from ChatGPT-User’s 14.9% share of traffic to GPTBot’s 17.4% share. ChatGPT-User’s presence among the top five suggests that a significant number of users may have been asking questions about current events during that period of time. For these News and Publications sites, the crawl-to-refer ratios are lower than the default view, with Anthropic at 2,500:1, OpenAI at 152:1, and Perplexity at 32.7:1.

As a third example, we find that the mix again shifts for the Computer and Electronics industry set. While GPTBot was again the most active AI bot, Amazonbot moved up into second place; together these bots now account for over 40% of crawling traffic. ClaudeBot and Meta-ExternalAgent both had a 13.9% share of the crawling traffic, with ByteDance’s ByteSpider rounding out the top five. The crawl-to-refer ratios for this vertical are again lower than for the unfiltered view, with Anthropic down to 8,800:1, OpenAI at 401.7:1, and Perplexity at 88:1.

Within Data Explorer, you can now break down AI Bots & Crawler data by Vertical and Industry. (A vertical is a pre-defined collection of multiple related industries), and you can also filter Crawl purpose and User agent breakdowns by Vertical and Industry. For example, the graphs below illustrate the traffic trends by AI crawler for sites within the Cryptocurrency industry under the Finance vertical, as well as the traffic trends by crawl purpose for that industry/vertical pair. While these sites see crawling traffic from quite a few bots, three-quarters of that traffic during the first week of August was concentrated in just four bots, and 80% of it was for gathering information to train models.

Because the Industry sets shown on the main AI Insights page are manually curated collections of related industries, clicking through to the Data Explorer view from one of those graphs will pre-populate the Industry selector with the relevant entries. For example, clicking through from the HTTP traffic by bot graph for the Gaming & Gambling industry set results in the following Data Explorer view, which lists the component industries.

Conclusion

AI crawler traffic has become a fact of life for content owners, and the complexity of dealing with it has increased as bots are used for purposes beyond LLM training. Work is underway to allow website publishers to declare how automated systems should use their content. However, it will take some time for these proposed solutions to be standardized, and for both publishers and crawlers to adopt them. As the space evolves, we’ll continue to expand Cloudflare Radar’s insights into AI crawler activity.

If you share our AI-related graphs on social media, be sure to tag us: @CloudflareRadar (X), noc.social/@cloudflareradar (Mastodon), and radar.cloudflare.com (Bluesky). If you have questions or comments, you can reach out to us on social media, or contact us via email.

Evaluating image segmentation models for background removal for Images

Deanna Lam — Thu, 28 Aug 2025 14:00:00 GMT

Last week, we wrote about face cropping for Images, which runs an open-source face detection model in Workers AI to automatically crop images of people at scale.

It wasn’t too long ago when deploying AI workloads was prohibitively complex. Real-time inference previously required specialized (and costly) hardware, and we didn’t always have standard abstractions for deployment. We also didn’t always have Workers AI to enable developers — including ourselves — to ship AI features without this additional overhead.

And whether you’re skeptical or celebratory of AI, you’ve likely seen its explosive progression. New benchmark-breaking computational models are released every week. We now expect a fairly high degree of accuracy — the more important differentiators are how well a model fits within a product’s infrastructure and what developers do with its predictions.

This week, we’re introducing background removal for Images. This feature runs a dichotomous image segmentation model on Workers AI to isolate subjects in an image from their backgrounds. We took a controlled, deliberate approach to testing models for efficiency and accuracy.

Here’s how we evaluated various image segmentation models to develop background removal.

A primer on image segmentation

In computer vision, image segmentation is the process of splitting an image into meaningful parts.

Segmentation models produce a mask that assigns each pixel to a specific category. This differs from detection models, which don’t classify every pixel but instead mark regions of interest. A face detection model, such as the one that informs face cropping, draws bounding boxes based on where it thinks there are faces. (If you’re curious, our post on face cropping discusses how we use these bounding boxes to perform crop and zoom operations.)

Salient object detection is a type of segmentation that highlights the parts of an image that most stand out. Most salient detection models create a binary mask that categorizes the most prominent (or salient) pixels as the “foreground” and all other pixels as the “background”. In contrast, a multi-class mask considers the broader context and labels each pixel as one of several possible classes, like “dog” or “chair”. These multi-class masks are the basis of content analysis models, which distinguish which pixels belong to specific objects or types of objects.

In this photograph of my dog, a detection model predicts that a bounding box contains a dog; a segmentation model predicts that some pixels belong to a dog, while all other pixels don’t.

For our use case, we needed a model that could produce a soft saliency mask, which predicts how strongly each pixel belongs to either the foreground (objects of interest) or the background. That is, each pixel is assigned a value on a scale of 0–255, where 0 is completely transparent and 255 is fully opaque. Most background pixels are labeled at (or near) 0; foreground pixels may vary in opacity, depending on its degree of saliency.

In principle, a background removal feature must be able to accurately predict saliency across a broad range of contexts. For example, e-commerce and retail vendors want to display all products on a uniform, white background; in creative and image editing applications, developers want to enable users to create stickers and cutouts from uploaded content, including images of people or avatars.

In our research, we focused primarily on the following four image segmentation models:

U2-Net (U Square Net): Trained on the largest saliency dataset (DUST-TR) of 10,553 images, which were then horizontally flipped to reach a total of 21,106 training images.
IS-Net (Intermediate Supervision Network): A novel, two-step approach from the same authors of U2-Net; this model produces cleaner boundaries for images with noisy, cluttered backgrounds.
BiRefNet (Bilateral Reference Network): Specifically designed to segment complex and high-resolution images with accuracy by checking that the small details match the big picture.
SAM (Segment Anything Model): Developed by Meta to allow segmentation by providing prompts and input points.

Different scales of information allow computational models to build a holistic view of an image. Global context considers the overall shape of objects and how areas of pixels relate to the entire image, while local context traces fine details like edges, corners, and textures. If local context focuses on the trees and their leaves, then global context represents the entire forest.

U2-Net extracts information using a multi-scale approach, where it analyzes an image at different zoom levels, then combines its predictions in a single step. The model analyzes global and local context at the same time, so it works well on images with multiple objects of varying sizes.

IS-Net introduces a new, two-step strategy called intermediate supervision. First, the model separates the foreground from the background, identifying potential areas that likely belong to objects of interest — all other pixels are labeled as the background. Second, it refines the boundaries of the highlighted objects to produce a final pixel-level mask.

The initial suppression of the background results in cleaner, more precise edges, as the segmentation focuses only on the highlighted objects of interest and is less likely to mistakenly include background pixels in the final mask. This model especially excels when dealing with complex images with cluttered backgrounds.

Both models output their predictions in a single direction for scale. U2-Net interprets the global and local context in one pass, while Is-Net begins with the global context, then focuses on the local context.

In contrast, BiRefNet refines its predictions over multiple passes, moving in both contextual directions. Like Is-Net, it initially creates a map that roughly highlights the salient object, then traces the finer details. However, BiRefNet moves from global to local context, then from local context back to global. In other words, after refining the edges of the object, it feeds the output back to the large-scale view. This way, the model can check that the small-scale details align with the broader image structure, providing higher accuracy on high-resolution images.

U2-Net, IS-Net, and BiRefNet are exclusively saliency detection models, producing masks that distinguish foreground pixels from background pixels. However, SAM was designed to be more extensible and general; its primary goal is to segment any object based on specified inputs, not only salient objects. This means that the model can also be used to create multi-class masks that label various objects within an image, even if they aren’t the primary focus of an image.

How we measure segmentation accuracy

In most saliency datasets, the actual location of the object is known as the ground-truth area. These regions are typically defined by human annotators, who manually trace objects of interest in each image. This provides a reliable reference to evaluate model predictions.

Photograph by Allen Fang

Each model outputs a predicted area (where it thinks the foreground pixels are), which can be compared against the ground-truth area (where the foreground pixels actually are).

Models are evaluated for segmentation accuracy based on common metrics like Intersection over Union, Dice coefficient, and pixel accuracy. Each score takes a slightly different approach to quantify the alignment between the predicted and ground-truth areas (“P” and “G”, respectively, in the formulas below).

Intersection over Union

Intersection over Union (IoU), also called the Jaccard index, measures how well the predicted area matches the true object. That is, it counts the number of foreground pixels that are shared in both the predicted and ground-truth masks. Mathematically, IoU is written as:

Jaccard formula

The formula divides the intersection (P∩G), or the pixels where the predicted and ground-truth areas overlap, by the union (P∪G), or the total area of pixels that belong to either area, counting the overlapping pixels only once.

IoU produces a score between 0 and 1. A higher value indicates a closer overlap between the predicted and ground-truth areas. A perfect match, although rare, would score 1, while a smaller overlapping area brings the score closer to 0.

Dice coefficient

The Dice coefficient, also called the Sørensen–Dice index, similarly compares how well the model’s prediction matches reality, but is much more forgiving than the IoU score. It gives more weight to the shared pixels between the predicted and actual foreground, even if the areas differ in size. Mathematically, the Dice coefficient is written as:

Sørensen–Dice formula

The formula divides twice the intersection (P∩G) by the sum of pixels in both predicted and ground-truth areas (P+G), counting any overlapping pixels twice.

Like IoU, the Dice coefficient also produces a value between 0 and 1, indicating a more accurate match as it approaches 1.

Pixel accuracy

Pixel accuracy measures the percentage of pixels that were correctly labeled as either the foreground or the background. Mathematically, pixel accuracy is written as:

Pixel accuracy formula

The formula divides the number of correctly predicted pixels by the total number of pixels in the image.

The total area of correctly predicted pixels is the sum of foreground and background pixels that accurately match the ground-truth areas.

The correctly predicted foreground is the intersection of the predicted and ground-truth areas (P∩G). The inverse of the predicted area (P’, or 1–P) represents the pixels that the model identifies as the background; the inverse of the ground-truth area (G’, or 1–G) represents the actual boundaries of the background. When these two inverted areas overlap (P’∩G’, or (1–P)∩(1–G)), this intersection is the correctly predicted background.

Interpreting the metrics

Of the three metrics, IoU is the most conservative measure of segmentation accuracy. Small mistakes, such as including extra background pixels in the predicted foreground, reduce the score noticeably. This metric is most valuable for applications that require precise boundaries, such as autonomous driving systems.

Meanwhile, the Dice coefficient rewards the overlapping pixels more heavily, and subsequently tends to be higher than the IoU score for the same prediction. In model evaluations, this metric is favored over IoU when it’s more important to capture the object than to penalize mistakes. For example, in medical imaging, the risk of missing a true positive substantially outweighs the inconvenience of flagging a false positive.

In the context of background removal, we biased toward the IoU score and Dice coefficient over pixel accuracy. Pixel accuracy can be misleading, especially when processing an image where background pixels comprise the majority of pixels.

For example, consider an image with 900 background pixels and 100 foreground pixels. A model that correctly predicts only 5 foreground pixels — 5% of all foreground pixels — will score deceptively high in pixel accuracy. Intuitively, we’d likely say that this model performed poorly. However, assuming all 900 background pixels were correctly predicted, the model maintains 90.5% pixel accuracy, despite missing the subject almost entirely.

Pixels, predictions, and patterns

To determine the most suitable model for the Images API, we performed a series of tests using the open-source rembg library, which combines all relevant models in a single interface.

Each model was tasked with outputting a prediction mask to label foreground versus background pixels. We pulled images from two saliency datasets: Humans contains over 7,000 images of people with varying skin tones, clothing, and hairstyles, while DIS5K (version 1.5) spans a vast range of objects and scenes. If a model contained variants that were pre-trained on specific types of segmentation (e.g. clothes, humans), then we repeated the tests for the generalized model and each variant.

Our experiments were executed on a GPU with 23 GB VRAM to mirror realistic hardware constraints, similar to the environment where we already run a face detection model. We also replicated the same tests on a larger GPU instance with 94 GB VRAM; this served as an upper-bound reference point to benchmark potential speed gains if additional compute were available. Cloudflare typically reserves larger GPUs for more compute-intensive AI workloads — we viewed these tests more as an exploration for comparison than as a production scenario.

During our analysis, we started to see key trends emerge:

On the smaller GPU, inference times were generally faster for lightweight models like U2-Net (176 MB) and Is-Net (179 MB). The average speed across both datasets were 307 milliseconds for U2-Net and 351 milliseconds for Is-Net. On the opposite end, BiRefNet (973 MB) had noticeably slower output times, averaging 821 milliseconds across its two generalized variants.

BiRefNet ran 2.4 times faster on the larger GPU, reducing its average inference time to 351 milliseconds — comparable to the other models, despite its larger size. In contrast, the lighter models did not show any notable speed gain with additional compute, suggesting that scaling hardware configurations primarily benefits heavier models. In Appendix 1 (“Inference Time in Milliseconds”), we compare speed across models and GPU instances.

We also observed distinct patterns when comparing model performance across the two saliency datasets. Most notably, all models ran faster on the Humans dataset, where images of people tend to be single-subject and relatively uniform. The DIS5K dataset, in contrast, includes images with higher complexity — that is, images with more objects, cluttered backgrounds, or multiple objects of varying scales.

Slower predictions suggest a relationship between visual complexity and the computation needed to identify the important parts of an image. In other words, datasets with simpler, well-separated objects can be analyzed more quickly, while complex scenes require more computation to generate accurate masks.

Similarly, complexity challenges accuracy as much as it does efficiency. In our tests, all models demonstrated higher segmentation accuracy with the Humans dataset. In Appendix 2 (“Measures of Model Accuracy”), we present our results for segmentation accuracy across both datasets.

Specialized variants scored slightly higher in accuracy compared to their generalized counterparts. But in broad, practical applications, selecting a specialized model for every input isn’t realistic, at least for our initial beta version. We favored general-purpose models that can produce accurate predictions without prior classification. For this reason, we excluded SAM — while powerful in its intended use cases, SAM is designed to work with additional inputs. On unprompted segmentation tasks, it produced lower accuracy scores (and much higher inference times) amongst the models we tested.

All BiRefNet variants showed greater accuracy compared to other models. The generalized variants (-general and -dis) were just as accurate as its more specialized variants like -portrait. The birefnet-general variant, in particular, achieved a high IoU score of 0.87 and Dice coefficient of 0.92, averaged across both datasets.

In contrast, the generalized U2-Net model showed high accuracy on the Humans dataset, reaching an IoU score of 0.89 and a Dice coefficient of 0.94, but received a low IoU score of 0.39 and Dice coefficient of 0.52 on the DIS5K dataset. The isnet-general-use model performed substantially better, obtaining an average IoU score of 0.82 and Dice coefficient of 0.89 across both datasets.

We observed whether models could interpret both the global and local context of an image. In some scenarios, the U2-Net and Is-Net models captured the overall gist of an image, but couldn’t accurately trace fine edges. We designed one test around measuring how well each model could isolate bicycle wheels; for variety, we included images across both interior and exterior backgrounds. Lower scoring models, while correctly labeling the area surrounding the wheel, struggled with the pixels between the thin spokes and produced prediction masks that included these background pixels.

Photograph by Yomex Owo on Unsplash

In other scenarios, the models showed the opposite limitation: they produced masks with clean edges, but failed to identify the focus of the image. We ran another test using a photograph of a gray T-shirt against black gym flooring. Both generalized U2-Net and Is-Net models labeled only the logo as the salient object, creating a mask that omitted the rest of the shirt entirely.

Meanwhile, the BiRefNet model achieved high accuracy across both types of tests. Its architecture passes information bidirectionally, allowing details at the pixel level to be informed by the larger scene (and vice versa). In practice, this means that BiRefNet interprets how fine-grained edges fit into the broader object. For our beta version, we opted to use the BiRefNet model to drive decisions for background removal.

Unlike lower scoring models, the BiRefNet model understood that the entire shirt is the true subject of the image.

Applying background removal with the Images API

The Images API now supports automatic background removal for hosted and remote images. This feature is available in open beta to all Cloudflare users on Free and Paid plans.

Use the segment parameter when optimizing an image through a specially-formatted Images URL or a worker, and Cloudflare will isolate the subject of your image and convert the background into transparent pixels. This can be combined with other optimization operations, as shown in the transformation URL below:

This request will:

Crop the image toward the detected face.
Isolate the subject in the image, replacing the background with transparent pixels.
Fill the transparent pixels with a solid white color (#FFFFFF).

You can also bind the Images API to your worker to build programmatic workflows that give more fine-grained control over how images will be optimized. To demonstrate how this works, I made a simple image editing app for creating cutouts and overlays, built entirely on Images and Workers. This can be used to create images like the one below. Here, we apply background removal to isolate the dog and ice cream cone, then overlay them on a landscape image.

Photographs by Guy Hurst (landscape), Oskar Gackowski (ice cream), and me (dog)

Here is a snippet that you can use to overlay images in a worker:

Background removal is another step in our ongoing effort to enable developers to build interactive and imaginative products. These features are an iterative process, and we’ll continue to refine our approach even further. We’re looking forward to sharing our progress with you.

Read more about applying background removal in our documentation.

Appendix 1: Inference Time in Milliseconds

23 GB VRAM GPU

94 GB VRAM GPU

Appendix 2: Measures of Model Accuracy

The next step for content creators in working with AI bots: Introducing AI Crawl Control

Will Allen — Thu, 28 Aug 2025 14:00:00 GMT

Empowering content creators in the age of AI with smarter crawling controls and direct communication channels

Imagine you run a regional news site. Last month an AI bot scraped 3 years of archives in minutes — with no payment and little to no referral traffic. As a small company, you may struggle to get the AI company's attention for a licensing deal. Do you block all crawler traffic, or do you let them in and settle for the few referrals they send?

It’s picking between two bad options.

Cloudflare wants to help break that stalemate. On July 1st of this year, we declared Content Independence Day based on a simple premise: creators deserve control of how their content is accessed and used. Today, we're taking the next step in that journey by releasing AI Crawl Control to general availability — giving content creators and AI crawlers an important new way to communicate.

AI Crawl Control goes GA

Today, we're rebranding our AI Audit tool as AI Crawl Control and moving it from beta to general availability. This reflects the tool's evolution from simple monitoring to detailed insights and control over how AI systems can access your content.

The market response has been overwhelming: content creators across industries needed real agency, not just visibility. AI Crawl Control delivers that control.

Using HTTP 402 to help publishers license content to AI crawlers

Many content creators have faced a binary choice: either they block all AI crawlers and miss potential licensing opportunities and referral traffic; or allow them through without any compensation. Many content creators had no practical way to say "we're open for business, but let's talk terms first."

Our customers are telling us:

We want to license our content, but crawlers don't know how to reach us.
Blanket blocking feels like we're closing doors on potential revenue and referral traffic.
We need a way to communicate our terms before crawling begins.

To address these needs, we are making it easier than ever to send customizable 402 HTTP status codes.

Our private beta launch of Pay Per Crawl put the HTTP 402 (“Payment Required”) response codes to use, working in tandem with Web Bot Auth to enable direct payments between agents and content creators. Today, we’re making customizable 402 response codes available to every paid Cloudflare customer — not just pay per crawl users.

Here's how it works: in AI Crawl Control, paying Cloudflare customers will be able to select individual bots to block with a configurable message parameter and send 402 payment required responses. Think: "To access this content, email partnerships@yoursite.com or call 1-800-LICENSE" or "Premium content available via API at api.yoursite.com/pricing."

On an average day, Cloudflare customers are already sending over one billion 402 response codes. This shows a deep desire to move beyond blocking to open communication channels and new monetization models. With the 402 HTTP status code, content creators can tell crawlers exactly how to properly license their content, creating a direct path from crawling to a commercial agreement. We are excited to make this easier than ever in the AI Crawl Control dashboard.

How to customize your 402 status code with AI Crawl Control:

For Paid Plan Users:

When you block individual crawlers from the AI Crawl Control dashboard, you can now choose to send 402 Payment Required status codes and customize your message. For example: To access this content, email partnerships@yoursite.com or call 1-800-LICENSE.

The response will look like this:

The message can be configured from Settings in the AI Crawl Control Dashboard:

Beyond just blocking AI bots

This is just the beginning. We're planning to add additional parameters that will let crawlers understand the content's value, freshness, and licensing terms directly in the 402 response. Imagine crawlers receiving structured data about content quality and update frequency, for example, in addition to contact information.

Meanwhile, pay per crawl continues advancing through beta, giving content creators the infrastructure to automatically monetize crawler access with transparent, usage-based pricing.

What excites us most is the market shift we're seeing. We're moving to a world where content creators have clear monetization paths to become active participants in the development of rich AI experiences.

The 402 response is a bridge between two industries that want to work together: content creators whose work fuels AI development, and AI companies who need high-quality data. Cloudflare’s AI Crawl Control creates the infrastructure for these partnerships to flourish.

The age of agents: cryptographically recognizing agent traffic

Jin-Hee Lee — Thu, 28 Aug 2025 14:00:00 GMT

On the surface, the goal of handling bot traffic is clear: keep malicious bots away, while letting through the helpful ones. Some bots are evidently malicious — such as mass price scrapers or those testing stolen credit cards. Others are helpful, like the bots that index your website. Cloudflare has segmented this second category of helpful bot traffic through our verified bots program, vetting and validating bots that are transparent about who they are and what they do.

Today, the rise of agents has transformed how we interact with the Internet, often blurring the distinctions between benign and malicious bot actors. Bots are no longer directed only by the bot owners, but also by individual end users to act on their behalf. These bots directed by end users are often working in ways that website owners want to allow, such as planning a trip, ordering food, or making a purchase.

Our customers have asked us for easier, more granular ways to ensure specific bots, crawlers, and agents can reach their websites, while continuing to block bad actors. That’s why we’re excited to introduce signed agents, an extension of our verified bots program that gives a new bot classification in our security rules and in Radar. Cloudflare has long recognized agents — but we’re now endowing them with their own classification to make it even easier for our customers to set the traffic lanes they want for their website.

The age of agents

Cloudflare has continuously expanded our verified bot categorization to include different functions as the market has evolved. For instance, we first announced our grouping of AI crawler traffic as an official bot category in 2023. And in 2024, when OpenAI announced a new AI search prototype and introduced three different bots with distinct purposes, we added three new categories to account for this innovation: AI Search, AI Assistant, and Archiver.

But the bot landscape is constantly evolving. Let's unpack a common type of verified AI bot — an AI crawler such as GPTBot. Even though the bot performs an array of tasks, the bot’s ultimate purpose is a singular, repetitive task on behalf of the operator of that bot: fetch and index information. Its intelligence is applied to performing that singular job on behalf of that bot owner.

Agents, though, are different. Think about an AI agent tasked by a user to "Book the best deal for a round-trip flight to New York City next month." These agents sometimes use remote browsing products like Cloudflare's Browser Rendering and similar products from companies like Browserbase and Anchor Browser. And here is the key distinction: this particular type of bot isn’t operating on behalf of a single company, like OpenAI in the prior example, but rather the end users themselves.

Introducing signed agents

In May, we announced Web Bot Auth, a new method of using cryptography to verify bot and agent traffic. HTTP message signatures allow bots to authenticate themselves and allow customer origins to identify them. This is one of the authentication methods we use today for our verified bots program.

What, exactly, is a signed agent? First, they are agents that are generally directed by an end user instead of a single company or entity. Second, the infrastructure or remote browsing platform the agents use is signing their HTTP requests via Web Both Auth, with Cloudflare validating these message signatures. And last, they comply with our signed agent policy.

The signed agents classification improves on our existing frameworks in a couple of ways:

Increased precision and visibility: we’ve updated the Cloudflare bots and agents directory to include signed agents in addition to verified bots. This allows us to verify the cryptographic signatures of a much wider set of automated traffic, and our customers to granularly apply their security preferences more easily. Bot operators can now submit signed agent applications from the Cloudflare dashboard, allowing bot owners to specify to us how they think we should segment their automated traffic.
Easier controls from security rules: similar to how they can take action on verified bots as a group, our Enterprise customers will be able to take action on signed agents as a group when configuring their security rules. This new field will be available in the Cloudflare dashboard under security rules soon.

To apply to have an agent added to Cloudflare’s directory of bots and agents, customers should complete the Bot Submission Form in the Cloudflare dashboard. Here, they can specify whether the submission should be considered for the signed agents list or the verified bots list. All signed agents will be recognized by their cryptographic signatures through Web Bot Auth validation.

The Bot Submission Form, available in the Cloudflare dashboard for bot owners to submit both verified bot and signed agent applications.

We want to be clear: our verified bots program isn’t going anywhere. In fact, well-behaved and transparent applications that make use of signed agents can further qualify to be a verified bot, if their specific service adheres to our policy. For instance, Cloudflare Radar's URL Scanner, which relies on Browser Rendering as a service to scan URLs, is a verified bot. While Browser Rendering itself does not qualify to be a verified bot, URL Scanner does, since the bot owner (in this case, Cloudflare Radar) directs the traffic sent by the bot and always identifies itself with a unique Web Bot Auth signature — distinct from Browser Rendering’s signature.

From an agent’s perspective…

Since the launch of Web Bot Auth, our own Browser Rendering product has been sending signed Web Bot Auth HTTP headers, and is always given a bot score of 1 for our Bot Management customers. As of today, Browser Rendering will now show up in this new signed agent category.

We’re also excited to announce the first cohort of agents that we’re partnering with and will be classifying as signed agents: ChatGPT agent, Goose from Block, Browserbase, and Anchor Browser. They are perfect examples of this new classification because their remote browsers are used by their end customers, not necessarily the companies themselves. We’re thrilled to partner with these teams to take this critical step for the AI ecosystem:

“When we built Goose as an open source tool, we designed it to run locally with an extensible architecture that lets developers automate complex workflows. As Goose has evolved to interact with external services and third-party sites on users' behalf, Web Bot Auth enables those sites to trust Goose while preserving what makes it unique. This authentication breakthrough unlocks entirely new possibilities for autonomous agents." – Douwe Osinga, Staff Software Engineer, Block

"At Browserbase, we provide web browsing capabilities for some of the largest AI applications. We're excited to partner with Cloudflare to support the adoption of Web Bot Auth, a critical layer of identity for agents. For AI to thrive, agents need reliable, responsible web access." – Paul Klein, CEO, Browserbase

“Anchor Browser has partnered with Cloudflare to let developers ship verified browser agents. This way trustworthy bots get reliable access while sites stay protected.” – Idan Raman, CEO, Anchor Browser

Updated visibility on Radar

We want everyone to be in the know about our bot classifications. Cloudflare began publishing verified bots on our Radar page back in 2022, meaning anyone on the Internet — Cloudflare customer or not — can see all of our verified bots on Radar. We dynamically update the list of bots, but show more than just a list: we announced on Content Independence Day that every verified bot would get its own page in our public-facing directory on Radar, which includes the traffic patterns that we see for each bot.

Our directory has been updated to include both signed agents and verified bots — we share exactly how Cloudflare classifies the bots that it recognizes, plus we surface all of the traffic that Cloudflare observes from these many recognized agents and bots. Through this updated directory, we’re not only giving better visibility to our customers, but also striving to set a higher standard for transparency of bot traffic on the Internet.

Cloudflare Radar’s Bots Directory, which lists verified bots and signed agents. This view is filtered to view only agent entries.

Cloudflare Radar’s signed agent page for ChatGPT agent, which includes its traffic patterns for the last 7 days, from August 21, 2025 to August 27, 2025.

What’s now, what’s next

As of today, the Cloudflare bot directory supports both bots and agents in a more clear-cut way, and customers or agent creators can submit agents to be signed and recognized through their account dashboard. In addition, anyone can see our signed agents and their traffic patterns on Radar. Soon, customers will be able to take action on signed agents as a group within their firewall rules, the same way you can take action on our verified bots.

Agents are changing the way that humans interact with the Internet. Websites need to know what tools are interacting with them, and for the builders of those tools to be able to easily scale. Message signatures help achieve both of these goals, but this is only step one. Cloudflare will continue to make it easier for agents and websites to interact (or not!) at scale, in a seamless way.

Make Your Website Conversational for People and Agents with NLWeb and AutoRAG

Catarina Pires Mota — Thu, 28 Aug 2025 14:00:00 GMT

Publishers and content creators have historically relied on traditional keyword-based search to help users navigate their website’s content. However, traditional search is built on outdated assumptions: users type in keywords to indicate intent, and the site returns a list of links for the most relevant results. It’s up to the visitor to click around, skim pages, and piece together the answer they’re looking for.

AI has reset expectations and that paradigm is breaking: how we search for information has fundamentally changed.

Your New Type of Visitors

Users no longer want to search websites the old way. They’re used to interacting with AI systems like Copilot, Claude, and ChatGPT, where they can simply ask a question and get an answer. We’ve moved from search engines to answer engines.

At the same time, websites now have a new class of visitors, AI agents. Agents face the same pain with keyword search: they have to issue keyword queries, click through links, and scrape pages to piece together answers. But they also need more: a structured way to ask questions and get reliable answers across websites. This means that websites need a way to give the agents they trust controlled access, so that information is retrieved accurately.

Website owners need a way to participate in this shift.

A New Search Model for the Agentic Web

If AI has reset expectations, what comes next? To meet both people and agents where they are, websites need more than incremental upgrades to keyword search. They need a model that makes conversational access to content a first-class part of the web itself.

That’s what we want to deliver: combining an open standard (NLWeb) with the infrastructure (AutoRAG) to make it simple for any website to become AI-ready.

NLWeb is an open project developed by Microsoft that defines a standard protocol for natural-language queries on websites. Each NLWeb instance also operates as a Model Context Protocol (MCP) server. Cloudflare is building to this spec and actively working with Microsoft to extend the standard with the goal to let every site function like an AI app, so users and agents alike can query its contents naturally.

AutoRAG, Cloudflare’s managed retrieval engine, can automatically crawl your website, store the content in R2, and embed it into a managed vector database. AutoRAG keeps the index fresh with continuous re-crawling and re-indexing. Model inference and embedding can be served through Workers AI. Each AutoRAG is paired with an AI Gateway that can provide observability and insights into your AI model usage. This gives you a complete, managed pipeline for conversational search without the burden of managing custom infrastructure.

“Together, NLWeb and AutoRAG let publishers go beyond search boxes, making conversational interfaces for websites simple to create and deploy. This integration will enable every website to easily become AI-ready for both people and trusted agents.” – R.V. Guha, creator of NLWeb, CVP and Technical Fellow at Microsoft.

We are optimistic this will open up new monetization models for publishers:

"The challenges publishers have faced are well known, as are the risks of AI accelerating the collapse of already challenged business models. However, with NLWeb and AutoRAG, there is an opportunity to reset the nature of relationships with audiences for the better. More direct engagement on Publisher Owned and Operated (O&O) environments, where audiences value the brand and voice of the Publisher, means new potential for monetization. This would be the reset the entire industry needs." – Joe Marchese, General & Build Partner at Human Ventures.

One-Click to Make Your Site Conversational

By combining NLWeb's standard with Cloudflare’s AutoRAG infrastructure, we’re making it possible to easily bring conversational search to any website.

Simply select your domain in AutoRAG, and it will crawl and index your site for semantic querying. It then deploys a Cloudflare Worker, which acts as the access layer. This Worker implements the NLWeb standard and UI defined by the NLWeb project and exposes your indexed content to both people and AI agents.

The Worker includes:

`/ask` endpoint: The defined standard for how conversational web searches should be served. Powers the conversational UI at the root `/` as well as the embeddable preview at `/snippet.html`. It supports chat history so queries can build on one another within the same session, and includes automatic query decontextualization to improve retrieval quality.
`/mcp` endpoint: Implements an MCP server that trusted AI agents can connect to for structured access.

With this setup, your site content is immediately available in two ways for you to experiment: through a conversational UI that you can serve to your visitors, and through a structured MCP interface that lets trusted agents query your site reliably on your terms.

Additionally, if you prefer to deploy and host your own version of the NLWeb project, there’s also the option to use AutoRAG as the retrieval engine powering the NLWeb instance.

How Your Site Becomes Conversational

From your perspective, making your site conversational is just a single click. Behind the scenes, AutoRAG spins up a full retrieval pipeline to make that possible:

Crawling and ingestion: AutoRAG explores your site like a search engine, following `sitemap.xml` and `robots.txt` files to understand what pages are available and allowed for crawling. From there, it follows your sitemap to discover pages within your domain (up to 100k pages). Browser Rendering is used to load each page so that it can capture dynamic, JavaScript content. Crawled pages are downloaded into an R2 bucket in your account before being ingested.
Continuous Indexing: Once ingested, the content is parsed and embedded into Vectorize, making it queryable beyond keyword matching through semantic search. AutoRAG automatically re-crawls and re-indexes to keep your knowledge base aligned with your latest content.
Access & Observability: A Cloudflare Worker is deployed in your account to serve as the access layer that implements the NLWeb protocol (you can also find the deployable Worker in the Workers templates repository). Workers AI is used to seamlessly power the summarization and decontextualized query capabilities to improve responses. Soon, with the AI Gateway and Secret Store BYO keys, you’ll be able to connect models from any provider and select them directly in the AutoRAG dashboard.

Road to Making Websites a First-Class Data Source

Until now, AutoRAG only supported R2 as a data source. That worked well for structured files, but we needed to make a website itself a first-class data source to be indexed and searchable. Making that possible meant building website crawling into AutoRAG and strengthening the system to handle large, dynamic sources like websites.

Before implementing our web crawler, we needed to improve the reliability of data syncs. Prior users of AutoRAG lacked visibility into when indexing syncs ran and whether they were successful. To fix this, we introduced a Job module to track all syncs, store history, and provide logs. This required two new Durable Objects to be added into AutoRAG’s architecture:

JobManager runs a complete sync, and its duties include queuing files, embedding content, and keeping the Vectorize database up to date. To ensure data consistency, only one JobManager can run per RAG at a time, enforced by the RagManager (a Durable Object in our existing architecture), which cancels any running jobs before starting new ones which can be triggered either manually or by a scheduled sync.
FileManager solved scalability issues we hit when Workers ran out of memory during parallel processing. Originally, a single Durable Object was responsible for handling multiple files, but with a 128MB memory limit it quickly became a bottleneck. The solution was to break the work apart: JobManager now distributes files across many FileManagers, each responsible for a single file. By processing 20 files in parallel through 20 different FileManagers, we expanded effective memory capacity from 128MB to roughly 2.5GB per batch.

With these improvements, we were ready to build the website parser. By reusing our existing R2-based queuing logic, we added crawling with minimal disruption:

A JobManager designated for a website crawl begins by reading the sitemaps associated with the RAG configuration.
Instead of listing objects from an R2 bucket, it queues each website link into our existing R2-based queue, using the full URL as the R2 object key.
From here, the process is nearly identical to our file-based sync. A FileManager picks up the job and checks if the RAG is configured for website parsing.
If it is, the FileManager crawls the link and places the page's HTML contents into the user's R2 bucket, again using the URL as the object key.

After these steps, we index the data and serve it at query time. This approach maximized code reuse, and any improvements to our HTML-to-Markdown conversion now benefit both file and website-based RAGs automatically.

Get Started Today

Getting your website ready for conversational search through NLWeb and AutoRAG is simple. Here’s how:

In the Cloudflare Dashboard, navigate to Compute & AI > AutoRAG.
Select Create in AutoRAG, then choose the NLWeb Website quick deploy option.
Select the domain from your Cloudflare account that you want indexed.
Click Start indexing.

That’s it! You can now try out your NLWeb search experience via the provided link, and test out how it will look on your site by using the embeddable snippet.

We’d love to hear your feedback as you experiment with this new capability and share your thoughts with us at nlweb@cloudflare.com.

AI Gateway now gives you access to your favorite AI models, dynamic routing and more — through just one endpoint

Michelle Chen — Wed, 27 Aug 2025 14:05:00 GMT

Getting the observability you need is challenging enough when the code is deterministic, but AI presents a new challenge — a core part of your user’s experience now relies on a non-deterministic engine that provides unpredictable outputs. On top of that, there are many factors that can influence the results: the model, the system prompt. And on top of that, you still have to worry about performance, reliability, and costs.

Solving performance, reliability and observability challenges is exactly what Cloudflare was built for, and two years ago, with the introduction of AI Gateway, we wanted to extend to our users the same levels of control in the age of AI.

Today, we’re excited to announce several features to make building AI applications easier and more manageable: unified billing, secure key storage, dynamic routing, security controls with Data Loss Prevention (DLP). This means that AI Gateway becomes your go-to place to control costs and API keys, route between different models and providers, and manage your AI traffic. Check out our new AI Gateway landing page for more information at a glance.

Connect to all your favorite AI providers

When using an AI provider, you typically have to sign up for an account, get an API key, manage rate limits, top up credits — all within an individual provider’s dashboard. Multiply that for each of the different providers you might use, and you’ll soon be left with an administrative headache of bills and keys to manage.

With AI Gateway, you can now connect to major AI providers directly through Cloudflare and manage everything through one single plane. We’re excited to partner with Anthropic, Google, Groq, OpenAI, and xAI to provide Cloudflare users with access to their models directly through Cloudflare. With this, you’ll have access to over 350+ models across 6 different providers.

You can now get billed for usage across different providers directly through your Cloudflare account. This feature is available for Workers Paid users, where you’ll be able to add credits to your Cloudflare account and use them for AI inference to all the supported providers. You’ll be able to see real-time usage statistics and manage your credits through the AI Gateway dashboard. Your AI Gateway inference usage will also be documented in your monthly Cloudflare invoice. No more signing up and paying for each individual model provider account.

Usage rates are based on then-current list prices from model providers — all you will need to cover is the transaction fee as you load credits into your account. Since this is one of the first times we’re launching a credits based billing system at Cloudflare, we’re releasing this feature in Closed Beta — sign up for access here.

BYO Provider Keys, now with Cloudflare Secrets Store

Although we’ve introduced unified billing, some users might still want to manage their own accounts and keys with providers. We’re happy to say that AI Gateway will continue supporting our BYO Key feature, improving the experience of BYO Provider Keys by integrating with Cloudflare’s secrets management product Secrets Store. Now, you can seamlessly and securely store your keys in one centralized location and distribute them without relying on plain text. Secrets Store uses a two level key hierarchy with AES encryption to ensure that your secret stays safe, while maintaining low latency through our global configuration system, Quicksilver.

You can now save and manage keys directly through your AI Gateway dashboard or through the Secrets Store dashboard, API, or Wrangler by using the new AI Gateway scope. Scoping your secrets to AI Gateway ensures that only this specific service will be able to access your keys, meaning that secret could not be used in a Workers binding or anywhere else on Cloudflare’s platform.

You can pass your AI provider keys without including them directly in the request header. Instead of including the actual value, you can deploy the secret only using the Secrets Store reference:

Or, using Javascript:

By using Secrets Store to deploy your secrets, you no longer need to give every developer access to every key — instead, you can rely on Secrets Store’s role-based access control to further lock down these sensitive values. For example, you might want your security administrators to have Secrets Store admin permissions so that they can create, update, and delete the keys when necessary. With Cloudflare audit logging, all such actions will be logged so you know exactly who did what and when. Your developers, on the other hand, might only need Deploy permissions, so they can reference the values in code, whether that is a Worker or AI Gateway or both. This way, you reduce the risk of the secret getting leaked accidentally or intentionally by a malicious actor. This also allows you to update your provider keys in one place and automatically propagate that value to any AI Gateway using those values, simplifying the management.

Unified Request/Response

We made it super easy for people to try out different AI models – but the developer experience should match that as well. We found that each provider can have slight differences in how they expect people to send their requests, so we’re excited to launch an automatic translation layer between providers. When you send a request through AI Gateway, it just works – no matter what provider or model you use.

Dynamic Routes

When we first launched Cloudflare Workers, it was an easy way for people to intercept HTTP requests and customize actions based on different attributes. We think the same customization is necessary for AI traffic, so we’re launching Dynamic Routes in AI Gateway.

Dynamic Routes allows you to define certain actions based on different request attributes. If you have free users, maybe you want to ratelimit them to a certain request per second (RPS) or a certain dollar spend. Or maybe you want to conduct an A/B test and split 50% of traffic to Model A and 50% of traffic to Model B. You could also want to chain several models in a row, like adding custom guardrails or enhancing a prompt before it goes to another model. All of this is possible with Dynamic Routes!

We’ve built a slick UI in the AI Gateway dashboard where you can define simple if/else interactions based on request attributes or a percentage split. Once you define a route, you’ll use the route as the “model” name in your input JSON and we will manage the traffic as you defined.

Built-in security with Firewall in AI Gateway

Earlier this year we announced Guardrails in AI Gateway and now we’re expanding our security capabilities and include Data Loss Prevention (DLP) scanning in AI Gateway’s Firewall. With this, you can select the DLP profiles you are interested in blocking or flagging, and we will scan requests for the matching content. DLP profiles include general categories like “Financial Information”, “Social Security, Insurance, Tax and Identifier Numbers” that everyone has access to with a free Zero Trust account. If you would like to create a custom DLP profile to safeguard specific text, the upgraded Zero Trust plan allows you to create custom DLP profiles to catch sensitive data that is unique to your business.

False positives and grey area situations happen, we give admins controls on whether to fully block or just alert on DLP matches. This allows administrators to monitor for potential issues without creating roadblocks for their users.. Each log on AI gateway now includes details about the DLP profiles matched on your request, and the action that was taken:

More coming soon…

If you think about the history of Cloudflare, you’ll notice similar patterns that we’re following for the new vision for AI Gateway. We want developers of AI applications to be able to have simple interconnectivity, observability, security, customizable actions, and more — something that Cloudflare has a proven track record of accomplishing for global Internet traffic. We see AI Gateway as a natural extension of Cloudflare’s mission, and we’re excited to make it come to life.

We’ve got more launches up our sleeves, but we couldn’t wait to get these first handful of features into your hands. Read up about it in our developer docs, give it a try, and let us know what you think. If you want to explore larger deployments, reach out for a consultation with Cloudflare experts.

How Cloudflare runs more AI models on fewer GPUs: A technical deep-dive

Sven Sauleau — Wed, 27 Aug 2025 14:00:00 GMT

As the demand for AI products grows, developers are creating and tuning a wider variety of models. While adding new models to our growing catalog on Workers AI, we noticed that not all of them are used equally – leaving infrequently used models occupying valuable GPU space. Efficiency is a core value at Cloudflare, and with GPUs being the scarce commodity they are, we realized that we needed to build something to fully maximize our GPU usage.

Omni is an internal platform we’ve built for running and managing AI models on Cloudflare’s edge nodes. It does so by spawning and managing multiple models on a single machine and GPU using lightweight isolation. Omni makes it easy and efficient to run many small and/or low-volume models, combining multiple capabilities by:

Spawning multiple models from a single control plane,
Implementing lightweight process isolation, allowing models to spin up and down quickly,
Isolating the file system between models to easily manage per-model dependencies, and
Over-committing GPU memory to run more models on a single GPU.

Cloudflare aims to place GPUs as close as we possibly can to people and applications that are using them. With Omni in place, we’re now able to run more models on every node in our network, improving model availability, minimizing latency, and reducing power consumed by idle GPUs.

Here’s how.

Omni’s architecture – at a glance

At a high level, Omni is a platform to run AI models. When an inference request is made on Workers AI, we load the model’s configuration from Workers KV and our routing layer forwards it to the closest Omni instance that has available capacity. For inferences using the Asynchronous Batch API, we route to an Omni instance that is idle, which is typically in a location where it’s night.

Omni runs a few checks on the inference request, runs model specific pre and post processing, then hands the request over to the model.

Elastic scaling by spawning multiple models from a single control plane

If you’re developing an AI application, a typical setup is having a container or a VM dedicated to running a single model with a GPU attached to it. This is simple. But it’s also heavy-handed — because it requires managing the entire stack from provisioning the VM, installing GPU drivers, downloading model weights, and managing the Python environment. At scale, managing infrastructure this way is incredibly time consuming and often requires an entire team.

If you’re using Workers AI, we handle all of this for you. Omni uses a single control plane for running multiple models, called the scheduler, which automatically provisions models and spawns new instances as your traffic scales. When starting a new model instance, it downloads model weights, Python code, and any other dependencies. Omni’s scheduler provides fine-grained control and visibility over the model’s lifecycle: it receives incoming inference requests and routes them to the corresponding model processes, being sure to distribute the load between multiple GPUs. It then makes sure the model processes are running, rolls out new versions as they are released, and restarts itself when detecting errors or failure states. It also collects metrics for billing and emits logs.

The inference itself is done by a per-model process, supervised by the scheduler. It receives the inference request and some metadata, then sends back a response. Depending on the model, the response can be various types; for instance, a JSON object or a SSE stream for text generation, or binary for image generation.

The scheduler and the child processes communicate by passing messages over Inter-Process Communication (IPC). Usually the inference request is buffered in the scheduler for applying features, like prompt templating or tool calling, before the request is passed to the child process. For potentially large binary requests, the scheduler hands over the underlying TCP connection to the child process for consuming the request body directly.

Implementing lightweight process and Python isolation

Typically, deploying a model requires its own dedicated container, but we want to colocate more models on a single container to conserve memory and GPU capacity. In order to do so, we needed finer-grained controls over CPU memory and the ability to isolate a model from its dependencies and environment. We deploy Omni in two configurations; a container running multiple models or bare metal running a single model. In both cases, process isolation and Python virtual environments allow us to isolate models with different dependencies by creating namespaces and are limited by cgroups.

Python doesn’t take into account cgroups memory limits for memory allocations, which can lead to OOM errors. Many AI Python libraries rely on psutil for pre-allocating CPU memory. psutil reads /proc/meminfo to determine how much memory is available. Since in Omni each model has its own configurable memory limits, we need psutil to reflect the current usage and limits for a given model, not for the entire system.

The solution for us was to create a virtual file system, using fuse, to mount our own version of /proc/meminfo which reflects the model’s current usage and limits.

To illustrate this, here’s an Omni instance running a model (running as pid 8). If we enter the mount namespace and look at /proc/meminfo it will reflect the model’s configuration:

In this case the model has 7Gib of memory available and the entire container 15Gib. If the model tries to allocate more than 7Gib of memory, it will be OOM killed and restarted by the scheduler’s process manager, without causing any problems to the other models.

For isolating Python and some system dependencies, each model runs in a Python virtual environment, managed by uv. Dependencies are cached on the machine and, if possible, shared between models (uv uses symbolic links between its cache and virtual environments).

Also separated processes for models allows to have different CUDA contexts and isolation for error recovery.

Over-committing memory to run more models on a single GPU

Some models don’t receive enough traffic to fully utilize a GPU, and with Omni we can pack more models on a single GPU, freeing up capacity for other workloads. When it comes to GPU memory management, Omni has two main jobs: safely over-commit GPU memory, so that more models than normal can share a single GPU, and enforce memory limits, to prevent any single model from running out of memory while running.

Over-committing memory means allocating more memory than is physically available to the device.

For example, if a GPU has 10 Gib of memory, Omni would allow 2 models of 10Gib each on that GPU.

Right now, Omni is configured to run 13 models and is allocating about 400% GPU memory on a single GPU, saving up 4 GPUs. Omni does this by injecting a CUDA stub library that intercepts CUDA memory allocations (cuMalloc* or cudaMalloc*) calls and forces memory allocations to be performed in unified memory mode.

In Unified memory mode CUDA shares the same memory address space for both the GPU and the CPU:

CUDA’s unified memory mode

In practice this is what memory over-commitment looks like: imagine 3 models (A, B and C). Models A+B fit in the GPU’s memory but C takes up the entire memory.

Models A+B are loaded first and are in GPU memory, while model C is in CPU memory

Omni receives a request for model C so models A+B are swapped out and C is swapped in.

Omni receives a request for model B, so model C is partly swapped out and model B is swapped back in.

Omni receives a request for model A, so model A is swapped back in and model C is completely swapped out.

The trade-off is added latency: if performing an inference requires memory that is currently on the host system, it must be transferred to the GPU. For smaller models, this latency is minimal, because with PCIe 4.0, the physical bus between your GPU and system, provides 32 GB/sec of bandwidth. On the other hand, if a model need to be “cold started” i.e. it’s been swapped out because it hasn’t been used in a while, the system may need to swap back the entire model – a larger sized model, for example, might use 5Gib of GPU memory for weights and caches, and would take ~156ms to be swapped back into the GPU. Naturally, over time, inactive models are put into CPU memory, while active models stay hot in the GPU.

Rather than allowing the model to choose how much GPU memory it uses, AI frameworks tend to pre-allocate as much GPU memory as possible for performance reasons, making co-locating models more complicated. Omni allows us to control how much memory is actually exposed to any given model to prevent a greedy model from over-using the GPU allocated to it. We do this by overriding the CUDA runtime and driver APIs (cudaMemGetInfo and cuMemGetInfo). Instead of exposing the entire GPU memory, we only expose a subset of memory to each model.

How Omni runs multiple models for Workers AI

AI models can run in a variety of inference engines or backends: vLLM, Python, and now our very own inference engine, Infire. While models have different capabilities, each model needs to support Workers AI features, like batching and function calling. Omni acts as a unified layer for integrating these systems. It integrates into our internal routing and scheduling systems, and provides a Python API for our engineering team to add new models more easily. Let’s take a closer look at how Omni does this in practice:

Similar to how a JavaScript Worker works, Omni calls a request handler, running the model’s logic and returning a response.

Omni installs Python dependencies at model startup. We run an internal Python registry and mirror the public registry. In either case we declare dependencies in requirements.txt:

The handle_request function can be async and return different Python types, including pydantic objects. Omni will convert the return value into a Workers AI response for the eyeball.

A Python package is injected, named omni, containing all the Python APIs to interact with the request, the Workers AI systems, building Responses, error handling, etc. Internally we publish it as regular Python package to be used in standalone, for unit testing for instance:

What’s next

Omni allows us to run models more efficiently by spawning them from a single control plane and implementing lightweight process isolation. This enables quick starting and stopping of models, isolated file systems for managing Python and system dependencies, and over-committing GPU memory to run more models on a single GPU. This improves the performance for our entire Workers AI stack, reduces the cost of running GPUs, and allows us to ship new models and features quickly and safely.

Right now, Omni is running in production on a handful of models in the Workers AI catalog, and we’re adding more every week. Check out Workers AI today to experience Omni’s performance benefits on your AI application.

How we built the most efficient inference engine for Cloudflare’s network

Vlad Krasnov — Wed, 27 Aug 2025 14:00:00 GMT

Inference powers some of today’s most powerful AI products: chat bot replies, AI agents, autonomous vehicle decisions, and fraud detection. The problem is, if you’re building one of these products on top of a hyperscaler, you’ll likely need to rent expensive GPUs from large centralized data centers to run your inference tasks. That model doesn’t work for Cloudflare — there’s a mismatch between Cloudflare’s globally-distributed network and a typical centralized AI deployment using large multi-GPU nodes. As a company that operates our own compute on a lean, fast, and widely distributed network within 50ms of 95% of the world’s Internet-connected population, we need to be running inference tasks more efficiently than anywhere else.

This is further compounded by the fact that AI models are getting larger and more complex. As we started to support these models, like the Llama 4 herd and gpt-oss, we realized that we couldn’t just throw money at the scaling problems by buying more GPUs. We needed to utilize every bit of idle capacity and be agile with where each model is deployed.

After running most of our models on the widely used open source inference and serving engine vLLM, we figured out it didn’t allow us to fully utilize the GPUs at the edge. Although it can run on a very wide range of hardware, from personal devices to data centers, it is best optimized for large data centers. When run as a dedicated inference server on powerful hardware serving a specific model, vLLM truly shines. However, it is much less optimized for dynamic workloads, distributed networks, and for the unique security constraints of running inference at the edge alongside other services.

That’s why we decided to build something that will be able to meet the needs of Cloudflare inference workloads for years to come. Infire is an LLM inference engine, written in Rust, that employs a range of techniques to maximize memory, network I/O, and GPU utilization. It can serve more requests with fewer GPUs and significantly lower CPU overhead, saving time, resources, and energy across our network.

Our initial benchmarking has shown that Infire completes inference tasks up to 7% faster than vLLM 0.10.0 on unloaded machines equipped with an H100 NVL GPU. On infrastructure under real load, it performs significantly better.

Currently, Infire is powering the Llama 3.1 8B model for Workers AI, and you can test it out today at @cf/meta/llama-3.1-8b-instruct!

The Architectural Challenge of LLM Inference at Cloudflare

Thanks to industry efforts, inference has improved a lot over the past few years. vLLM has led the way here with the recent release of the vLLM V1 engine with features like an optimized KV cache, improved batching, and the implementation of Flash Attention 3. vLLM is great for most inference workloads — we’re currently using it for several of the models in our Workers AI catalog — but as our AI workloads and catalog has grown, so has our need to optimize inference for the exact hardware and performance requirements we have.

Cloudflare is writing much of our new infrastructure in Rust, and vLLM is written in Python. Although Python has proven to be a great language for prototyping ML workloads, to maximize efficiency we need to control the low-level implementation details. Implementing low-level optimizations through multiple abstraction layers and Python libraries adds unnecessary complexity and leaves a lot of CPU performance on the table, simply due to the inefficiencies of Python as an interpreted language.

We love to contribute to open-source projects that we use, but in this case our priorities may not fit the goals of the vLLM project, so we chose to write a server for our needs. For example, vLLM does not support co-hosting multiple models on the same GPU without using Multi-Instance GPU (MIG), and we need to be able to dynamically schedule multiple models on the same GPU to minimize downtime. We also have an in-house AI Research team exploring unique features that are difficult, if not impossible, to upstream to vLLM.

Finally, running code securely is our top priority across our platform and Workers AI is no exception. We simply can’t trust a 3rd party Python process to run on our edge nodes alongside the rest of our services without strong sandboxing. We are therefore forced to run vLLM via gvisor. Having an extra virtualization layer adds another performance overhead to vLLM. More importantly, it also increases the startup and tear downtime for vLLM instances — which are already pretty long. Under full load on our edge nodes, vLLM running via gvisor consumes as much as 2.5 CPU cores, and is forced to compete for CPU time with other crucial services, that in turn slows vLLM down and lowers GPU utilization as a result.

While developing Infire, we’ve been incorporating the latest research in inference efficiency — let’s take a deeper look at what we actually built.

How Infire works under the hood

Infire is composed of three major components: an OpenAI compatible HTTP server, a batcher, and the Infire engine itself.

An overview of Infire’s architecture

Platform startup

When a model is first scheduled to run on a specific node in one of our data centers by our auto-scaling service, the first thing that has to happen is for the model weights to be fetched from our R2 object storage. Once the weights are downloaded, they are cached on the edge node for future reuse.

As the weights become available either from cache or from R2, Infire can begin loading the model onto the GPU.

Model sizes vary greatly, but most of them are large, so transferring them into GPU memory can be a time-consuming part of Infire’s startup process. For example, most non-quantized models store their weights in the BF16 floating point format. This format has the same dynamic range as the 32-bit floating format, but with reduced accuracy. It is perfectly suited for inference providing the sweet spot of size, performance and accuracy. As the name suggests, the BF16 format requires 16 bits, or 2 bytes per weight. The approximate in-memory size of a given model is therefore double the size of its parameters. For example, LLama3.1 8B has approximately 8B parameters, and its memory footprint is about 16 GB. A larger model, like LLama4 Scout, has 109B parameters, and requires around 218 GB of memory. Infire utilizes a combination of Page Locked memory with CUDA asynchronous copy mechanism over multiple streams to speed up model transfer into GPU memory.

While loading the model weights, Infire begins just-in-time compiling the required kernels based on the model's parameters, and loads them onto the device. Parallelizing the compilation with model loading amortizes the latency of both processes. The startup time of Infire when loading the Llama-3-8B-Instruct model from disk is just under 4 seconds.

The HTTP server

The Infire server is built on top of hyper, a high performance HTTP crate, which makes it possible to handle hundreds of connections in parallel – while consuming a modest amount of CPU time. Because of ChatGPT’s ubiquity, vLLM and many other services offer OpenAI compatible endpoints out of the box. Infire is no different in that regard. The server is responsible for handling communication with the client: accepting connections, handling prompts and returning responses. A prompt will usually consist of some text, or a "transcript" of a chat session along with extra parameters that affect how the response is generated. Some parameters that come with a prompt include the temperature, which affects the randomness of the response, as well as other parameters that affect the randomness and length of a possible response.

After a request is deemed valid, Infire will pass it to the tokenizer, which transforms the raw text into a series of tokens, or numbers that the model can consume. Different models use different kinds of tokenizers, but the most popular ones use byte-pair encoding. For tokenization, we use HuggingFace's tokenizers crate. The tokenized prompts and params are then sent to the batcher, and scheduled for processing on the GPU, where they will be processed as vectors of numbers, called embeddings.

The batcher

The most important part of Infire is in how it does batching: by executing multiple requests in parallel. This makes it possible to better utilize memory bandwidth and caches.

In order to understand why batching is so important, we need to understand how the inference algorithm works. The weights of a model are essentially a bunch of two-dimensional matrices (also called tensors). The prompt represented as vectors is passed through a series of transformations that are largely dominated by one operation: vector-by-matrix multiplication. The model weights are so large, that the cost of the multiplication is dominated by the time it takes to fetch it from memory. In addition, modern GPUs have hardware units dedicated to matrix-by-matrix multiplications (called Tensor Cores on Nvidia GPUs). In order to amortize the cost of memory access and take advantage of the Tensor Cores, it is necessary to aggregate multiple operations into a larger matrix multiplication.

Infire utilizes two techniques to increase the size of those matrix operations. The first one is called prefill: this technique is applied to the prompt tokens. Because all the prompt tokens are available in advance and do not require decoding, they can all be processed in parallel. This is one reason why input tokens are often cheaper (and faster) than output tokens.

How Infire enables larger matrix multiplications via batching

The other technique is called batching: this technique aggregates multiple prompts into a single decode operation.

Infire mixes both techniques. It attempts to process as many prompts as possible in parallel, and fills the remaining slots in a batch with prefill tokens from incoming prompts. This is also known as continuous batching with chunked prefill.

As tokens get decoded by the Infire engine, the batcher is also responsible for retiring prompts that reach an End of Stream token, and sending tokens back to the decoder to be converted into text.

Another job the batcher has is handling the KV cache. One demanding operation in the inference process is called attention. Attention requires going over the KV values computed for all the tokens up to the current one. If we had to recompute those previously encountered KV values for every new token we decode, the runtime of the process would explode for longer context sizes. However, using a cache, we can store all the previous values and re-read them for each consecutive token. Potentially the KV cache for a prompt can store KV values for as many tokens as the context window allows. In LLama 3, the maximal context window is 128K tokens. If we pre-allocated the KV cache for each prompt in advance, we would only have enough memory available to execute 4 prompts in parallel on H100 GPUs! The solution for this is paged KV cache. With paged KV caching, the cache is split into smaller chunks called pages. When the batcher detects that a prompt would exceed its KV cache, it simply assigns another page to that prompt. Since most prompts rarely hit the maximum context window, this technique allows for essentially unlimited parallelism under typical load.

Finally, the batcher drives the Infire forward pass by scheduling the needed kernels to run on the GPU.

CUDA kernels

Developing Infire gives us the luxury of focusing on the exact hardware we use, which is currently Nvidia Hopper GPUs. This allowed us to improve performance of specific compute kernels using low-level PTX instructions for this specific architecture.

Infire just-in-time compiles its kernel for the specific model it is running, optimizing for the model’s parameters, such as the hidden state size, dictionary size and the GPU it is running on. For some operations, such as large matrix multiplications, Infire will utilize the high performance cuBLASlt library, if it would deem it faster.

Infire also makes use of very fine-grained CUDA graphs, essentially creating a dedicated CUDA graph for every possible batch size on demand. It then stores it for future launch. Conceptually, a CUDA graph is another form of just-in-time compilation: the CUDA driver replaces a series of kernel launches with a single construct (the graph) that has a significantly lower amortized kernel launch cost, thus kernels executed back to back will execute faster when launched as a single graph as opposed to individual launches.

How Infire performs in the wild

We ran synthetic benchmarks on one of our edge nodes with an H100 NVL GPU.

The benchmark we ran was on the widely used ShareGPT v3 dataset. We ran the benchmark on a set of 4,000 prompts with a concurrency of 200. We then compared Infire and vLLM running on bare metal as well as vLLM running under gvisor, which is the way we currently run in production. In a production traffic scenario, an edge node would be competing for resources with other traffic. To simulate this, we benchmarked vLLM running in gvisor with only one CPU available.

As evident from the benchmarks we achieved our initial goal of matching and even slightly surpassing vLLM performance, but more importantly, we’ve done so at a significantly lower CPU usage, in large part because we can run Infire as a trusted bare-metal process. Inference no longer takes away precious resources from our other services and we see GPU utilization upward of 80%, reducing our operational costs.

This is just the beginning. There are still multiple proven performance optimizations yet to be implemented in Infire – for example, we’re integrating Flash Attention 3, and most of our kernels don’t utilize kernel fusion. Those and other optimizations will allow us to unlock even faster inference in the near future.

What’s next

Running AI inference presents novel challenges and demands to our infrastructure. Infire is how we’re running AI efficiently — close to users around the world. By building upon techniques like continuous batching, a paged KV-cache, and low-level optimizations tailored to our hardware, Infire maximizes GPU utilization while minimizing overhead. Infire completes inference tasks faster and with a fraction of the CPU load of our previous vLLM-based setup, especially under the strict security constraints we require. This allows us to serve more requests with fewer resources, making requests served via Workers AI faster and more efficient.

However, this is just our first iteration — we’re excited to build in multi-GPU support for larger models, quantization, and true multi-tenancy into the next version of Infire. This is part of our goal to make Cloudflare the best possible platform for developers to build AI applications.

Want to see if your AI workloads are faster on Cloudflare? Get started with Workers AI today.

State-of-the-art image generation Leonardo models and text-to-speech Deepgram models now available in Workers AI

Michelle Chen — Wed, 27 Aug 2025 14:00:00 GMT

When we first launched Workers AI, we made a bet that AI models would get faster and smaller. We built our infrastructure around this hypothesis, adding specialized GPUs to our datacenters around the world that can serve inference to users as fast as possible. We created our platform to be as general as possible, but we also identified niche use cases that fit our infrastructure well, such as low-latency image generation or real-time audio voice agents. To lean in on those use cases, we’re bringing on some new models that will help make it easier to develop for these applications.

Today, we’re excited to announce that we are expanding our model catalog to include closed-source partner models that fit this use case. We’ve partnered with Leonardo.Ai and Deepgram to bring their latest and greatest models to Workers AI, hosted on Cloudflare’s infrastructure. Leonardo and Deepgram both have models with a great speed-to-performance ratio that suit the infrastructure of Workers AI. We’re starting off with these great partners — but expect to expand our catalog to other partner models as well.

The benefits of using these models on Workers AI is that we don’t only have a standalone inference service, we also have an entire suite of Developer products that allow you to build whole applications around AI. If you’re building an image generation platform, you could use Workers to host the application logic, Workers AI to generate the images, R2 for storage, and Images for serving and transforming media. If you’re building Realtime voice agents, we offer WebRTC and WebSocket support via Workers, speech-to-text, text-to-speech, and turn detection models via Workers AI, and an orchestration layer via Cloudflare Realtime. All in all, we want to lean into use cases that we think Cloudflare has a unique advantage in, with developer tools to back it up, and make it all available so that you can build the best AI applications on top of our holistic Developer Platform.

Leonardo Models

Leonardo.Ai is a generative AI media lab that trains their own models and hosts a platform for customers to create generative media. The Workers AI team has been working with Leonardo for a while now and have experienced the magic of their image generation models firsthand. We’re excited to bring on two image generation models from Leonardo: @cf/leonardo/phoenix-1.0 and @cf/leonardo/lucid-origin.

“We’re excited to enable Cloudflare customers a new avenue to extend and use our image generation technology in creative ways such as creating character images for gaming, generating personalized images for websites, and a host of other uses... all through the Workers AI and the Cloudflare Developer Platform.” - Peter Runham, CTO, Leonardo.Ai

The Phoenix model is trained from the ground up by Leonardo, excelling at things like text rendering and prompt coherence. The full image generation request took 4.89s end-to-end for a 25 step, 1024x1024 image.

The Lucid Origin model is a recent addition to Leonardo’s family of models and is great at generating photorealistic images. The image took 4.38s to generate end-to-end at 25 steps and a 1024x1024 image size.

Deepgram Models

Deepgram is a voice AI company that develops their own audio models, allowing users to interact with AI through a natural interface for humans: voice. Voice is an exciting interface because it carries higher bandwidth than text, because it has other speech signals like pacing, intonation, and more. The Deepgram models that we’re bringing on our platform are audio models which perform extremely fast speech-to-text and text-to-speech inference. Combined with the Workers AI infrastructure, the models showcase our unique infrastructure so customers can build low-latency voice agents and more.

"By hosting our voice models on Cloudflare's Workers AI, we're enabling developers to create real-time, expressive voice agents with ultra-low latency. Cloudflare's global network brings AI compute closer to users everywhere, so customers can now deliver lightning-fast conversational AI experiences without worrying about complex infrastructure." - Adam Sypniewski, CTO, Deepgram

@cf/deepgram/nova-3 is a speech-to-text model that can quickly transcribe audio with high accuracy. @cf/deepgram/aura-1 is a text-to-speech model that is context aware and can apply natural pacing and expressiveness based on the input text. The newer Aura 2 model will be available on Workers AI soon. We’ve also improved the experience of sending binary mp3 files to Workers AI, so you don’t have to convert it into an Uint8 array like you had to previously. Along with our Realtime announcements (coming soon!), these audio models are the key to enabling customers to build voice agents directly on Cloudflare.

With the AI binding, a call to the Nova 3 speech-to-text model would look like this:

With the REST API, it would look like this:

As well, we’ve added WebSocket support to the Deepgram models, which you can use to keep a connection to the inference server live and use it for bi-directional input and output. To use the Nova model with WebSocket support, check out our Developer Docs.

All the pieces work together so that you can:

Capture audio with Cloudflare Realtime from any WebRTC source
Pipe it via WebSocket to your processing pipeline
Transcribe with audio ML models Deepgram running on Workers AI
Process with your LLM of choice through a model hosted on Workers AI or proxied via AI Gateway
Orchestrate everything with Realtime Agents

Try these models out today

Check out our developer docs for more details, pricing and how to get started with the newest partner models available on Workers AI.

Securing the AI Revolution: Introducing Cloudflare MCP Server Portals

Kenny Johnson — Tue, 26 Aug 2025 14:05:00 GMT

Securing the AI Revolution: Introducing Cloudflare MCP Server Portals

Large Language Models (LLMs) are rapidly evolving from impressive information retrieval tools into active, intelligent agents. The key to unlocking this transformation is the Model Context Protocol (MCP), an open-source standard that allows LLMs to securely connect to and interact with any application — from Slack to Canva, to your own internal databases.

This is a massive leap forward. With MCP, an LLM client like Gemini, Claude, or ChatGPT can answer more than just "tell me about Slack." You can ask it: "What were the most critical engineering P0s in Jira from last week, and what is the current sentiment in the #engineering-support Slack channel regarding them? Then propose updates and bug fixes to merge."

This is the power of MCP: turning models into teammates.

But this great power comes with proportional risk. Connecting LLMs to your most critical applications creates a new, complex, and largely unprotected attack surface. Today, we change that. We’re excited to announce Cloudflare MCP Server Portals are now available in Open Beta. MCP Server Portals are a new capability that enable you to centralize, secure, and observe every MCP connection in your organization. This feature is part of Cloudflare One, our secure access service edge (SASE) platform that helps connect and protect your workspace.

What Exactly is the Model Context Protocol?

Think of MCP as a universal translator or a digital switchboard for AI. It’s a standardized set of rules that lets two very different types of software—LLMs and everyday applications—talk to each other effectively. It consists of two primary components:

MCP Clients: These are the LLMs you interact with, like ChatGPT, Claude, or Gemini. The client is the front end to the AI that you use to ask questions and give commands.
MCP Servers: These can be developed for any application you want to connect to your LLM. SaaS providers like Slack or Atlassian may offer MCP servers for their products, or your own developers can also build custom ones for internal tools.

Credit: Architecture Overview - Model Context Protocol

For a useful connection, MCP relies on a few other key concepts:

Resources: A mechanism for the server to give the LLM context. This could be a specific file, a database schema, or a list of users in an application.
Prompts: Standardized questions the server can ask the client to get the information it needs to fulfill a request (e.g., "Which user do you want to search for?").
Tools: These are the actions the client can ask the server to perform, like querying a database, calling an API, or sending a message.

Without MCP, your LLM is isolated. With MCP, it's integrated, capable of interacting with your entire software ecosystem in a structured and predictable way.

The Peril of an Unsecured AI Ecosystem

Think of an LLM as the most brilliant and enthusiastic junior hire you've ever had. They have boundless energy and can produce incredible work, but they lack the years of judgment to know what they shouldn't do. The current, decentralized approach to MCP is like giving that junior hire a master key to every office and server room on their first day.

It's not a matter of if something will go wrong, but when.

This "shadow AI" infrastructure is the modern equivalent of the early Internet, where every server had a public IP address, fully exposed to the world. It’s the Wild West of unmanaged connections, impossible to secure. And the risks go far beyond accidental data deletion. Attackers are actively exploiting the unique vulnerabilities of LLM-driven ecosystems:

Prompt and tool injection: This is more than just telling a model to "ignore previous instructions." Attackers are now hiding malicious commands inside the descriptions of MCP tools themselves. Consider an LLM seeking to use a seemingly harmless "WebSearch" tool. A poisoned description could trick it into also running a query against a financial database and exfiltrating the results.
Supply chain attacks: How can you trust the third-party MCP servers used by your teams? In mid-2025, a critical vulnerability (CVE-2025-6514) was discovered in a popular npm package used for MCP authentication, exposing countless servers. In another incident dubbed "NeighborJack," security researchers found hundreds of MCP servers inadvertently exposed to the public Internet because they were bound to 0.0.0.0 without a firewall, allowing for potential OS command injection and host takeover.
Privilege escalation and the "confused deputy": An attacker doesn't need to break your LLM; they just need to confuse it. In one documented case, an AI agent running with high-level privileges was tricked into executing SQL commands embedded in a support ticket. The agent, acting as a "confused deputy," couldn't distinguish the malicious SQL from the legitimate ticket data and dutifully executed the commands, compromising an entire database.
Data leakage: Without centralized controls, data can bleed between systems in unexpected ways. In June 2025, a popular team collaboration tool’s MCP integration suffered a privacy breach where a bug caused some customer information to become visible in other customers' MCP instances, forcing them to take the integration offline for two weeks.

The Solution: A Single Front Door for Your MCP Servers

You can't protect what you can't see. Cloudflare MCP Server Portals solve this problem by providing a single, centralized gateway for all your MCP servers, somewhat similar to an application launcher for single sign-on. Instead of developers distributing dozens of individual server endpoints, they register their servers with Cloudflare. You provide your users with a single, unified Portal endpoint to configure in their MCP client.

This changes the security posture and user experience immediately. By routing all MCP traffic through Cloudflare, you get:

Centralized policy enforcement: You can integrate MCP Server Portals directly into Cloudflare One. This means you can enforce the same granular access policies for your AI connections that you do for your human users. Require multi-factor authentication, check for device posture, restrict by geography, and ensure only the right users can access specific servers and tools.
Comprehensive visibility and logging: Who is accessing which MCP server and which toolsets are they engaging with? What prompts are being run? What tools are being invoked? Previously, this data was scattered across every individual server. Server Portals aggregate all MCP request logs into a single place, giving you the visibility needed to audit activity and detect anomalies before they become breaches.
A curated AI user experience based on least privilege: Administrators can now review and approve MCP servers before making them available to users through a Portal. When a user authenticates through their Portal, they are only presented with the curated list of servers and tools they are authorized to use, preventing the use of unvetted or malicious third-party servers. This approach adheres to the Zero Trust security best practice of least privilege.
Simplified user configuration: Instead of having to load individual MCP server configurations into a MCP Client, users can load a single URL that pulls down all accessible MCP Servers. This drastically simplifies how many URLs need to be shared out and known by users. As new MCP Servers are added, they become dynamically available through the portal, instead of sharing each new URL on publishing of a server.

When a user connects to their MCP Server Portal, Access prompts them to authenticate with their corporate identity provider. Once authenticated, Cloudflare enforces which MCP Servers the user has access to, regardless of the underlying server’s authorization policies.

For MCP servers with domains hosted on Cloudflare, Access policies can be used to enforce the server’s direct authorization. This is done by creating an OAuth server that is linked to the domain’s existing Access Application. For MCP servers with domains outside Cloudflare and/or hosted by a third party, they require authorization controls outside of Cloudflare Access, this is usually done using OAuth.

The Road Ahead: What's Next for AI Security

MCP Server Portals are a foundational step in our mission to secure the AI revolution. This is just the beginning. In the coming months, we plan to build on this foundation by:

Mechanisms to lock down MCP Servers: Unless an MCP Server author enforces Authorization controls, users can still technically access MCPs outside of a Portal. We will build additional enforcement mechanisms to prevent this.
Integrating with Firewall for AI: Imagine applying the power of our WAF to your MCP traffic, detecting and blocking prompt injection attacks before they ever reach your servers.
Cloudflare hosted MCP Servers: We will make it easy to deploy MCP Servers using Cloudflare’s AI Gateway. This will allow for deeper prompt filtering and controls.
Applying machine learning to detect abuse: We will layer our own machine learning models on top of your MCP logs to automatically identify anomalous behavior, such as unusual data exfiltration patterns or suspicious tool usage.
Enhancing the protocol: We are committed to working with the open-source community to strengthen the MCP standard itself, contributing to a more secure and robust ecosystem for everyone.

This is our commitment: to provide the tools you need to innovate with confidence.

Get Started Today!

Progress doesn't have to come at the expense of security. With MCP Server Portals, you can empower your teams to build the future with AI, safely. This is a critical piece of helping to build a better Internet, and we are excited to see what you will build with it.

MCP Server Portals are now available in Open Beta for all Cloudflare One customers. To get started, navigate to the Access > AI Controls page in the Zero Trust Dashboard. If you don't have an account, you can sign up today and get started with up to 50 free seats or contact our experts to explore larger deployments.

Cloudflare is also starting a user research program focused on AI security. If you are interested in previews of new functionality or want to help shape our roadmap, please express your interest here.

ChatGPT, Claude, & Gemini security scanning with Cloudflare CASB

Alex Dunbrack — Tue, 26 Aug 2025 14:00:00 GMT

Starting today, all users of Cloudflare One, our secure access service edge (SASE) platform, can use our API-based Cloud Access Security Broker (CASB) to assess the security posture of their generative AI (GenAI) tools: specifically, OpenAI’s ChatGPT, Claude by Anthropic, and Google’s Gemini. Organizations can connect their GenAI accounts and within minutes, start detecting misconfigurations, Data Loss Prevention (DLP) matches, data exposure and sharing, compliance risks, and more — all without having to install cumbersome software onto user devices.

As Generative AI adoption has exploded in the enterprise, IT and Security teams need to hustle to keep themselves abreast of newly emerging security and compliance challenges that come alongside these powerful tools. In this rapidly changing landscape, IT and Security teams need tools that help enable AI adoption while still protecting the security and privacy of their enterprise networks and data.

Cloudflare’s API CASB and inline CASB work together to help organizations safely adopt AI tools. The API CASB integrations provide out-of-band visibility into data at rest and security posture inside popular AI tools like ChatGPT, Claude, and Gemini. At the same time, Cloudflare Gateway provides in-line prompt controls and Shadow AI identification. It applies policies and DLP to traffic as it moves to these AI providers. Together, these features give organizations a unified control plane for securing their use of GenAI.

What’s new

ChatGPT, Claude and Gemini are now all live in the integrations supported by Cloudflare’s API CASB. These integrations are available to all Cloudflare One users, account owners can easily connect their GenAI tenants, and CASB will scan for security issues across multiple domains:

Agentless Connections: Connect ChatGPT, Claude, and Gemini via agentless, API‑based integrations to scan posture and data risks; no endpoint software to install.
Posture Management: Detect insecure settings and misconfigurations that can lead to data exposure or misuse.
DLP Detection: Identify where sensitive data has been uploaded in chat attachments (prompts coming soon).
GenAI-specific Insights: Surface risks associated with the unique capability of a given AI provider's toolsets.

Admins can now answer questions like: What are our employees doing in ChatGPT? What data is being uploaded and used in Claude? Is Gemini configured correctly in Google Workspace?

Now let’s take a closer look at each integration.

OpenAI ChatGPT

Cloudflare’s CASB integration with OpenAI’s ChatGPT scans for several types of insights, including:

Capability Activation: Highlights capabilities that are specific to ChatGPT’s feature set, like actions, code execution, web access.
External Exposure: Finds chats and GPTs that are shared beyond the tenant, like GPTs shared publicly or listed on the GPT Store, and ties them back to their owners for quick triage.
Secrets, Keys and Invites: Identifies API keys that aren’t rotated or are no longer used to maintain credential hygiene. Identifies over‑privileged or stale invites.
Sensitive Content (via DLP): Detects sensitive data (e.g. credential and secrets, financial / health information, source code, etc.) via DLP profile matches in uploaded chat attachments to enable targeted response.

Anthropic Claude

For Claude, Cloudflare is able to provide the following out-of-band detections:

Secrets, Keys and Invites: Surfaces high‑risk invites and entitlement drift early so the least‑privilege access control stays tight. Spots unused API keys and rotation gaps before they turn into forgotten open doors.
Sensitive Content (via DLP): Monitors for sensitive data in uploaded files to help organizations safely enable Claude usage while maintaining compliance. Security teams get this information as quickly as CASB scans, giving them the visibility they need to help employees use Claude productively and securely with sensitive data.

As Anthropic continues to expand Claude's API capabilities and features, Cloudflare will add corresponding security detections to match new functionality as it becomes available.

Google Gemini

Cloudflare’s detections for Google Gemini appear as part of our API CASB integration for Google Workspace:

Identity & MFA: Identifies Gemini users and admins without MFA, leaving them prime targets for compromise. Imagine if an IT admin relied on Gemini daily to process corporate data, but their Google Workspace account lacked multi-factor authentication. One successful phishing email could give an attacker privileged access to Gemini and the wider Google Workspace environment — turning a minor oversight into an organization-wide breach.
License Hygiene: Flags suspended accounts still holding Gemini or AI Ultra licenses to cut cost and reduce exposure. An AI Ultra user has access to more powerful and riskier features, like Project Mariner, a research prototype that acts as an autonomous agent, capable of automating up to 10 tasks simultaneously across web browsers. An attacker can cause more damage by compromising an AI Ultra user, which is why we include this in our set of detections.

The Gemini integration has a narrower scope because Google has structured their product and API differently than OpenAI or Anthropic. For organizations, Gemini is delivered as a Google Workspace add-on. Enterprises enable Gemini features in Gmail, Docs, Sheets, and other Google Workspace apps through add-on licenses such as Gemini Enterprise or AI Ultra. Our CASB detections focus on identity, MFA, and license hygiene, rather than posture issues like public sharing or custom assistant publishing because Gemini does not yet provide those API endpoints.

The Future of GenAI Posture Management

Like countless other organizations, Cloudflare is adopting GenAI, on the same journey to make these environments even safer than they are today. We are excited to extend our management coverage to our customers so they can continue to innovate with GenAI. But looking ahead, we’re encouraged to see GenAI providers take concrete steps towards making security, compliance, and data privacy even more important tenets of their platforms.

Secure GenAI beyond the reach of Inline Controls

Generative AI adoption brings new security requirements. Cloudflare CASB delivers out-of-band visibility across these tools, surfacing insights on top of inline controls. With posture, access, and data under control, organizations can embrace GenAI confidently and securely.

How to get started:

For existing Cloudflare One customers: Contact your account manager or enable the integrations directly in your dashboard today.
New to Cloudflare One? Sign up now for 50 free seats to begin securely using Gen AI immediately. For larger deployments, request a consultation with our experts.

If you want to preview other new functionality and help shape our roadmap, express interest in our user research program for AI security.

Block unsafe prompts targeting your LLM endpoints with Firewall for AI

Radwa Radwan — Tue, 26 Aug 2025 14:00:00 GMT

Security teams are racing to secure a new attack surface: AI-powered applications. From chatbots to search assistants, LLMs are already shaping customer experience, but they also open the door to new risks. A single malicious prompt can exfiltrate sensitive data, poison a model, or inject toxic content into customer-facing interactions, undermining user trust. Without guardrails, even the best-trained model can be turned against the business.

Today, as part of AI Week, we’re expanding our AI security offerings by introducing unsafe content moderation, now integrated directly into Cloudflare Firewall for AI. Built with Llama, this new feature allows customers to leverage their existing Firewall for AI engine for unified detection, analytics, and topic enforcement, providing real-time protection for Large Language Models (LLMs) at the network level. Now with just a few clicks, security and application teams can detect and block harmful prompts or topics at the edge — eliminating the need to modify application code or infrastructure.

This feature is immediately available to current Firewall for AI users. Those not yet onboarded can contact their account team to participate in the beta program.

AI protection in application security

Cloudflare's Firewall for AI protects user-facing LLM applications from abuse and data leaks, addressing several of the OWASP Top 10 LLM risks such as prompt injection, PII disclosure, and unbound consumption. It also extends protection to other risks such as unsafe or harmful content.

Unlike built-in controls that vary between model providers, Firewall for AI is model-agnostic. It sits in front of any model you choose, whether it’s from a third party like OpenAI or Gemini, one you run in-house, or a custom model you have built, and applies the same consistent protections.

Just like our origin-agnostic Application Security suite, Firewall for AI enforces policies at scale across all your models, creating a unified security layer. That means you can define guardrails once and apply them everywhere. For example, a financial services company might require its LLM to only respond to finance-related questions, while blocking prompts about unrelated or sensitive topics, enforced consistently across every model in use.

Unsafe content moderation protects businesses and users

Effective AI moderation is more than blocking “bad words”, it’s about setting boundaries that protect users, meeting legal obligations, and preserving brand integrity, without over-moderating in ways that silence important voices.

Because LLMs cannot be fully scripted, their interactions are inherently unpredictable. This flexibility enables rich user experiences but also opens the door to abuse.

Key risks from unsafe prompts include misinformation, biased or offensive content, and model poisoning, where repeated harmful prompts degrade the quality and safety of future outputs. Blocking these prompts aligns with the OWASP Top 10 for LLMs, preventing both immediate misuse and long-term degradation.

One example of this is Microsoft’s Tay chatbot. Trolls deliberately submitted toxic, racist, and offensive prompts, which Tay quickly began repeating. The failure was not only in Tay’s responses; it was in the lack of moderation on the inputs it accepted.

Detecting unsafe prompts before reaching the model

Cloudflare has integrated Llama Guard directly into Firewall for AI. This brings AI input moderation into the same rules engine our customers already use to protect their applications. It uses the same approach that we created for developers building with AI in our AI Gateway product.

Llama Guard analyzes prompts in real time and flags them across multiple safety categories, including hate, violence, sexual content, criminal planning, self-harm, and more.

With this integration, Firewall for AI not only discovers LLM traffic endpoints automatically, but also enables security and AI teams to take immediate action. Unsafe prompts can be blocked before they reach the model, while flagged content can be logged or reviewed for oversight and tuning. Content safety checks can also be combined with other Application Security protections, such as Bot Management and Rate Limiting, to create layered defenses when protecting your model.

The result is a single, edge-native policy layer that enforces guardrails before unsafe prompts ever reach your infrastructure — without needing complex integrations.

How it works under the hood

Before diving into the architecture of Firewall for AI engine and how it fits within our previously mentioned module to detect PII in the prompts, let’s start with how we detect unsafe topics.

Detection of unsafe topics

A key challenge in building safety guardrails is balancing a good detection with model helpfulness. If detection is too broad, it can prevent a model from answering legitimate user questions, hurting its utility. This is especially difficult for topic detection because of the ambiguity and dynamic nature of human language, where context is fundamental to meaning.

Simple approaches like keyword blocklists are interesting for precise subjects — but insufficient. They are easily bypassed and fail to understand the context in which words are used, leading to poor recall. Older probabilistic models such as Latent Dirichlet Allocation (LDA) were an improvement, but did not properly account for word ordering and other contextual nuances.

Recent advancements in LLMs introduced a new paradigm. Their ability to perform zero-shot or few-shot classification is uniquely suited for the task of topic detection. For this reason, we chose Llama Guard 3, an open-source model based on the Llama architecture that is specifically fine-tuned for content safety classification. When it analyzes a prompt, it answers whether the text is safe or unsafe, and provides a specific category. We are showing the default categories, as listed here. Because Llama 3 has a fixed knowledge cutoff, certain categories — like defamation or elections — are time-sensitive. As a result, the model may not fully capture events or context that emerged after it was trained, and that’s important to keep in mind when relying on it.

For now, we cover the 13 default categories. We plan to expand coverage in the future, leveraging the model’s zero-shot capabilities.

A scalable architecture for future detections

We designed Firewall for AI to scale without adding noticeable latency, including Llama Guard, and this remains true even as we add new detection models.

To achieve this, we built a new asynchronous architecture. When a request is sent to an application protected by Firewall for AI, a Cloudflare Worker makes parallel, non-blocking requests to our different detection modules — one for PII, one for unsafe topics, and others as we add them.

Thanks to the Cloudflare network, this design scales to handle high request volumes out of the box, and latency does not increase as we add new detections. It will only be bounded by the slowest model used.

We optimize to keep the model utility at its maximum while keeping the guardrail detection broad enough.

Llama Guard is a rather large model, so running it at scale with minimal latency is a challenge. We deploy it on Workers AI, leveraging our large fleet of high performance GPUs. This infrastructure ensures we can offer fast, reliable inference throughout our network.

To ensure the system remains fast and reliable as adoption grows, we ran extensive load tests simulating the requests per second (RPS) we anticipate, using a wide range of prompt sizes to prepare for real-world traffic. To handle this, the number of model instances deployed on our network scales automatically with the load. We employ concurrency to minimize latency and optimize for hardware utilization. We also enforce a hard 2-second threshold for each analysis; if this time limit is reached, we fall back to any detections already completed, ensuring your application's requests latency is never further impacted.

From detection to security rules enforcement

Firewall for AI follows the same familiar pattern as other Application Security features like Bot Management and WAF Attack Score, making it easy to adopt.

Once enabled, the new fields appear in Security Analytics and expanded logs. From there, you can filter by unsafe topics, track trends over time, and drill into the results of individual requests to see all detection outcomes, for example: did we detect unsafe topics, and what are the categories. The request body itself (the prompt text) is not stored or exposed; only the results of the analysis are logged.

After reviewing the analytics, you can enforce unsafe topic moderation by creating rules to log or block based on prompt categories in Custom rules.

For example, you might log prompts flagged as sexual content or hate speech for review.

You can use this expression:
If (any(cf.llm.prompt.unsafe_topic_categories[*] in {"S10" "S12"})) then Log

Or deploy the rule with the categories field in the dashboard as in the below screenshot.

You can also take a broader approach by blocking all unsafe prompts outright:
If (cf.llm.prompt.unsafe_topic_detected)then Block

These rules are applied automatically to all discovered HTTP requests containing prompts, ensuring guardrails are enforced consistently across your AI traffic.

What’s Next

In the coming weeks, Firewall for AI will expand to detect prompt injection and jailbreak attempts. We are also exploring how to add more visibility in the analytics and logs, so teams can better validate detection results. A major part of our roadmap is adding model response handling, giving you control over not only what goes into the LLM but also what comes out. Additional abuse controls, such as rate limiting on tokens and support for more safety categories, are also on the way.

Firewall for AI is available in beta today. If you’re new to Cloudflare and want to explore how to implement these AI protections, reach out for a consultation. If you’re already with Cloudflare, contact your account team to get access and start testing with real traffic.

Cloudflare is also opening up a user research program focused on AI security. If you are curious about previews of new functionality or want to help shape our roadmap, express your interest here.

Best Practices for Securing Generative AI with SASE

AJ Gerstenhaber — Tue, 26 Aug 2025 14:00:00 GMT

As Generative AI revolutionizes businesses everywhere, security and IT leaders find themselves in a tough spot. Executives are mandating speedy adoption of Generative AI tools to drive efficiency and stay abreast of competitors. Meanwhile, IT and Security teams must rapidly develop an AI Security Strategy, even before the organization really understands exactly how it plans to adopt and deploy Generative AI.

IT and Security teams are no strangers to “building the airplane while it is in flight”. But this moment comes with new and complex security challenges. There is an explosion in new AI capabilities adopted by employees across all business functions — both sanctioned and unsanctioned. AI Agents are ingesting authentication credentials and autonomously interacting with sensitive corporate resources. Sensitive data is being shared with AI tools, even as security and compliance frameworks struggle to keep up.

While it demands strategic thinking from Security and IT leaders, the problem of governing the use of AI internally is far from insurmountable. SASE (Secure Access Service Edge) is a popular cloud-based network architecture that combines networking and security functions into a single, integrated service that provides employees with secure and efficient access to the Internet and to corporate resources, regardless of their location. The SASE architecture can be effectively extended to meet the risk and security needs of organizations in a world of AI.

Cloudflare’s SASE Platform is uniquely well-positioned to help IT teams govern their AI usage in a secure and responsible way — without extinguishing innovation. What makes Cloudflare different in this space is that we are one of the few SASE vendors that operate not just in cybersecurity, but also in AI infrastructure. This includes: providing AI infrastructure for developers (e.g. Workers AI, AI Gateway, remote MCP servers, Realtime AI Apps) to securing public-facing LLMs (e.g. Firewall for AI or AI Labyrinth), to allowing content creators to charge AI crawlers for access to their content, and the list goes on. Our expertise in this space gives us a unique view into governing AI usage inside an organization. It also gives our customers the opportunity to plug different components of our platform together to build out their AI and AI cybersecurity infrastructure.

This week, we are taking this AI expertise and using it to help ensure you have what you need to implement a successful AI Security Strategy. As part of this, we are announcing several new AI Security Posture Management (AI-SPM) features, including:

shadow AI reporting to gain visibility into employee’s use of AI,
confidence scoring of AI providers to manage risk,
AI prompt protection to defend against malicious inputs and prevent data loss,
out-of-band API CASB integrations with AI providers to detect misconfigurations,
new tools that untangle and secure Model Context Protocol (MCP) deployments in the enterprise.

All of these new AI-SPM features are built directly into Cloudflare’s powerful SASE platform.

And we’re just getting started. In the coming months you can expect to see additional valuable AI-SPM features launch across the Cloudflare platform, as we continue investing in making Cloudflare the best place to protect, connect, and build with AI.

What’s in this AI security guide?

In this guide, we will cover best practices for adopting generative AI in your organization using Cloudflare’s SASE (Secure Access Service Edge) platform. We start by covering how IT and Security leaders can formulate their AI Security Strategy. Then, we show how to implement this strategy using long-standing features of our SASE platform alongside the new AI-SPM features we launched this week.

This guide below is divided into three key pillars for dealing with (human) employee access to AI – Visibility, Risk Management and Data Protection — followed by additional guidelines around deploying agentic AI in the enterprise using MCP. Our objective is to help you align your security strategy with your business goals while driving adoption of AI across all your projects and teams.

And we do this all using our single SASE platform, so you don’t have to deploy and manage a complex hodgepodge of point solutions and security tools. In fact, we provide you with an overview of your AI security posture in a single dashboard, as you can see here:

AI Security Report in Cloudflare’s SASE platform

Develop your AI Security Strategy

The first step to securing AI usage is to establish your organization's level of risk tolerance. This includes pinpointing your biggest security concerns for your users and your data, along with relevant legal and compliance requirements. Relevant issues to consider include:

Do you have specific sensitive data that should not be shared with certain AI tools? (Some examples include personally identifiable information (PII), personal health information (PHI), sensitive financial data, secrets and credentials, source code or other proprietary business information.)
Are there business decisions that your employees should not be making using assistance from AI? (For instance, the EU AI Act AI prohibits the use of AI to evaluate or classify individuals based on their social behavior, personal characteristics, or personality traits.)
Are you subject to compliance frameworks that require you to produce records of the generative AI tools that your employees used, and perhaps even the prompts that your employees input into AI providers? (For example, HIPAA requires organizations to implement audit trails that records who accessed PHI and when, GDPR requires the same for PII, SOC2 requires the same for secrets and credentials.)
Do you have specific data protection requirements that require employees to use the sanctioned, enterprise version of a certain generative AI provider, and avoid certain AI tools or their consumer versions? (Enterprise AI tools often have more favorable terms of service, including shorter data retention periods, more limited data-sharing with third-parties, and/or a promise not to train AI models on user inputs.)
Do you require employees to completely avoid the use of certain AI tools, perhaps because they are unreliable, unreviewed or headquartered in a risky geography?
Are there security protections offered by your organization's sanctioned AI providers and to what extent do you plan to protect against misconfigurations of AI tools that can result in leaks of sensitive data?
What is your policy around the use of autonomous AI agents? What is your strategy for adopting the Model Context Protocol (MCP)? (The Model Context Protocol is a standard way to make information available to large language models (LLMs), similar to the way an application programming interface (API) works. It supports agentic AI that autonomously pursues goals and takes action.)

While almost every organization has relevant compliance requirements that implicate their use of generative AI, there is no “one size fits all” for addressing these issues.

Some organizations have mandates to broadly adopt AI tools of all stripes, while others require employees to interact with sanctioned AI tools only.
Some organizations are rapidly adopting the MCP, while others are not yet ready for agents to autonomously interact with their corporate resources.
Some organizations have robust requirements around data loss prevention (DLP), while others are still early in the process of deploying DLP in their organization.

Even with this diversity of goals and requirements, Cloudflare SASE provides a flexible platform for the implementation of your organization’s AI Security Strategy.

Build a solid foundation for AI Security

To implement your AI Security Strategy, you first need a solid SASE deployment.

SASE provides a unified platform that consolidates security and networking, replacing a fragmented patchwork of point solutions with a single platform that controls application visibility, user authentication, Data Loss Prevention (DLP), and other policies for access to the Internet and access to internal corporate resources. SASE is the essential foundation for an effective AI Security Strategy.

SASE architecture allows you to execute your AI security strategy by discovering and inventorying the AI tools used by your employees. With this visibility, you can proactively manage risk and support compliance requirements by monitoring AI prompts and responses to understand what data is being shared with AI tools. Robust DLP allows you to scan and block sensitive data from being entered into AI tools, preventing data leakage and protecting your organization's most valuable information. Our Secure Web Gateway (SWG) allows you to redirect traffic from unsanctioned AI providers to user education pages or to sanctioned enterprise AI providers. And our new integration of MCP tooling into our SASE platform helps you secure the deployment of agentic AI inside your organization.

If you're just starting your SASE journey, our Secure Internet Traffic Deployment Guide is the best place to begin. For this guide, however, we will skip these introductory details and dive right into using SASE to secure the use of Generative AI.

Gain visibility into your AI landscape

You can't protect what you can't see. The first step is to gain visibility into your AI landscape, which is essential for discovering and inventorying all the AI tools that your employees are using, deploying or experimenting with in your organization.

Discover Shadow AI

Shadow AI refers to the use of AI applications that haven't been officially sanctioned by your IT department. Shadow AI is not an uncommon phenomenon – Salesforce found that over half of the knowledge workers it surveyed admitted to using unsanctioned AI tools at work. Use of unsanctioned AI is not necessarily a sign of malicious intent; employees are often just trying to do their jobs better. As an IT or Security leader, your goal should be to discover Shadow AI and then apply the appropriate AI security policy. There are two powerful ways to do this: inline and out-of-band.

Discover employee usage of AI, inline

The most direct way to get visibility is by using Cloudflare's Secure Web Gateway (SWG).

SWG helps you get a clear picture of both sanctioned and unsanctioned AI and chat applications. By reviewing your detected usage, you'll gain insight into which AI apps are being used in your organization. This knowledge is essential for building policies that support approved tools, and block or control risky ones. This feature requires you to deploy the WARP client in Gateway proxy mode on your end-user devices.

You can review your company’s AI app usage using our new Application Library and Shadow IT dashboards. These tools allow you to:

Review traffic from user devices to understand how many users engage with a specific application over time.
Denote application’s status (e.g., Approved, Unapproved) inside your organization, and use that as input to a variety of SWG policies that control access to applications with that status.
Automate assessment of SaaS and Gen AI applications at scale with our soon-to-be-released Cloudflare Application Confidence Scores.

Shadow IT dashboard showing utilization of applications of different status (Approved, Unapproved, In Review, Unreviewed).

Discover employee usage of AI, out-of-band

Even if your organization doesn't use a device client, you can still get valuable data on Shadow AI usage if you use Cloudflare's integrations for Cloud Access Security Broker (CASB) with services like Google Workspace, Microsoft 365, or GitHub.

Cloudflare CASB provides high-fidelity detail about your SaaS environments, including sensitive data visibility and suspicious user activity. By integrating CASB with your SSO provider, you can see if your users have authenticated to any third-party AI applications, giving you a clear and non-invasive sense of app usage across your organization.

An API CASB integration with Google Workspace, showing findings filtered to third party integrations. Findings discover multiple LLM integrations.

Implement an AI risk management framework

Now that you’ve gained visibility into your AI landscape, the next step is to proactively manage that risk. Cloudflare’s SASE platform allows you to monitor AI prompts and responses, enforce granular security policies, coach users on secure behavior, and prevent misconfigurations in your enterprise AI providers.

Detect and monitor AI prompts and responses

If you have TLS decryption enabled in your SASE platform, you can gain new and powerful insights into how your employees are using AI with our new AI prompt protection feature.

AI Prompt Protection provides you with visibility into the exact prompts and responses from your employees’ interactions with supported AI applications. This allows you to go beyond simply knowing which tools are being used and gives you insight into exactly what kind of information is being shared.

This feature also works with DLP profiles to detect sensitive data in prompts. You can also choose whether to block the action or simply monitor it.

Log entry for a prompt detected using AI prompt protection.

Build granular AI security policies

Once your monitoring tools give you a clear understanding of AI usage, you can begin building security policies to achieve your security goals. Cloudflare's Gateway allows you to create policies based on application categories, application approval status, users, user groups, and device status. For example, you can:

create policies to explicitly allow approved AI applications while blocking unapproved AI applications;
create policies that redirect users from unapproved AI applications to an approved AI application;
limit access to certain applications to specific users or groups that have specific device security posture;
build policies to enable prompt capture (with AI prompt protection) for specific high-risk user groups, such as contractors or new employees, without affecting the rest of the organization; and
put certain applications behind Remote Browser Isolation (RBI), to prevent end users from uploading files or pasting data into the application.

Gateway application status policy selector

All of these policies can be written in Cloudflare Gateway’s unified policy builder, making it easy to deploy your AI Security Strategy across your organization.

Control access to internal LLMs

You can use Cloudflare Access to control your employees’ access to your organization’s internal LLMs, including any proprietary models you train internally and/or models that your organization runs on Cloudflare Worker’s AI.

Cloudflare Access allows you to gate access to these LLMs using fine-grained policies, including ensuring users are granted access based on their identity, user group, device posture, and other contextual signals. For example, you can use Cloudflare Access to write a policy that ensures that only certain data scientists at your organization can access a Workers AI model that is trained on certain types of customer data.

Manage the security posture of third-party AI providers

As you define which AI tools are sanctioned, you can develop functional security controls for consistent usage. Cloudflare newly supports API CASB integrations with popular AI tools like OpenAI (ChatGPT), Anthropic (Claude), and Google Gemini. These "out-of-band" integrations provide immediate visibility into how users are engaging with sanctioned AI tools, allowing you to report on posture management findings include:

Misconfigurations related to sharing settings.
Best practices for API key management.
DLP profile matches in uploaded attachments
Riskier AI features (e.g. autonomous web browsing, code execution) that are toggled on

OpenAI API CASB Integration showing riskier features that are toggled on, security posture risks like unused admin credentials, and an uploaded attachment with a DLP profile match.

Layer on data protection

Robust data protection is the final pillar that protects your employee’s access to AI..

Prevent data loss

Our SASE platform has long supported Data Loss Prevention (DLP) tools that scan and block sensitive data from being entered into AI tools, to prevent data leakage and protect your organization's most valuable information. You can write policies that detect sensitive data while adapting to organization-specific traffic patterns, and use Cloudflare Gateway’s unified policy builder to apply these to your users' interactions with AI tools or other applications. For example, you could write a DLP policy that detects and blocks the upload of a social security number (SSN), phone number or address.

As part of our new AI prompt protection feature, you can now also gain a semantic understanding of your users’ interactions with supported AI providers. Prompts are classified inline into meaningful, high-level topics that include PII, credentials and secrets, source code, financial information, code abuse / malicious code and prompt injection / jailbreak. You can then build inline granular policies based on these high-level topic classifications. For example, you could create a policy that blocks a non-HR employee from submitting a prompt with the intent to receive PII from the response, while allowing the HR team to do so during a compensation planning cycle.

Our new AI prompt protection feature empowers you to apply smart, user-specific DLP rules that empower your teams to get work done, all while strengthening your security posture. To use our most advanced DLP feature, you'll need to enable TLS decryption to inspect traffic.

The above policy blocks all ChatGPT prompts that may receive PII back in the response for employees in engineering, marketing, product, and finance user groups.

Secure MCP — and Agentic AI

MCP (Model Context Protocol) is an emerging AI standard, where MCP servers act as a translation layer for AI agents, allowing them to communicate with public and private APIs, understand datasets, and perform actions. Because these servers are a primary entry point for AI agents to engage with and manipulate your data, they are a new and critical security asset for your security team to manage.

Cloudflare already offers a robust set of developer tools for deploying remote MCP servers—a cloud-based server that acts as a bridge between a user's data and tools and various AI applications. But now our customers are asking for help securing their enterprise MCP deployments.

That is why we’re making MCP security controls a core part of our SASE platform.

Control MCP Authorization

MCP servers typically use OAuth for authorization, where the server inherits the permissions of the authorizing user. While this adheres to least-privilege for the user, it can lead to authorization sprawl — where the agent accumulates an excessive number of permissions over time. This makes the agent a high-value target for attackers.

Cloudflare Access now helps you manage authorization sprawl by applying Zero Trust principles to MCP server access. A Zero Trust model assumes no user, device, or network can be trusted implicitly, so every request is continuously verified. This approach ensures secure authentication and management of these critical assets as your business adopts more agentic workflows.

Centralize management of MCP servers

Cloudflare MCP Server Portal is a new feature in Cloudflare’s SASE platform that centralizes the management, security, and observation of an organization’s MCP servers.

MCP Server Portal allows you to register all your MCP servers with Cloudflare and provide your end users with a single, unified Portal endpoint to configure in their MCP client. This approach simplifies the user experience, because it eliminates the need to configure a one-to-one connection between every MCP client and server. It also means that new MCP servers dynamically become available to users whenever they are added to the Portal.

Beyond these usability enhancements, MCP Server Portal addresses the significant security risks associated with MCP in the enterprise. The current decentralized approach of MCP deployments creates a tangle of unmanaged one-to-one connections that are difficult to secure. The lack of centralized controls creates a variety of risks including prompt injection, tool injection (where malicious code is part of the MCP server itself), supply chain attacks and data leakage.

MCP Server Portals solve this by routing all MCP traffic through Cloudflare, allowing for centralized policy enforcement, comprehensive visibility and logging, and a curated user experience based on the principle of least privilege. Administrators can review and approve MCP servers before making them available, and users are only presented with the servers and tools they are authorized to use, which prevents the use of unvetted or malicious third-party servers.

An MCP Server Portal in the Cloudflare Dashboard

All of these features are only the beginning of our MCP security roadmap, as we continue advancing our support for MCP infrastructure and security controls across the entire Cloudflare platform.

Implement your AI security strategy in a single platform

As organizations rapidly develop and deploy their AI security strategies, Cloudflare’s SASE platform is ideally situated to implement policies that balance productivity with data and security controls.

Our SASE has a full suite of features to protect employee interactions with AI. Some of these features are deeply integrated in our Secure Web Gateway (SWG), including the ability to write fine-grained access policies, gain visibility into Shadow IT and introspect on interactions with AI tools using AI prompt protection. Apart from these inline controls, our CASB provides visibility and control using out-of-band API integrations. Our Cloudflare Access product can apply Zero Trust principles while protecting employee access to corporate LLMs that are hosted on Workers AI or elsewhere. We’re newly integrating controls for securing MCP that can also be used alongside Cloudflare’s Remote MCP Server platform.

And all of these features are integrated directly into Cloudflare’s SASE’s unified dashboard, providing a unified platform for you to implement your AI security strategy. You can even gain a holistic view of all of your AI-SPM controls using our newly-released AI-SPM overview dashboard.

AI security report showing utilization of AI applications.

As one the few SASE vendors that also offer AI infrastructure, Cloudflare’s SASE platform can also be deployed alongside products from our developer and application security platforms to holistically implement your AI security strategy alongside your AI infrastructure strategy (using, for example, Workers AI, AI Gateway, remote MCP servers, Realtime AI Apps, Firewall for AI, AI Labyrinth, or pay per crawl .)

Cloudflare is committed to helping enterprises securely adopt AI

Ensuring AI is scalable, safe, and secure is a natural extension of Cloudflare’s mission, given so much of our success relies on a safe Internet. As AI adoption continues to accelerate, so too does our mission to provide a market-leading set of controls for AI Security Posture Management (AI-SPM). Learn more about how Cloudflare helps secure AI or start exploring our new AI-SPM features in Cloudflare’s SASE dashboard today!