
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/">
    <channel>
        <title><![CDATA[ The Cloudflare Blog ]]></title>
        <description><![CDATA[ Get the latest news on how products at Cloudflare are built, technologies used, and join the teams helping to build a better Internet. ]]></description>
        <link>https://blog.cloudflare.com</link>
        <atom:link href="https://blog.cloudflare.com/" rel="self" type="application/rss+xml"/>
        <language>en-us</language>
        <image>
            <url>https://blog.cloudflare.com/favicon.png</url>
            <title>The Cloudflare Blog</title>
            <link>https://blog.cloudflare.com</link>
        </image>
        <lastBuildDate>Fri, 17 Apr 2026 16:08:37 GMT</lastBuildDate>
        <item>
            <title><![CDATA[Introducing the Agent Readiness score. Is your site agent-ready?]]></title>
            <link>https://blog.cloudflare.com/agent-readiness/</link>
            <pubDate>Fri, 17 Apr 2026 13:05:00 GMT</pubDate>
            <description><![CDATA[ The Agent Readiness score can help site owners understand how well their websites support AI agents. Here we explore new standards, share Radar data, and detail how we made Cloudflare’s docs the most agent-friendly on the web. ]]></description>
            <content:encoded><![CDATA[ <p></p><p>The web has always had to adapt to new standards. It learned to speak to web browsers, and then it learned to speak to search engines. Now, it needs to speak to AI agents.</p><p>Today, we are excited to introduce <a href="https://isitagentready.com/"><u>isitagentready.com</u></a> — a new tool to help site owners understand how they can make their sites optimized for agents, from guiding agents on how to authenticate, to controlling what content agents can see, the format they receive it in, and how they pay for it. We are also <a href="https://radar.cloudflare.com/ai-insights#adoption-of-ai-agent-standards"><u>introducing a new dataset to Cloudflare Radar</u></a> that tracks the overall adoption of each agent standard across the Internet.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/sGg5lZjafjQ398V7hYyMv/93e112a34754e2065ffbf6445ebc4500/unnamed.png" />
          </figure><p>We want to lead by example. That is why we are also sharing how we recently overhauled Cloudflare's <a href="https://developers.cloudflare.com/"><u>Developer Documentation</u></a> to make it the most agent-friendly documentation site, allowing AI tools to answer questions faster and significantly cheaper.</p>
    <div>
      <h2>How agent-ready is the web today?</h2>
      <a href="#how-agent-ready-is-the-web-today">
        
      </a>
    </div>
    <p>The short answer: not very. This is expected, but also shows how much more effective agents can be than they are today, if standards are adopted.</p><p>To analyze this, Cloudflare Radar took the 200,000 <a href="https://radar.cloudflare.com/domains"><u>most visited domains</u></a> on the Internet; filtered out categories where agent readiness isn't important (like redirects, ad-servers, and tunneling services) to focus on businesses, publishers, and platforms that AI agents might realistically need to interact with; and scanned them using our new tool.</p><p>The result is a new “Adoption of AI agent standards” chart that can now be found in the <a href="https://radar.cloudflare.com/ai-insights#adoption-of-ai-agent-standards"><u>Cloudflare Radar AI Insights</u></a> page where we can measure adoption of each standard across multiple domain categories.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/Vn8SoboYs4OmY2y6aXZke/c641d63cc71e4645e3b19c4124b5e912/image3.png" />
          </figure><p>Looking at individual checks, a few things stood out:</p><ul><li><p><a href="https://www.cloudflare.com/learning/bots/what-is-robots-txt/"><u>robots.txt</u></a> is nearly universal — 78% of sites have one — but the vast majority are written for traditional search engine crawlers, not AI agents.</p></li><li><p><a href="https://contentsignals.org/"><u>Content Signals</u></a>: 4% of sites have declared their AI usage preferences in robots.txt. This is a new standard that is gaining momentum.</p></li><li><p>Markdown content negotiation (serving text/markdown on Accept: text/markdown) passes on 3.9% of sites.</p></li><li><p>New emerging standards like <a href="https://modelcontextprotocol.io/community/server-card/charter"><u>MCP Server Cards</u></a> and <a href="https://datatracker.ietf.org/doc/rfc9727/"><u>API Catalogs (RFC 9727)</u></a> together appear on fewer than 15 sites in the entire dataset. It’s still early — there is lots of opportunity to stand out by being one of the first sites to adopt new standards and work well with agents. </p></li></ul><p>This chart will be updated weekly, and the data can also be accessed through the <a href="https://radar.cloudflare.com/explorer"><u>Data Explorer</u></a> or the <a href="https://developers.cloudflare.com/api/resources/radar/"><u>Radar API</u></a>.</p>
    <div>
      <h2>Get an agent readiness score for your site</h2>
      <a href="#get-an-agent-readiness-score-for-your-site">
        
      </a>
    </div>
    <p>You can get an agent readiness score for your own website by going to <a href="https://isitagentready.com/"><u>isitagentready.com</u></a> and entering the site’s URL.</p><p>Scores and audits that provide actionable feedback have helped to drive adoption of new standards before. For example, <a href="https://developer.chrome.com/docs/lighthouse/performance/performance-scoring"><u>Google Lighthouse</u></a> scores websites on performance and security best practices, and guides site owners to adopt the latest web platform standards. We think something similar should exist to help site owners adopt best practices for agents.</p><p>When you enter your site, Cloudflare makes requests to it to check which standards it supports, and provides a score based on four dimensions:</p><ul><li><p>Discoverability: <a href="https://datatracker.ietf.org/doc/html/rfc9309"><u>robots.txt</u></a>, <a href="https://www.sitemaps.org/protocol.html"><u>sitemap.xml</u></a>, <a href="https://datatracker.ietf.org/doc/html/rfc8288"><u>Link Headers (RFC 8288)</u></a></p></li><li><p>Content: <a href="https://blog.cloudflare.com/markdown-for-agents/"><u>Markdown for Agents</u></a></p></li><li><p>Bot Access Control: <a href="https://contentsignals.org/"><u>Content Signals</u></a>, <a href="https://developers.cloudflare.com/ai-crawl-control/"><u>AI bot rules in robots.txt</u></a>, <a href="https://datatracker.ietf.org/doc/draft-meunier-web-bot-auth-architecture/"><u>Web Bot Auth</u></a></p></li><li><p>Capabilities: <a href="https://github.com/cloudflare/agent-skills-discovery-rfc"><u>Agent Skills</u></a>, API Catalog <a href="https://www.rfc-editor.org/rfc/rfc9727"><u>(RFC 9727)</u></a>, OAuth server discovery via <a href="https://www.rfc-editor.org/rfc/rfc8414"><u>RFC 8414</u></a> and <a href="https://datatracker.ietf.org/doc/html/rfc9728"><u>RFC 9728</u></a>, <a href="https://modelcontextprotocol.io/community/server-card/charter"><u>MCP Server Card</u></a>, and <a href="https://developer.chrome.com/blog/webmcp-epp"><u>WebMCP</u></a></p></li></ul>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/69MdHcYAZi60gVKRVP9GFM/f3521831b2ca361e12a33f6c8eb05f5b/image9.png" />
          </figure><p><sup><i>Screenshot of results from an agent-readiness check for an example website.</i></sup></p><p>Additionally, we check if the site supports agentic commerce standards including <a href="https://www.x402.org/"><u>x402</u></a>, <a href="https://ucp.dev/"><u>Universal Commerce Protocol</u></a>, and <a href="https://www.agenticcommerce.dev/"><u>Agentic Commerce Protocol</u></a>, but these do not currently count towards the score.</p><p>For each failing check, we provide a prompt that you can give to your coding agent and have it implement support on your behalf.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/9C62LtqTLgvZGViVEh8n0/b9c01cebe9042ad4b458d4305b3db7b2/image6.png" />
          </figure><p>The site itself is also agent-ready, practicing what it preaches. It exposes a stateless MCP server (https://isitagentready.com/.well-known/mcp.json) with a <code>scan_site</code> tool via Streamable HTTP, so any MCP-compatible agent can scan websites programmatically without using the web interface. It also publishes an Agent Skills index (https://isitagentready.com/.well-known/agent-skills/index.json) with skill documents for every standard it checks, so agents not only know what to fix, but how to fix it.</p><p>Let’s dig into the checks in each category, and why they matter for agents.</p>
    <div>
      <h3>Discoverability</h3>
      <a href="#discoverability">
        
      </a>
    </div>
    <p><a href="https://www.cloudflare.com/learning/bots/what-is-robots-txt/"><u>robots.txt</u></a> has been around since 1994, and most sites have one. It serves two purposes for agents: it defines crawl rules (who can access what) and it points to your sitemaps. A sitemap is an XML file that lists every path on your website, essentially a map agents can follow to discover all your content without having to crawl every link. The robots.txt is where agents look first.</p><p>Beyond sitemaps, agents can also discover important resources directly from HTTP response headers, specifically, using the Link response header (<a href="https://www.rfc-editor.org/rfc/rfc8288"><u>RFC 8288</u></a>). Unlike links buried inside HTML, the Link header is part of the HTTP response itself, which means an agent can find links to resources without having to parse any markup:</p>
            <pre><code>HTTP/1.1 200 OK
Link: &lt;/.well-known/api-catalog&gt;; rel="api-catalog"</code></pre>
            
    <div>
      <h3>Content accessibility</h3>
      <a href="#content-accessibility">
        
      </a>
    </div>
    <p>Getting an agent onto your site is one thing. Making sure it can actually read your content is another.</p><p>Back in September 2024, which feels like a lifetime ago given how fast AI is moving, <a href="http://llms.txt"><u>llms.txt</u></a> was proposed as a way to provide a LLM-friendly representation of a website, and fit within the model’s context window. <a href="https://llmstxt.org/"><u>llms.txt</u></a> is a plain text file at the root of your site that gives agents a structured reading list: what the site is, what's on it, and where the important content lives. Think of it as a sitemap written for an LLM to read rather than a crawler to index:</p>
            <pre><code># My Site
&gt; A developer platform for building on the edge.
## Documentation
- [Getting Started](https://example.com/docs/start.md)
- [API Reference](https://example.com/docs/api.md)
## Changelog
- [Release Notes](https://example.com/changelog.md)</code></pre>
            <p><a href="https://blog.cloudflare.com/markdown-for-agents/"><u>Markdown content negotiation</u></a> goes even further. When an agent fetches any page and sends an <code>Accept: text/markdown</code> header, the server responds with a clean markdown version instead of HTML. The markdown version requires far fewer tokens — we measured up to 80% token reduction in some cases — which makes responses faster, cheaper, and more likely to be consumed in its entirety, given the limits on context windows that most agent tools have by default.</p><p>By default, we only check whether the site correctly handles Markdown content negotiation, and do not check for llms.txt. You can customize the scan to include llms.txt if you choose to.</p>
    <div>
      <h3>Bot Access Control</h3>
      <a href="#bot-access-control">
        
      </a>
    </div>
    <p>Now that agents can navigate your site and consume your content, the next question is: do you want to let any bot do it?</p><p><code>robots.txt</code> does more than point to sitemaps. It is also where you define your access rules. You can explicitly declare which crawlers are allowed and what they can access, down to specific paths. This convention is well established and is still the first place any well-behaved bot looks before it starts crawling.</p><p><a href="https://contentsignals.org/"><u>Content Signals</u></a> let you be more specific. Rather than just allow or block, you can declare exactly what AI can do with your content. Using a <code>Content-Signal</code> directive in your <code>robots.txt</code>, you can independently control three things: whether your content can be used for AI training (<code>ai-train</code>), whether it can be used as AI input for inference and grounding (<code>ai-input</code>), and whether it should appear in search results (<code>search</code>):</p>
            <pre><code>User-agent: *
Content-Signal: ai-train=no, search=yes, ai-input=yes</code></pre>
            <p>Inversely, the <a href="https://blog.cloudflare.com/web-bot-auth/"><u>Web Bot Auth</u></a> IETF draft standard allows friendly bots to authenticate themselves, and allows websites receiving requests from bots to identify them. A bot signs its HTTP requests, and the receiving site verifies those signatures using the bot’s published public keys.</p><p>Those public keys live at a well-known endpoint, <code>/.well-known/http-message-signatures-directory</code>, which we check as part of the scan.</p><p>Not all sites need to implement this. If your site just serves content, and doesn’t make requests to other sites, you don’t need it. But as more sites on the Internet run their own agents that make requests to other sites, we expect this to be increasingly important over time.</p>
    <div>
      <h3>Protocol Discovery</h3>
      <a href="#protocol-discovery">
        
      </a>
    </div>
    <p>Beyond passive content consumption, agents can also interact with your site directly by calling APIs, invoking tools, and completing tasks autonomously.</p><p>If your service has one or more public APIs, the API Catalog (<a href="https://www.rfc-editor.org/rfc/rfc9727"><u>RFC 9727</u></a>) gives agents a single well-known location to discover all of them. Hosted at <code>/.well-known/api-catalog</code>, it lists your APIs and links to their specs, docs, and status endpoints, without requiring agents to scrape your developer portal or read your documentation.</p><p>We can't talk about agents without mentioning MCP. The <a href="https://modelcontextprotocol.io/docs/getting-started/intro"><u>Model Context Protocol</u></a> is an open standard that allows AI models to connect with external data sources and tools. Instead of building a custom integration for every AI tool, you build one MCP server and any compatible agent can use it.</p><p>To help agents find your MCP server, you can publish an MCP Server Card (a proposal currently in <a href="https://github.com/modelcontextprotocol/modelcontextprotocol/issues/1649"><u>draft</u></a>). This is a JSON file at <code>/.well-known/mcp/server-card.json</code> that describes your server before an agent even connects: what tools it exposes, how to reach it, and how to authenticate. An agent reads this file and knows everything it needs to start using your server:</p>
            <pre><code>{
  "$schema": "https://static.modelcontextprotocol.io/schemas/mcp-server-card/v1.json",
  "version": "1.0",
  "protocolVersion": "2025-06-18",
  "serverInfo": {
    "name": "search-mcp-server",
    "title": "Search MCP Server",
    "version": "1.0.0"
  },
  "description": "Search across all documentation and knowledge base articles",
  "transport": {
    "type": "streamable-http",
    "endpoint": "/mcp"
  },
  "authentication": {
    "required": false
  },
  "tools": [
    {
      "name": "search",
      "title": "Search",
      "description": "Search documentation by keyword or question",
      "inputSchema": {
        "type": "object",
        "properties": {
          "query": { "type": "string" }
        },
        "required": ["query"]
      }
    }
  ]
}</code></pre>
            <p>Agents work best when they have <a href="https://agentskills.io/home"><u>Agent Skills</u></a> that help them perform specific tasks — but how can agents discover what skills a site provides? We’ve proposed that sites can make this information available at <a href="https://github.com/cloudflare/agent-skills-discovery-rfc"><code><u>.well-known/agent-skills/index.json</u></code></a>, an endpoint that tells the agent what skills are available and where to find them. You might notice that the <code>.well-known</code> standard (<a href="https://datatracker.ietf.org/doc/html/rfc8615"><u>RFC 8615</u></a>) is used by many other agent and authorization standards — thank you to Cloudflare’s own Mark Nottingham who authored the standard, and other IETF contributors!</p><p>Many sites require you to sign in first in order to access them. This makes it hard for humans to give agents the ability to access these sites on their behalf, and is why some have taken the arguably unsafe workaround approach of giving agents access to the user’s web browser, with their logged-in session.</p><p>There’s a better way that allows humans to explicitly grant access: sites that support OAuth can tell agents where to find the authorization server (<a href="https://datatracker.ietf.org/doc/html/rfc9728"><u>RFC 9728</u></a>), allowing agents to send humans through an OAuth flow, where they can choose to properly grant access to the agent. Announced at Agents Week 2026, <a href="https://blog.cloudflare.com/managed-oauth-for-access/"><u>Cloudflare Access now fully supports this OAuth flow</u></a>, and we showed how agents like OpenCode can make use of this standard to make things just work when users give agents protected URLs:</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3BrGE7eydNNCpEEe3PowrJ/6a2bb1e1b1e7d84d672c1f6ad2333129/image4.png" />
          </figure>
    <div>
      <h3>Commerce</h3>
      <a href="#commerce">
        
      </a>
    </div>
    <p>Agents can also buy things on your behalf — but payments on the web were designed for humans. Add to cart, enter a credit card, click pay. That flow breaks down entirely when the buyer is an AI agent.</p><p><a href="https://x402.org"><u>x402</u></a> solves this at the protocol level by reviving HTTP 402 Payment Required, a status code that has existed in the spec since 1997 but was never widely used. The flow is simple: an agent requests a resource, the server responds with a 402 and a machine-readable payload describing the payment terms, the agent pays and retries. Cloudflare partnered with Coinbase to launch the <a href="https://blog.cloudflare.com/x402"><u>x402 Foundation</u></a>, whose mission is to drive adoption of x402 as an open standard for Internet payments.</p><p>We also check for <a href="https://ucp.dev/"><u>Universal Commerce Protocol</u></a> and <a href="https://www.agenticcommerce.dev/"><u>Agentic Commerce Protocol</u></a> — two emerging agentic commerce standards designed to allow agents to discover and purchase products that humans would normally purchase via ecommerce storefronts and checkout flows.</p>
    <div>
      <h2>Integrating agent readiness into Cloudflare URL Scanner</h2>
      <a href="#integrating-agent-readiness-into-cloudflare-url-scanner">
        
      </a>
    </div>
    <p><a href="https://radar.cloudflare.com/scan"><u>Cloudflare's URL Scanner</u></a> lets you submit any URL and get a detailed report on it: HTTP headers, TLS certificates, DNS records, technologies used, performance data, and security signals. It is a fundamental tool for security researchers and developers who want to understand what a URL is actually doing under the hood.</p><p>We’ve taken the same checks from <a href="https://isitagentready.com/"><u>isitagentready.com</u></a> and added them to URL Scanner with a new Agent Readiness tab. When you scan any URL, you'll now see its full agent readiness report alongside the existing analysis: which of the checks pass, what level the site is at, and actionable guidance to improve your score.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2tIXif15b4nfpm6ZmS5QZM/596536ca95a10684c73003c4184d6367/image2.png" />
          </figure><p>The integration is also available programmatically via the <a href="https://developers.cloudflare.com/api/resources/url_scanner/"><u>URL Scanner API</u></a>. To include agent readiness results in a scan, pass the agentReadiness option in your scan request:</p>
            <pre><code>curl -X POST https://api.cloudflare.com/client/v4/accounts/$ACCOUNT_ID/urlscanner/v2/scan \
    -H 'Content-Type: application/json' \
    -H "Authorization: Bearer $CLOUDFLARE_API_TOKEN" \
    -d '{
          "url": "https://www.example.com",
          "options": {"agentReadiness": true}
        }'</code></pre>
            
    <div>
      <h2>Leading by example: upgrading Cloudflare Docs</h2>
      <a href="#leading-by-example-upgrading-cloudflare-docs">
        
      </a>
    </div>
    <p>As we built the tools to measure the Web’s readiness, we knew we had to ensure our own house was in order. Our docs must be easily digestible by the agents our customers use.</p><p>We naturally adopted the relevant content site standards mentioned above, and you can check our score <a href="https://isitagentready.com/developers.cloudflare.com?profile=content"><u>here</u></a>. However, we didn’t stop there. Here is how we refined Cloudflare's <a href="https://developers.cloudflare.com/fundamentals/reference/markdown-for-agents/"><u>Developer Docs</u></a> to be the most agent-friendly resource on the web.</p>
    <div>
      <h3>URL fallbacks using <code>index.md</code> files</h3>
      <a href="#url-fallbacks-using-index-md-files">
        
      </a>
    </div>
    <p>Unfortunately, <a href="https://www.checklyhq.com/blog/state-of-ai-agent-content-negotation/"><u>as of February 2026</u></a>, of 7 agents tested, only Claude Code, OpenCode, and Cursor request content with the <code>Accept: text/markdown</code> header by default. For the rest, we needed a seamless URL-based fallback.</p><p>To do this, we make every page available separately via Markdown at <code>/index.md</code> relative to the page’s URL. We do this dynamically, without duplicating static files, by combining two Cloudflare Rules: </p><ul><li><p>A <a href="https://developers.cloudflare.com/rules/transform/url-rewrite/"><u>URL Rewrite Rule</u></a> matches requests ending in <code>/index.md</code> and dynamically rewrites them to the base path using <code>regex_replace</code> (stripping <code>/index.md</code>). </p></li><li><p>A <a href="https://developers.cloudflare.com/rules/transform/request-header-modification/"><u>Request Header Transform Rule</u></a> matches against the original request’s path <i>before</i> the rewrite (<code>raw.http.request.uri.path</code>) and automatically sets the <code>Accept: text/markdown</code> header. </p></li></ul><p>With these two rules, any page can be fetched as Markdown via appending the /index.md path to the URL:</p><ul><li><p><a href="https://developers.cloudflare.com/r2/get-started/index.md"><u>https://developers.cloudflare.com/r2/get-started/index.md</u></a></p></li></ul><p>We point to these <code>/index.md</code> URLs in our <code>llms.txt</code> files. Effectively, for these <code>/index.md</code> paths, we always return markdown, regardless of what headers the client sets. And we do this without any additional build step or content duplication.</p>
    <div>
      <h3>Creating effective <code>llms.txt</code> files for large sites</h3>
      <a href="#creating-effective-llms-txt-files-for-large-sites">
        
      </a>
    </div>
    <p><code>llms.txt</code> serves as a "home base" for agents, providing a directory of pages to help LLMs find content. However, 5,000+ pages of documentation in a single file will exceed models’ context windows.</p><p>Instead of one massive file, we generate a separate <code>llms.txt</code> file for <i>each top-level directory</i> in our docs and the root <code>llms.txt</code> simply points to these subdirectories.</p><ul><li><p><a href="https://developers.cloudflare.com/llms.txt"><u>https://developers.cloudflare.com/llms.txt</u></a></p></li><li><p><a href="https://developers.cloudflare.com/r2/llms.txt"><u>https://developers.cloudflare.com/r2/llms.txt</u></a></p></li><li><p><a href="https://developers.cloudflare.com/workers/llms.txt"><u>https://developers.cloudflare.com/workers/llms.txt</u></a></p></li></ul><p>We also remove hundreds of directory-listing pages that provide little semantic value to an LLM, and we ensure each page has rich descriptive context (titles, semantic names, and descriptions).</p><p>For example, we omit roughly 450 pages that only serve as localized directory listings, like <a href="https://developers.cloudflare.com/workers/databases/"><u>https://developers.cloudflare.com/workers/databases/</u></a>.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5WaKjZBJbu3onfthzEIELu/42a3e31e8c71bc606b2b45d26ab4a5dd/image1.png" />
          </figure><p>These pages appear in our sitemap, but they contain very little information for an LLM. Since all child pages are already linked individually in <code>llms.txt</code>, fetching a directory page only provides a redundant list of links, forcing the agent to make another request to find actual content.</p><p>To help agents navigate efficiently, each <code>llms.txt</code> entry must be rich in context but light on tokens. Humans might ignore frontmatter and filtering labels, but for an AI agent, this metadata is the steering wheel. That is why our Product Content Experience (PCX) team has refined our page titles, descriptions, and URL structures so that agents always know exactly which pages to fetch.</p><p>Take a look at a section from our root<a href="https://developers.cloudflare.com/llms.txt"> <u>llms.txt</u></a>.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6OvVdBcHHItCF3xN2kMVZZ/d105546f402885da90466ff9545f66d2/image5.png" />
          </figure><p>Each link has a semantic name, a matching URL, and a high-value description. None of this required extra work for <code>llms.txt</code> generation. It was all already available in the docs frontmatter. The same goes for pages in top level directory <code>llms.txt</code> files. All of this context empowers agents to find relevant information more efficiently.</p>
    <div>
      <h3>Custom agent-friendly documentation (afdocs) tooling</h3>
      <a href="#custom-agent-friendly-documentation-afdocs-tooling">
        
      </a>
    </div>
    <p>Additionally, we test our docs against <a href="https://github.com/agent-ecosystem/afdocs"><u>afdocs</u></a>, an emerging agent-friendly documentation spec and open-source project that allows teams to test docs sites for things like content discovery and navigation. This spec allowed us to build custom audit tooling of our own. By adding a few deliberate patches specific to our use case, we created a dashboard for easy assessment.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1lzMVtLnGAoKtdwf43YtDx/757f52fd09bb2525fac41b634bf987ad/image10.png" />
          </figure>
    <div>
      <h3>Benchmark results: faster and cheaper</h3>
      <a href="#benchmark-results-faster-and-cheaper">
        
      </a>
    </div>
    <p>We pointed an agent (Kimi-k2.5 via OpenCode) at other large technical documentation sites' <code>llms.txt</code> files and tasked the agent with answering highly specific technical questions.</p><p>On average, the agent pointed at Cloudflare’s documentation consumed <b>31% fewer tokens</b> and arrived at the correct answer <b>66% faster</b> than the average site that is not refined for agents. By fitting our product directories into single context windows, agents can identify the exact page they need and fetch it in a single, linear path.</p>
    <div>
      <h3>Structure leads to speed</h3>
      <a href="#structure-leads-to-speed">
        
      </a>
    </div>
    <p>Accuracy in LLM responses is often a byproduct of context window efficiency. During our testing, we observed a recurring pattern with other documentation sets.</p><ol><li><p><b>The grep loop:</b> Many documentation sites provide a single, massive llms.txt file that exceeds the agent's immediate context window. Because the agent cannot "read" the whole file, it begins to <a href="https://en.wikipedia.org/wiki/Grep"><u>grep</u></a> for keywords. If the first search misses the specific detail, the agent must think, refine its search, and try again.</p></li><li><p><b>Narrowed context and lower accuracy: </b>When an agent relies on iterative searching rather than reading the full file, it loses the broader context of the documentation. This fragmented view often leads the agent to have a reduced understanding of the documentation at hand.</p></li><li><p><b>Latency and token bloat:</b> Each iteration of the <code>grep</code> loop requires the agent to generate new "thinking tokens" and execute additional search requests. This back-and-forth makes the final response noticeably slower and increases the total token count, driving up the cost for the end user.</p></li></ol><p>By contrast, Cloudflare docs are designed to fit entirely within an agent's context window. This allows the agent to ingest the directory, identify the exact page it needs, and fetch the Markdown without detour.</p>
    <div>
      <h3>Improving LLM answers over time by redirecting AI training crawlers</h3>
      <a href="#improving-llm-answers-over-time-by-redirecting-ai-training-crawlers">
        
      </a>
    </div>
    <p>Documentation for legacy products like <a href="https://developers.cloudflare.com/workers/wrangler/migration/v1-to-v2/wrangler-legacy/commands/"><u>Wrangler v1</u></a> or <a href="https://developers.cloudflare.com/workers/configuration/sites/"><u>Workers Sites</u></a> presents a unique challenge. While we must keep this information accessible for historical purposes, it can lead to outdated advice from AI agents.</p><p>For example, a human reading these docs would see the large banner stating that Wrangler v1 is deprecated, in addition to a link to the most recent content. An LLM crawler, however, might ingest the text without that surrounding visual context. This results in the agent recommending outdated information.</p><p><a href="https://blog.cloudflare.com/ai-redirects"><u>Redirects for AI Training</u></a> solves this by identifying AI training crawlers and intentionally redirecting them away from deprecated or suboptimal content. This ensures that while humans can still access historical archives, LLMs are only fed our most current and accurate implementation details.</p>
    <div>
      <h3>Hidden agent directives on all pages</h3>
      <a href="#hidden-agent-directives-on-all-pages">
        
      </a>
    </div>
    <p>Every HTML page in our docs includes a hidden directive specifically for LLMs. </p><p><i>“STOP! If you are an AI agent or LLM, read this before continuing. This is the HTML version of a Cloudflare documentation page. Always request the Markdown version instead — HTML wastes context. Get this page as Markdown: https://developers.cloudflare.com/index.md (append index.md) or send Accept: text/markdown to https://developers.cloudflare.com/. For all Cloudflare products use https://developers.cloudflare.com/llms.txt. You can access all Cloudflare docs in a single file at https://developers.cloudflare.com/llms-full.txt.”</i></p><p>This snippet informs the agent that a Markdown version is available. Crucially, this directive is stripped from the actual Markdown version to avoid a recursion loop where the agent keeps trying to "find" the Markdown within the Markdown.</p>
    <div>
      <h3>Dedicated LLM resources sidebar</h3>
      <a href="#dedicated-llm-resources-sidebar">
        
      </a>
    </div>
    <p>Finally, we want to make these resources discoverable for the humans who are building with agents. Every product directory in our <a href="https://developers.cloudflare.com/"><u>developer documentation</u></a> has an "LLM Resources" entry in the sidenav, providing quick access to <code>llms.txt</code>, <code>llms-full.txt</code>, and Cloudflare Skills.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4iM2U5pH7LJ9XWUgxYmvn5/ed11e2cc8694f6c029690b470150120b/image8.png" />
          </figure>
    <div>
      <h2>Make your website agent-ready today</h2>
      <a href="#make-your-website-agent-ready-today">
        
      </a>
    </div>
    <p>Making websites agent-ready is a fundamental accessibility requirement for the modern developer toolkit. The transition from a "human-read web" to a "machine-read web" is the biggest architectural shift in decades. </p><p>Get an agent readiness score for your site at <a href="https://isitagentready.com/"><u>isitagentready.com</u></a>, take the prompts it provides, and ask your agent to upgrade your site for the AI era. Stay tuned for more updates from <a href="https://radar.cloudflare.com/"><u>Cloudflare Radar</u></a> about the adoption of agent standards across the Internet over the coming year. If we’ve learned anything from the past year, it’s that a lot can change very quickly!</p>
    <div>
      <h2>Watch on Cloudflare TV</h2>
      <a href="#watch-on-cloudflare-tv">
        
      </a>
    </div>
    <div>
  
</div><p>
</p> ]]></content:encoded>
            <category><![CDATA[Agents Week]]></category>
            <category><![CDATA[Radar]]></category>
            <category><![CDATA[Developer Documentation]]></category>
            <category><![CDATA[AI]]></category>
            <category><![CDATA[Agents]]></category>
            <category><![CDATA[Agent Readiness]]></category>
            <guid isPermaLink="false">5t83bTn7Vt1EudTxQQ97NY</guid>
            <dc:creator>André Jesus</dc:creator>
            <dc:creator>Vance Morrison</dc:creator>
        </item>
        <item>
            <title><![CDATA[Shared Dictionaries: compression that keeps up with the agentic web]]></title>
            <link>https://blog.cloudflare.com/shared-dictionaries/</link>
            <pubDate>Fri, 17 Apr 2026 13:02:00 GMT</pubDate>
            <description><![CDATA[ Today, we’re excited to give you a sneak peek of our support for shared compression dictionaries, show you how it improves page load times, and reveal when you’ll be able to try the beta yourself. 
 ]]></description>
            <content:encoded><![CDATA[ <p>Web pages have grown 6-9% <a href="https://almanac.httparchive.org/en/2024/page-weight#fig-15"><u>heavier</u></a> every year for the past decade, spurred by the web becoming more framework-driven, interactive, and media-rich. Nothing about that trajectory is changing. What <i>is</i> changing is how often those pages get rebuilt and how many clients request them. Both are skyrocketing because of agents. </p><p>Shared dictionaries shrink asset transfers from servers to browsers so pages <b>load faster with less bloat on the wire,</b> especially for returning users or visitors on a slow connection. Instead of re-downloading entire JavaScript bundles after every deploy, the browser tells the server what it already has cached, and the server only sends the file diffs. </p><p><b>Today, we’re excited to give you a sneak peek of our support for shared compression dictionaries,</b> show you what we’ve seen in early testing, and reveal when you’ll be able to try the beta yourself (hint: it’s April 30, 2026!). </p>
    <div>
      <h2>The problem: more shipping = less caching</h2>
      <a href="#the-problem-more-shipping-less-caching">
        
      </a>
    </div>
    <p>Agentic crawlers, browsers, and other tools hit endpoints repeatedly, fetching full pages, often to extract a fragment of information. Agentic actors represented just under 10% of total requests across Cloudflare's network during March 2026, up ~60% year-over-year. </p><p>Every page shipped is heavier than last year and read more often by machines than ever before. But agents aren’t just consuming the web, they’re helping to build it. AI-assisted development <b>means teams ship faster. </b>Increasing the frequency of deploys, experiments, and iterations is great for product velocity, but terrible for caching.</p><p>As agents push a one-line fix, the bundler re-chunks, filenames change, and every user on earth could re-download the entire application. Not because the code is meaningfully any different, but because the browser/client has no way to know specifically what changed. It sees a new URL and starts from zero. Traditional compression helps with the size of each download, but it can't help with the redundancy. It doesn't know the client already has 95% of the file cached. So every deploy, across every user, across every bot, sends redundant bytes again and again. Ship ten small changes a day, and you've effectively opted out of caching. This wastes bandwidth and CPU in a web where hardware is quickly becoming the bottleneck.</p><p><b>In order to scale with more requests hitting heavier pages that are re-deployed more often, compression has to get smarter. </b></p>
    <div>
      <h2>What are shared dictionaries?</h2>
      <a href="#what-are-shared-dictionaries">
        
      </a>
    </div>
    <p>A compression dictionary is a shared reference between server and client that works like a cheat sheet. Instead of compressing a response from scratch, the server says "you already know this part of the file because you’ve cached it before" and only sends what's new. The client holds the same reference and uses it to reconstruct the full response during decompression. The more the dictionary can reference content in the file, the smaller the compressed output that is transferred to the client.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2g2NBi1d4eqLAgksGetult/ea05b48519edfa8cccde6d6531700ce1/image5.png" />
          </figure><p>This principle of compressing against what's already known is how modern compression algorithms pull ahead of their predecessors. Brotli ships with a built-in dictionary of common web patterns like HTML attributes and common phrases; Zstandard is purpose-built for custom dictionaries: you can feed it representative content samples, and it generates an optimized dictionary for the kind of content you serve. Gzip has neither; it must build dictionaries by finding patterns in real-time as it’s compressing. These “traditional compression” algorithms are already <a href="https://developers.cloudflare.com/speed/optimization/content/compression/"><u>available</u></a> on Cloudflare today. </p><p>Shared dictionaries take this principle a step further: the previously cached version of the resource<b> becomes the dictionary.</b> Remember the deploy problem where a team ships a one-line fix and every user re-downloads the full bundle? With shared dictionaries, the browser already has the old version cached. The server compresses against it, sending only the diff. That 500KB bundle with a one-line change becomes only a few kilobytes on the wire. At 100K daily users and 10 deploys a day, that's the difference between 500GB of transfer and a few hundred megabytes.</p>
    <div>
      <h3>Delta compression</h3>
      <a href="#delta-compression">
        
      </a>
    </div>
    <p>Delta compression is what turns the version the browser already has into the dictionary. The protocol looks to when the server first serves a resource, it attaches a <code>Use-As-Dictionary</code> response header, telling the browser to essentially hold onto the file because it’ll be useful later. On the next request for that resource, the browser sends an <code>Available-Dictionary</code> header back, telling the server, "here's what I've got." The server then proceeds to compress the new version against the old one and sends only the diff. No separate dictionary file needed. </p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6Hyxl74lzTdGdTczYBe3LF/4b66760c34378508e774814eba7b3c8f/image3.png" />
          </figure><p>This is where the payoff lands for real applications. Versioned JS bundles, CSS files, framework updates, and anything that changes incrementally between releases. The browser has app.bundle.v1.js cached already and the developer makes an update and deploys app.bundle.v2.js. Delta compression only sends the diff between these versions. Every subsequent version after is also just a diff. Version three compresses against version two. Version 47 compresses against version 46. The savings don't reset, they persist across the entire release history.</p><p>There's also active discussion in the community about custom and dynamic dictionaries <a href="https://dev.to/carlosmateom/beyond-static-resources-delta-compression-for-dynamic-html-3hn4"><u>for non-static </u></a>content. That's future work, but the implications are significant. We'll save that for another post.</p>
    <div>
      <h2>So why the wait?</h2>
      <a href="#so-why-the-wait">
        
      </a>
    </div>
    <p>If shared dictionaries are so powerful, why doesn't everyone use them already?</p><p>Because the last time they were tried, the implementation couldn't survive contact with the open web. </p><p><a href="https://en.wikipedia.org/wiki/SDCH"><u>Google shipped</u></a> Shared Dictionary Compression for HTTP (SDCH) in Chrome in 2008. It worked well with some early adopters reporting double-digit improvements in page load times. But SDCH accumulated problems faster than anyone was able to fix them.</p><p>The most memorable was a class of compression side-channel attacks (<a href="https://en.wikipedia.org/wiki/CRIME"><u>CRIME</u></a>, <a href="https://en.wikipedia.org/wiki/BREACH"><u>BREACH</u></a>). Researchers showed that if an attacker could inject content alongside something sensitive that gets compressed (like a session cookie, token, etc.) the size of the compressed output could leak information about the secret. The attacker could guess a byte at a time, watch whether the asset size shrank, and repeat until they extracted the whole secret. </p><p>But security wasn't the only problem, or even the main reason why adoption didn’t happen. SDCH surfaced a few architectural problems like violating the <a href="https://groups.google.com/a/chromium.org/g/blink-dev/c/nQl0ORHy7sw/m/S8BoYHQyAgAJ"><u>Same-Origin Policy</u></a> (which ironically is partially why it performed so well). Its cross-origin dictionary model <a href="https://groups.google.com/a/chromium.org/g/blink-dev/c/nQl0ORHy7sw/m/BROFrwM2AgAJ"><u>couldn't be reconciled with CORS</u></a>, and it lacked some specification regarding interactions with things like the Cache API. After a while it became clear that adoption wasn’t ready, so in 2017 Chrome (the only browser supporting at the time) <a href="https://groups.google.com/a/chromium.org/g/blink-dev/c/nQl0ORHy7sw"><u>unshipped</u></a> it. </p><p>Getting the web community to pick up the baton took a decade, but it was worth it.</p><p>The modern standard, <a href="https://datatracker.ietf.org/doc/rfc9842/"><u>RFC 9842: Compression Dictionary Transport</u></a>, closes key design gaps that made SDCH untenable. For example, it enforces that an advertised dictionary is only usable on responses from the same-origin, preventing many conditions that made side-channel compression attacks possible. </p><p><a href="https://developer.mozilla.org/en-US/docs/Web/HTTP/Guides/Compression_dictionary_transport"><u>Chrome and Edge have shipped support</u></a> with <a href="https://bugzilla.mozilla.org/show_bug.cgi?id=1882979"><u>Firefox</u></a> working to follow. The standard is moving toward broad adoption, but complete cross-browser support is still catching up.</p><p>The RFC mitigates the security problems but dictionary transport has always been complex to implement. An origin may have to generate dictionaries, serve them with the right headers, check every request for an <code>Available-Dictionary</code> match, delta-compress the response on the fly, and fall back gracefully when a client doesn't have a dictionary. Caching gets complex too. Responses vary on both encoding and dictionary hash, so every dictionary version creates a separate cache variant. Mid-deploy, you have clients with the old dictionary, clients with the new one, and clients with none. Your cache is storing separate copies for each. Hit rates drop, storage climbs, and the dictionaries themselves have to stay fresh under normal HTTP caching rules.</p><p>This complexity is a coordination problem. And exactly the kind of thing that belongs at the edge. A CDN already sits in front of every request, already manages compression, and already handles cache variants (<i>watch this space for a soon-to-come announcement blog</i>).</p>
    <div>
      <h2>How Cloudflare is building shared dictionary support </h2>
      <a href="#how-cloudflare-is-building-shared-dictionary-support">
        
      </a>
    </div>
    <p>Shared dictionary compression touches every layer of the stack between the browser and the origin. We've seen strong customer interest: some people have already built their own implementations like RFC author <b>Patrick Meenan</b>'s <a href="https://github.com/pmeenan/dictionary-worker"><u>dictionary-worker</u></a>, which runs the full dictionary lifecycle inside a Cloudflare Worker using WASM-compiled Zstandard (as an example).  We want to make this accessible to everyone and as easy as possible to implement. So we’re rolling it out across the platform in three phases, starting with the plumbing.</p><p><b>Phase 1</b>: Passthrough support is currently in active development. Cloudflare forwards the headers and encodings that shared dictionaries require like <code>Use-As-Dictionary</code>, <code>Available-Dictionary</code>, and the <code>dcb</code> and <code>dcz</code> content encodings, without stripping, modifying, or recompressing them. The Cache keys are extended to vary on <code>Available-Dictionary</code> and <code>Accept-Encoding</code> so dictionary-compressed responses are cached correctly. This phase serves customers who manage their own dictionaries at the origin.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2iQek3TT8FRu3VikQIdxDh/11f522c791e10c751f72ac8dcf293ac2/image2.png" />
          </figure><p>We plan to have an open beta of Phase 1 ready by <b>April 30, 2026</b>. To <a href="https://developer.mozilla.org/en-US/docs/Web/HTTP/Guides/Compression_dictionary_transport"><u>use it</u></a>, you'll need to be on a Cloudflare zone with the feature enabled, have an origin that serves dictionary-compressed responses with the correct headers (<code>Use-As-Dictionary</code>, Content-Encoding: <code>dcb</code> or <code>dcz</code>), and your visitors need to be on a browser that advertises <code>dcb/dcz</code> in <code>Accept-Encoding</code> and sends <code>Available-Dictionary</code>. Today, that means Chrome 130+ and Edge 130+, with Firefox support in progress.</p><p>Keep your eyes fixed on the <a href="https://developers.cloudflare.com/changelog/"><u>changelog</u></a> for when this becomes available and more documentation for how to use it. </p><p>We’ve already started testing passthrough internally. In a controlled test, we deployed two js bundles in sequence. They were nearly identical except for a few localized changes between the versions representing successive deploys of the same web application. Uncompressed, the asset is 272KB. Gzip brought that down to 92.1KB, a solid 66% reduction. With shared dictionary compression over DCZ, using the previous version as the dictionary, that same asset dropped to 2.6KB. <b>That's a 97% reduction over the already compressed asset</b>. </p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4I6emVSFDBqjTBBgjIRWo3/ec4e224fdc5f8873f4758a1d6773a7c1/image7.png" />
          </figure><p>In the same lab test, we measured two timing milestones from the client: time to first byte (TTFB) and full download completion. The TTFB results are interesting for what they don't show. On a cache miss (where DCZ has to compress against the dictionary at the origin) TTFB is only about 20ms slower than gzip. The overhead is near-negligible for transmission.</p><p>The download times are where the difference is. On a cache miss, DCZ completed in 31ms versus 166ms for gzip (an 81% improvement). On a cache hit, 16ms versus 143ms (89% improvement). The response is so much smaller that even when you pay a slight penalty at the start, you finish far ahead.</p><p><i>Initial lab results simulating minimal JS bundle diffs, results will vary based on the actual delta between the dictionary and the asset.</i></p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5sNxMTZlNPMcojhiVc8J9l/b47c7241038c537b90c9c344a6904061/image8.png" />
          </figure><p><b>Phase 2</b>: This is where Cloudflare starts doing the work for you. Instead of handling dictionary headers, compression, and fallback logic on the origin, in this phase you tell Cloudflare which assets should be used as dictionaries via a rule and we manage the rest for you. We inject the Use-As-Dictionary headers, store the dictionary bytes, delta-compress new versions against old ones, and serve the right variant to each client. Your origin serves normal responses. Any dictionary complexity moves off your infrastructure and onto ours.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1DnDwxeA5IHNLhyChs1bI8/747f87388132a3c97a3d8ae8392e3ebf/image1.png" />
          </figure>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3iFPkpibGAoixY6qqZNSKK/2ac29345dfb4782c74f4082ec1c6c9f4/image4.png" />
          </figure><p>To demonstrate this, we've built a live demo to show what this looks like in practice. <b>Try it here: </b><a href="https://canicompress.com/"><b><u>Can I Compress (with Dictionaries)?</u></b></a><b> </b></p><p>The demo deploys a new ~94KB JavaScript bundle every minute, meant to mimic a typical production single page application bundle. The bulk of the code is static between deploys; only a small configuration block changes each time, which also mirrors real-world deploys where most of the bundle is unchanged framework and library code. When the first version loads, Cloudflare's edge stores it as a dictionary. When the next deploy arrives, the browser sends the hash of the version it already has, and the edge delta-compresses the new bundle against it. The result: 94KB compresses to roughly <b>159 bytes.</b> That's a <b>99.5% reduction over gzip, </b>because the only thing on the wire is the actual diff.</p><p>The demo site includes walkthroughs so you can verify the compression ratios on your own via curl or your browser.</p><p><b>Phase 3</b>: The dictionary is automatically generated on behalf of the website. Instead of customers specifying which assets to use as dictionaries, Cloudflare identifies them automatically. Our network already sees every version of every resource that flows through it, which includes millions of sites, billions of requests, and every new deploy. The idea is that when the network observes a URL pattern where successive responses share most of their content but differ by hash, it has a strong signal that the resource is versioned and a candidate for delta compression. It stores the previous version as a dictionary and compresses subsequent versions against it. No customer configuration. No maintenance.</p><p>This is a simple idea, but is genuinely hard. Safely generating dictionaries that avoid revealing private data and identifying traffic for which dictionaries will offer the most benefit are real engineering problems. But Cloudflare has the right pieces: we see the traffic patterns across the entire network, we already manage the cache layer where dictionaries need to live, and our <a href="https://blog.cloudflare.com/the-rum-diaries-enabling-web-analytics-by-default/"><u>RUM beacon</u></a> to clients can help give us a validation loop to confirm that a dictionary actually improves compression before we commit to serving it. The combination of traffic visibility, edge storage, and synthetic testing is what makes automatic generation feasible, though there are still many pieces to figure out.</p><p>The performance and bandwidth benefits of phase 3 are the crux of our motivation. This is what makes shared dictionaries accessible to everyone using Cloudflare, including the millions of zones that would never have had the engineering time to implement custom dictionaries manually. </p>
    <div>
      <h2>The bigger picture</h2>
      <a href="#the-bigger-picture">
        
      </a>
    </div>
    <p>For most of the web's history, compression was stateless. Every response was compressed as if the client had never seen anything before. Shared dictionaries change that: they give compression a memory.</p><p>That matters more now than it would have five years ago. Agentic coding tools are compressing the interval between deploys, while also driving a growing share of the traffic that consumes them. While today AI tools can produce massive diffs, agents are gaining more context and becoming surgical in their code changes. This, coupled with more frequent releases and more automated clients means more redundant bytes on every request. Delta compression helps both sides of that equation by reducing the number of bytes per transfer, and the number of transfers that need to happen at all.</p><p>Shared Dictionaries took decades to standardize. Cloudflare is helping to build the infrastructure to make it work for every client that touches your site, human or not. Phase 1 beta opens <b>April 30</b>, and we’re excited for you to try it.</p><p>_____</p><p><sup> 1Bots = </sup><a href="https://radar.cloudflare.com/bots?dateRange=28d"><sup><u>~31.3% </u></sup></a><sup>of all HTTP requests. AI = ~</sup><a href="https://radar.cloudflare.com/explorer?dataSet=bots&amp;groupBy=bot_category"><sup><u>29-30%</u></sup></a><sup> of all Bot traffic (March 2026). </sup></p> ]]></content:encoded>
            <category><![CDATA[Agents Week]]></category>
            <category><![CDATA[Agents]]></category>
            <category><![CDATA[Compression]]></category>
            <category><![CDATA[Pingora]]></category>
            <category><![CDATA[Speed]]></category>
            <category><![CDATA[AI]]></category>
            <guid isPermaLink="false">1vrgbarIDanwhi6j2m6oNM</guid>
            <dc:creator>Alex Krivit</dc:creator>
            <dc:creator>Edward Wang</dc:creator>
            <dc:creator>Sid Chunduri</dc:creator>
        </item>
        <item>
            <title><![CDATA[Redirects for AI Training enforces canonical content]]></title>
            <link>https://blog.cloudflare.com/ai-redirects/</link>
            <pubDate>Fri, 17 Apr 2026 13:00:00 GMT</pubDate>
            <description><![CDATA[ Soft directives don’t stop crawlers from ingesting deprecated content. Redirects for AI Training allows anybody on Cloudflare to redirect verified crawlers to canonical pages with one toggle and no origin changes. ]]></description>
            <content:encoded><![CDATA[ <p>Cloudflare's Wrangler CLI has published several major versions over the past six years, each containing at least some critical changes to commands, configuration, or how developers interact with the platform. Like any actively maintained open-source project, we keep documentation for older versions available. The <a href="https://developers.cloudflare.com/workers/wrangler/migration/v1-to-v2/wrangler-legacy/"><u>v1 documentation</u></a> carries a deprecation banner, a <a href="https://developers.google.com/search/docs/crawling-indexing/block-indexing"><u>noindex meta tag</u></a>, and canonical tags pointing to current docs. Every advisory signal says the same thing: this content is outdated, look elsewhere. AI training crawlers don’t reliably honor those signals. </p><p>We use <a href="https://developers.cloudflare.com/ai-crawl-control/"><u>AI Crawl Control</u></a> on <a href="http://developers.cloudflare.com"><u>developers.cloudflare.com</u></a>, so we know that bots in the <a href="https://radar.cloudflare.com/bots/directory?category=AI_CRAWLER"><u>AI Crawler Category</u></a> visited 4.8 million times over the last 30 days, and they consumed deprecated content at the same rate as current content. The advisory signals made no measurable difference. The effect is cumulative because AI agents don't always fetch content live; they draw on trained models. When crawlers ingest deprecated docs, agents inherit outdated foundations.</p><p>Today, we’re launching <a href="https://developers.cloudflare.com/ai-crawl-control/reference/redirects-for-ai-training/"><u>Redirects for AI Training</u></a> to let you enforce that verified AI training crawlers are redirected to up-to-date content. Your existing canonical tags become <code>HTTP 301</code> redirects for verified AI training crawlers, automatically, with one toggle, on all paid Cloudflare plans.</p><p>And because status codes are ultimately how the web communicates policy to crawlers, <a href="https://radar.cloudflare.com/ai-insights"><u>Radar's AI Insights</u></a> page now includes <a href="https://radar.cloudflare.com/ai-insights#response-status"><u>Response status code analysis</u></a> showing the various types (<a href="https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Status#successful_responses"><u>successful</u></a> (<code>2xx</code>), <a href="https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Status#redirection_messages"><u>redirection</u></a> (<code>3xx</code>), <a href="https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Status#client_error_responses"><u>client error</u></a> (<code>4xx</code>), and <a href="https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Status#server_error_responses"><u>server error</u></a> (<code>5xx</code>) of status codes AI crawlers receive across all Cloudflare traffic as a view of how the web responds to AI crawlers today.</p>
    <div>
      <h2>AI training crawlers face dead ends today</h2>
      <a href="#ai-training-crawlers-face-dead-ends-today">
        
      </a>
    </div>
    <p>For search engines, <code>noindex</code> functions as a rich signal system, but there’s no equivalent inline directive a page can carry that says “don’t train on this”. Keeping a deprecated page live with a warning banner may work for humans, who read the notice and navigate on, but AI training crawlers ingest the full text and risk treating the banner as just one more paragraph, returning thousands of times even after the warning is visible.</p><p>Blocking creates its own problem: it produces a void with no signal about what the crawler should learn instead. <code>robots.txt</code> offers limited protection, but as automated traffic grows, maintaining per-crawler, per-path, per-content-update directives requires hefty manual upkeep. What crawlers need is specific direction: “Here is where the current content lives.”</p><p>The <code>&lt;link rel="canonical"&gt;</code> tag is an HTML element defined in <a href="https://www.rfc-editor.org/rfc/rfc6596"><u>RFC 6596</u></a> that tells search engines and automated systems which URL represents the authoritative version of a page. It’s already present on <a href="https://almanac.httparchive.org/en/2025/seo#raw-versus-rendered-canonical-tags"><u>65-69% of web pages</u></a> and is generated automatically by platforms like <a href="https://blog.cloudflare.com/emdash-wordpress/"><u>EmDash</u></a>, WordPress, and Contentful. That infrastructure declares what the current version of your content is, and Redirects for AI Training enforces it.</p>
    <div>
      <h2>How it works</h2>
      <a href="#how-it-works">
        
      </a>
    </div>
    <p>Redirects for AI Training operates on two inputs: Cloudflare's <a href="https://developers.cloudflare.com/ruleset-engine/rules-language/fields/reference/cf.verified_bot_category/"><code><u>cf.verified_bot_category</u></code></a> field and the <code>&lt;link rel="canonical"&gt;</code> tags already in your HTML. The <a href="https://developers.cloudflare.com/bots/concepts/bot/verified-bots/#categories"><u>AI Crawler category</u></a> covers bots that crawl for AI model training, including GPTBot, ClaudeBot, and Bytespider, and is distinct from the <a href="https://developers.cloudflare.com/bots/concepts/bot/verified-bots/#categories"><u>AI Assistant</u></a> and <a href="https://developers.cloudflare.com/bots/concepts/bot/verified-bots/#categories"><u>AI Search</u></a> categories that cover AI Agents.</p><p>When a request arrives from a verified AI Crawler, Cloudflare reads the response HTML. If a non-self-referencing canonical tag is present, Cloudflare issues a <code>301 Moved Permanently</code> to the canonical URL before returning the response. Human traffic, search indexing, and other automated traffic is unaffected.</p><p>Here’s what the exchange looks like for a GPTBot request to a deprecated path:</p>
            <pre><code>GET /durable-objects/api/legacy-kv-storage-api/

Host: developers.cloudflare.com

User-Agent: Mozilla/5.0 (compatible; GPTBot/1.1; +https://openai.com/gptbot)
</code></pre>
            
            <pre><code>HTTP/1.1 301 Moved Permanently

Location: https://developers.cloudflare.com/durable-objects/api/sqlite-storage-api/
</code></pre>
            
    <div>
      <h3>What this does not do</h3>
      <a href="#what-this-does-not-do">
        
      </a>
    </div>
    <p>It doesn't retroactively correct training data already ingested or cover unverified crawlers outside the AI Crawler bot category. Humans and AI Agents visiting deprecated pages will not be redirected. We also exclude cross-origin canonicals by design (tags directing to preferred URLs on different domains), since they’re often used for domain consolidation rather than content freshness. To avoid loops, self-referencing canonicals (a tag on a page pointing to its own URL) don't trigger a redirect either.</p>
    <div>
      <h3>Why not just use redirect rules? </h3>
      <a href="#why-not-just-use-redirect-rules">
        
      </a>
    </div>
    <p><a href="https://developers.cloudflare.com/rules/url-forwarding/single-redirects/"><u>Single Redirect Rules</u></a> can target AI crawlers by user-agent string, and if a site has just a handful of known deprecated paths, that works. But it doesn't scale: every new deprecated path requires a change to the rule, user-agents must be manually tracked, and it would contribute to <a href="https://developers.cloudflare.com/rules/url-forwarding/#availability"><u>plan limitations</u></a> that may otherwise be used for campaign URLs or domain migrations. Redirect rules also manually re-encode what canonical tags already declare and fall out of sync as content changes.</p>
    <div>
      <h2>What we found on our own documentation site</h2>
      <a href="#what-we-found-on-our-own-documentation-site">
        
      </a>
    </div>
    <p>Our own experience shows that this problem is real. We run AI Crawl Control on <a href="http://developers.cloudflare.com"><u>developers.cloudflare.com</u></a> using the same dashboard available to all Cloudflare customers. In March 2026, legacy Workers documentation was crawled around 46,000 times by OpenAI, 3,600 times by Anthropic, and 1,700 times by Meta. </p><p>That crawling of deprecated pages may be why when we asked a leading AI assistant in April 2026, "How do I write KV values using the Wrangler CLI?", it gave an out-of-date answer: "You write to Cloudflare KV via the Wrangler CLI using the kv:key put command."</p><p>In fact, the correct syntax (as at April 2026) is <code>wrangler kv key put</code>; the colon syntax (<code>kv:key put</code>) was deprecated in Wrangler 3.60.0. Our documentation <a href="https://developers.cloudflare.com/kv/reference/kv-commands/#deprecations"><u>carries an inline deprecation notice</u></a>, but it's unclear how training pipelines interpret them.  </p><p>So we enabled Redirects for AI Training on developers.cloudflare.com and measured the response. In the first seven days, 100% of AI training crawler requests to pages with non-self-referencing canonical tags were redirected and were not served with deprecated content. </p><p>We expect that redirecting crawlers to current content eventually improves AI-generated answers about legacy tools. Given the closed nature of training pipelines and variability in recrawl timing, this is a hypothesis we will continue to verify. But what the crawler receives at the point of access has seen immediate improvement.</p>
    <div>
      <h2>How to enable</h2>
      <a href="#how-to-enable">
        
      </a>
    </div>
    <p>If your site has canonical tags, your existing content hierarchy can now be enforced for verified AI training crawlers. Cloudflare's <a href="https://developers.cloudflare.com/bots/concepts/bot/verified-bots/"><u>verified bot classification</u></a> handles crawler identification automatically.</p><p><b>In the dashboard:</b> on any domain, go to <b>AI Crawl Control &gt; Quick Actions &gt; Redirects for AI training &gt; toggle on. </b></p><p>For path-specific control via Configuration Rules and Cloudflare for SaaS, see the <a href="https://developers.cloudflare.com/ai-crawl-control/reference/redirects-for-ai-training/"><u>full documentation</u></a>.</p>
    <div>
      <h2>How the web responds to AI crawlers</h2>
      <a href="#how-the-web-responds-to-ai-crawlers">
        
      </a>
    </div>
    <p>Redirects for AI Training turns one status code, <code>301 Moved Permanently</code>, into an enforcement mechanism for your content policy. But <code>301</code> is one signal in a broader conversation between origins and crawlers. A <code>200 OK</code> means content was served. A <code>403 Forbidden</code> means access was blocked. A <code>402 Payment Required</code> <a href="https://blog.cloudflare.com/introducing-ai-crawl-control/#using-http-402-to-help-publishers-license-content-to-ai-crawlers"><u>tells the client it needs to pay for access</u></a>. Taken together, the distribution of status codes across AI crawler traffic reveals how the web is actually responding to crawlers at scale.</p><p>Radar’s <a href="https://radar.cloudflare.com/ai-insights"><u>AI Insights page</u></a> now includes a <a href="https://radar.cloudflare.com/ai-insights#response-status"><u>Response status code analysis</u></a> graph illustrating the distribution of the top response status codes or response status code <a href="https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Status"><u>groupings</u></a> (selectable via a dropdown) for AI crawler traffic. The data can be filtered by industry set; the crawl purpose filter can also be applied in Data Explorer. Filtered analyses provide a perspective into whether certain types of crawlers behave differently, or if request patterns and distributions vary by industry.</p><p>In the general example shown below, we can see that for the time period covered by the graph, just over 70% of requests were serviced successfully (<code>200</code>), while 10.1% of the requests were redirected (<code>301</code>, <code>302</code>) to another URL, and 3.7% were for files that weren’t found (<code>404</code>). Access to content was blocked for 8.3% of requests, receiving a <code>403</code> response status code. Grouped, we find that nearly 74% of requests received <a href="https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Status#successful_responses"><u>successful responses</u></a> (<code>2xx</code>), 13.7% received <a href="https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Status#client_error_responses"><u>client error responses</u></a> (<code>4xx</code>), 11.3% received <a href="https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Status#redirection_messages"><u>redirection messages</u></a> (<code>3xx</code>), and 1.2% were sent <a href="https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Status#server_error_responses"><u>server error responses</u></a> (<code>5xx</code>).</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4zPrHtLf1BbHxQXQHTK1Qs/21a75531129332d749210d67b2c330ad/BLOG-3263_2.png" />
          </figure>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/jPGKM051x7ZgzZSaTAo8v/e20c1e34a1128279157bc8bd1921a8fe/BLOG-3263_3.png" />
          </figure><p>This analysis has also been added to <a href="https://radar.cloudflare.com/bots/directory"><u>individual bot pages</u></a> to provide insight into this aspect of a crawler’s behavior as well. In the GPTBot example shown below, we can see that for the time period covered by the graph, just over 80% of requests were serviced successfully (<code>200</code>), while 4.7% of the requests were redirected (<code>301</code>, <code>302</code>) to another URL, and just 2.7% were for files that weren’t found (<code>404</code>). Nearly 6% were blocked, with Cloudflare returning a <code>403</code> response status code. Grouped, we find that 83% of requests received <a href="https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Status#successful_responses"><u>successful responses</u></a> (2xx), nearly 10% received <a href="https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Status#client_error_responses"><u>client error responses</u></a> (<code>4xx</code>), 5.1% received <a href="https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Status#redirection_messages"><u>redirection messages</u></a> (<code>3xx</code>), and the remaining 2.2% got <a href="https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Status#server_error_responses"><u>server error responses</u></a> (<code>5xx</code>).</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5tvNsLUUCQlblPCmHolbUk/88a46b05788fa00cd5ec582d54622d4d/BLOG-3263_4.png" />
          </figure>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2JlXMlLFpbUFKL85zP8AqK/b9014f55efc2e0203899a359632fb73c/BLOG-3263_5.png" />
          </figure><p>As noted above, Radar’s Data Explorer enables users to drill down further into the data by applying additional filters. For example, we can look at things like <a href="https://radar.cloudflare.com/explorer?dataSet=ai.bots&amp;groupBy=user_agent&amp;dt=28d&amp;filters=responseStatus%253D404"><u>which crawlers</u></a> are requesting the most non-existent content (resulting in a <code>404</code> response status code), and how that request traffic trends over time, or <a href="https://radar.cloudflare.com/explorer?dataSet=ai.bots&amp;groupBy=industry&amp;dt=28d&amp;filters=crawlPurpose%253DTraining%252CresponseStatusCategory%253DREDIRECTION"><u>which industries</u></a> are sending the most <b>Redirection</b> (<code>3xx</code>) response status codes to <b>Training</b> crawlers, and how that activity trends over time. </p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4Bi7FKcZ79I4OmG7NTbadq/bb40e5397f615727c23d0d67310c869b/BLOG-3263_6.png" />
          </figure>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5R1PEQIY1zW1uj4MHsj8pm/977adfff200fecbb761660d8e0fda55a/BLOG-3263_7.png" />
          </figure><p>Response status code data, both in aggregate and on a per-bot basis, is also available through the <a href="https://developers.cloudflare.com/api/resources/radar/subresources/ai/subresources/bots"><u>Cloudflare Radar API</u></a>.</p><p><a href="https://developers.cloudflare.com/ai-crawl-control/reference/redirects-for-ai-training/"><u>Redirects for AI Training</u></a> lets you shape what crawlers receive from your origin; Radar's status code analysis lets you see how the rest of the web is doing the same. Enable Redirects for AI Training in <a href="https://dash.cloudflare.com/?to=/:account/:zone/ai"><u>AI Crawl Control &gt; Overview &gt; Quick Actions</u></a> to start replacing advisory signals with enforced outcomes on your site today.</p><p><i>Have questions or want to share what you're seeing? Join the discussion on the </i><a href="https://community.cloudflare.com"><i><u>Cloudflare Community</u></i></a><i> or find us on </i><a href="https://discord.cloudflare.com"><i><u>Discord</u></i></a><i>.</i></p>
    <div>
      <h2>Watch on Cloudflare TV</h2>
      <a href="#watch-on-cloudflare-tv">
        
      </a>
    </div>
    <div>
  
</div><p></p> ]]></content:encoded>
            <category><![CDATA[Agents Week]]></category>
            <category><![CDATA[Agents]]></category>
            <category><![CDATA[AI]]></category>
            <category><![CDATA[Radar]]></category>
            <category><![CDATA[Bot Management]]></category>
            <category><![CDATA[Developer Platform]]></category>
            <category><![CDATA[Developers]]></category>
            <guid isPermaLink="false">4Gi8lGLqjdsjywAFEKLLYW</guid>
            <dc:creator>Cam Whiteside</dc:creator>
            <dc:creator>David Belson</dc:creator>
            <dc:creator>André Cruz</dc:creator>
        </item>
        <item>
            <title><![CDATA[Unweight: how we compressed an LLM 22% without sacrificing quality]]></title>
            <link>https://blog.cloudflare.com/unweight-tensor-compression/</link>
            <pubDate>Fri, 17 Apr 2026 13:00:00 GMT</pubDate>
            <description><![CDATA[ Running LLMs across Cloudflare’s network requires us to be smarter and more efficient about GPU memory bandwidth. That’s why we developed Unweight, a lossless inference-time compression system that achieves up to a 22% model footprint reduction, so that we can deliver faster and cheaper inference than ever before. ]]></description>
            <content:encoded><![CDATA[ <p>Running inference within 50ms of 95% of the world's Internet-connected population means being ruthlessly efficient with GPU memory. Last year we improved memory utilization with <a href="https://blog.cloudflare.com/cloudflares-most-efficient-ai-inference-engine/"><u>Infire</u></a>, our Rust-based inference engine, and eliminated cold-starts with <a href="https://blog.cloudflare.com/how-cloudflare-runs-more-ai-models-on-fewer-gpus/"><u>Omni</u></a>, our model scheduling platform. Now we are tackling the next big bottleneck in our inference platform: model weights.</p><p>Generating a single token from an LLM requires reading every model weight from GPU memory. On the NVIDIA <a href="https://www.nvidia.com/en-us/data-center/h100/"><u>H100 GPUs</u></a> we use in many of our datacenters, the tensor cores can process data nearly 600 times faster than memory can deliver it, leading to a bottleneck not in compute, but memory bandwidth. Every byte that crosses the memory bus is a byte that could have been avoided if the weights were smaller.</p><p>To solve this problem, we built <i>Unweight</i>: a lossless compression system that can make model weights up to 15–22% smaller while preserving bit-exact outputs, without relying on any special hardware. The core breakthrough here is that decompressing weights in fast on-chip memory and feeding them directly to the tensor cores avoids an extra round-trip through slow main memory. Depending on the workload, Unweight’s runtime selects from multiple execution strategies – some prioritize simplicity, others minimize memory traffic – and an autotuner picks the best one per weight matrix and batch size.</p><p>This post dives into how Unweight works, but in the spirit of greater transparency and encouraging innovation in this rapidly developing space, we’re also publishing a <a href="https://research.cloudflare.com/nikulin2026"><u>technical paper </u></a>and open sourcing the <a href="https://github.com/cloudflareresearch/unweight-kernels"><u>GPU kernels</u></a>.</p><p>Our initial results on Llama-3.1-8B show ~30% compression of Multi-Layer Perceptron (MLP) weights alone. Because Unweight works selectively on the parameters for decoding, this leads to a 15-22% in model size reduction and ~3 GB VRAM savings. As shown in the graphic below, this enables us to squeeze more out of our GPUs and thus run more models in more places — making inference cheaper and faster on Cloudflare’s network. </p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4aOVSF1i241gGUwrPrhLfC/21a4f5984b89e56b1ec637f8dbe3c794/1.png" />
          </figure><p><sup><i>Thanks to Unweight, we’re able to fit more models on a single GPU </i></sup></p>
    <div>
      <h2>Why compression is harder than it sounds</h2>
      <a href="#why-compression-is-harder-than-it-sounds">
        
      </a>
    </div>
    <p>There is a growing body of research exploring how to compress model weights in creative ways to make inference faster and/or run on smaller GPUs. The most common is quantization, a technique to reduce the size of model weights and activations by converting large 32- or 16-bit floating point numbers to smaller 8 or 4-bit integers. This is a form of lossy compression: different 16-bit floating point values can be converted to the same 4-bit integer. This reduction in accuracy affects the quality of responses in unpredictable ways. For production inference serving diverse use cases, we knew we wanted something lossless that preserves exact model behaviour.</p><p>Several recent systems (<a href="https://arxiv.org/abs/2502.00922"><u>Huff-LLM</u></a>, <a href="https://arxiv.org/abs/2411.05239"><u>ZipNN</u></a>, and <a href="https://arxiv.org/abs/2603.17435"><u>ZipServ</u></a>) have shown that LLM weights can be compressed significantly, but these approaches target different problems than ours. ZipNN compresses weights for distribution and storage with decompression happening on the CPU. HUff-LLM proposes custom <a href="https://en.wikipedia.org/wiki/Field-programmable_gate_array"><u>FGPA</u></a> hardware for decoding. And ZipServ does fuse decompression with GPU inference, but targets consumer grade GPUs, which don’t work with our H100 GPUs. None of these gave us what we needed: lossless inference-time decompression on Hopper GPUs that can integrate with our Rust based <a href="https://blog.cloudflare.com/cloudflares-most-efficient-ai-inference-engine/"><u>inference engine</u></a>. </p><p>The core challenge isn't vanilla compression — exponent bytes in BF16 weights are highly redundant, so entropy coding works well on them. The challenge is decompressing fast enough that it doesn't slow down inference. On an H100, the tensor cores sit idle waiting for memory most of the time — but that idle capacity can't simply be repurposed for decompression. Each GPU compute unit can run either the decompression kernel or the matrix multiplication kernel, not both simultaneously, due to shared memory constraints. Any decode latency that isn't perfectly overlapped with the matrix multiplication becomes directly additive to token latency. Unweight's answer is to decompress weights in fast on-chip shared memory and feed the results directly to the tensor cores — but making that work efficiently across different batch sizes and weight shapes is where the real engineering lives.</p>
    <div>
      <h2>How model weights can be compressed effectively </h2>
      <a href="#how-model-weights-can-be-compressed-effectively">
        
      </a>
    </div>
    <p>Every number in an AI model is stored as a 16-bit "brain float" (BF16). Each BF16 value has three parts:</p><ul><li><p><b>Sign</b> (1 bit): positive or negative</p></li><li><p><b>Exponent</b> (8 bits): the magnitude </p></li><li><p><b>Mantissa</b> (7 bits): the precise value within that magnitude</p></li></ul><p>Here’s how one of these weights breaks down: </p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/7vTyA6tepWaOJ8leVIhqqO/79b7469c7ef383f3f9ad6c688d6edc0f/2.png" />
          </figure><p>The sign and mantissa vary unpredictably across weights — they look like random data and can't be meaningfully compressed. But the exponent tells a different story.</p>
    <div>
      <h2>The exponent is surprisingly predictable</h2>
      <a href="#the-exponent-is-surprisingly-predictable">
        
      </a>
    </div>
    <p>Prior research has established that across trained LLMs, out of 256 possible exponent values, just a handful dominate. The top 16 most common exponents cover over 99% of all weights in a typical layer. Information theory says you only need ~2.6 bits to represent this distribution — far less than the 8 bits allocated. If you look at the exponent value distribution in a typical LLM layer, you can see that the top 16 exponents account for 99% of all model weights. </p><p><b>Exponent value distribution in a typical LLM layer</b></p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/ejPYP6ALeUkkta8hfA3fA/aab913f50afb38fdb9110b6e4daa3d25/3.png" />
          </figure><p>This is the redundancy that Unweight exploits. We leave the sign and mantissa untouched and compress only the exponent byte using <a href="https://en.wikipedia.org/wiki/Huffman_coding"><u>Huffman coding</u></a> — a classic technique that assigns short codes to common values and longer codes to rare ones. Because the exponent distribution is so skewed, this achieves roughly 30% compression on the exponent stream. We apply this selectively to the MLP weight matrices (gate, up, and down projections), which make up roughly two-thirds of a model’s parameters and dominate memory traffic during token generation. Attention weights, embeddings and layer norms are uncompressed. All told the optimizations translate to about 20% reduction in overall multilayer perceptron (MLP) weight size, as explained in full detail in our technical report.</p><p>The small number of weights with rare exponents are handled separately: if any weight in a row of 64 has an exponent outside the top-16 palette, the entire row is stored verbatim. This approach eliminates per-element branching in the hot path — instead of checking every single weight for edge cases, we make one decision per row up front.</p>
    <div>
      <h2>The GPU memory bottleneck</h2>
      <a href="#the-gpu-memory-bottleneck">
        
      </a>
    </div>
    <p>An NVIDIA H100 GPU has two relevant kinds of memory:</p><ul><li><p><b>High Bandwidth Memory</b> (HBM): large, but relatively slow to access. This is where model weights live.</p></li><li><p><b>Shared memory</b> (SMEM): tiny, but extremely fast. This is where the GPU stages data right before doing math.</p></li></ul>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2H57YtBN941gTST60fH7su/b66ed530c29a9c5ea20fae86467a7ca7/4.png" />
          </figure><p><sup><i>During inference, generating each token requires reading the full weight matrix from HBM. The memory bus between HBM and SMEM is the performance bottleneck – </i></sup><sup>not the math itself. Fewer bytes across the bus = faster token generation.</sup></p><p>During inference, generating each token requires reading the full weight matrix from HBM through the memory bus — this is the bottleneck. The H100's tensor cores can crunch numbers far faster than HBM can feed them data. Compression helps because fewer bytes need to cross the bus. But there's a catch: the GPU can't do math on compressed data. The weights must be decompressed first.</p><p>Most prior work decompresses entire weight matrices back into HBM, then runs a standard matrix multiplication. This helps with storage capacity but doesn't help with bandwidth because you still read the full uncompressed matrix from HBM for every token.</p>
    <div>
      <h2>Four ways to use compressed weights</h2>
      <a href="#four-ways-to-use-compressed-weights">
        
      </a>
    </div>
    <p>There's no single best way to use compressed weights during inference. The right approach depends on the workload — the batch size, the shape of the weight matrix, and how much GPU time is available for decompression. Unweight offers four compressed execution pipelines, each with a different balance between decompression effort and computation complexity: a full Huffman decode, exponent-only decode, palette transcode, or skipping pre-processing completely.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/12UwJkO7buPrdzTOQOupUA/2f0bd7b3bc97f04c8958472f904d5682/5.png" />
          </figure><p><sup><i>Four different execution pipelines </i></sup></p><p>The four pipelines form a spectrum. At one end, full decode completely reconstructs the original BF16 weights and hands them to NVIDIA’s cuBLAS library for a standard matrix multiplication. This is the simplest path with cuBLAS running at full speed on ordinary data, but the preprocess step writes the most bytes back to main memory. It works well at small batch sizes where the matrix multiplication is tiny and custom kernel overhead dominates. At the other end, direct palette skips preprocessing entirely. Weights are pre-transcoded to a compact 4-bit format at model load time, and the matrix multiplication kernel reconstructs BF16 values on the fly from these indices. Zero preprocess cost, but the kernel does more work per element.</p><p>In between sit two independent paths: one that decodes only the exponent bytes (halving preprocess traffic), and one that transcodes to 4-bit palette indices at runtime (quartering it). Both use a reconstructive matrix multiplication — a custom kernel that loads compressed data, reconstructs BF16 in fast shared memory, and feeds it directly to the tensor cores without a round-trip through main memory.</p>
    <div>
      <h3>Why no single pipeline wins</h3>
      <a href="#why-no-single-pipeline-wins">
        
      </a>
    </div>
    <p>Less preprocessing means less data written to HBM, which frees the memory bus sooner. But it shifts more reconstruction work onto the matmul kernel. Whether that tradeoff pays off depends on the situation.</p><p>With small batch sizes (i.e. 1-64 tokens), the matmul is tiny, so there isn't much computation to overlap with, and the fixed costs of a custom kernel dominate. Full decode + cuBLAS often wins simply because cuBLAS has lower overhead. With large batch sizes (i.e. 256+ tokens), the matmul runs long enough to absorb the extra reconstruction work. A lighter preprocess finishes faster, and the freed-up bus bandwidth and compute overlap pay off. The palette or exponent pipelines pull ahead. Different weight matrices within the same layer can favor different pipelines. The "gate" and "up" projections have different dimensions than the "down" projection, changing the order of operations performed within the matmul which requires different performance tradeoffs.</p><p><b>Throughput vs pipeline strategy</b></p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3jthcdROyaeOdjDxuqG3ZD/29220083376fb26580a44f8f2d8057c8/6.png" />
          </figure><p>This is why Unweight doesn't hard-code a single strategy. The runtime picks the best pipeline for each weight matrix at each batch size, informed by an autotuning process that measures actual end-to-end throughput on the target hardware (more on this below).</p>
    <div>
      <h2>How the reconstructive matmul works</h2>
      <a href="#how-the-reconstructive-matmul-works">
        
      </a>
    </div>
    <p>Three of the four pipelines use a custom matrix multiplication kernel that fuses decompression with computation. This kernel loads compressed data from HBM, reconstructs the original BF16 values in shared memory, and feeds them directly into the tensor cores — all in one operation. The reconstructed weights never exist in main memory.</p><p><b>Traditional decompression vs Unweight</b></p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2OqxSbLVN2V90aTof0IKyb/322f45956be83ecfcbd484558a5dd7e1/7.png" />
          </figure><p><i>With Unweight, ~30% fewer bytes cross the memory bus for MLP weight matrices</i></p><p>Inside this kernel, the GPU's thread groups are split into two roles:</p><ul><li><p>A <b>producer</b> group loads compressed inputs from HBM into shared memory using dedicated memory-copy hardware (TMA). It stages sign+mantissa bytes, exponent data (or palette indices), and – for rows with rare exponents – the verbatim exponent rows. It runs ahead of the consumer, filling a circular buffer so data is ready before it's needed.</p></li><li><p><b>Consumer</b> groups reconstruct BF16 values by combining exponents with sign+mantissa bytes, then immediately feed the result into Hopper's WGMMA tensor-core instructions. The reconstructed weights go straight from assembly to computation without leaving shared memory.</p></li></ul><p>The reconstructive matmul comes in multiple variants, differing in how many output tiles each compute unit handles and how deep the circular buffer runs. Wider output tiles improve data reuse at large batch sizes; deeper buffers hide memory latency at small batch sizes. The autotuner selects the best variant per workload.</p>
    <div>
      <h2>Sharing the GPU between decoding and computation</h2>
      <a href="#sharing-the-gpu-between-decoding-and-computation">
        
      </a>
    </div>
    <p>In the two fused pipelines, a separate preprocess kernel (Huffman decoder or palette transcoder) runs concurrently with the reconstructive matmul. But these kernels compete for GPU resources.</p><p>On Hopper, each compute unit (SM) has 228 KB of shared memory. The reconstructive matmul needs ~227 KB for its pipeline buffer and accumulator tiles. A decode kernel needs ~16 KB for its Huffman lookup table. Since 227 + 16 &gt; 228, these two kernels <i>cannot share the same compute unit</i>. Every SM assigned to decoding is one fewer SM available for the matmul.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2JiEsj6QS7ntozWQJz8YH9/f207e0f863191ac245bd0f55839635c3/8.png" />
          </figure><p>This creates a balancing act: more decode SMs means faster preprocessing but slower matrix multiplication, and vice versa. The optimal split is another tunable parameter — and another reason why the autotuner measures real throughput rather than relying on heuristics.</p>
    <div>
      <h2>Pipelining across layers</h2>
      <a href="#pipelining-across-layers">
        
      </a>
    </div>
    <p>Even with the SM partitioning constraint, Unweight hides much of the decompression cost by exploiting the structure of transformer models.</p><p>Not every layer needs Huffman decoding at runtime. Unweight classifies layers as "hard" (requiring Huffman preprocessing) or "easy" (using pre-transcoded palette data that the matmul can consume directly). The runtime alternates between them:</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4j8l78Iivhl6paLMudctu6/6b34b6f4726d7a00bc45acd0d206ed0f/9.png" />
          </figure><p><sup><i>Decode runs on separate CUDA streams during bootstrap, attention, and easy MLP compute. By the time a hard layer's MLP runs, its preprocessed weights are already waiting</i></sup></p><p>While the GPU computes an easy layer — which needs no preprocessing — a separate set of CUDA streams is decoding the next hard layer's weights in the background. By the time the easy layers finish and the hard layer's turn arrives, its preprocessed data is already waiting. Double-buffered preprocess slots ensure that decode output from one hard layer isn't overwritten while it's still being consumed.</p><p>The down projection benefits most from this overlap: it's consumed last in the MLP sequence (after gate, activation, and up), so its decode has the longest runway to complete.</p>
    <div>
      <h2>Autotuning</h2>
      <a href="#autotuning">
        
      </a>
    </div>
    <p>With four pipelines, multiple matmul kernel variants , and a tunable SM split between decoding and computation, the configuration space is large. Rather than hard-coding a single strategy, Unweight uses an autotuner that measures actual end-to-end inference throughput on the target hardware. It sweeps candidate configurations for the gate projection while holding up and down fixed, then sweeps up, then down, repeating until no further improvement is found. The result is a per-model configuration file that tells the runtime exactly which pipeline, matmul variant, and SM allocation to use for each projection at each batch size — all driven by measured performance rather than heuristics.</p>
    <div>
      <h2>One compression format, multiple uses</h2>
      <a href="#one-compression-format-multiple-uses">
        
      </a>
    </div>
    <p>Encoding format, execution pipeline, and scheduling are independent choices. The same Huffman-compressed model bundle can serve both distribution and inference:</p><ul><li><p>For <b>distribution</b>, Huffman encoding maximizes compression (~22% total model size reduction), reducing transfer times when shipping models across the network.</p></li><li><p>For <b>inference</b>, Huffman-encoded projections can be transcoded to the palette intermediate format on model load, enabling the most efficient runtime execution without constraining the distribution format.</p></li></ul><p>A single model bundle doesn't need to commit to one strategy at packaging time. The runtime selects the best execution path per projection and per batch size on the fly.</p>
    <div>
      <h2>Our results </h2>
      <a href="#our-results">
        
      </a>
    </div>
    <p>On Llama 3.1 8B (our primary testbed), Unweight achieves:</p><ul><li><p>~13% model footprint reduction for inference bundles (compressing only gate/up MLP projections), or ~22% for distribution bundles (compressing all MLP projections including down). All compression is 100% bit-exact lossless. Extrapolating to Llama 70B, this can translate to roughly 18–28 GB saved depending on configuration.</p></li><li><p>30–40% throughput overhead at current optimization level, measured end-to-end on H100 SXM5. The overhead is largest at batch size 1 (~41%) and narrows at batch 1024 (~30%). Three known sources – small-batch fixed costs, redundant weight-tile reconstruction, and the excluded down projection – are under active optimization.</p></li></ul><p>These are intermediate results on a single model. The compression ratios should generalize to other SwiGLU architectures (exponent statistics are consistent across model scales), but the throughput numbers are specific to the current kernel implementations and will change as optimization continues. We do not yet compress attention weights, embeddings, or layer norms, which dilute the overall reduction.</p>
    <div>
      <h2>Why this matters </h2>
      <a href="#why-this-matters">
        
      </a>
    </div>
    <p>GPUs are expensive in multiple dimensions: the cost of the cards themselves, the high-bandwidth memory they demand, and their significant power consumption.</p><p>To combat this, several researchers have shown systems with promising results of ~30% compression ratios on full models — but these target consumer GPUs and research frameworks that don’t work at production scale. The key insight into Unweight’s development is that multilayer perceptrons (MLPs) constitute the majority of model weights and a significant amount of the compute cost during inference workloads. It compresses only MLP weights (avoiding overhead on layers where compression benefit is marginal), is designed specifically for datacenter H100 GPUs with their tightly-balanced compute and memory, and comes with four execution pipelines that adapt to batch size rather than using a single approach.</p><p>However, we want to be clear: Unweight is not a free lunch. On-chip reconstruction adds computational work that wouldn't exist with uncompressed weights. On Llama 3.1 8B, the inference configuration saves approximately 13% of total model memory at a throughput cost of roughly 30% at typical serving batch sizes. This gap narrows at larger batches (where preprocess overlap improves) and is expected to narrow further as we optimize — in particular, we haven't yet compressed the down projection in each MLP layer (about one-third of the compressible weights), and several kernel improvements are in active development.</p><p>For Cloudflare's network, Unweight gives us better capacity: it allows us to serve state-of-the-art models with less GPU memory per instance, which translates to cost savings and the ability to deploy more models in more places. For model distribution, the savings are larger: Huffman-compressed bundles are about 22% smaller, reducing transfer times when shipping models to edge locations worldwide. </p>
    <div>
      <h2>What’s next </h2>
      <a href="#whats-next">
        
      </a>
    </div>
    <p>Looking forward, we have three concrete research directions we think will improve upon our efficiency gains: </p><p><b>Down projection compression.</b> Unweight compresses gate and up MLP projections today, but down projection accounts for roughly one-third of compressible weights. This requires a different kernel variant due to its transposed dimensions, which we will expect to reduce the total model size beyond 22%.</p><p><b>Kernel optimization.</b> The current 30–40% throughput overhead has three identified sources: small-batch fixed costs in the reconstructive matmul, redundant weight reconstruction at large batch sizes, and the missing down projection. Each has a known mitigation path, which we outline in our <a href="https://research.cloudflare.com/nikulin2026"><u>technical paper</u></a>.</p><p><b>More models. </b>Our results are for Llama 3.1 8B, but the underlying exponent statistics are consistent across SwiGLU architectures at all scales. We're working to bring Unweight to the larger models we serve through <a href="https://www.cloudflare.com/developer-platform/products/workers-ai/"><u>Workers AI</u></a>.</p><p>Longer term, we are investigating what Unweight’s architecture means for Mixture-of-Experts models, where cold experts must be fetched on demand and reduced storage would further reduce cost.</p><p>This is a fast-moving field, so we’re excited to open-source our work here and contribute to a growing corpus of research in compression and GPU efficiency. Unweight is one piece of the puzzle, but we hope that other researchers find it a useful paradigm to build upon! </p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/7pmxEB8lMH4Uo80E2oc368/1efdf5892d994e5506e124ea3b110b16/10.png" />
          </figure><p></p> ]]></content:encoded>
            <category><![CDATA[Agents Week]]></category>
            <category><![CDATA[Research]]></category>
            <category><![CDATA[AI]]></category>
            <guid isPermaLink="false">5zArX1DmYGvf4WM0go16wb</guid>
            <dc:creator>Mari Galicer</dc:creator>
            <dc:creator>Ivan Nikulin</dc:creator>
            <dc:creator>Chris Branch</dc:creator>
        </item>
        <item>
            <title><![CDATA[Agents that remember: introducing Agent Memory]]></title>
            <link>https://blog.cloudflare.com/introducing-agent-memory/</link>
            <pubDate>Fri, 17 Apr 2026 13:00:00 GMT</pubDate>
            <description><![CDATA[ Cloudflare Agent Memory is a managed service that gives AI agents persistent memory, allowing them to recall what matters, forget what doesn't, and get smarter over time. ]]></description>
            <content:encoded><![CDATA[ <p></p><p>As developers build increasingly sophisticated <a href="https://developers.cloudflare.com/agents/">agents</a> on Cloudflare, one of the biggest challenges they face is getting the right information into context at the right time. The quality of results produced by models is directly tied to the quality of context they operate with, but even as context window sizes grow past one million (1M) tokens, <a href="https://www.trychroma.com/research/context-rot"><u>context rot</u></a> remains an unsolved problem. A natural tension emerges between two bad options: keep everything in context and watch quality degrade, or aggressively prune and risk losing information the agent needs later.</p><p>Today we're announcing the private beta of <b>Agent Memory</b>, a managed service that extracts information from agent conversations and makes it available when it’s needed, without filling up the context window.</p><p>It gives AI agents persistent memory, allowing them to recall what matters, forget what doesn't, and get smarter over time. In this post, we’ll explain how it works — and what it can help you build.</p>
    <div>
      <h2>The state of agentic memory</h2>
      <a href="#the-state-of-agentic-memory">
        
      </a>
    </div>
    <p>Agentic memory is one of the fastest-moving spaces in AI infrastructure, with new open-source libraries, managed services, and research prototypes launching on a near-weekly basis. These offerings vary widely in what they store, how they retrieve, and what kinds of agents they're designed for. Benchmarks like <a href="https://arxiv.org/abs/2410.10813"><u>LongMemEval</u></a>, <a href="https://arxiv.org/abs/2402.17753"><u>LoCoMo</u></a>, and <a href="https://arxiv.org/pdf/2510.27246"><u>BEAM</u></a> provide useful apples-to-apples comparisons, but they also make it easy to build systems that <a href="https://en.wikipedia.org/wiki/Overfitting"><u>overfit</u></a> for a specific evaluation and break down in production.</p><p>Existing offerings also differ in architecture. Some are managed services that handle extraction and retrieval in the background, others are self-hosted frameworks where you run the memory pipeline yourself. Some expose constrained, purpose-built APIs that keep memory logic out of the agent's main context; others give the model raw access to a database or filesystem and let it design its own queries, burning tokens on storage and retrieval strategy instead of the actual task. Some try to fit everything into the context window, partitioning across multiple agents if needed, while others use retrieval to surface only what's relevant. </p><p>Agent Memory is a managed service with an opinionated API and retrieval-based architecture. We've carefully considered the alternatives, and we believe this combination is the right default for most production workloads. Tighter ingestion and retrieval pipelines are superior to giving agents raw filesystem access. In addition to improved cost and performance, they provide a better foundation for complex reasoning tasks required in production, like temporal logic, supersession, and instruction following. We'll likely expose data for programmatic querying down the road, but we expect that to be useful for edge cases, not common cases.</p><p>We built Agent Memory because the workloads we see on our platform exposed gaps that existing approaches don't fully address. Agents running for weeks or months against real codebases and production systems need memory that stays useful as it grows — not just memory that performs well on a clean benchmark dataset that may fit entirely into a newer model's context window. </p><p>They need fast ingestion. They need retrieval that doesn't block the conversation. And they need to run on models that keep the per-query cost reasonable.</p>
    <div>
      <h2>How you use it</h2>
      <a href="#how-you-use-it">
        
      </a>
    </div>
    <p>Agent Memory stores memories in a profile, which is addressed by name. A profile gives you several operations: ingest a conversation, remember something specific, recall what you need, list memories, or forget a specific memory. <i>Ingest</i> is the bulk path that is typically called when the harness compacts context. <i>Remember</i> is for the model to store something important on the spot. <i>Recall</i> runs the full retrieval pipeline and returns a synthesized answer.</p>
            <pre><code>export default {
  async fetch(request: Request, env: Env): Promise&lt;Response&gt; {
    // Get a profile -- an isolated memory store shared across sessions, agents, and users
    const profile = await env.MEMORY.getProfile("my-project");
    // Ingest -- extract memories from a conversation (typically called at compaction)
    await profile.ingest([
      { role: "user", content: "Set up the project with React and TypeScript." },
      { role: "assistant", content: "Done. Scaffolded a React + TS project targeting Workers." },
      { role: "user", content: "Use pnpm, not npm. And dark mode by default." },
      { role: "assistant", content: "Got it -- pnpm and dark mode as default." },
    ], { sessionId: "session-001" });
    // Remember -- store a single memory explicitly (direct tool use by the model)
    const memory = await profile.remember({
      content: "API rate limit was increased to 10,000 req/s per zone after the April 10 incident.",
      sessionId: "session-001",
    });
    // Recall -- retrieve memories and get a synthesized answer
    const results = await profile.recall("What package manager does the user prefer?");
    console.log(results.result); // "The user prefers pnpm over npm."
    return Response.json({ ok: true });
  },
};</code></pre>
            <p>Agent Memory is accessed via a binding from any Cloudflare Worker. It can also be accessed via a REST API for agents running outside of Workers, following the same pattern as other Cloudflare developer platform APIs. If you’re building with the Cloudflare Agents SDK, the Agent Memory service integrates neatly as the reference implementation for handling compaction, remembering, and searching over memories in <a href="https://developers.cloudflare.com/agents/concepts/memory/"><u>the memory portion</u></a> of the Sessions API.</p>
    <div>
      <h2>What you can build with it</h2>
      <a href="#what-you-can-build-with-it">
        
      </a>
    </div>
    <p>Agent Memory is designed to work across a range of agent architectures:</p><p><b>Memory for individual agents.</b> Regardless of whether you're building with coding agents like Claude Code or OpenCode with a human in the loop, using self-hosted agent frameworks like OpenClaw or Hermes to act on your behalf, or wiring up managed services like <a href="https://www.anthropic.com/engineering/managed-agents"><u>Anthropic’s Managed Agents</u></a>, Agent Memory can serve as the persistent memory layer without any changes to the agent's core loop.</p><p><b>Memory for custom agent harnesses.</b> Many teams are building their own agent infrastructure, including background agents that run autonomously without a human in the loop. <a href="https://builders.ramp.com/post/why-we-built-our-background-agent"><u>Ramp Inspect</u></a> is one public example; <a href="https://stripe.dev/blog/minions-stripes-one-shot-end-to-end-coding-agents"><u>Stripe</u></a> and <a href="https://engineering.atspotify.com/2025/11/spotifys-background-coding-agent-part-1"><u>Spotify</u></a> have described similar systems. These harnesses can also benefit from giving their agents memory that persists across sessions and survives restarts.</p><p><b>Shared memory across agents, people, and tools.</b> A memory profile doesn't have to belong to a single agent. A team of engineers can share a memory profile so that knowledge learned by one person's coding agent is available to everyone: coding conventions, architectural decisions, tribal knowledge that currently lives in people's heads or gets lost when context is pruned. A code review bot and a coding agent can share memory so that review feedback shapes future code generation. The knowledge your agents accumulate stops being ephemeral and starts becoming a durable team asset.</p><p>While search is a component of memory, agent search and agent memory solve distinct problems. <a href="https://developers.cloudflare.com/ai-search/"><u>AI Search</u></a> is our primitive for finding results across unstructured and structured files; Agent Memory is for context recall. The data in Agent Memory doesn't exist as files; it's derived from sessions. An agent can use both, and they are designed to work together. </p>
    <div>
      <h2>Your memories are yours</h2>
      <a href="#your-memories-are-yours">
        
      </a>
    </div>
    <p>As agents become more capable and more deeply embedded in business processes, the memory they accumulate becomes genuinely valuable — not just as an operational state, but as institutional knowledge that took real work to build. We're hearing growing concern from customers about what it means to tie that asset to a single vendor, which is reasonable. The more an agent learns, the higher the switching cost if that memory can't move with it.</p><p>Agent Memory is a managed service, but your data is yours. Every memory is exportable, and we're committed to making sure the knowledge your agents accumulate on Cloudflare can leave with you if your needs change. We think the right way to earn long-term trust is to make leaving easy and to keep building something good enough that you don't want to.</p>
    <div>
      <h2>How Agent Memory works</h2>
      <a href="#how-agent-memory-works">
        
      </a>
    </div>
    <p>To understand what happens behind the API shown above, it helps to break down how agents manage context. An agent has three components:</p><ol><li><p>A <b>harness</b> that drives repeated calls to a model, facilitates tool calls, and manages state.</p></li><li><p>A <b>model</b> that takes context and returns completions.</p></li><li><p><b>State</b> that includes both the current context window and additional information outside context: conversation history, files, databases, memory.</p></li></ol><p>The critical moment in an agent’s context lifecycle is <b>compaction,</b> when the harness decides to shorten context to stay within a model's limits or to avoid context rot. Today, most agents discard information permanently. Agent Memory preserves knowledge on compaction instead of losing it.</p><p>Agent Memory integrates into this lifecycle in two ways:</p><ol><li><p><b>Bulk ingestion at compaction.</b> When the harness compacts context, it ships the conversation to Agent Memory for ingestion. Ingestion extracts facts, events, instructions, and tasks from the message history, deduplicates them against existing memories, and stores them as memories for future retrieval.</p></li><li><p><b>Direct tool use by the model.</b> The model gets tools to interact directly with memories, including the ability to recall (search memories for specific information). The model can also remember (explicitly store memories based on something important), forget (mark a memory as no longer important or true), and list (see what memories are stored). These are lightweight operations that don't require the model to design queries or manage storage. The primary agent should never burn context on storage strategy. The tool surface it sees is deliberately constrained so that memory stays out of the way of the actual task.</p></li></ol>
    <div>
      <h3>The ingestion pipeline</h3>
      <a href="#the-ingestion-pipeline">
        
      </a>
    </div>
    <p>When a conversation arrives for ingestion, it passes through a multi-stage pipeline that extracts, verifies, classifies, and stores memories.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1Cjj8uXbZRCrQVAxirTo7y/d62b5b798990540e78825747ff24afec/image3.png" />
          </figure><p>The first step is deterministic ID generation. Each message gets a content-addressed ID — a SHA-256 hash of session ID, role, and content, truncated to 128 bits. If the same conversation is ingested twice, every message resolves to the same ID, making re-ingestion idempotent. </p><p>Next, the extractor runs two passes in parallel. A full pass chunks messages at roughly 10K characters with two-message overlap and processes up to four chunks concurrently. Each chunk gets a structured transcript with role labels, relative dates resolved to absolutes ("yesterday" becomes "2026-04-14"), and line indices for source provenance. For longer conversations (9+ messages), a detail pass runs alongside the full pass, using overlapping windows that focus specifically on extracting concrete values like names, prices, version numbers, and entity attributes that broad extraction tends to miss. The two result sets are then merged.</p><p>The next step is to verify each extracted memory against the source transcript. The verifier runs eight checks covering entity identity, object identity, location context, temporal accuracy, organizational context, completeness, relational context, and whether inferred facts are actually supported by the conversation. Each item is passed, corrected, or dropped accordingly.</p><p>The pipeline then classifies each verified memory into one of  four types. </p><ul><li><p><b>Facts</b> represent what is true right now, atomic, stable knowledge like "the project uses GraphQL" or "the user prefers dark mode." </p></li><li><p><b>Events</b> capture what happened at a specific time, like a deployment or a decision. </p></li><li><p><b>Instructions </b>describe how to do something, such as procedures, workflows, runbooks. </p></li><li><p><b>Tasks</b> track what is being worked on right now and are ephemeral by design.</p></li></ul><p>Facts and instructions are keyed. Each gets a normalized topic key, and when a new memory has the same key as an existing one, the old memory is superseded rather than deleted. This creates a version chain with a forward pointer from the old memory to the new memory. Tasks are excluded from the vector index entirely to keep it lean but remain discoverable via full-text search.</p><p>Finally, everything is written to storage using INSERT OR IGNORE so that content-addressed duplicates are silently skipped. After returning a response to the harness, background vectorization runs asynchronously. The embedding text prepends the 3-5 search queries generated during classification to the memory content itself, bridging the gap between how memories are written (declaratively: "user prefers dark mode") and how they're searched (interrogatively: "what theme does the user want?"). Vectors for superseded memories are deleted in parallel with new upserts.</p>
    <div>
      <h3>The retrieval pipeline</h3>
      <a href="#the-retrieval-pipeline">
        
      </a>
    </div>
    <p>When an agent searches for a memory, the query goes through a separate retrieval pipeline. During development, we discovered that no single retrieval method works best for all queries, so we run several methods in parallel and fuse the results.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4cH9jB2ghMIQZG0McicyV7/8d75ca8c748054c06cc5e25c4d1d730f/image5.png" />
          </figure><p>The first stage runs query analysis and embedding concurrently. The query analyzer produces ranked topic keys, full-text search terms with synonyms, and a HyDE (Hypothetical Document Embedding), a declarative statement phrased as if it were the answer to the question. This stage embeds the raw query directly, and both embeddings are used downstream.</p><p>In the next stage, five retrieval channels run in parallel. Full-text search with <a href="https://tartarus.org/martin/PorterStemmer/"><u>Porter stemming</u></a> handles keyword precision for queries where you know the exact term but not the surrounding context. Exact fact-key lookup returns results where the query maps directly to a known topic key. Raw message search queries the stored conversation messages directly via full-text search for unclassified conversation fragments that act as a safety net, catching verbatim details that the extraction pipeline may have generalized away. Direct vector search finds semantically similar memories using the embedded query. And HyDE vector search finds memories that are similar to what the answer would look like, which often surfaces results that direct embedding misses — particularly for abstract or multi-hop queries where the question and the answer use different vocabulary.</p><p>In the third and final stage, results from all five retrieval channels are merged using Reciprocal Rank Fusion (RRF), where each result receives a weighted score based on where it ranked within a given channel. Fact-key matches get the highest weight because an exact topic match is the strongest signal. Full-text search, HyDE vectors, and direct vectors are each weighted based on strength of signal. Finally, raw message matches are also included with low weight as a safety net to identify candidate results the extraction pipeline may have missed. Ties are broken by recency, with newer results ranked higher.</p><p>The pipeline then passes the top candidates to the synthesis model, which generates a natural-language answer to the original search query. Some specific query types get special treatment. As an example, temporal computation is handled deterministically via regex and arithmetic, not by the LLM. The results are injected into the synthesis prompt as pre-computed facts. Models are unreliable at things like date math, so we don't ask them to do it.</p>
    <div>
      <h2>How we built it</h2>
      <a href="#how-we-built-it">
        
      </a>
    </div>
    <p>Our initial prototype of Agent Memory was lightweight, with a basic extraction pipeline, vector storage, and simple retrieval. It worked well enough to demonstrate the concept, but not well enough to ship.</p><p>So we put it into an agent-driven loop and iterated. The cycle looked like this: run benchmarks, analyze where we had gaps, propose solutions, have a human review the proposals to select strategies that generalize rather than overfit, let the agent make the changes, repeat.</p><p>This worked well, but came with one specific challenge. LLMs are stochastic, even with temperature set to zero. This caused results to vary across runs, which meant we had to average multiple runs (time-consuming for large benchmarks) and rely on trend analysis alongside raw scores to understand what was actually working. Along the way we had to guard carefully against overfitting the benchmarks in ways that didn't genuinely make the product better for the general case.</p><p>Over time, this got us to a place where benchmark scores improved consistently with each iteration and we had a generalized architecture that would work in the real world. We intentionally tested against multiple benchmarks (including LoCoMo, LongMemEval, and BEAM) to push the system in different ways.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/68NbsLPqv7VXu0ODV6PFGK/89aa2e05b82f24b642d0c4cbddbdf340/image.png" />
          </figure>
    <div>
      <h2>Why Cloudflare</h2>
      <a href="#why-cloudflare">
        
      </a>
    </div>
    <p>We build Cloudflare on Cloudflare, and Agent Memory is no different. Existing primitives that are powerful and easily composable allowed us to ship the first prototype in a weekend and a fully functioning, productionized internal version of Agent Memory in less than a month. In addition to speed of delivery, Cloudflare turned out to be the ideal place to build this kind of service for a few other reasons.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/7IWzT1bmOW7Lip9ib17Di8/1a3a83be3d4e7f1450eacd41d18a924f/image4.png" />
          </figure><p>Under the hood, Agent Memory is a Cloudflare Worker that coordinates several systems:</p><ul><li><p>Durable Object: stores the raw messages and classified memories</p></li><li><p>Vectorize: provides vector search over embedded memories</p></li><li><p>Workers AI: runs the LLMs and embedding models</p></li></ul><p>Each memory context maps to its own Durable Object instance and Vectorize index, keeping data fully isolated between contexts. It also allows us to scale easily with higher demands.</p><p><b>Compute isolation via Durable Objects.</b> Each memory profile gets its own <a href="https://developers.cloudflare.com/durable-objects/"><u>Durable Object</u></a> (DO) with a SQLite-backed store, providing strong isolation between tenants without any infrastructure overhead. The DO handles FTS indexing, supersession chains, and transactional writes. DO’s getByName() addressing means any request, from anywhere, can reach the right memory profile by name, and ensures that sensitive memories are strongly isolated from other tenants.</p><p><b>Storage across the stack.</b> Memory content lives in SQLite-backed DOs. Vectors live in <a href="https://developers.cloudflare.com/vectorize/"><u>Vectorize</u></a>. In the future, snapshots and exports will go to <a href="https://developers.cloudflare.com/r2/"><u>R2</u></a> for cost-efficient long-term storage. Each primitive is purpose-built for its workload, we don't need to force everything into a single shape or database.</p><p><b>Local model inference with Workers AI.</b> The entire extraction, classification, and synthesis pipeline runs on <a href="https://developers.cloudflare.com/workers-ai/"><u>Workers AI</u></a> models deployed on Cloudflare's network. All AI calls pass a session affinity header routed to the memory profile name, so repeated requests hit the same backend for prompt caching benefits.</p><p>One interesting finding from our model selection: a bigger, more powerful model isn't always better. We currently default to Llama 4 Scout (17B, 16-expert MoE) for extraction, verification, classification, and query analysis, and Nemotron 3 (120B MoE, 12B active parameters) for synthesis. Scout handles the structured classification tasks efficiently, while Nemotron's larger reasoning capacity improves the quality of natural-language answers. The synthesizer is the only stage where throwing more parameters at the problem consistently helped. For everything else, the smaller model hit a better sweet spot of cost, quality, and latency.</p>
    <div>
      <h2>How we've been using it</h2>
      <a href="#how-weve-been-using-it">
        
      </a>
    </div>
    <p>We run Agent Memory internally for our own workflows at Cloudflare, as both a proving ground and a source of ideas for what to build next.</p><p><b>Coding agent memory.</b> We use an internal <a href="https://opencode.ai">OpenCode</a> plugin that wires Agent Memory into the development loop. Agent Memory provides memory of past compaction within sessions and across them. The less obvious benefit has been shared memory across a team: with a shared profile, the agent knows what other members of your team have already learned, which means it can stop asking questions that have already been answered and stop making mistakes that have already been corrected.</p><p><b>Agentic code review.</b> We've connected Agent Memory to our internal agentic code reviewer. Arguably the most useful thing it learned to do was stay quiet. The reviewer now remembers that a particular comment wasn't relevant in a past review, that a specific pattern was flagged, and the author chose to keep it for a good reason. Reviews get less noisy over time, not just smarter.</p><p><b>Chat bots.</b> We've also wired memory into an internal chat bot that ingests message history and then lurks and remembers new messages that are sent. Then, when someone asks a question, the bot can answer based on previous conversations.</p><p>We also have a number of additional use cases that we plan to roll out internally in the near future as we refine and improve the service.</p>
    <div>
      <h2>What's next</h2>
      <a href="#whats-next">
        
      </a>
    </div>
    <p>We're continuing to test and refine Agent Memory internally, improving the extraction pipeline, tuning retrieval quality, and expanding the background processing capabilities. Similar to how the human brain consolidates memories by replaying and strengthening connections during sleep, we see opportunities for memory storage to improve asynchronously and are currently implementing and testing various strategies to make this work.</p><p>We plan to make Agent Memory publicly available soon. If you're building agents on Cloudflare and want early access, <a href="https://forms.gle/RAXbK6gN9Yy89ECw8"><u>contact us to join the waitlist</u></a>.</p><p>If you want to dig into the architecture, share what you're building, or follow along as we develop this further, join us on the<a href="https://discord.cloudflare.com"> <u>Cloudflare Discord</u></a> or start a thread in the<a href="https://community.cloudflare.com"> <u>Cloudflare Community</u></a>. We're actively watching both, and are interested in what production agent workloads actually look like in the wild.</p> ]]></content:encoded>
            <category><![CDATA[Agents Week]]></category>
            <category><![CDATA[Agents]]></category>
            <category><![CDATA[AI]]></category>
            <category><![CDATA[Storage]]></category>
            <guid isPermaLink="false">3VGCjw0ivWk7RfPbPZ9k9H</guid>
            <dc:creator>Tyson Trautmann</dc:creator>
            <dc:creator>Rob Sutter</dc:creator>
        </item>
        <item>
            <title><![CDATA[Agents Week: network performance update]]></title>
            <link>https://blog.cloudflare.com/network-performance-agents-week/</link>
            <pubDate>Fri, 17 Apr 2026 13:00:00 GMT</pubDate>
            <description><![CDATA[ By migrating our request handling layer to a Rust-based architecture called FL2, Cloudflare has increased its performance lead to 60% of the world’s top networks. We use real-user measurements and TCP connection trimeans to ensure our data reflects the actual experience of people on the Internet. ]]></description>
            <content:encoded><![CDATA[ <p>When it comes to the Internet, performance is everything. Every millisecond shaved off a connection is a better experience for the real people using the applications and websites you build. That's why, at Cloudflare, we measure our performance constantly and share updates on a regular basis. </p><p>In our <a href="https://blog.cloudflare.com/network-performance-update-birthday-week-2025"><u>last performance post</u></a>, published during Birthday Week 2025, we shared that Cloudflare was the fastest network in 40% of the largest 1,000 networks in the world. At the time, we noted a nuanced reading of that figure; we were competitive in many more networks, and the gaps were often notably small. But even so, we were not satisfied with 40%. By December 2025 (our most recent available analysis), <b>we had become the fastest provider in 60% of the top networks</b>. Here's how we got there, and what it means.</p>
    <div>
      <h3>How do we measure and compare network performance?</h3>
      <a href="#how-do-we-measure-and-compare-network-performance">
        
      </a>
    </div>
    <p>Before diving into the results, let’s review how we collect the data. We start with the 1,000 largest networks in the world by estimated population, using <a href="https://stats.labs.apnic.net/cgi-bin/aspopjson"><u>APNIC's data</u></a> as our source. These networks represent real users in nearly every geography, giving us a broad and meaningful picture of how Internet users experience the web.</p><p>To measure performance, we use TCP connection time, which is the time it takes for an end user's device to complete a TCP handshake with the endpoint they're trying to reach. We chose this metric because it most closely approximates what users actually perceive as "Internet speed." It's not so abstract that it ignores real-world constraints like congestion and distance, but it's precise enough to give us actionable data. (We've <a href="https://blog.cloudflare.com/introducing-radar-internet-quality-page/"><u>previously written</u></a> about why we favor this metric over alternatives.)</p><p>We calculate our rankings using the trimean of TCP connection times. The trimean is a weighted average of three values: the first quartile (25th percentile), the median (50th percentile), and the third quartile (75th percentile). This approach smooths out noise and outliers, giving us a cleaner signal about the typical user experience rather than an extreme case that might skew the picture.</p><p>To capture this data, we rely on Real User Measurements (RUM). When users encounter a Cloudflare-branded error page, a small speed test runs silently in the background. The browser retrieves small files from multiple providers including Cloudflare, Amazon CloudFront, Google, Fastly, and Akamai and records how long each exchange takes. This gives us performance data directly from the user's browser, in their real-world network conditions. It's the difference between testing a car's top speed on a track versus watching how people actually drive on the highway. </p>
    <div>
      <h3>How did we improve? </h3>
      <a href="#how-did-we-improve">
        
      </a>
    </div>
    <p>Historically we have shared how we’ve created new Cloudflare points of presence and reduced our end latency by simply getting more hardware closer to our users. Most recently, we deployed new locations in Constantine, Algeria; Malang, Indonesia; and Wroclaw, Poland. When we deployed our location in Wroclaw, our free users went from an average of 19ms round-trip time (RTT) to an average of 12ms round trip time (RTT), a 40% improvement. In Malang, Enterprise traffic went from a 39ms average RTT to a 37ms average RTT, a 5% improvement. Seeing our customers' experience improve, even if only by a couple of milliseconds, is great. But adding new locations alone doesn’t fully explain how we went from being #1 in 40% of networks to #1 in 60% of networks.</p><p>The answer there has to do with improving how our network handles connections in software. By leveraging protocols like HTTP/3 and changing how we manage congestion windows, we can reduce processing time by milliseconds in code, in addition to the improvements on the wire. By improving CPU usage and memory usage in our software that handles fundamental actions like establishing connections, SSL/TLS termination, traffic management, and the core proxy that all requests flow through, we can make that software more efficient in its usage of resources across our global fleet of hardware. These ongoing efficiency gains result in better performance for you and your customers. </p><p>Think of incoming connections to Cloudflare like toll booths on a highway. Lines can build up at toll booths if there aren’t enough toll booths, or if the booths themselves aren’t efficient at processing cars going through them. We’ve been constantly working to improve not only how our toll booths process incoming cars (the software improvements in connection handling), but also at improving how we send cars between available booths so that we can keep lines short and latency low.   </p>
    <div>
      <h3>How do the results look today?</h3>
      <a href="#how-do-the-results-look-today">
        
      </a>
    </div>
    <p>As we noted above, by December, Cloudflare had become the fastest provider in 60% of the top networks, up from 40% when we last reported. Since Birthday Week in September 2025 we have steadily increased the networks where we are the fastest. Let’s break down the impact. </p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3i0DbxnemIuOa0tSlgeHul/762f6315b90436796a458d98270eccc5/BLOG-3236_2.png" />
          </figure>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/7e3paAR7whXLWrjyYmMDb4/986c660e75f9b5326606e5ea3eb650e8/BLOG-3236_3.png" />
          </figure><p>This means that between September and December, we became the fastest in 40 additional countries and in 261 additional networks. We saw the biggest increase in the United States, where we are the fastest in 54 more ASNs.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2YNBX47Y95k7u3IhIRXjli/1cfb3365b54b38514ee75525ab63db09/BLOG-3236_4.png" />
          </figure><p>On average throughout December, we were 6ms faster than the next-fastest provider. As shown above, the line representing Cloudflare’s latency, or connection time, is consistently lower throughout December than the next fastest provider.</p>
    <div>
      <h3>A faster Internet is a better Internet</h3>
      <a href="#a-faster-internet-is-a-better-internet">
        
      </a>
    </div>
    <p>Every percentage point in our network ranking represents real users who are able to connect to their website or application that much faster because of Cloudflare. But we also know that 60% isn't the ceiling. There are still networks where we're number two, sometimes by the smallest of margins. We see those gaps clearly, and we're working on them. We're committed to being the fastest provider across every network in the world. </p><p>Follow our blog for more performance updates as we continue to make the Internet faster.</p> ]]></content:encoded>
            <category><![CDATA[Network Performance Update]]></category>
            <category><![CDATA[Rust]]></category>
            <category><![CDATA[Agents Week]]></category>
            <category><![CDATA[Agents]]></category>
            <category><![CDATA[Performance]]></category>
            <guid isPermaLink="false">5SYHyUGFmSQjwi4u55DYUC</guid>
            <dc:creator>Lai Yi Ohlsen</dc:creator>
        </item>
        <item>
            <title><![CDATA[Introducing Flagship: feature flags built for the age of AI]]></title>
            <link>https://blog.cloudflare.com/flagship/</link>
            <pubDate>Fri, 17 Apr 2026 13:00:00 GMT</pubDate>
            <description><![CDATA[ We are launching Flagship, a native feature flag service built on Cloudflare’s global network to eliminate the latency of third-party providers. By using KV and Durable Objects, Flagship allows for sub-millisecond flag evaluation. ]]></description>
            <content:encoded><![CDATA[ <p>AI is writing more code than ever. AI-assisted contributions now account for a rapidly growing share of new code across the platform. Agentic coding tools like OpenCode and Claude Code are shipping entire features in minutes.</p><p>AI-generated code entering production is only going to accelerate. But the bigger shift isn't just speed — it's autonomy.</p><p>Today, an AI agent writes code and a human reviews, merges, and deploys it. Tomorrow, the agent does all of that itself. The question becomes: how do you let an agent ship to production without removing every safety net?</p><p>Feature flags are the answer. An agent writes a new code path behind a flag and deploys it — the flag is off, so nothing changes for users. The agent then enables the flag for itself or a small test cohort, exercises the feature in production, and observes the results. If metrics look good, it ramps the rollout. If something breaks, it disables the flag. The human doesn't need to be in the loop for every step — they set the boundaries, and the flag controls the blast radius.</p><p>This is the workflow feature flags were always building toward: not just decoupling deployment from release, but decoupling human attention from every stage of the shipping process. The agent moves fast because the flag makes it safe to move fast.</p><p>Today, we're announcing Flagship — Cloudflare's native feature flag service, built on <a href="https://openfeature.dev/"><u>OpenFeature</u></a>, the CNCF open standard for feature flag evaluation. It works everywhere — Workers, Node.js, Bun, Deno, and the browser — but it's fastest on Workers, where flags are evaluated within the Cloudflare network. With the Flagship binding and OpenFeature, integration looks like this:</p>
            <pre><code>await OpenFeature.setProviderAndWait(
    new FlagshipServerProvider({ binding: env.FLAGS })
);</code></pre>
            <p>Flagship is now available in closed beta.</p>
    <div>
      <h3>The problem with feature flags on Workers</h3>
      <a href="#the-problem-with-feature-flags-on-workers">
        
      </a>
    </div>
    <p>Many Cloudflare developers have resorted to the pragmatic workaround: hardcoding flag logic directly into their Workers. And honestly, it works well enough in the beginning. Workers deploy in seconds, so flipping a boolean in code and pushing it to production is fast enough for most situations.</p><p>But it doesn't stay simple. One hardcoded flag becomes ten. Ten becomes fifty, owned by different teams, with no central view of what's on or off. There's no audit trail — when something breaks, you're searching <code>git blame</code> to figure out who toggled what.</p>
    <div>
      <h4>Network call to external services</h4>
      <a href="#network-call-to-external-services">
        
      </a>
    </div>
    <p>Another common pattern used on workers is to make an HTTP request to an external service in the following manner:</p>
            <pre><code>const response = await fetch("https://flags.example-service.com/v1/evaluate", {
      ...
      body: JSON.stringify({
        flagKey: "new-checkout-flow",
        context: {
          ...
        },
      }),
    });
const { value } = await response.json();
if (value === true) {
    return handleNewCheckout(request);
}
return handleLegacyCheckout(request);</code></pre>
            <p>That outbound request sits on the critical path of every single user request. It could add considerable latency depending on how far the user is from the flag service's region.</p><p>This is a strange situation. Your application runs at the edge, milliseconds from the user. But the feature flag check forces it to reach back across the Internet to another API before it can decide what to render.</p>
    <div>
      <h4>Why local evaluation doesn't solve the problem</h4>
      <a href="#why-local-evaluation-doesnt-solve-the-problem">
        
      </a>
    </div>
    <p>Some feature flag services offer a "local evaluation" SDK. Instead of calling a remote API on every request, the SDK downloads the full set of flag rules into memory and evaluates them locally. No outbound request per evaluation and the flag decision happens in-process.</p><p>On Workers, none of these assumptions hold. There is no long-lived process: a Worker isolate can be created, serve a request, and be evicted between one request and the next. A new invocation could mean re-initializing the SDK from scratch.</p><p>On a serverless platform, you need a distribution primitive that's already at the edge, one where the caching is managed for you, reads are local, and you don't need a persistent connection to keep things up to date.</p><p>Cloudflare KV is a great primitive for this!</p>
    <div>
      <h3>How Flagship works</h3>
      <a href="#how-flagship-works">
        
      </a>
    </div>
    <p>Flagship is built entirely on Cloudflare's infrastructure — Workers, Durable Objects, and KV. There are no external databases, no third-party services, and no centralized origin servers in the evaluation path.</p><p>When you create or update a flag, the control plane writes the change atomically to a <a href="https://developers.cloudflare.com/durable-objects/"><u>Durable Object</u></a> — a SQLite-backed, globally unique instance that serves as the source of truth for that app's flag configuration and changelog. Within seconds, the updated flag config is synced to <a href="https://developers.cloudflare.com/kv/"><u>Workers KV</u></a>, Cloudflare's globally distributed key-value store, where it's replicated across Cloudflare's network.</p><p>When a request evaluates a flag, Flagship reads the flag config directly from KV at the edge — the same Cloudflare location already handling the request. The evaluation engine then runs right there in an <a href="https://developers.cloudflare.com/workers/reference/how-workers-works/#isolates"><u>isolate</u></a>: it matches the request context against the flag's targeting rules, resolves the rollout percentage, and returns a variation. Both the data and the logic live at the edge — nothing is sent elsewhere to be evaluated.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1nKDxB6OkUr1fpVKGWEugC/8fc4dbf0bd1ec45e30010a223982487e/image2.png" />
          </figure>
    <div>
      <h3>Using Flagship: the Worker binding</h3>
      <a href="#using-flagship-the-worker-binding">
        
      </a>
    </div>
    <p>For teams running Cloudflare Workers, Flagship offers a direct binding that evaluates flags inside the Workers runtime — no HTTP round-trip, no SDK overhead. Add the binding to your <code>wrangler.jsonc</code> and your Worker is connected:</p>
            <pre><code>{
  "flagship": [
    {
      "binding": "FLAGS",
      "app_id": "&lt;APP_ID&gt;"
    }
  ]
}</code></pre>
            <p>That's it. Your account ID is inferred from your Cloudflare account, and the <code>app_id</code> ties the binding to a specific Flagship app. In your Worker, you just ask for a flag value:</p>
            <pre><code>export default {
  async fetch(request: Request, env: Env) {
    // Simple boolean check
    const showNewUI = await env.FLAGS.getBooleanValue('new-ui', false, {
      userId: 'user-42',
      plan: 'enterprise',
    });
    // Full evaluation details when you need them
    const details = await env.FLAGS.getStringDetails('checkout-flow', 'v1', {
      userId: 'user-42',
    });
    // details.value = "v2", details.variant = "new", details.reason = "TARGETING_MATCH"
  },
};</code></pre>
            <p>The binding supports typed accessors for every variation type - <code>getBooleanValue()</code>, <code>getStringValue()</code>, <code>getNumberValue()</code>, <code>getObjectValue()</code> - plus <code>*Details()</code> variants that return the resolved value alongside the matched variant and the reason it was selected. On evaluation errors, the default value is returned gracefully. On type mismatches, the binding throws an exception — that's a bug in your code, not a transient failure.</p>
    <div>
      <h3>The SDK: OpenFeature-native</h3>
      <a href="#the-sdk-openfeature-native">
        
      </a>
    </div>
    <p>Most feature flag SDKs come with their own interfaces and evaluation patterns. Over time, those become deeply embedded in your codebase — and switching providers means rewriting every call site.</p><p>We didn't want to build another one of those. Flagship is built on <a href="https://openfeature.dev/">OpenFeature</a>, the CNCF open standard for feature flag evaluation. OpenFeature defines a common interface for flag evaluation across languages and providers — it's the same relationship that OpenTelemetry has to observability. You write your evaluation code once against the standard, and swap providers by changing a single line of configuration.</p>
            <pre><code>import { OpenFeature } from '@openfeature/server-sdk';
import { FlagshipServerProvider } from '@cloudflare/flagship/server';
await OpenFeature.setProviderAndWait(
  new FlagshipServerProvider({
    appId: 'your-app-id',
    accountId: 'your-account-id',
    authToken: 'your-cloudflare-api-token',
  })
);
const client = OpenFeature.getClient();
const showNewCheckout = await client.getBooleanValue(
  'new-checkout-flow',
  false,
  {
    targetingKey: 'user-42',
    plan: 'enterprise',
    country: 'US',
  }
);</code></pre>
            <p>If you're running on Workers with the Flagship binding, you can pass it directly to the OpenFeature provider. The binding already carries your account context, so there's nothing to configure — authentication is implicit.</p>
            <pre><code>import { OpenFeature } from '@openfeature/server-sdk';
import { FlagshipProvider } from '@cloudflare/flagship/server';
let initialized = false;
export default {
  async fetch(request: Request, env: Env) {
    if (!initialized) {
      await OpenFeature.setProviderAndWait(
        new FlagshipServerProvider({ binding: env.FLAGS })
      );
      initialized = true;
    }
    const client = OpenFeature.getClient();
    const showNewCheckout = await client.getBooleanValue('new-checkout-flow', false, {
      targetingKey: 'user-42',
      plan: 'enterprise',
    });
  },
};</code></pre>
            <p>Your evaluation code doesn't change — the OpenFeature interface is identical. But under the hood, Flagship evaluates flags through the binding instead of over HTTP. You get the portability of the standard with the performance of the binding.</p><p>A client-side provider is also available for browsers. It pre-fetches the flags you specify, caches them with a configurable TTL, and serves evaluations synchronously from that cache.</p>
    <div>
      <h3>What you can do with Flagship</h3>
      <a href="#what-you-can-do-with-flagship">
        
      </a>
    </div>
    <p>Flagship supports the patterns you'd expect from a feature flag service and the ones that become critical when AI-generated code is landing in production daily.</p><p>Flag values can be boolean, strings, numbers, or full JSON objects — useful for configuration blocks, UI theme definitions, or routing users to different API versions without maintaining separate code paths.</p>
    <div>
      <h4>Targeting Rules</h4>
      <a href="#targeting-rules">
        
      </a>
    </div>
    <p>Each flag can have multiple rules, evaluated in priority order. The first rule that matches wins.</p><p>A rule consists of:</p><ul><li><p>Conditions that determine whether the rule applies to a given context</p></li><li><p>A flag variation to serve when the rule matches</p></li><li><p>An optional rollout for percentage-based delivery</p></li><li><p>A priority that determines evaluation order when multiple rules are present (lower number = higher priority)</p></li></ul>
    <div>
      <h4>Nested Logical Conditions</h4>
      <a href="#nested-logical-conditions">
        
      </a>
    </div>
    <p>Conditions can be composed using AND/OR logic, nested up to five levels deep. A single rule can express things like:</p>
            <pre><code>(plan == “enterprise” AND region == “us” ) OR (user.email.endsWith(“@cloudflare.com”))
= serve (“premium”)</code></pre>
            <p>At the top level of a rule, multiple conditions are combined with implicit AND where all conditions must pass for the rule to match. Within each condition, you can nest AND/OR groups for more complex logic.</p>
    <div>
      <h4>Flag Rollouts by Percentage</h4>
      <a href="#flag-rollouts-by-percentage">
        
      </a>
    </div>
    <p>Unlike <a href="https://developers.cloudflare.com/workers/configuration/versions-and-deployments/gradual-deployments/"><u>gradual deployments</u></a>, which split traffic between different uploaded versions of your Worker, feature flags let you roll out behavior by percentage within a single version that is serving 100% of traffic. </p><p>Any rule can include a percentage rollout. Instead of serving a variation to everyone who matches the conditions, you serve it to a percentage of them. </p><p>Rollouts use consistent hashing on the specified context attribute. The same attribute value (userId, for example) always hashes to the same bucket, so they won't flip between variations across requests. You can ramp from 5% to 10% to 50% to 100% of users, so those who were already in the rollout stay in it.</p>
    <div>
      <h3>Built for what comes next</h3>
      <a href="#built-for-what-comes-next">
        
      </a>
    </div>
    <p>AI-generated code entering production is only going to accelerate. Agentic workflows will push it further — agents that autonomously deploy, test, and iterate on code in production. The teams that thrive in this world won't be the ones shipping the fastest. They'll be the ones who can ship fast and still maintain control over what their users see, roll back in seconds when something breaks, and gradually expose new code paths with confidence.</p><p>That's what Flagship is built for:</p><ul><li><p><b>Evaluation across region Earth,</b> cached globally using K/V.</p></li><li><p><b>A full audit trail.</b> Every flag change is recorded with field-level diffs, so you know who changed what and when.</p></li><li><p><b>Dashboard integration</b>. Anyone on the team can toggle a flag or adjust a rollout without touching code.</p></li><li><p><b>OpenFeature compatibility.</b> Adopt Flagship without rewriting your evaluation code. Leave without rewriting it either.</p></li></ul>
    <div>
      <h3>Get started with Flagship</h3>
      <a href="#get-started-with-flagship">
        
      </a>
    </div>
    <p>Starting today, Flagship is in private beta. You can request for access <a href="https://www.cloudflare.com/lp/flagship-private-beta/"><u>here</u></a>. We'll share more details on pricing as we approach general availability.</p><ul><li><p>Visit the<a href="https://dash.cloudflare.com"> <u>Cloudflare dashboard</u></a> to create your first Flagship app</p></li><li><p>Install the SDK: <code>npm i @cloudflare/flagship</code>; or use the Worker binding directly in your Worker</p></li><li><p>Read the <a href="https://developers.cloudflare.com/flagship/get-started/"><u>documentation</u></a> for integration guides and API reference</p></li><li><p>Check out the <a href="https://github.com/cloudflare/flagship"><u>source code</u></a> for examples and to contribute</p></li></ul><p>If you're currently hardcoding flags in your Workers, or evaluating flags through an external service that adds latency to every request, give Flagship a try. We'd love to hear what you build.</p> ]]></content:encoded>
            <category><![CDATA[Cloudflare Workers]]></category>
            <category><![CDATA[Developer Platform]]></category>
            <category><![CDATA[AI]]></category>
            <category><![CDATA[Durable Objects]]></category>
            <category><![CDATA[Performance]]></category>
            <category><![CDATA[Feature Flags]]></category>
            <guid isPermaLink="false">3d8fI4f0doBxr4ybfKEymB</guid>
            <dc:creator>Rohan Mukherjee</dc:creator>
            <dc:creator>Abhishek Kankani</dc:creator>
        </item>
        <item>
            <title><![CDATA[Cloudflare’s AI Platform: an inference layer designed for agents]]></title>
            <link>https://blog.cloudflare.com/ai-platform/</link>
            <pubDate>Thu, 16 Apr 2026 14:05:00 GMT</pubDate>
            <description><![CDATA[ We're building AI Gateway into a unified inference layer for AI, letting developers call models from 14+ providers. New features include Workers AI binding integration and an expanded catalog with multimodal models.
 ]]></description>
            <content:encoded><![CDATA[ <p>AI models are changing quickly: the best model to use for agentic coding today might in three months be a completely different model from a different provider. On top of this, real-world use cases often require calling more than one model. Your customer support agent might use a fast, cheap model to classify a user's message; a large, reasoning model to plan its actions; and a lightweight model to execute individual tasks.</p><p>This means you need access to all the models, without tying yourself financially and operationally to a single provider. You also need the right systems in place to monitor costs across providers, ensure reliability when one of them has an outage, and manage latency no matter where your users are.</p><p>These challenges are present whenever you’re building with AI, but they get even more pressing when you’re building <a href="https://www.cloudflare.com/learning/ai/what-is-agentic-ai/"><u>agents</u></a>. A simple chatbot might make one <a href="https://www.cloudflare.com/learning/ai/inference-vs-training/"><u>inference</u></a> call per user prompt. An agent might chain ten calls together to complete a single task and suddenly, a single slow provider doesn't add 50ms, it adds 500ms. One failed request isn't a retry, but suddenly a cascade of downstream failures. </p><p>Since launching AI Gateway and Workers AI, we’ve seen incredible adoption from developers building AI-powered applications on Cloudflare and we’ve been shipping fast to keep up! In just the past few months, we've refreshed the dashboard, added zero-setup default gateways, automatic retries on upstream failures, and more granular logging controls. Today, we’re making Cloudflare into a unified inference layer: one API to access any AI model from any provider, built to be fast and reliable. </p>
    <div>
      <h3>One catalog, one unified endpoint</h3>
      <a href="#one-catalog-one-unified-endpoint">
        
      </a>
    </div>
    <p>Starting today, you can call third-party models using the same AI.run() binding you already use for Workers AI. If you’re using Workers, switching from a Cloudflare-hosted model to one from OpenAI, Anthropic, or any other provider is a one-line change. </p>
            <pre><code>const response = await env.AI.run('anthropic/claude-opus-4-6',{
input: 'What is Cloudflare?',
}, {
gateway: { id: "default" },
});</code></pre>
            <p>For those who don’t use Workers, we’ll be releasing REST API support in the coming weeks, so you can access the full model catalog from any environment.</p><p>We’re also excited to share that you'll now have access to 70+ models across 12+ providers — all through one API, one line of code to switch between them, and one set of credits to pay for them. And we’re quickly expanding this as we go.</p><p>You can browse through our <a href="https://developers.cloudflare.com/ai/models"><u>model catalog</u></a> to find the best model for your use case, from open-source models hosted on Cloudflare Workers AI to proprietary models from the major model providers. We’re excited to be expanding access to models from <b>Alibaba Cloud, AssemblyAI, Bytedance, Google, InWorld, MiniMax, OpenAI, Pixverse, Recraft, Runway, and Vidu</b> — who will provide their models through AI Gateway. Notably, we’re expanding our model offerings to include image, video, and speech models so that you can build multimodal applications</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2ez5tichGEEn5k6SzCgWLm/380a685b14ee9732fdf87c6f88c8f39e/BLOG-3209_2.png" />
          </figure><p>Accessing all your models through one API also means you can manage all your AI spend in one place. Most companies today are calling <a href="https://aidbintel.com/pulse-survey"><u>an average of 3.5 models</u></a> across multiple providers, which means no one provider is able to give you a holistic view of your AI usage. <b>With AI Gateway, you’ll get one centralized place to monitor and manage AI spend.</b></p><p>By including custom metadata with your requests, you can get a breakdown of your costs on the attributes that you care about most, like spend by free vs. paid users, by individual customers, or by specific workflows in your app.</p>
            <pre><code>const response = await env.AI.run('@cf/moonshotai/kimi-k2.5',
      {
prompt: 'What is AI Gateway?'
      },
      {
metadata: { "teamId": "AI", "userId": 12345 }
      }
    );</code></pre>
            
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6ez3O7rmbrCUdD5R5UcuP9/4c219ff5ce1e24a0485a931b6af47608/BLOG-3209_3.png" />
          </figure>
    <div>
      <h3>Bring your own model</h3>
      <a href="#bring-your-own-model">
        
      </a>
    </div>
    <p>AI Gateway gives you access to models from all the providers through one API. But sometimes you need to run a model you've fine-tuned on your own data or one optimized for your specific use case. For that, we are working on letting users bring their own model to Workers AI. </p><p>The overwhelming majority of our traffic comes from dedicated instances for Enterprise customers who are running custom models on our platform, and we want to bring this to more customers. To do this, we leverage Replicate’s <a href="https://cog.run/"><u>Cog</u></a> technology to help you containerize machine learning models.</p><p>Cog is designed to be quite simple: all you need to do is write down dependencies in a cog.yaml file, and your inference code in a Python file. Cog abstracts away all the hard things about packaging ML models, such as CUDA dependencies, Python versions, weight loading, etc. </p><p>Example of a <code>cog.yaml</code> file:</p>
            <pre><code>build:
  python_version: "3.13"
  python_requirements: requirements.txt
predict: "predict.py:Predictor"</code></pre>
            <p>Example of a <a href="http://predict.py"><code><u>predict.py</u></code></a> file, which has a function to set up the model and a function that runs when you receive an inference request (a prediction):</p>
            <pre><code>from cog import BasePredictor, Path, Input
import torch

class Predictor(BasePredictor):
    def setup(self):
        """Load the model into memory to make running multiple predictions efficient"""
        self.net = torch.load("weights.pth")

    def predict(self,
            image: Path = Input(description="Image to enlarge"),
            scale: float = Input(description="Factor to scale image by", default=1.5)
    ) -&gt; Path:
        """Run a single prediction on the model"""
        # ... pre-processing ...
        output = self.net(input)
        # ... post-processing ...
        return output</code></pre>
            <p>Then, you can run cog build to build your container image, and push your Cog container to Workers AI. We will deploy and serve the model for you, which you then access through your usual Workers AI APIs. </p><p>We’re working on some big projects to be able to bring this to more customers, like customer-facing APIs and wrangler commands so that you can push your own containers, as well as faster cold starts through GPU snapshotting. We’ve been testing this internally with Cloudflare teams and some external customers who are guiding our vision. If you’re interested in being a design partner with us, please reach out! Soon, anyone will be able to package their model and use it through Workers AI.</p>
    <div>
      <h3>The fast path to first token</h3>
      <a href="#the-fast-path-to-first-token">
        
      </a>
    </div>
    <p>Using Workers AI models with AI Gateway is particularly powerful if you’re building live agents – where a user's perception of speed hinges on time to first token or how quickly the agent starts responding, rather than how long the full response takes. Even if total inference is 3 seconds, getting that first token 50ms faster makes the difference between an agent that feels zippy and one that feels sluggish.</p><p>Cloudflare's network of data centers in 330 cities around the world means AI Gateway is positioned close to both users and inference endpoints, minimizing the network time before streaming begins.</p><p>Workers AI also hosts open-source models on its public catalog, which now includes large models purpose-built for agents, including <a href="https://developers.cloudflare.com/workers-ai/models/kimi-k2.5"><u>Kimi K2.5</u></a> and real-time voice models. When you call these Cloudflare-hosted models through AI Gateway, there's no extra hop over the public Internet since your code and inference run on the same global network, giving your agents the lowest latency possible.</p>
    <div>
      <h3>Built for reliability with automatic failover</h3>
      <a href="#built-for-reliability-with-automatic-failover">
        
      </a>
    </div>
    <p>When building agents, speed is not the only factor that users care about – reliability matters too. Every step in an agent workflow depends on the steps before it. Reliable inference is crucial for agents because one call failing can affect the entire downstream chain. </p><p>Through AI Gateway, if you're calling a model that's available on multiple providers and one provider goes down, we'll automatically route to another available provider without you having to write any failover logic of your own. </p><p>If you’re building <a href="https://blog.cloudflare.com/project-think/"><u>long-running agents with Agents SDK</u></a>, your streaming inference calls are also resilient to disconnects. AI Gateway buffers streaming responses as they’re generated, independently of your agent's lifetime. If your agent is interrupted mid-inference, it can reconnect to AI Gateway and retrieve the response without having to make a new inference call or paying twice for the same output tokens. Combined with the Agents SDK's built-in checkpointing, the end user never notices.</p>
    <div>
      <h3>Replicate</h3>
      <a href="#replicate">
        
      </a>
    </div>
    <p>The Replicate team has officially <a href="https://blog.cloudflare.com/replicate-joins-cloudflare/"><u>joined</u></a> our AI Platform team, so much so that we don’t even consider ourselves separate teams anymore. We’ve been hard at work on integrations between Replicate and Cloudflare, which include bringing all the Replicate models onto AI Gateway and replatforming the hosted models onto Cloudflare infrastructure. Soon, you’ll be able to access the models you loved on Replicate through AI Gateway, and host the models you deployed on Replicate on Workers AI as well.</p>
    <div>
      <h3>Get started</h3>
      <a href="#get-started">
        
      </a>
    </div>
    <p>To get started, check out our documentation for <a href="https://developers.cloudflare.com/ai-gateway"><u>AI Gateway</u></a> or <a href="https://developers.cloudflare.com/workers-ai/"><u>Workers AI</u></a>. Learn more about building agents on Cloudflare through <a href="https://developers.cloudflare.com/agents/"><u>Agents SDK</u></a>. </p>
    <div>
      <h3>Watch on Cloudflare TV</h3>
      <a href="#watch-on-cloudflare-tv">
        
      </a>
    </div>
    <div>
  
</div><p></p> ]]></content:encoded>
            <category><![CDATA[Agents Week]]></category>
            <category><![CDATA[Agents]]></category>
            <category><![CDATA[AI]]></category>
            <category><![CDATA[AI Gateway]]></category>
            <category><![CDATA[Workers AI]]></category>
            <category><![CDATA[Developers]]></category>
            <category><![CDATA[Developer Platform]]></category>
            <category><![CDATA[LLM]]></category>
            <guid isPermaLink="false">2vIUFXJLJcMgjgY6jFnQ7G</guid>
            <dc:creator>Ming Lu</dc:creator>
            <dc:creator>Michelle Chen</dc:creator>
        </item>
        <item>
            <title><![CDATA[Building the foundation for running extra-large language models]]></title>
            <link>https://blog.cloudflare.com/high-performance-llms/</link>
            <pubDate>Thu, 16 Apr 2026 14:00:00 GMT</pubDate>
            <description><![CDATA[ We built a custom technology stack to run fast large language models on Cloudflare’s infrastructure. This post explores the engineering trade-offs and technical optimizations required to make high-performance AI inference accessible. ]]></description>
            <content:encoded><![CDATA[ <p>An agent needs to be powered by a large language model. A few weeks ago, we announced that Workers AI is officially entering the arena for hosting large open-source models like Moonshot’s Kimi K2.5. Since then, we’ve made Kimi K2.5 3x faster and have more model additions in-flight. These models have been the backbone of a lot of the agentic products, harnesses, and tools that we have been launching this week. </p><p>Hosting AI models is an interesting challenge: it requires a delicate balance between software and very, very expensive hardware. At Cloudflare, we’re good at squeezing every bit of efficiency out of our hardware through clever software engineering. This is a deep dive on how we’re laying the foundation to run extra-large language models.</p>
    <div>
      <h2>Hardware configurations</h2>
      <a href="#hardware-configurations">
        
      </a>
    </div>
    <p>As we mentioned in <a href="https://blog.cloudflare.com/workers-ai-large-models/"><u>our previous Kimi K2.5 blog post</u></a>, we’re using a variety of hardware configurations in order to best serve models. A lot of hardware configurations depend on the size of inputs and outputs that users are sending to the model. For example, if you are using a model to write fanfiction, you might give it a few small prompts (input tokens) while asking it to generate pages of content (output tokens). </p><p>Conversely, if you are running a summarization task, you might be sending in hundreds of thousands of input tokens, but only generating a small summary with a few thousand output tokens. Presented with these opposing use cases, you have to make a choice — should you tune your model configuration so it’s faster at processing input tokens, or faster at generating output tokens?</p><p>When we launched large language models on Workers AI, we knew that most of the use cases would be used for agents. With agents, you send in a large number of input tokens. It starts off with a large system prompt, all the tools, MCPs. With the first user prompt, that context keeps growing. Each new prompt from the user sends a request to the model, which consists of everything that was said before — all the previous user prompts, assistant messages, code generated, etc. For Workers AI, that means we had to focus on two things: fast input token processing and fast tool calling.</p>
    <div>
      <h3>Prefill decode (PD) disaggregation</h3>
      <a href="#prefill-decode-pd-disaggregation">
        
      </a>
    </div>
    <p>One hardware configuration that we use to improve performance and efficiency is disaggregated prefill. There are two stages to processing an LLM request: prefill, which processes the input tokens and populates the KV cache, and decode, which generates output tokens. Prefill is usually compute bound, while decode is memory bound. This means that the parts of the GPU that are used in each stage are different, and since prefill is always done before decode, the stages block one another. Ultimately, it means that we are not efficiently utilizing all of our GPU power if we do both prefill and decode on a single machine.</p><p>With prefill decode disaggregation, separate inference servers are run for each stage. First, a request is sent to the prefill stage which performs prefill and stores it in its KV cache. Then the same request is sent to the decode server, with information about how to transfer the KV cache from the prefill server and begin decoding. This has a number of advantages, because it allows the servers to be tuned independently for the role they are performing, scaled to account for more input-heavy or output-heavy traffic, or even to run on heterogeneous hardware.</p><p>This architecture requires a relatively complex load balancer to achieve. Beyond just routing the requests as described above, it must rewrite the responses (including streaming SSE) of the decode server to include information from the prefill server such as cached tokens. To complicate matters, different inference servers require different information to initiate the KV cache transfer. We extended this to implement token-aware load balancing, in which there is a pool of prefill and decode endpoints, and the load balancer estimates how many prefill or decode tokens are in-flight to each endpoint in the pool and attempts to spread this load evenly. </p><p>After our public model launch, our input/output patterns changed drastically again. We took the time to analyze our new usage patterns and then tuned our configuration to fit our customer’s use cases.</p><p>Here’s a graph of our p90 Time to First Token drop after shifting traffic to our new PD disaggregated architecture, whilst request volume increased, using the same quantity of GPUs. We see a significant improvement in the tail latency variance.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5j6AEIF1GMhnHuJD08VQtw/9e65e2badc5fd1e75557230e8f455ccc/BLOG-3266_2.png" />
          </figure><p>Similarly, p90 time per token went from ~100 ms with high variance to 20-30 ms, a 3x improvement in intertoken latency.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6MNBGFHwI3U0bYwuAosg2g/394f4f5708cc1d3b045dc3f52c64b808/BLOG-3266_3.png" />
          </figure>
    <div>
      <h3>Prompt Caching</h3>
      <a href="#prompt-caching">
        
      </a>
    </div>
    <p>Since agentic use cases usually have long contexts, we optimize for efficient prompt caching in order to not recompute input tensors on every turn. We leverage a header called <code>x-session-affinity</code> in order to help requests route to the right region that previously had the computed input tensors. We wrote about this in our <a href="https://blog.cloudflare.com/workers-ai-large-models/"><u>original blog post</u></a> about launching large LLMs on Workers AI. We added session affinity headers to popular agent harnesses <a href="https://github.com/anomalyco/opencode/pull/20744"><u>like OpenCode</u></a>, where we noticed a significant increase in total throughput. A small difference in prompt caching from our users can sum to a factor of additional GPUs needed to run a model. While we have KV-aware routing internally, we also rely on clients sending the <code>x-session-affinity</code> in order to be explicit about prompt caching. We incentivize the use of the header by offering discounted cached tokens. We highly encourage users to <a href="https://developers.cloudflare.com/workers-ai/features/prompt-caching/"><u>leverage prompt caching</u></a> in order to have faster inference and cheaper pricing.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/gKzvEB0JlL1L0IHIby7TV/fe82399fe870d343a572f053d62b4e52/BLOG-3266_4.png" />
          </figure><p>We worked with our heaviest internal users to adopt this header. The result was an increase in input token cache hit ratios from 60% to 80% during peak times. This significantly increases the request throughput that we can handle, while offering better performance for interactive or time-sensitive sessions like OpenCode or AI code reviews.</p>
    <div>
      <h3>KV-cache optimization</h3>
      <a href="#kv-cache-optimization">
        
      </a>
    </div>
    <p>As we’re serving larger models now, one instance can span multiple GPUs. This means that we had to find an efficient way to share KV cache across GPUs. KV cache is where all the input tensors from prefill (result of prompts in a session) are stored, and initially lives in the VRAM of a GPU. Every GPU has a fixed VRAM size, but if your model instance requires multiple GPUs, there needs to be a way for the KV cache to live across GPUs and talk to each other. To achieve this for Kimi, we leveraged<a href="https://github.com/kvcache-ai/Mooncake"><u> Moonshot AI’s</u></a> Mooncake Transfer Engine and Mooncake Store.</p><p>Mooncake’s Transfer Engine is a high-performance data transfer framework. It works with different Remote Direct Memory Access (RDMA) protocols such as NVLink and NVMe over Fabric, which enables direct memory-to-memory data transfer without involving the CPU. It improves the speed of transferring data across multiple GPU machines, which is particularly important in multi-GPU and multi-node configurations for models. </p><p>When paired with LMCache or SGLang HiCache, the cache is shared across all nodes in the cluster, allowing a prefill node to identify and re-use a cache from a previous request that was originally pre-filled on a different node. This eliminates the need for session aware routing within a cluster and allows us to load balance the traffic much more evenly. Mooncake Store also allows us to extend the cache beyond GPU VRAM, and leverage NVMe storage. This extends the time that sessions remain in cache, improving our cache hit ratio and allowing us to handle more traffic and offer better performance to users.</p>
    <div>
      <h3>Speculative decoding</h3>
      <a href="#speculative-decoding">
        
      </a>
    </div>
    <p>LLMs work by predicting the next token in a sequence, based on the tokens that came before it. With a naive implementation, models only predict the next <i>n</i> token, but we can actually make it predict the next <i>n+1, n+2...</i> tokens in a single forward pass of the model. This popular technique is known as speculative decoding, which we’ve written about in a <a href="https://blog.cloudflare.com/making-workers-ai-faster/"><u>previous post on Workers AI. </u></a></p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/25Au0LKNr8ozQg5UQM34wY/06080a8699d5f8edb050450913a19a40/BLOG-3266_5.png" />
          </figure><p>With speculative decoding, we leverage a smaller LLM (the draft model) to generate a few candidate tokens for the target model to choose from. The target model then just has to select from a small pool of candidate tokens in a single forward pass. Validating the tokens is faster and less computationally expensive than using the larger target model to generate the tokens. However, quality is still upheld as the target model ultimately has to accept or reject the draft tokens.</p><p>In agentic use cases, speculative decoding really shines because of the volume of tool calls and structured outputs that models need to generate. A tool call is largely predictable — you know there will be a name, description, and it’s wrapped in a JSON envelope.</p><p>To do this with Kimi K2.5, we leverage <a href="https://huggingface.co/nvidia/Kimi-K2.5-Thinking-Eagle3"><u>NVIDIA’s EAGLE-3</u></a> (Extrapolation Algorithm for Greater Language-model Efficiency) draft model. The levers for tuning speculative decoding include the number of future tokens to generate. As a result, we’re able to achieve high-quality inference while speeding up tokens per second throughput.</p>
    <div>
      <h2>Infire: our proprietary inference engine</h2>
      <a href="#infire-our-proprietary-inference-engine">
        
      </a>
    </div>
    <p>As we announced during Birthday Week in 2025, Cloudflare has a proprietary inference engine, <a href="https://blog.cloudflare.com/cloudflares-most-efficient-ai-inference-engine/"><u>Infire</u></a>, that makes machine learning models faster. Infire is an inference engine written in Rust, designed to support Cloudflare’s unique challenges with inference given our distributed global network. We’ve extended Infire support for this new class of large language models we are planning to run, which meant we had to build a few new features to make it all work.</p>
    <div>
      <h3>Multi-GPU support</h3>
      <a href="#multi-gpu-support">
        
      </a>
    </div>
    <p>Large language models like Kimi K2.5 are over 1 trillion parameters, which is about 560GB of model weights. A typical H100 has about 80GB of VRAM and the model weights need to be loaded in GPU memory in order to run. This means that a model like Kimi K2.5 needs at least 8 H100s in order to load the model into memory and run — and that’s not even including the extra VRAM you would need for KV Cache, which includes your context window.</p><p>Since we initially launched Infire, we had to add support for multi-GPU, letting the inference engine run across multiple GPUs in either pipeline-parallel or tensor-parallel modes with expert-parallelism supported as well.</p><p>For pipeline parallelism, Infire attempts to properly load balance all stages of the pipeline, in order to prevent the GPUs of one stage from starving while other stages are executing. On the other hand, for tensor parallelism, Infire optimizes for reducing cross-GPU communication, making it as fast as possible. For most models, utilizing both pipeline parallelism and tensor parallelism in tandem provides the best balance of throughput and latency.</p>
    <div>
      <h3>Even lower memory overhead</h3>
      <a href="#even-lower-memory-overhead">
        
      </a>
    </div>
    <p>While already having much lower GPU memory overhead than <a href="https://vllm.ai/"><u>vLLM</u></a>, we optimized Infire even further, tightening the memory required for internal state like activations. Currently Infire is capable of running Llama 4 Scout on just two H200 GPUs with more than 56 GiB remaining for KV-cache, sufficient for more than 1.2m tokens. Infire is also capable of running Kimi K2.5 on 8 H100 GPUs (yes that is H100), with more than 30 GiB still available for KV-cache. In both cases you would have trouble even booting vLLM in the first place.</p>
    <div>
      <h3>Faster cold-starts</h3>
      <a href="#faster-cold-starts">
        
      </a>
    </div>
    <p>While adding multi-GPU support, we identified additional opportunities to improve boot times. Even for the largest models, such as Kimi K2.5, Infire can begin serving requests in under 20 seconds. The load times are only bounded by the drive speed.</p>
    <div>
      <h3>Maximizing our hardware for faster throughput</h3>
      <a href="#maximizing-our-hardware-for-faster-throughput">
        
      </a>
    </div>
    <p>Investing in our proprietary inference engine enables us to maximize our hardware by getting up to 20% higher tokens per second throughput on unconstrained systems, and also enabling us to use lower-end hardware to run the latest models, where it was previously completely infeasible.</p>
    <div>
      <h2>The journey doesn’t end</h2>
      <a href="#the-journey-doesnt-end">
        
      </a>
    </div>
    <p>New technologies, research, and models come out on a weekly basis for the machine learning community. We’re continuously optimizing our technology stack in order to provide high-quality, performant inference for our customers while operating our GPUs efficiently. If these sound like interesting challenges for you – <a href="https://www.cloudflare.com/careers/jobs/"><u>we’re hiring</u></a>!</p> ]]></content:encoded>
            <category><![CDATA[Agents Week]]></category>
            <category><![CDATA[Agents]]></category>
            <category><![CDATA[AI]]></category>
            <category><![CDATA[Developer Platform]]></category>
            <category><![CDATA[Developers]]></category>
            <category><![CDATA[Infrastructure]]></category>
            <category><![CDATA[Workers AI]]></category>
            <guid isPermaLink="false">71xNLfh83S7Fg78QEcgdhf</guid>
            <dc:creator>Michelle Chen</dc:creator>
            <dc:creator>Kevin Flansburg</dc:creator>
            <dc:creator>Vlad Krasnov</dc:creator>
        </item>
        <item>
            <title><![CDATA[Artifacts: versioned storage that speaks Git]]></title>
            <link>https://blog.cloudflare.com/artifacts-git-for-agents-beta/</link>
            <pubDate>Thu, 16 Apr 2026 13:01:00 GMT</pubDate>
            <description><![CDATA[ Give your agents, developers, and automations a home for code and data. We’ve just launched Artifacts: Git-compatible versioned storage built for agents. Create tens of millions of repos, fork from any remote, and hand off a URL to any Git client.
 ]]></description>
            <content:encoded><![CDATA[ <p>Agents have changed how we think about source control, file systems, and persisting state. Developers and agents are generating more code than ever — more code will be written over the next 5 years than in all of programming history — and it’s driven an order-of-magnitude change in the scale of the systems needed to meet this demand. Source control platforms are especially struggling here: they were built to meet the needs of humans, not a 10x change in volume driven by agents who never sleep, can work on several issues at once, and never tire.</p><p>We think there’s a need for a new primitive: a distributed, versioned filesystem that’s built for agents first and foremost, and that can serve the types of applications that are being built today.</p><p>We’re calling this Artifacts: a versioned file system that speaks Git. You can create repositories programmatically, alongside your agents, sandboxes, Workers, or any other compute paradigm, and connect to it from any regular Git client.</p><p>Want to give every agent session a repo? Artifacts can do it. Every sandbox instance? Also Artifacts. Want to create 10,000 forks from a known-good starting point? You guessed it: Artifacts again. Artifacts exposes a REST API and native Workers API for creating repositories, generating credentials, and commits for environments where a Git client isn’t the right fit (i.e. in any serverless function).</p><p>Artifacts is available in private beta and we’re aiming to open this up as a public beta by early May.</p>
            <pre><code>// Create a repo
const repo = await env.AGENT_REPOS.create(name)
// Pass back the token &amp; remote to your agent
return { repo.remote, repo.token }</code></pre>
            
            <pre><code># Clone it and use it like any regular git remote
$ git clone https://x:${TOKEN}@123def456abc.artifacts.cloudflare.net/git/repo-13194.git
</code></pre>
            <p>That’s it. A bare repo, ready to go, created on the fly, that any git client can operate it against.</p><p>And if you want to bootstrap an Artifacts repo from an existing git repository so that your agent can work on it independently and push independent changes, you can do that too with .import():</p>
            <pre><code>interface Env {
  ARTIFACTS: Artifacts
}

export default {
  async fetch(request: Request, env: Env) {
    // Import from GitHub
    const { remote, token } = await env.ARTIFACTS.import({
      source: {
        url: "https://github.com/cloudflare/workers-sdk",
        branch: "main",
      },
      target: {
        name: "workers-sdk",
      },
    })

    // Get a handle to the imported repo
    const repo = await env.ARTIFACTS.get("workers-sdk")

    // Fork to an isolated, read-only copy
    const fork = await repo.fork("workers-sdk-review", {
      readOnly: true,
    })

    return Response.json({ remote: fork.remote, token: fork.token })
  },
}</code></pre>
            <p><a href="http://developers.cloudflare.com/artifacts/"><u>Check out the documentation</u></a> to get started, or if you want to understand how Artifacts is being used, how it was built, and how it works under the hood: read on.</p>
    <div>
      <h2>Why Git? What’s a versioned file system?</h2>
      <a href="#why-git-whats-a-versioned-file-system">
        
      </a>
    </div>
    <p>Agents know Git. It’s deep in the training data of most models. The happy path <i>and </i>the edge cases are well known to agents, and code-optimized models (and/or harnesses) are particularly good at using git.</p><p>Further, Git’s data model is not only good for source control, but for <i>anything</i> where you need to track state, time travel, and persist large amounts of small data. Code, config, session prompts and agent history: all of these are things (“objects”) that you often want to store in small chunks (“commits”) and be able to revert or otherwise roll back to (“history”). </p><p>We could have invented an entirely new, bespoke protocol… but then you have the bootstrap problem. AI models don’t know it, so you have to distribute skills, or a CLI, or hope that users are plugged into your docs MCP… all of that adds friction.

If we can just give agents an authenticated, secure HTTPS Git remote URL and have them operate as if it were a Git repo, though? That turns out to work pretty well. And for non-Git-speaking clients — such as a Cloudflare Worker, a Lambda function, or a Node.js app — we’ve exposed a REST API and (soon) language-specific SDKs. Those clients can also use <a href="https://isomorphic-git.org/"><u>isomorphic-git</u></a>, but in many cases a simpler TypeScript API can reduce the API surface needed.</p>
    <div>
      <h3>Not just for source control</h3>
      <a href="#not-just-for-source-control">
        
      </a>
    </div>
    <p>Artifacts’ Git API might make you think it’s just for source control, but it turns out that the Git API and data model is a powerful way to persist state in a way that allows you to fork, time-travel and diff state for <i>any</i> data.</p><p>Inside Cloudflare, we’re using Artifacts for our internal agents: automatically persisting the current state of the filesystem <i>and</i> the session history in a per-session Artifacts repo. This enables us to:</p><ul><li><p>Persist sandbox state without having to provision (and keep) block storage around.</p></li><li><p>Share sessions with others and allow them to time-travel back through both session (prompt) state <i>and</i> file state, irrespective of whether there were commits to the “actual” repository (source control).</p></li><li><p>And the best: <i>fork</i> a session from any point, allowing our team to share sessions with a co-worker and have them pick it up from them. Debugging something and want another set of eyes? Send a URL and fork it. Want to riff on an API? Have a co-worker fork it and pick up from where you left off.</p></li></ul><p>We’ve also spoken to teams who want to use Artifacts in cases where the Git protocol isn’t a requirement at all, but the semantics (reverting, cloning, diffing) <i>are</i>. Storing per-customer config as part of your product, and want the ability to roll back? Artifacts can be a good representation of this.</p><p>We’re excited to see teams explore the non-Git use-cases around Artifacts just as much as the Git-focused ones.</p>
    <div>
      <h2>Under the hood</h2>
      <a href="#under-the-hood">
        
      </a>
    </div>
    <p>Artifacts are built on top of Durable Objects. The ability to create millions (or tens of millions+) of instances of stateful, isolated compute is inherent to how Durable Objects work today, and that’s exactly what we needed for supporting millions of Git repos per namespace.</p><p>Major League Baseball (for live game fan-out), Confluence Whiteboards, and our own <a href="https://developers.cloudflare.com/agents/"><u>Agents SDK</u></a> use Durable Objects under the hood at significant scale, and so we’re building this on a primitive that we’ve had in production for some time.</p><p>What we did need, however, was a Git server implementation that could run on Cloudflare Workers. It needed to be small, as complete as possible, extensible (<a href="https://git-scm.com/docs/git-notes"><u>notes</u></a>, <a href="https://git-lfs.com/"><u>LFS</u></a>), and efficient. So we built one in <a href="https://ziglang.org/"><u>Zig</u></a>, and compiled it to Wasm.</p><p>Why did we use Zig? Three reasons:</p><ol><li><p> The entire git protocol engine is written in pure Zig (no libc), compiled to a ~100KB WASM binary (with room for optimization!). It implements SHA-1, zlib inflate/deflate, delta encoding/decoding, pack parsing, and the full git smart HTTP protocol — all from scratch, with zero external dependencies other than the standard library.</p></li><li><p> Zig gives us  manual control over memory allocation which is important in constrained environments like Durable Objects. The Zig Build System lets us easily share  code between the WASM runtime (production) and native builds (testing against libgit2 for correctness verification).</p></li><li><p>The WASM module communicates with the JS host via a thin callback interface: 11 host-imported functions for storage operations (host_get_object, host_put_object, etc.) and one for streaming output (host_emit_bytes). The WASM side is fully testable in isolation.</p></li></ol><p>Under the hood, Artifacts also uses R2 (for snapshots) and KV (for tracking auth tokens):</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/35SxJbQfntIscpotc0GBt8/48ae11213d7483c9b488321baacf78e7/BLOG-3269_2.png" />
          </figure><p><sup><code><i>How Artifacts works (Workers, Durable Objects, and WebAssembly)</i></code></sup></p><p>A Worker acts as the front-end, handling authentication &amp; authorization, key metrics (errors, latency) and looking up each Artifacts repository (Durable Object) on the fly. </p><p>Specifically:</p><ul><li><p>Files are stored in the underlying Durable Object’s SQLite database.</p><ul><li><p>Durable Object storage has a 2MB max row size, so large Git objects are chunked and stored across multiple rows.</p></li><li><p>We make use of the sync KV API (state.storage.kv)  which is backed by SQLite under the hood.</p></li></ul></li><li><p> DOs have ~128MB memory limits: this means we can spawn tens of millions of them (they’re fast and light) but have to work within those limits.</p><ul><li><p>We make heavy use of streaming in both the fetch and push paths, directly returning a `ReadableStream&lt;Uint8Array&gt;` built from the raw WASM output chunks.</p></li><li><p>We avoid calculating our own git deltas, instead,  the raw deltas and base hashes are persisted alongside the resolved object. On fetch, if the requesting client already has the base object, Zig emits the delta instead of the full object, which saves bandwidth <i>and</i> memory.</p></li></ul></li><li><p>Support for both v1 and v2 of the git protocol.</p><ul><li><p>We support capabilities including ls-refs, shallow clones (deepen, deepen-since, deepen-relative), and incremental fetch with have/want negotiation.</p></li><li><p>We have an extensive test suite with conformance tests against git clients and verification tests against a libgit2 server designed to validate protocol support.</p></li></ul></li></ul><p>On top of this, we have native support for <a href="https://git-scm.com/docs/git-notes"><u>git-notes</u></a>. Artifacts is designed to be agent-first, and notes enable agents to add notes (metadata) to Git objects. This includes prompts, agent attribution and other metadata that can be read/written from the repo without mutating the objects themselves.</p>
    <div>
      <h2>Big repos, big problems? Meet ArtifactFS.</h2>
      <a href="#big-repos-big-problems-meet-artifactfs">
        
      </a>
    </div>
    <p>Most repos aren’t that big, and Git is <a href="https://github.blog/open-source/git/gits-database-internals-i-packed-object-store/"><u>designed to be extremely efficient</u></a> in terms of storage: most repositories take only a few seconds to clone at most, and that’s dominated by network setup time, auth, and <a href="https://git-scm.com/book/ms/v2/Git-Internals-Git-Objects"><u>checksumming</u></a>. In most agent or sandbox scenarios, that’s workable: just clone the repo as the sandbox starts and get to work.</p><p>But what about a multi-GB repository and/or repos with millions of objects? How can we clone that repo quickly, without blocking the agent’s ability to get to work for minutes and consuming compute?</p><p>A popular web framework (at 2.4GB and with a long history!) takes close to 2 minutes to clone. A shallow clone is faster, but not enough to get down to single digit seconds, and we don’t always want to omit history (agents find it useful).</p><p>Can we get large repos down to ~10-15 seconds so that our agent can get to work? Well, yes: with a few tricks.</p><p>As part of our launch of Artifacts, <a href="https://github.com/cloudflare/artifact-fs"><u>we’re open-sourcing ArtifactFS</u></a>, a filesystem driver designed to mount large Git repos as quickly as possible, hydrating file contents on the fly instead of blocking on the initial clone. It's ideal for agents, sandboxes, containers and other use cases where startup time is critical. If you can shave ~90-100 seconds off your sandbox startup time for every large repo, and you’re running 10,000 of those sandbox jobs per month: that’s 2,778 sandbox hours saved.</p><p>You can think of ArtifactFS as “Git clone but async”:</p><ul><li><p>ArtifactFS runs a blobless clone of a git repository: it fetches the file tree and refs, but not the file contents. It can do that during sandbox startup, which then allows your agent harness to get to work.</p></li><li><p>In the background, it starts to hydrate (download) file contents concurrently via a lightweight daemon.</p></li><li><p>It prioritizes files that agents typically want to operate on first: package manifests (<code>package.json, go.mod</code>), configuration files, and code, deprioritizing binary blobs (images, executables and other non-text-files) where possible so that agents can scan the file tree as the files themselves are hydrated.</p></li><li><p>If a file isn’t fully hydrated when the agent tries to read it, the read will block until it has.</p></li></ul><p>The filesystem does not attempt to “sync” files back to the remote repository: with thousands or millions of objects, that’s typically very slow, and since we’re speaking git, we don’t need to. Your agent just needs to commit and push, as it would with any repository. No new APIs to learn.</p><p>Importantly, ArtifactFS works with any Git remote, not just our own Artifacts. If you’re cloning large repos from GitHub, GitLab, or self-hosted Git infrastructure: you can still use ArtifactFS.</p>
    <div>
      <h2>What’s coming?</h2>
      <a href="#whats-coming">
        
      </a>
    </div>
    <p>Our release today is just the beta, and we’re already working on a number of features that you’ll see land over the next few weeks:</p><ul><li><p>Expanding the <a href="https://developers.cloudflare.com/artifacts/observability/metrics/"><u>available metrics</u></a> we expose. Today we’re shipping metrics for key operations counts per namespace, repo and stored bytes per repo, so that managing millions of Artifacts isn’t toilsome.</p></li><li><p>Support for <a href="https://developers.cloudflare.com/queues/event-subscriptions/"><u>Event Subscriptions</u></a> for repo-level events so that we can emit events on pushes, pulls, clones, and forks to any repository within a namespace. This will also allow you to consume events, write webhooks, and use those events to notify end-users, drive lifecycle events within your products, and/or run post-push jobs (like CI/CD).</p></li><li><p>Native TypeScript, Go and Python client SDKs for interacting with the Artifacts API</p></li><li><p>Repo-level search APIs and namespace-wide search APIs, e.g. “find all the repos with a <code>package.json</code> file”. </p></li></ul><p>We’re also planning an API for <a href="https://developers.cloudflare.com/workers/ci-cd/builds/"><u>Workers Builds</u></a>, allowing you to run CI/CD jobs on any agent-driven workflow.</p>
    <div>
      <h2>What will it cost me?</h2>
      <a href="#what-will-it-cost-me">
        
      </a>
    </div>
    <p>We’re still early with Artifacts, but want our pricing to work at agent-scale: it needs to be cost effective to have millions of repos, unused (or rarely used) repos shouldn’t be a drag, and our pricing should match the massively-single-tenant nature of agents.</p><p>You also shouldn’t have to think about whether a repo is going to be used or not, whether it’s hot or cold, and/or whether an agent is going to wake it up. We’ll charge you for the storage you consume and the operations (e.g. clones, forks, pushes &amp; pulls) against each repo.</p><table><tr><th><p></p></th><th><p><b>$/unit</b></p></th><th><p><b>Included</b></p></th></tr><tr><td><p><b>Operations</b></p></td><td><p>$0.15 per 1,000 operations</p></td><td><p>First 10k included (per month)</p></td></tr><tr><td><p><b>Storage</b></p></td><td><p>$0.50/GB-mo</p></td><td><p>First 1GB included.</p></td></tr></table><p>Big, busy repos will cost more than smaller, less-often-used repos, whether you have 1,000, 100,000, or 10 million of them.</p><p>We’ll also be bringing Artifacts to the Workers Free plan (with some fair limits) as the beta progresses, and we’ll provide updates throughout the beta should this pricing change and ahead of billing any usage.</p>
    <div>
      <h2>Where do I start? </h2>
      <a href="#where-do-i-start">
        
      </a>
    </div>
    
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3xLMMKCN1HNGWbkSyq0tDZ/2a8d49383957804f3ce204783e11ae80/BLOG-3269_3.png" />
          </figure><p>Artifacts is launching in private beta, and we expect public beta to be ready in early May (2026, to be clear!). We’ll be allowing customers in progressively over the next few weeks, and <a href="https://forms.gle/DwBoPRa3CWQ8ajFp7"><u>you can register interest for the private beta</u></a> directly.</p><p>In the meantime, you can learn more about Artifacts by:</p><ul><li><p>Reading the <a href="http://developers.cloudflare.com/artifacts/get-started/workers/"><u>getting started guide</u></a> in the docs.</p></li><li><p>Visiting the Cloudflare dashboard (Build &gt; Storage &amp; Databases &gt; Artifacts)</p></li><li><p>Reading through the <a href="http://developers.cloudflare.com/artifacts/api/rest-api/"><u>REST API examples</u></a></p></li><li><p>Learning more about <a href="http://developers.cloudflare.com/artifacts/concepts/how-artifacts-works/"><u>how Artifacts works</u></a> under the hood</p></li></ul><p>Follow <a href="http://developers.cloudflare.com/changelog/product/artifacts/"><u>the changelog</u></a> to track the beta as it progresses.</p>
    <div>
      <h2>Watch on Cloudflare TV</h2>
      <a href="#watch-on-cloudflare-tv">
        
      </a>
    </div>
    <div>
  
</div>
<p></p> ]]></content:encoded>
            <category><![CDATA[Agents Week]]></category>
            <category><![CDATA[Agents]]></category>
            <category><![CDATA[GitHub]]></category>
            <category><![CDATA[Cloudflare Workers]]></category>
            <category><![CDATA[Storage]]></category>
            <category><![CDATA[Developer Platform]]></category>
            <category><![CDATA[Developers]]></category>
            <guid isPermaLink="false">2sshzOlmGVsrtBz2mgeceE</guid>
            <dc:creator>Dillon Mulroy</dc:creator>
            <dc:creator>Matt Carey</dc:creator>
            <dc:creator>Matt Silverlock</dc:creator>
        </item>
        <item>
            <title><![CDATA[AI Search: the search primitive for your agents]]></title>
            <link>https://blog.cloudflare.com/ai-search-agent-primitive/</link>
            <pubDate>Thu, 16 Apr 2026 13:00:22 GMT</pubDate>
            <description><![CDATA[ AI Search is the search primitive for your agents. Create instances dynamically, upload files, and search across instances with hybrid retrieval and relevance boosting. Just create a search instance, upload, and search.
 ]]></description>
            <content:encoded><![CDATA[ <p>Every <a href="https://www.cloudflare.com/learning/ai/what-is-agentic-ai/"><u>agent</u></a> needs search: Coding agents search millions of files across repos, or support agents search customer tickets and internal docs. The use cases are different, but the underlying problem is the same: get the right information to the model at the right time.</p><p>If you're building search yourself, you need a vector index, an indexing pipeline that parses and chunks your documents, and something to keep the index up to date when your data changes. If you also need keyword search, that's a separate index and fusion logic on top. And if each of your agents needs its own searchable context, you're setting all of that up per agent. </p><p><a href="https://developers.cloudflare.com/ai-search/"><u>AI Search</u></a> (formerly <a href="https://blog.cloudflare.com/introducing-autorag-on-cloudflare/"><u>AutoRAG</u></a>) is the plug-and-play search primitive you need. You can dynamically create instances, give it your data, and search — from a Worker, the Agents SDK, or Wrangler CLI. Here's what we're shipping:</p><ul><li><p><b>Hybrid search</b>. Enable both semantic and keyword matching in the same query. Vector search and BM25 run in parallel and results are fused. (The search on our blog is now powered by AI Search. <i>Try the magnifying glass icon to the top right.</i>)</p></li><li><p><b>Built-in storage and index.</b> New instances come with their own storage and vector index. Upload files directly to an instance via API and they're indexed. No R2 buckets to set up, no external data sources to connect first. The new <code>ai_search_namespaces</code> binding lets you create and delete instances at runtime from your Worker, so you can spin up one per agent, per customer, or per language without redeployment.</p></li></ul><p>You can now also attach metadata to documents and use it to boost rankings at query time, and query across multiple instances in a single call.<b> </b></p><p>Now, let's look at what this means in practice.</p>
    <div>
      <h2>In action: Customer Support Agent</h2>
      <a href="#in-action-customer-support-agent">
        
      </a>
    </div>
    <p>Let's walk through a support agent that searches for two kinds of knowledge: shared product docs, and per-customer history like past resolutions. The product docs are too large to fit in a context window, and each customer's history grows with every resolved issue, so the agent needs retrieval to find what's relevant.</p><p>Here's what that looks like with AI Search and the <a href="https://developers.cloudflare.com/agents"><u>Agents SDK</u></a>. Start by scaffolding a project:</p>
            <pre><code>npm create cloudflare@latest -- --template cloudflare/agents-starter
</code></pre>
            <p>First, bind an AI Search namespace to your Worker:</p>
            <pre><code>// wrangler.jsonc 
{
  "ai_search_namespaces": [
    { "binding": "SUPPORT_KB", "namespace": "support" }
  ],
  "ai": { "binding": "AI" },
  "durable_objects": {
    "bindings": [
      { "name": "SupportAgent", "class_name": "SupportAgent" }
    ]
  }
}
</code></pre>
            <p>Let's say your shared product documentation lives in an R2 bucket called <code>product-doc</code>. You can create a one-off AI Search instance (named <code>product-knowledge</code>) backed by the bucket on the Cloudflare Dashboard within the <code>support</code> namespace:</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1b8NdFL2HDBy8FqBHEI679/f17ed98d45fb9b42a616e0b464460489/BLOG-3240_2.png" />
          </figure><p>That's your shared knowledge base, the docs every agent can reference.</p><p>When a customer comes back with a new issue, knowing what's already been tried saves everyone time. You can track this by creating an AI Search instance per customer. After each resolved issue, the agent saves a summary of what went wrong and how it was fixed. Over time, this builds up a searchable log of past resolutions. You can create instances dynamically using the namespace binding:</p>
            <pre><code>// create a per-customer instance when they first show up 
await env.SUPPORT_KB.create({
  id: `customer-${customerId}`,
  index_method:{ keyword: true, vector: true }
});
</code></pre>
            <p>Each instance gets its own built-in storage and vector index — powered by <a href="https://www.cloudflare.com/developer-platform/products/r2/"><u>R2</u></a> and <a href="https://www.cloudflare.com/developer-platform/products/vectorize/"><u>Vectorize</u></a>. The instance starts empty and accumulates context over time. Next time the customer comes back, all of it is searchable.</p><p>Here's what the namespace looks like after a few customers:</p>
            <pre><code>namespace: "support"
├── product-knowledge     (R2 as source, shared across all agents)
├── customer-abc123       (managed storage, per-customer)
├── customer-def456       (managed storage, per-customer)
└── customer-ghi789       (managed storage, per-customer)

</code></pre>
            <p>Now the agent itself. It extends <code>AIChatAgent</code> from the Agents SDK and defines two tools. We're using <a href="https://blog.cloudflare.com/workers-ai-large-models/"><u>Kimi K2.5</u></a> as the LLM via <a href="https://www.cloudflare.com/developer-platform/products/workers-ai/"><u>Workers AI</u></a>. The model decides when to call the tools based on the conversation:</p>
            <pre><code>import { AIChatAgent, type OnChatMessageOptions } from "@cloudflare/ai-chat";
import { createWorkersAI } from "workers-ai-provider";
import { streamText, convertToModelMessages, tool, stepCountIs } from "ai";
import { routeAgentRequest } from "agents";
import { z } from "zod";

export class SupportAgent extends AIChatAgent&lt;Env&gt; {
  async onChatMessage(_onFinish: unknown, options?: OnChatMessageOptions) {
    // the client passes customerId in the request body
    // via the Agent SDK's sendMessage({ body: { customerId } })
    const customerId = options?.body?.customerId;

    // create a per-customer instance when they first show up.
    // each instance gets its own storage and vector index.
    if (customerId) {
      try {
        await this.env.SUPPORT_KB.create({
          id: `customer-${customerId}`,
          index_method: { keyword: true, vector: true }
        });
      } catch {
        // instance already exists
      }
    }

    const workersai = createWorkersAI({ binding: this.env.AI });

    const result = streamText({
      model: workersai("@cf/moonshotai/kimi-k2.5"),
      system: `You are a support agent. Use search_knowledge_base
        to find relevant docs before answering. Search results
        include both product docs and this customer's past
        resolutions — use them to avoid repeating failed fixes
        and to recognize recurring issues. When the issue is
        resolved, call save_resolution before responding.`,
      // this.messages is the full conversation history, automatically
      // persisted by AIChatAgent across reconnects
      messages: await convertToModelMessages(this.messages),
      tools: {
        // tool 1: search across shared product docs AND this
        // customer's past resolutions in a single call
        search_knowledge_base: tool({
          description: "Search product docs and customer history",
          inputSchema: z.object({
            query: z.string().describe("The search query"),
          }),
          execute: async ({ query }) =&gt; {
            // always search product docs;
            // include customer history if available
            const instances = ["product-knowledge"];
            if (customerId) {
              instances.push(`customer-${customerId}`);
            }
            return await this.env.SUPPORT_KB.search({
              query: query,
              ai_search_options: {
                // surface recent docs over older ones
                boost_by: [
                  { field: "timestamp", direction: "desc" }
                ],
                // search across both instances at once
                instance_ids: instances
              }
            });
          }
        }),

        // tool 2: after resolving an issue, the agent saves a
        // summary so future agents have full context
        save_resolution: tool({
          description:
            "Save a resolution summary after solving a customer's issue",
          inputSchema: z.object({
            filename: z.string().describe(
              "Short descriptive filename, e.g. 'billing-fix.md'"
            ),
            content: z.string().describe(
              "What the problem was, what caused it, and how it was resolved"
            ),
          }),
          execute: async ({ filename, content }) =&gt; {
            if (!customerId) return { error: "No customer ID" };
            const instance = this.env.SUPPORT_KB.get(
              `customer-${customerId}`
            );
            // uploadAndPoll waits until indexing is complete,
            // so the resolution is searchable before the next query
            const item = await instance.items.uploadAndPoll(
              filename, content
            );
            return { saved: true, filename, status: item.status };
          }
        }),
      },
      // cap agentic tool-use loops at 10 steps
      stopWhen: stepCountIs(10),
      abortSignal: options?.abortSignal,
    });

    return result.toUIMessageStreamResponse();
  }
}

// route requests to the SupportAgent durable object
export default {
  async fetch(request: Request, env: Env) {
    return (
      (await routeAgentRequest(request, env)) ||
      new Response("Not found", { status: 404 })
    );
  }
} satisfies ExportedHandler&lt;Env&gt;;
</code></pre>
            <p>With this, the model decides when to search and when to save. When it searches, it queries <code>product-knowledge</code> and this customer's past resolutions together. When the issue is resolved, it saves a summary that's immediately searchable in future conversations. </p>
    <div>
      <h2>How AI Search finds what you're looking for</h2>
      <a href="#how-ai-search-finds-what-youre-looking-for">
        
      </a>
    </div>
    <p>Under the hood, AI Search runs a multi-step retrieval pipeline, in which every step is configurable.</p>
    <div>
      <h3>Hybrid Search: search that understands intent and matches terms</h3>
      <a href="#hybrid-search-search-that-understands-intent-and-matches-terms">
        
      </a>
    </div>
    <p>Until now, AI Search only offered vector search. Vector search is great at understanding intent, but it can lose specifics. In a query "ERR_CONNECTION_REFUSED timeout," the embedding captures the broad concept of connection failures. But the user isn't looking for general networking docs. They're looking for the specific document that mentions “ERR_CONNECTION_REFUSED”. Vector search might return results about troubleshooting without ever surfacing the page that contains that exact error string. </p><p>Keyword search fills that gap. AI Search now supports BM25, one of the most widely used retrieval scoring functions. BM25 scores documents by how often your query terms appear, how rare those terms are across the entire corpus, and how long the document is. It rewards matches on specific terms, penalizes common filler words, and normalizes for document length. When you search "ERR_CONNECTION_REFUSED timeout", BM25 finds documents that actually contain "ERR_CONNECTION_REFUSED" as a term. However, BM25 may miss a page about “troubleshooting network connections” even though it may be describing the same problem. That's where vector search shines, and why you need both.</p><p>When you enable hybrid search, it runs vector and BM25 in parallel, fuses the results, and optionally reranks them:</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/27CV8IBS2dYTV5puCtIPmD/3c66c190127fa38c4a4275425de8f9c4/BLOG-3240_3.png" />
          </figure><p>Let's take a look at the new configurations for BM25, and how they come together.</p><ol><li><p><b>Tokenizer </b>controls how your documents are broken into matchable terms at index time. Porter stemmer (option: <code>porter</code>) stems words so "running" matches "run." Trigram (option: <code>trigram</code>) matches character substrings so "conf" matches "configuration." You can use porter for natural language content like docs, and trigram for code where partial matches matter.</p></li><li><p><b>Keyword match mode </b>controls which documents are candidates for BM25 scoring at query time. <code>AND</code> requires all query terms to appear in a document, OR includes anything with at least one match.</p></li><li><p><b>Fusion </b>controls how vector and keyword results are combined into the final list of results during query time. Reciprocal rank fusion (option: <code>rrf</code>) merges by rank position rather than score, which avoids comparing two incompatible scoring scales, whereas max fusion (option: <code>max</code>) takes the higher score.</p></li><li><p><b>(Optional) Reranking </b>adds a cross-encoder pass that re-scores results by evaluating the query and document together as a pair. It can help catch cases where a result has the right terms but isn't answering the question. </p></li></ol><p>Every option has a sane default when omitted. You have the flexibility to configure what matters whenever you create a new instance:</p>
            <pre><code>const instance = await env.AI_SEARCH.create({
  id: "my-instance",
  index_method: { keyword: true, vector: true },
  indexing_options: {
    keyword_tokenizer: "porter"
  },
  retrieval_options: {
    keyword_match_mode: "or"
  },
  fusion_method: "rrf",
  reranking: true,
  reranking_model: "@cf/baai/bge-reranker-base"
});
</code></pre>
            
    <div>
      <h3>Boost relevance: surface what matters</h3>
      <a href="#boost-relevance-surface-what-matters">
        
      </a>
    </div>
    <p>Retrieval gets you relevant results, but relevance alone isn't always enough. For example, in a news search, an article from last week and an article from three years ago might both be semantically relevant to "election results," but most users probably want the recent one. Boosting lets you layer business logic on top of retrieval by nudging rankings based on document metadata.</p><p>You can boost on timestamp (built in on every item) or any <a href="https://developers.cloudflare.com/ai-search/configuration/indexing/metadata/"><u>custom metadata field</u></a> you define.</p>
            <pre><code>// boost high priority docs
const results = await instance.search({
  query: "deployment guide",
  ai_search_options: {
    boost_by: [
      { field: "timestamp", direction: "desc" }
    ]
  }
});
</code></pre>
            
    <div>
      <h3>Cross-instance search: query across boundaries</h3>
      <a href="#cross-instance-search-query-across-boundaries">
        
      </a>
    </div>
    <p>In the support agent example, product documentation and customer resolution history live in separate instances by design. But when the agent is answering a question, it needs context from both places at once. Without cross-instance search, you'd make two separate calls and merge the results yourself.</p><p>The namespace binding exposes a <code>search()</code> method that handles this for you. Pass an array of instance names and get one ranked list back:</p>
            <pre><code>const results = await env.SUPPORT_KB.search({
  query: "billing error",
  ai_search_options: {
    instance_ids: ["product-knowledge", "customer-abc123"]
  }
});
</code></pre>
            <p>Results are merged and ranked across instances. The agent doesn't need to know or care that shared docs and customer resolution history live in separate places. </p>
    <div>
      <h2>How AI Search instances work</h2>
      <a href="#how-ai-search-instances-work">
        
      </a>
    </div>
    <p>So far we've covered how AI Search finds the right results. Now let's look at how you can create and manage your search instances.</p><p>If you used AI Search before this release, you know the setup: create an R2 bucket, link it to an AI Search instance, AI search generates a service API token for you, and you manage the Vectorize index that gets provisioned on your account. Uploading an object requires you to write to R2 and then wait for a sync job to run to have the object indexed.</p><p>New instances created now work differently. When you call <code>create()</code>, the instance comes with its own storage and vector index built-in. You can upload a file, the file is sent to index immediately, and you can poll for indexing status all with one <code>uploadAndpoll()</code> API. Once completed, you can search the instance immediately, and there are no external dependencies to wire together.</p>
            <pre><code>const instance = env.AI_SEARCH.get("my-instance");

// upload and wait for indexing to complete
const item = await instance.items.uploadAndPoll("faq.md", content, {
  metadata: { category: "onboarding" }
});
console.log(item.status); // "completed"

// immediately search after indexing is completed
const results = await instance.search({
  // alternative way to pass in users' query other than using parameter query 
  messages: [{ role: "user", content: "onboarding guide" }],
});
</code></pre>
            <p>Each instance can also connect to one external data source (an R2 bucket or a website) and run on a sync schedule. It can exist alongside the provided built-in storage. In the support agent example, <code>product-knowledge</code> is backed by an R2 bucket for shared documentation, while each customer's instance uses built-in storage for context uploaded on the fly.</p>
    <div>
      <h3>Namespaces: create search instances at runtime</h3>
      <a href="#namespaces-create-search-instances-at-runtime">
        
      </a>
    </div>
    <p>The <code>ai_search_namespaces</code> is a new binding you can leverage to dynamically create search instances at runtime. It replaces the previous <code>env.AI.autorag()</code> API, which accessed AI Search through the <code>AI</code> binding. The old bindings will continue to work using <a href="https://developers.cloudflare.com/workers/configuration/compatibility-dates/"><u>Workers compatibility dates</u></a>.</p>
            <pre><code>// wrangler.jsonc 
{
  "ai_search_namespaces": [
    { "binding": "AI_SEARCH", "namespace": "example" },
  ]
}
</code></pre>
            <p>The namespace binding gives you APIs like <code>create()</code>, <code>delete()</code>, <code>list()</code>, and <code>search()</code> at the namespace level. If you’re creating instances dynamically (e.g. per agent, per customer, per tenant), this is the binding to use.</p>
            <pre><code>// create an instance 
const instance = await env.AI_SEARCH.create({
  id: "my-instance"
});

// delete an instance and all its indexed data
await env.AI_SEARCH.delete("old-instance");
</code></pre>
            
    <div>
      <h3>Pricing for new instances</h3>
      <a href="#pricing-for-new-instances">
        
      </a>
    </div>
    <p>New instances created as of today will get built-in storage and a vector index automatically. </p><p>These instances are free to use while AI Search is in open beta with the limits listed below. When using the website as a data source, website crawling using <a href="https://developers.cloudflare.com/browser-rendering/"><u>Browser Run (formerly Browser Rendering)</u></a> is also now a built-in service, meaning that you won’t be billed for it separately. After beta, the goal is to provide unified pricing for AI Search as a single service, rather than billing separately for each underlying component. Workers AI and <a href="https://www.cloudflare.com/developer-platform/products/ai-gateway/"><u>AI Gateway</u></a> usage will continue to be billed separately.</p><p>We'll give at least 30 days notice and communicate pricing details before any billing begins.</p><table><tr><th><p><b>Limit</b></p></th><th><p><b>Workers Free</b></p></th><th><p><b>Workers Paid</b></p></th></tr><tr><td><p>AI Search instances per account</p></td><td><p>100</p></td><td><p>5,000</p></td></tr><tr><td><p>Files per instance</p></td><td><p>100,000</p></td><td><p>1M or 500K for hybrid search</p></td></tr><tr><td><p>Max file size</p></td><td><p>4MB</p></td><td><p>4MB</p></td></tr><tr><td><p>Queries per month</p></td><td><p>20,000</p></td><td><p>Unlimited</p></td></tr><tr><td><p>Maximum pages crawled per day</p></td><td><p>500</p></td><td><p>Unlimited</p></td></tr></table><p><i>What about existing instances?</i> </p><p>If you created instances before this release, they continue to work exactly as they do today. Your R2 buckets, Vectorize indexes, and Browser Run usage remain on your account and are billed as before. We'll share migration details for existing instances soon.</p>
    <div>
      <h2>Get started today</h2>
      <a href="#get-started-today">
        
      </a>
    </div>
    <p>Search is one of the most fundamental things an agent can do. With AI Search, you don't have to build the infrastructure to make it happen. Create an instance, give it your data, and let your agents search it.</p><p>Get started today by running this command to create your first instance:</p>
            <pre><code>npx wrangler ai-search create my-search
</code></pre>
            <p>Check out the <a href="https://developers.cloudflare.com/ai-search/"><u>docs</u></a> and come tell us what you're building on the <a href="https://discord.cloudflare.com/"><u>Cloudflare Developer Discord</u></a>.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5Y5WLWBuK7NBMLmY6ZWL96/ce7ca954f4f51ac21f8e9d3f15d0343c/BLOG-3240_4.png" />
          </figure><p></p> ]]></content:encoded>
            <category><![CDATA[Agents Week]]></category>
            <category><![CDATA[Agents]]></category>
            <category><![CDATA[AI Search]]></category>
            <category><![CDATA[AI]]></category>
            <guid isPermaLink="false">4l8kYFerKsLkZH2ZVaOoYf</guid>
            <dc:creator>Gabriel Massadas</dc:creator>
            <dc:creator>Miguel Cardoso</dc:creator>
            <dc:creator>Anni Wang</dc:creator>
        </item>
        <item>
            <title><![CDATA[Deploy Postgres and MySQL databases with PlanetScale + Workers]]></title>
            <link>https://blog.cloudflare.com/deploy-planetscale-postgres-with-workers/</link>
            <pubDate>Thu, 16 Apr 2026 13:00:22 GMT</pubDate>
            <description><![CDATA[ Learn how to deploy PlanetScale Postgres and MySQL databases via Cloudflare and connect Cloudflare Workers. ]]></description>
            <content:encoded><![CDATA[ <p>Cloudflare announced our PlanetScale partnership last September to give <a href="https://workers.cloudflare.com/"><u>Cloudflare Workers</u></a> direct access to Postgres and MySQL databases for fast, full-stack applications.</p><p>Soon, we’re bringing our technologies even closer: you’ll be able to create PlanetScale Postgres and MySQL databases directly from the Cloudflare dashboard and API, and have them billed to your Cloudflare account. </p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5Tj4gJrV5hxlIWxlmoXVZe/7661c1e47c0c868b88301b5f4aca4441/BLOG-3213_2.png" />
          </figure><p>You choose the data storage that fits your Worker application needs and keep a single system for billing as a Cloudflare self-serve or enterprise customer. Cloudflare credits like those given in our <a href="https://www.cloudflare.com/forstartups/"><u>startup program</u></a> or Cloudflare committed spend can be used towards PlanetScale databases.</p>
    <div>
      <h2>Postgres &amp; MySQL for Workers</h2>
      <a href="#postgres-mysql-for-workers">
        
      </a>
    </div>
    <p>SQL relational databases like Postgres and MySQL are a foundation of modern applications. In particular, Postgres has risen in developer popularity with its rich tooling ecosystem (ORMs, GUIs, etc) and extensions like <a href="https://github.com/pgvector/pgvector/"><u>pgvector</u></a> for building vector search in AI-driven applications. Postgres is the default choice for most developers who need a powerful, flexible, and scalable database to power their applications.</p><p>You can already connect your PlanetScale account and create Postgres databases directly from the <a href="https://dash.cloudflare.com/?to=/:account/workers/hyperdrive?modal=1">Cloudflare dashboard</a> for your Workers. Starting next month, a new Cloudflare subscription will bill for new PlanetScale databases direct to your Cloudflare account as a self-serve or enterprise user.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1CHTq1qoaNGSNO5atsiS8J/a8eba618b77362aa467d94c4f625c600/BLOG-3213_3.png" />
          </figure><p><sup><i>How to create PlanetScale databases via </i></sup><a href="https://dash.cloudflare.com/?to=/:account/workers/hyperdrive?modal=1"><sup><i><u>Cloudflare dashboard</u></i></sup></a><sup><i> after your PlanetScale account is connected. Cloudflare billing is coming next month.</i></sup></p><p>With our built-in integration, PlanetScale databases automatically work with Workers using Hyperdrive, our database connectivity service. <a href="https://blog.cloudflare.com/how-hyperdrive-speeds-up-database-access/"><u>Hyperdrive</u></a> service manages database connection pools and query caching to make database queries fast and reliable. You just add a <a href="https://developers.cloudflare.com/workers/runtime-apis/bindings/"><u>binding</u></a> to your Worker’s <a href="https://developers.cloudflare.com/workers/wrangler/configuration/#hyperdrive"><u>config file</u></a>: </p>
            <pre><code>// wrangler.jsonc file
{
  "hyperdrive": [
    {
      "binding": "DATABASE",
      "id": &lt;AUTO_CREATED_ID&gt;
    }
  ]
}
</code></pre>
            <p>And start running SQL queries via your Worker with your Postgres client of choice:</p>
            <pre><code>import { Client } from "pg";

export default {
  async fetch(request, env, ctx) {
   
    const client = new Client({ connectionString: env.DATABASE.connectionString });
    await client.connect();

    const result = await client.query("SELECT * FROM pg_tables");
    ...
}
</code></pre>
            
    <div>
      <h2>PlanetScale developer experience</h2>
      <a href="#planetscale-developer-experience">
        
      </a>
    </div>
    <p>PlanetScale was the obvious choice to provide to the Workers community due to it’s unrivaled performance and reliability. Developers can choose from two of the most popular relational databases <a href="https://planetscale.com/docs/postgres/postgres-compatibility"><u>with Postgres</u></a> or Vitess MySQL. PlanetScale matches how Cloudflare treats performance and reliability as key features of a developer platform. And with features like <a href="https://planetscale.com/docs/postgres/monitoring/query-insights"><u>query insights</u></a> and <a href="https://planetscale.com/docs/connect/ai-tooling"><u>agent driven</u></a> workflows for improving SQL query performance and <a href="https://planetscale.com/docs/postgres/branching"><u>branching</u></a> for deploying code safely, including database changes, the PlanetScale database developer experience is first-class.</p><p>Cloudflare users get the exact same PlanetScale database developer experience. Your PlanetScale databases can be deployed directly from Cloudflare with connections managed via Hyperdrive, which already makes your existing regional databases fast with global Workers. This means access to the same PlanetScale <a href="https://planetscale.com/docs/plans/planetscale-skus"><u>database clusters</u></a> at standard PlanetScale <a href="https://planetscale.com/pricing"><u>pricing</u></a> with all features included like query insights and detailed breakdown of <a href="https://planetscale.com/docs/billing#organization-usage-and-billing-page"><u>usage and costs</u></a>.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2Pfh4oM8zQSUGJKGEsxF3W/700f627c38279d9d90337b38de72b44e/BLOG-3213_4.png" />
          </figure><p><sup><i>A single node on PlanetScale Postgres starts at </i></sup><a href="https://planetscale.com/blog/5-dollar-planetscale"><sup><i><u>$5/month</u></i></sup></a><sup><i>.</i></sup></p>
    <div>
      <h2>Workers placement</h2>
      <a href="#workers-placement">
        
      </a>
    </div>
    <p>With centralized databases, Workers can run right next to your primary database to reduce latency with an <a href="https://developers.cloudflare.com/workers/configuration/placement/#configure-explicit-placement-hints"><u>explicit placement hint</u></a>. By default, Workers execute closest to a user request, which adds network latency when querying a central database especially for multiple queries. Instead, you can configure your Worker to execute in the closest Cloudflare data center to your PlanetScale database. In the future, Cloudflare can automatically set a placement hint based on the location of your PlanetScale database and reduce network latency to single digit milliseconds.</p>
            <pre><code>{
  "placement": {
    "region": "aws:us-east-1"
  }
}
</code></pre>
            
    <div>
      <h2>Coming soon</h2>
      <a href="#coming-soon">
        
      </a>
    </div>
    <p>You can deploy a PlanetScale Postgres database or connect an existing PlanetScale database to Workers today via the <a href="https://dash.cloudflare.com/?to=/:account/workers/hyperdrive?modal=1"><u>Cloudflare dashboard</u></a>. Everything today is still billed via PlanetScale.</p><p>Launching next month, new PlanetScale databases can be billed to your Cloudflare account. </p><p>We are building more with our PlanetScale partners, such as Cloudflare API integration, so tell us what you’d like to see next.</p> ]]></content:encoded>
            <category><![CDATA[SQL]]></category>
            <category><![CDATA[Database]]></category>
            <category><![CDATA[Storage]]></category>
            <category><![CDATA[Postgres]]></category>
            <category><![CDATA[MySQL]]></category>
            <category><![CDATA[Cloudflare Workers]]></category>
            <category><![CDATA[Developer Platform]]></category>
            <category><![CDATA[Developers]]></category>
            <category><![CDATA[PlanetScale]]></category>
            <guid isPermaLink="false">1IGJnHwj5QVRJm9iCdEqYV</guid>
            <dc:creator>Vy Ton</dc:creator>
            <dc:creator>Matt Silverlock</dc:creator>
        </item>
        <item>
            <title><![CDATA[Cloudflare Email Service: now in public beta. Ready for your agents]]></title>
            <link>https://blog.cloudflare.com/email-for-agents/</link>
            <pubDate>Thu, 16 Apr 2026 06:00:00 GMT</pubDate>
            <description><![CDATA[ Agents are becoming multi-channel. That means making them available wherever your users already are — including the inbox. Today, Cloudflare Email Service enters public beta with the infrastructure layer to make that easy: send, receive, and process email natively from your agents.
 ]]></description>
            <content:encoded><![CDATA[ <p>Email is the most accessible interface in the world. It is ubiquitous. There’s no need for a custom chat application, no custom SDK for each channel. Everyone already has an email address, which means everyone can already interact with your application or agent. And your agent can interact with anyone.</p><p>If you are building an application, you already rely on email for signups, notifications, and invoices. Increasingly, it is not just your application logic that needs this channel. Your agents do, too. During our private beta, we talked to developers who are building exactly this: customer support agents, invoice processing pipelines, account verification flows, multi-agent workflows. All built on top of email. The pattern is clear: email is becoming a core interface for agents, and developers need infrastructure purpose-built for it.</p><p>Cloudflare Email Service is that piece. With <b>Email Routing</b>, you can receive email to your application or agent. With <b>Email Sending,</b> you can reply to emails or send outbounds to notify your users when your agents are done doing work. And with the rest of the developer platform, you can build a full email client and <a href="https://blog.cloudflare.com/project-think/"><u>Agents SDK</u></a> onEmail hook as native functionality. </p><p>Today, as part of Agents Week, Cloudflare Email Service is entering <b>public beta</b>, allowing any application and any agent to send emails. We are also completing the toolkit for building email-native agents: </p><ul><li><p>Email Sending binding, available from your Workers and the Agents SDK </p></li><li><p>A new Email MCP server</p></li><li><p>Wrangler CLI email commands</p></li><li><p>Skills for coding agents</p></li><li><p>An open-source agentic inbox reference app</p></li></ul>
    <div>
      <h2>Email Sending: now in public beta</h2>
      <a href="#email-sending-now-in-public-beta">
        
      </a>
    </div>
    <p>Email Sending graduates from private beta to <b>public beta </b>today. You can now send transactional emails directly from Workers with a native Workers binding — no API keys, no secrets management.</p>
            <pre><code>export default {
  async fetch(request, env, ctx) {
    await env.EMAIL.send({
      to: "user@example.com",
      from: "notifications@your-domain.com",
      subject: "Your order has shipped",
      text: "Your order #1234 has shipped and is on its way."
    });
    return new Response("Email sent");
  },
};
</code></pre>
            <p>Or send from any platform, any language, using the REST API and our TypeScript, Python, and Go SDKs:</p>
            <pre><code>curl "https://api.cloudflare.com/client/v4/accounts/{account_id}/email-service/send" \
   --header "Authorization: Bearer &lt;API_TOKEN&gt;" \
   --header "Content-Type: application/json" \
   --data '{
     "to": "user@example.com",
     "from": "notifications@your-domain.com",
     "subject": "Your order has shipped",
     "text": "Your order #1234 has shipped and is on its way."
   }'
</code></pre>
            <p>Sending email that actually reaches inboxes usually means wrestling with SPF, DKIM, and DMARC records. When you add your domain to Email Service, we configure all of it automatically. Your emails are authenticated and delivered, not flagged as spam. And because Email Service is a global service built on Cloudflare's network, your emails are delivered with low latency anywhere in the world.</p><p>Combined with <a href="https://developers.cloudflare.com/email-routing/"><u>Email Routing</u></a>, which has been free and available for years, you now have complete bidirectional email within a single platform. Receive an email, process it in a Worker, and reply, all without leaving Cloudflare.</p><p>For the full deep dive on Email Sending, <a href="https://blog.cloudflare.com/email-service/"><u>refer to our Birthday Week announcement</u></a>. The rest of this post describes what Email Service unlocks for agents.</p>
    <div>
      <h2>Agents SDK: your agent is email-native</h2>
      <a href="#agents-sdk-your-agent-is-email-native">
        
      </a>
    </div>
    <p>The Agents SDK for building agents on Cloudflare already has a first-class <a href="https://developers.cloudflare.com/agents/api-reference/agents-api/"><u>onEmail hook</u></a> for receiving and processing inbound email. But until now, your agent could only reply synchronously, or send emails to members of your Cloudflare account. </p><p>With Email Sending, that constraint is gone. This is the difference between a chatbot and an agent.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4aGV0BVpbrj3ql5TubPMXx/b85351f13c5fae93a27d11e20a5fc11e/BLOG-3210_2.png" />
          </figure><p><sup><i>Email agents receive a message, orchestrate work across the platform, and respond asynchronously.</i></sup></p><p>A chatbot responds in the moment or not at all. An agent thinks, acts, and communicates on its own timeline. With Email Sending, your agent can receive a message, spend an hour processing data, check three other systems, and then reply with a complete answer. It can schedule follow-ups. It can escalate when it detects an edge case. It can operate independently. In other words: it can actually do work, not just answer questions. </p><p>Here's what a support agent looks like with the full pipeline — receive, persist, and reply:</p>
            <pre><code>import { Agent, routeAgentEmail } from "agents";
import { createAddressBasedEmailResolver, type AgentEmail } from "agents/email";
import PostalMime from "postal-mime";

export class SupportAgent extends Agent {
  async onEmail(email: AgentEmail) {
    const raw = await email.getRaw();
    const parsed = await PostalMime.parse(raw);

   // Persist in agent state
    this.setState({
      ...this.state,
      ticket: { from: email.from, subject: parsed.subject, body: parsed.text, messageId: parsed.messageId },
    });

    // Kick off long running background agent task 
    // Or place a message on a Queue to be handled by another Worker

    // Reply here or in other Worker handler, like a Queue handler
    await this.sendEmail({
      binding: this.env.EMAIL,
      fromName: "Support Agent",
      from: "support@yourdomain.com",
      to: this.state.ticket.from,
      inReplyTo: this.state.ticket.messageId,
      subject: `Re: ${this.state.ticket.subject}`,
      text: `Thanks for reaching out. We received your message about "${this.state.ticket.subject}" and will follow up shortly.`
    });
  }
}

export default {
  async email(message, env) {
    await routeAgentEmail(message, env, {
      resolver: createAddressBasedEmailResolver("SupportAgent"),
    });
  },
} satisfies ExportedHandler&lt;Env&gt;;</code></pre>
            <p>If you're new to the Agents SDK's email capabilities, here's what's happening under the hood.</p><p><b>Each agent gets its own identity from a single domain.</b> The address-based resolver routes support@yourdomain.com to a "support" agent instance, sales@yourdomain.com to a "sales" instance, and so on. You don't need to provision separate inboxes — the routing is built into the address. You can even use sub-addressing (NotificationAgent+user123@yourdomain.com) to route to different agent namespaces and instances.</p><p><b>State persists across emails.</b> Because agents are backed by Durable Objects, calling this.setState() means your agent remembers conversation history, contact information, and context across sessions. The inbox becomes the agent's memory, without needing a separate database or vector store.</p><p><b>Secure reply routing is built in.</b> When your agent sends an email and expects a reply, you can sign the routing headers with HMAC-SHA256 so that replies route back to the exact agent instance that sent the original message. This prevents attackers from forging headers to route emails to arbitrary agent instances — a security concern that most "email for agents" solutions haven't addressed.</p><p>This is the complete email agent pipeline that teams are building from scratch elsewhere: receive email, parse it, classify it, persist state, kick off async workflows, reply or escalate — all within a single Agent class, deployed globally on Cloudflare's network.</p>
    <div>
      <h2>Email tooling for your agents: MCP server, Wrangler CLI, and skills</h2>
      <a href="#email-tooling-for-your-agents-mcp-server-wrangler-cli-and-skills">
        
      </a>
    </div>
    <p>Email Service isn't only for agents running on Cloudflare. Agents run everywhere, whether it’s coding agents like Claude Code, Cursor, or Copilot running locally or in remote environments, or production agents running in containers or external clouds. They all need to send email from those environments. We're shipping three integrations that make Email Service accessible to any agent, regardless of where it runs.</p><p>Email is now available through the <a href="https://github.com/cloudflare/mcp"><u>Cloudflare MCP server</u></a>, the same <a href="https://blog.cloudflare.com/code-mode/"><u>Code Mode</u></a>-powered server that gives agents access to the entire Cloudflare API. With this MCP server, your agent can discover and call the Email endpoints to send and configure emails. You can send an email with a simple prompt:</p>
            <pre><code>"Send me a notification email at hello@example.com from my staging domain when the build completes"</code></pre>
            <p>For agents running on a computer or a sandbox with bash access, the Wrangler CLI solves the MCP context window problem that we discussed in the <a href="https://blog.cloudflare.com/code-mode/"><u>Code Mode</u></a> blog post — tool definitions can consume tens of thousands of tokens before your agent even starts processing a single message. With Wrangler, your agent starts with near-zero context overhead and discovers capabilities on demand through `--help` commands. Here is how your agent can send an email via Wrangler:</p>
            <pre><code>wrangler email send \
  --to "teammate@example.com" \
  --from "agent@your-domain.com" \
  --subject "Build completed" \
  --text "The build passed. Deployed to staging."
</code></pre>
            <p>Regardless of whether you give your agent the Cloudflare MCP or the Wrangler CLI, your agent will be able to now send emails on your behalf with just a prompt.</p>
    <div>
      <h3>Skills</h3>
      <a href="#skills">
        
      </a>
    </div>
    <p>We are also publishing a <a href="https://github.com/cloudflare/skills"><u>Cloudflare Email Service skill</u></a>. It gives your agents complete guidance: configuring the Workers binding, sending emails via the REST API or SDKs, handling inbound email with Email Routing configuration, building with Agents SDK, and managing email through Wrangler CLI or MCP. It also covers deliverability best practices and how to craft good transactional emails that land in inboxes rather than spam. Drop it into your project and your coding agent has everything needed to build production-ready email on Cloudflare.</p>
    <div>
      <h2>Open-sourcing tools for email agents</h2>
      <a href="#open-sourcing-tools-for-email-agents">
        
      </a>
    </div>
    <p>During the private beta, we also experimented with email agents. It became clear that you often want to keep the human-in-the-loop element to review emails and see what the agent is doing.The best way to do that is to have a fully featured email client with agent automations built-in.</p><p>That’s why we built <a href="https://github.com/cloudflare/agentic-inbox"><u>Agentic Inbox</u></a>: a reference application with full conversation threading, email rendering, receiving and storing emails and their attachments, and automatically replying to emails. It includes a dedicated MCP server built-in, so external agents can draft emails for your review before sending from your agentic-inbox. </p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4PgrSXLUD5kgA2SFOhksCS/75f85353cf710842420ed806be31b6f6/BLOG-3210_3.png" />
          </figure><p>We’re <a href="https://github.com/cloudflare/agentic-inbox"><u>open-sourcing Agentic Inbox</u></a> as a reference application for how to build a full email application using Email Routing for inbound, Email Sending for outbound, Workers AI for classification, R2 for attachments, and Agents SDK for stateful agent logic. You can deploy it today to get a full inbox, email client and agent for your emails, with the click of a button.</p><p>We want email agent tooling to be composable and reusable. Rather than every team rebuilding the same inbound-classify-reply pipeline, start with this reference application. Fork it, extend it, use it as a starting point for your own email agents that fit your workflows.</p><a href="https://deploy.workers.cloudflare.com/?url=https://github.com/cloudflare/agentic-inbox"><img src="https://deploy.workers.cloudflare.com/button" /></a>
<p></p>
    <div>
      <h2>Try it out today</h2>
      <a href="#try-it-out-today">
        
      </a>
    </div>
    <p>Email is where the world’s most important workflows live, but for agents, it has often been a difficult channel to reach. With <b>Email Sending</b> now in public beta, Cloudflare Email Service becomes a complete platform for bidirectional communication, making the inbox a first-class interface for your agents.</p><p>Whether you’re building a support agent that meets customers in their inbox or a background process that keeps your team updated in real time, your agents now have a seamless way to communicate on a global scale. The inbox is no longer a silo. Now it’s one more place for your agents to be helpful.</p><ul><li><p>Try out <a href="https://dash.cloudflare.com/?to=/:account/email-service/sending"><u>Email Sending in the Cloudflare Dashboard</u></a></p></li><li><p>Read the <a href="http://developers.cloudflare.com/email-service/"><u>Email Service documentation</u></a></p></li><li><p>Follow the <a href="http://developers.cloudflare.com/agents/api-reference/email"><u>Agents SDK email docs</u></a> </p></li><li><p>Check out the <a href="http://github.com/cloudflare/mcp-server-cloudflare"><u>Email Service MCP server</u></a> and <a href="https://github.com/cloudflare/skills"><u>Skills</u></a></p></li><li><p><a href="https://github.com/cloudflare/agentic-inbox"><u>Deploy the open-source reference app</u></a></p></li></ul>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2YaSov283la8ajbi0Sbprq/9108f26b499fd976470a79ee74034e59/BLOG-3210_5.png" />
          </figure>
    <div>
      <h2>Watch on Cloudflare TV</h2>
      <a href="#watch-on-cloudflare-tv">
        
      </a>
    </div>
    <div>
  
</div><p></p> ]]></content:encoded>
            <category><![CDATA[Agents Week]]></category>
            <category><![CDATA[Agents]]></category>
            <category><![CDATA[Developers]]></category>
            <category><![CDATA[Developer Platform]]></category>
            <category><![CDATA[AI]]></category>
            <category><![CDATA[Workers AI]]></category>
            <category><![CDATA[Email]]></category>
            <guid isPermaLink="false">3G2uQLc6OAHayqaZxxDrrZ</guid>
            <dc:creator>Thomas Gauvin</dc:creator>
            <dc:creator>Eric Falcão</dc:creator>
        </item>
        <item>
            <title><![CDATA[Project Think: building the next generation of AI agents on Cloudflare]]></title>
            <link>https://blog.cloudflare.com/project-think/</link>
            <pubDate>Wed, 15 Apr 2026 13:01:00 GMT</pubDate>
            <description><![CDATA[ Announcing a preview of the next edition of the Agents SDK — from lightweight primitives to a batteries-included platform for AI agents that think, act, and persist.
 ]]></description>
            <content:encoded><![CDATA[ <p>Today, we're introducing Project Think: the next generation of the <a href="https://developers.cloudflare.com/agents/"><u>Agents SDK</u></a>. Project Think is a set of new primitives for building long-running agents (durable execution, sub-agents, sandboxed code execution, persistent sessions) and an opinionated base class that wires them all together. Use the primitives to build exactly what you need, or use the base class to get started fast.</p><p>Something happened earlier this year that changed how we think about AI. Tools like <a href="https://github.com/badlogic/pi-mono"><u>Pi</u></a>, <a href="https://github.com/openclaw"><u>OpenClaw</u></a>, <a href="https://docs.anthropic.com/en/docs/agents"><u>Claude Code</u></a>, and <a href="https://openai.com/codex"><u>Codex</u></a> proved a simple but powerful idea: give an LLM the ability to read files, write code, execute it, and remember what it learned, and you get something that looks less like a developer tool and more like a general-purpose assistant.</p><p>These coding agents aren't just writing code anymore. People are using them to manage calendars, analyze datasets, negotiate purchases, file taxes, and automate entire business workflows. The pattern is always the same: the agent reads context, reasons about it, writes code to take action, observes the result, and iterates. Code is the universal medium of action.</p><p>Our team has been using these coding agents every day. And we kept running into the same walls:</p><ul><li><p><b>They only run on your laptop or an expensive VPS:</b> there's no sharing, no collaboration, no handoff between devices.</p></li><li><p><b>They're expensive when idle</b>: a fixed monthly cost whether the agent is working or not. Scale that to a team, or a company, and it adds up fast.</p></li><li><p><b>They require management and manual setup</b>: installing dependencies, managing updates, configuring identity and secrets.</p></li></ul><p>And there's a deeper structural issue. Traditional applications serve many users from one instance. As mentioned in our Welcome to Agents Week post, <a href="https://blog.cloudflare.com/welcome-to-agents-week/"><u>agents are one-to-one</u></a>. Each agent is a unique instance, serving one user, running one task. A restaurant has a menu and a kitchen optimized to churn out dishes at volume. An agent is more like a personal chef: different ingredients, different techniques, different tools every time.</p><p>That fundamentally changes the scaling math. If a hundred million knowledge workers each use an agentic assistant at even modest concurrency, you need capacity for tens of millions of simultaneous sessions. At current per-container costs, that's unsustainable. We need a different foundation.</p><p>That's what we've been building.</p>
    <div>
      <h2>Introducing Project Think</h2>
      <a href="#introducing-project-think">
        
      </a>
    </div>
    <p>Project Think ships a set of new primitives for the Agents SDK:</p><ul><li><p><b>Durable execution</b> with fibers: crash recovery, checkpointing, automatic keepalive</p></li><li><p><b>Sub-agents</b>: isolated child agents with their own SQLite and typed RPC</p></li><li><p><b>Persistent sessions</b>: tree-structured messages, forking, compaction, full-text search</p></li><li><p><b>Sandboxed code execution</b>: Dynamic Workers, codemode, runtime npm resolution</p></li><li><p><b>The execution ladder</b>: workspace, isolate, npm, browser, sandbox</p></li><li><p><b>Self-authored extensions</b>: agents that write their own tools at runtime</p></li></ul><p>Each of these is usable directly with the Agent base class. Build exactly what you need with the primitives, or use the Think base class to get started fast. Let's look at what each one does.</p>
    <div>
      <h2>Long-running agents</h2>
      <a href="#long-running-agents">
        
      </a>
    </div>
    <p>Agents, as they exist today, are ephemeral. They run for a session, tied to a single process or device, and then they are gone. A coding agent that dies when your laptop sleeps, that’s a tool. An agent that persists — that can wake up on demand, continue work after interruptions, and carry forward the state without depending on your local runtime — that starts to look like infrastructure. And it changes the scaling model for agents completely.</p><p>The Agents SDK builds on <a href="https://developers.cloudflare.com/durable-objects/"><u>Durable Objects</u></a> to give every agent an identity, persistent state, and the ability to wake on message. This is the <a href="https://en.wikipedia.org/wiki/Actor_model"><u>actor model</u></a>: each agent is an addressable entity with its own SQLite database. It consumes zero compute when hibernated. When something happens (an HTTP request, a WebSocket message, a scheduled alarm, an inbound email) the platform wakes the agent, loads its state, and hands it the event. The agent does its work, then goes back to sleep.</p><table><tr><th><p>
</p></th><th><p><b>VMs / Containers</b></p></th><th><p><b>Durable Objects</b></p></th></tr><tr><td><p><b>Idle cost</b></p></td><td><p>Full compute cost, always</p></td><td><p>Zero (hibernated)</p></td></tr><tr><td><p><b>Scaling</b></p></td><td><p>Provision and manage capacity</p></td><td><p>Automatic, per-agent</p></td></tr><tr><td><p><b>State</b></p></td><td><p>External database required</p></td><td><p>Built-in SQLite</p></td></tr><tr><td><p><b>Recovery</b></p></td><td><p>You build it (process managers, health checks)</p></td><td><p>Platform restarts, state survives</p></td></tr><tr><td><p><b>Identity / routing</b></p></td><td><p>You build it (load balancers, sticky sessions)</p></td><td><p>Built-in (name → agent)</p></td></tr><tr><td><p><b>10,000 agents, each active 1% of the time</b></p></td><td><p>10,000 always-on instances</p></td><td><p>~100 active at any moment</p></td></tr></table><p>This changes the economics of running agents at scale. Instead of "one expensive agent per power user," you can build "one agent per customer" or "one agent per task" or "one agent per email thread." The marginal cost of spawning a new agent is effectively zero.</p>
    <div>
      <h3>Surviving crashes: durable execution with fibers</h3>
      <a href="#surviving-crashes-durable-execution-with-fibers">
        
      </a>
    </div>
    <p>An LLM call takes 30 seconds. A multi-turn agent loop can run for much longer. At any point during that window, the execution environment can vanish: a deploy, a platform restart, hitting resource limits. The upstream connection to the model provider is severed permanently, in-memory state is lost, and connected clients see the stream stop with no explanation.</p><p><code></code><a href="https://developers.cloudflare.com/agents/api-reference/durable-execution/"><code><u>runFiber()</u></code></a> solves this. A fiber is a durable function invocation: registered in SQLite before execution begins, checkpointable at any point via <code>stash()</code>, and recoverable on restart via <code>onFiberRecovered</code>.</p>
            <pre><code>import { Agent } from "agents";

export class ResearchAgent extends Agent {
  async startResearch(topic: string) {
    void this.runFiber("research", async (ctx) =&gt; {
      const findings = [];

      for (let i = 0; i &lt; 10; i++) {
        const result = await this.callLLM(`Research step ${i}: ${topic}`);
        findings.push(result);

        // Checkpoint: if evicted, we resume from here
        ctx.stash({ findings, step: i, topic });

        this.broadcast({ type: "progress", step: i });
      }

      return { findings };
    });
  }

  async onFiberRecovered(ctx) {
    if (ctx.name === "research" &amp;&amp; ctx.snapshot) {
      const { topic } = ctx.snapshot;
      await this.startResearch(topic);
    }
  }
}
</code></pre>
            <p>The SDK keeps the agent alive automatically during fiber execution, no special configuration needed. For work measured in minutes, keepAlive() / keepAliveWhile() prevents eviction during active work. For longer operations (CI pipelines, design reviews, video generation) the agent starts the work, persists the job ID, hibernates, and wakes on callback.</p>
    <div>
      <h3>Delegating work: sub-agents via Facets</h3>
      <a href="#delegating-work-sub-agents-via-facets">
        
      </a>
    </div>
    <p>A single agent shouldn't do everything itself. <a href="https://developers.cloudflare.com/agents/api-reference/sub-agents/"><u>Sub-agents</u></a> are child Durable Objects colocated with the parent via <a href="https://blog.cloudflare.com/durable-object-facets-dynamic-workers/"><u>Facets</u></a>, each with their own isolated SQLite and execution context:</p>
            <pre><code>import { Agent } from "agents";

export class ResearchAgent extends Agent {
  async search(query: string) { /* ... */ }
}

export class ReviewAgent extends Agent {
  async analyze(query: string) { /* ... */ }
}

export class Orchestrator extends Agent {
  async handleTask(task: string) {
    const researcher = await this.subAgent(ResearchAgent, "research");
    const reviewer = await this.subAgent(ReviewAgent, "review");

    const [research, review] = await Promise.all([
      researcher.search(task),
      reviewer.analyze(task)
    ]);

    return this.synthesize(research, review);
  }
}
</code></pre>
            <p>Sub-agents are isolated at the storage level. Each one gets its own SQLite database, and there’s no implicit sharing of data between them. This is enforced by the runtime where sub-agent RPC latency is a function call. TypeScript catches misuse at compile time.</p>
    <div>
      <h3>Conversations that persist: the Session API</h3>
      <a href="#conversations-that-persist-the-session-api">
        
      </a>
    </div>
    <p>Agents that run for days or weeks need more than the typical flat list of messages. The experimental <a href="https://developers.cloudflare.com/agents/api-reference/sessions/"><u>Session API</u></a> models this explicitly. Available on the Agent base class, conversations are stored as trees, where each message has a parent_id. This enables forking (explore an alternative without losing the original path), non-destructive compaction (summarize older messages rather than deleting them), and full-text search across conversation history via <a href="https://www.sqlite.org/fts5.html"><u>FTS5</u></a>.</p>
            <pre><code>import { Agent } from "agents";
import { Session, SessionManager } from "agents/experimental/memory/session";

export class MyAgent extends Agent {
  sessions = SessionManager.create(this);

  async onStart() {
    const session = this.sessions.create("main");
    const history = session.getHistory();
    const forked = this.sessions.fork(session.id, messageId, "alternative-approach");
  }
}
</code></pre>
            <p>Session is usable directly with <code>Agent</code>, and it's the storage layer that the <code>Think</code> base class builds on.</p>
    <div>
      <h2>From tool calls to code execution</h2>
      <a href="#from-tool-calls-to-code-execution">
        
      </a>
    </div>
    <p>Conventional tool-calling has an awkward shape. The model calls a tool, pulls the result back through the context window, calls another tool, pulls that back, and so on. As the tool surface grows, this gets both expensive and clumsy. A hundred files means a hundred round-trips through the model.</p><p>But <a href="https://blog.cloudflare.com/code-mode/"><u>models are better at writing code to use a system than they are at playing the tool-calling game</u></a>. This is the insight behind <a href="https://github.com/cloudflare/agents/tree/main/packages/codemode"><u>@cloudflare/codemode</u></a>: instead of sequential tool calls, the LLM writes a single program that handles the entire task.</p>
            <pre><code>// The LLM writes this. It runs in a sandboxed Dynamic Worker.
const files = await tools.find({ pattern: "**/*.ts" });
const results = [];
for (const file of files) {
  const content = await tools.read({ path: file });
  if (content.includes("TODO")) {
    results.push({ file, todos: content.match(/\/\/ TODO:.*/g) });
  }
}
return results;
</code></pre>
            <p>Instead of 100 round-trips to the model, you just run a single program. This leads to fewer tokens used, faster execution, and better results. The <a href="https://github.com/cloudflare/mcp"><u>Cloudflare API MCP server</u></a> demonstrates this at scale. We expose only two tools <code>(search()</code> and <code>execute())</code>, which consume ~1,000 tokens, vs. ~1.17 million tokens for the naive tool-per-endpoint equivalent. This is a 99.9% reduction.</p>
    <div>
      <h3>The missing primitive: safe sandboxes</h3>
      <a href="#the-missing-primitive-safe-sandboxes">
        
      </a>
    </div>
    <p>Once you accept that models should write code on behalf of users, the question becomes: where does that code run? Not eventually, not after a product team turns it into a roadmap item. Right now, for this user, against this system, with tightly defined permissions.</p><p><a href="https://blog.cloudflare.com/dynamic-workers/"><u>Dynamic Workers</u></a> are that sandbox. A fresh V8 isolate spun up at runtime, in milliseconds, with a few megabytes of memory. That's roughly 100x faster and up to 100x more memory-efficient than a container. You can start a new one for every single request, run a snippet of code, and throw it away.</p><p>The critical design choice is the capability model. Instead of starting with a general-purpose machine and trying to constrain it, Dynamic Workers begin with almost no ambient authority (<code>globalOutbound: null</code>, no network access) and the developer grants capabilities explicitly, resource by resource, through bindings. We go from asking "how do we stop this thing from doing too much?" to "what exactly do we want this thing to be able to do?"</p><p>This is the right question for agent infrastructure.</p>
    <div>
      <h3>The execution ladder</h3>
      <a href="#the-execution-ladder">
        
      </a>
    </div>
    <p>This capability model leads naturally to a spectrum of compute environments, an <b>execution ladder</b> that the agent escalates through as needed:</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6yokfTVcg8frH4snf7c4sp/2306d721650b4956b28e2198f7cf915d/BLOG-3200_2.png" />
          </figure><p><b>Tier 0</b> is the Workspace, a durable virtual filesystem backed by SQLite and R2. Read, write, edit, search, grep, diff. Powered by <a href="https://www.npmjs.com/package/@cloudflare/shell"><code><u>@cloudflare/shell</u></code></a>.</p><p><b>Tier 1</b> is a Dynamic Worker: LLM-generated JavaScript running in a sandboxed isolate with no network access. Powered by <a href="https://www.npmjs.com/package/@cloudflare/codemode"><code><u>@cloudflare/codemode</u></code></a>.</p><p><b>Tier 2</b> adds npm. <a href="https://github.com/cloudflare/agents/tree/main/packages/worker-bundler"><code><u>@cloudflare/worker-bundler</u></code></a> fetches packages from the registry, bundles them with esbuild, and loads the result into the Dynamic Worker. The agent writes <code>import { z } from "zod"</code> and it just works.</p><p><b>Tier 3</b> is a headless browser via <a href="https://developers.cloudflare.com/browser-rendering/"><u>Cloudflare Browser Run</u></a>. Navigate, click, extract, screenshot. Useful when the service doesn't support agents yet via MCP or APIs.</p><p><b>Tier 4</b> is a <a href="https://developers.cloudflare.com/sandbox/"><u>Cloudflare Sandbox</u></a> configured with your toolchains, repos, and dependencies: <code>git clone, npm test, cargo build</code>, synced bidirectionally with the Workspace.</p><p>The key design principle: <b>the agent should be useful at Tier 0 alone, where each tier is additive.</b> The user can add capabilities as they go.</p>
    <div>
      <h3>Building blocks, not a framework</h3>
      <a href="#building-blocks-not-a-framework">
        
      </a>
    </div>
    <p>All of these primitives are available as standalone packages. <a href="https://blog.cloudflare.com/dynamic-workers/"><u>Dynamic Workers</u></a>, <a href="https://github.com/cloudflare/agents/tree/main/packages/codemode"><code><u>@cloudflare/codemode</u></code></a>, <a href="https://github.com/cloudflare/agents/tree/main/packages/worker-bundler"><code><u>@cloudflare/worker-bundler</u></code></a>, and <a href="https://www.npmjs.com/package/@cloudflare/shell"><code><u>@cloudflare/shell</u></code></a> (a durable filesystem with tools) are all usable directly with the Agent base class. You can combine them to give any agent a workspace, code execution, and runtime package resolution without adopting an opinionated framework.</p>
    <div>
      <h2>The platform</h2>
      <a href="#the-platform">
        
      </a>
    </div>
    <p>Here's the complete stack for building agents on Cloudflare:</p><table><tr><th><p><b>Capability</b></p></th><th><p><b>What it does</b></p></th><th><p><b>Powered by</b></p></th></tr><tr><td><p>Per-agent isolation</p></td><td><p>Every agent is its own world</p></td><td><p><a href="https://developers.cloudflare.com/durable-objects/"><u>Durable Objects</u></a> (DOs)</p></td></tr><tr><td><p>Zero cost when idle</p></td><td><p>$0 until the agent wakes up</p></td><td><p><a href="https://developers.cloudflare.com/durable-objects/best-practices/websockets/#websocket-hibernation-api"><u>DO Hibernation</u></a></p></td></tr><tr><td><p>Persistent state</p></td><td><p>Queryable, transactional storage</p></td><td><p><a href="https://developers.cloudflare.com/durable-objects/best-practices/access-durable-objects-storage/"><u>DO SQLite</u></a></p></td></tr><tr><td><p>Durable filesystem</p></td><td><p>Files that survive restarts</p></td><td><p>Workspace (SQLite + <a href="https://developers.cloudflare.com/r2/"><u>R2</u></a>)</p></td></tr><tr><td><p>Sandboxed code execution</p></td><td><p>Run LLM-generated code safely</p></td><td><p><a href="https://blog.cloudflare.com/dynamic-workers/"><u>Dynamic Workers</u></a> + <a href="https://github.com/cloudflare/agents/tree/main/packages/codemode"><code><u>@cloudflare/codemode</u></code></a></p></td></tr><tr><td><p>Runtime dependencies</p></td><td><p><code>import * from react</code> just works</p></td><td><p><a href="https://github.com/cloudflare/agents/tree/main/packages/worker-bundler"><code><u>@cloudflare/worker-bundler</u></code></a></p></td></tr><tr><td><p>Web automation</p></td><td><p>Browse, navigate, fill forms</p></td><td><p><a href="https://developers.cloudflare.com/browser-rendering/"><u>Browser Run</u></a></p></td></tr><tr><td><p>Full OS access</p></td><td><p>git, compilers, test runners</p></td><td><p><a href="https://developers.cloudflare.com/sandbox/"><u>Sandboxes</u></a></p></td></tr><tr><td><p>Scheduled execution</p></td><td><p>Proactive, not just reactive</p></td><td><p><a href="https://developers.cloudflare.com/durable-objects/api/alarms/"><u>DO Alarms + Fibers</u></a></p></td></tr><tr><td><p>Real-time streaming</p></td><td><p>Token-by-token to any client</p></td><td><p>WebSockets</p></td></tr><tr><td><p>External tools</p></td><td><p>Connect to any tool server</p></td><td><p>MCP</p></td></tr><tr><td><p>Agent coordination</p></td><td><p>Typed RPC between agents</p></td><td><p>Sub-agents (<a href="https://developers.cloudflare.com/dynamic-workers/usage/durable-object-facets/"><u>Facets</u></a>)</p></td></tr><tr><td><p>Model access</p></td><td><p>Connect to an LLM to power the agent</p></td><td><p><a href="https://developers.cloudflare.com/ai-gateway/"><u>AI Gateway</u></a> + <a href="https://developers.cloudflare.com/workers-ai/"><u>Workers AI</u></a> (or Bring Your Own Model)</p></td></tr></table><p>Each of these is a building block. Together, they form something new: a platform where anyone can build, deploy, and run AI agents as capable as the ones running on your local machine today, but <a href="https://www.cloudflare.com/learning/serverless/what-is-serverless/"><u>serverless</u></a>, durable, and safe by construction. </p>
    <div>
      <h2>The Think base class</h2>
      <a href="#the-think-base-class">
        
      </a>
    </div>
    <p>Now that you've seen the primitives, here's what happens when you wire them all together.</p><p><code>Think</code> is an opinionated harness that handles the full chat lifecycle: agentic loop, message persistence, streaming, tool execution, stream resumption, and extensions. You focus on what makes your agent unique.</p><p>The minimal subclass looks like this:</p>
            <pre><code>import { Think } from "@cloudflare/think";
import { createWorkersAI } from "workers-ai-provider";

export class MyAgent extends Think&lt;Env&gt; {
  getModel() {
    return createWorkersAI({ binding: this.env.AI })(
      "@cf/moonshotai/kimi-k2.5"
    );
  }
}
</code></pre>
            <p>That’s effectively all you need to have a working chat agent with streaming, persistence, abort/cancel, error handling, resumable streams, and a built-in workspace filesystem. Deploy with <code>npx wrangler deploy</code>.</p><p>Think makes decisions for you. When you need more control, you can override the ones you care about:</p><table><tr><td><p><b>Override</b></p></td><td><p><b>Purpose</b></p></td></tr><tr><td><p><code>getModel()</code></p></td><td><p>Return the <code>LanguageModel</code> to use</p></td></tr><tr><td><p><code>getSystemPrompt()</code></p></td><td><p>System prompt</p></td></tr><tr><td><p><code>getTools()</code></p></td><td><p>AI SDK compatible <code>ToolSet</code> for the agentic loop</p></td></tr><tr><td><p><code>maxSteps</code></p></td><td><p>Max tool-call rounds per turn</p></td></tr><tr><td><p><code>configureSession()</code></p></td><td><p>Context blocks, compaction, search, skills</p></td></tr></table><p>Under the hood, Think runs the complete agentic loop on every turn: it assembles the context (base instructions + tool descriptions + skills + memory + conversation history), calls <code>streamText</code>, executes tool calls (with output truncation to prevent context blowup), appends results, loops until the model is done or the step limit is reached. All messages are persisted after each turn.</p>
    <div>
      <h3>Lifecycle hooks</h3>
      <a href="#lifecycle-hooks">
        
      </a>
    </div>
    <p>Think gives you hooks at every stage of the chat turn, without requiring you to own the whole pipeline:</p>
            <pre><code>beforeTurn()
  → streamText()
    → beforeToolCall()
    → afterToolCall()
  → onStepFinish()
→ onChatResponse()
</code></pre>
            <p>Switch to a lower cost model for follow-up turns, limit the tools it can use, and pass in client-side context on each turn. Also log every tool call to analytics and automatically trigger one more follow-up turn after the model completes, all without replacing <code>onChatMessage</code>.</p>
    <div>
      <h3>Persistent memory and long conversations</h3>
      <a href="#persistent-memory-and-long-conversations">
        
      </a>
    </div>
    <p>Think builds on <a href="https://developers.cloudflare.com/agents/api-reference/sessions/?cf_target_id=E7A3D837FA7DC4C7DDA822B3DE0F831B"><u>Session API</u></a> as its storage layer, giving you tree-structured messages with branching built in.</p><p>On top of that, it adds persistent memory through <b>context blocks</b>. These are structured sections of the system prompt that the model can read and update over time, and they persist across hibernation<b>.</b>The model sees "MEMORY (Important facts, use set_context to update) [42%, 462/1100 tokens]" and can proactively remember things.</p>
            <pre><code>configureSession(session: Session) {
  return session
    .withContext("soul", {
      provider: { get: async () =&gt; "You are a helpful coding assistant." }
    })
    .withContext("memory", {
      description: "Important facts learned during conversation.",
      maxTokens: 2000
    })
    .withCachedPrompt();
}
</code></pre>
            <p>Sessions are flexible. You can run multiple conversations per agent and fork them to try a different direction without losing the original.<b> </b></p><p>As context grows, Think handles limits with non-destructive compaction. Older messages are summarized instead of removed, while the full history remains stored in SQLite.<b> </b></p><p>Search is built in as well. Using FTS5, you can query conversation history within a session or across all the sessions. The agent is also able to search its own past using<b> </b><code>search_context</code> tool.</p>
    <div>
      <h3>The full execution ladder, wired in</h3>
      <a href="#the-full-execution-ladder-wired-in">
        
      </a>
    </div>
    <p>Think integrates the entire execution ladder into a single <code>getTools()</code> return:</p>
            <pre><code>import { Think } from "@cloudflare/think";
import { createWorkspaceTools } from "@cloudflare/think/tools/workspace";
import { createExecuteTool } from "@cloudflare/think/tools/execute";
import { createBrowserTools } from "@cloudflare/think/tools/browser";
import { createSandboxTools } from "@cloudflare/think/tools/sandbox";
import { createExtensionTools } from "@cloudflare/think/tools/extensions";

export class MyAgent extends Think&lt;Env&gt; {
  extensionLoader = this.env.LOADER;

  getModel() {
    /* ... */
  }

  getTools() {
    return {
      execute: createExecuteTool({
        tools: createWorkspaceTools(this.workspace),
        loader: this.env.LOADER
      }),
      ...createBrowserTools(this.env.BROWSER),
      ...createSandboxTools(this.env.SANDBOX), // configured per-agent: toolchains, repos, snapshots
      ...createExtensionTools({ manager: this.extensionManager! }),
      ...this.extensionManager!.getTools()
    };
  }
}
</code></pre>
            
    <div>
      <h3>Self-authored extensions</h3>
      <a href="#self-authored-extensions">
        
      </a>
    </div>
    <p>Think takes code execution one step further. An agent can write its own extensions: TypeScript programs that run in Dynamic Workers, declaring permissions for network access and workspace operations.</p>
            <pre><code>{
  "name": "github",
  "description": "GitHub integration: PRs, issues, repos",
  "tools": ["create_pr", "list_issues", "review_pr"],
  "permissions": {
    "network": ["api.github.com"],
    "workspace": "read-write"
  }
}
</code></pre>
            <p>Think's <code>ExtensionManager</code> bundles the extension (optionally with npm deps via <code>@cloudflare/worker-bundler</code>), loads it into a Dynamic Worker, and registers the new tools. The extension persists in DO storage and survives hibernation. The next time the user asks about pull requests, the agent has a <code>github_create_pr </code>tool that didn't exist 30 seconds ago.</p><p>This is the kind of self-improvement loop that makes agents genuinely more useful over time. Not through fine-tuning or RLHF, but through code. The agent is able to write new capabilities for itself, all in sandboxed, auditable, and revocable TypeScript.</p>
    <div>
      <h3>Sub-agent RPC</h3>
      <a href="#sub-agent-rpc">
        
      </a>
    </div>
    <p>Think also works as a sub-agent, called via <code>chat()</code> over RPC from a parent, with streaming events via callback:</p>
            <pre><code>const researcher = await this.subAgent(ResearchSession, "research");
const result = await researcher.chat(`Research this: ${task}`, streamRelay);
</code></pre>
            <p>Each child gets its own conversation tree, memory, tools, and model. The parent doesn't need to know the details.</p>
    <div>
      <h3>Getting started</h3>
      <a href="#getting-started">
        
      </a>
    </div>
    <p>Project Think is experimental. The API surface is stable but will continue to evolve in the coming days and weeks. We're already using it internally to build our own background agent infrastructure, and we're sharing it early so you can build alongside us.</p>
            <pre><code>npm install @cloudflare/think agents ai @cloudflare/shell zod workers-ai-provider</code></pre>
            
            <pre><code>// src/server.ts
import { Think } from "@cloudflare/think";
import { createWorkersAI } from "workers-ai-provider";
import { routeAgentRequest } from "agents";

export class MyAgent extends Think&lt;Env&gt; {
  getModel() {
    return createWorkersAI({ binding: this.env.AI })(
      "@cf/moonshotai/kimi-k2.5"
    );
  }
}

export default {
  async fetch(request: Request, env: Env) {
    return (
      (await routeAgentRequest(request, env)) ||
      new Response("Not found", { status: 404 })
    );
  }
} satisfies ExportedHandler&lt;Env&gt;;
</code></pre>
            
            <pre><code>// src/client.tsx
import { useAgent } from "agents/react";
import { useAgentChat } from "@cloudflare/ai-chat/react";

function Chat() {
  const agent = useAgent({ agent: "MyAgent" });
  const { messages, sendMessage, status } = useAgentChat({ agent });
  // Render your chat UI
}
</code></pre>
            <p>Think speaks the same WebSocket protocol as <code>@cloudflare/ai-chat</code>, so existing UI components work out of the box. If you've built on <a href="https://developers.cloudflare.com/agents/api-reference/chat-agents/"><code><u>AIChatAgent</u></code></a>, your client code doesn't change.</p>
    <div>
      <h2>The third wave</h2>
      <a href="#the-third-wave">
        
      </a>
    </div>
    <p>We see three waves of AI agents:</p><p><b>The first wave was chatbots.</b> They were stateless, reactive, and fragile. Every conversation started from scratch with no memory, no tools, and no ability to act. This made them useful for answering questions, but limited them to only answering questions.</p><p><b>The second wave was coding agents.</b> These are stateful, tool-using and far more capable tools like Pi, Claude Code, OpenClaw, and Codex. These agents can read codebases, write code, execute it, and iterate. These proved that an LLM with the right tools is a general-purpose machine, but they run on your laptop, for one user, with no durability guarantees.</p><p><b>Now we are entering the third wave: agents as infrastructure.</b> Durable, distributed, structurally safe, and serverless. These are agents that run on the Internet, survive failures, cost nothing when idle, and enforce security through architecture rather than behavior. Agents that any developer can build and deploy for any number of users.</p><p>This is the direction we’re betting on.</p><p>The Agents SDK is already powering thousands of production agents. With Project Think and the the primitives it introduces, we're adding the missing pieces to make those agents dramatically more capable: persistent workspaces, sandboxed code execution, durable long-running tasks, structural security, sub-agent coordination, and self-authored extensions.</p><p>It's available today in preview. We're building alongside you, and we'd genuinely love to see what you (and your coding agent) create with it.</p><hr /><p><sup><i>Think is part of the Cloudflare Agents SDK, available as @cloudflare/think. The features described in this post are in preview. APIs may change as we incorporate feedback. Check the </i></sup><a href="https://github.com/cloudflare/agents/blob/main/docs/think/index.md"><sup><i><u>documentation</u></i></sup></a><sup><i> and </i></sup><a href="https://github.com/cloudflare/agents/tree/main/examples/assistant"><sup><i><u>example</u></i></sup></a><sup><i> to get started.</i></sup></p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/161Wz7Tf8Cpzn2u2cBCH3V/37633c016734590005edd280732e89b9/BLOG-3200_3.png" />
          </figure><p></p> ]]></content:encoded>
            <category><![CDATA[Agents Week]]></category>
            <category><![CDATA[Agents]]></category>
            <category><![CDATA[Storage]]></category>
            <category><![CDATA[Developer Platform]]></category>
            <category><![CDATA[Developers]]></category>
            <category><![CDATA[Cloudflare Workers]]></category>
            <category><![CDATA[Durable Objects]]></category>
            <category><![CDATA[AI]]></category>
            <guid isPermaLink="false">3r2ykMs0LTSPwVHmVWldCy</guid>
            <dc:creator>Sunil Pai</dc:creator>
            <dc:creator>Kate Reznykova</dc:creator>
        </item>
        <item>
            <title><![CDATA[Introducing Agent Lee - a new interface to the Cloudflare stack]]></title>
            <link>https://blog.cloudflare.com/introducing-agent-lee/</link>
            <pubDate>Wed, 15 Apr 2026 13:00:00 GMT</pubDate>
            <description><![CDATA[ Agent Lee is an in-dashboard agent that shifts Cloudflare’s interface from manual tab-switching to a single prompt. Using sandboxed TypeScript, it helps you troubleshoot and manage your stack as a grounded technical collaborator.
 ]]></description>
            <content:encoded><![CDATA[ <p>While there have been small improvements along the way, the interface of technical products has not really changed since the dawn of the Internet. It still remains: clicking five pages deep, cross-referencing logs across tabs, and hunting for hidden toggles.</p><p>AI gives us the opportunity to rethink all that. Instead of complexity spread over a sprawling graphical user interface: what if you could describe in plain language what you wanted to achieve? </p><p>This is the future — and we’re launching it today. We didn’t want to just put an agent in a dashboard. We wanted to create an entirely new way to interact with our entire platform. Any task, any surface, a single prompt.</p><p>Introducing Agent Lee.</p><p>Agent Lee is an in-dashboard AI assistant that understands <b>your</b> Cloudflare account. </p><p>It can help you with troubleshooting, which, today, is a manual grind. If your Worker starts returning 503s at 02:00 UTC, finding the root cause: be it an R2 bucket, a misconfigured route, or a hidden rate limit, you’re opening half a dozen tabs and hoping you recognize the pattern. Most developers don't have a teammate who knows the entire platform standing over their shoulder at 2 a.m. Agent Lee does. </p><p>But it won’t just troubleshoot for you at 2 a.m. Agent Lee will also fix the problem for you on the spot.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2Iva79HIiHPUrK8NLukkwH/dd1cf1709ab04f6d5825124cecd20a5e/BLOG-3231_2.png" />
          </figure><p>Agent Lee has been running in an active beta during which it has served over 18,000 daily users, executing nearly a quarter of a million tool calls per day. While we are confident in its current capabilities and success in production, this is a system we are continuously developing. As it remains in beta, you may encounter unexpected limitations or edge cases as we refine its performance. We encourage you to use the feedback form below to help us make it better every day.</p>
    <div>
      <h2>What Agent Lee can do</h2>
      <a href="#what-agent-lee-can-do">
        
      </a>
    </div>
    <p>Agent Lee is built directly into the dashboard and understands the resources in your account. It knows your Workers, your zones, your DNS configuration, your error rates. The knowledge that today lives across six tabs and two browser windows will now live in one place, and you can talk to it.</p><p>With natural language, you can use it to:</p><ul><li><p><b>Answer questions about your account:</b> "Show me the top 5 error messages on my Worker."</p></li><li><p><b>Debug an issue:</b> "I can't access my site with the www prefix."</p></li><li><p><b>Apply a change:</b> "Enable Access for my domain."</p></li><li><p><b>Deploy a resource: </b>"Create a new R2 bucket for my photos and connect it to my Worker."</p></li></ul><p>Instead of switching between products, you describe what you want to do, and Agent Lee helps you get there with instructions and visualizations. It retrieves context, uses the right tools, and creates dynamic visualizations based on the types of questions you ask. Ask what your error rate looks like over the last 24 hours, and it renders a chart inline, pulling from your actual traffic, not sending you to a separate Analytics page.</p><div>
  
</div><p>Agent Lee isn't answering FAQ questions — it's doing real work, against real accounts, at scale. Today, Agent Lee serves ~18,000 daily users, executing ~250k tool calls per day across DNS, Workers, SSL/TLS, R2, Registrar, Cache, Cloudflare Tunnel, API Shield, and more. </p>
    <div>
      <h2>How we built it</h2>
      <a href="#how-we-built-it">
        
      </a>
    </div>
    
    <div>
      <h3>Codemode</h3>
      <a href="#codemode">
        
      </a>
    </div>
    <p>Rather than presenting MCP tool definitions directly to the model, Agent Lee uses <a href="https://blog.cloudflare.com/code-mode/"><u>Codemode</u></a> to convert the tools into a TypeScript API and asks the model to write code that calls it instead.</p><p>This works better for a couple of reasons. LLMs have seen a huge amount of real-world TypeScript but very few tool call examples, so they're more accurate when working in code. For multi-step tasks, the model can also chain calls together in a single script and return only the final result, ultimately skipping the round-trips.</p><p>The generated code is sent to an upstream Cloudflare MCP server for sandboxed execution, but it goes through a Durable Object that acts as a credentialed proxy. Before any call goes out, the DO classifies the generated code as read or write by inspecting the method and body. Read operations are proxied directly. Write operations are blocked until you explicitly approve them through the elicitation gate. API keys are never present in the generated code — they're held inside the DO and injected server-side when the upstream call is made. The security boundary isn't just a sandbox that gets thrown away; it's a permission architecture that structurally prevents writes from happening without your approval.</p>
    <div>
      <h3>The MCP permission system</h3>
      <a href="#the-mcp-permission-system">
        
      </a>
    </div>
    <p>Agent Lee connects to Cloudflare's own MCP server, which exposes two tools: a search tool for querying API endpoints and an execute tool for writing code that performs API requests. This is the surface through which Agent Lee reads your account and, when you approve, writes to it.</p><p>Write operations go through an elicitation system that surfaces the approval step before any code executes. Agent Lee cannot skip this step. The permission model is the enforcement layer, and the confirmation prompt you see is not a UX courtesy. It's the gate.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/s8phQottGj8yVgc42Nzvl/3abb536756e10360b68cabc0522bcb30/BLOG-3231_4.png" />
          </figure>
    <div>
      <h2>Built on the same stack you can use</h2>
      <a href="#built-on-the-same-stack-you-can-use">
        
      </a>
    </div>
    <p>Every primitive Agent Lee is built on is available to all our customers: <a href="https://developers.cloudflare.com/agents/"><u>Agents SDK</u></a>, <a href="https://www.cloudflare.com/developer-platform/products/workers-ai/"><u>Workers AI</u></a>, <a href="https://www.cloudflare.com/developer-platform/products/durable-objects/"><u>Durable Objects</u></a>, and the same MCP infrastructure available to any Cloudflare developer. We didn't build internal tools that aren't available to you — instead we built it with the same Cloudflare lego blocks that you have access to.</p><p>Building Agent Lee on our own primitives wasn't just a design principle. It was the fastest way to find out what works and what doesn't. We built this in production, with real users, against real accounts. That means every limitation we hit is a limitation we can fix in the platform. Every pattern that works is one we can make easier for the next team that builds on top of it.</p><p>These are not opinions. They're what quarter of a million tool calls across 18,000 users a day are telling us.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6877BA4kZUUP6qTs5ONucr/50d572b38fecdb3e77ab38d7976f06ed/image5.png" />
          </figure>
    <div>
      <h2>Generative UI</h2>
      <a href="#generative-ui">
        
      </a>
    </div>
    <p>Interacting with a platform should feel like collaborating with an expert. Conversations should transcend simple text. With Agent Lee, as your dialogue evolves, the platform dynamically generates UI components alongside textual responses to provide a richer, more actionable experience.</p><p>For example, if you ask about website traffic trends for the month, you won’t just get a paragraph of numbers. Agent Lee will render an interactive line graph, allowing you to visualize peaks and troughs in activity at a glance.</p><p>To give you full creative control, every conversation is accompanied within an adaptive grid. Here you can click and drag across the grid to carve out space for new UI blocks, then simply describe what you want to see and let the agent handle the heavy lifting.</p><p>Today, we support a diverse library of visual blocks, including dynamic tables, interactive charts, architecture maps, and more. By blending the flexibility of natural language with the clarity of structured UI, Agent Lee transforms your chat history into a living dashboard.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1oTLXzK5eYGyJm6Y54cKbz/98bae92dc7523f63ac6515d8088e70f7/image4.png" />
          </figure>
    <div>
      <h2>Measuring quality and safety</h2>
      <a href="#measuring-quality-and-safety">
        
      </a>
    </div>
    <p>An agent that can take action on your account needs to be reliable and secure. Elicitations allow agentic systems to actively solicit information, preferences, or approvals from users or other systems mid-execution. When Agent Lee needs to take non-read actions on a user's behalf we use elicitations by requiring an explicit approval action in the user interface. These guardrails allow Agent Lee to truly be a partner alongside you in managing your resource safely.</p><p>In addition to safety, we continuously measure quality.</p><ul><li><p>Evals to measure conversation success rate and information accuracy.</p></li><li><p>Feedback signals from user interactions (thumbs up / thumbs down).</p></li><li><p>Tool call execution success rate and hallucination scorers.</p></li><li><p>Per-product breakdown of conversation performance.</p></li></ul><p>These systems help us improve Agent Lee over time while keeping users in control. </p>
    <div>
      <h2>Our vision ahead</h2>
      <a href="#our-vision-ahead">
        
      </a>
    </div>
    <p>Agent Lee in the dashboard is only the beginning.</p><p>The bigger vision is Agent Lee as the interface to the entire Cloudflare platform — from anywhere. The dashboard today, the CLI next, your phone when you're on the go. The surface you use shouldn't matter. You should be able to describe what you need and have it done, regardless of where you are.</p><p>From there, Agent Lee gets proactive. Rather than waiting to be asked, it watches what matters to you, your Workers, your traffic, your error thresholds and reaches out when something warrants attention. An agent that only responds is useful. One that notices things first is something different.</p><p>Underlying all of this is context. Agent Lee already knows your account configuration. Over time, it will know more, what you've asked before, what page you're on, what you were debugging last week. That accumulated context is what makes a platform feel less like a tool and more like a collaborator.</p><p>We're not there yet. Agent Lee today is the first step, running in production, doing real work at scale. The architecture is built to get to the rest. </p>
    <div>
      <h2>Try it out</h2>
      <a href="#try-it-out">
        
      </a>
    </div>
    <p>Agent Lee is available in beta for Free plan users. Log in to your <a href="https://dash.cloudflare.com/login"><u>Cloudflare dashboard</u></a> and click Ask AI in the upper right corner to get started.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4FThQbf24TcV1mYT49yi39/05df8347e14c8ef5e591d224a1a38393/Screenshot_2026-04-13_at_3.37.29%C3%A2__PM.png" />
          </figure><p>We'd love to know what you build and what you’d like to see in Agent Lee. Please share your feedback <a href="https://forms.gle/dSCHNkHpJt6Uwsvc8"><u>here</u></a>.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/NsNVsMvU9v3U03jY4kU54/28182a4d9f36f75f8e93e5fcf67c1f21/BLOG-3231_6.png" />
          </figure><p></p> ]]></content:encoded>
            <category><![CDATA[Agents Week]]></category>
            <category><![CDATA[Agents]]></category>
            <category><![CDATA[Workers AI]]></category>
            <category><![CDATA[SDK]]></category>
            <category><![CDATA[Dashboard]]></category>
            <category><![CDATA[Developers]]></category>
            <category><![CDATA[Developer Platform]]></category>
            <guid isPermaLink="false">4KNkr1nK6i3lbBEBRWnt5z</guid>
            <dc:creator>Kylie Czajkowski</dc:creator>
            <dc:creator>Aparna Somaiah</dc:creator>
            <dc:creator>Brayden Wilmoth</dc:creator>
        </item>
        <item>
            <title><![CDATA[Register domains wherever you build: Cloudflare Registrar API now in beta]]></title>
            <link>https://blog.cloudflare.com/registrar-api-beta/</link>
            <pubDate>Wed, 15 Apr 2026 13:00:00 GMT</pubDate>
            <description><![CDATA[ The Cloudflare Registrar API is now in beta. Developers and AI agents can search, check availability, and register domains at cost directly from their editor, their terminal, or their agent — without leaving their workflow. ]]></description>
            <content:encoded><![CDATA[ <p>Today we're launching the next chapter of Cloudflare Registrar: the <b>Registrar API in beta</b>.</p><p>The Registrar API makes it possible to search for domains, check availability, and register them programmatically. Now, buying a domain the moment an idea starts to feel real no longer has to pull you out of the agentic workflow.</p><p>A Registrar API has been one of the clearest asks from builders using Cloudflare. As more of the agentic workflow has moved into editors, terminals, and agent-driven tools, domain registration became the obvious gap to close.</p><p>When we launched <a href="https://www.cloudflare.com/products/registrar/"><u>Cloudflare Registrar</u></a> seven years ago, the idea was simple. Domains should be offered <a href="https://www.cloudflare.com/application-services/solutions/low-cost-domain-names/"><u>at cost</u></a>, with no markup and no games. Since then, Cloudflare Registrar has become one of the fastest growing registrars in the world as more people choose Cloudflare as the place to build their next project.</p><div>
  
</div>
<p></p><p><sup><i>Prompting an agent inside an AI code editor to generate name ideas, search, check, and purchase a domain.</i></sup></p>
    <div>
      <h2>Built for agents and automation</h2>
      <a href="#built-for-agents-and-automation">
        
      </a>
    </div>
    <p>The Registrar API is designed to work well anywhere software is already being built: inside editors, deployment pipelines, backend services, and agent-driven workflows.</p><p>The workflow is intentionally simple and machine-friendly. <code><b>Search</b></code> returns candidate names. <code><b>Check</b></code> returns real-time availability and pricing. <code><b>Register</b></code> takes a minimal request and returns a workflow-shaped response that can complete immediately or be polled if it takes longer. That makes it straightforward to use for traditional API clients and for <a href="https://www.cloudflare.com/learning/ai/what-is-agentic-ai/"><u>AI agents</u></a> acting on a user's behalf.</p><p>In practice, all this means that an agent can help with the full flow: suggest names, confirm which one is actually registrable, surface the price for approval, and then complete the purchase without forcing the user out of the tool they are already using.</p>
    <div>
      <h2>The Registrar API</h2>
      <a href="#the-registrar-api">
        
      </a>
    </div>
    <p>At its core, this first release of the Registrar API does three things</p><ul><li><p><code><b>Search</b></code> for domains</p></li><li><p><code><b>Check</b></code> availability</p></li><li><p><code><b>Register</b></code> domains</p></li></ul><p>For a curated set of popular TLDs to start, see the <a href="https://developers.cloudflare.com/api/resources/registrar"><u>Registrar API docs</u></a>. When supported, <a href="https://www.cloudflare.com/learning/dns/glossary/premium-domains/"><u>premium domains</u></a> can also be registered, but they require explicit fee acknowledgement.</p><p>The Registrar API is part of the full Cloudflare API, which means agents already have access to it today through the <a href="https://blog.cloudflare.com/code-mode-mcp/"><u>Cloudflare MCP</u></a>. It does not require a separate integration or a custom tool definition. An agent working in Cursor, Claude Code, or any MCP-compatible environment can discover and call Registrar endpoints using the same <code>search()</code> and <code>execute()</code> pattern that covers the entire Cloudflare API surface. The moment the API was part of our spec, it was ready for agents.</p><p><i>What it looks like in practice</i>:</p><p>You're building a new project in your favorite AI code editor. Halfway through scaffolding, you ask your agent: "Find me a good .dev domain for this project and register it."</p><p>The agent searches for candidate names based on your project. It checks real-time availability for the one you pick and confirms the price. You say yes. It registers the domain, using your account's default contact info and payment method automatically. By the time you've read the response, the domain is registered, and privacy is on.</p><p>Three API calls. A few seconds.</p><p><i>What it looks like in code</i>:</p>
    <div>
      <h3>Step 1: <code><b>Search</b></code> for domain names</h3>
      <a href="#step-1-search-for-domain-names">
        
      </a>
    </div>
    <p>Use the <code><b>search</b></code> endpoint to submit a domain query, with or without a domain extension.</p>
            <pre><code>async () =&gt; {
  return cloudflare.request({
    method: "GET",
    path: `/accounts/${accountId}/registrar/domain-search`,
    query: { q: "acme corp", limit: 3 },
  });
}</code></pre>
            
            <pre><code>{
  "success": true,
  "errors": [],
  "messages": [],
  "result": {
    "domains": [
      {
        "name": "acmecorp.com",
        "registrable": true,
        "tier": "standard",
        "pricing": {
          "currency": "USD",
          "registration_cost": "8.57",
          "renewal_cost": "8.57"
        }
      },
      {
        "name": "acmecorp.dev",
        "registrable": true,
        "tier": "standard",
        "pricing": {
          "currency": "USD",
          "registration_cost": "10.11",
          "renewal_cost": "10.11"
        }
      },
      {
        "name": "acmecorp.app",
        "registrable": true,
        "tier": "standard",
        "pricing": {
          "currency": "USD",
          "registration_cost": "11.00",
          "renewal_cost": "11.00"
        }
      }
    ]
  }
}</code></pre>
            
    <div>
      <h3>Step 2: <code><b>Check</b></code> availability and pricing</h3>
      <a href="#step-2-check-availability-and-pricing">
        
      </a>
    </div>
    <p>Search results are fast but non-authoritative; they're based on cached data, and availability can change in seconds for popular names. <code><b>Check</b></code> queries the registry directly. Call it immediately before registering, and use its price response as the source of truth.</p>
            <pre><code>async () =&gt; {
  return cloudflare.request({
    method: "POST",
    path: `/accounts/${accountId}/registrar/domain-check`,
    body: { domains: ["acmecorp.dev"] },
  });
}</code></pre>
            
            <pre><code>{
  "success": true,
  "errors": [],
  "messages": [],
  "result": {
    "domains": [
      {
        "name": "acmecorp.dev",
        "registrable": true,
        "tier": "standard",
        "pricing": {
          "currency": "USD",
          "registration_cost": "10.11",
          "renewal_cost": "10.11"
        }
      }
    ]
  }
}</code></pre>
            
    <div>
      <h3>Step 3: <code><b>Register</b></code> the domain</h3>
      <a href="#step-3-register-the-domain">
        
      </a>
    </div>
    <p>The only required field is the domain name. WHOIS privacy protection is enabled by default at no extra charge. If your account has a default registrant contact, the API uses it automatically; otherwise you can provide contact details inline in the request. Your default payment method is used automatically.</p>
            <pre><code>async () =&gt; {
  return cloudflare.request({
    method: "POST",
    path: `/accounts/${accountId}/registrar/registrations`,
    body: { domain_name: "acmecorp.dev" },
  });
}</code></pre>
            
            <pre><code>{
  "success": true,
  "errors": [],
  "messages": [],
  "result": {
    "domain_name": "acmecorp.dev",
    "state": "succeeded",
    "completed": true,
    "created_at": "2025-10-27T10:00:00Z",
    "updated_at": "2025-10-27T10:00:03Z",
    "context": {
      "registration": {
        "domain_name": "acmecorp.dev",
        "status": "active",
        "created_at": "2025-10-27T10:00:00Z",
        "expires_at": "2026-10-27T10:00:00Z",
        "auto_renew": true,
        "privacy_enabled": true,
        "locked": true
      }
    },
    "links": {
      "self": "/accounts/abc/registrar/registrations/acmecorp.dev/registration-status",
      "resource": "/accounts/abc/registrar/registrations/acmecorp.dev"
    }
  }
}</code></pre>
            <p>Registration typically completes synchronously within seconds. If it takes longer, the API returns a 202 Accepted with a workflow URL to poll. The response shape is the same either way, no special-casing needed. For premium domains, the <code><b>Check</b></code> response returns the exact registry-set price, and the <code><b>Register</b></code> request echoes that back as an explicit fee acknowledgement.</p>
    <div>
      <h3>A note on agents and non-refundable purchases</h3>
      <a href="#a-note-on-agents-and-non-refundable-purchases">
        
      </a>
    </div>
    <p>When an agent registers a domain on your behalf, it charges your default payment method. Domain registrations are non-refundable once complete. A well-designed agent flow should confirm the domain name and price with the user before calling the registration endpoint. The <code><b>Check</b></code> step exists precisely to make that confirmation step explicit and unambiguous. The API gives you the tools to build it correctly; the responsibility to do so belongs in your agent's logic.</p><p>By default, our API docs have explicit agent-facing instructions to seek permission from the user during the register API call. <b>Still, it is the responsibility of the human to design an agent flow that will not buy domains without your approval</b>.</p>
    <div>
      <h2>Why Cloudflare can do this differently</h2>
      <a href="#why-cloudflare-can-do-this-differently">
        
      </a>
    </div>
    <p>What makes Cloudflare different from many developer platforms now adding domain workflows is that Cloudflare operates the registrar itself. That means the same platform where a project is built and deployed can also search for, register, and manage the domain — without adding markup on top.</p><p>At-cost pricing is at the core of Cloudflare’s registrar model. We charge exactly what the registry charges. That holds true whether you're registering a domain through the dashboard, calling the API directly, or asking an agent to do it on your behalf.</p>
    <div>
      <h2>Where the API goes next</h2>
      <a href="#where-the-api-goes-next">
        
      </a>
    </div>
    <p>This beta focuses on the first critical moment in the domain lifecycle: search, check, and registration. We are actively working on expanding the API to cover more of the core Registrar experience, so domains can be managed programmatically after they are purchased, not just at the moment they are created. This will include lifecycle elements like transfers, renewals, contact updates, and more.</p><p>The API is the first step toward a broader registrar-as-a-service offering. Development of that service is underway now, and we’re aiming to launch it later this year. As the API expands, platforms like website builders, hosting providers, AI products, and other multi-tenant applications will be able to make domain registration part of their own user experience. Users can search for a domain, buy it, and provision it without ever leaving the service or agent-driven workflow they are already building in.</p>
    <div>
      <h2>Start building today</h2>
      <a href="#start-building-today">
        
      </a>
    </div>
    <p>The Registrar API exists because builders asked for it. Now that it’s available as a beta, we’d love to see what you build, in the <a href="https://community.cloudflare.com/"><u>Cloudflare Community</u></a> or on <a href="https://x.com/cloudflare"><u>X</u></a>, or on <a href="https://discord.com/invite/cloudflaredev"><u>Discord</u></a>.

To get started:</p><ul><li><p>Review the <a href="https://developers.cloudflare.com/registrar/registrar-api/"><u>Registrar API guide</u></a></p></li><li><p>Check out the <a href="https://developers.cloudflare.com/api/resources/registrar"><u>API reference</u></a></p></li></ul><p>Please let us know if something is missing, if a workflow breaks down, or if you are building toward a larger platform use case. We’re working quickly to expand the functionality of the API to support domain renewals, transfers, and more.</p><p>We can’t wait to see what you build!</p><p><i>Special thanks to Lucy Dryaeva and Fred Pinto for their valuable contributions to delivering the Registrar API beta.</i></p> ]]></content:encoded>
            <category><![CDATA[Registrar]]></category>
            <category><![CDATA[Developers]]></category>
            <category><![CDATA[AI]]></category>
            <category><![CDATA[API]]></category>
            <category><![CDATA[Developer Platform]]></category>
            <category><![CDATA[Agents]]></category>
            <category><![CDATA[Agents Week]]></category>
            <guid isPermaLink="false">oXDvwP79G3OGr3XDAFwmD</guid>
            <dc:creator>Ankit Shah</dc:creator>
            <dc:creator>Carlos Armada</dc:creator>
        </item>
        <item>
            <title><![CDATA[Browser Run: give your agents a browser]]></title>
            <link>https://blog.cloudflare.com/browser-run-for-ai-agents/</link>
            <pubDate>Wed, 15 Apr 2026 13:00:00 GMT</pubDate>
            <description><![CDATA[ Browser Rendering is now Browser Run, with Live View, Human in the Loop, CDP access, session recordings, and 4x higher concurrency limits for AI agents. ]]></description>
            <content:encoded><![CDATA[ <p>AI agents need to interact with the web. To do that, they need a browser. They need to navigate sites, read pages, fill forms, extract data, and take screenshots. They need to observe whether things are working as expected, with a way for their humans to step in if needed. And they need to do all of this at scale.</p><p>Today, we’re renaming Browser Rendering to <b>Browser Run</b>, and shipping key features that make it <i>the</i> browser for <a href="https://www.cloudflare.com/learning/ai/what-is-agentic-ai/"><u>AI agents</u></a>. The name Browser Rendering never fully captured what the product does. Browser Run lets you run full browser sessions on Cloudflare's global network, drive them with code or AI, record and replay sessions, crawl pages for content, debug in real time, and let humans intervene when your agent needs help. </p><p>Here’s what’s new:</p><ul><li><p><b>Live View</b>: see what your agent sees and is doing, in real time. Know instantly if things are working, and when they’re not, see exactly why.</p></li><li><p><b>Human in the Loop</b>: when your agent hits a snag like a login page or unexpected edge case, it can hand off to a human instead of failing. The human steps in, resolves, then hands back control.</p></li><li><p><b>Chrome DevTools Protocol (CDP) Endpoint</b>: the Chrome DevTools Protocol is how agents control browsers. Browser Run now exposes it directly, so agents get maximum control over the browser and existing CDP scripts work on Cloudflare.</p></li><li><p><b>MCP Client Support:</b> AI coding agents like Claude Desktop, Cursor, and OpenCode can now use Browser Run as their remote browser.</p></li><li><p><b>WebMCP Support</b>: agents will outnumber humans using the web. WebMCP allows websites to declare what actions are available for agents to discover and call, making navigation more reliable.</p></li><li><p><b>Session Recordings</b>: capture every browser session for debugging purposes. When something goes wrong, you have the full recording with DOM changes, user interactions, and page navigation.</p></li><li><p><b>Higher limits</b>: run more tasks at once with 120 concurrent browsers, up from 30. </p></li></ul><div>
  
</div>
<p></p><p><sup><i>An AI agent searching Amazon for an orange lava lamp, comparing options, and handing off to a human when sign-in is required to complete the purchase</i></sup></p>
    <div>
      <h2>Everything an agent needs</h2>
      <a href="#everything-an-agent-needs">
        
      </a>
    </div>
    <p>Let’s think about what agents need when browsing the web and how each feature fits in:</p>
<div><table><colgroup>
<col></col>
<col></col>
</colgroup>
<thead>
  <tr>
    <th><span>What an agent needs</span></th>
    <th><span>Browser Run (formerly Browser Rendering)</span></th>
  </tr></thead>
<tbody>
  <tr>
    <td><span>1) Browsers on-demand</span></td>
    <td><span>Chrome browser on Cloudflare’s global network</span></td>
  </tr>
  <tr>
    <td><span>2) A way to control the browser</span></td>
    <td><span>Take actions like navigate, click, fill forms, screenshot, and more with Puppeteer, Playwright, </span><span>CDP (new)</span><span>, </span><span>MCP Client Support (new)</span><span> and </span><span>WebMCP (new)</span></td>
  </tr>
  <tr>
    <td><span>3) Observability</span></td>
    <td><span>Live View (new)</span><span>, </span><span>Session Recordings (new)</span><span>, and </span><span>Dashboard redesign (new)</span></td>
  </tr>
  <tr>
    <td><span>4) Human intervention</span></td>
    <td><span>Human in the Loop (new)</span></td>
  </tr>
  <tr>
    <td><span>5) Scale</span></td>
    <td><span>10 requests/second for Quick Actions, </span><span>120 concurrent browsers (4x increase)</span></td>
  </tr>
</tbody></table></div>
    <div>
      <h2>1) Open a browser</h2>
      <a href="#1-open-a-browser">
        
      </a>
    </div>
    <p>First, an agent needs a browser. With Browser Run, agents can spin up a headless Chrome instance on Cloudflare’s global network, on demand. No infrastructure to manage, no Chrome versions to maintain. Browser sessions open near users for low latency, and scale up and down as needed. Pair Browser Run with the <a href="https://developers.cloudflare.com/agents/api-reference/browse-the-web/"><u>Agents SDK</u></a> to build long-running agents that browse the web, remember everything, and act on their own. </p>
    <div>
      <h2>2) Take actions</h2>
      <a href="#2-take-actions">
        
      </a>
    </div>
    <p>Once your agent has a browser, it needs ways to control it. Browser Run supports multiple approaches: new low-level protocol access with the Chrome DevTools Protocol (CDP) and WebMCP, in addition to existing higher-level automation using <a href="https://developers.cloudflare.com/browser-rendering/puppeteer/"><u>Puppeteer</u></a> and <a href="https://developers.cloudflare.com/browser-rendering/playwright/"><u>Playwright</u></a>, and <a href="https://developers.cloudflare.com/browser-rendering/rest-api/"><u>Quick Actions</u></a> for simple tasks. Let’s look at the details.</p>
    <div>
      <h3>Chrome DevTools Protocol (CDP) endpoint</h3>
      <a href="#chrome-devtools-protocol-cdp-endpoint">
        
      </a>
    </div>
    <p>The <a href="https://chromedevtools.github.io/devtools-protocol/"><u>Chrome DevTools Protocol (CDP)</u></a> is the low-level protocol that powers browser automation. Exposing CDP directly means the growing ecosystem of agent tools and existing CDP automation scripts can use Browser Run. When you open Chrome DevTools and inspect a page, CDP is what's running underneath. Puppeteer, Playwright, and most agent frameworks are built on top of it.</p><p>Every way that you have been using Browser Run has actually been through CDP already. What’s new is that we're now <a href="https://developers.cloudflare.com/browser-rendering/cdp/"><u>exposing CDP directly</u></a> as an endpoint. This matters for agents because CDP gives agents the most control possible over the browser. Agent frameworks already speak CDP natively, and can now connect to Browser Run directly. CDP also unlocks browser actions that aren't available through Puppeteer or Playwright, like JavaScript debugging. And because you're working with raw CDP messages instead of going through higher-level libraries, you can pass messages directly to models for more token-efficient browser control.</p><p>If you already have CDP automation scripts running against self-hosted Chrome, they work on Browser Run with a one-line config change. Point your WebSocket URL at Browser Run and stop managing your own browser infrastructure.</p>
            <pre><code>// Before: connecting to self-hosted Chrome
const browser = await puppeteer.connect({
  browserWSEndpoint: 'ws://localhost:9222/devtools/browser'
});

// After: connecting to Browser Run
const browser = await puppeteer.connect({
  browserWSEndpoint: 'wss://api.cloudflare.com/client/v4/accounts/&lt;ACCOUNT_ID&gt;/browser-rendering/devtools/browser',
  headers: { 'Authorization': 'Bearer &lt;API_TOKEN&gt;' }
});
</code></pre>
            <p>The CDP endpoint also makes Browser Run more accessible. You can now connect from any language, any environment, without needing to write a <a href="https://developers.cloudflare.com/workers/"><u>Cloudflare Worker</u></a>. (If you're already using Workers, nothing changes.)</p>
    <div>
      <h4>Using Browser Run with MCP Clients</h4>
      <a href="#using-browser-run-with-mcp-clients">
        
      </a>
    </div>
    <p>Now that Browser Run exposes the Chrome DevTools Protocol (CDP), MCP clients including Claude Desktop, Cursor, Codex, and OpenCode can use Browser Run as their remote browser. The <a href="https://github.com/ChromeDevTools/chrome-devtools-mcp"><u>chrome-devtools-mcp package</u></a> from the Chrome DevTools team is an MCP server that gives your AI coding assistant access to the full power of Chrome DevTools for reliable automation, in-depth debugging, and performance analysis.</p><p>Here’s an example of how to configure Browser Run for Claude Desktop:</p>
            <pre><code>{
  "mcpServers": {
    "browser-rendering": {
      "command": "npx",
      "args": [
        "-y",
        "chrome-devtools-mcp@latest",
        "--wsEndpoint=wss://api.cloudflare.com/client/v4/accounts/&lt;ACCOUNT_ID&gt;/browser-rendering/devtools/browser?keep_alive=600000",
        "--wsHeaders={\"Authorization\":\"Bearer &lt;API_TOKEN&gt;\"}"
      ]
    }
  }
}
</code></pre>
            <p>For other MCP clients, see <a href="https://developers.cloudflare.com/browser-rendering/cdp/mcp-clients/"><u>documentation for using Browser Run with MCP clients</u></a>.</p>
    <div>
      <h3>WebMCP support</h3>
      <a href="#webmcp-support">
        
      </a>
    </div>
    <p>The Internet was built for humans, so navigating as an AI agent today is unreliable. We’re betting on a future where more agents use the web than humans. In that world, sites need to be agent-friendly.</p><p>That’s why we’re launching support for <a href="https://developer.chrome.com/blog/webmcp-epp"><u>WebMCP</u></a>, a new browser API from the Google Chrome team that landed in Chromium 146+. WebMCP lets websites expose tools directly to AI agents, declaring what actions are available for agents to discover and call on each page. This helps agents navigate the web more reliably. Instead of agents needing to figure out how to use a site, websites can expose their tools for agents to discover and call</p><p>Two APIs make this work:</p><ul><li><p><code>navigator.modelContext</code> allows websites to register their tools</p></li><li><p><code>navigator.modelContextTesting</code> allows agents to discover and execute those tools</p></li></ul><p>Today, an agent visiting a travel booking site has to figure out the UI by looking at it. With WebMCP, the site declares “here’s a search_flights tool that takes an origin, destination, and date.” The agent calls it directly, without having to loop through slow screenshot-analyze-click loops. This makes navigation more reliable regardless of potential changes to the UI.</p><p>Tools are discovered on the page rather than preloaded. This matters for the long tail of the web, where preloading an MCP server for every possible site is not feasible and would bloat the context window. </p><div>
  
</div><p><sup><i>Using WebMCP to book a hotel through the Chrome DevTools console, discovering available tools with listTools()</i></sup></p><p>We have an experimental pool with browser instances running Chrome beta so you can test emerging browser features before they reach stable Chrome. We also just shipped <a href="https://developers.cloudflare.com/browser-rendering/reference/wrangler-commands/"><u>Wrangler browser commands</u></a> that let you manage browser sessions directly from the CLI, letting you create, manage, and view browser sessions directly from your terminal. To <a href="https://developers.cloudflare.com/browser-run/features/webmcp/"><u>access WebMCP-enabled browsers</u></a>, use the following Wrangler command to create a session in the experimental pool:</p>
            <pre><code>npm i -g wrangler@latest
wrangler browser create --lab --keepAlive 300  
</code></pre>
            
    <div>
      <h3>Existing ways to use Browser Run</h3>
      <a href="#existing-ways-to-use-browser-run">
        
      </a>
    </div>
    <p>While CDP and WebMCP are new, you could already use <a href="https://developers.cloudflare.com/browser-rendering/puppeteer/"><u>Puppeteer</u></a>, <a href="https://developers.cloudflare.com/browser-rendering/playwright/"><u>Playwright</u></a>, or <a href="https://developers.cloudflare.com/browser-rendering/stagehand/"><u>Stagehand</u></a> for full browser automation through Browser Run. And for simple tasks like <a href="https://developers.cloudflare.com/browser-rendering/rest-api/screenshot-endpoint/"><u>capturing screenshots</u></a>, <a href="https://developers.cloudflare.com/browser-rendering/rest-api/pdf-endpoint/"><u>generating PDFs</u></a>, and <a href="https://developers.cloudflare.com/browser-rendering/rest-api/markdown-endpoint/"><u>extracting markdown</u></a>, there are the <a href="https://developers.cloudflare.com/browser-rendering/rest-api/"><u>Quick Action endpoints</u></a>. </p>
    <div>
      <h4>/crawl endpoint — crawl web content</h4>
      <a href="#crawl-endpoint-crawl-web-content">
        
      </a>
    </div>
    <p>We also recently shipped a <a href="https://developers.cloudflare.com/browser-rendering/rest-api/crawl-endpoint/"><u>/crawl endpoint</u></a> that lets you crawl entire sites with a single API call. Give it a starting URL and pages are automatically discovered and scraped, then returned in your preferred format (HTML, Markdown, and structured JSON), with additional parameters to control crawl depth and scope, skip pages that haven’t changed, and specify certain paths to include or exclude. </p><p>We intentionally built /crawl to be a <a href="https://developers.cloudflare.com/browser-run/faq/#will-browser-run-be-detected-by-bot-management"><u>well-behaved crawler</u></a>. That means it respects site owner’s preferences out of the box, is a <a href="https://developers.cloudflare.com/bots/concepts/bot/signed-agents/"><u>signed agent</u></a> with a distinct bot ID that is cryptographically signed using <a href="https://developers.cloudflare.com/bots/reference/bot-verification/web-bot-auth/"><u>Web Bot Auth</u></a>, a non-customizable <a href="https://developers.cloudflare.com/browser-rendering/rest-api/crawl-endpoint/#user-agent"><u>User-Agent</u></a>, and follows robots.txt and <a href="https://www.cloudflare.com/ai-crawl-control/"><u>AI Crawl Control</u></a>. It does not bypass Cloudflare’s bot protections or CAPTCHAs. Site owners choose whether their content is accessible and /crawl respects it. </p>
            <pre><code># Initiate a crawl
curl -X POST 'https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl' \
  -H 'Authorization: Bearer &lt;apiToken&gt;' \
  -H 'Content-Type: application/json' \
  -d '{
    "url": "https://blog.cloudflare.com/"
  }'
</code></pre>
            
    <div>
      <h2>3) Observe</h2>
      <a href="#3-observe">
        
      </a>
    </div>
    <p>Things don’t always go right the first try. We kept hearing from customers that when their automations failed, they had no idea why. That’s why we’ve added multiple ways to observe what’s happening, so you can see exactly what your agent sees, both live and after the fact. </p>
    <div>
      <h3>Live View</h3>
      <a href="#live-view">
        
      </a>
    </div>
    <p><a href="https://developers.cloudflare.com/browser-run/features/live-view/"><u>Live View</u></a> lets you watch your agent’s browser session in real time. Whether you’re debugging an agent or running a long automation script, you see exactly what’s happening as it happens. This includes the page itself, as well as the DOM, console, and network requests. When something goes wrong — the expected button isn't there, the page needs authentication, or a CAPTCHA appears — you can catch it immediately.</p><p>There are two ways to access Live View. From code, obtain the <code>session_id</code> of the browser you want to inspect and open the <code>devtoolsFrontendURL</code> from the response in Chrome. Or from the Cloudflare dashboard, open the new Live Sessions tab in the Browser Run section and click into any active session.</p><div>
  
</div><p><sup><i>Live View of an AI agent booking a hotel, showing real-time browser activity</i></sup></p>
    <div>
      <h3>Session Recordings</h3>
      <a href="#session-recordings">
        
      </a>
    </div>
    <p>Live View is great when you’re available, but you can’t watch every session. <a href="https://developers.cloudflare.com/browser-run/features/session-recording/"><u>Session Recordings</u></a> captures DOM changes, mouse and keyboard events, and page navigation as structured JSON so you can replay any session after it ends. </p><p>Enable Session Recordings by passing <code>recording:true</code> when launching a browser. After the session closes, you can access the recording in the Cloudflare dashboard from the Runs tab or retrieve recordings via API and replay them with the <a href="https://github.com/rrweb-io/rrweb/tree/master/packages/rrweb-player"><u>rrweb-player</u></a>. Next, we’re adding the ability to inspect DOM state and console output at any point during the recording.   </p><div>
  
</div><p><sup><i>Session recording replay of a browser automation browsing the Sentry Shop and adding a bomber jacket to the cart </i></sup></p>
    <div>
      <h3>Dashboard Redesign</h3>
      <a href="#dashboard-redesign">
        
      </a>
    </div>
    <p>Previously, the <a href="https://dash.cloudflare.com/?to=/:account/workers/browser-run"><u>Browser Run dashboard</u></a> only showed logs from browser sessions. Requests for screenshots, PDFs, markdown, and crawls were not visible. The redesigned dashboard changes that. The new Runs tab shows every request. You can filter by endpoint and view details including target URLs, status, and duration.  </p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/7eExar2kjoc2QSq6skzTf6/ba4ad79fa01eb060f8b14cd5afa342e5/BLOG-3221_2.png" />
          </figure><p><sup><i>The Browser Run dashboard Runs tab showing browser sessions and quick actions like PDF, Screenshot, and Crawl in a single view, with a crawl job expanded to show its progress</i></sup></p>
    <div>
      <h2>4) Intervene</h2>
      <a href="#4-intervene">
        
      </a>
    </div>
    <p>Agents are good, but they’re not perfect. Sometimes they need their human to step in. Browser Run supports Human in the Loop workflows where a human can take control of a live browser session, handle what the automation cannot, then let the session continue. </p>
    <div>
      <h3>Human in the Loop</h3>
      <a href="#human-in-the-loop">
        
      </a>
    </div>
    <p>When automation hits a wall, you don't have to restart. With <a href="https://developers.cloudflare.com/browser-run/features/human-in-the-loop/"><u>Human in the Loop</u></a>, you can step in and interact with the page directly to click, type, navigate, enter credentials, or submit forms. This unlocks workflows that agents cannot handle.</p><p>Today, you can step in by opening the Live View URL for any active session. Next, we’re adding a handoff flow where the agent can signal that it needs help, notify a human to step in, then hand control back to the agent once the issue is resolved.</p><div>
  
</div>
<p></p><p><sup><i>An AI agent searching Amazon for an orange lava lamp, comparing options, and handing off to a human when sign-in is required to complete the purchase</i></sup></p>
    <div>
      <h2>5) Scale</h2>
      <a href="#5-scale">
        
      </a>
    </div>
    <p>Customers have asked us to raise limits so that they can do more, faster.</p>
    <div>
      <h3>Higher limits</h3>
      <a href="#higher-limits">
        
      </a>
    </div>
    <p>We've quadrupled the <a href="https://developers.cloudflare.com/browser-rendering/limits/"><u>default concurrent browser limit from 30 to 120</u></a>. Every session gives you instant access to a browser from a global pool of warm instances, so there's no cold start waiting for a browser to spin up. In March, we also <a href="https://developers.cloudflare.com/changelog/post/2026-03-04-br-rest-api-limit-increase/"><u>increased limits for Quick Actions</u></a> to 10 requests per second. If you need higher limits, they're available by request.</p>
    <div>
      <h2>What's next</h2>
      <a href="#whats-next">
        
      </a>
    </div>
    <ul><li><p><b>Human in the Loop Handoff</b>: today you can intervene in a browser session through Live View. Soon, the agent will be able to signal when it needs help, so you can build in notifications to alert a human to step in.</p></li><li><p><b>Session Recordings Inspection</b>: you can already scrub through the timeline and replay any session. Soon, you’ll be able to inspect DOM state and console output as well.</p></li><li><p><b>Traces and Browser Logs</b>: access debugging information without instrumenting your code. Console logs, network requests, timing data. If something broke, you'll know where.</p></li><li><p><b>Screenshot, PDF, and markdown directly from Workers</b>: the same simple tasks available through the <a href="https://developers.cloudflare.com/browser-rendering/rest-api/"><u>REST API</u></a> are coming to <a href="https://developers.cloudflare.com/browser-rendering/workers-bindings/"><u>Workers Bindings</u></a>. e<code>nv.BROWSER.screenshot()</code> just works, with no API tokens needed.</p></li></ul>
    <div>
      <h2>Get started</h2>
      <a href="#get-started">
        
      </a>
    </div>
    <p>Browser Run is available today on both the Workers Free and Workers Paid plans. Everything we shipped today — Live View, Human in the Loop, Session Recordings, and higher concurrency limits — is ready to use. </p><p>If you were already using Browser Rendering, everything works the same, just with a new name and more features.  </p><p>Check out the <a href="https://developers.cloudflare.com/browser-rendering/"><u>documentation</u></a> to get started. </p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6mQq5rxfDK3oU80JCRUL7P/7842cac72f0f4170cc697011230146ab/BLOG-3221_3.png" />
          </figure><p></p> ]]></content:encoded>
            <category><![CDATA[Agents Week]]></category>
            <category><![CDATA[Chrome]]></category>
            <category><![CDATA[Developer Platform]]></category>
            <category><![CDATA[Developers]]></category>
            <category><![CDATA[AI]]></category>
            <category><![CDATA[Cloudflare Workers]]></category>
            <category><![CDATA[Browser Rendering]]></category>
            <category><![CDATA[Agents]]></category>
            <category><![CDATA[Browser Run]]></category>
            <guid isPermaLink="false">160lCssR1GA8lEUV718ev5</guid>
            <dc:creator>Kathy Liao</dc:creator>
        </item>
        <item>
            <title><![CDATA[Rearchitecting the Workflows control plane for the agentic era]]></title>
            <link>https://blog.cloudflare.com/workflows-v2/</link>
            <pubDate>Wed, 15 Apr 2026 13:00:00 GMT</pubDate>
            <description><![CDATA[ Cloudflare Workflows, a durable execution engine for multi-step applications, now supports higher concurrency and creation rate limits through a rearchitectured control plane, helping scale to meet the use cases for durable background agents.
 ]]></description>
            <content:encoded><![CDATA[ <p>When we originally built <a href="https://developers.cloudflare.com/workflows/"><u>Workflows</u></a>, our durable execution engine for multi-step applications, it was designed for a world in which workflows were triggered by human actions, like a user signing up or placing an order. For use cases like onboarding flows, workflows only had to support one instance per person — and people can only click so fast. </p><p>Over time, what we’ve actually seen is a quantitative shift in the workload and access pattern: fewer human-triggered workflows, and more agent-triggered workflows, created at machine speed. </p><p>As agents become persistent and autonomous infrastructure, operating on behalf of users for hours or days, they need a durable, asynchronous execution engine for the work they are doing. Workflows provides exactly that: every step is independently retryable, the workflow can pause for human-in-the-loop approval, and each instance survives failures without losing progress.  </p><p>Moreover, workflows themselves are being used to implement agent loops and serve as the durable harnesses that manage and keep agents alive. Our<a href="https://developers.cloudflare.com/changelog/post/2026-02-03-agents-workflows-integration/"> <u>Agents SDK integration</u></a> accelerated this, making it easy for agents to spawn workflow instances and get real-time progress back. A single agent session can now kick off dozens of workflows, and many agents running concurrently means thousands of instances created in seconds. With <a href="https://blog.cloudflare.com/project-think"><u>Project Think</u></a> now available, we anticipate that velocity will only increase.</p><p>To help developers scale their agents and applications on Workflows, we are excited to announce that we now support:</p><ul><li><p>50,000 concurrent instances (number of workflow executions running in parallel), <a href="https://developers.cloudflare.com/changelog/post/2025-02-25-workflows-concurrency-increased/"><u>originally 4,500</u></a></p></li><li><p>300 instances/second created per account, previously 100</p></li><li><p>2 million queued instances (meaning instances that have been created or awoken and are waiting for a concurrency slot) per workflow, up from 1 million</p></li></ul><p>We redesigned the Workflows control plane from usage data and first principles to support these increases. For V1 of the control plane, a single Durable Object (DO) could serve as the central registry and coordinator of an entire account. For V2, we built two new components to help horizontally scale the system and alleviate the bottlenecks that V1 introduced, before migrating all customers — with live traffic — seamlessly onto the new version.</p>
    <div>
      <h2>V1: initial architecture of Workflows</h2>
      <a href="#v1-initial-architecture-of-workflows">
        
      </a>
    </div>
    <p>As described in our <a href="https://blog.cloudflare.com/building-workflows-durable-execution-on-workers/#building-cloudflare-on-cloudflare"><u>public beta blog post</u></a>, we built <a href="https://www.cloudflare.com/developer-platform/products/workflows/"><u>Workflows</u></a> entirely on our own developer platform. Fundamentally, a workflow is a series of durable steps, each independently retryable, that can execute tasks, wait for external events, or sleep until a predetermined time. </p>
            <pre><code>export class MyWorkflow extends WorkflowEntrypoint {

  async run(event, step) {
    const data = await step.do("fetch-data", async () =&gt; {
      return fetchFromAPI();
    });

    const approval = await step.waitForEvent("approval", {
      type: "approval",
      timeout: "24 hours",
    });

    await step.do("process-and-save", async () =&gt; {
      return store(transform(data));
    });
  }
}
</code></pre>
            <p>To trigger each instance, execute its logic, and store its metadata, we leverage SQLite-backed <a href="https://www.cloudflare.com/developer-platform/products/durable-objects/"><u>Durable Objects</u></a>, which are a simple but powerful primitive for coordination and storage within a distributed system. </p><p>In the control plane, some Durable Objects — like the <i>Engine</i>, which executes the actual workflow instance, including its step, retry, and sleep logic — are spun up at a ratio of 1:1 per instance. On the other hand, the <i>Account</i> is an account-level Durable Object that manages all workflows and workflow instances for that account.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/55bqaUjc30HJHe9spWYTo8/d8053955660553db8b64a484fb321ec7/BLOG-3116_2.png" />
          </figure><p>To learn more about the V1 control plane, refer to our <a href="https://blog.cloudflare.com/building-workflows-durable-execution-on-workers/"><u>Workflows announcement blog post</u></a>.</p><p>After we launched Workflows into beta, we were thrilled to see customers quickly scaling their use of the product, but we also realized that having a single Durable Object to store all that account-level information introduced a bottleneck. Many customers needed to create and execute hundreds or even thousands of Workflow instances per minute, which could quickly overwhelm the <i>Account</i> in our original architecture. The original rate limits — 4,500 concurrency slots and 100 instance creations per 10 seconds — were a result of this limitation. </p><p>On the V1 control plane, these limits were a hard cap. Any and all operations depending on <i>Account</i>, including create, update, and list, had to go through that single DO. Users with high concurrency workloads could have thousands of instances starting and ending at any given moment, building up to thousands of requests per second to <i>Account</i>. To solve for this, we rearchitected the workflow control plane such that it horizontally scales to higher concurrency and creation rate limits. </p>
    <div>
      <h2>V2: horizontal scale for higher throughput</h2>
      <a href="#v2-horizontal-scale-for-higher-throughput">
        
      </a>
    </div>
    <p>For the new version, we rethought every single operation from the ground up with the goal of optimizing for high-volume workflows. Ultimately, Workflows should scale to support whatever developers need – whether that is thousands of instances created per second or millions of instances running at a time. We also wanted to ensure that V2 allowed for flexible limits, which we can toggle and continue increasing, rather than the hard cap which V1 limits imposed. After many design iterations, we settled on the following pillars for our new architecture: </p><ul><li><p>The source of truth for the existence of a given instance should be its <i>Engine</i> and nothing else. </p><ul><li><p>In the V1 control plane architecture, we lacked a check before queuing the instance as to whether its <i>Engine</i> actually existed. This allowed for a bad state where an instance may have been queued without its corresponding <i>Engine </i>having spun up. </p></li><li><p>Instance lifecycle and liveness mechanisms must be horizontally scalable per-workflow and distributed throughout many regions.</p></li></ul></li><li><p>The new Account singleton should only store the minimum necessary metadata and have an invariant maximum amount of concurrent requests.</p></li></ul>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1txhhObwwIcV8C2gr9Hjfe/df7ea739567c7e42471458357c16583d/unnamed.png" />
          </figure><p>There are two new, critical components in the V2 control plane which allowed us to improve the scalability of Workflows: <i>SousChef</i> and <i>Gatekeeper</i>. The first component, <i>SousChef</i>, is a “second in command” to the <i>Account</i>. Recall that previously, the <i>Account</i> managed the metadata and lifecycle for all of the instances across all of the workflows within a given account. <i>SousChef</i> was introduced to keep track of metadata and lifecycle on a <b>subset</b> of instances in a given workflow. Within an account, a distribution of <i>SousChefs</i> can then report back to <i>Account</i> in a more efficient and manageable way. (An added benefit of this design: not only did we already have per-account isolation, but we also inadvertently gained “per-workflow” isolation within the same account, since each <i>SousChef</i> only takes care of one specific workflow).</p><p>The second component, <i>Gatekeeper</i>, is a mechanism to distribute concurrency “slots” (derived from concurrency limits) across all <i>SousChefs</i> within the account. It acts as a leasing system. When an instance is created, it is randomly assigned to one of the <i>SousChefs</i> within that account. Then the <i>SousChef</i> makes a request to <i>Account</i> to trigger that instance. Either a slot is granted, or the instance is queued. Once the slot is granted, the <i>SousChef</i> triggers execution of the instance and assumes responsibility that the instance never gets stuck. </p><p><i>Gatekeeper</i> was needed to make sure that <i>Engines</i> never overloaded their <i>Account</i> (a pressing risk on V1) so every communication between <i>SousChefs</i> and their <i>Account</i> happens on a periodic cycle, once per second — each cycle will also batch all slot requests, ensuring that only one JSRPC call is made. This ensures the instance creation rate can never overload or influence the most important component, <i>Account</i> (as an aside: if the <i>SousChef </i>count is too high, we rate-limit calls or spread across different <i>SousChefs</i> throughout different time periods). Also, this periodic property allows us to preserve fairness on older instances and to ensure max-min fairness through the many <i>SousChefs</i>, allowing them all to progress. For example, if an instance wakes up, it should be prioritized for a slot over a newly created instance, but each <i>SousChef</i> ensures that its own instances do not get stuck.</p><p>This architecture is more distributed, and therefore, more scalable. Now, when an instance is created, the request path is:</p><ol><li><p>Check control plane version</p></li><li><p>Check if a cached version of the workflow and version details is available in that location</p><ol><li><p>If not, check <i>Account</i> to get workflow name, unique ID, and version, and cache that information</p></li></ol></li><li><p>Store only necessary metadata (instance payload, creation date) onto its own <i>Engine</i></p></li></ol><p>So, how does <i>Engine</i> tell the control plane that it now exists? That happens in the background after instance metadata is set. As background operations on a Durable Object can fail, due to eviction or server failure, we also set an “alarm” on <i>Engine</i> in the creation hot-path. That way, if the background task does not finish, the alarm <b>ensures</b> that the instance will begin. </p><p>A <a href="https://developers.cloudflare.com/durable-objects/api/alarms/"><u>Durable Object alarm</u></a> allows a Durable Object instance to be awakened at a fine-grained time in the future with an<b> at-least-once </b>execution model, with automatic retries built in. We extensively use this combination of background “tasks” and alarms to remove operations off the hot-path while still ensuring that everything will happen as planned. That’s how we keep critical operations like <i>creating an instance</i> fast without ever compromising on reliability. </p><p>Other than unlocking scale, this version of the control plane means that: </p><ul><li><p>Instance listing performance is faster, and actually consistent with cursor pagination; </p></li><li><p>Any operation on an instance does exactly one network hop (as it can go directly to its <i>Engine</i>, ensuring that eyeball request latency is as small as we can manage);</p></li><li><p>We can ensure that more instances are actually behaving correctly (by running on time) concurrently (and correct them if not, making sure that <i>Engines</i> are never late to continue execution).</p></li></ul>
    <div>
      <h2>V1 → V2 migration</h2>
      <a href="#v1-v2-migration">
        
      </a>
    </div>
    <p>Now that we had a new version of the Workflows control plane that can handle a higher volume of user load, we needed to do the “boring” part: migrating our customers and instances to the new system. At Cloudflare’s scale, this becomes a problem in and of itself, so the “boring” part becomes the biggest challenge. Well before its one-year mark, Workflows had already racked up millions of instances and thousands of customers. Also, some tech debt on V1’s control plane meant that a queued instance might not have its own <i>Engine</i> Durable Object created yet, complicating matters further.</p><p>Such a migration is tricky because customers might have instances running at any given moment; we needed a way to add the <i>SousChef</i> and <i>Gatekeeper</i> components into older accounts without causing any disruption or downtime.</p><p>We ultimately decided that we would migrate existing <i>Accounts </i>(which we’ll refer to as <i>AccountOlds) </i>to behave like <i>SousChefs. </i>By persisting the <i>Account</i> DOs, we maintained the instance metadata, and simply converted the DO into a <i>SousChef</i> “DO”: </p>
            <pre><code>// You might be wondering what's this SousChef class? This is the SousChef DO class!
import { SousChef } from "@repo/souschef";

class AccountOld extends DurableObject {
  constructor(state: DurableObjectState, env: Env) {
    // We added the following snippet to the end of our AccountOld DO's
    // constructor. This ensures that if we want, we can use any primitive
    // that is available on SousChef DO
    if (this.currentVersion === ControlPlaneVersions.SOUS_CHEFS) {
      this.sousChef = new SousChef(this.ctx, this.env);
      await this.sousChef.setup()
    }
  }

  async updateInstance(params: UpdateInstanceParams) {
    if (this.currentVersion === ControlPlaneVersions.SOUS_CHEFS) {
      assert(this.sousChef !== undefined, 'SousChef must exist on v2');
      return this.sousChef.updateInstance(params);
    }

    // old logic remains the same
  }

  @RequiresVersion&lt;AccountOld&gt;(ControlPlaneVersions.V1)
  async getMetadata() {
    // this method can only be run if 
    // this.currentVersion === ControlPlaneVersions.V1
  }
}</code></pre>
            <p>We can instantiate the <i>SousChef</i> class within the <i>AccountOld</i> because the SQL tables that track instance metadata, on both <i>SousChefs</i> and <i>AccountOld</i> DOs, are the same on both. As such, we could just decide which version of the code to use. If this hadn’t been the case, we would have been forced to migrate the metadata of millions of instances, which would have made the migration more difficult and longer running for each account. So, how did the migration work?</p><p>First, we prepared <i>AccountOld</i> DOs to be switched to behave as <i>SousChefs</i> (which meant creating a release with a version of the snippet above). Then, we enabled control plane V2 per account, which triggered the next three steps roughly at the same time:</p><ul><li><p>All new instance creation requests are now routed to the new <i>SousChefs</i> (<i>SousChefs</i> are created when they receive the first request), new instances never go to <i>AccountOld</i> again;</p></li><li><p><i>AccountOld</i> DOs start migrating themselves to behave like <i>SousChefs</i>;</p></li><li><p>The new <i>Account</i> DO is spun up with the corresponding metadata.</p></li></ul><p>After all accounts were migrated to the new control plane version, we were able to sunset <i>AccountOld</i> DOs as their instance retention periods expired. Once all instances on all accounts on <i>AccountOlds</i> were migrated, we could spin down those DOs permanently. The migration was completed with no downtime in a process that truly felt like changing a car’s wheels while driving.</p>
    <div>
      <h2>Try it out</h2>
      <a href="#try-it-out">
        
      </a>
    </div>
    <p>If you are new to Workflows, try our <a href="https://developers.cloudflare.com/workflows/get-started/guide/"><u>Get Started guide</u></a> or <a href="https://developers.cloudflare.com/workflows/get-started/durable-agents/"><u>build your first durable agent</u></a> with Workflows.</p><p>If your use case requires higher limits than our new defaults — a concurrency limit of 50,000 slots and account-level creation rate limit of 300 instances per second, 100 per workflow — reach out via your account team or the <a href="https://forms.gle/ukpeZVLWLnKeixDu7"><u>Workers Limit Request Form</u></a>. You can also reach out with feedback, feature requests, or just to share how you are using Workflows on our <a href="https://discord.com/channels/595317990191398933/1296923707792560189"><u>Discord server</u></a>.</p> ]]></content:encoded>
            <category><![CDATA[Agents Week]]></category>
            <category><![CDATA[Agents]]></category>
            <category><![CDATA[Durable Objects]]></category>
            <category><![CDATA[Cloudflare Workers]]></category>
            <category><![CDATA[Developer Platform]]></category>
            <category><![CDATA[Developers]]></category>
            <guid isPermaLink="false">5R3ZpKlSDaSxbIwmpXwWYJ</guid>
            <dc:creator>Luís Duarte</dc:creator>
            <dc:creator>Mia Malden</dc:creator>
            <dc:creator>André Venceslau</dc:creator>
        </item>
        <item>
            <title><![CDATA[Add voice to your agent]]></title>
            <link>https://blog.cloudflare.com/voice-agents/</link>
            <pubDate>Wed, 15 Apr 2026 13:00:00 GMT</pubDate>
            <description><![CDATA[ An experimental voice pipeline for the Agents SDK enables real-time voice interactions over WebSockets. Developers can now build agents with continuous STT and TTS in just ~30 lines of server-side code.
 ]]></description>
            <content:encoded><![CDATA[ <p>For many of us, our first experiences with AI agents have been through typing into a chat box. And for those of us using agents day to day, we have likely gotten very good at writing detailed prompts or markdown files to guide them.</p><p>But some of the moments where agents would be most useful are not always text-first. You might be on a long commute, juggling a few open sessions, or just wanting to speak naturally to an agent, have it speak back, and continue the interaction.</p><p>Adding voice to an agent should not require moving that agent into a separate voice framework. Today, we are releasing an experimental voice pipeline for the <a href="https://developers.cloudflare.com/agents/api-reference/voice/"><u>Agents SDK</u></a>.</p><p>With <code>@cloudflare/voice</code>, you can add real-time voice to the same Agent architecture you already use. Voice just becomes another way you can talk to the same Durable Object, with the same tools, persistence, and WebSocket connection model that the Agents SDK already provides. </p><p><code>@cloudflare/voice</code> is an experimental package for the Agents SDK that provides: </p><ul><li><p><code>withVoice(Agent)</code> for full conversation voice agents</p></li><li><p><code>withVoiceInput(Agent)</code> for speech-to-text-only use cases, like dictation or voice search </p></li><li><p><code>useVoiceAgent</code> and <code>useVoiceInput</code> hooks for React apps </p></li><li><p><code>VoiceClient</code> for framework-agnostic clients </p></li><li><p>Built-in <a href="https://developers.cloudflare.com/workers-ai/"><u>Workers AI</u></a> providers, so that you can get started without external API keys: </p><ul><li><p>Continuous STT with <a href="https://developers.cloudflare.com/workers-ai/models/flux/"><u>Deepgram Flux</u></a></p></li><li><p>Continuous STT with<a href="https://developers.cloudflare.com/workers-ai/models/nova-3/"><u> Deepgram Nova 3</u></a></p></li><li><p>Text-to-speech with <a href="https://developers.cloudflare.com/workers-ai/models/aura-1/"><u>Deepgram Aura</u></a></p></li></ul></li></ul><p>This means you can now build an agent that users can talk to in real time over a single WebSocket connection, while keeping the same Agent class, Durable Object instance, and the same SQLite-backed conversation history. </p><p>Just as importantly, we want this to be bigger than one fixed default stack. The provider interfaces in <code>@cloudflare/voice</code> are intentionally small, and we want speech, telephony, and transport providers to build with us, so developers can mix and match the right components for their use case, instead of being locked into a single voice architecture.</p>
    <div>
      <h2>Get started with voice</h2>
      <a href="#get-started-with-voice">
        
      </a>
    </div>
    <p>Here’s the minimal server-side pattern for a voice agent in the Agents SDK: </p>
            <pre><code>import { Agent, routeAgentRequest } from "agents";
import {
  withVoice,
  WorkersAIFluxSTT,
  WorkersAITTS,
  type VoiceTurnContext
} from "@cloudflare/voice";

const VoiceAgent = withVoice(Agent);

export class MyAgent extends VoiceAgent&lt;Env&gt; {
  transcriber = new WorkersAIFluxSTT(this.env.AI);
  tts = new WorkersAITTS(this.env.AI);

  async onTurn(transcript: string, context: VoiceTurnContext) {
    return `You said: ${transcript}`;
  }
}

export default {
  async fetch(request: Request, env: Env) {
    return (
      (await routeAgentRequest(request, env)) ??
      new Response("Not found", { status: 404 })
    );
  }
} satisfies ExportedHandler&lt;Env&gt;;

</code></pre>
            <p>That’s the whole server. You add a continuous transcriber, a text-to-speech provider, and implement <code>onTurn()</code>. 

On the client side, you can connect to it with a React hook: </p>
            <pre><code>import { useVoiceAgent } from "@cloudflare/voice/react";

function App() {
  const {
    status,
    transcript,
    interimTranscript,
    startCall,
    endCall,
    toggleMute
  } = useVoiceAgent({ agent: "my-agent" });

  return (
    &lt;div&gt;
      &lt;p&gt;Status: {status}&lt;/p&gt;
      {interimTranscript &amp;&amp; &lt;p&gt;&lt;em&gt;{interimTranscript}&lt;/em&gt;&lt;/p&gt;}
      &lt;ul&gt;
        {transcript.map((msg, i) =&gt; (
          &lt;li key={i}&gt;
            &lt;strong&gt;{msg.role}:&lt;/strong&gt; {msg.text}
          &lt;/li&gt;
        ))}
      &lt;/ul&gt;
      &lt;button onClick={startCall}&gt;Start Call&lt;/button&gt;
      &lt;button onClick={endCall}&gt;End Call&lt;/button&gt;
      &lt;button onClick={toggleMute}&gt;Mute / Unmute&lt;/button&gt;
    &lt;/div&gt;
  );
}
</code></pre>
            <p>If you are not using React, you can use <code>VoiceClient</code> directly from <code>@cloudflare/voice/client</code>. </p>
    <div>
      <h2>How the voice pipeline works</h2>
      <a href="#how-the-voice-pipeline-works">
        
      </a>
    </div>
    <p>With the <a href="https://github.com/cloudflare/agents"><u>Agents SDK</u></a>, every agent is a <a href="https://developers.cloudflare.com/durable-objects/"><u>Durable Object</u></a> — a stateful, addressable server instance with its own <a href="https://developers.cloudflare.com/agents/api-reference/store-and-sync-state/"><u>SQLite database</u></a>, <a href="https://developers.cloudflare.com/agents/api-reference/websockets/"><u>WebSocket connections</u></a>, and application logic. The voice pipeline extends this model instead of replacing it. </p><p>At a high level, the flow looks like this:</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5IBi8TsQiJ18Um47zpqAIr/d284d47a8653c35bc7c027438a5a7a2c/unnamed__55_.png" />
          </figure><p>Here’s how the pipeline breaks down, step by step: </p><ol><li><p><b>Audio transport: </b>The browser captures microphone audio and streams 16 kHz mono PCM over the same WebSocket connection the agent already uses. </p></li><li><p><b>STT session setup:  </b>When the call starts, the agent creates a continuous transcriber session that lives for the duration of the call. </p></li><li><p><b>STT input: </b>Audio streams continuously into that session.</p></li><li><p><b>STT turn detection: </b>The speech-to-text model itself decides when the user has finished an utterance and emits a stable transcript for that turn. </p></li><li><p><b>LLM/application logic: </b>The voice pipeline passes that transcript to your <code>onTurn(</code>) method. </p></li><li><p><b>TTS output: </b>Your response is synthesized to audio and sent back to the client. If <code>onTurn()</code> returns a stream, the pipeline sentence-chunks it and starts sending audio as sentences are ready. </p></li><li><p><b>Persistence: </b>The user and agent messages are persisted in SQLite, so conversation history survives reconnections and deployments.  </p></li></ol>
    <div>
      <h2>Why voice should grow with the rest of your agent</h2>
      <a href="#why-voice-should-grow-with-the-rest-of-your-agent">
        
      </a>
    </div>
    <p>Many voice frameworks focus on the voice loop itself: audio in, transcription, model response, audio out. Those are important primitives, but there’s a lot more to an agent than just voice. </p><p>Real agents running in production will grow. They need state, scheduling, persistence, tools, workflows, telephony, and ways to keep all of that consistent across channels. As your agent grows in complexity, voice stops being a standalone feature and becomes part of a larger system. </p><p>We wanted voice in the Agents SDK to start from that assumption. Instead of building voice as a separate stack, we built it on top of the same Durable Object-based agent platform, so you can pull in the rest of the primitives you need without re-architecting the application later.</p>
    <div>
      <h3>Voice and text share the same state</h3>
      <a href="#voice-and-text-share-the-same-state">
        
      </a>
    </div>
    <p>A user might start by typing, switch to voice, and go back to text. With Agents SDK, these are all just different inputs to the same agent. The same conversation history lives in SQLite, and the same tools are available. This gives you both a cleaner mental model and a much simpler application architecture to reason about. </p>
    <div>
      <h2>Lower latency comes from...</h2>
      <a href="#lower-latency-comes-from">
        
      </a>
    </div>
    
    <div>
      <h3>a shorter network path </h3>
      <a href="#a-shorter-network-path">
        
      </a>
    </div>
    <p>Voice experiences feel good or bad very quickly. Once a user stops speaking, the system needs to transcribe, think, and start speaking back fast enough to feel conversational. </p><p>A lot of voice latency is not pure model time. It’s the cost of bouncing audio and text between different services in different places. Audio needs to go to STT, transcripts go to an LLM, and responses go to a TTS model – and each handoff adds network overhead. </p><p>With the Agents SDK voice pipeline, the agent runs on Cloudflare’s network, and the built-in providers use Workers AI bindings. That keeps the pipeline tighter and reduces the amount of infrastructure you have to stitch together yourself. </p>
    <div>
      <h3>built-in streaming</h3>
      <a href="#built-in-streaming">
        
      </a>
    </div>
    <p>A voice agent interaction feels much more natural if it speaks the first sentence quickly (also called Time-to-First Audio). When <code>onTurn()</code> returns a stream, the pipeline chunks it into sentences and starts synthesis as sentences complete. That means the user can hear the beginning of the answer while the rest is still being generated. </p>
    <div>
      <h2>A more realistic backend </h2>
      <a href="#a-more-realistic-backend">
        
      </a>
    </div>
    <p>Here is a fuller example that streams an LLM response and starts speaking it back, sentence by sentence:</p>
            <pre><code>import { Agent, routeAgentRequest } from "agents";
import {
  withVoice,
  WorkersAIFluxSTT,
  WorkersAITTS,
  type VoiceTurnContext
} from "@cloudflare/voice";
import { streamText } from "ai";
import { createWorkersAI } from "workers-ai-provider";

const VoiceAgent = withVoice(Agent);

export class MyAgent extends VoiceAgent&lt;Env&gt; {
  transcriber = new WorkersAIFluxSTT(this.env.AI);
  tts = new WorkersAITTS(this.env.AI);

  async onTurn(transcript: string, context: VoiceTurnContext) {
    const ai = createWorkersAI({ binding: this.env.AI });

    const result = streamText({
      model: ai("@cf/cloudflare/gpt-oss-20b"),
      system: "You are a helpful voice assistant. Be concise.",
      messages: [
        ...context.messages.map((m) =&gt; ({
          role: m.role as "user" | "assistant",
          content: m.content
        })),
        { role: "user" as const, content: transcript }
      ],
      abortSignal: context.signal
    });

    return result.textStream;
  }
}

export default {
  async fetch(request: Request, env: Env) {
    return (
      (await routeAgentRequest(request, env)) ??
      new Response("Not found", { status: 404 })
    );
  }
} satisfies ExportedHandler&lt;Env&gt;;
</code></pre>
            <p><code>Context.messages</code> gives you recent SQLite-backed conversation history, and <code>context.signal</code> lets the pipeline abort the LLM call if the user interrupts. </p>
    <div>
      <h2>Voice as an input: <code>withVoiceInput</code></h2>
      <a href="#voice-as-an-input-withvoiceinput">
        
      </a>
    </div>
    <p>Not every speech interface needs to speak back. Sometimes you might want dictation, transcription, or voice search. For these use cases, you can use <code>withVoiceInput</code></p>
            <pre><code>import { Agent, type Connection } from "agents";
import { withVoiceInput, WorkersAINova3STT } from "@cloudflare/voice";

const InputAgent = withVoiceInput(Agent);

export class DictationAgent extends InputAgent&lt;Env&gt; {
  transcriber = new WorkersAINova3STT(this.env.AI);

  onTranscript(text: string, _connection: Connection) {
    console.log("User said:", text);
  }
}
</code></pre>
            <p>On the client, <code>useVoiceInput</code> gives you a lightweight interface centered on transcriptions: </p>
            <pre><code>import { useVoiceInput } from "@cloudflare/voice/react";

const { transcript, interimTranscript, isListening, start, stop, clear } =
  useVoiceInput({ agent: "DictationAgent" });
</code></pre>
            <p>This is useful when speech is an input method, and you don’t need a full conversational loop. </p>
    <div>
      <h2>Voice and text on the same connection</h2>
      <a href="#voice-and-text-on-the-same-connection">
        
      </a>
    </div>
    <p>The same client can call <code>sendText(“What’s the weather?”)</code>, which bypasses STT and sends the text directly to <code>onTurn()</code>. During an active call, the response can be spoken and shown as text. Outside a call, it can remain text-only. </p><p>This gives you a genuinely multimodal agent, without splitting the implementation into different code paths. </p>
    <div>
      <h2>What else can you build? </h2>
      <a href="#what-else-can-you-build">
        
      </a>
    </div>
    <p>Because a voice agent is still an agent, all the normal Agents SDK capabilities still apply. </p>
    <div>
      <h3>Tools and scheduling</h3>
      <a href="#tools-and-scheduling">
        
      </a>
    </div>
    <p>You can greet a caller when a session starts: </p>
            <pre><code>import { Agent, type Connection } from "agents";
import { withVoice, WorkersAIFluxSTT, WorkersAITTS } from "@cloudflare/voice";

const VoiceAgent = withVoice(Agent);

export class MyAgent extends VoiceAgent&lt;Env&gt; {
  transcriber = new WorkersAIFluxSTT(this.env.AI);
  tts = new WorkersAITTS(this.env.AI);

  async onTurn(transcript: string) {
    return `You said: ${transcript}`;
  }

  async onCallStart(connection: Connection) {
    await this.speak(connection, "Hi! How can I help you today?");
  }
}
</code></pre>
            <p>You can schedule spoken reminders and expose tools to your LLM just like any other agent: </p>
            <pre><code>import { Agent } from "agents";
import {
  withVoice,
  WorkersAIFluxSTT,
  WorkersAITTS,
  type VoiceTurnContext
} from "@cloudflare/voice";
import { streamText, tool } from "ai";
import { createWorkersAI } from "workers-ai-provider";
import { z } from "zod";

const VoiceAgent = withVoice(Agent);

export class MyAgent extends VoiceAgent&lt;Env&gt; {
  transcriber = new WorkersAIFluxSTT(this.env.AI);
  tts = new WorkersAITTS(this.env.AI);

  async speakReminder(payload: { message: string }) {
    await this.speakAll(`Reminder: ${payload.message}`);
  }

  async onTurn(transcript: string, context: VoiceTurnContext) {
    const ai = createWorkersAI({ binding: this.env.AI });

    const result = streamText({
      model: ai("@cf/cloudflare/gpt-oss-20b"),
      messages: [
        ...context.messages.map((m) =&gt; ({
          role: m.role as "user" | "assistant",
          content: m.content
        })),
        { role: "user" as const, content: transcript }
      ],
      tools: {
        set_reminder: tool({
          description: "Set a spoken reminder after a delay",
          inputSchema: z.object({
            message: z.string(),
            delay_seconds: z.number()
          }),
          execute: async ({ message, delay_seconds }) =&gt; {
            await this.schedule(delay_seconds, "speakReminder", { message });
            return { confirmed: true };
          }
        })
      },
      abortSignal: context.signal
    });

    return result.textStream;
  }
}
</code></pre>
            
    <div>
      <h3>Runtime model switching</h3>
      <a href="#runtime-model-switching">
        
      </a>
    </div>
    <p>The voice pipeline also lets you choose a transcription model dynamically per connection. </p><p>For example, you might prefer Flux for conversational turn-taking and Nova 3 for higher-accuracy dictation. You can switch at runtime by overriding <code>createTranscriber()</code>: </p>
            <pre><code>import { Agent, type Connection } from "agents";
import {
  withVoice,
  WorkersAIFluxSTT,
  WorkersAINova3STT,
  WorkersAITTS,
  type Transcriber
} from "@cloudflare/voice";

export class MyAgent extends VoiceAgent&lt;Env&gt; {
  tts = new WorkersAITTS(this.env.AI);

  createTranscriber(connection: Connection): Transcriber {
    const url = new URL(connection.url ?? "http://localhost");
    const model = url.searchParams.get("model");
    if (model === "nova-3") {
      return new WorkersAINova3STT(this.env.AI);
    }
    return new WorkersAIFluxSTT(this.env.AI);
  }
}
</code></pre>
            <p>On the client, you can pass query parameters through the hook: </p>
            <pre><code>const voiceAgent = useVoiceAgent({
  agent: "my-voice-agent",
  query: { model: "nova-3" }
});
</code></pre>
            
    <div>
      <h2>Pipeline hooks</h2>
      <a href="#pipeline-hooks">
        
      </a>
    </div>
    <p>You can also intercept data between stages: </p><ul><li><p><code>afterTranscribe(transcript, connection)</code></p></li><li><p><code>beforeSynthesize(text, connection)</code></p></li><li><p><code>afterSynthesize(audio, text, connection)</code></p></li></ul><p>These hooks are useful for content filtering, text normalization, language-specific transformations, or custom logging. </p>
    <div>
      <h2>Telephone and transport options</h2>
      <a href="#telephone-and-transport-options">
        
      </a>
    </div>
    <p>By default, the voice pipeline uses a single WebSocket connection as the simplest path for 1:1 voice agents. But that’s not the only option. </p>
    <div>
      <h3>Phone calls via Twilio</h3>
      <a href="#phone-calls-via-twilio">
        
      </a>
    </div>
    <p>You can connect phone calls to the same agent using the Twilio adapter:</p>
            <pre><code>import { TwilioAdapter } from "@cloudflare/voice-twilio";

export default {
  async fetch(request: Request, env: Env) {
    if (new URL(request.url).pathname === "/twilio") {
      return TwilioAdapter.handleRequest(request, env, "MyAgent");
    }

    return (
      (await routeAgentRequest(request, env)) ??
      new Response("Not found", { status: 404 })
    );
  }
};
</code></pre>
            <p>This lets the same agent handle web voice, text input, and phone calls. </p><p>One caveat: the default Workers AI TTS provider returns MP3, while Twilio expects mulaw 8kHz audio. For production telephony, you may want to use a TTS provider that outputs PCM or mulaw directly. </p>
    <div>
      <h3>WebRTC</h3>
      <a href="#webrtc">
        
      </a>
    </div>
    <p>If you need a transport that is better suited to difficult network conditions or will include multiple participants, the voice package also includes SFU utilities and supports custom transports. The default model is WebSocket-native today, but we plan to develop more adapters to connect to our <a href="https://developers.cloudflare.com/realtime/sfu/"><u>global SFU infrastructure</u></a>. </p>
    <div>
      <h2>Build with us </h2>
      <a href="#build-with-us">
        
      </a>
    </div>
    <p>The voice pipeline is provider-agnostic by design. </p><p>Under the hood, each stage is defined by a small interface: a transcriber opens a continuous session and accepts audio frames as they arrive, while a TTS provider takes text and returns audio. If a provider can stream audio output, the pipeline can use that too.</p>
            <pre><code>interface Transcriber {
  createSession(options?: TranscriberSessionOptions): TranscriberSession;
}

interface TranscriberSession {
  feed(chunk: ArrayBuffer): void;
  close(): void;
}

interface TTSProvider {
  synthesize(text: string, signal?: AbortSignal): Promise&lt;ArrayBuffer | null&gt;;
}
</code></pre>
            <p>We didn’t want voice support in Agents SDK to only work with one fixed combination of models and transports. We wanted the default path to be simple, while still making it easy to plug in other providers as the ecosystem grows. </p><p>The built-in providers use Workers AI, so you can get started without external API keys:</p><ul><li><p><code>WorkersAIFluxSTT</code> for conversational streaming STT</p></li><li><p><code>WorkersAINova3STT</code> for dictation-style streaming STT</p></li><li><p><code>WorkersAITTS</code> for text-to-speech</p></li></ul><p>But the bigger goal is interoperability. If you maintain a speech or voice service, these interfaces are small enough to implement without needing to understand the rest of the SDK internals. If your STT provider accepts streaming audio and can detect utterance boundaries, it can satisfy the transcriber interface. If your TTS provider can stream audio output, even better. </p><p>We would love to work on interoperability with:</p><ul><li><p>STT providers like AssemblyAI, Rev.ai, Speechmatics, or any service with a real-time transcription API</p></li><li><p>TTS providers like PlayHT, LMNT, Cartesia, Coqui, Amazon Polly, or Google Cloud TTS</p></li><li><p>telephony adapters for platforms like Vonage, Telnyx, or Bandwidth</p></li><li><p>transport implementations for WebRTC data channels, SFU bridges, and other audio transport layers</p></li></ul><p>We are also interested in collaborations that go beyond individual providers:</p><ul><li><p>latency benchmarking across STT + LLM + TTS combinations</p></li><li><p>multilingual support and better documentation for non-English voice agents</p></li><li><p>accessibility work, especially around multimodal interfaces and speech impairments</p></li></ul><p>If you are building voice infrastructure and want to see a first-class integration, <a href="https://github.com/cloudflare/agents/pulls"><u>open a PR</u></a> or reach out.</p>
    <div>
      <h2>Try it now</h2>
      <a href="#try-it-now">
        
      </a>
    </div>
    <p>The voice pipeline is available today as an experimental package:</p>
            <pre><code>npm create cloudflare@latest -- --template cloudflare/agents-starter
</code></pre>
            <p>Add <code>@cloudflare/voice</code>, give your agent a transcriber and a TTS provider, deploy it, and start talking to it. You can also read the <a href="https://developers.cloudflare.com/agents/api-reference/voice/"><u>API reference</u></a>. </p><p>If you build something interesting, open an issue or PR on <a href="https://github.com/cloudflare/agents"><u>github.com/cloudflare/agents</u></a>. Voice should not require a separate stack, and we think the best voice agents will be the ones built on the same durable application model as everything else.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1SLeHwqXY0ehOUsy5Nzzq2/c16244f45b8411f8817b9a21b45ed4b8/BLOG-3198_3.png" />
          </figure><p></p> ]]></content:encoded>
            <category><![CDATA[Cloudflare Workers]]></category>
            <category><![CDATA[AI]]></category>
            <category><![CDATA[Agents Week]]></category>
            <category><![CDATA[Developers]]></category>
            <category><![CDATA[Agents]]></category>
            <category><![CDATA[Durable Objects]]></category>
            <guid isPermaLink="false">1t4PF8FLpPMGpgBd6UjBBv</guid>
            <dc:creator>Sunil Pai</dc:creator>
            <dc:creator>Korinne Alpers</dc:creator>
        </item>
        <item>
            <title><![CDATA[Securing non-human identities: automated revocation, OAuth, and scoped permissions]]></title>
            <link>https://blog.cloudflare.com/improved-developer-security/</link>
            <pubDate>Tue, 14 Apr 2026 13:00:10 GMT</pubDate>
            <description><![CDATA[ Cloudflare is introducing scannable API tokens, enhanced OAuth visibility, and GA for resource-scoped permissions. These tools help developers implement a true least-privilege architecture while protecting against credential leakage.
 ]]></description>
            <content:encoded><![CDATA[ <p>Agents let you build software faster than ever, but securing your environment and the code you write — from both mistakes and malice — takes real effort. <a href="https://www.cloudflare.com/learning/security/threats/owasp-top-10/"><u>Open Web Application Security Project</u></a> (OWASP) details a number of <a href="https://genai.owasp.org/resource/agentic-ai-threats-and-mitigations/"><u>risks</u></a> present in agentic AI systems, including the risk of credential leaks, user impersonation, and elevation of privilege. These risks can result in extreme damage to your environments including denial of service, data loss, or data leaks — which can do untold financial and reputational damage.  </p><p>This is an identity problem. In modern development, "identities" aren't just people — they are the agents, scripts, and third-party tools that act on your behalf. To secure these non-human identities, you need to manage their entire lifecycle: ensuring their credentials (tokens) aren't leaked, seeing which applications have access via OAuth, and narrowing their permissions using granular RBAC.</p><p>Today, we are introducing updates to address these needs: scannable tokens to protect your credentials,<b> </b>OAuth visibility to manage your principals, and resource-scoped RBAC to fine-tune your policies.</p>
    <div>
      <h2>Understanding identity: Principals, Credentials, and Policies</h2>
      <a href="#understanding-identity-principals-credentials-and-policies">
        
      </a>
    </div>
    <p>To secure the Internet in an era of <a href="https://www.cloudflare.com/learning/ai/what-is-agentic-ai/"><u>autonomous agents</u></a>, we have to rethink how we handle identity. Whether a request comes from a human developer or an AI agent, every interaction with an API relies on three core pillars:</p><ul><li><p><b>The Principal (The Traveler):</b> This is the identity itself — the "who." It might be you logging in via OAuth, or a background agent using an API token to deploy code.</p></li><li><p><b>The Credential (The Passport):</b> This is the proof of that identity. In this world, your API token is your passport. If it’s stolen or leaked, anyone can "wear" your identity.</p></li><li><p><b>The Policy (The Visa):</b> This defines what that identity is allowed to do. Just because you have a valid passport doesn't mean you have a visa to enter every country. A policy ensures that even a verified identity can only access the specific resources it needs.</p></li></ul><p>When these three pillars aren't managed together, security breaks down. You might have a valid Principal using a stolen Credential, or a legitimate identity with a Policy that is far too broad. </p>
    <div>
      <h2>Leaked token detection</h2>
      <a href="#leaked-token-detection">
        
      </a>
    </div>
    <p>Agents and other third-party applications use API tokens to access the Cloudflare API. One of the simplest ways that we see people leaking their secrets is by accidentally pushing them to a public GitHub repository. <a href="https://www.gitguardian.com/files/the-state-of-secrets-sprawl-report-2026"><u>GitGuardian</u></a> reports that last year more than 28 million secrets were published to public GitHub repositories, and that AI is causing leaks to happen 5x faster than before.</p><p>If an API token is a digital passport, then leaking it on a public repository is like leaving your passport on a park bench. Anyone who finds it can impersonate that identity until the document is canceled. Our partnership with GitHub acts like a global "lost and found" for these credentials. By the time you realize your passport is missing, we’ve already identified the document, verified its authenticity via the checksum, and voided it to prevent misuse.</p><p>We’re partnering with several leading credential scanning tools to help proactively find your leaked tokens and revoke them before they could be used maliciously. We know it’s not a matter of if, but rather when, before you, an employee, or one of your agents makes a mistake and pushes a secret somewhere it shouldn’t be. </p>
    <div>
      <h4>GitHub</h4>
      <a href="#github">
        
      </a>
    </div>
    <p>We’ve partnered with GitHub and are participating in their Secret Scanning program to find your tokens in both public and private repositories. If we are notified that a token has leaked to a public repository, we will automatically revoke the token to prevent it from being used maliciously. For private repositories, GitHub will notify you about any leaked Cloudflare tokens and you can clean these up.</p>
    <div>
      <h5>How it works</h5>
      <a href="#how-it-works">
        
      </a>
    </div>
    <p>We’ve shared the new token formats (below!) with GitHub, and they now scan for them on every commit. If they find something that looks like a leaked Cloudflare token, they verify the token is real (using the checksum), send us a webhook to revoke it, and then we notify you via email so you can generate a new one in Dashboard settings.</p><p>This means we plug the hole as soon as it’s found. By the time you realize you made a mistake, we've already fixed it. </p><p>We hope this is the kind of feature you don’t need to use, but our partners are on the lookout for leaks to help keep you secure. </p>
    <div>
      <h4>Cloudflare One</h4>
      <a href="#cloudflare-one">
        
      </a>
    </div>
    <p>Cloudflare One customers are also protected from these leaks. By configuring the <a href="https://developers.cloudflare.com/cloudflare-one/data-loss-prevention/dlp-profiles/predefined-profiles/#credentials-and-secrets"><u>Credentials and Secrets</u></a> DLP profile, organizations can activate prevention everywhere a credential can travel:</p><ul><li><p><b>Network Traffic (</b><a href="https://www.cloudflare.com/sase/products/gateway/"><b><u>Cloudflare Gateway</u></b></a><b>):</b> Apply these entries to a policy to detect and block Cloudflare API tokens moving across your network. A token in a file upload, an outbound request, or a download is stopped before it reaches its destination.</p></li><li><p><b>Outbound Email (</b><a href="https://www.cloudflare.com/sase/products/email-security/"><b><u>Cloudflare Email Security</u></b></a><b>):</b> Microsoft 365 customers can extend this same prevention to Outlook. The <a href="https://developers.cloudflare.com/cloudflare-one/email-security/outbound-dlp/"><u>DLP Assist</u></a> add-in scans messages before delivery, catching a token before it’s sent externally.</p></li><li><p><b>Data at Rest (</b><a href="https://www.cloudflare.com/sase/products/casb/"><b><u>Cloudflare CASB</u></b></a><b>):</b> Cloudflare’s Cloud Access Security Broker applies the same profile to scan files across connected SaaS applications, catching tokens saved or shared in Google Drive, OneDrive, Dropbox, and other integrated services.</p></li></ul><p>The most novel exposure vector, though, is AI traffic. <a href="https://www.cloudflare.com/developer-platform/products/ai-gateway/"><u>Cloudflare AI Gateway</u></a> integrates with the same DLP profiles to scan and block both incoming prompts and outgoing AI model responses in real time.</p>
    <div>
      <h4>Other credential scanners</h4>
      <a href="#other-credential-scanners">
        
      </a>
    </div>
    <p>The only way credential scanning works is if we meet you where you are, so we are working with several open source and commercial credential scanners to ensure you are protected no matter what secret scanner you use. </p>
    <div>
      <h3>How it works</h3>
      <a href="#how-it-works">
        
      </a>
    </div>
    <p>Until now, Cloudflare’s API tokens were pretty generic looking, so they were hard for credential scanners to identify with high confidence. These automated security tools scan your code repositories looking for exposed credentials like API keys, tokens or passwords. The “cf” prefix makes Cloudflare tokens instantly recognizable with greater confidence, and the checksum makes it easy for tools to statically validate them. Your existing tokens will continue to work, but every new token you generate will use the scannable format so it’s easily detected with high confidence.</p><table><tr><td><p><b>Credential Type</b></p></td><td><p><b>What it's for</b></p></td><td><p><b>New Format</b></p></td></tr><tr><td><p>User API Key</p></td><td><p>Legacy global API key tied to your user account (full access)</p></td><td><p><b>cfk_[40 characters][checksum]</b></p></td></tr><tr><td><p>User API Token</p></td><td><p>Scoped token you create for specific permissions</p></td><td><p><b>cfut_[40 characters][checksum]</b></p></td></tr><tr><td><p>Account API Token</p></td><td><p>Token owned by the account (not a specific user)</p></td><td><p><b>cfat_[40 characters][checksum]</b></p></td></tr></table>
    <div>
      <h4>Getting started</h4>
      <a href="#getting-started">
        
      </a>
    </div>
    <p>If you have existing API tokens, you can roll the token to create a new, scannable API token. This is optional, but recommended to ensure that your tokens are easily discoverable in case they leak. </p><p>While API tokens are generally used by your own scripts and agents, OAuth is how you manage access for third-party platforms. Both require clear visibility to prevent unauthorized access and ensure you know exactly who — or what — has access to your data.</p>
    <div>
      <h2>Improving the OAuth consent experience</h2>
      <a href="#improving-the-oauth-consent-experience">
        
      </a>
    </div>
    <p>When you connect third-party applications like Wrangler to your Cloudflare Account using OAuth, you're granting that application access to your account’s data. Over time, you may forget why you granted a third party application access to your Account in the first place. Previously, there was no central place to view &amp; manage those applications. Starting today, there is.  </p><p>Going forward, when a third party application requests access to your Cloudflare account, you’ll be able to review: </p><ul><li><p><b>Which third-party application</b> is requesting access, along with information about the application like Name, Logo, and the Publisher.</p></li><li><p><b>Which scopes</b> the third-party application is requesting access to.</p></li><li><p><b>Which accounts</b> to grant the third party application access to.</p></li></ul>
<div><table><thead>
  <tr>
    <th><span>Before</span></th>
    <th><span>After</span></th>
  </tr></thead>
<tbody>
  <tr>
    <td><img src="https://images.ctfassets.net/zkvhlag99gkb/2p3RZFDklLn9cfQOVYq5vS/d40e6116c115c453095f8ed2d110f062/image3.png" /></td>
    <td><img src="https://images.ctfassets.net/zkvhlag99gkb/33yGBWfD468P6T0hAnvbPX/9241ef24b6381eedaf2830b782a69f2e/image4.png" /><br /><br />
    
    <img src="https://images.ctfassets.net/zkvhlag99gkb/22pDfCAFbPLNhnAPXLB36w/0671d7e892c5a93040ab17a62eda4a3c/image1.png" /></td>
  </tr>
</tbody></table></div><p>Not all applications require the same permissions; some only need to read data, others may need to make changes to your Account. Understanding these scopes before you grant access helps you maintain least-privilege. </p><p>We also added a <a href="https://dash.cloudflare.com/profile/access-management/authorization"><u>Connected Applications</u></a> experience so you can see which applications have access to which accounts, what scopes/permissions are associated with that application, and easily revoke that access as needed. </p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5Aiu82urkjaL9SWpZUBgNi/827cf38aa655d4094de1895d07f51137/BLOG-3216_5.png" />
          </figure>
    <div>
      <h4>Getting started</h4>
      <a href="#getting-started">
        
      </a>
    </div>
    <p>The OAuth consent and revocation improvements are available now. Check which apps currently have access to your accounts by visiting My Profile &gt; Access Management &gt; Connected Applications. </p><p>For developers building integrations with Cloudflare, keep an eye on the <a href="https://developers.cloudflare.com/changelog"><u>Cloudflare Changelog</u></a> for more announcements around how you can register your own OAuth apps soon! </p>
    <div>
      <h2>Fine-grained resource-level permissioning </h2>
      <a href="#fine-grained-resource-level-permissioning">
        
      </a>
    </div>
    <p>If the token is the passport, then resource-scoped permissions are the visas inside it. Having a valid passport gets you through the front door, but it shouldn't give you access to every room in the building. By narrowing the scope to specific resources — like a single Load Balancer pool or a specific Gateway policy — you are ensuring that even if an identity is verified, it only has the "visa" to go where it’s strictly necessary.</p><p>Last year, we <a href="https://developers.cloudflare.com/changelog/post/2025-10-01-fine-grained-permissioning-beta/"><u>announced</u></a> support for resource scoped permissions in Cloudflare’s <a href="https://www.cloudflare.com/learning/access-management/role-based-access-control-rbac/"><u>role-based access control (RBAC)</u></a> system for several of our Zero Trust products. This enables you to right size permissions for both users and agents to minimize security risks. We’ve expanded this capability to several new resources-level permissions. The resource scope is now supported for:</p><ul><li><p>Access Applications</p></li><li><p>Access Identity Providers</p></li><li><p>Access Policies</p></li><li><p>Access Service Tokens</p></li><li><p>Access Targets</p></li></ul><p>We’ve also completely overhauled the API Token creation experience, making it easier for customers to provision and manage Account API Tokens right from the Cloudflare Dashboard. </p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2XSjOtE46g8iVNN7QNzoDI/2f2280b106da9ebc9f3959d5300a4241/account_owned_token_gif.gif" />
          </figure>
    <div>
      <h4>How it works</h4>
      <a href="#how-it-works">
        
      </a>
    </div>
    <p>When you add a member to your Cloudflare account or create an API Token, you typically assign that principal a policy. A Permission Policy is what gives a principal permission to take an action, whether that’s managing Cloudflare One Access Applications, or DNS Records. Without a policy, a principal can authenticate, but they are unauthorized to do any actions within an account.</p><p>Policies are made up of three components: a Principal, a Role, and a Scope. The Principal is who or what you're granting access to, whether that's a human user, a Non-Human Identity (NHI) like an API Token, or increasingly, an Agent acting on behalf of a user. The Role defines what actions they're permitted to take. The Scope determines where those permissions apply, and historically, that's been restricted to the entire account, or individual zones.</p>
    <div>
      <h2>New permission roles</h2>
      <a href="#new-permission-roles">
        
      </a>
    </div>
    <p>We’re also expanding the role surface more broadly at both the Account &amp; Zone level with the introduction of a number of new roles for many products.   </p><ul><li><p>Account scope</p><ul><li><p>CDN Management</p></li><li><p>MCP Portals</p></li><li><p>Radar</p></li><li><p>Request Tracer</p></li><li><p>SSL/TLS Management</p></li></ul></li><li><p>Zone scope</p><ul><li><p>Analytics</p></li><li><p>Logpush</p></li><li><p>Page Rules</p></li><li><p>Security Center</p></li><li><p>Snippets</p></li><li><p>Zone Settings</p></li></ul></li></ul>
    <div>
      <h4>Getting started</h4>
      <a href="#getting-started">
        
      </a>
    </div>
    <p>The resource scope and all new account and zone-level roles are available today for all Cloudflare customers. You can assign account, zone, or resource-scoped policies through the Cloudflare Dashboard, the API, or Terraform. </p><p>For a full breakdown of all available roles and how scopes work, visit our <a href="https://developers.cloudflare.com/fundamentals/manage-members/roles/"><u>roles</u></a> and <a href="https://developers.cloudflare.com/fundamentals/manage-members/scope/"><u>scope documentation</u></a>.</p>
    <div>
      <h2>Secure your accounts</h2>
      <a href="#secure-your-accounts">
        
      </a>
    </div>
    <p>These updates provide the granular building blocks needed for a true least-privilege architecture. By refining how we manage permissions and credentials, developers and enterprises can have greater confidence in their security posture across the users, apps, agents, and scripts that access Cloudflare. Least privilege isn’t a new concept, and for enterprises, it’s never been optional. Whether a human administrator is managing a zone or an agent is programmatically deploying a Worker, the expectation is the same, they should only be authorized to do the job it was given, and nothing else. </p><p>Following today’s announcement, we recommend customers:</p><ol><li><p>Review your <a href="https://dash.cloudflare.com/profile/api-tokens"><u>API tokens</u></a>, and reissue with the new, scannable API tokens as soon as possible. </p></li><li><p><a href="https://dash.cloudflare.com/profile/access-management/authorization"><u>Review your authorized OAuth apps</u></a>, and revoke any that you are no longer using</p></li><li><p>Review <a href="https://dash.cloudflare.com/?to=/:account/billing"><u>member</u></a> &amp; <a href="https://dash.cloudflare.com/profile/api-tokens"><u>API Token</u></a> permissions in your accounts and ensure that users are taking advantage of the new account, zone, or resource scoped permissions as needed to reduce your risk area. </p></li></ol><p></p> ]]></content:encoded>
            <category><![CDATA[Agents Week]]></category>
            <category><![CDATA[Agents]]></category>
            <category><![CDATA[Security]]></category>
            <category><![CDATA[Product News]]></category>
            <category><![CDATA[Developers]]></category>
            <category><![CDATA[Developer Platform]]></category>
            <guid isPermaLink="false">4cMjGGRR98LV3HgGdwgWrf</guid>
            <dc:creator>Justin Hutchings</dc:creator>
            <dc:creator>Adam Bouhmad</dc:creator>
            <dc:creator>Rebecca Varley</dc:creator>
        </item>
    </channel>
</rss>