The Cloudflare Blog

Our billing pipeline was suddenly slow. The culprit was a hidden bottleneck in ClickHouse

James Morrison — Thu, 14 May 2026 13:00:00 GMT

At Cloudflare, we are heavy users of ClickHouse, an open source online analytical processing (OLAP) database. Every day, we make millions of calls to ClickHouse to determine how much users should be billed for their usage of Cloudflare products. If we don't finish those jobs in a timely fashion, the invoices become very difficult to reconcile.

This pipeline powers hundreds of millions of dollars in usage revenue, fraud systems, and more, so being delayed has major downstream implications.

Which is why it was a big problem when the daily aggregation jobs in ClickHouse – responsible for ensuring Cloudflare’s bills go out – had slowed way down, following a migration. All the usual suspects looked clean: I/O, memory, rows scanned, parts read. Everything we would normally check when a ClickHouse query is slow appeared to be normal.

This is the story of how we discovered a hidden bottleneck buried deep within ClickHouse’s internals, and the three patches we wrote to fix it.

The setup: a petabyte-scale analytics platform

We use ClickHouse to store over a hundred petabytes of data across a few dozen clusters. To simplify onboarding for our many internal teams, we built a system called "Ready-Analytics" in early 2022.

The premise is simple: instead of designing new tables, teams can stream data into a single, massive table. Datasets are disambiguated by a namespace, and each record uses a standard schema (e.g., 20 float fields, 20 string fields, a timestamp, and an indexID).

In ClickHouse, the way data is sorted is crucial to query performance. This is where the indexID comes into play. It’s a string field, which forms part of the primary key, meaning that every individual namespace can have its data sorted in a way that is optimal for the queries the owners of that namespace expect to be running. Altogether, we end up with a primary key that looks like this: (namespace, indexID, timestamp).

This system is popular, with hundreds of applications using it. It had already grown to more than 2PiB of data by December 2024, and an ingestion rate of millions of rows per second. But it had one critical flaw: its retention policy.

The problem: one retention policy to rule them all

Cloudflare has been using ClickHouse for many years, since before it had native Time-to-Live (TTL) features. Consequently, we built our own retention system based on partitioning. The Ready-Analytics table was partitioned by day, and our retention job simply dropped partitions older than 31 days.

This "one-size-fits-all" 31-day retention was a major limitation. Some teams needed to store data for years due to legal or contractual obligations, while others needed only a few days. This restriction meant these use cases couldn't use Ready-Analytics and had to opt for a conventional setup, which has a far more complex onboarding process.

We needed a new system that allowed per-namespace retention.

The solution: a new partitioning scheme

We considered two main approaches:

A Table-per-Namespace: This would naturally solve the retention problem but would require significant new automation to manage thousands of tables on demand.
A New Partitioning Key: We could change the partitioning key from just (day) to (namespace, day).

We chose the second option. This would allow our existing retention system to continue managing partitions, but now with per-namespace granularity.

We knew this would increase the total number of data parts in the table, but we made a key assumption: since every query is filtered by a specific namespace, the number of parts read by any single query shouldn't change. We believed this meant performance would be unaffected.

^{This shows how we changed the partitioning, allowing us to cheaply drop data for a single namespace}

This new system also allowed us to build a sophisticated storage management layer. Using the max-min fairness algorithm, we could set a target disk utilization (e.g., 90%) and automatically "share" available space. Namespaces using less than their fair share would cede their unused capacity to those that needed more. This allowed us to confidently run our clusters at 90% utilization.

We began the migration in January 2025. Using ClickHouse's Merge table feature, we combined the old and new tables, writing all new data to the new partitioned table while the old data aged out.

The mystery: when billing starts to break

Two months later, in late March 2025, our billing team reported that their daily aggregation jobs were slowing down. These jobs are time-critical; if they don't finish, bills don't go out. The jobs were getting progressively slower, and we were approaching a deadline.

We investigated, but none of the usual suspects were to blame. I/O was fine. Memory was fine. The metrics for individual queries showed they were not reading more data or more parts than before. Our initial assumption seemed correct, yet the system was grinding to a halt.

It took several days before we even had a theory. Finally, we made a plot of query duration against the total part count in the cluster. The correlation was undeniable.

^{Average SELECT Query Durations on the Ready Analytics ClickHouse Cluster, showing progressive performance degradation.}

^{Linear Growth in Total Data Part Count per Table Replica, following the new (namespace, day) partitioning scheme.}

But why? If we weren't reading the extra parts, why did their mere existence slow us down?

The investigation: hunting bottlenecks with flame graphs

We turned to ClickHouse's built-in trace_log to generate flame graphs. This is a built-in table that records traces from the running ClickHouse server. It not only includes traces of what code is being executed, but it associates these with specific users, query IDs and other metadata, meaning you can filter down to quite precise sets of events if necessary. In our case, we wanted to look specifically at leaf SELECT queries. This was easy thanks to the available metadata in this table.

The first CPU-based flame graph quickly confirmed our suspicion: a huge amount of time was being spent in query planning. This is the phase before execution when ClickHouse decides which parts to read.

^{Flame graph showing that 45% of leaf query CPU time is spent filtering a vector of parts based on the partition ID}

The flame graph was clear: 45% of the sampled CPU time was being spent in a single function called filterPartsByPartition.

Our first attempt at a fix was a small patch to this exact code path. The planner evaluates heuristics to prune parts, and we believed they weren't being evaluated in the optimal order for our table. Our patch changed the order, yielding a small 5% improvement. We were on the right path, but we'd missed the real problem.

We had been generating "CPU" traces, which only sample active threads. We switched to "Real" traces, which sample all threads, including those that are inactive or waiting. The new flame graph was a revelation.

^{Flame graph showing that more than half of leaf query duration is spent waiting for a mutex that protects the list of active parts}

The problem wasn't CPU-bound work; it was massive lock contention. More than half of our query duration was spent waiting to acquire a single mutex (MergeTreeData) that protects the table's list of parts. To plan a query, every single thread had to:

Acquire an exclusive lock on this mutex.
Make a complete copy of the list of all parts in the table.
Release the lock.
Filter that list down to the relevant parts.

With tens of thousands of parts and hundreds of concurrent queries, they were all just standing in a single-file line.

The fixes: a trio of patches

This insight helped us plan a series of optimizations to alleviate these hotspots. As with all the patches we make to ClickHouse, we try to make them generic, and eventually get them contributed to the upstream codebase. This makes it easier for us to maintain our fork, and means the community benefits from the changes we make too!

Optimization 1: use a shared lock

The query planner doesn't modify the parts list; it just reads it. It had no business using an exclusive lock.

The Fix: We modified the code to acquire a shared lock (std::shared_lock) instead. This allowed all query planners to enter the critical section concurrently.

The Result: A massive, immediate drop in query duration. The lock contention vanished.

^{Immediate Impact of the Shared Lock Optimization (Optimization 1) on Average SELECT Query Durations, demonstrating the resolution of lock contention.}

Optimization 2: stop copying the vector

Performance was significantly better, but still not back to baseline. We went back to the trace log and made another ‘Real’ flame graph.

^{Flame graph showing that we spend a quarter of leaf query duration copying the vector of all parts, and another quarter filtering through it (copying again).}

The new flame graph showed the bottleneck had simply moved. Now, time was being spent copying the giant vector of parts, even with the shared lock. Intuitively, copying a vector sounds cheap, but when it contains tens of thousands of elements, and you do it hundreds of times a second, it adds up.

The Fix: We deferred the copy entirely. We created a "shared copy" of the parts list. Read-only operations (like query planning) just read from this copy. Any operation that modifies the set of parts (like a new insert) regenerates the cache. Planners now only copy the filtered list of parts they actually need.

The Result: Another significant performance improvement.

^{Further Performance Improvement After Rolling Out the Vector Copy Optimization (Optimization 2).}

After seeing these massive savings internally, we decided to bring these changes to the community. After some small design iterations with the maintainers at ClickHouse Inc., we got the changes merged under PR #85535. They have been available since ClickHouse version 25.11.

Optimization 3: binary search for parts

We're still not done. As part counts grow, performance still degrades, just much more slowly. The correlation with part count was still there. Coming back to this after a few months, a new flame graph (looking the same as Figure 3) shows the time is spent in the filtering code path (the one we tried to fix first). This code performs a linear scan over all parts, evaluating predicates against each one. Over a few months, we were back to select durations from before the optimizations.

But we know this list of parts is sorted by the partitioning key. Remember that the first column of the partition key is namespace, which the vast majority of queries filter on, because it identifies the “tenant.” How can we make use of this?

The Fix: We implemented a binary search based on the namespace part of the partition ID. This works because the vector is sorted, so you can filter out a lot of the entries without actually looking at them. This is particularly effective since the namespace is the first part of that sorting key. After this first-pass of binary search, we have a much smaller range of parts we need to examine, and for those we still step through each one, applying the same logic as before to exclude parts based on other conditions.

The Result: After deploying this patch in March 2026, query durations dropped by 50% (see Figure 8). More importantly, this finally breaks correlation of query durations with the number of parts. Unfortunately, this solution doesn’t generalize that well for arbitrary query conditions (e.g. conditions such as namespace in (5,10)). We are looking into more generic approaches like extending the query condition cache to cover part filtering.

^{Sustained Latency Reduction Following the Implementation of Binary Search for Part Pruning (Optimization 3).}

An uneasy truce

These optimizations resolved the immediate crisis with the billing system. But this journey exposed the deep, non-obvious costs of our partitioning choice.

Other problems remain. In this blog post we’ve only described the problems increasing part counts had on our select durations, but it has also caused problems for ZooKeeper, which tracks metadata for all the parts in ClickHouse. Perhaps one day we’ll tell the story of the 100 gigabyte ZooKeeper cluster.

We've bought ourselves significant breathing room, but the fundamental question remains: Was this partitioning scheme the right long-term choice? Or will we eventually need to bite the bullet and move to a different architecture? For now, our patches are holding, but the experience was a clear example of how even a well-planned change can fall victim to incorrect assumptions.

When the billing team first reported this problem we had 30,000 parts per replica. The part rate never stopped growing, and a year later we hit 160k parts per replica, but query durations have been stable thanks to the optimizations we made here.

At Cloudflare, we solve complex engineering problems at a massive scale. If the debugging and optimizations we described here sound like the type of challenge you’re looking for, check out some of the open roles we are hiring for.

Browser Run: now running on Cloudflare Containers, it’s faster and more scalable

Ruskin Constant — Wed, 13 May 2026 13:00:00 GMT

We’ve enabled higher usage limits, faster performance, and better reliability for Browser Run by rebuilding on top of Cloudflare’s Containers.

You can now spin up 60 browsers per minute via the Workers binding and run up to 120 concurrently — 4x the previous limit. Also, Quick Action response times dropped more than 50%. You don't need to change anything: these improvements are live today. On top of that, we’re shipping fixes and new features faster than before. Read on to learn how we did it and see the data.

Remind me: what is Browser Run?

Browser Run enables developers to programmatically control and interact with headless browser instances running on Cloudflare’s global network. That’s useful for end-to-end testing of web applications, securely investigating suspicious URLs, and leveraging how browsers can easily render PDF documents, amongst other quick actions like capturing screenshots and extracting content. More recently, it’s become a critical enabler of AI agents to interact with the web. We’re building Browser Run to be the go-to platform to responsibly utilize automated browsers securely at massive scale.

Outgrowing our bunk bed

Before adopting Cloudflare Containers, we shared infrastructure with Browser Isolation (BISO). While technically similar, BISO’s larger container images slowed startup and development. Crucially, BISO browsers lacked optimal global distribution, compromising resiliency and latency. Additionally, typical BISO users’ long, steady sessions clashed with Browser Run’s short, spiky usage, creating scaling bottlenecks and availability delays.

Thankfully, after much internal development, Cloudflare released Durable Object (DO)-enabled Containers open beta last year, meaning we were ready for a tentative adoption that ultimately benefited both product platforms. Like most successful product platforms, we’re committed to building on our own platform wherever feasible so that we can feel and fix any pain points ahead of any external customers.

The migration: Containers

We started a gradual migration by inserting a Worker in our incoming request paths to provide some Container-powered browsers to a handful of users alongside those from BISO. This dual support during development was key: it allowed us to compare performance, isolate implementation bugs and ultimately gain confidence in the benefits of the Container-driven approach.

Ramping up adoption, we first used the Container browsers for all of our Quick Actions endpoints, then for connections via the Workers browser binding on free accounts, followed by pay-as-you-go accounts in order to validate stability before we rolled it out to all remaining contract customers, ensuring a transition that required no action or existing worker redeployments from our customers.

Challenges: performance and scale bottlenecks

On our end, though, we faced a fresh set of challenges getting familiar with a novel, unstable early-stage Containers platform interface that was light on documentation, light on observability, and light on colleagues in an overlapping timezone. However, our feedback to our own teams as Customer Zero meant that we could provide a tight feedback loop leading to substantial upgrades that benefit our external customers too. Nevertheless, there was a lot of friction to overcome initially, most of which were to be expected for a closed beta in active development. Other hurdles to overcome were intrinsic to the new technical environment.

For example, once our browsers could run globally, our architecture had to adapt. DO-enabled Containers create a Durable Object as close to the incoming request as possible, but the connected Container may spin up on the other side of the world. This works fine for one-shot messages like "start my app," but when you're establishing a WebSocket between them and exchanging dozens of messages for a screenshot request, those extra milliseconds crossing the globe start adding up.

Our solution? Create regional pools of pre-warmed DO-backed browser containers to constrain the max distance (and hence max latency) between DOs and containers. When a request comes in, we pick a DO-container pair closest to the user within that region. This keeps latency low on both hops: user to DO, and DO to container. It adds a few more moving parts to our overall architecture, but we figured that was worthwhile so long as we had observability into the global state of each browser so that we could allocate and re-allocate capacity according to changing demand. A perfect use case for Workers KV…to a point.

Demand for our headless browsers has been ramping up since the beginning of last year. In short, AI agent builders discovered Browser Run and quickly brought request volumes outpacing our existing capacity. We quickly hit the limits of how quickly we could adjust our pool capacity to serve this new demand with a scalable approach. KV’s eventual consistency of around 30 seconds was becoming a bottleneck on our critical request path. You might check KV, see a container as "available," but by the time you route to it (30 seconds later), it's already claimed. That lag creates race conditions and overallocation of browsers, severely limiting how fast we could scale to meet demand spikes.

Migrating from KV to D1 + Queues

We previously stored each container state in KV. This meant that we could keep getting a minute old state due to cache TTL (recently KV changed the minimum cache TTL to 30 seconds, but even so that value is still too high).

We decided to migrate the container state into D1 instances instead. D1's transactional nature is a good fit here. Once we assign a browser to a user, it's exclusively theirs. Browsers are not shared resources. SQLite transactions ensure atomic assignment and prevent race conditions where two requests might claim the same browser simultaneously.

Here’s a simplified version of our browser acquisition query:

WITH candidate_pool AS (
    -- candidate pool logic to pick based on latency and other rules
)
UPDATE containers
SET status = 'picked'
WHERE sessionId IN (
    SELECT sessionId
    FROM candidate_pool
    ORDER BY RANDOM()
    LIMIT ?5
)
RETURNING data

We keep D1 shards per location and given that we may have several thousand containers running, and that each container needs to update its state every 5 seconds, we kept running into a problem: we would overload the database. For instance, if each write takes 1ms we can only write at most 1,000 times, which at one row per write would mean that we could only have 5,000 containers before overloading the database.

However, if we batch those writes, we can get much higher values, because batch writes are not significantly longer than individual ones, so we can increase the throughput in orders of magnitude. In our case, we use 100 row batches, which means we can now update a maximum of 500,000 containers per location. This headroom means capacity planning is no longer a bottleneck.

Currently, our P95 for batch write is 0.1ms!

To batch writes, we use Queues: every 5 seconds, each container computes its own state and adds it to its location queue. We then configure a worker consumer with 100 batch size and 1 second batch timeout:

{
    ...
    "queues": {
        "consumers": [
            {
                "queue": "production-core-containers-queue-weur",
                "max_batch_size": 100,
                "max_batch_timeout": 1,
                "max_retries": 1,
            },
            ...
        ]
        ...
    }
}

With this configuration, we achieve acceptable lag times well below 2 seconds. That said, queue backlogs can still cause stale state. When this happens, each region falls back to a designated backup region until the primary queue catches up.

Additional perks for quick actions

With dedicated infrastructure, we could now make upgrades to the browser container image without unwanted side effects or bloat for other products like BISO. This opened the door to optimize quick actions like screenshots and content extraction. Previously, our workers established a WebSocket to the remote browser and sent instructions one at a time: open a page, navigate to the URL, wait for it to load and take the screenshot. Each step had to be completed before the next could begin.

However, now we send all parameters in a single HTTP request directly to the container, and the entire flow executes internally without any back-and-forth between the worker and browser.

Results: massive performance boost and increased limits

We’ve seen a sharp decrease in average quick-action response time, as users are able to get what they need from a browser session in less time: less time waiting for browsers to be ready and faster processing of their DevTools Protocol messages.

Overcoming our real-time state management at this new scale meant we could spend more time in the playground, discovering and cooking up new features such as our recently launched /crawl endpoint.

Better browser flexibility

We also benefitted from another important perk by leaving behind shared Browser Isolation containers: faster upgrades.

When our browsers ran on shared product infrastructure, upgrading Chrome meant coordinating across multiple teams and products, each with their own roadmap and priorities. However, now that we run our own container image, we can upgrade at a faster tempo. For example, WebGL, a much-requested feature, is now available for browser-based rendering along with WebMCP (Model Context Protocol for the web) which enables new agentic interaction patterns. Both are made possible because we can control the browser version and flags without unwanted side effects in other Cloudflare products.

In a nutshell, we’re just getting started with unleashing the power of browsers at scale, especially for agentic development. We hope you’re diving in too — check out our docs.

Get started

Browser Run is available on all Workers plans. Start with the quick start guide, explore the Quick Actions, or try the /crawl endpoint to deeply extract data from any webpage, following links across the site.

Building AI agents? Check out our Agents SDK with built-in Browser Run support.

When "idle" isn't idle: how a Linux kernel optimization became a QUIC bug

Esteban Carisimo — Tue, 12 May 2026 13:00:00 GMT

CUBIC, standardized in RFC 9438, is the default congestion controller in Linux, and as a result governs how most TCP and QUIC connections on the public Internet probe for available bandwidth, back off when they detect loss, and recover afterward. At Cloudflare, our open-source implementation of QUIC, quiche, uses CUBIC as its default congestion controller, meaning this code is in the critical path for a significant share of the traffic we serve.

In this post, we’ll tell the story of a bug in which CUBIC's congestion window (cwnd) gets permanently pinned at its minimum and never recovers from a congestion collapse event.

The story starts with a Linux kernel change aimed at bringing CUBIC into line with the app-limited exclusion described in RFC 9438 §4.2-12 — a fix to a real problem in TCP that, when ported to our QUIC implementation, surfaced unexpected behaviors in quiche. It has a happy ending: an elegant (near-)one-line fix that broke the cycle.

CUBIC's logic in a nutshell

Before we dive into the core problem, a quick refresher on Congestion Control Algorithms (CCAs) may help to set the stage.

The central knob a CCA turns is the congestion window (cwnd): the sender-side cap on how many bytes can be in flight (sent but not yet acknowledged) at any moment. A larger cwnd lets the sender push more data per round trip; a smaller cwnd throttles it. Every loss-based CCA, CUBIC included, is ultimately a policy for how to grow cwnd when the network looks healthy and how to shrink it when it doesn't.

In essence, CCAs aim to maximize data transfer by inferring the "available bandwidth" of the network; because no one wants to pay for a 1 Gbps subscription and only use a fraction of it. The family of loss-based algorithms, to which CUBIC belongs, operate on a fundamental premise: (1) if there is no packet loss, increase the sending rate (i.e. increase the bandwidth utilization); (2) if there is loss, loss-based algorithms assume that the network's capacity has been exceeded, and the sender must back off (i.e. decrease the bandwidth utilization).

This logic is built on several assumptions that have been revisited over the years. However, we'll save that discussion for another time.

The symptom: a test that fails 61% of the time

Our investigation started with the report of unexpected failures in our ingress proxy integration test pipeline. This erratic behavior appeared in tests where CUBIC was evaluated in a scenario of heavy loss in the early part of the connection.

Recovery after congestion collapse is an uncommon regime, but it is exactly the regime a congestion controller exists to handle. Most congestion control tests exercise the steady-state and growth phases of an algorithm; far fewer probe what happens at minimum cwnd, after the connection has been beaten down. Bugs in this corner of the state space are invisible in throughput dashboards, undetectable by static review, and only surface when you deliberately drive a CCA into it and watch whether it can climb back out — which is exactly what this test did.

The simulated test setup includes the following details:

Quiche HTTP/3 client and server running at locally (localhost)
RTT = 10ms (set up in the configuration)
A 10 MB file download over HTTP/3
Using CUBIC congestion control
With 30% random packet loss injected during the first two seconds
After two seconds, loss stops entirely
The test has a generous 10-second timeout to complete the download, which is expected to be completed in four or five seconds

The expected behavior is straightforward: CUBIC should take some hits during the loss phase, reduce its congestion window, and once loss stops, steadily ramp up and finish the download well within the timeout. Instead, we observed in multiple 100-time runs that around 60% of our tests were not able to complete the download within the generous 10-second timeout.

The anomaly: 999 state transitions with zero loss

We instrumented quiche's qlog output with packet loss events and built visualizations to understand what was happening inside the congestion controller:

^{Connection overview of a failing test. After T=2s, packet loss stops entirely — yet cwnd remains pinned at the minimum floor and the congestion state oscillates between recovery and congestion avoidance every ~14ms.}

After the two-second (2000 ms) mark, packet loss stops entirely. However, the number of bytes in flight remains flat, which contradicts the core logic of the CUBIC algorithm: in the absence of loss, apply more gas to increase throttle (more bytes in our world). This raises the question: if the network is no longer dropping packets, why is the congestion window failing to grow?

When we zoom into that region, our analysis shows that CUBIC enters a rapid oscillation, shown in our plot as an extended recovery phase, between congestion avoidance state (the operational regime phase) and recovery state (the packet loss recovery state) — 999 transitions in approximately 6.7 seconds. That’s one transition every ~14ms — suspiciously close to the connection's RTT (10ms). Throughout this entire period, cwnd is locked at the minimum floor: 2700 bytes, or two full-size packets.

Clearly something in CUBIC's logic is misinterpreting the state of the connection. The key clue is the oscillation period: ~14ms matches the RTT. Whatever is triggering the recovery/avoidance flip is happening once per round trip, in lockstep with connection's ACK clock; the self-clocking rhythm in which each round-trip's ACKs from the client trigger the server's next send. Because this is a download (server to client), the ACKs in question travel client to server, and CUBIC's state machine runs on the server side: every time those ACKs land, bytes_in_flight drops to zero and the server sends the next two-packet burst, which is what triggers the bug.

To confirm this behavior was CUBIC-specific, we ran the same test with Reno, another member of the loss-based family but with a different growth rate. The results were conclusive: 100% pass rate, showing Reno recovered cleanly after the loss phase, and revealing that this is a CUBIC-related bug.

^{Reno recovers cleanly after the loss phase ends at T=2s and completes the download by ~5s}

Tracing the root cause

Loss-based algorithms have two pedals, gas and brake, with a difference in how they accelerate. Well, CUBIC comes with some extra features. Here we are going to focus on bytes_in_flight == 0.

TCP CUBIC after idle (Linux, 2017)

To understand the bug, we first need to understand the optimization it came from. In 2017,an issue was found with Linux kernel's CUBIC implementation. The commit message explains:

The epoch is only updated/reset initially and when experiencing losses. The delta "t" of now - epoch_start can be arbitrary large after app idle as well as the bic_target. Consequentially the slope (inverse of ca->cnt) would be really large, and eventually ca->cnt would be lower-bounded in the end to 2 to have delayed-ACK slow-start behavior.
This particularly shows up when slow_start_after_idle is disabled as a dangerous cwnd inflation (1.5 x RTT) after few seconds of idle time.

The epoch is the reference timestamp CUBIC uses to anchor its growth curve: W_cubic(delta_t) is parameterized by delta_t = now - epoch_start, and the epoch is reset whenever CUBIC restarts its growth function — most notably after a loss event reduces cwnd. Between resets, delta_t grows monotonically with wall-clock time.

When an application goes idle (stops sending) for a while and then resumes, the CUBIC growth function W_cubic(delta_t) computes delta_t as now - epoch_start, as illustrated in the figure below. Since the epoch wasn't updated during idle, delta_t is huge, producing an enormous target window — and CUBIC would immediately try to inflate cwnd to an unreasonable value.

Jana Iyengar's initial fix was to reset `epoch_start` when the application resumes sending. But Neal Cardwell pointed out the flaw in that approach:

…it would ask the CUBIC algorithm to recalculate the curve so that we again start growing steeply upward from where cwnd is now (as CUBIC does just after a loss). Ideally we'd want the cwnd growth curve to be the same shape, just shifted later in time by the amount of the idle period.

The elegant solution, authored by Eric Dumazet, Yuchung Cheng, and Neal Cardwell, was to shift the epoch forward by the idle duration rather than resetting it. This preserves the shape of the CUBIC growth curve — just sliding it in time so that the algorithm picks up where it left off.

The port to quiche (2020)

When CUBIC was first implemented in quiche, this idle-period adjustment was ported. However, QUIC, which runs in the user space, doesn't have TCP's kernel-level CA_EVENT_TX_START callback. Instead, the quiche implementation checks for the idle condition inside on_packet_sent():

// cubic.rs — on_packet_sent() (simplified)
/// Updates the state when a packet is sent.
fn on_packet_sent(&mut self, bytes_in_flight: usize, now: Instant, ...) {
    // If the sending burst is restarting (i.e., bytes_in_flight was zero before this send),
    // adjust the congestion recovery start time to account for the gap in sending.
    if bytes_in_flight == 0 {
        let delta = now - self.last_sent_time;
        self.congestion_recovery_start_time += delta;
    }
    // Record the time of this send event.
    self.last_sent_time = now;
}

Where it breaks: the QUIC difference

The fix ported to quiche included a bug in the original kernel change which was fixed by a followup change to the kernel cubic module about a week later. The commit message for the second fix explains:

tcp_cubic: do not set epoch_start in the future Tracking idle time in bictcp_cwnd_event() is imprecise, as epoch_start is normally set at ACK processing time, not at send time.
Doing a proper fix would need to add an additional state variable, and does not seem worth the trouble, given CUBIC bug has been there forever before Jana noticed it.
Let's simply not set epoch_start in the future, otherwise bictcp_update() could overflow and CUBIC would again grow cwnd too fast.

As mentioned in the commit message, recovery start time is set during ACK processing, and the computation of the adjustment based on sent times can push the recovery start time into the future. This explains the oscillation between recovery and congestion avoidance seen on our test. The trap only consistently triggers when every incoming ACK drives bytes_in_flight all the way to zero — which in practice means cwnd has collapsed to its minimum (two packets) and the application has data ready to send another full window the moment an ACK arrives. Outside this regime, bytes_in_flight == 0 is less likely to hold on every send, so it is less likely to trigger the bug.

Why doesn't this also happen at connection start? The bug only triggers when the connection exits slow-start and switches over to congestion avoidance. Before exiting slow-start, congestion_recovery_start_time is not set, so the buggy branch in on_packet_sent has no recovery boundary to advance. During slow start CUBIC's cwnd grows by the same Reno-style ack-based rule shared by all loss-based CCAs — the cubic curve and its sensitivity to congestion_recovery_start_time only enter the picture once the connection is in congestion avoidance, meaning the trap needs three things at once: a real loss event to set the recovery boundary, congestion avoidance to be running, and cwnd collapsed to the two-packet floor.

^{The self-perpetuating recovery trap. At minimum cwnd, every ACK cycle triggers the idle period adjustment with an inflated delta.}

At a minimum cwnd (two packets), the dynamics of the connection shift into a "death spiral" where the idle period optimization becomes a self-fulfilling prophecy. This trap operates in a continuous loop:

Send and ACK packets: The sender transmits the entire two-packet window. After one RTT (~14ms), both packets are ACKed, causing bytes_in_flight to drop to zero.
False idle detection: When the next burst is sent, on_packet_sent() sees bytes_in_flight == 0 and assumes the connection was idle, but it was congestion limited.
Inflated delta: The calculation uses now - last_sent_time to determine the idle duration. When the congestion window (cwnd) is at its minimum, last_sent_time is the timestamp of the start of the previous RTT cycle. Therefore, the resulting delta is approximately 14ms (the connection's RTT + additional rounding errors). This RTT-sized delta is incorrectly applied as the "idle" time. The actual time the connection was idle (the processing gap between the last ACK arriving and the next packet being sent) is effectively 0. By measuring the full RTT instead of the true gap, the delta is inflated significantly, aggressively shifting the recovery start time forward, possibly into the future.
Perceived recovery: Because the recovery start time is now in the future, the in_congestion_recovery() check returns true for every incoming ACK. Processing of the next ACK exits recovery and sets the recovery start to the ACK time which is larger than last_sent_time, making it likely for the congestion controller to push the recovery time into the future when doing the next send.
Stagnation: Since CUBIC skips cwnd growth for any packet perceived to be in a recovery period, the window remains pinned at two packets — ensuring the pipe drains completely on the next ACK and restarting the cycle.

And this loop repeats for thousands of cycles until the accumulation of small deviations — from scheduler jitter and ACK processing variance — lets the <= boundary in in_congestion_recovery() slip behind the next packet's send time, breaking the cycle.

The fix: measuring idle from the right moment

Fixing the death spiral involves measuring the idle duration from when bytes_in_flight actually transitioned to zero (the last ACK processed) rather than the last packet sent.

The code change

Add last_ack_time timestamp to the CUBIC state.
Update that timestamp when ACKs arrive.
Use it for the idle delta computation:

// cubic.rs — on_packet_sent()
fn on_packet_sent(&mut self, bytes_in_flight: usize, now: Instant, ...) {
    // Check if the connection was idle before this packet was sent.
    if bytes_in_flight == 0 {
        if let Some(recovery_start_time) = r.congestion_recovery_start_time {
            // Measure idle from the most recent activity: either the
            // last ACK (approximating when bif hit 0) or the last data
            // send, whichever is later. Using last_sent_time alone
            // would inflate the delta by a full RTT when cwnd is small
            // and bif transiently hits 0 between ACK and send.
            let idle_start = cmp::max(cubic.last_ack_time, cubic.last_sent_time);

            if let Some(idle_start) = idle_start {
                if idle_start < now {
                    let delta = now - idle_start;
                    r.congestion_recovery_start_time =
                        Some(recovery_start_time + delta);
                }
            }
        }
}

With the delta now reflecting the actual gap since the last ACK, the recovery boundary stops chasing the send time:

^{Old code: boundary advances one RTT per cycle, always landing on or ahead of the next send.}

^{Fix: boundary barely moves; the next send lands ahead of it and cwnd grows.}

For genuinely idle connections, last_ack_time is far in the past and the same expression captures the full idle duration, the original epoch-shift behavior is preserved.

Validation

With the fix applied, the 100% pass rate of our quiche testing suite was restored.

^{After the fix, cwnd grows along the expected CUBIC curve and the download completes in ~4-5 seconds.}

We don't worry about the losses at the end of the connection — that's expected because we fully utilized the router's allocated buffer. In other words, we are fully utilizing the available bandwidth in this test case.

Takeaways

"Idle" is harder to define than it sounds. Normal pipeline delays at small windows can look like idleness to simple checks.
Minimum-cwnd dynamics are a unique corner case. The bug was invisible at high speeds and only triggered after severe loss.
The fix was surprisingly small compared to the complexity of the behavior. After weeks of instrumenting qlogs and analyzing visualizations to find the root cause, the solution required changing just three lines of code. As we noted during the investigation: the effort to find the bug was massive, but the fix itself was basically one line of logic.

The fix described in this post has been contributed to cloudflare/quiche, Cloudflare's open-source implementation of QUIC and HTTP/3. Our CCA efforts go beyond loss-based algorithms: we also use quiche’s modular congestion control design to experiment with and tune our model-based BBRv3 implementation, now enabled for a growing percentage of our QUIC deployments. Stay tuned for further updates on QUIC congestion control implementation and performance.

If you're interested in congestion control, transport protocols, or contributing to open-source networking code, check out the quiche repository. We're always looking for talented engineers who love digging into problems like these, please explore our open positions.

Building for the future

Matthew Prince — Thu, 07 May 2026 20:15:12 GMT

This afternoon, we sent the following email to our global team. One of our core values at Cloudflare is transparency, and we believe it's important that you hear this directly from us because it’s a major moment at Cloudflare.

Team:
We are writing to let you know directly that we’ve made the decision to reduce Cloudflare’s workforce by more than 1,100 employees globally.
The way we work at Cloudflare has fundamentally changed. We don’t just build and sell AI tools and platforms. We are our own most demanding customer. Cloudflare’s usage of AI has increased by more than 600% in the last three months alone. Employees across the company from engineering to HR to finance to marketing run thousands of AI agent sessions each day to get their work done. That means we have to be intentional in how we architect our company for the agentic AI era in order to supercharge the value we deliver to our customers and to honor our mission to help build a better Internet for everyone, everywhere.
Today is a hard day. This decision unfortunately means saying goodbye to teammates who have contributed meaningfully to our mission and to building Cloudflare into one of the world’s most successful companies. We want to be clear that this decision is not a reflection of the individual work or talent of those leaving us. Instead, we are reimagining every internal process, team, and role across the company. Today’s actions are not a cost-cutting exercise or an assessment of individuals’ performance; they are about Cloudflare defining how a world-class, high-growth company operates and creates value in the agentic AI era.
This is a moment we need to own as founders and leaders of the company. Matthew has personally sent out every offer letter we've extended. It is a practice he has always looked forward to because it represented our growth and the incredible talent joining our mission. It didn’t feel right for this message to come from anyone other than the two of us. Rather than trickling out notices through managers, we will be sending emails to every employee.
Within the next hour, every member of our global team will receive an email from both of us clarifying how this change affects them. For those departing today, we will send this update to both their personal and Cloudflare addresses to ensure they receive the information immediately.
It’s important to us that we treat departing team members right and in a way that exceeds what we’ve seen from other companies. We believe acting with empathy isn’t about avoiding hard decisions but rather about how you treat people when those decisions are made. If we are asking our team to be world-class, we have a reciprocal obligation to be world-class in how we treat them. We are pairing the directness of these measures with severance packages that lead the industry. The packages for departing employees will include the equivalent of their full base pay through the end of 2026. Healthcare coverage is different across the globe, and if you’re in the United States, we’ll continue to provide support through the end of the year. We are also vesting equity for departing team members through August 15th, so they receive stock beyond their departure date. And, if departing team members haven’t hit their one-year cliffs, we are going to waive those and vest their pro-rated equity through August as well.
We’ve asked the team to do this only once, as hard as that may be today. We don’t want to do it again for the foreseeable future. By taking decisive action now, we provide immediate clarity to those departing and protect the stability of the team that remains. We are making these changes now because making smaller, repeated cuts or dragging a reorganization out over multiple quarters creates prolonged emotional uncertainty for employees and stalls our ability to build. It’s the right thing to do; it’s the honest thing to do; and it reflects the values of the company we are continuing to build.
Cloudflare started as a digitally native company built in the cloud. That allowed us to catch up to and pass companies that had a head start of years or decades but were slowed down by outdated systems and processes. As we’ve now become the leader, we cannot rest on the workflows and organizational structures that worked yesterday. We’re confident that our reshaped organization will be even faster and more innovative as we continue building the future.
To those departing us: you’ve helped build the strong foundation Cloudflare stands on today. We have the utmost respect for your work and gratitude for the impact you have made. We’re confident you will land at other great places and build many future great companies, bringing with you a unique set of skills learned while building Cloudflare.
Transparency is a core principle at Cloudflare, and it was important that you hear this from us first. We will be heading to our earnings conference call at 2 PM PT, when we’ll share more. We also plan to address today’s announcements live with the team at our all-hands meeting.
It’s not an easy day, but it’s the right decision. Our mission to help build a better Internet is more important now than ever, and there’s a lot of work left to be done.

How Cloudflare responded to the “Copy Fail” Linux vulnerability

Chris J Arges — Thu, 07 May 2026 13:00:00 GMT

On April 29, 2026, a Linux kernel local privilege escalation vulnerability was publicly disclosed under the name "Copy Fail" (CVE-2026-31431). Cloudflare’s Security and Engineering teams began assessing the vulnerability as soon as it was disclosed. We reviewed the exploit technique, evaluated exposure across our infrastructure, and validated that our existing behavioral detections could identify the exploit pattern within minutes.

There was no impact to the Cloudflare environment, no customer data was at risk, and no services were disrupted at any point. Read on to learn how our preparedness paid off.

Background

Our Linux kernel release process

Cloudflare operates a global Linux server infrastructure at an immense scale, with datacenters located across 330 cities. We maintain a custom Linux kernel build based on the community's Long-Term Support (LTS) versions to manage updates effectively at this volume. At any given time, we may utilize multiple LTS versions from various series, such as 6.12 or 6.18, which benefit from extended update periods.

The community regularly merges and releases security and stability updates which trigger an automated job to generate a new internal kernel build approximately every week. These builds undergo testing in our staging data centers to ensure stability before a global rollout. Following a successful release, the Edge Reboot Release (ERR) pipeline manages a systematic update and reboot of the edge infrastructure on a four-week cycle. Our control plane infrastructure typically adopts the most recent kernel, with reboots scheduled according to specific workload requirements.

By the time a CVE becomes public knowledge, the necessary fix has typically been integrated into stable Linux LTS releases for several weeks. Our established procedures ensure that we have already deployed these patches.

At the time of the "Copy Fail" disclosure, the majority of our infrastructure was running the 6.12 LTS version, while a subset of machines had begun transitioning to the newer 6.18 LTS release.

About the Copy Fail vulnerability

It helps to understand the vulnerability before getting to the response story. A comprehensive write-up can be found in the original Xint Code disclosure post.

AF_ALG and the kernel crypto API

The Linux kernel's internal crypto API manages functions like kTLS and IPsec. Userspace programs access this via the AF_ALG socket family, allowing unprivileged processes to request encryption or decryption. The algif_aead module facilitates this for Authenticated Encryption with Associated Data (AEAD) ciphers.

An unprivileged program follows these steps:

Opens an AF_ALG socket and binds to an AEAD template.
Sets a key and accepts a request socket.
Submits input via sendmsg() or splice().
Executes the operation using recvmsg().

The splice() system call is critical here, as it moves data by passing page cache references.

Memory mechanics: page cache and in-place crypto

The page cache is a shared system cache for file contents. Modifying a page belonging to a setuid binary effectively edits that program for all users until the page is evicted.

The crypto API utilizes scatterlists, which are structures linking various memory pages. In 2017, algif_aead was optimized for in-place operations, chaining destination and reference pages together. This design lacked enforcement to prevent algorithms from writing past intended boundaries.

The vulnerability: out-of-bounds write

When the user executes recvmsg(), the authencesn wrapper in the kernel performs a 4-byte write past the legitimate output region:

scatterwalk_map_and_copy(tmp + 1, dst, assoclen + cryptlen, 4, 1);

By using splice(), an attacker can chain a target file's page cache pages to the scatterlist. The out-of-bounds write then taints the cached file, allowing an attacker to control which file is modified, the offset, and the specific 4 bytes written. This means the attacker can manipulate the following with this exploit:

File: Any readable file.
Offset: Tunable via assoclen and splice parameters.
Value: Controlled via AAD bytes 4-7 in sendmsg()

The exploit, step by step

The default exploit targets /usr/bin/su, a setuid-root binary present on essentially every distribution.

Cache Reference: Open /usr/bin/su as O_RDONLY and read() to populate the page cache. Use splice() on the file descriptor to pass these page cache references into the crypto scatterlist.
Setup: Create an AF_ALG socket, bind() to authencesn(hmac(sha256),cbc(aes)), set a key, and accept a request socket without needing privileges.
Write Construction: For each 4-byte shellcode chunk:
- sendmsg() with AAD bytes 4–7 containing the shellcode.
- splice() the binary into a pipe then the AF_ALG socket so assoclen + cryptlen targets the desired .text offset.
Trigger: recvmsg() initiates decryption. authencesn writes its scratch data to the target offset of /usr/bin/su in the page cache. Although the function returns -EBADMSG, the 4-byte write is now in the global page cache.
Execution: Running execve("/usr/bin/su") loads the tainted page cache. Since the binary is setuid-root, the injected shellcode executes with root privileges.

The upstream fix (commit a664bf3d603d) reverts the 2017 in-place optimization, removing the exploit.

How we responded

When the vulnerability was disclosed, many workstreams started in parallel:

Mapping the blast radius: Our security team worked with kernel engineers to determine which kernel versions were vulnerable and assess the potential exposure.
Validating coverage: Security reviewed the exploit technique and confirmed that our existing behavioral detections could identify the exploit pattern during authorized internal validation.
Proactive threat hunting: Security began searching for signs that the vulnerability had been exploited before it was publicly known, going back 48 hours in our fleet-wide logs.
Engineering a mitigation: Kernel engineers began building a runtime mitigation that would protect the fleet without breaking production services.
Continuing software updates: Our engineering teams worked on delivering an updated Linux kernel, which required carefully rebooting and rolling it out across our servers.

There was no customer impact at any point during this response.

Validating detection coverage

One of the first things our security team did was confirm that our existing endpoint detection would catch this exploit. Our servers run behavioral detection that continuously monitors process execution patterns. It doesn't rely on knowing about specific vulnerabilities; it watches for anomalous behavior across the fleet.

When our engineers validated the vulnerability internally as part of the response, the detection platform flagged it within minutes. The system linked the entire execution chain—starting at the script interpreter, moving through the kernel’s cryptographic subsystem, and ending at the privilege escalation binary—flagging it as malicious based on fleet-wide behavioral patterns.

This happened without a signature update, without a rule change, and without human intervention. Our behavioral detection coverage existed before we wrote any custom logic for this particular Copy File exploit.

The confirmation was important because it meant we had coverage before writing a vulnerability-specific rule.

Hunting for exploitation

While our engineering team moved to a more targeted mitigation, our security investigation had been running since disclosure. This is our standard procedure for any critical vulnerability.

Our security team operates on a simple principle for critical vulnerabilities: assume compromise until you can prove otherwise. The investigation started from the assumption that exploitation could have occurred before the vulnerability was public, and we worked systematically to either confirm or rule it out.

The exploit leaves a distinctive trace in kernel logs when it runs. We searched for that trace across our centralized logging infrastructure, covering 48 hours before the vulnerability was publicly disclosed. If someone had exploited this before the world knew about it, we would have seen it.

We pulled access logs for affected systems and reconstructed who connected, when, and what commands they ran. This gave us a complete forensic picture of interactive activity on potentially affected infrastructure.

We checked that system binaries had not been tampered with, validated cryptographic hashes against known-good package manifests, looked for persistence mechanisms, and audited network connections for anything unusual. Everything was clean.

Incident timeline and impact

Time (UTC)	Event
2026-04-29 16:00	Copy Fail publicly disclosed.
2026-04-29 ~21:00	Security and Engineering teams began assessing fleet exposure and mitigation options before full declaration of the Incident Response process
2026-04-29 22:52	Security confirmed existing behavioral detection covered the Copy Fail exploit pattern. During authorized internal validation, detection flagged the activity within minutes.
2026-04-29 23:01	Existing behavioral detection generated a high-severity alert for exploit-like activity, confirming detection coverage for the technique.
2026-04-29 (evening)	First mitigation attempt pushed to our staging datacenter. The deployment process surfaced a dependency conflict; the mitigation was rolled back. No production systems were affected.
2026-04-29 (overnight)	Engineering drafted bpf-lsm mitigation program.
2026-04-30 03:14	Security incident declared to drive cross-functional collaboration and urgency. Security performed fleetwide threat hunting of historical data to confirm that no malicious activity was present on Cloudflare systems.
2026-04-30 (morning)	Engineering tested the bpf-lsm mitigation program and made it production-ready.
2026-04-30 14:25	Engineering incident declared to coordinate mitigation program and Linux patch rollout.
2026-04-30 ~17:00	Decision made: ship a patched build of the previous LTS line through reboot automation; do not accelerate the new LTS; lean on bpf-lsm in the meantime.
2026-04-30 (afternoon)	Visibility pipeline (eBPF tracing of AF_ALG socket usage) deployed fleet-wide. Gives a complete picture of all legitimate AF_ALG users.
2026-04-30 (evening)	bpf-lsm mitigation program rolled out behind a separate gate to fully mitigate the fleet. End-to-end verification on a previously-vulnerable test node confirms the exploit no longer works.
2026-05-04 (morning)	Reboot automation resumed at normal pace with the patched kernel.
2026-05-04 onward	Servers that had already passed through reboot automation earlier in the week manually rebooted to pick up the patched kernel. Unpatched servers update per our normal reboot automation.

This graph shows the progress of our mitigation program as it progressed through our infrastructure.

How did we mitigate it?

Because of the long timeframe involved in deploying a patched Linux kernel, we also pursued mitigating this exploit without a reboot.

Removing the module

The bug was in the algif_aead kernel module. Therefore, the simple fix was to just remove this module and disallow it from being reloaded.

This mitigation was therefore exactly what the Copy Fail write-up from the security researchers who identified it recommends.

echo "install algif_aead /bin/false" > /etc/modprobe.d/disable-algif.conf
rmmod algif_aead 2>/dev/null || true

Unfortunately removing the module would have impacted software that leverages the kernel crypto API. This meant that we had to figure out a more surgical mitigation.

Bpf-lsm

We’ve already developed and deployed such a tool for this exact scenario: bpf-lsm. Instead of removing the module, this tool leaves it loaded for legitimate users and uses a BPF Linux Security Module program to deny the socket_bind LSM hook for everyone else. This completely blocks the front door for any exploits.

A draft of the eBPF program was put together overnight. Team members picked it up the following morning, ran validations, and made it production-ready. The program is fairly straightforward. On every socket_bind call:

If the socket family is not AF_ALG, allow the call through unchanged.
If the family is AF_ALG, check the calling binary's path against an allow-list of the binaries we know to be legitimate users.
If the binary is on the allow-list, allow the bind. Otherwise, deny it.

To verify the mitigation on a given machine without exploiting it, the Copy Fail write-up gives a one-liner:

python3 -c 'import socket; s = socket.socket(socket.AF_ALG, socket.SOCK_SEQPACKET, 0); s.bind(("aead","authencesn(hmac(sha256),cbc(aes))"));'

On a mitigated machine you get PermissionError: [Errno 1] Operation not permitted (or FileNotFoundError, depending on which mitigation is active) instead of a successful bind.

Rolling it out

Before enabling enforcement, we verified that our known internal service was the sole legitimate AF_ALG user to avoid accidental outages. We used prometheus-ebpf-exporter to hook the socket() syscall and track AF_ALG usage per binary across the fleet. This required no kernel changes and provided aggregate data from hundreds of thousands of servers within hours. Results confirmed the identified service was indeed the only legitimate user.

So the bpf-lsm rollout was deliberately staged in two steps:

Get visibility first. Push the ebpf-exporter config gated by salt. Confirm at the metric layer that the known service is effectively the only thing creating AF_ALG sockets.
Then enforce. Push the bpf-lsm program behind a separate enforcement gate.

In parallel, the upstream backport for our majority LTS line finally became available, and our internal automation built a patched kernel against it.

We started to test the patched kernel in our staging datacenters as soon as possible, then we resumed the longer reboot process in order to fully patch our fleet.

Remediation and follow-up steps

While we were prepared for this scenario, at Cloudflare we’re always learning and improving. Key areas we identified for improvement:

Better visibility into kernel-API dependencies. We will review kernel-subsystem usage across production services, so we can continue to quickly mitigate exploits without service disruption.
Better runtime mitigation. bpf-lsm is a valuable tool for mitigations, but we want to make this tool even better. This will include looking into faster deployments, better playbooks, and better logging and visibility of the tool.
Reduce attack surface of Linux Kernel. Review and audit our kernel configuration. Proactively identify unused modules or features so that we can remove them from our build entirely.

Conclusion

The "Copy Fail" vulnerability presented a unique challenge for us. Despite our practice of deploying Linux patch updates every two weeks, we remained vulnerable because a month-old mainline fix had yet to be backported to our primary kernel line. Despite that, we were still able to roll out patched kernels within hours of the backport's release. In the interim, bpf-lsm provided a surgical, no-reboot mitigation that secured our fleet. While our initial attempt to disable the problematic module failed, it did so safely within our internal staging environment rather than production, allowing us to identify this dependency.

By the end of the rollout, every machine in our fleet was protected by either a patched kernel or a bpf-lsm program denying the vulnerable code path to non-allow-listed binaries. There was no customer impact at any point during this incident, and we have committed to the follow-up work above to make our response faster and our visibility better the next time something like this lands. Responsible disclosure works, in-kernel visibility tooling pays off in moments exactly like this one, and bpf-lsm continues to be one of the most useful primitives we have for runtime kernel mitigation.

At Cloudflare, critical vulnerability response is a coordinated effort across Security, Engineering, Product, and many other teams. Special thanks to Ali Adnan, Ivan Babrou, Frederik Baetens, Curtis Bray, Piers Cornwell, Everton Didone Foscarini, Rob Dinh, Elle Dougherty, Kevin Flansburg, Matt Fleming, Kimberley Hall, Brandon Harris, Jerry Ho, Oxana Kharitonova, Marek Kroemeke, Fred Lawler, James Munson, Nafeez Nazer, Walead Parviz, Miguel Pato, Evan Pratten, Josh Seba, June Slater, Ryan Timken, Michael Wolf, Jianxin Zeng and everyone else who contributed to the investigation, mitigation, and remediation of Copy Fail. We'd also like to thank the Linux upstream maintainers and Copy Fail researchers whose work helped make a rapid response possible.

When DNSSEC goes wrong: how we responded to the .de TLD outage

Sebastiaan Neuteboom — Wed, 06 May 2026 17:00:00 GMT

On May 5, 2026, at roughly 19:30 UTC, DENIC, the registry operator for the .de country-code top-level domain (TLD), started publishing incorrect DNSSEC signatures for the .de zone. Any validating DNS resolver receiving these signatures was required by the DNSSEC specification to reject them and return SERVFAIL to clients, including 1.1.1.1, the public DNS resolver operated by Cloudflare.

The country-code top-level domain for Germany, .de, is one of the largest on the Internet. On Cloudflare Radar, it consistently ranks among the most broadly queried TLDs globally. An outage at this level of the DNS hierarchy has the potential to make millions of domains unreachable.

In this post, we’ll walk through what we saw, the impact of these events, and how we applied temporary mitigations while DENIC resolved the issue.

How DNSSEC works

DNSSEC (Domain Name System Security Extensions) adds cryptographic authentication to DNS. When a zone is signed with DNSSEC, each set of records is accompanied by a digital signature known as an RRSIG record that lets a resolver verify the records haven’t been tampered with. Unlike encrypted DNS protocols, such as DNS over TLS (DoT) and DNS over HTTPs (DoH), DNSSEC is about integrity, not privacy. The records are visible, but their authenticity can be proven.

What makes DNSSEC unique is that the signatures travel together with the records they protect. This means integrity can be verified regardless of how many caches or hops a response has passed through. A cached record is just as verifiable as a fresh one.

DNSSEC is built on a chain of trust. Starting at the root zone, whose trust anchor is hard-coded into the resolvers, each zone delegates trust to child zones via Delegation Signer (DS) records. A DS record in the parent zone contains a cryptographic hash of a public key in the child zone. When a resolver validates example.de it verifies the chain: root trusts .de, .de trusts example.de. A break anywhere in that chain causes validation to fail for everything below it, which is why a misconfiguration at a TLD like .de affects every domain under it.

Zones typically use two types of keys: a Zone Signing Key (ZSK), used to sign the zone’s records, and a Key Signing Key (KSK), used to sign the ZSK itself. The KSK’s public key is what the parent zone’s DS record points to, anchoring the chain of trust. Rotating a ZSK is relatively straightforward: generate a new key, re-sign the zone’s records, and wait for caches to expire. Rotating a KSK is more involved, because the parent’s DS record must also be updated, often requiring coordination with a registrar or registry.

During a key rotation, there is a critical window where the old key is being phased out and the new one phased in. If the signatures published in the zone are made with a key that resolvers cannot verify against the zone’s published DNSKEY records, whether because the signing step failed, the timing was wrong, or the new key wasn’t fully distributed yet, resolvers have no choice but to reject the responses and return SERVFAIL.

What we saw

On May 5, 2026, at roughly 19:30 UTC, DENIC, the operator for the .de TLD, started publishing incorrect DNSSEC signatures for the .de zone. Any validating resolver receiving these records was required by the DNSSEC specification to reject them and return SERVFAIL. 1.1.1.1 was no exception.

The graph below shows the response codes 1.1.1.1 returned for .de queries during the incident.

After the immediate spike in SERVFAILs at 19:30 UTC, it climbed steadily over the following three hours as cached records slowly started expiring. As each domain's cached records expired and resolvers went back to DENIC for fresh copies, they got back broken signatures and started failing.

Also visible is a large increase in query volume. This is typical during DNS incidents, as clients retry failed queries, often three or more times, inflating the raw numbers. The SERVFAIL rate looks more alarming than the actual user impact, as many of those queries represent the same user retrying the same domain.

What might be surprising is that the NOERROR rate stayed relatively stable throughout the incident. That's “serve stale” at work, which we'll cover in the next section.

Serve stale

Recursive resolvers cache the records they receive from authoritative nameservers for the duration of each record's TTL (Time-to-Live). While a record is cached, the resolver serves it directly without going back to the authoritative nameserver. When the TTL expires, the resolver fetches a fresh copy and re-caches it.

During the outage, freshly requested records ended up resolving to SERVFAIL. The DNSSEC signatures were broken and the resolver correctly rejected them. But many .de records were still sitting in cache from before the incident began. Rather than immediately discarding those and returning SERVFAIL to users, 1.1.1.1 continued serving them past their TTL. This is called “serving stale.”

1.1.1.1 implements RFC 8767, which formalizes this behavior. When upstream resolution fails, a resolver may continue serving expired cached records rather than returning an error. This significantly cushions the impact of an upstream outage, buying time for operators to respond.

The result is visible in the graph below, which shows response codes for .de queries during the incident excluding the stale-served responses. Without stale-served responses, the NOERROR rate drops steadily from 19:30 onward. These represent queries that users received good answers for only because their record was still in cache.

Our mitigation

While the issue was largely out of our own control, and serve stale was doing its job, there was still a legitimate impact for a lot of users. Luckily, there were some actions we were able to take to improve the situation.

Negative Trust Anchors

RFC 7646 defines the concept of a Negative Trust Anchor (NTA). In normal DNSSEC operation, a validating resolver maintains a set of trust anchors: public keys at the root of the chain of trust. Each DNS zone signed with DNSSEC has a trust anchor, and every child zone builds its own trust anchor upon it. When the cryptographic signatures linking the chain together are broken, responses will be rejected and result in SERVFAIL. An NTA is an explicit exception. It tells the resolver to treat a specific zone as if it were unsigned, bypassing validation for names under that zone.

NTAs exist precisely for these types of incidents. When a TLD operator publishes broken signatures, every DNSSEC-validating resolver is forced to return SERVFAIL for every domain under that TLD. Not because of anything wrong with those domains themselves, but because their parent zone is misconfigured. Continuing to return SERVFAIL in that situation provides no security value: the failure is already known, public, and being fixed. RFC 7646 explicitly names TLD misconfiguration as the primary use case for NTAs.

What we actually deployed

For 1.1.1.1 we have our own resolver referred to as Big Pineapple, which also powers 1.1.1.1 for Families, Gateway DNS, DNS Firewall, and more. At this time, we have not implemented a native NTA mechanism. Instead, we used an existing override rule mechanism to mark .de as an insecure zone, which causes all .de queries to be resolved as if they don’t have DNSSEC enabled. This is functionality equivalent to an NTA, though it is not formally defined in any RFC.

The decision to bypass DNSSEC is a deliberate tradeoff. Without DNSSEC validation, .de domains become vulnerable to genuine attacks for the duration of the incident. During incidents like this, we weighed this as acceptable because the signing failure was widespread, publicly confirmed, and affected every validating resolver on the Internet equally. As it was put in our internal incident room: “There is no user of 1.1.1.1 resolving a .de name right now who would prefer a SERVFAIL over an unvalidated response.”

We rolled out our mitigation at 22:17 UTC, which marked the end of impact for 1.1.1.1. We communicated this with fellow DNS operators in the DNS-OARC Mattermost.

Origin resolution mitigations

While all Internet users can access our 1.1.1.1 resolver, we have a particular responsibility to customers using our CDN platform services. Those with .de origin names were also affected by this outage.

Cloudflare operates a separate internal resolver for origin resolution, distinct from our publicly available 1.1.1.1 service. To mitigate impact we applied a similar NTA for .de on the internal resolver service, restoring origin connectivity for affected customers.

Extended DNS Errors

Before our mitigation, queries that couldn't be served from cache received a SERVFAIL response from 1.1.1.1. Each SERVFAIL included an Extended DNS Error (EDE) code, defined in RFC 8914, which gives clients more detail about what went wrong.

Some resolvers returned EDE 6 (DNSSEC Bogus) with a descriptive message pointing directly at the broken signature. This is the correct behavior:

EDE: 6 (DNSSEC Bogus): RRSIG with malformed signature found for example.de/nsec3 (keytag=33834)

1.1.1.1, on the other hand, returned EDE 22 (No Reachable Authority), which on the surface suggests a connectivity problem with the upstream nameservers rather than a DNSSEC validation failure.

The cause is a bug in how we propagate DNSSEC EDE codes up from our trust chain verifier. When the verifier detects a bogus signature it creates the DNSSEC Bogus EDE code, but this is never inserted into the response. Instead, the outer layer of the resolver sees a problem with recursive resolution with no error code and falls back to reporting “No Reachable Authority.” This obscures the underlying DNSSEC cause.

We're aware that this isn't helpful for 1.1.1.1 users and will be fixing our responses to surface the DNSSEC errors.

Is this a failure of DNSSEC as a technology?

DNS is a critical part of the request chain for most Internet communication. It would be easy to come to the conclusion that this outage and the mitigations applied means DNSSEC has failed as a technology. However, any technology that is misconfigured will risk breaking for users that rely on it. Leaving critical fiber cables exposed on the seabed for sharks to chew on does not invalidate the important role underwater cables pose in today's Internet communications. It only highlights that we’ve sometimes failed to accurately protect it. The same applies here. DNSSEC serves a critical role in ensuring that we can rely on the DNS answers without tampering by malicious actors.

#HugOps

No one likes to have serious incidents. These things, unfortunately, happen to everyone who operates critical infrastructure at scale. When they do, the DNS community tends to show up for each other.

Incidents like this also highlight why relationships between operators matter. DNS is a decentralized system, no single organization controls all of it, and keeping it running reliably depends on mutual trust and open lines of communication between registries, resolver operators, and the broader community. Forums like DNS-OARC provide exactly this: shared mailing lists and chat rooms where operators can coordinate quickly across organizational boundaries when something goes wrong.

DENIC has published a short blog post about the incident where they state: “The outage is linked to a routine, scheduled key rollover. During this process, non-validatable signatures were generated and distributed. As a precautionary measure, future rollovers have been suspended until the exact technical causes have been identified.”

We're sure we’ll hear more when their own analysis is ready.

Takeaways from this incident

This incident highlights a structural reality of the DNS hierarchy: when a registry at the TLD level fails, every domain under that TLD is affected simultaneously, regardless of where it's hosted or which resolver is used. This isn't unique to DNSSEC; the same is true if a TLD’s nameservers become unreachable. The hierarchy that makes the global DNS work is also what makes failures at the top propagate downward.

There is no simple fix for this. What the industry can do is respond quickly and consistently when it happens. In this incident, resolver operators across the Internet independently applied Negative Trust Anchors within an hour, restoring resolution while DENIC worked to fix the zone. Operational practices, industry communication channels like DNS-OARC, and features like serve stale all reduce the impact, even if they can’t eliminate the underlying dependency.

We also came away with some points to improve for ourselves. We will be working on our EDE errors to better surface DNSSEC errors.

We look forward to DENIC’s post-incident report and appreciate the transparency they showed throughout.

If you want to learn more about how DNSSEC works, visit our page How does DNSSEC work? And you can always follow real-time DNS trends and TLD data on Cloudflare Radar.

Code Orange: Fail Small is complete. The result is a stronger Cloudflare network

Jeremy Hartman — Fri, 01 May 2026 21:07:30 GMT

Over the past two and a bit quarters, we've undertaken an intensive engineering effort, internally code-named "Code Orange: Fail Small", focused on making Cloudflare's infrastructure more resilient, secure, and reliable for every customer.

Earlier this month, the Cloudflare team finished this work.

While improving resiliency will never be a “job done” and will always be a top priority across our development lifecycle, we have now completed the work that would have avoided the November 18, 2025 and December 5, 2025 global outages.

This work focused on several key areas: safer configuration changes, reducing the impact of failure, and revising our “break glass” procedures and incident management. We also introduced measures to prevent drift and regressions over time, and strengthened the way we communicate to our customers during an outage.

Here we explain in depth what we shipped, and what it means for you.

Safer configuration changes

What it means for you: In most cases, Cloudflare internal configuration changes no longer reach our network instantly and are instead rolled out progressively with real-time health monitoring. This allows our observability tools to catch problems and revert issues before they affect your traffic.

In order to catch potentially dangerous deployments before they reach production, we've identified high-risk configuration pipelines, and built new tools to manage configuration changes better.

For products that run on our network processing customer traffic and receive configuration changes, we no longer deploy these changes instantly across the network. Instead, relevant teams have adopted a “health-mediated deployment” methodology, the same we use when releasing software, for all configuration deployments. This includes but is not limited to the product teams that were directly affected by the incidents.

Central to this is a new internal component we call Snapstone, which we built to bring health-mediated deployment to configuration changes. Snapstone is a system that bundles configuration change into a package, and then allows gradual release of the configuration change with health mediation principles. Before Snapstone, applying this methodology to config was possible but difficult. It required significant per-team effort and wasn't consistently applied across the network. Snapstone closes this gap by providing a unified way to bring progressive rollout, real-time health monitoring, and automated rollback to configuration deployments by default.

What makes Snapstone particularly powerful is its flexibility. Rather than being a fix for specific past failures, Snapstone allows teams to dynamically define any unit of configuration that needs health mediation, whether that's a data file like the one that caused the November 18 outage, or a control flag in our global configuration system like the one involved in the December 5 outage. Teams create these configuration units on demand, and Snapstone ensures they are deployed safely everywhere they're used.

This gives us something we didn't have before: when a risk review or operational experience identifies a dangerous configuration pattern, the fix is straightforward -- bring it into Snapstone, and the configuration pattern immediately inherits safe deployment.

Reducing the impact of failure

What it means for you: In the event an issue is observed on our network, our systems now fail more gracefully. This vastly reduces the potential impact radius, to ensure your traffic is delivered even in worst-case scenarios.

Product teams have carefully reviewed, both in a manual and programmatic fashion, their potential failure modes for products that are critical for serving customer traffic. Teams have removed non-essential runtime dependencies and implemented better failure modes. We will now use the last known good configuration where possible (“fail stale”), and if that isn’t possible we have reviewed each failure case and implemented “fail open” or “fail close” depending on whether serving traffic with reduced functionality is preferable to failing to serve traffic.

Let’s look at an example of how this works. Our November 2025 outage was triggered by a failed rollout of our Bot Management detection machine learning classifier. Under our new procedures, if data were generated again that our system could not read, the system would refuse to use the updated configuration and instead use the old configuration. If the old configuration was not available for some reason, it would fail open to ensure customer production traffic continues to be served, which is a much better outcome than downtime.

As a result, if the same Bot Management change that caused the failure in November were to roll out now, the system would detect the failure in an early stage of the deployment, before it had affected anything more than a small percentage of traffic.

We have also begun further segmenting our system so that independent copies of services run for different cohorts of traffic. Cloudflare already takes advantage of these customer cohorts for blast radius mitigation with traffic management techniques today, and this additional process segmentation work provides a powerful reliability capability for us going forward.

For example, the Workers runtime system is segmented into multiple independent services handling different cohorts of traffic, with one handling only traffic for our free customers. Changes are deployed to these segments based on customer cohorts, starting with free customers first. We’re also sending updates more quickly and frequently to the least critical segments, and at a slower pace to the most critical segments.

As a result, if a change were deployed to the Workers runtime system and it broke traffic, it would now only affect a small percentage of our free customers before being automatically detected and rolled back.

Sticking to the Workers runtime system as an example, in a seven-day period earlier this month, the deployment process was triggered more than 50 times. You can see how each happens in “waves” as the change propagates to the edge, often in parallel to the following and prior releases:

We’re working on extending this pattern of deployment to many more of our systems in the future.

Revised “break glass” and incident management procedures

What it means for you: If an incident does occur, we have the tools and teams to communicate more clearly and resolve it faster, minimizing downtime.

Cloudflare runs on Cloudflare. We use our own Zero Trust products to secure our infrastructure, but this creates a dependency: if a network-wide outage impacts these tools, we lose the very pathways we need to fix them. Before this Code Orange initiative, our "break glass" pathways were restricted to a handful of people and offered limited tool access. We needed these tools and pathways to be more broadly available during an outage.

To solve this, we conducted a comprehensive audit of the tools essential for system visibility, debugging, and production changes. We ultimately developed backup authorization pathways for 18 key services, supported by new emergency scripts and proxies.

Throughout the Code Orange program, we moved from theory to practice. After small-team exercises, we conducted an engineering-wide drill on April 7, 2026, involving more than 200 team members. While automation keeps these pathways functional, drills like these ensure our engineers have the muscle memory to use them under pressure.

This effort also focused on the flow of information. When internal visibility is disrupted, our incident response slows down, and our ability to communicate with the outside world suffers. Historically, technical observations from the heat of the moment didn't always translate into clear updates for our customers.

To bridge this gap, we established a dedicated communications team to work in lockstep with incident responders during major events. Just as our engineers practiced their "break glass" procedures, this team used the Code Orange program to drill on streamlining the cadence and clarity of customer updates. By ensuring we have both the tools to see and the structure to speak, we can resolve incidents faster and keep our customers better informed.

We have codified our improvements

What it means for you: We will remember the learnings from our incidents and have codified the resolutions. Our network will only become more resilient.

To avoid drift and reintroducing regressions to the work done as part of Code Orange over time, the team has built an internal Codex that solidifies all our guidelines in clear and concise rules.

The Codex is now mandatory for all engineering and product teams, and has become a central part of Cloudflare internal procedures. Its rules are enforced via AI code reviews that automatically highlight any instance that might diverge from the guidelines, requiring additional manual reviews be performed. This is applied without exception to our entire codebase. The goal is simple: Build institutional memory that enforces itself.

The November and December outages shared a common failure mode: code that assumed inputs would always be valid, with no graceful degradation when that assumption broke. A Rust service called .unwrap() instead of handling an error; Lua code indexed an object that didn't exist. Both patterns are preventable if the lessons are captured and enforced.

The Codex is part of our answer. It's a living repository of engineering standards written by domain experts through our Request For Comments (RFC) process, then distilled into actionable rules. Best practices that previously lived in the heads of senior engineers, or were discovered only after an incident, now become shared knowledge accessible to everyone. Each rule follows a simple format: "If you need X, use Y" with a link to the RFC that explains why.

For example, one RFC now states: "Do not use .unwrap() outside of tests and build.rs." Another captures a broader principle: "Services MUST validate that upstream dependencies are in an expected state before processing."

Had these rules been enforced earlier, the November and December outages would have been rejected merge requests instead of global incidents.

Rules without enforcement are suggestions. The Codex integrates with AI-powered agents at every stage of the software development lifecycle, from design review through deployment to incident analysis. This shifts enforcement left, from "global outage" to "rejected merge request." The blast radius of a violation shrinks from millions of affected requests to a single developer getting actionable feedback before their code ever reaches production.

The Codex is a living document and will be continuously improved over time. Domain experts write RFCs to codify best practices. Incidents surface gaps that become new RFCs. Every approved RFC generates Codex rules. Those rules feed the agents that review the next merge request. It's a flywheel: expertise becomes standards, standards become enforcement, enforcement raises the floor for everyone.

It’s not just about code: communication is key

What it means for you: Transparency is important to us. If something goes wrong, we’re committed to keeping you updated every step of the way, so you can stay focused on what matters to you.

The global outages have made us review core processes and cultural approaches even beyond engineering and product development. As part of the broader Code Orange initiatives, we have introduced additional service level objectives (SLOs) to all our services, enforced a global changelog, onboarded all teams to our maintenance coordination system, and improved transparency across the company on our incident “prevents” ticket backlog.

We have also strengthened the way we communicate to our customers during an outage. Our goal is to alert you to an issue the moment we confirm it, before you even notice a problem. By the time you notice a lag or an error, our aim is to have an update already waiting in your notifications.

During an active incident, we now provide updates at predictable intervals (e.g., every 30 or 60 minutes), even if the update is simply, "We are still testing the fix; no new changes yet." This allows you to plan your day rather than constantly refreshing a status page.

Our job isn't done when the status returns to normal. We provide detailed post-mortems explaining what happened, why it happened, and the specific structural changes we are making to ensure it doesn't happen again.

This initiative is complete. But our work on resiliency is never done.

We take the incidents very seriously and adopted a shared ownership across the entire Cloudflare organization by asking every team: What could have been done better? This guided the work that we carried out over the last two quarters.

While this work is never truly done, we are confident that we are in a much better position and Cloudflare is now much stronger because of it.

Introducing Dynamic Workflows: durable execution that follows the tenant

Dan Lapid — Fri, 01 May 2026 13:00:00 GMT

When we first launched Workers eight years ago, it was a direct-to-developers platform. Over the years, we have expanded and scaled the ecosystem so that platforms could not only build on Workers directly, but they could also enable their customers to ship code to us through many multi-tenant applications. We now see on Workers: Applications where users describe what they want, and the AI writes the implementation. Multi-tenant SaaS where every customer's business logic is, at runtime, some TypeScript the platform has never seen before. Agents that write and run their own tools. CI/CD products where every repo defines its own pipeline.

Last month, when we shipped the Dynamic Workers open beta, we gave those platforms a clean primitive for the compute side: hand the Workers runtime some code at runtime, get back an isolated, sandboxed Worker, on the same machine, in single-digit milliseconds. Durable Object Facets extended the same idea to storage — each dynamically-loaded app can have its own SQLite database, spun up on demand, with the platform sitting in front, as a supervisor. Artifacts did the same for source control: a Git-native, versioned filesystem you can create by the tens of millions, one per agent, one per session, one per tenant. So, we have dynamic deployment for storage and source control. What’s next?

Today, we are bridging durable execution and dynamic deployment with Dynamic Workflows.

The gap between durable and dynamic execution

Cloudflare Workflows is our durable execution engine. It turns a run(event, step) function into a program where every step survives failures, can sleep for hours or days, can wait for external events, and resumes exactly where it left off when the isolate is recycled. It's the right primitive for anything that has to "keep going" past a single request: onboarding flows, video transcoding pipelines, multi-stage billing, long-running agent loops, and — as of Workflows V2 — up to 50,000 concurrent instances and 300 new instances per second per account, redesigned for the agentic era.

But Workflows has always had one assumption baked in: the workflow code is part of your deployment. Your wrangler.jsonc has a block that says "when the engine calls into WORKFLOWS, run the class called MyWorkflow." One binding, one class. Per deploy.

That works fine if you own all the code. It's fine if you're running a traditional application.

It stops working the moment you want to let your customer ship their workflow.

Say you're building an app platform where the AI writes TypeScript for every tenant. Say you're running a CI/CD product where each repository has its own pipeline. Say you're using an agents SDK where each agent writes its own durable plan. In every one of these cases, the workflow is different for every tenant, every agent, every request. There is no single class to bind.

This is the same shape of problem that Dynamic Workers solved for compute and that Durable Object Facets solved for storage. We just hadn't solved it for durable execution yet.

Dynamic Workflows

@cloudflare/dynamic-workflows is a small library. Roughly 300 lines of TypeScript. It lets a single Worker — the Worker Loader — route every create() call to a different tenant's code, and, critically, have the Workflows engine dispatch run(event, step) back to that same code when the workflow actually executes, seconds or hours or days later.

Here's the whole pattern. A Worker Loader:

import {
  createDynamicWorkflowEntrypoint,
  DynamicWorkflowBinding,
  wrapWorkflowBinding,
} from '@cloudflare/dynamic-workflows';

// The library looks this class up on cloudflare:workers exports.
export { DynamicWorkflowBinding };

function loadTenant(env, tenantId) {
  return env.LOADER.get(tenantId, async () => ({
    compatibilityDate: '2026-01-01',
    mainModule: 'index.js',
    modules: { 'index.js': await fetchTenantCode(tenantId) },
    // The tenant sees this as a normal Workflow binding.
    env: { WORKFLOWS: wrapWorkflowBinding({ tenantId }) },
  }));
}

// Register this as class_name in wrangler.jsonc.
export const DynamicWorkflow = createDynamicWorkflowEntrypoint(
  async ({ env, metadata }) => {
    const stub = loadTenant(env, metadata.tenantId);
    return stub.getEntrypoint('TenantWorkflow');
  }
);

export default {
  fetch(request, env) {
    const tenantId = request.headers.get('x-tenant-id');
    return loadTenant(env, tenantId).getEntrypoint().fetch(request);
  },
};

Add to your wrangler.jsonc:

"workflows": [
		{
			"name": "dynamic-workflow",
			"binding": "WORKFLOW",
			"class_name": "DynamicWorkflow"
		}
	]

The tenant writes plain, idiomatic Workflows code. They have no idea they're being dispatched:

import { WorkflowEntrypoint } from 'cloudflare:workers';

export class TenantWorkflow extends WorkflowEntrypoint {
  async run(event, step) {
    return step.do('greet', async () => `Hello, ${event.payload.name}!`);
  }
}

export default {
  async fetch(request, env) {
    const instance = await env.WORKFLOWS.create({ params: await request.json() });
    return Response.json({ id: await instance.id });
  },
};

That's it. The tenant calls env.WORKFLOWS.create(...) against what looks like a perfectly normal Workflow binding. Workflow IDs, .status(), .pause(), retries, hibernation, durable steps, step.sleep('24 hours'), step.waitForEvent() — everything works the way it always has.

The library handles one thing: making sure that when the Workflows engine eventually wakes up and calls run(event, step), it ends up inside the right tenant's code.

How it works

Three layers: the Workflows engine (platform) on top, your Worker Loader in the middle, your tenant's code (a Dynamic Worker) on the bottom.

When a request reaches the Worker Loader, it routes the execution to the correct dynamic code on the fly. The rest of the execution is a handoff between these three layers, left-to-right in time: the request enters, bounces up to the engine, is persisted, and later bounces back down again.

Walking the flow:

① → ② Entering the tenant's code. The Worker Loader receives an HTTP request, figures out which tenant it's for, loads that tenant's code via the Worker Loader, and forwards the request to its default.fetch. The env it hands the tenant contains WORKFLOWS: wrapWorkflowBinding({ tenantId }). As far as the tenant is concerned, that looks and acts like a real Workflow binding.

③ Up to the Worker Loader. When the tenant calls env.WORKFLOWS.create({ params }), it's actually making a Remote Procedure Call (RPC) into the Worker Loader — the wrapped binding is a WorkerEntrypoint subclass (DynamicWorkflowBinding) that the runtime specialized with the tenant's metadata at load time. That's why you have to export { DynamicWorkflowBinding } from your Worker Loader: the runtime builds per-tenant stubs by looking the class up in cloudflare:workers exports. Bindings that cross the Dynamic Worker boundary have to be RPC stubs — a plain { create, get } object can't be structured-cloned, and the raw Workflow binding isn't serializable either.

Inside the Worker Loader, the wrapped binding transparently rewrites the payload:

tenant calls:  create({ params: { name: 'Alice' } })
                            │
                            ▼
engine sees:   create({ params: {
                  __workerLoaderMetadata: { tenantId: 't-42' },
                  params: { name: 'Alice' }
               }})

④ Up to the engine. The Worker Loader then calls .create() on the real WORKFLOWS binding with the envelope as the params. From here the Workflows engine takes over. It persists event.payload — which now includes the envelope — and schedules the run. Every time the engine later wakes up the workflow (whether that’s after a 24-hour sleep, a crash, or a deploy), the metadata rides along with the payload, waiting to route the run.

One implication: treat the metadata as a routing hint, not as authorization. The tenant can read it back via instance.status(). Don't put secrets in there.

⑤ → ⑥ The engine comes back down. When the engine is ready to run a step, it calls .run(event, step) on the class you registered in wrangler.jsonc — the one createDynamicWorkflowEntrypoint gave you. That class unwraps the envelope, hands the metadata to the loadRunner callback you wrote, and forwards the unwrapped event through to whatever runner the callback returns.

The callback is where everything interesting happens, and it's entirely yours. Fetch the tenant's latest source from R2. Check their plan tier and pick a region. Attach a tail Worker for per-tenant logging. Bundle TypeScript on the fly with @cloudflare/worker-bundler. In the common case, you just hand off to the Worker Loader:

const stub = env.LOADER.get(tenantId, () => loadTenantCode(tenantId));
return stub.getEntrypoint('TenantWorkflow');

The Worker Loader caches by ID, so a workflow that runs many steps over many hours reuses the same dynamic Worker across them. When the isolate eventually gets evicted, the next step.do() pulls the code again and keeps going — the tenant's workflow has no idea anything happened. A Dynamic Worker boots in single-digit milliseconds using a few megabytes of memory, so the dispatch overhead is essentially free. You can have a million tenants, each with their own distinct workflow code, each spun up lazily on the step boundary where it's needed, and none of them cost anything while idle.

The escape hatch

If you want to subclass WorkflowEntrypoint yourself — to add logging around run(), wire up per-tenant observability, or thread custom state through — the library exposes the lower-level dispatchWorkflow primitive that createDynamicWorkflowEntrypoint is built on:

import { dispatchWorkflow } from '@cloudflare/dynamic-workflows';

export class MyDynamicWorkflow extends WorkflowEntrypoint {
  async run(event, step) {
    return dispatchWorkflow(
      { env: this.env, ctx: this.ctx },
      event,
      step,
      ({ metadata, env }) => loadRunnerForTenant(env, metadata),
    );
  }
}

Everything else — IDs, pause/resume, sendEvent, retries — falls through to the real Workflows engine untouched.

Dynamic Workers are the primitive

Step back from the specifics for a second. Every interesting line of this library is either a wrapper around .create() on the outbound side or a wrapper around WorkflowEntrypoint on the inbound side. The actual work — spinning up the tenant's code, sandboxing it, routing RPC across the boundary, caching the isolate, hibernating between steps — is all done by Dynamic Workers underneath.

That's the real story, and it's a lot bigger than Workflows

Dynamic Workers is the primitive that swallows everything. Durable Object Facets is the same pattern applied to Durable Objects. Dynamic Workflows is that same pattern applied to WorkflowEntrypoint. Each one is the same small amount of envelope-and-unwrap glue between the static binding you've always had and the dynamic version you can now hand to your customers.

And we're not stopping at Workflows. Every binding that Workers currently exposes is heading for a dynamic counterpart — queues where each producer ships its own handler, caches, databases, object stores, AI bindings, and MCP servers where every tenant brings their own tools. Whatever you bind to a Worker today, you will soon be able to bind dynamically: dispatched per tenant, per agent, per request, at zero idle cost.

The unit economics of running a platform like this are, frankly, absurd. Shipping a multi-tenant product used to mean giving every customer their own container, their own database, their own disk, their own scheduler, and stitching it together with orchestration glue, service meshes, and hair-pulling billing math. Many of these applications have to support thousands of customers at the very least; millions, at the most. On Dynamic Workers and everything composing on top of them, idle tenants cost approximately nothing and active tenants share the same hardware through isolate-level multi-tenancy. The floor drops several orders of magnitude. A platform that used to cap out at thousands of paying customers can now reasonably serve tens of millions.

What this unlocks

Agent platforms that plan like engineers

Coding agents — OpenCode, Claude Code, Codex, Pi — have been proving for the past year that LLMs are far better at writing code than at making sequential tool calls. The Cloudflare Agents SDK and Project Think extend that insight into durable execution: with primitives like fibers and sub-agents, an agent's long-running plan can survive crashes, hibernation, and redeploys without the user noticing.

Dynamic Workflows is the piece that lets that plan be a first-class Cloudflare Workflow — something the agent literally writes and the platform literally runs, with the full durability machinery behind it. A run(event, step) function the model wrote a minute ago, where every step.do(...) is independently retryable, every step.sleep('24 hours') hibernates for free, and every step.waitForEvent(...) waits indefinitely for the human to approve the next action. The agent writes the workflow; the platform runs it; neither has to know ahead of time what the plan looks like.

SDKs and frameworks where the user brings the logic

If you're shipping a framework where your customer writes the run(event, step) function — a workflow builder UI, a visual automation tool, a per-tenant extension system, a low-code tool for non-developers — Dynamic Workflows is now the primitive that makes it work without compromise. You call wrapWorkflowBinding({ tenantId }) once, hand the result to their code as WORKFLOWS, and every workflow instance they create is automatically tagged, routed back, and executed in their sandbox. The framework owns the Worker Loader; the user owns the workflow; neither has to care about the other.

CI/CD at primitive speed

Here's the use case that's been getting us most excited.

Every CI/CD platform in existence is, underneath, a dispatcher of per-repo configuration files: "run these steps, in this order, with these secrets, cache these directories, upload these artifacts." Each repo has its own pipeline. Each branch might have its own variant. Each pull request spawns an instance of that pipeline that has to run to completion, survive a machine crash, retry a flaky step, stream logs, pause for approvals, and persist results.

That's exactly the shape of a durable workflow. The reason CI hasn't been built that way until now is that nobody had a cloud primitive where the workflow itself is different for every repo, dispatched at runtime, at zero provisioning cost. Now you do.

Here's what a CI pipeline looks like when it's just code your customer ships with their repo — say, in .cloudflare/ci.ts. The workflow itself is real; the runInSandbox() / summarise() / GitHub binding helpers below are platform-provided glue, the kind of thing you'd ship once in your dispatcher:

import { WorkflowEntrypoint } from 'cloudflare:workers';

export class CIPipeline extends WorkflowEntrypoint {
  async run(event, step) {
    const { repo, sha, branch, pr } = event.payload;

    // Fork an isolated copy of the repo at this commit. Seconds, not minutes.
    const workspace = await step.do('checkout', () =>
      this.env.ARTIFACTS.fork(repo, { sha })
    );

    await step.do('install', () => runInSandbox(workspace, ['pnpm', 'install']));

    // Each parallel step is independently retryable.
    const [lint, test, build] = await Promise.all([
      step.do('lint',  () => runInSandbox(workspace, ['pnpm', 'lint'])),
      step.do('test',  () => runInSandbox(workspace, ['pnpm', 'test'])),
      step.do('build', () => runInSandbox(workspace, ['pnpm', 'build'])),
    ]);

    if (pr) {
      await step.do('comment', () =>
        this.env.GITHUB.commentOnPR(repo, pr, summarise({ lint, test, build }))
      );
    }

    // Workflow hibernates until approval arrives. No VM held open.
    if (branch === 'main') {
      await step.waitForEvent('approval', { type: 'deploy-approval', timeout: '24 hours' });
      await step.do('deploy', () => runInSandbox(workspace, ['pnpm', 'deploy']));
    }
  }
}

The platform owns the dispatcher. It ingests a webhook, figures out which repo it came from, loads that repo's CIPipeline class as a Dynamic Worker, and hands the run-off to Dynamic Workflows. The platform doesn't know what's in the pipeline. It doesn't need to. It's running a durable function that happens to live in the customer's repo.

Now line up what each step actually does:

Artifacts gives every repo a Git-native, versioned filesystem that lives on Cloudflare's globally distributed network. ArtifactFS hydrates the tree lazily, so even a multi-GB repo is ready to work within single-digit seconds — and fork() gives each CI run its own isolated copy, with no git clone tax.
Dynamic Workers run each lightweight step (lint, format, typecheck, bundle) in a sandboxed isolate that boots in milliseconds, on the same machine as the repo's data. No VM provisioning, no image pull, no cold start.
Dynamic Workflows holds the whole run together. Steps are retryable and durable. The run hibernates for free while waiting on approvals. State and progress survive deploys, evictions, and crashes.
Sandboxes handle the heavy corners — the step that needs docker build, the integration suite that needs Postgres running, the Rust compile that needs 8 cores. Snapshots to R2 mean even those warm-start in a couple of seconds.

A traditional CI run for a mid-sized JS repo looks something like: allocate VM (15-30s) → pull base image (10s) → git clone (10s) → npm ci (30-60s) → run tests (actual work) → tear down. Several minutes of ceremony before the first test runs, and you pay for the whole VM the whole time.

The same pipeline on this stack looks like: edge fork of the repo (seconds) → each step boots a fresh isolate or snapshot-restored sandbox in milliseconds → runs the actual work → hibernates. Nothing has to cold-start. Nothing has to be provisioned ahead of time. Nothing has to be kept warm. The repo doesn't move — the compute comes to it.

CI has never been this fast, and the reason it hasn't is that none of these primitives have existed together in one place. Now they do.

Try it

@cloudflare/dynamic-workflows is MIT-licensed and on npm today:

npm install @cloudflare/dynamic-workflows

It runs on top of Dynamic Workers, which is in open beta on the Workers Paid plan. The repo includes a working example — an interactive browser playground where you write a TenantWorkflow class, hit Run, and watch the steps execute with live-streaming logs and a per-step checklist that lights up as each step.do() commits. Clone it, deploy it, show it to a coworker.

If you're a platform, an SDK, a framework, or a CI/CD product, and you want to give your customers their own workflows without running their code in your own process: this is the primitive we built for you. If you're building agents that write durable plans, this is the primitive that makes those plans real Workflows. If you're just watching all of this, and it looks fun to build on top of: we'd love to see what you make.

Find us in the Cloudflare Developers Discord.

Post-quantum encryption for Cloudflare IPsec is generally available

Sharon Goldberg — Thu, 30 Apr 2026 14:00:00 GMT

While more than two-thirds of human-generated TLS traffic to Cloudflare is already protected by post-quantum cryptography, the world of site-to-site networking has been a different story. For years, the IPsec community remained caught between the high bar of Internet-scale interoperability and the niche requirements of specialized hardware. That gap is now closing.

Earlier this month, we announced that Cloudflare has moved its target for full post-quantum security forward to 2029, spurred by several recent advances in quantum computing. To advance that goal, we’ve made post-quantum encryption in Cloudflare IPsec generally available.

Using the new IETF draft for hybrid ML-KEM (FIPS 203), we’ve successfully tested interoperability with branch connectors from Fortinet and Cisco — meaning you can start protecting your wide-area network (WAN) against harvest-now-decrypt-later attacks today using hardware you already have.

This post explains how we implemented the new hybrid IPsec handshake, why it took four years longer to land than its TLS counterpart, and how the industry is finally consolidating around a standard that works at Internet scale.

Cloudflare IPsec

Cloudflare IPsec is a WAN Network-as-a-Service that replaces legacy network architectures by connecting data centers, branch offices, and cloud VPCs to Cloudflare's global IP Anycast network. Customers get simplified configuration, high availability (if a data center becomes unavailable, traffic is automatically rerouted to the nearest healthy one), and the scale of Cloudflare's global network. This is done through encrypted IPsec tunnels that support both site-to-site WAN, outbound Internet connections, and connectivity to the Cloudflare One SASE platform.

Post-quantum encryption in IPsec

Cloudflare IPsec now uses post-quantum encryption with hybrid ML-KEM (FIPS 203) to stop harvest-now-decrypt-later attacks. These are attacks where an adversary harvests data today and then decrypts later, after Q-Day, when there are powerful quantum computers that can break the classical public key cryptography used across the Internet. Harvest-now-decrypt-later attacks are becoming a concern for more organizations as Q-Day approaches faster than expected.

ML-KEM (Module-Lattice-Based Key-Encapsulation Mechanism) is a post-quantum cryptography algorithm that is based on mathematical assumptions that are not known to be vulnerable to attacks by quantum computers. It does not require special hardware or a dedicated physical link between sender and receiver. ML-KEM is intentionally designed to be implemented in software across standard processors to provide post-quantum encryption of network traffic.

Draft-ietf-ipsecme-ikev2-mlkem specifies post-quantum encryption for IPsec using hybrid ML-KEM, which combines the well-understood security of classical Diffie-Hellman and the post-quantum security of ML-KEM in a single, standards-compliant handshake. Specifically, a classical Diffie-Hellman exchange runs first, its derived key encrypts a second exchange that runs ML-KEM, and the outputs of both are mixed into the session keys that secure IPsec data plane traffic sent using the Encapsulating Security Payload (ESP) protocol.

Our interoperable implementation

Earlier we announced the closed beta of our implementation of draft-ietf-ipsecme-ikev2-mlkem in production in our Cloudflare IPsec product and tested it against a reference implementation (strongswan). Now that we have made this implementation generally available, we have also confirmed interoperability with several other vendors, including Cisco and Fortinet, which is a big win for this new standard.

Cisco: Customers using Cisco 8000 Series Secure Routers after version 26.1.1 as their branch connector can also now establish post-quantum Cloudflare IPsec tunnels per draft-ietf-ipsecme-ikev2-mlkem.

Fortinet: Customers using Fortinet FortiOS 7.6.6 and later as their branch connector can now establish post-quantum Cloudflare IPsec tunnels to Cloudflare's global network per draft-ietf-ipsecme-ikev2-mlkem.

The importance of being interoperable

Given that upgrading cryptography is hard and can take years, our 2029 target date for a full update to post-quantum cryptography is going to require concentrated effort. That’s why we hope the IPsec community continues to focus on the development of interoperable standards like draft-ietf-ipsecme-ikev2-mlkem.

Let us explain why these standards are vitally important. A full specification for hybrid ML-KEM in IPsec, draft-ietf-ipsecme-ikev2-mlkem, became available only in late 2025. That's roughly four years after support for hybrid ML-KEM landed in TLS. (In fact, Cloudflare turned on hybrid post-quantum key agreement with TLS in 2022, even before NIST finalized the standardization of ML-KEM, because the TLS community quickly converged on a single, interoperable approach and pushed it into production. Today more than two-thirds of the human-generated TLS traffic to Cloudflare's network is protected with hybrid ML-KEM.)

The four-year delay is likely due in part to the IPsec community's continued interest in Quantum Key Distribution (QKD), as codified in RFC 8784, published in 2020. We've written before about why QKD is not part of our post-quantum strategy: QKD requires specialized hardware and a dedicated physical link between the two parties, which fundamentally means it will not operate at Internet scale. Also, QKD does not provide authentication, so you still need post-quantum cryptography anyway to stop active attackers. It’s difficult to find implementations of QKD that interoperate across vendors.

The U.S. NSA, Germany's BSI, and the UK's NCSC have all warned against solely relying on QKD. Post-quantum cryptography, by contrast, runs on the hardware you already have, authenticates the parties at both ends, and works end-to-end across the Internet.

RFC 9370, published in 2023, opened the door to post-quantum cryptography in IPsec, allowing up to seven key exchanges to be run in parallel with classical Diffie-Hellman. However, RFC 9370 did not specify which ciphersuites should be used in these parallel key exchanges. In the absence of that specification, some vendors shipped early implementations under RFC 9370 before the hybrid ML-KEM draft was available, defining their own ciphersuites including some which are not NIST-standardized. This is exactly the kind of “ciphersuite bloat” NIST SP 800 52r2 warned against. And the risks to interoperability have played out in practice: Cloudflare IPsec does not yet interoperate with Palo Alto Networks' RFC 9370–based implementation, because it was launched before draft-ietf-ipsecme-ikev2-mlkem was available.

Fortunately, we now have draft-ietf-ipsecme-ikev2-mlkem that fills in the gaps in RFC 9370, specifying hybrid ML-KEM as one of the key exchange mechanisms that can be operated in parallel with classical Diffie-Hellman. We hope to add Palo Alto Networks to the list of interoperable post-quantum branch connectors as the industry continues to consolidate around draft-ietf-ipsecme-ikev2-mlkem.

But the journey towards interoperable post-quantum IPsec standards is not over yet. While draft-ietf-ipsecme-ikev2-mlkem supports post-quantum encryption, we still need IPsec standards for post-quantum authentication, so that we can stop attacks by quantum adversaries on live systems after Q-Day. Given the shortened timeline for full post-quantum readiness, we hope the IPsec community will continue to focus on interoperable PQC implementations, rather than diverting focus to niche use cases with QKD.

Towards an interoperable post-quantum Internet

At Cloudflare, we’re helping make a secure and post-quantum Internet accessible to everyone, without specialized hardware and at no extra cost to our customers. Post-quantum Cloudflare IPsec is one more step on our path to full post-quantum security by 2029, and we’re doing it in a way that ensures that the Internet remains open and interoperable for years to come.

Agents can now create Cloudflare accounts, buy domains, and deploy

Sid Chatterjee — Thu, 30 Apr 2026 13:00:00 GMT

Coding agents are great at building software. But to deploy to production they need three things from the cloud they want to host their app — an account, a way to pay, and an API token. Until now these have been tasks that humans handle directly. Increasingly, agents handle them on the user’s behalf. The agent needs to perform all the tasks a human customer can. They’re given higher-order problems to solve and choose to use Cloudflare and call Cloudflare APIs.

Starting today, agents can provision Cloudflare on behalf of their users. They can create a Cloudflare account, start a paid subscription, register a domain, and get back an API token to deploy code right away. Humans can be in the loop to grant permission and must accept Cloudflare's terms of service, but no human steps are otherwise required from start to finish. There’s no need to go to the dashboard, copy and paste API tokens, or enter credit card details. Without any extra setup, agents have everything they need to deploy a new production application in one shot. And with Cloudflare’s Code Mode MCP server and Agent Skills, they’re even better at it.

This all works via a new protocol that we’ve co-designed with Stripe as part of the launch of Stripe Projects.

We’re excited to launch this new partnership with Stripe, and also to offer $100,000 in Cloudflare credits to all new startups who incorporate using Stripe Atlas. But this new protocol also makes it possible for any platform with signed-in users to integrate with Cloudflare in the same way Stripe does, with zero friction for the end user.

How it works: zero to production without any setup or manual steps

Install the Stripe CLI with the Stripe Projects plugin, login to Stripe, and then start a new project:

stripe projects init

Then prompt your agent to build something new and deploy it to a new domain. You can watch a condensed two-minute video of this entire flow below:

If the email you’re logged into Stripe with already has a Cloudflare account, you’ll be prompted with a typical OAuth flow to grant the agent access. If there is no existing Cloudflare account for the email you’re logged in with, Cloudflare will provision an account automatically for you and your agent:

You will see the agent build and deploy a site to a new Cloudflare account, and then use the Stripe Projects CLI to register the domain:

The agent will prompt for input and approval when necessary. For example, if your Stripe account doesn’t yet have a linked payment method, the agent will prompt you to add one:

At the end, the agent has deployed to production, and the app runs on the newly registered domain:

The agent has gone from literal zero, no Cloudflare account at all, without any preconfigured Agent Skills or MCP server, to having:

Provisioned a new Cloudflare account
Obtained an API token
Purchased a domain
Deployed an app to production

But wait — how did the agent discover that it could do all of this? How did it know what services it could provision, and how to purchase a domain? How did it gain the context it needed to understand how to deploy to Cloudflare? Let’s dig in.

How the protocol and integration works

There are three components to the interaction between the agent, Stripe, and Cloudflare shown above:

Discovery — the agent can call a command to query the catalog of available services.
Authorization — the platform attests to the identity of the user, allowing providers to provision accounts or link existing ones, and securely issue credentials back to the agent.
Payment — the platform provides a payment token that providers can use to bill the customer, allowing the agent to start subscriptions, make purchases and be billed on a usage basis.

These build on prior art and existing standards like OAuth, OIDC and payment tokenization — but are used together to remove many steps that might otherwise require a human in the loop.

Discovery: how agents find services they can provision themselves

In the agent session above, before the agent ran the CLI command stripe projects add cloudflare/registrar:domain, it first had to discover the Cloudflare Registrar service. It did this by calling the stripe projects catalog command, which returns available services:

The full set of Cloudflare products and services from other providers is long and growing — arguably overwhelming to humans. But for agents, this catalog of services is exactly the context they need. The agent chooses services to use from this catalog based on what the user has asked them to do and the user’s preferences — but the user needs no prior knowledge of what services are offered by which providers, and does not need to provide any input. Providers like Cloudflare make this catalog available via a simple REST API that returns JSON, and that gives agents everything they need.

Authorization: instant account creation for new users

When the agent chooses a service and provisions it (ex: stripe projects add cloudflare/registrar:domain), it provisions the resource within a Cloudflare account. But how is it able to create one on demand, without sending a human to a signup page?

Remember how at the start, the user signed in to their Stripe account? Stripe acts as the identity provider, attesting to the user’s identity. Cloudflare automatically provisions a new account for the user if no account already exists, and returns credentials back to the Stripe Projects CLI, which are securely stored, but available to the agent to use to make authenticated requests to Cloudflare. This means if someone is brand new to Cloudflare or other services, they can start building right away with their agent, without extra steps.

If the user already has a Cloudflare account, they’re sent through a standard OAuth flow to grant access to the Stripe Projects CLI, allowing them to provision resources on their existing Cloudflare account.

Payment: give your agent a budget it can spend, without giving it your credit card info

You might rightly worry, “What if my agent goes a bit overboard and starts buying dozens of domains? Will I end up on the hook for a massive bill? Can I really trust my agent with my credit card?”

The protocol accounts for this in two ways. When an agent provisions a paid service, Stripe includes a payment token in the request to the Provider (Cloudflare). Raw payment details like credit card numbers aren’t ever shared with the agent. Stripe then sets a default limit of $100.00 USD/month as the maximum the agent can spend on any one provider. When you’re ready to raise this limit, you can then set Budget Alerts on your Cloudflare account.

Any platform with signed-in users can integrate with Cloudflare in the same way Stripe does

Any platform with signed-in users can act as the “Orchestrator”, playing the same role Stripe does with Stripe Projects, and integrate with Cloudflare.

Let’s say your product is a coding agent. You’d love for people to be able to take what they’ve built and get it deployed to production, using Cloudflare and other services. But the last thing you want is to send people down a maze of authorization flows and decision trees of where and how to deploy it. You just want to let people ship.

Your platform acts as the Orchestrator, with the already signed-in user. When your user needs a domain, a storage bucket, a sandbox to give their agent, or anything else, you make one API call to Cloudflare to provision a new Cloudflare account to them, and get back a token to make authenticated requests on their behalf.

Or let’s say you want Cloudflare customers to be able to easily provision your service, similar to how Cloudflare is partnering with Planetscale to make it possible to create Planetscale Postgres databases directly from Cloudflare. We started working with Planetscale on this well before this new protocol got off the ground, but the flow here is quite similar. Cloudflare acts as the Orchestrator, letting you connect to your PlanetScale account, create databases, and use the user’s existing payment method for billing.

This new protocol starts to standardize the types of cross-product integrations that many platforms have been doing for years, often in ways that were one off or bespoke to a particular platform. Without a standard, each integration required engineering work that often couldn’t be leveraged for future integrations. Similar to how the OAuth standard made it possible to delegate access to your account to other platforms, the protocol uses OAuth and extends further into payments and account creation, doing so in a way that treats agents as a first-class concern.

We’re excited to continue evolving the standard, and to work with Stripe on sharing a more official specification soon. We’re also excited to integrate with more platforms — email us at agenticpartnerships@cloudflare.com, and tell us how you want your platform to integrate with Cloudflare.

Give your agent the power to provision and pay

Stripe Projects is in open beta, and you can get started even if you don’t yet have a Cloudflare account. Just install the Stripe CLI, log in to Stripe, and then start a new project:

stripe projects init

Prompt your agent to build something new on Cloudflare, and show us what you’ve built!

Shutdowns, power outages, and conflict: a review of Q1 2026 Internet disruptions

David Belson — Tue, 28 Apr 2026 13:00:00 GMT

In the first quarter of 2026, government-directed shutdowns figured prominently, with prolonged Internet blackouts in both Uganda and Iran, a stark contrast to the lack of observed government-directed shutdowns in the same quarter a year prior. This quarter, we also observed a number of Internet disruptions caused by power outages, including three separate collapses of Cuba's national electrical grid. Military action continued to disrupt connectivity in Ukraine and also impacted hyperscaler cloud infrastructure in the Middle East. Severe weather knocked out Internet connectivity in Portugal, while cable damage disrupted connectivity in the Republic of Congo. A technical problem hit Verizon Wireless in the United States, and unknown issues briefly disrupted connectivity for customers of providers in Guinea and the United Kingdom.

This post is intended as a summary overview of observed and confirmed disruptions and is not an exhaustive or complete list of issues that have occurred during the quarter. A larger list of detected traffic anomalies is available in the Cloudflare Radar Outage Center. Note that both bytes-based and request-based traffic graphs are used within this post to illustrate the impact of the observed disruptions, with the choice of metric generally made based on which better illustrates the impact of the disruption.

Government-directed shutdowns

Uganda

In advance of the January 15 presidential election, Ugandan authorities ordered a nationwide Internet shutdown. The Uganda Communications Commission (UCC) instructed mobile network operators to suspend public Internet access, effective 18:00 local time (15:00 UTC) on January 13. The UCC reportedly defended the shutdown as necessary to "curb misinformation, disinformation, electoral fraud and related risks." Domestic traffic at the Uganda Internet Exchange Point (UIXP) dropped from approximately 72 Gbps to 1 Gbps as a result of the action taken.

Similarly, Cloudflare data shows a near-complete loss of traffic from Uganda coincident with the start of the shutdown, with traffic remaining effectively at zero through 23:00 local time (20:00 UTC) on January 17, when Internet connectivity was partially restored after incumbent President Yoweri Museveni was declared winner of his seventh term.

Full Internet restoration was announced by the UCC on January 26, with mobile network operators MTN Uganda and Airtel Uganda both confirming on social media that restrictions had been lifted. The shutdown prompted lawsuits against UCC and the telecoms companies and drew criticism from digital rights organizations including CIPESA.

Uganda also blocked Internet access during its 2021 election. Authorities had repeatedly promised this time would be different, stating as recently as January 5 that "claims suggesting otherwise are false, misleading."

Iran

Iranian citizens spent a large part of Q1 2026 offline, or with severely limited connectivity, due to two nationwide Internet shutdowns. The first began around 20:00 local time (16:30 UTC) on January 8, and we explored the impact seen over the first few days in our What we know about Iran’s Internet shutdown blog post. Traffic from Iran remained near zero until January 21, when a small amount of traffic returned, only to disappear a little over 24 hours later. A similar brief restoration also occurred on January 25, before traffic recovered more aggressively starting on January 27.

A near-complete loss of announced IPv6 address space started several hours before the drop in traffic took place on January 8. Asiatech (AS43754) was by far the single largest contributor, losing 4.46 million /48-equivalents, accounting for ~9.4% of Iran's entire IPv6 space loss on its own. RASANA (AS31549) was the second-largest, losing 4.19 million /48-equivalents (~8.8% of the country total). As would be expected, this resulted in the share of IPv6 traffic in Iran going to zero. Given the gap in timing between this change and the loss of traffic across the country, this may have been a leading indicator of what was about to happen, but likely not a direct cause of it. Some nominal shifts in announced IPv4 address space are visible during the shutdown, but levels remained fairly consistent during the shutdown period. These observations suggest that the shutdown was implemented by other means, such as filtering.

Cloudflare Radar social media posts (X, Bluesky, Mastodon) throughout January and into early February documented our observations about the state of connectivity in Iran over the course of that month.

On February 28, as military strikes on Iran escalated, a second nationwide Internet shutdown began. Cloudflare Radar observed a sharp drop in traffic from Iran beginning around 10:30 local time (07:00 UTC). Traffic levels fell to well under 1% of previous levels, with only small amounts of Web and DNS traffic egressing the country.

No significant shifts in announced IP address space were observed around the onset of this shutdown. IPv4 space remained fairly consistent, and IPv6 space remained consistently volatile, suggesting that route withdrawals were not the cause of this second shutdown.

The continued announcement of IP address space, and the presence of traffic from the country, even if just a small amount, supports reports that the shutdown was effectively achieved through aggressive filtering, with so-called “whitelists” and “white SIM cards” restricting access to only approved Internet sites by selected users.

Iran remained effectively offline through the end of the quarter. As of late April, this shutdown remains largely in place, making it one of the longest sustained Internet disruptions observed in recent years.

Republic of Congo

On March 15, as the Republic of Congo held a presidential election expected to extend President Denis Sassou Nguesso's 42-year rule, a near-complete shutdown of Internet connectivity was observed in the country. Traffic from the country dropped precipitously around 06:30 local time (05:30 UTC), falling to near zero for approximately 60 hours through the election period and its immediate aftermath. Traffic began recovering around March 17 at 18:20 local time (17:20 UTC), rapidly returning to pre-shutdown levels. While Congolese authorities provided no official explanation for the drop in traffic, similar shutdowns were put into place during the 2021 and 2016 elections.

Military action

Ukraine (Dnipropetrovsk)

On January 7-8, Russian attacks on energy infrastructure in Ukraine caused power outages that disrupted Internet connectivity in Dnipropetrovsk and surrounding regions. Cloudflare Radar observed a significant drop in traffic from the region, reaching nearly 50% below the prior week’s levels, starting around 22:45 local time (20:45 UTC) on January 7. Recovery began approximately 06:00 local time (04:00 UTC) on January 8.

Ukraine (Kharkiv)

On January 26, Russia launched a drone and missile attack targeting energy infrastructure in Kharkiv. Cloudflare Radar observed an approximately 50% drop in traffic from the region beginning around 19:15 local time (17:15 UTC). Recovery progressed through January 27 as power was gradually restored.

Amazon Web Services Middle East (United Arab Emirates and Bahrain)

One of the most unusual disruptions of the quarter was the physical damage inflicted on Amazon Web Services data centers in the Middle East by drone strikes tied to the ongoing regional conflict. On the morning of March 1 (UTC), Amazon reported a fire started after objects hit a UAE data center. The following day, the company confirmed that two of its facilities in the United Arab Emirates (me-central-1 region) were "directly struck" by drones and that a facility in Bahrain (me-south-1 region) was also taken offline after being damaged by a nearby strike.

Cloudflare's Cloud Observatory data showed elevated connection failure rates for the me-central-1 and me-south-1 regions beginning March 1-2 and remaining higher for multiple days. Connection failures occur when Cloudflare fails to successfully connect to an origin server when attempting to retrieve uncacheable content, or content not in/expired from cache. These graphs illustrate the increased rate of failures experienced when attempting to connect to servers in these impacted regions.

In a status post on the AWS Health Dashboard, Amazon acknowledged: "These strikes have caused structural damage, disrupted power delivery to our infrastructure, and in some cases required fire suppression activities that resulted in additional water damage." The company warned that instability was likely to continue in the Middle East, making operations "unpredictable," and urged customers with workloads in the affected regions to back up their data or migrate to other AWS regions.

The AWS me-south-1 region in Bahrain suffered an additional disruption on March 23, following further drone activity.

Power outages

Argentina (Buenos Aires)

On January 15, a power outage struck Buenos Aires during a summer heat wave. The outage caused nominal disruptions in Internet connectivity for customers of multiple providers in the Buenos Aires area, including Telecom Argentina (AS7303), Telecentro (AS27747), and IPLAN (AS16814), with traffic from these networks dropping between 17:30 and 19:30 local time (20:30 - 22:30 UTC). Traffic returned to expected levels approximately two hours after the outage began.

Moldova and Ukraine

An emergency power cut on Ukraine's electricity grid on January 31 caused widespread power outages affecting Moldova and several Ukrainian regions including Kyiv and Kharkiv. Moldova was reportedly hit by widespread power cuts amid the Ukrainian grid problems, and the Ukrainian Energy Minister explained the cross-border impact, noting “Today at 10:42 a.m. (08:42 GMT), a technical malfunction occurred, causing a simultaneous shutdown of the 400 kilovolt line between the power grids of Romania and Moldova and the 750 kilovolt line between western and central Ukraine.” Traffic from Moldova, Kyiv, and Kharkiv began falling around 10:42 local time (08:42 UTC), reaching as much as 46% below the prior week, with recovery occurring around 14:00 local time (12:00 UTC).

Paraguay

On February 18, widespread power outages struck Paraguay after key transmission lines went out of service. The National Electricity Administration (ANDE) posted a series of updates on X documenting the incident and efforts to restore power. Internet traffic from Paraguay dropped as much as 72% compared to the prior week beginning around 15:15 local time (18:15 UTC), and the disruption lasted nearly three hours, with recovery occurring by approximately 18:30 local time (21:30 UTC).

Dominican Republic

A major failure in the Interconnected National Electric System (SENI) of the Dominican Republic caused a widespread power outage on February 23. The state-owned electric company Empresa de Transmisión Eléctrica Dominicana (ETED) posted updates on X documenting the failure and the recovery effort. Internet traffic from the country dropped sharply beginning around 10:50 local time (14:50 UTC), and recovered around midnight local time (04:00 UTC) on February 24, in line with a confirmation posted by ETED that “The authorities of the electric sector reported that the Interconnected National Electric System (SENI) was fully restored to 100% at 11:53 p.m. on this Monday…”.

Cuba

Cuba experienced three separate collapses of its National Electric System (SEN) during March, each causing widespread Internet disruption, reflecting the severe deterioration of the country's electrical infrastructure. (Power outages also disrupted Internet connectivity in Cuba during September and March 2025, and October 2024.)

The first collapse occurred on March 4, when a disconnection of Cuba's National Electroenergy System cascaded from Camagüey to Pinar del Río, cutting power to the western half of the island, including Havana. OSDE/UNE (Cuba's Electric Union) confirmed the failure on social media. Cloudflare Radar data showed traffic from the island dropping by nearly half beginning around 12:15 local time (17:15 UTC), with traffic recovering by approximately 05:01 local time (10:01 UTC) on March 5.

The second collapse occurred on March 16, when Cuba's entire National Electric Power System was disconnected. EnergíaMinas Cuba posted updates on the situation on X. Cloudflare Radar data again shows a significant loss of traffic from Cuba beginning around 13:35 local time (17:35 UTC) on March 16, dropping approximately 65%. Traffic returned to expected levels by approximately 20:00 local time on March 17 (00:00 UTC on March 18), with the disruption lasting over 30 hours.

The third collapse (the second in just a week) happened just days later, on March 21-22. EnergíaMinas Cuba and OSDE/UNE again provided situation updates via X. Cloudflare Radar data shows another significant loss of traffic from Cuba beginning around 18:30 local time (22:30 UTC) on March 21, falling as much as 77% compared to the previous week. Traffic recovered around 21:39 local time on March 22 (01:39 UTC on March 23).

U.S. Virgin Islands

According to a Facebook post from the Virgin Islands Water and Power Authority (WAPA) on March 24, a loss of generation at the Richmond Power Plant combined with damage to an underground cable caused a power outage affecting St. Croix and St. Thomas in the U.S. Virgin Islands. Cloudflare Radar data shows traffic from local provider VI Powernet (AS14434), the primary ISP for the U.S. Virgin Islands, dropping to near zero beginning around 12:15 local time (16:15 UTC), with recovery occurring by approximately 14:45 local time (18:45 UTC). Although VI Powernet experienced a near-complete outage, traffic from St. Thomas only fell by around 60%, and approximately 40% from St. Croix due to the presence of other providers.

Severe weather

Portugal

Storm Kristin made landfall in Portugal on January 28, causing widespread damage and power outages across the country. Approximately 1,500 incidents were registered by Civil Protection between midnight and 08:00 local time (00:00 - 08:00 UTC), with the hardest-hit areas being the districts of Leiria and Coimbra. Significant infrastructure damage was reported, and by 07:00 local time (07:00 UTC), over 850,000 E-Redes customers were without electricity.

The associated power outages disrupted Internet connectivity across Portugal, which Cloudflare Radar observed primarily in the regions of Leiria, Santarém, and Coimbra beginning around 04:10 local time (04:10 UTC) on January 28. Internet traffic dropped as much as 70% in Leiria, and 52% in Coimbra.

Recovery was slow: over 290,000 customers remained without power as late as January 30, and Cloudflare continued tracking gradual recovery of regional traffic over the following weeks. (Coimbra returned to expected levels within the first several days after the storm.) More than three weeks after the storm, over 6,000 customers in Leiria reportedly remained without electricity.

Cable damage

Republic of Congo

Just after the New Year, Internet connectivity in the Republic of Congo was disrupted by an incident on the WACS (West Africa Cable System) submarine cable. Congo Telecom (AS37451) posted on X announcing "an international incident on the WACS cable" was causing Internet disruptions, and stating that backup solutions had been activated. Cloudflare Radar observed a significant drop in traffic from Congo beginning around 00:00 local time on January 2 (23:00 UTC on January 1), falling to 82% below expected levels. A follow-up post from Congo Telecom confirmed that repairs were ongoing, with users potentially experiencing slowdowns during peak hours. Traffic returned to expected levels by approximately 15:00 local time (14:00 UTC) on January 4.

Technical problems

Verizon Wireless (United States)

On January 14, a software issue impacted voice and data services for customers of Verizon Wireless (AS6167) across the United States. Verizon published an official statement acknowledging that the outage began January 14 and that by 22:15 ET (03:15 UTC on January 15) the issue had been resolved. Multiple updates on X from @VerizonNews kept subscribers informed throughout the evening. Cloudflare Radar data shows a minor drop in traffic beginning around 12:30 ET (17:30 UTC) on January 14, consistent with the reported onset of the outage.

Grenada

On February 9-10, customers of Flow Grenada (AS46650) – the primary Internet provider serving Grenada – experienced an island-wide service disruption lasting approximately 12 hours. The provider posted on Facebook acknowledging a service disruption, though no details about the root cause were provided. Cloudflare Radar data shows traffic from the network initially dropping around 11:30 local time (15:30) UTC on February 9, disappearing completely around 20:00 local time (midnight UTC on February 10), and recovering by approximately 23:30 local time (03:30 UTC on February 10). Routing data shows a complete loss of announced IPv4 space at the same time traffic dropped to zero. Major spikes in BGP announcements around the time the disruption initially started, and bookending the complete outage, suggest that the whole event may have been routing-related.

Unknown cause

Orange Guinée (Guinea)

Customers of Orange Guinée (AS37461) in Guinea were unable to make phone calls or access the Internet starting around 10:45 local time (10:45 UTC) on January 6. Orange Guinée subsequently confirmed an "exceptional breakdown" affecting mobile phone and Internet services due to a technical incident, with teams mobilized to restore service. Service was restored by approximately 14:00 local time (14:00 UTC) that same day. No further details on the root cause of the incident were publicly disclosed.

TalkTalk (United Kingdom)

On March 25, customers of UK broadband provider TalkTalk (AS13285) reported widespread service disruptions. TalkTalk acknowledged the issues on X but did not publicly disclose a root cause. Cloudflare Radar observed traffic from the provider drop nearly 50% as compared to the previous week beginning around 07:00 local time (07:00 UTC), with service restored by approximately 08:15 local time (08:15 UTC).

A quarter marked by major disruptions

The first quarter of 2026 was marked by an unusually high number of severe and prolonged Internet disruptions. The major government-directed shutdowns, particularly the extended blackouts in Uganda and Iran, underscore how Internet access continues to be weaponized as a tool of political control. Cuba's three separate national grid collapses in a single month paint a troubling picture of infrastructure fragility with direct consequences for connectivity. And the drone strikes on AWS data centers in the Middle East represent an unprecedented escalation as active military conflict directly and physically damaged major cloud infrastructure, with disastrous consequences for the websites and applications hosted there.

The Cloudflare Radar team is constantly monitoring for Internet disruptions, sharing our observations on the Cloudflare Radar Outage Center, via social media, and in posts on blog.cloudflare.com. Follow us on social media at @CloudflareRadar(X), noc.social/@cloudflareradar (Mastodon), and radar.cloudflare.com (Bluesky), or contact us via email.

Making Rust Workers reliable: panic and abort recovery in wasm‑bindgen

Guy Bedford — Wed, 22 Apr 2026 13:00:00 GMT

Rust Workers run on the Cloudflare Workers platform by compiling Rust to WebAssembly, but as we’ve found, WebAssembly has some sharp edges. When things go wrong with a panic or an unexpected abort, the runtime can be left in an undefined state. For users of Rust Workers, panics were historically fatal, poisoning the instance and possibly even bricking the Worker for a period of time.

While we were able to detect and mitigate these issues, there remained a small chance that a Rust Worker would unexpectedly fail and cause other requests to fail along with it. An unhandled Rust abort in a Worker affecting one request might escalate into a broader failure affecting sibling requests or even continue to affect new incoming requests. The root cause of this was in wasm-bindgen, the core project that generates the Rust-to-JavaScript bindings Rust Workers depend on, and its lack of built-in recovery semantics.

In this post, we’ll share how the latest version of Rust Workers handles comprehensive Wasm error recovery that solves this abort-induced sandbox poisoning. This work has been contributed back into wasm-bindgen as part of our collaboration within the wasm-bindgen organization formed last year. First with panic=unwind support, which ensures that a single failed request never poisons other requests, and then with abort recovery mechanisms that guarantee Rust code on Wasm can never re-execute after an abort.

Initial recovery mitigations

Our initial attempts to address reliability in this area focused on understanding and containing failures caused by Rust panics and aborts in production Rust Workers. We introduced a custom Rust panic handler that tracked failure state within a Worker and triggered full application reinitialization before handling subsequent requests. On the JavaScript side, this required wrapping the Rust-JavaScript call boundary using Proxy‑based indirection to ensure that all entrypoints were consistently encapsulated. We also made targeted modifications to the generated bindings to correctly reinitialize the WebAssembly module after a failure.

While this approach relied on custom JavaScript logic, it demonstrated that reliable recovery was achievable and eliminated the persistent failure modes we were seeing in practice. This solution was shipped by default to all workers‑rs users starting in version 0.6, and it laid the groundwork for the more general, upstreamed abort recovery mechanisms described in the sections that follow.

Implementing `panic=unwind` with WebAssembly Exception Handling

The abort recovery mechanisms described above ensure that a Worker can survive a failure, but they do so by reinitializing the entire application. For stateless request handlers, this is fine. But for workloads that hold meaningful state in memory, such as Durable Objects, reinitialization means losing that state entirely. A single panic in one request could wipe the in-memory state being used by other concurrent requests.

In most native Rust environments, panics can be unwound, allowing destructors to run and the program to recover without losing state. In WebAssembly, things historically looked very different. Rust compiled to Wasm via wasm32-unknown-unknown defaults to panic=abort, so a panic inside a Rust Worker would abruptly trap with an unreachable instruction and exit Wasm back to JS with a WebAssembly.RuntimeError.

To recover from panics without discarding instance state, we needed panic=unwind support for wasm32-unknown-unknown in wasm-bindgen, made possible by the WebAssembly Exception Handling proposal, which gained wide engine support in 2023.

We start by compiling with RUSTFLAGS='-Cpanic=unwind' cargo build -Zbuild-std, which rebuilds the standard library with unwind support and generates code with proper panic unwinding. For example:

struct HasDropA;
struct HasDropB;
extern "C" {
    fn imported_func();
}

fn some_func() {
    let a = HasDropA;
    let b = HasDropB;
    imported_func();
}

compiles to WebAssembly as:

try
  call 
catch_all
  call 
  call 
  rethrow
end
call 
call

This ensures that even if imported_func() panics, destructors still run. Similarly, std::panic::catch_unwind(|| some_func()) compiles into:

try
  call 
  ;; set result to Ok(return value)
catch
  try
    call 
    ;; set result to Err(panic payload)
  catch_all
    call 
    unreachable
  end
end

Getting this to work end-to-end required several changes to the wasm-bindgen toolchain. The WebAssembly parser Walrus did not know how to handle try/catch instructions, so we added support for them. The descriptor interpreter also needed to be taught how to evaluate code containing exception handling blocks. At that point, the full application could be built with panic=unwind.

The final step was modifying the exports generated by wasm-bindgen to catch panics at the Rust-JavaScript boundary and surface them as JavaScript PanicError exceptions. One subtlety: Rust will catch foreign exceptions and abort when unwinding through extern "C" functions, so exports needed to be marked extern "C-unwind" to explicitly allow unwinding across the boundary. For futures, a panic rejects the JavaScript Promise with a PanicError.

Closures required special attention to ensure unwind safety was properly checked, via a new MaybeUnwindSafe trait that checks UnwindSafe only when built with panic=unwind. This quickly exposed a problem, though: many closures capture references that remain after an unwind, making them inherently unwind-unsafe. To avoid a situation where users are encouraged to incorrectly wrap closures in AssertUnwindSafe just to satisfy the compiler, we added Closure::new_aborting variants, which terminate on panic instead of unwinding in cases where unwind safety can't be guaranteed.

With panic unwinding enabled:

Panics in exported Rust functions are caught by wasm-bindgen
Panics surface to JavaScript as PanicError exceptions
Async exports reject their returned promises with a PanicError
Rust destructors run correctly
The WebAssembly instance remains valid and reusable

The full details of the approach and how to use it in wasm-bindgen are covered in the latest guide page for Wasm Bindgen: Catching Panics.

Abort recovery

Even with panic=unwind support, aborts still happen - out-of-memory errors being one common cause. Because aborts can’t unwind, there is no possibility of state recovery at all, but we can at least detect and recover from aborts for future operations to avoid invalid state erroring subsequent requests.

Panic unwind support introduced a new problem for abort recovery. When we receive an error from Wasm we don’t know if it came from an extern “C-unwind” foreign error, or if it was a genuine abort. Aborts can take many shapes in WebAssembly.

We had two options to solve this technically: either mark all errors which are definitely aborts, or mark all errors which are definitely unwinds. Either could have worked but we chose the latter. Since our foreign exception handling was directly using raw WAT-level (WebAssembly text format) Exception Handling instructions already, we found it easier to implement exception tags for foreign exceptions to distinguish them from aborting non-unwind-safe exceptions.

With the ability to clearly distinguish between recoverable and non-recoverable errors thanks to this Exception.Tag feature in WebAssembly Exception Handling, we were able to then integrate both a new abort handler as well as abort reentrancy guards. A new abort hook, set_on_abort, can be used at initialization time to attach a handler that recovers accordingly for the platform embedding’s needs.

Hardening panic and abort handling is critical to avoiding invalid execution state. WebAssembly allows deeply interleaved call stacks, where Wasm can call into JavaScript and JavaScript can re-enter Wasm at arbitrary depths, while alongside this, multiple tasks can be functioning in the same instance. Previously, an abort occurring in one task or nested stack was not guaranteed to invalidate higher stacks through JS, leading to undefined behavior. Care was required to ensure we can guarantee the execution model, and contribution in this space remains ongoing.

While aborts are never ideal, and reinitialization on failure is an absolute worst-case scenario, implementing critical error recovery as the last line of defense ensures execution correctness and that future operations will be able to succeed. The invalid state does not persist, ensuring a single failure does not cascade into multiple failures.

Extension: abort reinitialization for wasm-bindgen libraries

While we were working on this, we realized that this is a common problem for libraries used by JS that are built with wasm-bindgen, and that they would also benefit from attaching an abort handler to be able to perform recovery.

But when building Wasm as an ES module and importing it directly (e.g. via import { func } from ‘wasm-dep’), it’s not clear what the recovery mechanism would be for a Wasm abort while calling func() for an already-linked and initialized library that is in a user JS application.

While not strictly a Rust Workers use case, our team also supports JS-based Workers users who run Rust-backed Wasm library dependencies. If we could fix this problem at the same time, that could indirectly also benefit Wasm usage on the Cloudflare Workers platform.

To support automatic abort recovery for Wasm library use cases, we added support for an experimental reinitialization mechanism into wasm‑bindgen, --reset-state-function. This exposes a function that allows the Rust application to effectively request that it reset its internal Wasm instance back to its initial state for the next call, without requiring consumers of the generated bindings to reimport or recreate them. Class instances from the old instance will throw as their handles become orphaned, but new classes can then be constructed. The JS application using a Wasm library is errored but not bricked.

The full technical details of this feature and how to use it in wasm-bindgen are covered in the new wasm-bindgen guide section Wasm Bindgen: Handling Aborts.

Maturing the Rust Wasm Exception Handling ecosystem

Upstream contributions for this work did not stop at the wasm-bindgen project. Building for Wasm with panic=unwind still requires an experimental nightly Rust target, so we’ve also been working to advance Rust’s Wasm support for WebAssembly Exception Handling to help bring this to stable Rust.

During the development of WebAssembly Exception Handling, a late‑stage specification change resulted in two variants: legacy exception handling and the final modern exception handling "with exnref". Today, Rust’s WebAssembly targets still default to emitting code for the legacy variant. While legacy exception handling is widely supported, it is now deprecated.

Modern WebAssembly Exception Handling is supported as of the following JS platform releases:

Runtime	Version	Release Date
v8	13.8.1	April 28, 2025
workerd	v1.20250620.0	June 19, 2025
Chrome	138	June 28, 2025
Firefox	131	October 1, 2024
Safari	18.4	March 31, 2025
Node.js	25.0.0	October 15, 2025

As we were investigating the support matrix, the largest concern ended up being the Node.js 24 LTS release schedule, which would have left the entire ecosystem stuck on legacy WebAssembly Exception Handling until April 2028.

Having discovered this discrepancy, we were able to backport modern exception handling to the Node.js 24 release, and even backport the fixes needed to make it work on the Node.js 22 release line to ensure support for this target. This should allow the modern Exception Handling proposal to become the default target next year.

Over the coming months, we’ll be working to make the transition to stable panic=unwind and modern Exception Handling as invisible as possible to end users.

While these long‑term investments in the ecosystem take time, they help build a stronger foundation for the Rust WebAssembly community as a whole, and we’re glad to be able to contribute to these improvements.

Using panic unwind in Rust Workers

As of version 0.8.0 of Rust Workers, we have a new --panic-unwind flag, which can be added to the build command, following the instructions here.

With this flag, panics can be fully recovered, and abort recovery will use the new abort classification and recovery hook mechanism. We highly recommend upgrading and trying it out for a more stable Rust Workers experience, and plan to make panic=unwind the default in a subsequent release. Users remaining on panic=abort will still continue to take advantage of the previous custom recovery wrapper handling from 0.6.0.

Committing to Rust Workers stability

This work is part of our ongoing effort towards a stable release for Rust Workers. By solving these sharp edges of the Wasm platform foundations at their root, and contributing back to the ecosystem where it makes sense, we build stronger foundations not just for our platform, but the entire Rust, JS, and Wasm ecosystem.

We have a number of future improvements planned for Rust Workers, and we’ll soon be sharing updates on this additional work, including wasm-bindgen generics and automated bindgen, which Guy Bedford from our team previewed in a talk on Rust & JS Interoperability at Wasm.io last month.

Find us in #rust‑on‑workers on the Cloudflare Discord. We also welcome feedback and discussion and especially all new contributors to the workers-rs and wasm-bindgen GitHub projects.

Moving past bots vs. humans

Thibault Meunier — Tue, 21 Apr 2026 13:00:00 GMT

For us humans to interact with the online world, we need a gateway: keyboard, screen, browser, device. What is called "human detection" online are patterns that humans use when interacting with such devices. These patterns have changed in recent years: a startup CEO now uses their browser to summarize the news, a tech enthusiast automates the process to book their concert tickets when sales open at night, someone who's visually impaired enables accessibility on their screen reader, and companies route their employee traffic through zero trust proxies.

At the same time, website owners are still looking to protect their data, manage their resources, control content distribution, and prevent abuse. These problems aren’t solved by knowing whether the client is a human or a bot: There are wanted bots and there are unwanted humans. These problems require knowing intent and behavior. The ability to detect automation remains critical. However, as the distinctions between actors become blurry, the systems we build now should accommodate a future where "bots vs. humans" is not the important data point.

What actually matters is not humanity in the abstract, but questions such as: is this attack traffic, is that crawler load proportional to the traffic it returns, do I expect this user to connect from this new country, are my ads being gamed?

What we discuss with the term “bots” is really two stories. The first is whether website owners should let known crawlers through when they are not getting traffic back. We have touched on this with bot authentication with http message signatures for crawlers that want to identify without being impersonated. The second is the emergence of new clients that do not embed the same behaviors as web browsers historically did, which matters for systems such as private rate limit.

In this post, we explore how web protection works today, and how it must evolve when the line between bot and human is fading.

The Web we had

When we use the Web, we don't talk directly to the thousands of servers we interact with every day. We use Web browsers. These are also known as "user agents" because they act on our behalf, representing our interests so that we can safely shop, read, and watch the Web without giving sites access to our entire computer or phone.

Websites also have an interest in how browsers work. They want to make sure that their content is presented accurately (fits the screen on mobile, has the right background color, the correct language). Websites also want to ensure that people are able to complete a purchase, read their articles, use their microphone, or sign in securely without a password. They also want people to see the ads beside the articles.

This tension between the interests of browser users and websites has been going on for a long time. Publishers typically want pixel-level control over the experiences of their users, but the people on the other side of the browser often want to use the data they access in ways that weren't envisioned by the publisher.

Web browser vendors and the standards ecosystem around them have paid careful attention to balancing these interests, sometimes with great controversy. For example, you can use browser extensions to block ads, but over time browsers have restricted what such extensions can do. Accessibility standards (e.g., WCAG) have paved the way for using Web content in ways that aren't about pixels, backed in many places by regulatory requirements. One can question the specifics of each of those tradeoffs, but they come as a package: if you want to be on the Web, you have to accept it, whether you are a publisher or a user.

Now, however, that balance is shifting. Having an assistant summarize the news or aggregate research is not a new concept, but AI democratizes this capability for everyone. The friction comes from how these emerging clients operate. A human assistant might print an article or take a screenshot without the publisher knowing, but they still use a standard web browser to render the site in the first place. AI agents bypass this step, disrupting the balanced approach to publishers’ vs. users’ rights that browsers built. They quietly fetch the raw data without rendering the page. For publishers, because of their overlap with pre-existing browser traffic, these clients are inherently opaque. Website owners cannot tell if their fetched content is serving one private report (possibly distorted, possibly unattributed) or being ingested to train a model for a million users, which disrupts the predictable (and monetizable) traffic that keeps their sites online.

The implicit agreement that made the Web work is breaking down. To understand how, the next section goes over a common architecture on the Internet.

The client-server model

Let’s take a step back, and look at one of the main deployment patterns on the Internet: the client-server model. A client makes a request to a server to obtain a resource:

^{Figure 1: Client-Server model. A client sends a request which the server responds to.}

To handle more requests, a website can increase its capacity to serve; it can deploy additional servers or place a cache in front of static traffic. Similarly, the number of requests coming from the client side can increase if one client makes more requests, or if the number of clients multiplies.

^{Figure 2: Multiple clients send multiple requests to different servers, with one fronted by a CDN.}

That simplicity is part of what made the Web successful. It allows many kinds of clients to exist, and it allows the network to evolve without each server needing to know exactly what software is on the other end.

^{Figure 3: Two different client contexts that send requests to servers. Each server only sees a request, but not the end-user behind it.}

That openness also creates uncertainty. A website can see a valid request for a resource, yet it usually cannot know what happens after the response leaves the server: whether the content is rendered for one person using a keyboard, a mouse, and a screen to control a browser; or if it's an independent program making requests automatically, archiving responses, indexing them, and feeding into a larger system.

Bot management today

This model works surprisingly well. That is why operating a website can be as simple as starting a web server with a connection to the Internet. It holds only until the server has to decide which requests it can afford to serve, trust, or prioritize.

Sometimes that is about capacity. If your service is provisioned to handle 100 requests per second globally, but you're receiving 200, you have to drop certain requests. If your server only has 1 CPU but incoming requests require 2, you have to drop requests. If the cost of serving 200 is prohibitive, then you have to rate-limit all requests.

You can drop requests at random. It's possibly unfair, and may miss the target by affecting wanted clients, but it works. In the absence of other signals, there is no other choice.

And capacity is only part of the picture. Servers also try to distinguish among clients for many other reasons: to separate attacks from ordinary traffic, to manage non-malicious load, to prevent extraction of data, to limit ad fraud, to prevent fake account creation, or to stop automated actions being taken on a user's behalf.

The difficulty is that web clients are unauthenticated by default, while still exposing many partial signals. Therefore, most servers decide to apply access control logic based on the information they receive. If a single IP address is making 10x the number of requests as others, it might be blocked. A server that goes further might infer that this IP address is used by a VPN, and therefore proxies the traffic of more than one user. The service could decide to apply a coefficient: assuming each client can make 10 requests per second, a shared IP address would be allowed 100 rps before seeing their requests being dropped.

That's one of the keys to bot management: it aims to provide the server with more information about the client to help it make decisions. This information is inherently imprecise, because the client is not under the control of the server. In addition, the same information creates fingerprint vectors that can be used by the server for different purposes such as personalized advertising. This transforms a mitigation vector to a tracking vector.

At a high level, the server sees the following signals from the client:

Passive client signals: required to make a request on the Internet. Clients necessarily send your IP address, and usually establish a TLS session.
Active client signals: voluntarily provided by the client, often invisible to the end user. This includes a User-Agent header or authentication credentials.
Server signals: information the server observes, such as the geographic location of the edge server handling the request, or the local time the request is received.

To limit and cap volumetric abuse, what matters to the origin is the capability and intent of the client to make multiple requests. In the case of an ad-funded website, the origin needs confidence that ads are actually displayed to the end-user. To preserve their brand, origins may want to ensure that the client has specific rendering capabilities: PDF reader, SVG renderer, virtual keyboard. And if the request is coming from an intercepting proxy, the origin may want to ensure that the request actually originates from an end client

If traffic grows then so do the costs to operate. If clients do not generate value, monetary or not, then the server has no incentive to cover those costs.

Different operators respond to this environment differently. Some large crawlers and platforms identify themselves because predictable access is worth the cost of being attributable. It may even help. Others try to avoid identification: because they expect to be blocked, because they seek anonymity, or because they are operating on behalf of end users. The result is an unstable balance built on partial signals.

This is why the human versus bots frame is misleading. What the origin cares about is not humanity in the abstract, but whether the client is behaving in ways the site can support.

A digression: the rate limit trilemma

^{Figure 4: Rate limit trilemma. Decentralized, anonymous, accountable — pick two}

There's a fundamental tension in how we govern access on the Internet: decentralized, anonymous, accountable — pick two.

Fully decentralized + anonymous means no accountability. A blocked client can spawn a new account without impact on its reputation. This implies that origins have to invest more to manage their resources. This is the default of the Web.
Decentralized + accountable means everyone knows who you are, which works for certain use cases but has clear drawbacks. Think OAuth mechanisms such as “Log in with”, which requires account registration and revealing activity to a third party.
Anonymous + accountable likely requires governance, rules, and enforcement. No widely deployed system achieves both properties for the same actor. The closest precedent is the Web PKI, where governance (CA policies, Certificate Transparency) holds servers accountable. When that governance fails, there are consequences. No equivalent exists today for the client side.

Current tools build on elements from that first space to strive for the second: TLS fingerprints, IP addresses, robots.txt. They attempt accountability, but only hold as long as the derived fingerprints remain stable.

The important distinctions are what, not who

For a website owner deciding how to handle incoming traffic, the meaningful distinction isn't necessarily bots vs. humans. It's about balancing the origin’s needs to understand the traffic it receives with the clients’ needs to preserve their privacy.

Platforms and services that want to be identifiable

Figure 5: A crawler makes multiple request to a server

Some traffic comes from known operators making high volumes of requests: search engine crawlers, cloud platforms, enterprise infrastructure. These actors often have low privacy expectations. They're infrastructure making millions of requests from identifiable sources. The ability to identify the source of a request helps to mitigate misjudgment if an infrastructure provider is sending you too many requests or accessing pages it should not. Self-identification is one of the principles for responsible AI bots we proposed. It is based on these principles that Cloudflare operates its URL scanner for Radar, or how we expose crawling capabilities.

For this traffic, identity works. More precisely, some operators can tolerate attributable requests because reliable access is worth it. Web Bot Auth using HTTP Message Signatures allow operators to cryptographically sign their requests. OpenAI, Google, Cloudflare, or AWS, for example, sign requests originating from their platforms. Origins can verify "this request really came from the platform infrastructure" without relying on IP ranges or User-Agent strings.

Humans and other end-users rightfully have expectations other than being identifiable, to preserve anonymity without sacrificing their access and quality of experience.

Distributed traffic that needs anonymity

^{Figure 6: Three distinct browsers make a request to a server. One is operated by a human, one by an on-device assistant, and one is proxied through a corporate proxy.}

Other traffic comes from many sources, each making relatively few requests. This includes humans browsing the web, researchers doing measurements, scrapers using residential proxies, and increasingly, AI assistants acting on humans’ behalf.

And increasingly the distinction between bots and humans is moot. There is no meaningful difference between the AI assistant booking concert tickets and the human who would have done so manually. Both are distributed. Both need anonymity. In each case, an origin would want to create less friction for users who wish to use the service as intended, rather than abuse it.

Identity could work. To replace the old assumption we had for IP addresses, it should provide a unique, verifiable set of attributes tied to a specific client, proven through an account login, an email address, or a hardware key. However, it implies the need to present this identity when accessing websites. It also undermines privacy.

We want to build modern solutions that prove behavior without proving identity.

Anonymous credentials for the Web

Since 2019, clients accessing websites via Cloudflare have been able to provide such proof of behavior, by sending a privacy token along with their request. This is due to Cloudflare’s early support for Privacy Pass. Privacy Pass, as standardised in RFC 9576, RFC 9578, lets a client carry an issuer-backed proof of some prior check, such as having solved a challenge, without turning that result into a stable identifier. It defines tokens that are unlinkable with any prior visit, request, or session.

This matters because it offers a different model from fingerprinting. Instead of collecting passive signals, the server can ask the client for an active privacy-preserving signal.

This reduces the friction on session establishment. Privacy Pass has scaled to billions of tokens per day across Cloudflare’s infrastructure, primarily for privacy relay services.

^{Figure 7: Privacy Pass Redemption and Issuance Protocol Interaction from}^{Section 3.1 of RFC 9576}

The RFC highlights four roles. The issuer trusts one or more attesters to perform some checks before issuing credentials (tokens in the RFC case). The client holds these credentials and decides when to present them, within the right scope. The origin remains in control of which issuers it trusts and what each presentation means. This does not remove abuse or policy questions, it simply provides clients and servers with a privacy-preserving way to handle them.

The system is simple, but it also has bounds: it does not, for example, allow for dynamic rate limits. If a client is issued 100 tokens, and starts consuming too many resources after the first or second session, there’s no way to invalidate the remaining tokens that were previously issued.

In addition, because of the unlinkability property, it’s hard for new issuers to emerge. There is no feedback mechanism that an origin can provide regarding the quality of the signal an issuer token conveys.

Finally, there’s a 1:1 relationship between the number of tokens that an issuer provides, and the number of unlinkable presentations that can be made with those tokens when they are redeemed: one token per presentation. Ideally, we would like a system in which the client contacts an issuer once and can later make multiple presentations scoped to a particular origin context. That points toward user agents holding vouched credentials and presenting proofs derived from them, rather than repeatedly acquiring single-use tokens.

Our goal is to help establish an open private rate limiting ecosystem. In that spirit, we are helping to develop and explore new Privacy Pass primitives, such as Anonymous Rate-Limit Credentials (ARC) and Anonymous Credit Tokens (ACT).

With ACT, for instance, clients can prove something like "I have a good history with this service" without revealing "I am this user.” ACT preserves unlinkability between presentations at the protocol level, which is the key cryptographic property here. Even in the joint issuer-origin deployment model in Section 4.3 of RFC 9576, the protocol is designed so that token issuance and presentation are not directly linkable. That does not eliminate correlation through other layers such as IP addresses, cookies, account state, or timing. The same properties can be provided using standardized VOPRF and BlindRSA primitives within the reverse flow framework that ACT implements.

A successful ecosystem needs to be an open issuer ecosystem. In practice, that means more than saying anyone can mint credentials. Origins need to be able to decide which issuers to trust. User agents need a consistent way to present what is being requested. The ecosystem also needs ways for issuers to establish reputation and for relying parties to stop trusting low-quality issuers. No single gatekeeper should control participation.

To make this work, there needs to be a protocol and client API that works across browsers and other user agents. It has to be simple to deploy, clear to users, and narrow enough that browsers can place limits on abusive proof requests rather than merely surfacing them.

The trajectory if we do nothing

Website owners are already reacting to the disruption caused by emerging clients. This is partly caused by large-scale scraping and model training, and also by user agents acting in ways sites did not anticipate. Websites, therefore, have asked for more technical means to block AI crawlers and associated tools. In an ecosystem where the lines between bots and humans are increasingly blurred, the measures we have today will become less effective on their own.

If those measures aren't effective, we can expect sites to pivot: requiring an account to see any content, or tying access to a stable identifier. This means no more ad-supported login-free articles, no more "three free articles a month.” Other content businesses may move away from the Web completely, offering their data and services directly to AI vendors for a fee, or within walled gardens operated by large platforms.

These outcomes are bad. Everyone benefits from the open access to information that the Web offers. It is not that all sites will make these choices. There are many reasons for offering content online, and not all of them are commercial. But if enough sites do, they change what "normal" is on the Web to be something worse.

That matters because the open Web is an environment in which different clients can gather information from different sources without relying on a handful of players. We also benefit from having a diversity of sources of information. On a Web where access to information is largely mediated through a small handful of companies, we put too much power into too few hands. The result is not just more friction for anonymous clients, but a more brittle Internet with fewer ways for publishers to meet users.

Anonymous authentication brings some risk, too

We should be clear about what we're building. Infrastructure for proving properties can become infrastructure for requiring properties. Anonymous credentials are meant to prove something about their holder; for example, "I solved a challenge" or "I have not exceeded a rate limit." But a system that can prove any single attribute is also capable of proving other attributes, which is a source of concern.

Today, presenting a Privacy Pass token may convey "solved a CAPTCHA". Tomorrow, the same systems could prove entirely different attributes. For instance, issuing tokens only to devices that "have device attestation" excludes older devices and their users. Similarly, requiring attributes such as "has an Apple or Google account" excludes users of non-mainstream platforms.

Once the infrastructure exists to verify anonymous proofs, what gets proven can expand. We need to make sure this does not gate access to the Internet.

Why we should build it anyway

Gates already exist. Platforms increasingly require identity. Websites are blocking traffic coming from shared proxies. The question isn't whether gates will appear, it's whether the user remains in control of their privacy.

As we’ve discussed, bot management requires some signals to be shared. The alternatives to anonymous proofs are worse. Without the ability to prove attributes anonymously, every gate requires fingerprints: retry from a specific browser, link your account, don’t use a VPN. These may not even be options to people, such as the ones which have no idea their connections are proxied.

Privacy-preserving credentials do not remove the need for trust or policy. They can make those demands more explicit and less pervasive. Unlike fingerprints, proofs are explicit. Users can see what is being asked, and clients such as web browsers and AI assistants can help enforce consent.

To decide, use this guardrail

There is a simple test to evaluate the next methods for the Internet that serves everyone: do the methods allow anyone, from anywhere in the world, to build their own device, their own browser, use any operating system, and get access to the Web. If that property cannot hold, if device attestation from specific manufacturers becomes the only viable signal, we should stop.

This means we need to foster an open issuer ecosystem, where no single gatekeeper decides who can participate. In the rate limit trilemma, decentralization is mandatory on the open Web. We don't yet know fully how to build it, but we know we need to foster it.

A new balance

Until now the Web has largely been in balance. Some aspects may have been a happy accident, while others could have been inevitable. For many end users and publishers, it worked because the Web stayed open enough to support a variety of clients accessing a similar variety of resources.

That balance is at risk. Privacy-preserving primitives for the Web are one attempt to build a different outcome: privacy-preserving, open, accountable. It is not guaranteed to succeed. But it is better than waiting.

If you are interested in tracking and participating, this work happens in the open at the IETF and at the W3C. We believe the existing places where people gathered to shape the Web of today are the best places to design the Web of tomorrow.

The Internet is for the end user, and they need to be in the center of it.

Building the agentic cloud: everything we launched during Agents Week 2026

Ming Lu — Mon, 20 Apr 2026 13:00:00 GMT

Today marks the end of our first Agents Week, an innovation week dedicated entirely to the age of agents. It couldn’t have been more timely: over the past year, agents have swiftly changed how people work. Coding agents are helping developers ship faster than ever. Support agents resolve tickets end-to-end. Research agents validate hypotheses across hundreds of sources in minutes. And people aren't just running one agent: they're running several in parallel and around the clock.

As Cloudflare's CTO Dane Knecht and VP of Product Rita Kozlov noted in our welcome to Agents Week post, the potential scale of agents is staggering: If even a fraction of the world's knowledge workers each run a few agents in parallel, you need compute capacity for tens of millions of simultaneous sessions. The one-app-serves-many-users model the cloud was built on doesn't work for that. But that's exactly what developers and businesses want to do: build agents, deploy them to users, and run them at scale.

Getting there means solving problems across the entire stack. Agents need compute that scales from full operating systems to lightweight isolates. They need security and identity built into how they run. They need an agent toolbox: the right models, tools, and context to do real work. All the code that agents generate needs a clear path from afternoon prototype to production app. And finally, as agents drive a growing share of Internet traffic, the web itself needs to adapt for the emerging agentic web. Turns out, the containerless, serverless compute platform we launched eight years ago with Workers was ready-made for this moment. Since then, we've grown it into a full platform, and this week we shipped the next wave of primitives purpose-built for agents, organized around exactly those problems.

We are here to create Cloud 2.0 — the agentic cloud. Infrastructure designed for a world where agents are a primary workload.

Here's a list of everything we announced this week — we wouldn’t want you to miss a thing.

Compute

It starts with compute. Agents need somewhere to run, and somewhere to store and run the code they write. Not all agents need the same thing: some need a full operating system to install packages and run terminal commands, most need something lightweight that starts in milliseconds and scales to millions. This week we shipped the environments to run them, as well as a new Git-compatible workspace for agents:

Announcement	Summary
Artifacts: Versioned storage that speaks Git	Give your agents, developers, and automations a home for code and data. We’ve just launched Artifacts: Git-compatible versioned storage built for agents. Create tens of millions of repos, fork from any remote, and hand off a URL to any Git client.
Agents have their own computers with Sandboxes GA	Cloudflare Sandboxes give AI agents a persistent, isolated environment: a real computer with a shell, a filesystem, and background processes that starts on demand and picks up exactly where it left off.
Dynamic, identity-aware, and secure: egress controls for Sandboxes	Outbound Workers for Sandboxes provide a programmable, zero-trust egress proxy for AI agents. This allows developers to inject credentials and enforce dynamic security policies without exposing sensitive tokens to untrusted code.
Durable Objects in Dynamic Workers: Give each AI-generated app its own database	Durable Object Facets allows Dynamic Workers to instantiate Durable Objects with their own isolated SQLite databases. This enables developers to build platforms that run persistent, stateful code generated on-the-fly.
Rearchitecting the Workflows control plane for the agentic era	Cloudflare Workflows, a durable execution engine for multi-step applications, now supports 50,000 concurrency and 300 creation rate limits through a rearchitectured control plane, helping scale to meet the use cases for durable background agents.

Security

Running agents and their code is only half the challenge. Agents connect to private networks, access internal services, and take autonomous actions on behalf of users. When anyone in an organization can spin up their own agents, security can't be an afterthought. It has to be the default. This week, we launched the tools to make that easy.

Announcement	Summary
Secure private networking for everyone: users, nodes, agents, Workers — introducing Cloudflare Mesh	Cloudflare Mesh provides secure, private network access for users, nodes, and autonomous AI agents. By integrating with Workers VPC, developers can now grant agents scoped access to private databases and APIs without manual tunnels.
Managed OAuth for Access: make internal apps agent-ready in one click	Managed OAuth for Cloudflare Access helps AI agents securely navigate internal applications. By adopting RFC 9728, agents can authenticate on behalf of users without using insecure service accounts.
Securing non-human identities: automated revocation, OAuth, and scoped permissions	Cloudflare is introducing scannable API tokens, enhanced OAuth visibility, and GA for resource-scoped permissions. These tools help developers implement a true least-privilege architecture while protecting against credential leakage.
Scaling MCP adoption: our reference architecture for enterprise MCP deployments	We share Cloudflare's internal strategy for governing MCP using Access, AI Gateway, and MCP server portals. We also launch Code Mode to slash token costs and recommend new rules for detecting Shadow MCP in Cloudflare Gateway.

Agent Toolbox

A capable agent needs to be able to think and remember, communicate, and see. This means being powered with the right models, with access to the right tools and the right context for their task at hand. This week we shipped the primitives — inference, search, memory, voice, email, and a browser — that turn an agent into something that actually gets work done.

Announcement	Summary
Project Think: building the next generation of AI agents on Cloudflare	Announcing a preview of the next edition of the Agents SDK — from lightweight primitives to a batteries-included platform for AI agents that think, act, and persist.
Add voice to your agent	An experimental voice pipeline for the Agents SDK enables real-time voice interactions over WebSockets. Developers can now build agents with continuous STT and TTS in just ~30 lines of server-side code.
Cloudflare Email Service: now in public beta. Ready for your agents	Agents are becoming multi-channel. That means making them available wherever your users already are — including the inbox. Cloudflare Email Service enters public beta with the infrastructure layer to make that easy: send, receive, and process email natively from your agents.
Cloudflare's AI platform: an inference layer designed for agents	We're building Cloudflare into a unified inference layer for agents, letting developers call models from 14+ providers. New features include Workers binding for running third-party models and an expanded catalog with multimodal models.
Building the foundation for running extra-large language models	We built a custom technology stack to run fast large language models on Cloudflare’s infrastructure. This post explores the engineering trade-offs and technical optimizations required to make high-performance AI inference accessible.
Unweight: how we compressed an LLM 22% without sacrificing quality	Running large LLMs across Cloudflare’s network requires us to be smarter and more efficient about GPU memory bandwidth. That’s why we developed Unweight, a lossless inference-time compression system that achieves up to a 22% model footprint reduction, so that we can deliver faster and cheaper inference than ever before.
Agents that remember: introducing Agent Memory	Cloudflare Agent Memory is a managed service that gives AI agents persistent memory, allowing them to recall what matters, forget what doesn't, and get smarter over time.
AI Search: the search primitive for your agents	AI Search is the search primitive for your agents. Create instances dynamically, upload files, and search across instances with hybrid retrieval and relevance boosting. Just create a search instance, upload, and search.
Browser Run: give your agents a browser	Browser Rendering is now Browser Run, with Live View, Human in the Loop, CDP access, session recordings, and 4x higher concurrency limits for AI agents.

Prototype to production

The best infrastructure is also one that’s easy to use. We want to meet developers and their agents where they’re already working: in the terminal, in the editor, in a prompt, and make the full Cloudflare platform accessible without context-switching.

Announcement	Summary
Building a CLI for all of Cloudflare	We’re introducing cf, a new unified CLI designed for consistency across the Cloudflare platform, alongside Local Explorer for debugging local data. These tools simplify how developers and AI agents interact with our nearly 3,000 API operations.
Introducing Agent Lee - a new interface to the Cloudflare stack	Agent Lee is an in-dashboard agent that shifts Cloudflare’s interface from manual tab-switching to a single prompt. Using sandboxed TypeScript, it helps you troubleshoot and manage your stack as a grounded technical collaborator.
Introducing Flagship: feature flags built for the age of AI	Introducing Flagship, a native feature flag service built on Cloudflare’s global network to eliminate the latency of third-party providers. By using KV and Durable Objects, Flagship allows for sub-millisecond flag evaluation.
Deploy Postgres and MySQL databases with PlanetScale + Workers	Learn how to deploy PlanetScale Postgres and MySQL databases via Cloudflare and connect Cloudflare Workers.
Register domains wherever you build: Cloudflare Registrar API now in beta	The Cloudflare Registrar API is now in beta. Developers and AI agents can search, check availability, and register domains at cost directly from their editor, their terminal, or their agent — without leaving their workflow.

Agentic Web

As more agents come online, they're still browsing an Internet that was built for people. Existing websites need new tools to control what bots can access their content, package and present it for agents, and measure how ready they are for this shift.

Announcement	Summary
Introducing the Agent Readiness score. Is your site agent-ready?	The Agent Readiness score can help site owners understand how well their websites support AI agents. Here we explore new standards, share Radar data, and detail how we made Cloudflare’s docs the most agent-friendly on the web.
Redirects for AI Training enforces canonical content	Soft directives don’t stop crawlers from ingesting deprecated content. Redirects for AI Training allows anybody on Cloudflare to redirect verified crawlers to canonical pages with one toggle and no origin changes.
Agents Week: Network performance update	By migrating our request handling layer to a Rust-based architecture called FL2, Cloudflare has increased its performance lead to 60% of the world’s top networks. We use real-user measurements and TCP connection trimeans to ensure our data reflects the actual experience of people on the Internet
Shared dictionary compression that keeps up with the agentic web	We give you a sneak peek of our support for shared compression dictionaries, show you how it improves page load times, and reveal when you’ll be able to try the beta yourself.

That’s a wrap

Agents Week 2026 is ending, but the agentic cloud is just getting started. Everything we shipped this week — from compute and security to the agent toolbox and the agentic web — is the foundation. We're going to keep building on it to give you everything you need to build what's next.

We also have more blog posts coming out today and tomorrow to continue the story, so keep an eye out for the latest at our blog.

If you're building on any of what we announced this week, we want to hear about it. Come find us on X or Discord, or head to the developer documentation.

The AI engineering stack we built internally — on the platform we ship

Ayush Thakur — Mon, 20 Apr 2026 13:00:00 GMT

In the last 30 days, 93% of Cloudflare’s R&D organization used AI coding tools powered by infrastructure we built on our own platform.

Eleven months ago, we undertook a major project: to truly integrate AI into our engineering stack. We needed to build the internal MCP servers, access layer, and AI tooling necessary for agents to be useful at Cloudflare. We pulled together engineers from across the company to form a tiger team called iMARS (Internal MCP Agent/Server Rollout Squad). The sustained work landed with the Dev Productivity team, who also own much of our internal tooling including CI/CD, build systems, and automation.

Here are some numbers that capture our own agentic AI use over the last 30 days:

3,683 internal users actively using AI coding tools (60% company-wide, 93% across R&D), out of approximately 6,100 total employees
47.95 million AI requests
295 teams are currently utilizing agentic AI tools and coding assistants.
20.18 million AI Gateway requests per month
241.37 billion tokens routed through AI Gateway
51.83 billion tokens processed on Workers AI

The impact on developer velocity internally is clear: we’ve never seen a quarter-to-quarter increase in merge requests to this degree.

As AI tooling adoption has grown the 4-week rolling average has climbed from ~5,600/week to over 8,700. The week of March 23 hit 10,952, nearly double the Q4 baseline.

MCP servers were the starting point, but the team quickly realized we needed to go further: rethink how standards are codified, how code gets reviewed, how engineers onboard, and how changes propagate across thousands of repos.

This post dives deep into what that looked like over the past eleven months and where we ended up. We're publishing now, to close out Agents Week, because the AI engineering stack we built internally runs on the same products we’re shipping and enhancing this week.

The architecture at a glance

The engineer-facing tools layer (OpenCode, Windsurf, and other MCP-compatible clients) include both open-source and third-party coding assistant tools.

Each layer maps to a Cloudflare product or tool we use:

What we built	Built with
Zero Trust authentication	Cloudflare Access
Centralized LLM routing, cost tracking, BYOK, and Zero Data Retention controls	AI Gateway
On-platform inference with open-weight models	Workers AI
MCP Server Portal with single OAuth	Workers + Access
AI Code Reviewer CI integration	Workers + AI Gateway
Sandboxed execution for agent-generated code (Code Mode)	Dynamic Workers
Stateful, long-running agent sessions	Agents SDK (McpAgent, Durable Objects)
Isolated environments for cloning, building, and testing	Sandbox SDK — GA as of Agents Week
Durable multi-step workflows	Workflows — scaled 10x during Agents Week
16K+ entity knowledge graph	Backstage (OSS)

None of this is internal-only infrastructure. Everything (besides Backstage) listed above is a shipping product, and many of them got substantial updates during Agents Week.

We’ll walk through this in three acts:

The platform layer — how authentication, routing, and inference work (AI Gateway, Workers AI, MCP Portal, Code Mode)
The knowledge layer — how agents understand our systems (Backstage, AGENTS.md)
The enforcement layer — how we keep quality high at scale (AI Code Reviewer, Engineering Codex)

Act 1: The platform layer

How AI Gateway helped us stay secure and improve the developer experience

When you have over 3,600+ internal users using AI coding tools daily, you need to solve for access and visibility across many clients, use cases, and roles.

Everything starts with Cloudflare Access, which handles all authentication and zero-trust policy enforcement. Once authenticated, every LLM request routes through AI Gateway. This gives us a single place to manage provider keys, cost tracking, and data retention policies.

^{The OpenCode AI Gateway overview: 688.46k requests per day, 10.57B tokens per day, routing to four providers through one endpoint.}

AI Gateway analytics show how monthly usage is distributed across model providers. Over the last month, internal request volume broke down as follows.

Provider	Requests/month	Share
Frontier Labs (OpenAI, Anthropic, Google)	13.38M	91.16%
Workers AI	1.3M	8.84%

Frontier models handle the bulk of complex agentic coding work for now, but Workers AI is already a significant part of the mix and handles an increasing share of our agentic engineering workloads.

How we increasingly leverage Workers AI

Workers AI is Cloudflare's serverless AI inference platform which runs open-source models on GPUs across our global network. Beyond huge cost improvements compared to frontier models, a key advantage is that inference stays on the same network as your Workers, Durable Objects, and storage. No cross-cloud hops to deal with, which cause more latency, network flakiness, and additional networking configuration to manage.

^{Workers AI usage in the last month: 51.47B input tokens, 361.12M output tokens.}

Kimi K2.5, launched on Workers AI in March 2026, is a frontier-scale open-source model with a 256k context window, tool calling, and structured outputs. As we described in our Kimi K2.5 launch post, we have a security agent that processes over 7 billion tokens per day on Kimi. That would cost an estimated $2.4M per year on a mid-tier proprietary model. But on Workers AI, it's 77% cheaper.

Beyond security, we use Workers AI for documentation review in our CI pipeline, for generating AGENTS.md context files across thousands of repositories, and for lightweight inference tasks where same-network latency matters more than peak model capability.

As open-source models continue to improve, we expect Workers AI to handle a growing share of our internal workloads.

One thing we got right early: routing through a single proxy Worker from day one. We could have had clients connect directly to AI Gateway, which would have been simpler to set up initially. But centralizing through a Worker meant we could add per-user attribution, model catalog management, and permission enforcement later without touching any client configs. Every feature described in the bootstrap section below exists because we had that single choke point. The proxy pattern gives you a control plane that direct connections don't, and if we plug in additional coding assistant tools later, the same Worker and discovery endpoint will handle them.

How it works: one URL to configure everything

The entire setup starts with one command:

opencode auth login https://opencode.internal.domain

That command triggers a chain that configures providers, models, MCP servers, agents, commands, and permissions, without the user touching a config file.

Step 1: Discover auth requirements. OpenCode fetches config from a URL like https://opencode.internal.domain/.well-known/opencode.

This discovery endpoint is served by a Worker and the response has an auth block telling OpenCode how to authenticate, along with a config block with providers, MCP servers, agents, commands, and default permissions:

{
  "auth": {
    "command": ["cloudflared", "access", "login", "..."],
    "env": "TOKEN"
  },
  "config": {
    "provider": { "..." },
    "mcp": { "..." },
    "agent": { "..." },
    "command": { "..." },
    "permission": { "..." }
  }
}

Step 2: Authenticate via Cloudflare Access. OpenCode runs the auth command and the user authenticates through the same SSO they use for everything else at Cloudflare. cloudflared returns a signed JWT. OpenCode stores it locally and automatically attaches it to every subsequent provider request.

Step 3: Config is merged into OpenCode. The config provided is shared defaults for the entire organization, but local configs always take priority. Users can override the default model, add their own agents, or adjust project and user scoped permissions without affecting anyone else.

Inside the proxy Worker. The Worker is a simple Hono app that does three things:

Serves the shared config. The config is compiled at deploy time from structured source files and contains placeholder values like {baseURL} for the Worker's origin. At request time, the Worker replaces these, so all provider requests route through the Worker rather than directly to model providers. Each provider gets a path prefix (/anthropic, /openai, /google-ai-studio/v1beta, /compat for Workers AI) that the Worker forwards to the corresponding AI Gateway route.
Proxies requests to AI Gateway. When OpenCode sends a request like POST /anthropic/v1/messages, the Worker validates the Cloudflare Access JWT, then rewrites headers before forwarding:
```
Stripped:   authorization, cf-access-token, host
Added:      cf-aig-authorization: Bearer 
            cf-aig-metadata: {"userId": ""}
```
The request goes to AI Gateway, which routes it to the appropriate provider. The response passes straight through with zero buffering. The apiKey field in the client config is empty because the Worker injects the real key server-side. No API keys exist on user machines.
Keeps the model catalog fresh. An hourly cron trigger fetches the current OpenAI model list from models.dev, caches it in Workers KV, and injects store: false on every model for Zero Data Retention. New models get ZDR automatically without a config redeploy.

Anonymous user tracking. After JWT validation, the Worker maps the user's email to a UUID using D1 for persistent storage and KV as a read cache. AI Gateway only ever sees the anonymous UUID in cf-aig-metadata, never the email. This gives us per-user cost tracking and usage analytics without exposing identities to model providers or Gateway logs.

Config-as-code. Agents and commands are authored as markdown files with YAML frontmatter. A build script compiles them into a single JSON config validated against the OpenCode JSON schema. Every new session picks up the latest version automatically.

The overall architecture is simple and easy for anyone to deploy with our developer platform: a proxy Worker, Cloudflare Access, AI Gateway, and a client-accessible discovery endpoint that configures everything automatically. Users run one command and they're done. There’s nothing for them to configure manually, no API keys on laptops or MCP server connections to manually set up. Making changes to our agentic tools and updating what 3,000+ people get in their coding environment is just a wrangler deploy away.

The MCP Server Portal: one OAuth, multiple MCP tools

We described our full approach to governing MCP at enterprise scale in a separate post, including how we use MCP Server Portals, Cloudflare Access, and Code Mode together. Here's the short version of what we built internally.

Our internal portal aggregates 13 production MCP servers exposing 182+ tools across Backstage, GitLab, Jira, Sentry, Elasticsearch, Prometheus, Google Workspace, our internal Release Manager, and more. This unifies access and simplifies everything giving us one endpoint and one Cloudflare Access flow governing access to every tool.

Each MCP server is built on the same foundation: McpAgent from the Agents SDK, workers-oauth-provider for OAuth, and Cloudflare Access for identity. The whole thing lives in a single monorepo with shared auth infrastructure, Bazel builds, CI/CD pipelines, and catalog-info.yaml for Backstage registration. Adding a new server is mostly copying an existing one and changing the API it wraps. For more on how this works and the security architecture behind it, see our enterprise MCP reference architecture.

Code Mode at the portal layer

MCP is the right protocol for connecting AI agents to tools, but it has a practical problem: every tool definition consumes context window tokens before the model even starts working. As the number of MCP servers and tools grows, so does the token overhead, and at scale, this becomes a real cost. Code Mode is the emerging fix: instead of loading every tool schema up front, the model discovers and calls tools through code.

Our GitLab MCP server originally exposed 34 individual tools (get_merge_request, list_pipelines, get_file_content, and so on). Those 34 tool schemas consumed roughly 15,000 tokens of context window per request. On a 200K context window, that's 7.5% of the budget gone before asking a question. Multiplied across every request, every engineer, every day, it adds up.

MCP Server Portals now support Code Mode proxying, which lets us solve that problem centrally instead of one server at a time. Rather than exposing every upstream tool definition to the client, the portal collapses them into two portal-level tools: portal_codemode_search and portal_codemode_execute.

The nice thing about doing this at the portal layer is that it scales cleanly. Without Code Mode, every new MCP server adds more schema overhead to every request. With portal-level Code Mode, the client still only sees two tools even as we connect more servers behind the portal. That means less context bloat, lower token cost, and a cleaner architecture overall.

Act 2: The knowledge layer

Backstage: the knowledge graph underneath all of it

Before the iMARS team could build MCP servers that were actually useful, we needed to solve a more fundamental problem: structured data about our services and infrastructure. We need our agents to understand context outside the code base, like who owns what, how services depend on each other, where the documentation lives, and what databases a service talks to.

We run Backstage, the open-source internal developer portal originally built by Spotify, as our service catalog. It's self-hosted (not on Cloudflare products, for the record) and it tracks things like:

2,055 services, 167 libraries, and 122 packages
228 APIs with schema definitions
544 systems (products) across 45 domains
1,302 databases, 277 ClickHouse tables, 173 clusters
375 teams and 6,389 users with ownership mappings
Dependency graphs connecting services to the databases, Kafka topics, and cloud resources they rely on

Our Backstage MCP server (13 tools) is available through our MCP Portal, and an agent can look up who owns a service, check what it depends on, find related API specs, and pull Tech Insights scores, all without leaving the coding session.

Without this structured data, agents are working blind. They can read the code in front of them, but they can't see the system around it. The catalog turns individual repos into a connected map of the engineering organization.

AGENTS.md: getting thousands of repos ready for AI

Early in the rollout, we kept seeing the same failure mode: coding agents produced changes that looked plausible and were still wrong for the repo. Usually the problem was local context: the model didn't know the right test command, the team's current conventions, or which parts of the codebase were off-limits. That pushed us toward AGENTS.md: a short, structured file in each repo that tells coding agents how the codebase actually works and forces teams to make that context explicit.

What AGENTS.md looks like

We built a system that generates AGENTS.md files across our GitLab instance. Because these files sit directly in the model's context window, we wanted them to stay short and high-signal. A typical file looks like this:

# AGENTS.md

## Repository
- Runtime: cloudflare workers
- Test command: `pnpm test`
- Lint command: `pnpm lint`

## How to navigate this codebase
- All cloudflare workers  are in src/workers/, one file per worker
- MCP server definitions are in src/mcp/, each tool in a separate file
- Tests mirror source: src/foo.ts -> tests/foo.test.ts

## Conventions
- Testing: use Vitest with `@cloudflare/vitest-pool-workers` (Codex: RFC 021, RFC 042)
- API patterns: Follow internal REST conventions (Codex: API-REST-01)

## Boundaries
- Do not edit generated files in `gen/`
- Do not introduce new background jobs without updating `config/`

## Dependencies
- Depends on: auth-service, config-service
- Depended on by: api-gateway, dashboard

When an agent reads this file, it doesn't have to infer the repo from scratch. It knows how the codebase is organized, which conventions to follow and which Engineering Codex rules apply.

How we generate them at scale

The generator pipeline pulls entity metadata from our Backstage service catalog (ownership, dependencies, system relationships), analyzes the repository structure to detect the language, build system, test framework, and directory layout, then maps the detected stack to relevant Engineering Codex standards. A capable model then generates the structured document, and the system opens a merge request so the owning team can review and refine it.

We've processed roughly 3,900 repositories this way. The first pass wasn't always perfect, especially for polyglot repos or unusual build setups, but even that baseline was much better than asking agents to infer everything from scratch.

The initial merge request solved the bootstrap problem, but keeping these files current mattered just as much. A stale AGENTS.md can be worse than no file at all. We closed that loop with the AI Code Reviewer, which can flag when repository changes suggest that AGENTS.md should be updated.

Act 3: The enforcement layer

The AI Code Reviewer

Every merge request at Cloudflare gets an AI code review. Integration is straightforward: teams add a single CI component to their pipeline, and from that point every MR is reviewed automatically.

We use GitLab's self-hosted solution as our CI/CD platform. The reviewer is implemented as a GitLab CI component that teams include in their pipeline. When an MR is opened or updated, the CI job runs OpenCode with a multi-agent review coordinator. The coordinator classifies the MR by risk tier (trivial, lite, or full) and delegates to specialized review agents: code quality, security, codex compliance, documentation, performance, and release impact. Each agent connects to the AI Gateway for model access, pulls Engineering Codex rules from a central repo, and reads the repository's AGENTS.md for codebase context. Results are posted back as structured MR comments.

A separate Workers-based config service handles centralized model selection per reviewer agent, so we can shift models without changing the CI template. The review process itself runs in the CI runner and is stateless per execution.

The output format

We spent time getting the output format right. Reviews are broken into categories (Security, Code Quality, Performance) so engineers can scan headers rather than reading walls of text. Each finding has a severity level (Critical, Important, Suggestion, or Optional Nits) that makes it immediately clear what needs attention versus what's informational.

The reviewer maintains context across iterations. If it flagged something in a previous review round that has since been fixed, it acknowledges that rather than re-raising the same issue. And when a finding maps to an Engineering Codex rule, it cites the specific rule ID, turning an AI suggestion into a reference to an organizational standard.

Workers AI handles about 15% of the reviewer's traffic, primarily for documentation review tasks where Kimi K2.5 performs well at a fraction of the cost of frontier models. Models like Opus 4.6 and GPT 5.4 handle security-sensitive and architecturally complex reviews where reasoning capability matters most.

Over the last 30 days:

100% AI code reviewer coverage across all repos on our standard CI pipeline.
5.47M AI Gateway requests
24.77B tokens processed

We're releasing a detailed technical blog post alongside this one that covers the reviewer's internal architecture, including how we route between models, the multi-agent orchestration, and the cost optimization strategies we've developed.

Engineering Codex: engineering standards as agent skills

The Engineering Codex is Cloudflare's new internal standards system where our core engineering standards live. We have a multi-stage AI distillation process, which outputs a set of codex rules ("If you need X, use Y. You must do X, if you are doing Y or Z.") along with an agent skill that uses progressive disclosure and nested hierarchical information directories and links across markdown files.

This skill is available for engineers to use locally as they build with prompts like “how should I handle errors in my Rust service?” or “review this TypeScript code for compliance.” Our Network Firewall team audited rampartd using a multi-agent consensus process where every requirement was scored COMPLIANT, PARTIAL, or NON-COMPLIANT with specific violation details and remediation steps reducing what previously required weeks of manual work to a structured, repeatable process.

At review time, the AI Code Reviewer cites specific Codex rules in its feedback.

^{AI Code Review: showing categorized findings (Codex Compliance in this case) noting the codex RFC violation.}

None of these pieces are especially novel on their own. Plenty of companies run service catalogs, ship reviewer bots, or publish engineering standards. The difference is the wiring. When an agent can pull context from Backstage, read AGENTS.md for the repo it’s editing, and get reviewed against Codex rules by the same toolchain, the first draft is usually close enough to ship. That wasn’t true six months ago.

The scoreboard

From launching this effort to 93% R&D adoption took less than a year.

Company-wide adoption (Feb 5 – April 15, 2026):

Metric	Value
Active users	3,683 (60% of the company)
R&D team adoption	93%
AI messages	47.95M
Teams with AI activity	295
OpenCode messages	27.08M
Windsurf messages	434.9K

AI Gateway (last 30 days, combined):

Metric	Value
Requests	20.18M
Tokens	241.37B

Workers AI (last 30 days):

Metric	Value
Input tokens	51.47B
Output tokens	361.12M

What's next: background agents

The next evolution in our internal engineering stack will include background agents: agents that can be spun up on demand with the same tools available locally (MCP portal, git, test runners) but running entirely in the cloud. The architecture uses Durable Objects and the Agents SDK for orchestration, delegating to Sandbox containers when the job requires a full development environment like cloning a repo, installing dependencies, or running tests. The Sandbox SDK went GA during Agents Week.

Long-running agents, shipped natively into the Agents SDK during Agents Week, solve the durable session problem that previously required workarounds. The SDK now supports sessions that run for extended periods without eviction, enough for an agent to clone a large repo, run a full test suite, iterate on failures, and open a MR in a single session.

This represents an eleven-month effort to rethink not just how code gets written, but how it gets reviewed, how standards are enforced, and how changes ship safely across thousands of repos. Every layer runs on the same products our customers use.

Start building

Agents Week just shipped everything you need. The platform is here.

npx create-cloudflare@latest --template cloudflare/agents-starter

That agents starter gets you running. The diagram below is the full architecture for when you’re ready to grow it, your tools layer on top (chatbot, web UI, CLI, browser extension), the Agents SDK handling session state and orchestration in the middle, and the Cloudflare services you call from it underneath.

Docs: Agents SDK · Sandbox SDK · AI Gateway · Workers AI · Workflows · Code Mode · MCP on Cloudflare

Repos: cloudflare/agents · cloudflare/sandbox-sdk · cloudflare/mcp-server-cloudflare · cloudflare/skills

For more on how we’re using AI at Cloudflare, read the post on our process for AI Code Review. And check out everything we shipped during Agents Week.

We’d love to hear what you build. Find us on Discord, X, and Bluesky.

^{Ayush Thakur built the AGENTS.md system and the AI Gateway integration for the OpenCode infrastructure, Scott Roemeschke is the Engineering Manager of the Developer Productivity team at Cloudflare, Rajesh Bhatia leads the Productivity Platform function at Cloudflare. This post was a collaborative effort across the Devtools team, with help from volunteers across the company through the iMARS (Internal MCP Agent/Server Rollout Squad) tiger team.}

Orchestrating AI Code Review at scale

Ryan Skidmore — Mon, 20 Apr 2026 13:00:00 GMT

Code review is a fantastic mechanism for catching bugs and sharing knowledge, but it is also one of the most reliable ways to bottleneck an engineering team. A merge request sits in a queue, a reviewer eventually context-switches to read the diff, they leave a handful of nitpicks about variable naming, the author responds, and the cycle repeats. Across our internal projects, the median wait time for a first review was often measured in hours.

When we first started experimenting with AI code review, we took the path that most other people probably take: we tried out a few different AI code review tools and found that a lot of these tools worked pretty well, and a lot of them even offered a good amount of customisation and configurability! Unfortunately, though, the one recurring theme that kept coming up was that they just didn’t offer enough flexibility and customisation for an organisation the size of Cloudflare.

So, we jumped to the next most obvious path, which was to grab a git diff, shove it into a half-baked prompt, and ask a large language model to find bugs. The results were exactly as noisy as you might expect, with a flood of vague suggestions, hallucinated syntax errors, and helpful advice to "consider adding error handling" on functions that already had it. We realised pretty quickly that a naive summarisation approach wasn't going to give us the results we wanted, especially on complex codebases.

Instead of building a monolithic code review agent from scratch, we decided to build a CI-native orchestration system around OpenCode, an open-source coding agent. Today, when an engineer at Cloudflare opens a merge request, it gets an initial pass from a coordinated smörgåsbord of AI agents. Rather than relying on one model with a massive, generic prompt, we launch up to seven specialised reviewers covering security, performance, code quality, documentation, release management, and compliance with our internal Engineering Codex. These specialists are managed by a coordinator agent that deduplicates their findings, judges the actual severity of the issues, and posts a single structured review comment.

We've been running this system internally across tens of thousands of merge requests. It approves clean code, flags real bugs with impressive accuracy, and actively blocks merges when it finds genuine, serious problems or security vulnerabilities. This is just one of the many ways we’re improving our engineering resiliency as part of Code Orange: Fail Small.

This post is a deep dive into how we built it, the architecture we landed on, and the specific engineering problems you run into when you try to put LLMs in the critical path of your CI/CD pipeline, and more critically, in the way of engineers trying to ship code.

The architecture: plugins all the way to the moon

When you are building internal tooling that has to run across thousands of repositories, hardcoding your version control system or your AI provider is a great way to ensure you'll be rewriting the whole thing in six months. We needed to support GitLab today and who knows what tomorrow, alongside different AI providers and different internal standards requirements, without any component needing to know about the others.

We built the system on a composable plugin architecture where the entry point delegates all configuration to plugins that compose together to define how a review runs. Here is what the execution flow looks like when a merge request triggers a review:

Each plugin implements a ReviewPlugin interface with three lifecycle phases. Bootstrap hooks run concurrently and are non-fatal, meaning if a template fetch fails, the review just continues without it. Configure hooks run sequentially and are fatal, because if the VCS provider can't connect to GitLab, there is no point in continuing the job. Finally, postConfigure runs after the configuration is assembled to handle asynchronous work like fetching remote model overrides.

The ConfigureContext gives plugins a controlled surface to affect the review. They can register agents, add AI providers, set environment variables, inject prompt sections, and alter fine-grained agent permissions. No plugin has direct access to the final configuration object. They contribute through the context API, and the core assembler merges everything into the opencode.json file that OpenCode consumes.

Because of this isolation, the GitLab plugin doesn't read Cloudflare AI Gateway configurations, and the Cloudflare plugin doesn't know anything about GitLab API tokens. All VCS-specific coupling is isolated in a single ci-config.ts file.

Here is the plugin roster for a typical internal review:

Plugin	Responsibility
`@opencode-reviewer/gitlab`	GitLab VCS provider, MR data, MCP comment server
`@opencode-reviewer/cloudflare`	AI Gateway configuration, model tiers, failback chains
`@opencode-reviewer/codex`	Internal compliance checking against engineering RFCs
`@opencode-reviewer/braintrust`	Distributed tracing and observability
`@opencode-reviewer/agents-md`	Verifies the repo's AGENTS.md is up to date
`@opencode-reviewer/reviewer-config`	Remote per-reviewer model overrides from a Cloudflare Worker
`@opencode-reviewer/telemetry`	Fire-and-forget review tracking

How we use OpenCode under the hood

We picked OpenCode as our coding agent of choice for a couple of reasons:

We use it extensively internally, meaning we were already very familiar with how it worked
It’s open source, so we can contribute features and bug fixes upstream as well as investigate issues really easily when we spot them (at the time of writing, Cloudflare engineers have landed over 45 pull requests upstream!)
It has a great open source SDK, allowing us to easily build plugins that work flawlessly

But most importantly, because it is structured as a server first, with its text-based user interface and desktop app acting as clients on top. This was a hard requirement for us because we needed to create sessions programmatically, send prompts via an SDK, and collect results from multiple concurrent sessions without hacking around a CLI interface.

The orchestration works in two distinct layers:

The Coordinator Process: We spawn OpenCode as a child process using Bun.spawn. We pass the coordinator prompt via stdin rather than as a command-line argument, because if you have ever tried to pass a massive merge request description full of logs as a command-line argument, you have probably met the Linux kernel's ARG_MAX limit. We learned this pretty quickly when E2BIG errors started showing up on a small percentage of our CI jobs for incredibly large merge requests. The process runs with --format json, so all output arrives as JSONL events on stdout:

const proc = Bun.spawn(
  ["bun", opencodeScript, "--print-logs", "--log-level", logLevel,
   "--format", "json", "--agent", "review_coordinator", "run"],
  {
    stdin: Buffer.from(prompt),
    env: {
      ...sanitizeEnvForChildProcess(process.env),
      OPENCODE_CONFIG: process.env.OPENCODE_CONFIG_PATH ?? "",
      BUN_JSC_gcMaxHeapSize: "2684354560", // 2.5 GB heap cap
    },
    stdout: "pipe",
    stderr: "pipe",
  },
);

The Review Plugin: Inside the OpenCode process, a runtime plugin provides the spawn_reviewers tool. When the coordinator LLM decides it is time to review the code, it calls this tool, which launches the sub-reviewer sessions through OpenCode's SDK client:

const createResult = await this.client.session.create({
  body: { parentID: input.parentSessionID },
  query: { directory: dir },
});

// Send the prompt asynchronously (non-blocking)
this.client.session.promptAsync({
  path: { id: task.sessionID },
  body: {
    parts: [{ type: "text", text: promptText }],
    agent: input.agent,
    model: { providerID, modelID },
  },
});

Each sub-reviewer runs in its own OpenCode session with its own agent prompt. The coordinator doesn't see or control what tools the sub-reviewers use. They are free to read source files, run grep, or search the codebase as they see fit, and they simply return their findings as structured XML when they finish.

What’s JSONL, and what do we use it for?

One of the big challenges that you typically face when working with systems like this is the need for structured logging, and while JSON is a fantastic-structured format, it requires everything to be “closed out” to be a valid JSON blob. This is especially problematic if your application exits early before it has a chance to close everything out and write a valid JSON blob to disk — and this is often when you need the debug logs most.

This is why we use JSONL (JSON Lines), which does exactly what it says in the tin: it’s a text format where every line is a valid, self-contained JSON object. Unlike a standard JSON array, you don't have to parse the whole document to read the first entry. You read a line, parse it, and move on. This means you don’t have to worry about buffering massive payloads into memory, or hoping for a closing ] that may never arrive because the child process ran out of memory.

In practice, it looks like this:

Stripped:   authorization, cf-access-token, host
Added:      cf-aig-authorization: Bearer 
            cf-aig-metadata: {"userId": ""}

Every CI system that needs to parse structured output from a long-running process eventually lands on something like JSONL — but we didn’t want to reinvent the wheel. (And OpenCode already supports it!)

The streaming pipeline

We process the coordinator's output in real-time, though we buffer and flush every 100 lines (or 50ms) to save our disks from a slow but painful appendFileSync death.

We watch for specific triggers as the stream flows in and pull out relevant data, like token usage out of step_finish events to track costs, and we use error events to kick off our retry logic. We also make sure to keep an eye out for output truncation — if a step_finish arrives with reason: "length", we know the model hit its max_tokens limit and got cut off mid-sentence, so we should automatically retry.

One of the operational headaches we didn’t predict was that large, advanced models like Claude Opus 4.7 or GPT-5.4 can sometimes spend quite a while thinking through a problem, and to our users this can make it look exactly like a hung job. We found that users would frequently cancel jobs and complain that the reviewer wasn’t working as intended, when in reality it was working away in the background. To counter this, we added an extremely simple heartbeat log that prints "Model is thinking... (Ns since last output)" every 30 seconds which almost entirely eliminated the problem.

Specialised agents instead of one big prompt

Instead of asking one model to review everything, we split the review into domain-specific agents. Each agent has a tightly scoped prompt telling it exactly what to look for, and more importantly, what to ignore.

The security reviewer, for example, has explicit instructions to only flag issues that are "exploitable or concretely dangerous":

## What to Flag
- Injection vulnerabilities (SQL, XSS, command, path traversal)
- Authentication/authorisation bypasses in changed code
- Hardcoded secrets, credentials, or API keys
- Insecure cryptographic usage
- Missing input validation on untrusted data at trust boundaries

## What NOT to Flag
- Theoretical risks that require unlikely preconditions
- Defense-in-depth suggestions when primary defenses are adequate
- Issues in unchanged code that this MR doesn't affect
- "Consider using library X" style suggestions

It turns out that telling an LLM what not to do is where the actual prompt engineering value resides. Without these boundaries, you get a firehose of speculative theoretical warnings that developers will immediately learn to ignore.

Every reviewer produces findings in a structured XML format with a severity classification: critical (will cause an outage or is exploitable), warning (measurable regression or concrete risk), or suggestion (an improvement worth considering). This ensures we are dealing with structured data that drives downstream behavior, rather than parsing advisory text.

The models we use

Because we split the review into specialised domains, we don't need to use a super expensive, highly capable model for every task. We assign models based on the complexity of the agent's job:

Top-tier: Claude Opus 4.7 and GPT-5.4: Reserved exclusively for the Review Coordinator. The coordinator has the hardest job — reading the output of seven other models, deduplicating findings, filtering out false positives, and making a final judgment call. It needs the highest reasoning capability available.
Standard-tier: Claude Sonnet 4.6 and GPT-5.3 Codex: The workhorse for our heavy-lifting sub-reviewers (Code Quality, Security, and Performance). These are fast, relatively cheap, and excellent at spotting logic errors and vulnerabilities in code.
Kimi K2.5: Used for lightweight, text-heavy tasks like the Documentation Reviewer, Release Reviewer, and the AGENTS.md Reviewer.

These are the defaults, but every single model assignment can be overridden dynamically at runtime via our reviewer-config Cloudflare Worker, which we'll cover in the control plane section below.

Prompt injection prevention

Agent prompts are built at runtime by concatenating the agent-specific markdown file with a shared REVIEWER_SHARED.md file containing mandatory rules. The coordinator's input prompt is assembled by stitching together MR metadata, comments, previous review findings, diff paths, and custom instructions into structured XML.

We also had to sanitise user-controlled content. If someone puts Repository: evil-corp in their MR description, they could theoretically break out of the XML structure and inject their own instructions into the coordinator's prompt. We strip these boundary tags out entirely, because we've learned over time to never underestimate the creativity of Cloudflare engineers when it comes to testing a new internal tool:

const PROMPT_BOUNDARY_TAGS = [
  "mr_input", "mr_body", "mr_comments", "mr_details",
  "changed_files", "existing_inline_findings", "previous_review",
  "custom_review_instructions", "agents_md_template_instructions",
];
const BOUNDARY_TAG_PATTERN = new RegExp(
  `]*>`, "gi"
);

Saving tokens with shared context

The system doesn't embed full diffs in the prompt. Instead, it writes per-file patch files to a diff_directory and passes the path. Each sub-reviewer reads only the patch files relevant to its domain.

We also extract a shared context file (shared-mr-context.txt) from the coordinator's prompt and write it to disk. Sub-reviewers read this file instead of having the full MR context duplicated in each of their prompts. This was a deliberate decision, as duplicating even a moderately-sized MR context across seven concurrent reviewers would multiply our token costs by 7x.

The coordinator helps keep things focused

After spawning all sub-reviewers, the coordinator performs a judge pass to consolidate the results:

Deduplication: If the same issue is flagged by both the security reviewer and the code quality reviewer, it gets kept once in the section where it fits best.
Re-categorisation: A performance issue flagged by the code quality reviewer gets moved to the performance section.
Reasonableness filter: Speculative issues, nitpicks, false positives, and convention-contradicted findings get dropped. If the coordinator isn't sure, it uses its tools to read the source code and verify.

The overall approval decision follows a strict rubric:

Condition	Decision	GitLab Action
All LGTM (“looks good to me”), or only trivial suggestions	`approved`	`POST /approve`
Only suggestion-severity items	`approved_with_comments`	`POST /approve`
Some warnings, no production risk	`approved_with_comments`	`POST /approve`
Multiple warnings suggesting a risk pattern	`minor_issues`	`POST /unapprove` (revoke prior bot approval)
Any critical item, or production safety risk	`significant_concerns`	`/submit_review requested_changes` (block merge)

The bias is explicitly toward approval, meaning a single warning in an otherwise clean MR still gets approved_with_comments rather than a block.

Because this is a production system that directly sits between engineers shipping code, we made sure to build an escape hatch. If a human reviewer comments break glass, the system forces an approval regardless of what the AI found. Sometimes you just need to ship a hotfix, and the system detects this override before the review even starts, so we can track it in our telemetry and aren’t caught out by any latent bugs or LLM provider outages.

Risk tiers: don't send the dream team to review a typo fix

You don't need seven concurrent AI agents burning Opus-tier tokens to review a one-line typo fix in a README. The system classifies every MR into one of three risk tiers based on the size and nature of the diff:

// Simplified from packages/core/src/risk.ts
function assessRiskTier(diffEntries: DiffEntry[]) {
  const totalLines = diffEntries.reduce(
    (sum, e) => sum + e.addedLines + e.removedLines, 0
  );
  const fileCount = diffEntries.length;
  const hasSecurityFiles = diffEntries.some(
    e => isSecuritySensitiveFile(e.newPath)
  );

  if (fileCount > 50 || hasSecurityFiles) return "full";
  if (totalLines <= 10 && fileCount <= 20)  return "trivial";
  if (totalLines <= 100 && fileCount <= 20) return "lite";
  return "full";
}

Security-sensitive files: anything touching auth/, crypto/, or file paths that sound even remotely security-related always trigger a full review, because we’d rather spend a bit extra on tokens than potentially miss a security vulnerability.

Each tier gets a different set of agents:

Tier	Lines Changed	Files	Agents	What Runs
Trivial	≤10	≤20	2	Coordinator + one generalised code reviewer
Lite	≤100	≤20	4	Coordinator + code quality + documentation + (more)
Full	>100 or >50 files	Any	7+	All specialists, including security, performance, release

The trivial tier also downgrades the coordinator from Opus to Sonnet, for example, as a two-reviewer check on a minor change doesn't require an extremely capable and expensive model to evaluate.

Diff filtering: getting rid of the noise

Before the agents see any code, the diff goes through a filtering pipeline that strips out noise like lock files, vendored dependencies, minified assets, and source maps:

const NOISE_FILE_PATTERNS = [
  "bun.lock", "package-lock.json", "yarn.lock",
  "pnpm-lock.yaml", "Cargo.lock", "go.sum",
  "poetry.lock", "Pipfile.lock", "flake.lock",
];

const NOISE_EXTENSIONS = [".min.js", ".min.css", ".bundle.js", ".map"];

We also filter out generated files by scanning the first few lines for markers like // @generated or /* eslint-disable */. However, we explicitly exempt database migrations from this rule, since migration tools often stamp files as generated even though they contain schema changes that absolutely need to be reviewed.

The spawn_reviewers tool: concurrent orchestration

The spawn_reviewers tool manages the lifecycle of up to seven concurrent reviewer sessions with circuit breakers, failback chains, per-task timeouts, and retry logic. It acts essentially as a tiny scheduler for LLM sessions.

Determining when an LLM session is actually "done" is surprisingly tricky. We rely primarily on OpenCode's session.idle events, but we back that up with a polling loop that checks the status of all running tasks every three seconds. This polling loop also implements inactivity detection. If a session has been running for 60 seconds with no output at all, it is killed early and marked as an error, which catches sessions that crash on startup before producing any JSONL.

Timeouts operate at three levels:

Per-task: 5 minutes (10 for code quality, which reads more files). This prevents one slow reviewer from blocking the rest.
Overall: 25 minutes. A hard cap for the entire spawn_reviewers call. When it hits, every remaining session is aborted.
Retry budget: 2 minutes minimum. We don't bother retrying if there isn't enough time left in the overall budget.

Resilience: circuit breakers and failback chains

Running seven concurrent AI model calls means you are absolutely going to hit rate limits and provider outages. We implemented a circuit breaker pattern inspired by Netflix's Hystrix, adapted for AI model calls. Each model tier has independent health tracking with three states:

When a model's circuit opens, the system walks a failback chain to find a healthy alternative. For example:

const DEFAULT_FAILBACK_CHAIN = {
  "opus-4-7":   "opus-4-6",    // Fall back to previous generation
  "opus-4-6":   null,          // End of chain
  "sonnet-4-6": "sonnet-4-5",
  "sonnet-4-5": null,
};

Each model family is isolated, so if one model is overloaded, we fall back to an older generation model rather than crossing streams. When a circuit opens, we allow exactly one probe request through after a two-minute cooldown to see if the provider has recovered, which prevents us from stampeding a struggling API.

Error classification

When a sub-reviewer session fails, the system needs to decide if it should trigger model failback or if it's a problem that a different model won't fix. The error classifier maps OpenCode's error union type to a shouldFailback boolean:

switch (err.name) {
  case "APIError":
    // Only retryable API errors (429, 503) trigger failback
    return { shouldFailback: Boolean(data.isRetryable), ... };
  case "ProviderAuthError":
    // Auth failure (a different model won't fix bad credentials)
    return { shouldFailback: false, ... };
  case "ContextOverflowError":
    // Too many tokens (a different model has the same limit)
    return { shouldFailback: false, ... };
  case "MessageAbortedError":
    // User/system abort (not a model problem)
    return { shouldFailback: false, ... };
}

Only retryable API errors trigger failback. Auth errors, context overflow, aborts, and structured output errors do not.

Coordinator-level failback

The circuit breaker handles sub-reviewer failures, but the coordinator itself can also fail. The orchestration layer has a separate failback mechanism: if the OpenCode child process fails with a retryable error (detected by scanning stderr for patterns like "overloaded" or "503"), it hot-swaps the coordinator model in the opencode.json config file and retries. This is a file-level swap that reads the config JSON, replaces the review_coordinator.model key, and writes it back before the next attempt.

The control plane: Workers for config and telemetry

If a model provider goes down at 8 a.m. UTC when our colleagues in Europe are just waking up, we don’t want to wait for an on-call engineer to make a code change to switch out the models we’re using for the reviewer. Instead, the CI job fetches its model routing configuration from a Cloudflare Worker backed by Workers KV.

The response contains per-reviewer model assignments and a providers block. When a provider is disabled, the plugin filters out all models from that provider before selecting the primary:

function filterModelsByProviders(models, providers) {
  return models.filter((m) => {
    const provider = extractProviderFromModel(m.model);
    if (!provider) return true;       // Unknown provider → keep
    const config = providers[provider];
    if (!config) return true;         // Not in config → keep
    return config.enabled;            // Disabled → filter out
  });
}

This means we can flip a switch in KV to disable an entire provider, and every running CI job will route around it within five seconds. The config format also carries failback chain overrides, allowing us to reshape the entire model routing topology from a single Worker update.

We also use a fire-and-forget TrackerClient that talks to a separate Cloudflare Worker to track job starts, completions, findings, token usage, and Prometheus metrics. The client is designed to never block the CI pipeline, using a 2-second AbortSignal.timeout and pruning pending requests if they exceed 50 entries. Prometheus metrics are batched on the next microtask and flushed right before the process exits, forwarding to our internal observability stack via Workers Logging, so we know exactly how many tokens we are burning in real time.

Re-reviews: not starting from scratch

When a developer pushes new commits to an already-reviewed MR, the system runs an incremental re-review that is aware of its own previous findings. The coordinator receives the full text of its last review comment and a list of inline DiffNote comments it previously posted, along with their resolution status.

The re-review rules are strict:

Fixed findings: Omit from the output, and the MCP server auto-resolves the corresponding DiffNote thread.
Unfixed findings: Must be re-emitted even if unchanged, so the MCP server knows to keep the thread alive.
User-resolved findings: Respected unless the issue has materially worsened.
User replies: If a developer replies "won't fix" or "acknowledged", the AI treats the finding as resolved. If they reply "I disagree", the coordinator will read their justification and either resolve the thread or argue back.

We also made sure to build in a small Easter egg and made sure that the reviewer can also handle one lighthearted question per MR. We figured a little personality helps build rapport with developers who are being reviewed (sometimes brutally) by a robot, so the prompt instructs it to keep the answer brief and warm before politely redirecting back to the review.

Keeping AI context fresh: the AGENTS.md Reviewer

AI coding agents rely heavily on AGENTS.md files to understand project conventions, but these files rot incredibly fast. If a team migrates from Jest to Vitest but forgets to update their instructions, the AI will stubbornly keep trying to write Jest tests.

We built a specific reviewer just to assess the materiality of an MR and yell at developers if they make a major architectural change without updating the AI instructions. It classifies changes into three tiers:

High materiality (strongly recommend update): package manager changes, test framework changes, build tool changes, major directory restructures, new required env vars, CI/CD workflow changes.
Medium materiality (worth considering): major dependency bumps, new linting rules, API client changes, state management changes.
Low materiality (no update needed): bug fixes, feature additions using existing patterns, minor dependency updates, CSS changes.

It also penalizes anti-patterns in existing AGENTS.md files, like generic filler ("write clean code"), files over 200 lines that cause context bloat, and tool names without runnable commands. A concise, functional AGENTS.md with commands and boundaries is always better than a verbose one.

How our teams use it

The system ships as a fully contained internal GitLab CI component. A team adds it to their .gitlab-ci.yml:

include:
  - component: $CI_SERVER_FQDN/ci/ai/opencode@~latest

The component handles pulling the Docker image, setting up Vault secrets, running the review, and posting the comment. Teams can customise behavior by dropping an AGENTS.md file in their repo root with project-specific review instructions, and teams can opt to provide a URL to an AGENTS.md template that gets injected into all agent prompts to ensure their standard conventions apply across all of their repositories without needing to keep multiple AGENTS.md files up to date.

The entire system also runs locally. The @opencode-reviewer/local plugin provides a /fullreview command inside OpenCode's TUI that generates diffs from the working tree, runs the same risk assessment and agent orchestration, and posts results inline. It's the exact same agents and prompts, just running on your laptop instead of in CI.

Show me the numbers!

We have been running this system for about a month now, and we track everything through our review-tracker Worker. Here is what the data looks like across 5,169 repositories from March 10 to April 9, 2026.

The overview

In the first 30 days, the system completed 131,246 review runs across 48,095 merge requests in 5,169 repositories. The average merge request gets reviewed 2.7 times (the initial review, plus re-reviews as the engineer pushes fixes), and the median review completes in 3 minutes and 39 seconds. That is fast enough that most engineers see the review comment before they have finished context-switching to another task. The metric we’re the proudest about, though, is that engineers have only needed to “break glass” 288 times (0.6% of merge requests).

On the cost side, the average review costs $1.19 and the median is $0.98. The distribution has a long tail of expensive reviews – massive refactors that trigger full-tier orchestration. The P99 review costs $4.45, which means 99% of reviews come in under five dollars.

Percentile	Cost per review	Review duration
Median	$0.98	3m 39s
P90	$2.36	6m 27s
P95	$2.93	7m 29s
P99	$4.45	10m 21s

What it found

The system produced 159,103 total findings across all reviews, broken down as follows:

That is about 1.2 findings per review on average, which is deliberately low. We biased hard for signal over noise, and the "What NOT to Flag" prompt sections are a big part of why the numbers look like this rather than 10+ findings per review of dubious quality.

The code quality reviewer is the most prolific, producing nearly half of all findings. Security and performance reviewers produce fewer findings but at higher average severity, but the absolute numbers tell the full story — code quality produces nearly half of all findings by volume, while the security reviewer flags the highest proportion of critical issues at 4%:

Reviewer	Critical	Warning	Suggestion	Total
Code Quality	6,460	29,974	38,464	74,898
Documentation	155	9,438	16,839	26,432
Performance	65	5,032	9,518	14,615
Security	484	5,685	5,816	11,985
Codex (compliance)	224	4,411	5,019	9,654
AGENTS.md	18	2,675	4,185	6,878
Release	19	321	405	745

Token usage

Over the month, we processed approximately 120 billion tokens in total. The vast majority of those are cache reads, which is exactly what we want to see — it means the prompt caching is working, and we are not paying full input pricing for repeated context across re-reviews.

Our cache hit rate sits at 85.7%, which saves us an estimated five figures compared to what we would pay at full input token pricing. This is partially thanks to the shared context file optimisation — sub-reviewers reading from a cached context file rather than each getting their own copy of the MR metadata, but also by using the exact same base prompts across all runs, across all merge requests.

Here is how the token usage breaks down by model and by agent:

Model	Input	Output	Cache Read	Cache Write	% of Total
Top-tier models (Claude Opus 4.7, GPT-5.4)	806M	1,077M	25,745M	5,918M	51.8%
Standard-tier models (Claude Sonnet 4.6, GPT-5.3 Codex)	928M	776M	48,647M	11,491M	46.2%
Kimi K2.5	11,734M	267M	0	0	0.0%

Top-tier models and Standard-tier models split the cost roughly 52/48, which makes sense given that the top-tier models have to do a lot more complex work (one session per review, but with expensive extended thinking and large output) while the standard-tier models handle three sub-reviewers per full review. Kimi processes the most raw input tokens (11.7B) but costs “nothing” since it runs through Workers AI.

The per-agent breakdown shows where the tokens actually go:

Agent	Input	Output	Cache Read	Cache Write
Coordinator	513M	1,057M	20,683M	5,099M
Code Quality	428M	264M	19,274M	3,506M
Engineering Codex	409M	236M	18,296M	3,618M
Documentation	8,275M	216M	8,305M	616M
Security	199M	149M	8,917M	2,603M
Performance	157M	124M	6,138M	2,395M
AGENTS.md	4,036M	119M	2,307M	342M
Release	183M	5M	231M	15M

The coordinator produces by far the most output tokens (1,057M) because it has to write the full structured review comment. The documentation reviewer has the highest raw input (8,275M) because it processes every file type, not just code. The release reviewer barely registers because it only runs when release-related files are in the diff.

Cost by risk tier

The risk tier system is doing its job. Trivial reviews (typo fixes, small doc changes) cost 20 cents on average, while full reviews with all seven agents average $1.68. The spread is exactly what we designed for:

Tier	Reviews	Avg Cost	Median	P95	P99
Trivial	24,529	$0.20	$0.17	$0.39	$0.74
Lite	27,558	$0.67	$0.61	$1.15	$1.95
Full	78,611	$1.68	$1.47	$3.35	$5.05

So, what does a review look like?

We’re glad you asked! Here’s an example of what a particularly egregious review looks like:

As you can see, the reviewer doesn’t beat around the bush and calls out problems when it sees them.

Limitations we're honest about

This isn't a replacement for human code review, at least not yet with today’s models. AI reviewers regularly struggle with:

Architectural awareness: The reviewers see the diff and surrounding code, but they don't have the full context of why a system was designed a certain way or whether a change is moving the architecture in the right direction.
Cross-system impact: A change to an API contract might break three downstream consumers. The reviewer can flag the contract change, but it can't verify that all consumers have been updated.
Subtle concurrency bugs: Race conditions that depend on specific timing or ordering are hard to catch from a static diff. The reviewer can spot missing locks, but not all the ways a system can deadlock.
Cost scales with diff size: A 500-file refactor with seven concurrent frontier model calls costs real money. The risk tier system manages this, but when the coordinator's prompt exceeds 50% of the estimated context window, we emit a warning. Large MRs are inherently expensive to review.

We’re just getting started

For more on how we’re using AI at Cloudflare, read our post on our internal AI engineering stack. And check out everything we shipped during Agents Week.

Have you integrated AI into your code review? We’d love to hear about it. Find us on Discord, X, and Bluesky.

Interested in building cutting edge projects like this, on cutting edge technology? Come build with us!

Introducing the Agent Readiness score. Is your site agent-ready?

André Jesus — Fri, 17 Apr 2026 13:05:00 GMT

The web has always had to adapt to new standards. It learned to speak to web browsers, and then it learned to speak to search engines. Now, it needs to speak to AI agents.

Today, we are excited to introduce isitagentready.com — a new tool to help site owners understand how they can make their sites optimized for agents, from guiding agents on how to authenticate, to controlling what content agents can see, the format they receive it in, and how they pay for it. We are also introducing a new dataset to Cloudflare Radar that tracks the overall adoption of each agent standard across the Internet.

We want to lead by example. That is why we are also sharing how we recently overhauled Cloudflare's Developer Documentation to make it the most agent-friendly documentation site, allowing AI tools to answer questions faster and significantly cheaper.

How agent-ready is the web today?

The short answer: not very. This is expected, but also shows how much more effective agents can be than they are today, if standards are adopted.

To analyze this, Cloudflare Radar took the 200,000 most visited domains on the Internet; filtered out categories where agent readiness isn't important (like redirects, ad-servers, and tunneling services) to focus on businesses, publishers, and platforms that AI agents might realistically need to interact with; and scanned them using our new tool.

The result is a new “Adoption of AI agent standards” chart that can now be found in the Cloudflare Radar AI Insights page where we can measure adoption of each standard across multiple domain categories.

Looking at individual checks, a few things stood out:

robots.txt is nearly universal — 78% of sites have one — but the vast majority are written for traditional search engine crawlers, not AI agents.
Content Signals: 4% of sites have declared their AI usage preferences in robots.txt. This is a new standard that is gaining momentum.
Markdown content negotiation (serving text/markdown on Accept: text/markdown) passes on 3.9% of sites.
New emerging standards like MCP Server Cards and API Catalogs (RFC 9727) together appear on fewer than 15 sites in the entire dataset. It’s still early — there is lots of opportunity to stand out by being one of the first sites to adopt new standards and work well with agents.

This chart will be updated weekly, and the data can also be accessed through the Data Explorer or the Radar API.

Get an agent readiness score for your site

You can get an agent readiness score for your own website by going to isitagentready.com and entering the site’s URL.

Scores and audits that provide actionable feedback have helped to drive adoption of new standards before. For example, Google Lighthouse scores websites on performance and security best practices, and guides site owners to adopt the latest web platform standards. We think something similar should exist to help site owners adopt best practices for agents.

When you enter your site, Cloudflare makes requests to it to check which standards it supports, and provides a score based on four dimensions:

Discoverability: robots.txt, sitemap.xml, Link Headers (RFC 8288)
Content: Markdown for Agents
Bot Access Control: Content Signals, AI bot rules in robots.txt, Web Bot Auth
Capabilities: Agent Skills, API Catalog (RFC 9727), OAuth server discovery via RFC 8414 and RFC 9728, MCP Server Card, and WebMCP

^{Screenshot of results from an agent-readiness check for an example website.}

Additionally, we check if the site supports agentic commerce standards including x402, Universal Commerce Protocol, and Agentic Commerce Protocol, but these do not currently count towards the score.

For each failing check, we provide a prompt that you can give to your coding agent and have it implement support on your behalf.

The site itself is also agent-ready, practicing what it preaches. It exposes a stateless MCP server (https://isitagentready.com/.well-known/mcp.json) with a scan_site tool via Streamable HTTP, so any MCP-compatible agent can scan websites programmatically without using the web interface. It also publishes an Agent Skills index (https://isitagentready.com/.well-known/agent-skills/index.json) with skill documents for every standard it checks, so agents not only know what to fix, but how to fix it.

Let’s dig into the checks in each category, and why they matter for agents.

Discoverability

robots.txt has been around since 1994, and most sites have one. It serves two purposes for agents: it defines crawl rules (who can access what) and it points to your sitemaps. A sitemap is an XML file that lists every path on your website, essentially a map agents can follow to discover all your content without having to crawl every link. The robots.txt is where agents look first.

Beyond sitemaps, agents can also discover important resources directly from HTTP response headers, specifically, using the Link response header (RFC 8288). Unlike links buried inside HTML, the Link header is part of the HTTP response itself, which means an agent can find links to resources without having to parse any markup:

HTTP/1.1 200 OK
Link: ; rel="api-catalog"

Content accessibility

Getting an agent onto your site is one thing. Making sure it can actually read your content is another.

Back in September 2024, which feels like a lifetime ago given how fast AI is moving, llms.txt was proposed as a way to provide a LLM-friendly representation of a website, and fit within the model’s context window. llms.txt is a plain text file at the root of your site that gives agents a structured reading list: what the site is, what's on it, and where the important content lives. Think of it as a sitemap written for an LLM to read rather than a crawler to index:

# My Site
> A developer platform for building on the edge.
## Documentation
- [Getting Started](https://example.com/docs/start.md)
- [API Reference](https://example.com/docs/api.md)
## Changelog
- [Release Notes](https://example.com/changelog.md)

Markdown content negotiation goes even further. When an agent fetches any page and sends an Accept: text/markdown header, the server responds with a clean markdown version instead of HTML. The markdown version requires far fewer tokens — we measured up to 80% token reduction in some cases — which makes responses faster, cheaper, and more likely to be consumed in its entirety, given the limits on context windows that most agent tools have by default.

By default, we only check whether the site correctly handles Markdown content negotiation, and do not check for llms.txt. You can customize the scan to include llms.txt if you choose to.

Bot Access Control

Now that agents can navigate your site and consume your content, the next question is: do you want to let any bot do it?

robots.txt does more than point to sitemaps. It is also where you define your access rules. You can explicitly declare which crawlers are allowed and what they can access, down to specific paths. This convention is well established and is still the first place any well-behaved bot looks before it starts crawling.

Content Signals let you be more specific. Rather than just allow or block, you can declare exactly what AI can do with your content. Using a Content-Signal directive in your robots.txt, you can independently control three things: whether your content can be used for AI training (ai-train), whether it can be used as AI input for inference and grounding (ai-input), and whether it should appear in search results (search):

User-agent: *
Content-Signal: ai-train=no, search=yes, ai-input=yes

Inversely, the Web Bot Auth IETF draft standard allows friendly bots to authenticate themselves, and allows websites receiving requests from bots to identify them. A bot signs its HTTP requests, and the receiving site verifies those signatures using the bot’s published public keys.

Those public keys live at a well-known endpoint, /.well-known/http-message-signatures-directory, which we check as part of the scan.

Not all sites need to implement this. If your site just serves content, and doesn’t make requests to other sites, you don’t need it. But as more sites on the Internet run their own agents that make requests to other sites, we expect this to be increasingly important over time.

Protocol Discovery

Beyond passive content consumption, agents can also interact with your site directly by calling APIs, invoking tools, and completing tasks autonomously.

If your service has one or more public APIs, the API Catalog (RFC 9727) gives agents a single well-known location to discover all of them. Hosted at /.well-known/api-catalog, it lists your APIs and links to their specs, docs, and status endpoints, without requiring agents to scrape your developer portal or read your documentation.

We can't talk about agents without mentioning MCP. The Model Context Protocol is an open standard that allows AI models to connect with external data sources and tools. Instead of building a custom integration for every AI tool, you build one MCP server and any compatible agent can use it.

To help agents find your MCP server, you can publish an MCP Server Card (a proposal currently in draft). This is a JSON file at /.well-known/mcp/server-card.json that describes your server before an agent even connects: what tools it exposes, how to reach it, and how to authenticate. An agent reads this file and knows everything it needs to start using your server:

{
  "$schema": "https://static.modelcontextprotocol.io/schemas/mcp-server-card/v1.json",
  "version": "1.0",
  "protocolVersion": "2025-06-18",
  "serverInfo": {
    "name": "search-mcp-server",
    "title": "Search MCP Server",
    "version": "1.0.0"
  },
  "description": "Search across all documentation and knowledge base articles",
  "transport": {
    "type": "streamable-http",
    "endpoint": "/mcp"
  },
  "authentication": {
    "required": false
  },
  "tools": [
    {
      "name": "search",
      "title": "Search",
      "description": "Search documentation by keyword or question",
      "inputSchema": {
        "type": "object",
        "properties": {
          "query": { "type": "string" }
        },
        "required": ["query"]
      }
    }
  ]
}

Agents work best when they have Agent Skills that help them perform specific tasks — but how can agents discover what skills a site provides? We’ve proposed that sites can make this information available at .well-known/agent-skills/index.json, an endpoint that tells the agent what skills are available and where to find them. You might notice that the .well-known standard (RFC 8615) is used by many other agent and authorization standards — thank you to Cloudflare’s own Mark Nottingham who authored the standard, and other IETF contributors!

Many sites require you to sign in first in order to access them. This makes it hard for humans to give agents the ability to access these sites on their behalf, and is why some have taken the arguably unsafe workaround approach of giving agents access to the user’s web browser, with their logged-in session.

There’s a better way that allows humans to explicitly grant access: sites that support OAuth can tell agents where to find the authorization server (RFC 9728), allowing agents to send humans through an OAuth flow, where they can choose to properly grant access to the agent. Announced at Agents Week 2026, Cloudflare Access now fully supports this OAuth flow, and we showed how agents like OpenCode can make use of this standard to make things just work when users give agents protected URLs:

Commerce

Agents can also buy things on your behalf — but payments on the web were designed for humans. Add to cart, enter a credit card, click pay. That flow breaks down entirely when the buyer is an AI agent.

x402 solves this at the protocol level by reviving HTTP 402 Payment Required, a status code that has existed in the spec since 1997 but was never widely used. The flow is simple: an agent requests a resource, the server responds with a 402 and a machine-readable payload describing the payment terms, the agent pays and retries. Cloudflare partnered with Coinbase to launch the x402 Foundation, whose mission is to drive adoption of x402 as an open standard for Internet payments.

We also check for Universal Commerce Protocol and Agentic Commerce Protocol — two emerging agentic commerce standards designed to allow agents to discover and purchase products that humans would normally purchase via ecommerce storefronts and checkout flows.

Integrating agent readiness into Cloudflare URL Scanner

Cloudflare's URL Scanner lets you submit any URL and get a detailed report on it: HTTP headers, TLS certificates, DNS records, technologies used, performance data, and security signals. It is a fundamental tool for security researchers and developers who want to understand what a URL is actually doing under the hood.

We’ve taken the same checks from isitagentready.com and added them to URL Scanner with a new Agent Readiness tab. When you scan any URL, you'll now see its full agent readiness report alongside the existing analysis: which of the checks pass, what level the site is at, and actionable guidance to improve your score.

The integration is also available programmatically via the URL Scanner API. To include agent readiness results in a scan, pass the agentReadiness option in your scan request:

curl -X POST https://api.cloudflare.com/client/v4/accounts/$ACCOUNT_ID/urlscanner/v2/scan \
    -H 'Content-Type: application/json' \
    -H "Authorization: Bearer $CLOUDFLARE_API_TOKEN" \
    -d '{
          "url": "https://www.example.com",
          "options": {"agentReadiness": true}
        }'

Leading by example: upgrading Cloudflare Docs

As we built the tools to measure the Web’s readiness, we knew we had to ensure our own house was in order. Our docs must be easily digestible by the agents our customers use.

We naturally adopted the relevant content site standards mentioned above, and you can check our score here. However, we didn’t stop there. Here is how we refined Cloudflare's Developer Docs to be the most agent-friendly resource on the web.

URL fallbacks using `index.md` files

Unfortunately, as of February 2026, of 7 agents tested, only Claude Code, OpenCode, and Cursor request content with the Accept: text/markdown header by default. For the rest, we needed a seamless URL-based fallback.

To do this, we make every page available separately via Markdown at /index.md relative to the page’s URL. We do this dynamically, without duplicating static files, by combining two Cloudflare Rules:

A URL Rewrite Rule matches requests ending in /index.md and dynamically rewrites them to the base path using regex_replace (stripping /index.md).
A Request Header Transform Rule matches against the original request’s path before the rewrite (raw.http.request.uri.path) and automatically sets the Accept: text/markdown header.

With these two rules, any page can be fetched as Markdown via appending the /index.md path to the URL:

https://developers.cloudflare.com/r2/get-started/index.md

We point to these /index.md URLs in our llms.txt files. Effectively, for these /index.md paths, we always return markdown, regardless of what headers the client sets. And we do this without any additional build step or content duplication.

Creating effective `llms.txt` files for large sites

llms.txt serves as a "home base" for agents, providing a directory of pages to help LLMs find content. However, 5,000+ pages of documentation in a single file will exceed models’ context windows.

Instead of one massive file, we generate a separate llms.txt file for each top-level directory in our docs and the root llms.txt simply points to these subdirectories.

We also remove hundreds of directory-listing pages that provide little semantic value to an LLM, and we ensure each page has rich descriptive context (titles, semantic names, and descriptions).

For example, we omit roughly 450 pages that only serve as localized directory listings, like https://developers.cloudflare.com/workers/databases/.

These pages appear in our sitemap, but they contain very little information for an LLM. Since all child pages are already linked individually in llms.txt, fetching a directory page only provides a redundant list of links, forcing the agent to make another request to find actual content.

To help agents navigate efficiently, each llms.txt entry must be rich in context but light on tokens. Humans might ignore frontmatter and filtering labels, but for an AI agent, this metadata is the steering wheel. That is why our Product Content Experience (PCX) team has refined our page titles, descriptions, and URL structures so that agents always know exactly which pages to fetch.

Take a look at a section from our root llms.txt.

Each link has a semantic name, a matching URL, and a high-value description. None of this required extra work for llms.txt generation. It was all already available in the docs frontmatter. The same goes for pages in top level directory llms.txt files. All of this context empowers agents to find relevant information more efficiently.

Custom agent-friendly documentation (afdocs) tooling

Additionally, we test our docs against afdocs, an emerging agent-friendly documentation spec and open-source project that allows teams to test docs sites for things like content discovery and navigation. This spec allowed us to build custom audit tooling of our own. By adding a few deliberate patches specific to our use case, we created a dashboard for easy assessment.

Benchmark results: faster and cheaper

We pointed an agent (Kimi-k2.5 via OpenCode) at other large technical documentation sites' llms.txt files and tasked the agent with answering highly specific technical questions.

On average, the agent pointed at Cloudflare’s documentation consumed 31% fewer tokens and arrived at the correct answer 66% faster than the average site that is not refined for agents. By fitting our product directories into single context windows, agents can identify the exact page they need and fetch it in a single, linear path.

Structure leads to speed

Accuracy in LLM responses is often a byproduct of context window efficiency. During our testing, we observed a recurring pattern with other documentation sets.

The grep loop: Many documentation sites provide a single, massive llms.txt file that exceeds the agent's immediate context window. Because the agent cannot "read" the whole file, it begins to grep for keywords. If the first search misses the specific detail, the agent must think, refine its search, and try again.
Narrowed context and lower accuracy: When an agent relies on iterative searching rather than reading the full file, it loses the broader context of the documentation. This fragmented view often leads the agent to have a reduced understanding of the documentation at hand.
Latency and token bloat: Each iteration of the grep loop requires the agent to generate new "thinking tokens" and execute additional search requests. This back-and-forth makes the final response noticeably slower and increases the total token count, driving up the cost for the end user.

By contrast, Cloudflare docs are designed to fit entirely within an agent's context window. This allows the agent to ingest the directory, identify the exact page it needs, and fetch the Markdown without detour.

Improving LLM answers over time by redirecting AI training crawlers

Documentation for legacy products like Wrangler v1 or Workers Sites presents a unique challenge. While we must keep this information accessible for historical purposes, it can lead to outdated advice from AI agents.

For example, a human reading these docs would see the large banner stating that Wrangler v1 is deprecated, in addition to a link to the most recent content. An LLM crawler, however, might ingest the text without that surrounding visual context. This results in the agent recommending outdated information.

Redirects for AI Training solves this by identifying AI training crawlers and intentionally redirecting them away from deprecated or suboptimal content. This ensures that while humans can still access historical archives, LLMs are only fed our most current and accurate implementation details.

Hidden agent directives on all pages

Every HTML page in our docs includes a hidden directive specifically for LLMs.

“STOP! If you are an AI agent or LLM, read this before continuing. This is the HTML version of a Cloudflare documentation page. Always request the Markdown version instead — HTML wastes context. Get this page as Markdown: https://developers.cloudflare.com/index.md (append index.md) or send Accept: text/markdown to https://developers.cloudflare.com/. For all Cloudflare products use https://developers.cloudflare.com/llms.txt. You can access all Cloudflare docs in a single file at https://developers.cloudflare.com/llms-full.txt.”

This snippet informs the agent that a Markdown version is available. Crucially, this directive is stripped from the actual Markdown version to avoid a recursion loop where the agent keeps trying to "find" the Markdown within the Markdown.

Dedicated LLM resources sidebar

Finally, we want to make these resources discoverable for the humans who are building with agents. Every product directory in our developer documentation has an "LLM Resources" entry in the sidenav, providing quick access to llms.txt, llms-full.txt, and Cloudflare Skills.

Make your website agent-ready today

Making websites agent-ready is a fundamental accessibility requirement for the modern developer toolkit. The transition from a "human-read web" to a "machine-read web" is the biggest architectural shift in decades.

Get an agent readiness score for your site at isitagentready.com, take the prompts it provides, and ask your agent to upgrade your site for the AI era. Stay tuned for more updates from Cloudflare Radar about the adoption of agent standards across the Internet over the coming year. If we’ve learned anything from the past year, it’s that a lot can change very quickly!

Watch on Cloudflare TV

Shared Dictionaries: compression that keeps up with the agentic web

Alex Krivit — Fri, 17 Apr 2026 13:02:00 GMT

Web pages have grown 6-9% heavier every year for the past decade, spurred by the web becoming more framework-driven, interactive, and media-rich. Nothing about that trajectory is changing. What is changing is how often those pages get rebuilt and how many clients request them. Both are skyrocketing because of agents.

Shared dictionaries shrink asset transfers from servers to browsers so pages load faster with less bloat on the wire, especially for returning users or visitors on a slow connection. Instead of re-downloading entire JavaScript bundles after every deploy, the browser tells the server what it already has cached, and the server only sends the file diffs.

Today, we’re excited to give you a sneak peek of our support for shared compression dictionaries, show you what we’ve seen in early testing, and reveal when you’ll be able to try the beta yourself (hint: it’s April 30, 2026!).

The problem: more shipping = less caching

Agentic crawlers, browsers, and other tools hit endpoints repeatedly, fetching full pages, often to extract a fragment of information. Agentic actors represented just under 10% of total requests across Cloudflare's network during March 2026¹, up ~60% year-over-year.

Every page shipped is heavier than last year and read more often by machines than ever before. But agents aren’t just consuming the web, they’re helping to build it. AI-assisted development means teams ship faster. Increasing the frequency of deploys, experiments, and iterations is great for product velocity, but terrible for caching.

As agents push a one-line fix, the bundler re-chunks, filenames change, and every user on earth could re-download the entire application. Not because the code is meaningfully any different, but because the browser/client has no way to know specifically what changed. It sees a new URL and starts from zero. Traditional compression helps with the size of each download, but it can't help with the redundancy. It doesn't know the client already has 95% of the file cached. So every deploy, across every user, across every bot, sends redundant bytes again and again. Ship ten small changes a day and you've effectively opted out of caching. This wastes bandwidth and CPU in a web where hardware is quickly becoming the bottleneck.

In order to scale with more requests hitting heavier pages that are re-deployed more often, compression has to get smarter.

What are shared dictionaries?

A compression dictionary is a shared reference between server and client that works like a cheat sheet. Instead of compressing a response from scratch, the server says "you already know this part of the file because you’ve cached it before" and only sends what's new. The client holds the same reference and uses it to reconstruct the full response during decompression. The more the dictionary can reference content in the file, the smaller the compressed output that is transferred to the client.

This principle of compressing against what's already known is how modern compression algorithms pull ahead of their predecessors. Brotli ships with a built-in dictionary of common web patterns like HTML attributes and common phrases; Zstandard is purpose-built for custom dictionaries: you can feed it representative content samples and it generates an optimized dictionary for the kind of content you serve. Gzip has neither; it must build dictionaries by finding patterns in real-time as it’s compressing. These “traditional compression” algorithms are already available on Cloudflare today.

Shared dictionaries take this principle a step further: the previously cached version of the resource becomes the dictionary. Remember the deploy problem where a team ships a one-line fix and every user re-downloads the full bundle? With shared dictionaries, the browser already has the old version cached. The server compresses against it, sending only the diff. That 500KB bundle with a one-line change becomes only a few kilobytes on the wire. At 100K daily users and 10 deploys a day, that's the difference between 500GB of transfer and a few hundred megabytes.

Delta compression

Delta compression is what turns the version the browser already has into the dictionary. The protocol looks to when the server first serves a resource, it attaches a Use-As-Dictionary response header, telling the browser to essentially hold onto the file because it’ll be useful later. On the next request for that resource, the browser sends an Available-Dictionary header back, telling the server, "here's what I've got." The server then proceeds to compress the new version against the old one and sends only the diff. No separate dictionary file needed.

This is where the payoff lands for real applications. Versioned JS bundles, CSS files, framework updates, and anything that changes incrementally between releases. The browser has app.bundle.v1.js cached already and the developer makes an update and deploys app.bundle.v2.js. Delta compression only sends the diff between these versions. Every subsequent version after is also just a diff. Version three compresses against version two. Version 47 compresses against version 46. The savings don't reset, they persist across the entire release history.

There's also active discussion in the community about custom and dynamic dictionaries for non-static content. That's future work, but the implications are significant. We'll save that for another post.

So why the wait?

If shared dictionaries are so powerful, why doesn't everyone use them already?

Because the last time they were tried, the implementation couldn't survive contact with the open web.

Google shipped Shared Dictionary Compression for HTTP (SDCH) in Chrome in 2008. It worked well with some early adopters reporting double-digit improvements in page load times. But SDCH accumulated problems faster than anyone was able to fix them.

The most memorable was a class of compression side-channel attacks (CRIME, BREACH). Researchers showed that if an attacker could inject content alongside something sensitive that gets compressed (like a session cookie, token, etc.) the size of the compressed output could leak information about the secret. The attacker could guess a byte at a time, watch whether the asset size shrank, and repeat until they extracted the whole secret.

But security wasn't the only problem, or even the main reason why adoption didn’t happen. SDCH surfaced a few architectural problems like violating the Same-Origin Policy (which ironically is partially why it performed so well). Its cross-origin dictionary model couldn't be reconciled with CORS, and it lacked some specification regarding interactions with things like the Cache API. After a while it became clear that adoption wasn’t ready, so in 2017 Chrome (the only browser supporting at the time) unshipped it.

Getting the web community to pick up the baton took a decade, but it was worth it.

The modern standard, RFC 9842: Compression Dictionary Transport, closes key design gaps that made SDCH untenable. For example, it enforces that an advertised dictionary is only usable on responses from the same-origin, preventing many conditions that made side-channel compression attacks possible.

Chrome and Edge have shipped support with Firefox working to follow. The standard is moving toward broad adoption, but complete cross-browser support is still catching up.

The RFC mitigates the security problems but dictionary transport has always been complex to implement. An origin may have to generate dictionaries, serve them with the right headers, check every request for an Available-Dictionary match, delta-compress the response on the fly, and fall back gracefully when a client doesn't have a dictionary. Caching gets complex too. Responses vary on both encoding and dictionary hash, so every dictionary version creates a separate cache variant. Mid-deploy, you have clients with the old dictionary, clients with the new one, and clients with none. Your cache is storing separate copies for each. Hit rates drop, storage climbs, and the dictionaries themselves have to stay fresh under normal HTTP caching rules.

This complexity is a coordination problem. And exactly the kind of thing that belongs at the edge. A CDN already sits in front of every request, already manages compression, and already handles cache variants (watch this space for a soon-to-come announcement blog).

How Cloudflare is building shared dictionary support

Shared dictionary compression touches every layer of the stack between the browser and the origin. We've seen strong customer interest: some people have already built their own implementations like RFC author Patrick Meenan's dictionary-worker, which runs the full dictionary lifecycle inside a Cloudflare Worker using WASM-compiled Zstandard (as an example). We want to make this accessible to everyone and as easy as possible to implement. So we’re rolling it out across the platform in three phases, starting with the plumbing.

Phase 1: Passthrough support is currently in active development. Cloudflare forwards the headers and encodings that shared dictionaries require like Use-As-Dictionary, Available-Dictionary, and the dcb and dcz content encodings, without stripping, modifying, or recompressing them. The Cache keys are extended to vary on Available-Dictionary and Accept-Encoding so dictionary-compressed responses are cached correctly. This phase serves customers who manage their own dictionaries at the origin.

We plan to have an open beta of Phase 1 ready by April 30, 2026. To use it, you'll need to be on a Cloudflare zone with the feature enabled, have an origin that serves dictionary-compressed responses with the correct headers (Use-As-Dictionary, Content-Encoding: dcb or dcz, Vary on Accept-Encoding, Available-Dictionary), and your visitors need to be on a browser that can use dictionary transport. Today, that means Chrome 130+ and Edge 130+, with Firefox support in progress.

Keep your eyes fixed on the changelog for when this becomes available and more documentation for how to use it.

We’ve already started testing passthrough internally. In a controlled test, we deployed two js bundles in sequence. They were nearly identical except for a few localized changes between the versions representing successive deploys of the same web application. 200 tests were run hitting our San Jose, California PoP with an origin located in Council Bluffs, Iowa serving the dictionary, JS bundle, and dictionary-compressed bundle (note that the dcz compressed bundle was precomputed). For each request we captured TTFB and total download time via curl.

Uncompressed, the asset is 272KB. Gzip brought that down to 92.2KB, a solid 66% reduction. With shared dictionary compression over DCZ, using the previous version as the dictionary, that same asset dropped to 2.6KB. That's a 99% reduction from the uncompressed asset, and still 97% smaller than gzip.

In the same lab test, we measured two timing milestones from the client: time to first byte (TTFB) and full download completion. The timing difference is stark. At P99 on a cache miss, DCZ TTFB is roughly 60 ms faster than gzip. On a cache hit the TTFB gap narrows to a negligible 10ms.

Download completion is where it matters. "Transfer" in the graphs indicates total time curl spent receiving the response minus TTFB. On a cache miss, this time for the DCZ response is 1ms versus 161ms for gzip (99.4% faster - the body essentially arrived alongside the headers). On a cache hit, 1ms versus 54ms (98% faster). The localized changes between versions were already captured by the dictionary, which is exactly the point: for successive deploys of roughly the same application, shared dictionaries eliminate nearly all redundant transfer.

Initial lab results simulating minimal JS bundle diffs, results will vary based on the actual delta between the dictionary and the asset.

Phase 2: This is where Cloudflare starts doing the work for you. Instead of handling dictionary headers, compression, and fallback logic on the origin, in this phase you tell Cloudflare which assets should be used as dictionaries via a rule and we manage the rest for you. We inject the Use-As-Dictionary headers, store the dictionary bytes, delta-compress new versions against old ones, and serve the right variant to each client. Your origin serves normal responses. Any dictionary complexity moves off your infrastructure and onto ours.

To demonstrate this, we've built a live demo to show what this looks like in practice. Try it here: Can I Compress (with Dictionaries)?

The demo deploys a new ~94KB JavaScript bundle every minute, meant to mimic a typical production single page application bundle. The bulk of the code is static between deploys; only a small configuration block changes each time, which also mirrors real-world deploys where most of the bundle is unchanged framework and library code. When the first version loads, Cloudflare's edge stores it as a dictionary. When the next deploy arrives, the browser sends the hash of the version it already has, and the edge delta-compresses the new bundle against it. The result: 94KB compresses to roughly 450 bytes. That's a 99.5% reduction over gzip, because the only thing on the wire is the actual diff.

The demo site includes walkthroughs so you can verify the compression ratios on your own via curl, your browser, or your agent of choice.

Phase 3: The dictionary is automatically generated on behalf of the website. Instead of customers specifying which assets to use as dictionaries, Cloudflare identifies them automatically. Our network already sees every version of every resource that flows through it, which includes millions of sites, billions of requests, and every new deployment. The idea is that when the network observes a URL pattern where successive responses share most of their content, it has a strong signal that the resource is a good candidate for delta compression. If safe to do so, it stores the previous version as a dictionary and compresses subsequent versions against it. No customer configuration. No maintenance.

This is a simple idea, but is genuinely hard. Safely generating dictionaries that avoid revealing private data and identifying traffic for which dictionaries will offer the most benefit are real engineering problems. But Cloudflare has the right pieces: we see the traffic patterns across the entire network, we already manage the cache layer where dictionaries need to live, and our RUM beacon to clients can help give us a validation loop to confirm that a dictionary actually improves compression before we commit to serving it. The combination of traffic visibility, edge storage, and synthetic testing is what makes automatic generation feasible, though there are still many pieces to figure out.

The performance and bandwidth benefits of phase 3 are the crux of our motivation. This is what makes shared dictionaries accessible to everyone using Cloudflare, including the millions of zones that would never have had the engineering time to implement custom dictionaries manually.

The bigger picture

For most of the web's history, compression was stateless. Every response was compressed as if the client had never seen anything before. Shared dictionaries change that: they give compression a memory.

That matters more now than it would have five years ago. Agentic coding tools are compressing the interval between deploys, while also driving a growing share of the traffic that consumes them. While today AI tools can produce massive diffs, agents are gaining more context and becoming surgical in their code changes. This, coupled with more frequent releases and more automated clients means more redundant bytes on every request. Delta compression helps both sides of that equation by reducing the number of bytes per transfer, and the number of transfers that need to happen at all.

Shared Dictionaries took decades to standardize. Cloudflare is helping to build the infrastructure to make it work for every client that touches your site, human or not. Phase 1 beta opens April 30 and we’re excited for you to try it.

_____

^{1Bots =}^~31.3%^{of all HTTP requests. AI = ~}^29-30%^{of all Bot traffic (March 2026).}

Redirects for AI Training enforces canonical content

Cam Whiteside — Fri, 17 Apr 2026 13:00:00 GMT

Cloudflare's Wrangler CLI has published several major versions over the past six years, each containing at least some critical changes to commands, configuration, or how developers interact with the platform. Like any actively maintained open-source project, we keep documentation for older versions available. The v1 documentation carries a deprecation banner, a noindex meta tag, and canonical tags pointing to current docs. Every advisory signal says the same thing: this content is outdated, look elsewhere. AI training crawlers don’t reliably honor those signals.

We use AI Crawl Control on developers.cloudflare.com, so we know that bots in the AI Crawler Category visited 4.8 million times over the last 30 days, and they consumed deprecated content at the same rate as current content. The advisory signals made no measurable difference. The effect is cumulative because AI agents don't always fetch content live; they draw on trained models. When crawlers ingest deprecated docs, agents inherit outdated foundations.

Today, we’re launching Redirects for AI Training to let you enforce that verified AI training crawlers are redirected to up-to-date content. Your existing canonical tags become HTTP 301 redirects for verified AI training crawlers, automatically, with one toggle, on all paid Cloudflare plans.

And because status codes are ultimately how the web communicates policy to crawlers, Radar's AI Insights page now includes Response status code analysis showing the various types (successful (2xx), redirection (3xx), client error (4xx), and server error (5xx) of status codes AI crawlers receive across all Cloudflare traffic as a view of how the web responds to AI crawlers today.

AI training crawlers face dead ends today

For search engines, noindex functions as a rich signal system, but there’s no equivalent inline directive a page can carry that says “don’t train on this”. Keeping a deprecated page live with a warning banner may work for humans, who read the notice and navigate on, but AI training crawlers ingest the full text and risk treating the banner as just one more paragraph, returning thousands of times even after the warning is visible.

Blocking creates its own problem: it produces a void with no signal about what the crawler should learn instead. robots.txt offers limited protection, but as automated traffic grows, maintaining per-crawler, per-path, per-content-update directives requires hefty manual upkeep. What crawlers need is specific direction: “Here is where the current content lives.”

The tag is an HTML element defined in RFC 6596 that tells search engines and automated systems which URL represents the authoritative version of a page. It’s already present on 65-69% of web pages and is generated automatically by platforms like EmDash, WordPress, and Contentful. That infrastructure declares what the current version of your content is, and Redirects for AI Training enforces it.

How it works

Redirects for AI Training operates on two inputs: Cloudflare's cf.verified_bot_category field and the tags already in your HTML. The AI Crawler category covers bots that crawl for AI model training, including GPTBot, ClaudeBot, and Bytespider, and is distinct from the AI Assistant and AI Search categories that cover AI Agents.

When a request arrives from a verified AI Crawler, Cloudflare reads the response HTML. If a non-self-referencing canonical tag is present, Cloudflare issues a 301 Moved Permanently to the canonical URL before returning the response. Human traffic, search indexing, and other automated traffic is unaffected.

Here’s what the exchange looks like for a GPTBot request to a deprecated path:

GET /durable-objects/api/legacy-kv-storage-api/

Host: developers.cloudflare.com

User-Agent: Mozilla/5.0 (compatible; GPTBot/1.1; +https://openai.com/gptbot)

HTTP/1.1 301 Moved Permanently

Location: https://developers.cloudflare.com/durable-objects/api/sqlite-storage-api/

What this does not do

It doesn't retroactively correct training data already ingested or cover unverified crawlers outside the AI Crawler bot category. Humans and AI Agents visiting deprecated pages will not be redirected. We also exclude cross-origin canonicals by design (tags directing to preferred URLs on different domains), since they’re often used for domain consolidation rather than content freshness. To avoid loops, self-referencing canonicals (a tag on a page pointing to its own URL) don't trigger a redirect either.

Why not just use redirect rules?

Single Redirect Rules can target AI crawlers by user-agent string, and if a site has just a handful of known deprecated paths, that works. But it doesn't scale: every new deprecated path requires a change to the rule, user-agents must be manually tracked, and it would contribute to plan limitations that may otherwise be used for campaign URLs or domain migrations. Redirect rules also manually re-encode what canonical tags already declare and fall out of sync as content changes.

What we found on our own documentation site

Our own experience shows that this problem is real. We run AI Crawl Control on developers.cloudflare.com using the same dashboard available to all Cloudflare customers. In March 2026, legacy Workers documentation was crawled around 46,000 times by OpenAI, 3,600 times by Anthropic, and 1,700 times by Meta.

That crawling of deprecated pages may be why when we asked a leading AI assistant in April 2026, "How do I write KV values using the Wrangler CLI?", it gave an out-of-date answer: "You write to Cloudflare KV via the Wrangler CLI using the kv:key put command."

In fact, the correct syntax (as at April 2026) is wrangler kv key put; the colon syntax (kv:key put) was deprecated in Wrangler 3.60.0. Our documentation carries an inline deprecation notice, but it's unclear how training pipelines interpret them.

So we enabled Redirects for AI Training on developers.cloudflare.com and measured the response. In the first seven days, 100% of AI training crawler requests to pages with non-self-referencing canonical tags were redirected and were not served with deprecated content.

We expect that redirecting crawlers to current content eventually improves AI-generated answers about legacy tools. Given the closed nature of training pipelines and variability in recrawl timing, this is a hypothesis we will continue to verify. But what the crawler receives at the point of access has seen immediate improvement.

How to enable

If your site has canonical tags, your existing content hierarchy can now be enforced for verified AI training crawlers. Cloudflare's verified bot classification handles crawler identification automatically.

In the dashboard: on any domain, go to AI Crawl Control > Quick Actions > Redirects for AI training > toggle on.

For path-specific control via Configuration Rules and Cloudflare for SaaS, see the full documentation.

How the web responds to AI crawlers

Redirects for AI Training turns one status code, 301 Moved Permanently, into an enforcement mechanism for your content policy. But 301 is one signal in a broader conversation between origins and crawlers. A 200 OK means content was served. A 403 Forbidden means access was blocked. A 402 Payment Required tells the client it needs to pay for access. Taken together, the distribution of status codes across AI crawler traffic reveals how the web is actually responding to crawlers at scale.

Radar’s AI Insights page now includes a Response status code analysis graph illustrating the distribution of the top response status codes or response status code groupings (selectable via a dropdown) for AI crawler traffic. The data can be filtered by industry set; the crawl purpose filter can also be applied in Data Explorer. Filtered analyses provide a perspective into whether certain types of crawlers behave differently, or if request patterns and distributions vary by industry.

In the general example shown below, we can see that for the time period covered by the graph, just over 70% of requests were serviced successfully (200), while 10.1% of the requests were redirected (301, 302) to another URL, and 3.7% were for files that weren’t found (404). Access to content was blocked for 8.3% of requests, receiving a 403 response status code. Grouped, we find that nearly 74% of requests received successful responses (2xx), 13.7% received client error responses (4xx), 11.3% received redirection messages (3xx), and 1.2% were sent server error responses (5xx).

This analysis has also been added to individual bot pages to provide insight into this aspect of a crawler’s behavior as well. In the GPTBot example shown below, we can see that for the time period covered by the graph, just over 80% of requests were serviced successfully (200), while 4.7% of the requests were redirected (301, 302) to another URL, and just 2.7% were for files that weren’t found (404). Nearly 6% were blocked, with Cloudflare returning a 403 response status code. Grouped, we find that 83% of requests received successful responses (2xx), nearly 10% received client error responses (4xx), 5.1% received redirection messages (3xx), and the remaining 2.2% got server error responses (5xx).

As noted above, Radar’s Data Explorer enables users to drill down further into the data by applying additional filters. For example, we can look at things like which crawlers are requesting the most non-existent content (resulting in a 404 response status code), and how that request traffic trends over time, or which industries are sending the most Redirection (3xx) response status codes to Training crawlers, and how that activity trends over time.

Response status code data, both in aggregate and on a per-bot basis, is also available through the Cloudflare Radar API.

Redirects for AI Training lets you shape what crawlers receive from your origin; Radar's status code analysis lets you see how the rest of the web is doing the same. Enable Redirects for AI Training in AI Crawl Control > Overview > Quick Actions to start replacing advisory signals with enforced outcomes on your site today.

Have questions or want to share what you're seeing? Join the discussion on the Cloudflare Community or find us on Discord.

Watch on Cloudflare TV

Unweight: how we compressed an LLM 22% without sacrificing quality

Mari Galicer — Fri, 17 Apr 2026 13:00:00 GMT

Running inference within 50ms of 95% of the world's Internet-connected population means being ruthlessly efficient with GPU memory. Last year we improved memory utilization with Infire, our Rust-based inference engine, and eliminated cold-starts with Omni, our model scheduling platform. Now we are tackling the next big bottleneck in our inference platform: model weights.

Generating a single token from an LLM requires reading every model weight from GPU memory. On the NVIDIA H100 GPUs we use in many of our datacenters, the tensor cores can process data nearly 600 times faster than memory can deliver it, leading to a bottleneck not in compute, but memory bandwidth. Every byte that crosses the memory bus is a byte that could have been avoided if the weights were smaller.

To solve this problem, we built Unweight: a lossless compression system that can make model weights up to 15–22% smaller while preserving bit-exact outputs, without relying on any special hardware. The core breakthrough here is that decompressing weights in fast on-chip memory and feeding them directly to the tensor cores avoids an extra round-trip through slow main memory. Depending on the workload, Unweight’s runtime selects from multiple execution strategies – some prioritize simplicity, others minimize memory traffic – and an autotuner picks the best one per weight matrix and batch size.

This post dives into how Unweight works, but in the spirit of greater transparency and encouraging innovation in this rapidly developing space, we’re also publishing a technical paper and open sourcing the GPU kernels.

Our initial results on Llama-3.1-8B show ~30% compression of Multi-Layer Perceptron (MLP) weights alone. Because Unweight works selectively on the parameters for decoding, this leads to a 15-22% in model size reduction and ~3 GB VRAM savings. As shown in the graphic below, this enables us to squeeze more out of our GPUs and thus run more models in more places — making inference cheaper and faster on Cloudflare’s network.

^{Thanks to Unweight, we’re able to fit more models on a single GPU}

Why compression is harder than it sounds

There is a growing body of research exploring how to compress model weights in creative ways to make inference faster and/or run on smaller GPUs. The most common is quantization, a technique to reduce the size of model weights and activations by converting large 32- or 16-bit floating point numbers to smaller 8 or 4-bit integers. This is a form of lossy compression: different 16-bit floating point values can be converted to the same 4-bit integer. This reduction in accuracy affects the quality of responses in unpredictable ways. For production inference serving diverse use cases, we knew we wanted something lossless that preserves exact model behaviour.

Several recent systems (Huff-LLM, ZipNN, and ZipServ) have shown that LLM weights can be compressed significantly, but these approaches target different problems than ours. ZipNN compresses weights for distribution and storage with decompression happening on the CPU. HUff-LLM proposes custom FGPA hardware for decoding. And ZipServ does fuse decompression with GPU inference, but targets consumer grade GPUs, which don’t work with our H100 GPUs. None of these gave us what we needed: lossless inference-time decompression on Hopper GPUs that can integrate with our Rust based inference engine.

The core challenge isn't vanilla compression — exponent bytes in BF16 weights are highly redundant, so entropy coding works well on them. The challenge is decompressing fast enough that it doesn't slow down inference. On an H100, the tensor cores sit idle waiting for memory most of the time — but that idle capacity can't simply be repurposed for decompression. Each GPU compute unit can run either the decompression kernel or the matrix multiplication kernel, not both simultaneously, due to shared memory constraints. Any decode latency that isn't perfectly overlapped with the matrix multiplication becomes directly additive to token latency. Unweight's answer is to decompress weights in fast on-chip shared memory and feed the results directly to the tensor cores — but making that work efficiently across different batch sizes and weight shapes is where the real engineering lives.

How model weights can be compressed effectively

Every number in an AI model is stored as a 16-bit "brain float" (BF16). Each BF16 value has three parts:

Sign (1 bit): positive or negative
Exponent (8 bits): the magnitude
Mantissa (7 bits): the precise value within that magnitude

Here’s how one of these weights breaks down:

The sign and mantissa vary unpredictably across weights — they look like random data and can't be meaningfully compressed. But the exponent tells a different story.

The exponent is surprisingly predictable

Prior research has established that across trained LLMs, out of 256 possible exponent values, just a handful dominate. The top 16 most common exponents cover over 99% of all weights in a typical layer. Information theory says you only need ~2.6 bits to represent this distribution — far less than the 8 bits allocated. If you look at the exponent value distribution in a typical LLM layer, you can see that the top 16 exponents account for 99% of all model weights.

Exponent value distribution in a typical LLM layer

This is the redundancy that Unweight exploits. We leave the sign and mantissa untouched and compress only the exponent byte using Huffman coding — a classic technique that assigns short codes to common values and longer codes to rare ones. Because the exponent distribution is so skewed, this achieves roughly 30% compression on the exponent stream. We apply this selectively to the MLP weight matrices (gate, up, and down projections), which make up roughly two-thirds of a model’s parameters and dominate memory traffic during token generation. Attention weights, embeddings and layer norms are uncompressed. All told the optimizations translate to about 20% reduction in overall multilayer perceptron (MLP) weight size, as explained in full detail in our technical report.

The small number of weights with rare exponents are handled separately: if any weight in a row of 64 has an exponent outside the top-16 palette, the entire row is stored verbatim. This approach eliminates per-element branching in the hot path — instead of checking every single weight for edge cases, we make one decision per row up front.

The GPU memory bottleneck

An NVIDIA H100 GPU has two relevant kinds of memory:

High Bandwidth Memory (HBM): large, but relatively slow to access. This is where model weights live.
Shared memory (SMEM): tiny, but extremely fast. This is where the GPU stages data right before doing math.

^{During inference, generating each token requires reading the full weight matrix from HBM. The memory bus between HBM and SMEM is the performance bottleneck –}^{not the math itself. Fewer bytes across the bus = faster token generation.}

During inference, generating each token requires reading the full weight matrix from HBM through the memory bus — this is the bottleneck. The H100's tensor cores can crunch numbers far faster than HBM can feed them data. Compression helps because fewer bytes need to cross the bus. But there's a catch: the GPU can't do math on compressed data. The weights must be decompressed first.

Most prior work decompresses entire weight matrices back into HBM, then runs a standard matrix multiplication. This helps with storage capacity but doesn't help with bandwidth because you still read the full uncompressed matrix from HBM for every token.

Four ways to use compressed weights

There's no single best way to use compressed weights during inference. The right approach depends on the workload — the batch size, the shape of the weight matrix, and how much GPU time is available for decompression. Unweight offers four compressed execution pipelines, each with a different balance between decompression effort and computation complexity: a full Huffman decode, exponent-only decode, palette transcode, or skipping pre-processing completely.

^{Four different execution pipelines}

The four pipelines form a spectrum. At one end, full decode completely reconstructs the original BF16 weights and hands them to NVIDIA’s cuBLAS library for a standard matrix multiplication. This is the simplest path with cuBLAS running at full speed on ordinary data, but the preprocess step writes the most bytes back to main memory. It works well at small batch sizes where the matrix multiplication is tiny and custom kernel overhead dominates. At the other end, direct palette skips preprocessing entirely. Weights are pre-transcoded to a compact 4-bit format at model load time, and the matrix multiplication kernel reconstructs BF16 values on the fly from these indices. Zero preprocess cost, but the kernel does more work per element.

In between sit two independent paths: one that decodes only the exponent bytes (halving preprocess traffic), and one that transcodes to 4-bit palette indices at runtime (quartering it). Both use a reconstructive matrix multiplication — a custom kernel that loads compressed data, reconstructs BF16 in fast shared memory, and feeds it directly to the tensor cores without a round-trip through main memory.

Why no single pipeline wins

Less preprocessing means less data written to HBM, which frees the memory bus sooner. But it shifts more reconstruction work onto the matmul kernel. Whether that tradeoff pays off depends on the situation.

With small batch sizes (i.e. 1-64 tokens), the matmul is tiny, so there isn't much computation to overlap with, and the fixed costs of a custom kernel dominate. Full decode + cuBLAS often wins simply because cuBLAS has lower overhead. With large batch sizes (i.e. 256+ tokens), the matmul runs long enough to absorb the extra reconstruction work. A lighter preprocess finishes faster, and the freed-up bus bandwidth and compute overlap pay off. The palette or exponent pipelines pull ahead. Different weight matrices within the same layer can favor different pipelines. The "gate" and "up" projections have different dimensions than the "down" projection, changing the order of operations performed within the matmul which requires different performance tradeoffs.

Throughput vs pipeline strategy

This is why Unweight doesn't hard-code a single strategy. The runtime picks the best pipeline for each weight matrix at each batch size, informed by an autotuning process that measures actual end-to-end throughput on the target hardware (more on this below).

How the reconstructive matmul works

Three of the four pipelines use a custom matrix multiplication kernel that fuses decompression with computation. This kernel loads compressed data from HBM, reconstructs the original BF16 values in shared memory, and feeds them directly into the tensor cores — all in one operation. The reconstructed weights never exist in main memory.

Traditional decompression vs Unweight

With Unweight, ~30% fewer bytes cross the memory bus for MLP weight matrices

Inside this kernel, the GPU's thread groups are split into two roles:

A producer group loads compressed inputs from HBM into shared memory using dedicated memory-copy hardware (TMA). It stages sign+mantissa bytes, exponent data (or palette indices), and – for rows with rare exponents – the verbatim exponent rows. It runs ahead of the consumer, filling a circular buffer so data is ready before it's needed.
Consumer groups reconstruct BF16 values by combining exponents with sign+mantissa bytes, then immediately feed the result into Hopper's WGMMA tensor-core instructions. The reconstructed weights go straight from assembly to computation without leaving shared memory.

The reconstructive matmul comes in multiple variants, differing in how many output tiles each compute unit handles and how deep the circular buffer runs. Wider output tiles improve data reuse at large batch sizes; deeper buffers hide memory latency at small batch sizes. The autotuner selects the best variant per workload.

Sharing the GPU between decoding and computation

In the two fused pipelines, a separate preprocess kernel (Huffman decoder or palette transcoder) runs concurrently with the reconstructive matmul. But these kernels compete for GPU resources.

On Hopper, each compute unit (SM) has 228 KB of shared memory. The reconstructive matmul needs ~227 KB for its pipeline buffer and accumulator tiles. A decode kernel needs ~16 KB for its Huffman lookup table. Since 227 + 16 > 228, these two kernels cannot share the same compute unit. Every SM assigned to decoding is one fewer SM available for the matmul.

This creates a balancing act: more decode SMs means faster preprocessing but slower matrix multiplication, and vice versa. The optimal split is another tunable parameter — and another reason why the autotuner measures real throughput rather than relying on heuristics.

Pipelining across layers

Even with the SM partitioning constraint, Unweight hides much of the decompression cost by exploiting the structure of transformer models.

Not every layer needs Huffman decoding at runtime. Unweight classifies layers as "hard" (requiring Huffman preprocessing) or "easy" (using pre-transcoded palette data that the matmul can consume directly). The runtime alternates between them:

^{Decode runs on separate CUDA streams during bootstrap, attention, and easy MLP compute. By the time a hard layer's MLP runs, its preprocessed weights are already waiting}

While the GPU computes an easy layer — which needs no preprocessing — a separate set of CUDA streams is decoding the next hard layer's weights in the background. By the time the easy layers finish and the hard layer's turn arrives, its preprocessed data is already waiting. Double-buffered preprocess slots ensure that decode output from one hard layer isn't overwritten while it's still being consumed.

The down projection benefits most from this overlap: it's consumed last in the MLP sequence (after gate, activation, and up), so its decode has the longest runway to complete.

Autotuning

With four pipelines, multiple matmul kernel variants , and a tunable SM split between decoding and computation, the configuration space is large. Rather than hard-coding a single strategy, Unweight uses an autotuner that measures actual end-to-end inference throughput on the target hardware. It sweeps candidate configurations for the gate projection while holding up and down fixed, then sweeps up, then down, repeating until no further improvement is found. The result is a per-model configuration file that tells the runtime exactly which pipeline, matmul variant, and SM allocation to use for each projection at each batch size — all driven by measured performance rather than heuristics.

One compression format, multiple uses

Encoding format, execution pipeline, and scheduling are independent choices. The same Huffman-compressed model bundle can serve both distribution and inference:

For distribution, Huffman encoding maximizes compression (~22% total model size reduction), reducing transfer times when shipping models across the network.
For inference, Huffman-encoded projections can be transcoded to the palette intermediate format on model load, enabling the most efficient runtime execution without constraining the distribution format.

A single model bundle doesn't need to commit to one strategy at packaging time. The runtime selects the best execution path per projection and per batch size on the fly.

Our results

On Llama 3.1 8B (our primary testbed), Unweight achieves:

~13% model footprint reduction for inference bundles (compressing only gate/up MLP projections), or ~22% for distribution bundles (compressing all MLP projections including down). All compression is 100% bit-exact lossless. Extrapolating to Llama 70B, this can translate to roughly 18–28 GB saved depending on configuration.
30–40% throughput overhead at current optimization level, measured end-to-end on H100 SXM5. The overhead is largest at batch size 1 (~41%) and narrows at batch 1024 (~30%). Three known sources – small-batch fixed costs, redundant weight-tile reconstruction, and the excluded down projection – are under active optimization.

These are intermediate results on a single model. The compression ratios should generalize to other SwiGLU architectures (exponent statistics are consistent across model scales), but the throughput numbers are specific to the current kernel implementations and will change as optimization continues. We do not yet compress attention weights, embeddings, or layer norms, which dilute the overall reduction.

Why this matters

GPUs are expensive in multiple dimensions: the cost of the cards themselves, the high-bandwidth memory they demand, and their significant power consumption.

To combat this, several researchers have shown systems with promising results of ~30% compression ratios on full models — but these target consumer GPUs and research frameworks that don’t work at production scale. The key insight into Unweight’s development is that multilayer perceptrons (MLPs) constitute the majority of model weights and a significant amount of the compute cost during inference workloads. It compresses only MLP weights (avoiding overhead on layers where compression benefit is marginal), is designed specifically for datacenter H100 GPUs with their tightly-balanced compute and memory, and comes with four execution pipelines that adapt to batch size rather than using a single approach.

However, we want to be clear: Unweight is not a free lunch. On-chip reconstruction adds computational work that wouldn't exist with uncompressed weights. On Llama 3.1 8B, the inference configuration saves approximately 13% of total model memory at a throughput cost of roughly 30% at typical serving batch sizes. This gap narrows at larger batches (where preprocess overlap improves) and is expected to narrow further as we optimize — in particular, we haven't yet compressed the down projection in each MLP layer (about one-third of the compressible weights), and several kernel improvements are in active development.

For Cloudflare's network, Unweight gives us better capacity: it allows us to serve state-of-the-art models with less GPU memory per instance, which translates to cost savings and the ability to deploy more models in more places. For model distribution, the savings are larger: Huffman-compressed bundles are about 22% smaller, reducing transfer times when shipping models to edge locations worldwide.

What’s next

Looking forward, we have three concrete research directions we think will improve upon our efficiency gains:

Down projection compression. Unweight compresses gate and up MLP projections today, but down projection accounts for roughly one-third of compressible weights. This requires a different kernel variant due to its transposed dimensions, which we will expect to reduce the total model size beyond 22%.

Kernel optimization. The current 30–40% throughput overhead has three identified sources: small-batch fixed costs in the reconstructive matmul, redundant weight reconstruction at large batch sizes, and the missing down projection. Each has a known mitigation path, which we outline in our technical paper.

More models. Our results are for Llama 3.1 8B, but the underlying exponent statistics are consistent across SwiGLU architectures at all scales. We're working to bring Unweight to the larger models we serve through Workers AI.

Longer term, we are investigating what Unweight’s architecture means for Mixture-of-Experts models, where cold experts must be fetched on demand and reduced storage would further reduce cost.

This is a fast-moving field, so we’re excited to open-source our work here and contribute to a growing corpus of research in compression and GPU efficiency. Unweight is one piece of the puzzle, but we hope that other researchers find it a useful paradigm to build upon!