The Cloudflare Blog

Cloudflare just got faster and more secure, powered by Rust

Richard Boulton — Fri, 26 Sep 2025 14:00:00 GMT

Cloudflare is relentless about building and running the world’s fastest network. We have been tracking and reporting on our network performance since 2021: you can see the latest update here.

Building the fastest network requires work in many areas. We invest a lot of time in our hardware, to have efficient and fast machines. We invest in peering arrangements, to make sure we can talk to every part of the Internet with minimal delay. On top of this, we also have to invest in the software we run our network on, especially as each new product can otherwise add more processing delay.

No matter how fast messages arrive, we introduce a bottleneck if that software takes too long to think about how to process and respond to requests. Today we are excited to share a significant upgrade to our software that cuts the median time we take to respond by 10ms and delivers a 25% performance boost, as measured by third-party CDN performance tests.

We've spent the last year rebuilding major components of our system, and we've just slashed the latency of traffic passing through our network for millions of our customers. At the same time, we've made our system more secure, and we've reduced the time it takes for us to build and release new products.

Where did we start?

Every request that hits Cloudflare starts a journey through our network. It might come from a browser loading a webpage, a mobile app calling an API, or automated traffic from another service. These requests first terminate at our HTTP and TLS layer, then pass into a system we call FL, and finally through Pingora, which performs cache lookups or fetches data from the origin if needed.

FL is the brain of Cloudflare. Once a request reaches FL, we then run the various security and performance features in our network. It applies each customer’s unique configuration and settings, from enforcing WAF rules and DDoS protection to routing traffic to the Developer Platform and R2.

Built more than 15 years ago, FL has been at the core of Cloudflare’s network. It enables us to deliver a broad range of features, but over time that flexibility became a challenge. As we added more products, FL grew harder to maintain, slower to process requests, and more difficult to extend. Each new feature required careful checks across existing logic, and every addition introduced a little more latency, making it increasingly difficult to sustain the performance we wanted.

You can see how FL is key to our system — we’ve often called it the “brain” of Cloudflare. It’s also one of the oldest parts of our system: the first commit to the codebase was made by one of our founders, Lee Holloway, well before our initial launch. We’re celebrating our 15th Birthday this week - this system started 9 months before that!

commit 39c72e5edc1f05ae4c04929eda4e4d125f86c5ce
Author: Lee Holloway 
Date:   Wed Jan 6 09:57:55 2010 -0800

    nginx-fl initial configuration

As the commit implies, the first version of FL was implemented based on the NGINX webserver, with product logic implemented in PHP. After 3 years, the system became too complex to manage effectively, and too slow to respond, and an almost complete rewrite of the running system was performed. This led to another significant commit, this time made by Dane Knecht, who is now our CTO.

commit bedf6e7080391683e46ab698aacdfa9b3126a75f
Author: Dane Knecht
Date:   Thu Sep 19 19:31:15 2013 -0700

    remove PHP.

From this point on, FL was implemented using NGINX, the OpenResty framework, and LuaJIT. While this was great for a long time, over the last few years it started to show its age. We had to spend increasing amounts of time fixing or working around obscure bugs in LuaJIT. The highly dynamic and unstructured nature of our Lua code, which was a blessing when first trying to implement logic quickly, became a source of errors and delay when trying to integrate large amounts of complex product logic. Each time a new product was introduced, we had to go through all the other existing products to check if they might be affected by the new logic.

It was clear that we needed a rethink. So, in July 2024, we cut an initial commit for a brand new, and radically different, implementation. To save time agreeing on a new name for this, we just called it “FL2”, and started, of course, referring to the original FL as “FL1”.

commit a72698fc7404a353a09a3b20ab92797ab4744ea8
Author: Maciej Lechowski
Date:   Wed Jul 10 15:19:28 2024 +0100

    Create fl2 project

Rust and rigid modularization

We weren’t starting from scratch. We’ve previously blogged about how we replaced another one of our legacy systems with Pingora, which is built in the Rust programming language, using the Tokio runtime. We’ve also blogged about Oxy, our internal framework for building proxies in Rust. We write a lot of Rust, and we’ve gotten pretty good at it.

We built FL2 in Rust, on Oxy, and built a strict module framework to structure all the logic in FL2.

Why Oxy?

When we set out to build FL2, we knew we weren’t just replacing an old system; we were rebuilding the foundations of Cloudflare. That meant we needed more than just a proxy; we needed a framework that could evolve with us, handle the immense scale of our network, and let teams move quickly without sacrificing safety or performance.

Oxy gives us a powerful combination of performance, safety, and flexibility. Built in Rust, it eliminates entire classes of bugs that plagued our Nginx/LuaJIT-based FL1, like memory safety issues and data races, while delivering C-level performance. At Cloudflare’s scale, those guarantees aren’t nice-to-haves, they’re essential. Every microsecond saved per request translates into tangible improvements in user experience, and every crash or edge case avoided keeps the Internet running smoothly. Rust’s strict compile-time guarantees also pair perfectly with FL2’s modular architecture, where we enforce clear contracts between product modules and their inputs and outputs.

But the choice wasn’t just about language. Oxy is the culmination of years of experience building high-performance proxies. It already powers several major Cloudflare services, from our Zero Trust Gateway to Apple’s iCloud Private Relay, so we knew it could handle the diverse traffic patterns and protocol combinations that FL2 would see. Its extensibility model lets us intercept, analyze, and manipulate traffic from layer 3 up to layer 7, and even decapsulate and reprocess traffic at different layers. That flexibility is key to FL2’s design because it means we can treat everything from HTTP to raw IP traffic consistently and evolve the platform to support new protocols and features without rewriting fundamental pieces.

Oxy also comes with a rich set of built-in capabilities that previously required large amounts of bespoke code. Things like monitoring, soft reloads, dynamic configuration loading and swapping are all part of the framework. That lets product teams focus on the unique business logic of their module rather than reinventing the plumbing every time. This solid foundation means we can make changes with confidence, ship them quickly, and trust they’ll behave as expected once deployed.

Smooth restarts - keeping the Internet flowing

One of the most impactful improvements Oxy brings is handling of restarts. Any software under continuous development and improvement will eventually need to be updated. In desktop software, this is easy: you close the program, install the update, and reopen it. On the web, things are much harder. Our software is in constant use and cannot simply stop. A dropped HTTP request can cause a page to fail to load, and a broken connection can kick you out of a video call. Reliability is not optional.

In FL1, upgrades meant restarts of the proxy process. Restarting a proxy meant terminating the process entirely, which immediately broke any active connections. That was particularly painful for long-lived connections such as WebSockets, streaming sessions, and real-time APIs. Even planned upgrades could cause user-visible interruptions, and unplanned restarts during incidents could be even worse.

Oxy changes that. It includes a built-in mechanism for graceful restarts that lets us roll out new versions without dropping connections whenever possible. When a new instance of an Oxy-based service starts up, the old one stops accepting new connections but continues to serve existing ones, allowing those sessions to continue uninterrupted until they end naturally.

This means that if you have an ongoing WebSocket session when we deploy a new version, that session can continue uninterrupted until it ends naturally, rather than being torn down by the restart. Across Cloudflare’s fleet, deployments are orchestrated over several hours, so the aggregate rollout is smooth and nearly invisible to end users.

We take this a step further by using systemd socket activation. Instead of letting each proxy manage its own sockets, we let systemd create and own them. This decouples the lifetime of sockets from the lifetime of the Oxy application itself. If an Oxy process restarts or crashes, the sockets remain open and ready to accept new connections, which will be served as soon as the new process is running. That eliminates the “connection refused” errors that could happen during restarts in FL1 and improves overall availability during upgrades.

We also built our own coordination mechanisms in Rust to replace Go libraries like tableflip with shellflip. This uses a restart coordination socket that validates configuration, spawns new instances, and ensures the new version is healthy before the old one shuts down. This improves feedback loops and lets our automation tools detect and react to failures immediately, rather than relying on blind signal-based restarts.

Composing FL2 from Modules

To avoid the problems we had in FL1, we wanted a design where all interactions between product logic were explicit and easy to understand.

So, on top of the foundations provided by Oxy, we built a platform which separates all the logic built for our products into well-defined modules. After some experimentation and research, we designed a module system which enforces some strict rules:

No IO (input or output) can be performed by the module.
The module provides a list of phases.
Phases are evaluated in a strictly defined order, which is the same for every request.
Each phase defines a set of inputs which the platform provides to it, and a set of outputs which it may emit.

Here’s an example of what a module phase definition looks like:

Phase {
    name: phases::SERVE_ERROR_PAGE,
    request_types_enabled: PHASE_ENABLED_FOR_REQUEST_TYPE,
    inputs: vec![
        InputKind::IPInfo,
        InputKind::ModuleValue(
            MODULE_VALUE_CUSTOM_ERRORS_FETCH_WORKER_RESPONSE.as_str(),
        ),
        InputKind::ModuleValue(MODULE_VALUE_ORIGINAL_SERVE_RESPONSE.as_str()),
        InputKind::ModuleValue(MODULE_VALUE_RULESETS_CUSTOM_ERRORS_OUTPUT.as_str()),
        InputKind::ModuleValue(MODULE_VALUE_RULESETS_UPSTREAM_ERROR_DETAILS.as_str()),
        InputKind::RayId,
        InputKind::StatusCode,
        InputKind::Visitor,
    ],
    outputs: vec![OutputValue::ServeResponse],
    filters: vec![],
    func: phase_serve_error_page::callback,
}

This phase is for our custom error page product. It takes a few things as input — information about the IP of the visitor, some header and other HTTP information, and some “module values.” Module values allow one module to pass information to another, and they’re key to making the strict properties of the module system workable. For example, this module needs some information that is produced by the output of our rulesets-based custom errors product (the “MODULE_VALUE_RULESETS_CUSTOM_ERRORS_OUTPUT” input). These input and output definitions are enforced at compile time.

While these rules are strict, we’ve found that we can implement all our product logic within this framework. The benefit of doing so is that we can immediately tell which other products might affect each other.

How to replace a running system

Building a framework is one thing. Building all the product logic and getting it right, so that customers don’t notice anything other than a performance improvement, is another.

The FL code base supports 15 years of Cloudflare products, and it’s changing all the time. We couldn’t stop development. So, one of our first tasks was to find ways to make the migration easier and safer.

Step 1 - Rust modules in OpenResty

It’s a big enough distraction from shipping products to customers to rebuild product logic in Rust. Asking all our teams to maintain two versions of their product logic, and reimplement every change a second time until we finished our migration was too much.

So, we implemented a layer in our old NGINX and OpenResty based FL which allowed the new modules to be run. Instead of maintaining a parallel implementation, teams could implement their logic in Rust, and replace their old Lua logic with that, without waiting for the full replacement of the old system.

For example, here’s part of the implementation for the custom error page module phase defined earlier (we’ve cut out some of the more boring details, so this doesn’t quite compile as-written):

pub(crate) fn callback(_services: &mut Services, input: &Input<'_>) -> Output {
    // Rulesets produced a response to serve - this can either come from a special
    // Cloudflare worker for serving custom errors, or be directly embedded in the rule.
    if let Some(rulesets_params) = input
        .get_module_value(MODULE_VALUE_RULESETS_CUSTOM_ERRORS_OUTPUT)
        .cloned()
    {
        // Select either the result from the special worker, or the parameters embedded
        // in the rule.
        let body = input
            .get_module_value(MODULE_VALUE_CUSTOM_ERRORS_FETCH_WORKER_RESPONSE)
            .and_then(|response| {
                handle_custom_errors_fetch_response("rulesets", response.to_owned())
            })
            .or(rulesets_params.body);

        // If we were able to load a body, serve it, otherwise let the next bit of logic
        // handle the response
        if let Some(body) = body {
            let final_body = replace_custom_error_tokens(input, &body);

            // Increment a metric recording number of custom error pages served
            custom_pages::pages_served("rulesets").inc();

            // Return a phase output with one final action, causing an HTTP response to be served.
            return Output::from(TerminalAction::ServeResponse(ResponseAction::OriginError {
                rulesets_params.status,
                source: "rulesets http_custom_errors",
                headers: rulesets_params.headers,
                body: Some(Bytes::from(final_body)),
            }));
        }
    }
}

The internal logic in each module is quite cleanly separated from the handling of data, with very clear and explicit error handling encouraged by the design of the Rust language.

Many of our most actively developed modules were handled this way, allowing the teams to maintain their change velocity during our migration.

Step 2 - Testing and automated rollouts

It’s essential to have a seriously powerful test framework to cover such a migration. We built a system, internally named Flamingo, which allows us to run thousands of full end-to-end test requests concurrently against our production and pre-production systems. The same tests run against FL1 and FL2, giving us confidence that we’re not changing behaviours.

Whenever we deploy a change, that change is rolled out gradually across many stages, with increasing amounts of traffic. Each stage is automatically evaluated, and only passes when the full set of tests have been successfully run against it - as well as overall performance and resource usage metrics being within acceptable bounds. This system is fully automated, and pauses or rolls back changes if the tests fail.

The benefit is that we’re able to build and ship new product features in FL2 within 48 hours - where it would have taken weeks in FL1. In fact, at least one of the announcements this week involved such a change!

Step 3 - Fallbacks

Over 100 engineers have worked on FL2, and we have over 130 modules. And we’re not quite done yet. We're still putting the final touches on the system, to make sure it replicates all the behaviours of FL1.

So how do we send traffic to FL2 without it being able to handle everything? If FL2 receives a request, or a piece of configuration for a request, that it doesn’t know how to handle, it gives up and does what we’ve called a fallback - it passes the whole thing over to FL1. It does this at the network level - it just passes the bytes on to FL1.

As well as making it possible for us to send traffic to FL2 without it being fully complete, this has another massive benefit. When we have implemented a piece of new functionality in FL2, but want to double check that it is working the same as in FL1, we can evaluate the functionality in FL2, and then trigger a fallback. We are able to compare the behaviour of the two systems, allowing us to get a high confidence that our implementation was correct.

Step 4 - Rollout

We started running customer traffic through FL2 early in 2025, and have been progressively increasing the amount of traffic served throughout the year. Essentially, we’ve been watching two graphs: one with the proportion of traffic routed to FL2 going up, and another with the proportion of traffic failing to be served by FL2 and falling back to FL1 going down.

We started this process by passing traffic for our free customers through the system. We were able to prove that the system worked correctly, and drive the fallback rates down for our major modules. Our Cloudflare Community MVPs acted as an early warning system, smoke testing and flagging when they suspected the new platform might be the cause of a new reported problem. Crucially their support allowed our team to investigate quickly, apply targeted fixes, or confirm the move to FL2 was not to blame.

We then advanced to our paying customers, gradually increasing the amount of customers using the system. We also worked closely with some of our largest customers, who wanted the performance benefits of FL2, and onboarded them early in exchange for lots of feedback on the system.

Right now, most of our customers are using FL2. We still have a few features to complete, and are not quite ready to onboard everyone, but our target is to turn off FL1 within a few more months.

Impact of FL2

As we described at the start of this post, FL2 is substantially faster than FL1. The biggest reason for this is simply that FL2 performs less work. You might have noticed in the module definition example a line

    filters: vec![],

Every module is able to provide a set of filters, which control whether they run or not. This means that we don’t run logic for every product for every request — we can very easily select just the required set of modules. The incremental cost for each new product we develop has gone away.

Another huge reason for better performance is that FL2 is a single codebase, implemented in a performance focussed language. In comparison, FL1 was based on NGINX (which is written in C), combined with LuaJIT (Lua, and C interface layers), and also contained plenty of Rust modules. In FL1, we spent a lot of time and memory converting data from the representation needed by one language, to the representation needed by another.

As a result, our internal measures show that FL2 uses less than half the CPU of FL1, and much less than half the memory. That’s a huge bonus — we can spend the CPU on delivering more and more features for our customers!

How do we measure if we are getting better?

Using our own tools and independent benchmarks like CDNPerf, we measured the impact of FL2 as we rolled it out across the network. The results are clear: websites are responding 10 ms faster at the median, a 25% performance boost.

Security

FL2 is also more secure by design than FL1. No software system is perfect, but the Rust language brings us huge benefits over LuaJIT. Rust has strong compile-time memory checks and a type system that avoids large classes of errors. Combine that with our rigid module system, and we can make most changes with high confidence.

Of course, no system is secure if used badly. It’s easy to write code in Rust, which causes memory corruption. To reduce risk, we maintain strong compile time linting and checking, together with strict coding standards, testing and review processes.

We have long followed a policy that any unexplained crash of our systems needs to be investigated as a high priority. We won’t be relaxing that policy, though the main cause of novel crashes in FL2 so far has been due to hardware failure. The massively reduced rates of such crashes will give us time to do a good job of such investigations.

What’s next?

We’re spending the rest of 2025 completing the migration from FL1 to FL2, and will turn off FL1 in early 2026. We’re already seeing the benefits in terms of customer performance and speed of development, and we’re looking forward to giving these to all our customers.

We have one last service to completely migrate. The “HTTP & TLS Termination” box from the diagram way back at the top is also an NGINX service, and we’re midway through a rewrite in Rust. We’re making good progress on this migration, and expect to complete it early next year.

After that, when everything is modular, in Rust and tested and scaled, we can really start to optimize! We’ll reorganize and simplify how the modules connect to each other, expand support for non-HTTP traffic like RPC and streams, and much more.

If you’re interested in being part of this journey, check out our careers page for open roles - we’re always looking for new talent to help us to help build a better Internet.

How Cloudflare uses the world’s greatest collection of performance data to make the world’s fastest global network even faster

Steve Goldsmith — Fri, 26 Sep 2025 06:00:00 GMT

Cloudflare operates the fastest network on the planet. We’ve shared an update today about how we are overhauling the software technology that accelerates every server in our fleet, improving speed globally.

That is not where the work stops, though. To improve speed even further, we have to also make sure that our network swiftly handles the Internet-scale congestion that hits it every day, routing traffic to our now-faster servers.

We have invested in congestion control for years. Today, we are excited to share how we are applying a superpower of our network, our massive Free Plan user base, to optimize performance and find the best way to route traffic across our network for all our customers globally.

Early results have seen performance increases that average 10% faster than the prior baseline. We achieved this by applying different algorithmic methods to improve performance based on the data we observe about the Internet each day. We are excited to begin rolling out these improvements to all customers.

How does traffic arrive in our network?

The Internet is a massive collection of interconnected networks, each composed of many machines (“nodes”). Data is transmitted by breaking it up into small packets, and passing them from one machine to another (over a “link”). Each one of these machines is linked to many others, and each link has limited capacity.

When we send a packet over the Internet, it will travel in a series of “hops” over the links from A to B. At any given time, there will be one link (one “hop”) with the least available capacity for that path. It doesn’t matter where in the connection this hop is — it will be the bottleneck.

But there’s a challenge — when you’re sending data over the Internet, you don’t know what route it’s going to take. In fact, each node decides for itself which route to send the traffic through, and different packets going from A to B can take entirely different routes. The dynamic and decentralized nature of the system is what makes the Internet so effective, but it also makes it very hard to work out how much data can be sent. So — how can a sender know where the bottleneck is, and how fast to send data?

Between Cloudflare nodes, our Argo Smart Routing product takes advantage of our visibility into the global network to speed up communication. Similarly, when we initiate connections to customer origins, we can leverage Argo and other insights to optimize them. However, the speed of a connection from your phone or laptop (the Client below) to the nearest Cloudflare datacenter will depend on the capacity of the bottleneck hop in the chain from you to Cloudflare, which happens outside our network.

What happens when too much data arrives at once?

If too much data arrives at any one node in a network in the path of a request being processed, the requestor will experience delays due to congestion. The data will either be queued for a while (risking bufferbloat), or some of it will simply get dropped. Protocols like TCP and QUIC respond to packets being dropped by retransmitting the data, but this introduces a delay, and can even make the problem worse by further overloading the limited capacity.

If cloud infrastructure providers like Cloudflare don’t manage congestion carefully, we risk overloading the system, slowing down the rate of data getting through. This actually happened in the early days of the Internet. To avoid this, the Internet infrastructure community has developed systems for controlling congestion, which give everyone a turn to send their data, without overloading the network. This is an evolving challenge, as the network grows ever more complicated, and the best method to implement congestion control is a constant pursuit. Many different algorithms have been developed, which take different sources of information and signals, optimize in a particular method, and respond to congestion in different ways.

Congestion control algorithms use a number of signals to estimate the right rate to send traffic, without knowing how the network is set up. One important signal has been loss. When a packet is received, the receiver sends an “ACK,” telling the sender the packet got through. If it’s dropped somewhere along the way, the sender never gets the receipt, and after a timeout will treat the packet as having been lost.

More recent algorithms have used additional data. For example, a popular algorithm called BBR (Bottleneck Bandwidth and Round-trip propagation time), which we have been using for much of our traffic, attempts to build a model during each connection of the maximum amount of data that can be transmitted in a given time period, using estimates of the round trip time as well as loss information.

The best algorithm to use often depends on the workload. For example, for interactive traffic like a video call, an algorithm that biases towards sending too much traffic can cause queues to build up, leading to high latency and poor video experience. If one were to optimize solely for that use case though, and avoid that by sending less traffic, the network will not make the best use of the connection for clients doing bulk downloads. The performance optimization outcome varies, depending on a lot of different factors. But – we have visibility into many of them!

BBR was an exciting development in congestion control approach, moving from reactive loss-based approaches to proactive model-based optimization, resulting in significantly better performance for modern networks. Our data gives us an opportunity to go further, applying different algorithmic methods to improve performance.

How can we do better?

All the existing algorithms are constrained to use only information gathered during the lifetime of the current connection. Thankfully, we know far more about the Internet at any given moment than this! With Cloudflare’s perspective on traffic, we see much more than any one customer or ISP might see at any given time.

Every day, we see traffic from essentially every major network on the planet. When a request comes into our system, we know what client device we’re talking to, what type of network is enabling the connection, and whether we’re talking to consumer ISPs or cloud infrastructure providers.

We know about the patterns of load across the global Internet, and the locations where we believe systems are overloaded, within our network, or externally. We know about the networks that have stable properties, which have high packet loss due to cellular data connections, and the ones that traverse low earth orbit satellite links and radically change their routes every 15 seconds.

How does this work?

We have been in the process of migrating our network technology stack to use a new platform, powered by Rust, that provides more flexibility to experiment with varying the parameters in the algorithms used to handle congestion control. Then we needed data.

The data powering these experiments needs to reflect the measure we’re trying to optimize, which is the user experience. It’s not just enough that we’re sending data to nearly all the networks on the planet; we have to be able to see what is the experience that customers have. So how do we do that, at our scale?

First, we have detailed “passive” logs of the rate at which data is able to be sent from our network, and how long it takes for the destination to acknowledge receipt. This covers all our traffic, and gives us an idea of how quickly the data was received by the client, but doesn’t guarantee to tell us about the user experience.

Next, we have a system for gathering Real User Measurement (RUM) data, which records information in supported web browsers about metrics such as Page Load Time (PLT). Any Cloudflare customer can enable this and will receive detailed insights in their dashboard. In addition, we use this metadata in aggregate across all our customers and networks to understand what customers are really experiencing.

However, RUM data is only going to be present for a small proportion of connections across our network. So, we’ve been working to find a way to predict the RUM measures by extrapolating from the data we see only in passive logs. For example, here are the results of an experiment we performed comparing two different algorithms against the cubic baseline.

Now, here’s the same timescale, observed through the prediction based on our passive logs. The curves are very similar - but even more importantly, the ratio between the curves is very similar. This is huge! We can use a relatively small amount of RUM data to validate our findings, but optimize our network in a much more fine-grained way by using the full firehose of our passive logs.

Extrapolating too far becomes unreliable, so we’re also working with some of our largest customers to improve our visibility of the behaviour of the network from their clients’ point of view, which allows us to extend this predictive model even further. In return, we’ll be able to give our customers insights into the true experience of their clients, in a way that no other platform can offer.

What is next?

We’re currently running our experiments and improved algorithms for congestion control on all of our free tier QUIC traffic. As we learn more, verify on more complex customers, and expand to TCP traffic, we’ll gradually roll this out to all our customers, for all traffic, over 2026 and beyond. The results have led to as much as a 10% improvement as compared to the baseline!

We’re working with a select group of enterprises to test this in an early access program. If you’re interested in learning more, contact us!

Building Cloudflare on Cloudflare

Richard Boulton — Thu, 18 May 2023 13:00:10 GMT

Cloudflare’s website, application security and performance products handle upwards of 46 million HTTP requests per second, every second. These products were originally built as a set of native Linux services, but we’re increasingly building parts of the system using our Cloudflare Workers developer platform to make these products faster, more robust, and easier to develop. This blog post digs into how and why we’re doing this.

System architecture

Our architecture can best be thought of as a chain of proxies, each communicating over HTTP. At first, these proxies were all implemented based on NGINX and Lua, but in recent years many of them have been replaced - often by new services built in Rust, such as Pingora.

The proxies each have distinct purposes - some obvious, some less so. One which we’ll be discussing in more detail is the FL service, which performs “Front Line” processing of requests, applying customer configuration to decide how to handle and route the request.

This architecture has worked well for more than a decade. It allows parts of the system to be developed and deployed independently, parts of the system to be scaled independently, and traffic to be routed to different nodes in our systems according to load, or to ensure efficient cache utilization.

So, why change it?

At the level of latency we care about, service boundaries aren’t cheap, particularly when communicating over HTTP. Each step in the chain adds latency due to communication overheads, so we can’t add more services as we develop new products. And we have a lot of products, with many more on the way.

To avoid this overhead, we put most of the logic for many different products into FL. We’ve developed a simple modular architecture in this service, allowing teams to make and deploy changes with some level of isolation. This has become a very complex service which takes a constant effort by a team of skilled engineers to maintain and operate.

Even with this effort, the developer experience for Cloudflare engineers has often been much harder than we would like. We need to be able to start working on implementing any change quickly, but even getting a version of the system running in a local development environment is hard, requiring installation of custom tooling and Linux kernels.

The structure of the code limits the ease of making changes. While some changes are easy to make, other things run into surprising limits due to the underlying platform. For example, it is not possible to perform I/O in many parts of the code which handle HTTP response processing, leading to complex workarounds to preload resources in case they are needed.

Deploying updates to the software is high risk, so is done slowly and with care. Massive improvements have been made in the past years to our processes here, but it’s not uncommon to have to wait a week to see changes reach production, and changes tend to be deployed in large batches, making it hard to isolate the effect of each change in a release.

Finally, the code has a modular structure, but once in production there is limited isolation and sandboxing, so tracing potential side effects is hard, and debugging often requires knowledge of the whole system, which takes years of experience to obtain.

Developer platform to the rescue

As soon as Cloudflare workers became part of our stack in 2017, we started looking at ways to use them to improve our ability to build new products. Now, in 2023, many of our products are built in part using workers and the wider developer platform; for example, read this post from the Waiting Room team about how they use Workers and Durable Objects, or this post about our cache purge system doing the same. Products like Cloudflare Zero Trust, R2, KV, Turnstile, Queues, and Exposed credentials check are built using Workers at large scale, handling every request processed by the products. We also use Workers for many of our pieces of internal tooling, from dashboards to building chatbots.

While we can and do spend time improving the tooling and architecture of all our systems, the developer platform is focussed all the time on making developers productive, and being as easy to use as possible. Many of the other posts this week on this blog talk about our work here. On the developer platform, any customer can get something running in minutes, and build and deploy full complex systems within days.

We have been working to give developers working on internal Cloudflare products the same benefits.

Customer workers vs internal workers

At this point, we need to talk about two different types of worker.

The first type is created when a customer writes a Cloudflare Worker. The code is deployed to our network, and will run whenever a request to the customer’s site matches the worker’s route. Many Cloudflare engineering teams use workers just like this to build parts of our product - for example, we wrote about our Coreless Purge system for Cache recently. In these cases, our engineering teams are using exactly the same process and tooling as any Cloudflare customer would use.

However, we also have another type of worker, which can only be deployed by Cloudflare. These are not associated with a single customer. Instead, they are run for all customers for which a particular product or other piece of logic needs to be performed.

For the rest of this post, we’re only going to be talking about these internal workers. The underlying tech is the same - the difference to remember is that these workers run in response to requests from many Cloudflare customers rather than one.

Initial integration of internal workers

We first integrated internal workers into our architecture in 2019, in a very simple way. An ordered chain of internal workers was created, which run before any customer scripts.

I previously said that adding more steps in our chain would cause excessive latency. So why isn’t this a problem for internal workers?

The answer is that these internal workers run within the same service as each other, and as customer workers which are operating on the request. So, there’s no need to marshal the request into HTTP to pass it on to the next step in the chain; the runtime just needs to pass a memory reference around, and perform a lightweight shift of control. There is still a cost of adding more steps - but the cost per step is much lower.

The integration gave us several benefits immediately. We were able to take advantage of the strong sandbox model for workers, removing any risk of unexpected side effects between customers or requests. It also allowed isolated deployments - teams could deploy their updates on their own schedule, without waiting for or disrupting other teams.

However, it also had a number of limitations. Internal workers could only run in one place in the lifetime of a request. This meant they couldn’t affect services running before them, such as the Cloudflare WAF.

Also, for security reasons, internal workers were published with an internal API using special credentials, rather than the public workers API. In 2019, this was no big deal, but since then there has been a ton of work to improve tooling such as wrangler, and build the developer platform. All of this tooling was unavailable for internal workers.

We had very limited observability of internal workers, lacking metrics and detailed logs, making them hard to debug.

Despite these limitations, the benefits of being able to use the workers ecosystem were big enough that ten products used these internal workers to implement parts of their logic. These included Zaraz, our Cloudflare challenges system, Waiting Room and several of our performance optimization products: Image Resizing, Images, Mirage and Rocket Loader. Such workers are also a core part of Automatic Platform Optimization for WordPress.

Can we replace internal services with workers?

We realized that we could do a lot more with the platform to improve our development processes. We also wondered how far it would be possible to go with the platform. Would it be possible to migrate all the logic implemented in the NGINX-based FL service to the developer platform? And if not, why not?

So we started, in late 2021, with a prototype. This routed traffic directly from our TLS ingress service to our workers runtime, skipping the FL service. We named this prototype Flame.

It worked. Just about. Most importantly for a prototype, we could see that we were missing some fundamental capabilities. We couldn’t access other Cloudflare internal services, such as our DNS infrastructure or our customer configuration database, and we couldn’t emit request logs to our data pipeline, for analytics and billing purposes.

We rely heavily on caching for performance, and there was no way to cache state between requests. We also couldn’t emit HTTP requests directly to customer origins, or to our cache, without using our full existing chain-of-proxies pipeline.

Also, the developer experience for this prototype was very poor. We couldn’t take advantage of all the developer experience work being put into wrangler, due to the need to use special APIs to deploy internal workers. We couldn’t record metrics and traces to our standard observability tooling systems, so we were blind to the behavior of the system in production. And we had no way to perform a controlled and gradual deployment of updated code.

Improving the developer platform for internal services

We set out to address these problems one by one. Wherever possible, we wanted to use the same tooling for internal purposes as we provide to customers. This not only reduces the amount of tooling we need to support, but also means that we understand the problems our customers face better, and can improve their experience as well as ours.

Tooling and routing

We started with the basics - how can we deploy code for internal services to the developer platform.

I mentioned earlier that we used special internal APIs for deploying our internal workers, for “security reasons”. We reviewed this with our security team, and found that we had good protections on our API to identify who was publishing a worker. The main thing we needed to add was a secure registry of accounts which were allowed to use privileged resources. Initially we did this by hard-coding a set of permissions into our API service - later this was replaced by a more flexible permissions control plane.

Even more importantly, there is a strong distinction between publishing a worker and deploying a worker.

Publishing is the process of pushing the worker to our configuration store, so that the code to be run can be loaded when it is needed. Internally, each worker version which is published creates a new artifact in our store.

The Workers runtime uses a capability-based security model. When it is published, each script is bundled together with a list of bindings, representing the capabilities that the script has to access other resources. This mechanism is a key part of providing safety - in order to be able to access resources, the script must have been published by an account with the permissions to provide the capabilities. The secure management of bindings to internal resources is a key part of our ability to use the developer platform for internal systems.

Deploying is the process of hooking up the worker to be triggered when a request comes in. For a customer worker, deployment means attaching the worker to a route. For our internal workers, deployment means updating a global configuration store with the details of the specific artifact to run.

After some work, we were finally able to use wrangler to build and publish internal services. But there was a problem! In order to deploy an internal worker, we needed to know the identifier for the artifact which was published. Fortunately, this was a simple change: we updated wrangler to output debug information which contained this information.

A big benefit of using wrangler is that we could make tools like “wrangler test” and “wrangler dev” work. An engineer can check out the code, and get going developing their feature with well-supported tooling, and within a realistic environment.

Event logging

We run a comprehensive data pipeline, providing streams of data for our customers to allow them to see what is happening on their sites, for our operations teams to understand how our system is behaving in production, and for us to provide services like DoS protection and accurate billing.

This pipeline starts from our network as messages in Cap’n Proto format. So we needed to build a new way to push pieces of log data to our internal pipeline, from inside a worker. The pipeline starts with a service called “logfwdr”, so we added a new binding which allowed us to push an arbitrary log message to the logfwdr service. This work was later a foundation of the Workers Analytics Engine bindings, which allow customers to use the same structured logging capabilities.

Observability

Observability is the ability to see how code is behaving. If you don’t have good observability tooling, you spend most of your time guessing. It’s inefficient and frankly unsafe to operate such a system.

At Cloudflare, we have very many systems for observability, but three of the most important are:

Unstructured logs (“syslogs”). These are ingested to systems such as Kibana, which allow searching and visualizing the logs.
Metrics. Also emitted from all our systems, these are a set of numbers representing things like “CPU usage” or “requests handled”, and are ingested to a massive Prometheus system. These are used for understanding the overall behavior of our systems, and for alerting us when unexpected or undesirable changes happen.
Traces. We use systems based around Open Telemetry to record detailed traces of the interactions of the components of our system. This lets us understand which information is being passed between each service, and the time being spent in each service.

Initial support for syslogs, metrics and traces for internal workers was built by our observability team, who provided a set of endpoints to which workers could push information. We wrapped this in a simple library, called “flame-common”, so that emitting observability events could be done without needing to think about the mechanics behind it.

Our initial wrapper looked something like this:

import { ObservabilityContext } from "flame-common";
 
export default {
    async fetch(
        request: Request,
        env: Env,
        ctx: ExecutionContext
    ): Promise {
 
        const obs = new ObservabilityContext(request, env, ctx);

        // Logging to syslog and kibana
        obs.logInfo("some information")
        obs.logError("an error occurred")
 
        // Metrics to Prometheus
        obs.counter("rps", "how many requests per second my service is doing")?.inc();
 
        // Tracing
        obs.startSpan("my code");
        obs.addAttribute("key", 42);
    },
};

An awkward part of this API was the need to pass the “ObservabilityContext” around to be able to emit events. Resolving this was one of the reasons that we recently added support for AsyncLocalStorage to the Workers runtime.

While our current observability system works, the internal implementation isn’t as efficient as we would like. So, we’re also working on adding native support for emitting events, metrics and traces from the Workers runtime. As we did with the Workers Analytics Engine, we want to find a way to do this which can be hooked up to our internal systems, but which can also be used by customers to add better observability to their workers.

Accessing internal resources

One of our most important internal services is our configuration store, Quicksilver. To be able to move more logic into the developer platform, we need to be able to access this configuration store from inside internal workers. We also need to be able to access a number of other internal services - such as our DNS system, and our DoS protection systems.

Our systems use Cap’n Proto in many places as a serialization and communication mechanism, so it was natural to add support for Cap’n Proto RPC to our Workers runtime. The systems which we need to talk to are mostly implemented in Go or Rust, which have good client support for this protocol.

We therefore added support for making connections to internal services over Cap’n Proto RPC to our Workers runtime. Each service will listen for connections from the runtime, and publish a schema to be used to communicate with it. The Workers runtime manages the conversion of data from JavaScript to Cap’n Proto, according to a schema which is bundled together with the worker at publication time. This makes the code for talking to an internal service, in this case our DNS service being used to identify the account owning a particular hostname, as simple as:

let ownershipInterface = env.RRDNS.getCapability();

let query = {
  request: {
    queryName: url.hostname,
    connectViaAddr: control_header.connect_via_addr,
  },
};

let response = await ownershipInterface.lookupOwnership(query);

Caching

Computers run on cache, and our services are no exception. Looking at the previous example, if we have 10,000 requests coming in quick succession for the same hostname, we don’t want to look up the hostname in our DNS system for each one. We want to cache the lookups.

At first sight, this is incompatible with the design of workers, where we give no guarantees of state being preserved between requests. However, we have added a new internal binding to provide a “volatile in-memory cache”. Wherever it is possible to efficiently share this cache between workers, we will do so.

The following flowchart describes the semantics of this cache.

To use the cache, we simply need to wrap a block of code in a call to the cache:

const owner = await env.OWNERSHIPCACHE.read(
  key,
  async (key) => {
    let ownershipInterface = env.RRDNS.getCapability();

    let query = {
      request: {
        queryName: url.hostname,
        connectViaAddr: control_header.connect_via_addr,
      },
    };

    let response = await ownershipInterface.lookupOwnership(query);
    const value = response.response;
    const expiration = new Date(Date.now() + 30_000);
    return { value, expiration };
  }
);

This cache drastically reduces the number of calls needed to fetch external resources. We are likely to improve it further, by adding support for refreshing in the background to reduce P99 latency, and improving observability of its usage and hit rates.

Direct egress from workers

If you looked at the architecture diagrams above closely, you might have noticed that the next step after the Workers runtime is always FL. Historically, the runtime only communicated with the FL service - allowing some product logic which was implemented in FL to be performed after workers had processed the requests.

However, in many cases this added unnecessary overhead; no logic actually needs to be performed in this step. So, we’ve added the ability for our internal workers to control how egress of requests works. In some cases, egress will go directly to our cache systems. In others, it will go directly to the Internet.

Gradual deployment

As mentioned before, one of the critical requirements is that we can deploy changes to our code in a gradual and controlled manner. In the rare event that something goes wrong, we need to make sure that it is detected as soon as possible, rather than triggering an issue across our entire network.

Teams using internal workers have built a number of different systems to address this issue, but they are all somewhat hard to use, with manual steps involving copying identifiers around, and triggering advancement at the right times. Manual effort like this is inefficient - we want developers to be thinking at a higher level of abstraction, not worrying about copying and pasting version numbers between systems.

We’ve therefore built a new deployment system for internal workers, based around a few principles:

Control deployments through git. A deployment to an internal-only environment would be triggered by a merge to a staging branch (with appropriate reviews). A deployment to production would be triggered by a merge to a production branch.
Progressive deployment. A deployment starts with the lowest impact system (ideally, a pre-production system which mirrors production, but has no customer impact if it breaks). It then progresses through multiple stages, each one with a greater level of impact, until the release is completed.
Health-mediated advancement. Between each stage, a set of end-to-end tests is performed, metrics are reviewed, and a minimum time must elapse. If any of these fail, the deployment is paused, or reverted; and this happens automatically, without waiting for a human to respond.

This system allows developers to focus on the behavior of their system, rather than the mechanics of a deployment.

There are still plenty of plans for further improvement to many of these systems - but they’re running now in production for many of our internal workers.

Moving from prototype to production

Our initial prototype has done its job: it’s shown us what capabilities we needed to add to our developer platform to be able to build more of our internal systems on it. We’ve added a large set of capabilities for internal service development to the developer platform, and are using them in production today for relatively small components of the system. We also know that if we were about to build our application security and performance products from scratch today, we could build them on the platform.

But there’s a world of difference between having a platform that is capable of running our internal systems, and migrating existing systems over to it. We’re at a very early stage of migration; we have real traffic running on the new platform, and expect to migrate more pieces of logic, and some full production sites, to run without depending on the FL service within the next few months.

We’re also still working out what the right module structure for our system is. As discussed, the platform allows us to split our logic into many separate workers, which communicate efficiently, internally. We need to work out what the right level of subdivision is to match our development processes, to keep our code understandable and maintainable, while maintaining efficiency and throughput.

What’s next?

We have a lot more exploration and work to do. Anyone who has worked on a large legacy system knows that it is easy to believe that rewriting the system from scratch would allow you to fix all its problems. And anyone who has actually done this knows that such a project is doomed to be many times harder than you expect - and risks recreating all the problems that the old architecture fixed long ago.

Any rewrite or migration we perform will need to give us a strong benefit, in terms of improved developer experience, reliability and performance.

And it has to be possible to migrate without slowing down the pace at which we develop new products, even for a moment.

We’ve done this before

Rewriting systems to take advantage of new technologies is something we do a lot at Cloudflare, and we’re good at it. The Quicksilver system has been fundamentally rebuilt several times - migrating from Kyoto Tycoon, and then migrating the datastore from LMDB to RocksDB. And we’ve rebuilt the code that handles HTML rewriting, to take advantage of the safety and performance of new technologies.

In fact, this isn’t even the first time we’ve rewritten our entire technical architecture for this very system. The first version of our performance and security proxy was implemented in PHP. This was retired in 2013 after an effort to rebuild the system from scratch. One interesting aspect of that rewrite is that it was done without stopping. The new system was so much easier to build that the developers working on it were able to catch up with the changes being made in the old system. Once the new system was mostly ready, it started handling requests; and if it found it wasn’t able to handle a request, it fell back to the old system. Eventually, enough logic was implemented that the old system could be turned off, leading to:

Author: Dane Knecht
Date:   Thu Sep 19 19:31:15 2013 -0700

    remove PHP.

It’s harder this time

Our systems are a lot more complicated than they were in 2013. The approach we’re taking is one of gradual change. We will not rebuild our systems as a new, standalone reimplementation. Instead, we will identify separable parts of our systems, where we can have a concrete benefit in the immediate future, and migrate these to new architectures. We’ll then learn from these experiences, feed them back into improving our platform and tooling, and identify further areas to work on.

Modularity of our code is of key importance; we are designing a system that we expect to be modified by many teams. To control this complexity, we need to introduce strong boundaries between code modules, allowing reasoning about the system to be done at a local level, rather than needing global knowledge.

Part of the answer may lie in producing multiple different systems for different use cases. Part of the strength of the developer platform is that we don’t have to publish a single version of our software - we can have as many as we need, running concurrently on the platform.

The Internet is a wild place, and we see every odd technical behavior you can imagine. There are standards and RFCs which we do our best to follow - but what happens in practice is often undocumented. Whenever we change any edge case behavior of our system, which is sometimes unavoidable with a migration to a new architecture, we risk breaking an assumption that someone has made. This doesn’t mean we can never make such changes - but we do need to be deliberate about it and understand the impact, so that we can minimize disruption.

To help with this, another essential piece of the puzzle is our testing infrastructure. We have many tests that run on our software and network, but we’re building new capabilities to test every edge-case behavior of our system, in production, before and after each change. This will let us experiment with a great deal more confidence, and decide when we migrate pieces of our system to new architectures whether to be “bug-for-bug” compatible, and if not, whether we need to warn anyone about the change. Again - this isn’t the first time we’ve done such a migration: for example, when we rebuilt our DNS pipeline to make it three times faster, we built similar tooling to allow us to see if the new system behaved in any way differently from the earlier system.

The one thing I’m sure of is that some of the things we learn will surprise us and make us change direction. We’ll use this to improve the capabilities and ease of use of the developer platform. In addition, the scale at which we’re running these systems will help to find any previously hidden bottlenecks and scaling issues in the platform. I look forward to talking about our progress, all the improvements we’ve made, and all the surprise lessons we’ve learnt, in future blog posts.

I want to know more

We’ve covered a lot here. But maybe you want to know more, or you want to know how to get access to some of the features we’ve talked about here for your own projects.

If you’re interested in hearing more about this project, or in letting us know about capabilities you want to add to the developer platform, get in touch on Discord.

Watch on Cloudflare TV

Starting as an Engineering Manager

Richard Boulton — Mon, 26 Feb 2018 16:58:50 GMT

I joined Cloudflare last week as an Engineering Manager, having previously spent 4 years working as the head of the software engineering community in the UK Government’s Digital Service (GDS). You only get one chance to be a new starter at each new place, so it’s important to make the most of the experience. Also, the job of Engineering Manager is different in every organisation, so it’s important to understand what the expectations and need for the role is in Cloudflare.

To help with this, I started by sketching out some objectives for my first week.

Photo by Glenn Carstens-Peters / Unsplash

Meet all my team members in a 1 to 1 meeting
Know the skills and motivations of each team member
Write down the expectations of my manager and team
Write down the scope of the job
Write a list of technologies to get an understanding of
Be mentored by one of the team in one of these technologies
Review what I can do to help with diversity and inclusion
Understand the management structure - draw a diagram of it
Spend 30 minutes with the most senior person I can to understand their aims
Play with the product myself

Some of these are a bit of a stretch to do fully in the first week, but I find objectives often work best for me if they’re a bit challenging.

I also made a short list of more organisational todos, such as understanding which tools have been accredited for use, getting all my accounts and kit working and setting up a todo list.

Other opinions

I worried that this was lacking in review and would benefit from other people’s perspectives. I’m lucky to have access to a network of people with technical and management experience, so I put the question to them on Twitter:

On way to new job, so am writing myself some first week objectives. I'm interested to hear, as an engineering manager joining a new company and team, what would yours be?
— Richard Boulton (@rboulton) 19 February 2018

You can read the responses yourself, but I got a lot of very good advice there.

Principles

Quite a few of the suggestions are good principles to follow, rather than measurable objectives, and I’d summarise them as:

Listen, don't tell. (You don't know what you're talking about yet.)
Capture, write down first impressions. (You won't remember them the same way.)
Reserve judgement, don't bring baggage from your last role. (Things that look similar at first may be very different.)

I wrote down the list of suggestions from twitter to see what I’d missed - there were 25 suggestions, though several were close duplicates. I ended up adding the following to my list:

Be able to articulate the vision for the company as a whole and the place of the engineering team in delivering that vision
Know who the non technical critical stakeholders are
Get an idea of the 5 best and worst things about the place
Projects: what's ongoing, what state are they in, why?
Emergencies and fires: how are they tackled, what's your role, what makes them harder to solve?

And finally, I added “Discuss what the comms policy is. What is it okay to blog about publically?”, which has resulted in me writing this post!

How did it go?

Well, as a result of this, I made a long list of tasks that I wanted to get done - and even managed to do most of them! I had some great one-to-one sessions with the team, and got to watch how they approached their work, and was struck by the very high level of technical skill and commitment to making the best system they can. I also got to spend time with various senior leaders and get their view on the company, and also began to get an understanding of what the product requirements are that depend on the team.

I found keeping the principles listed above in mind was incredibly helpful, particularly in keeping me from leaping to try to present solutions to issues before I understand the situation properly.

I’ve ended the week with more questions than I’ve answered, but thanks to some excellent guidance and support from my manager, and many other people’s generosity with their time, I’m confident they’re the right questions.

The Cloudflare Blog

Cloudflare just got faster and more secure, powered by Rust

Where did we start?

Rust and rigid modularization

Why Oxy?

Smooth restarts - keeping the Internet flowing

Composing FL2 from Modules

How to replace a running system

Step 1 - Rust modules in OpenResty

Step 2 - Testing and automated rollouts

Step 3 - Fallbacks

Step 4 - Rollout

Impact of FL2

How do we measure if we are getting better?

Security

What’s next?

How Cloudflare uses the world’s greatest collection of performance data to make the world’s fastest global network even faster

How does traffic arrive in our network?

What happens when too much data arrives at once?

How can we do better?

How does this work?

What is next?

Building Cloudflare on Cloudflare

System architecture

So, why change it?

Developer platform to the rescue

Customer workers vs internal workers

Initial integration of internal workers

Can we replace internal services with workers?

Improving the developer platform for internal services

Tooling and routing

Event logging

Observability

Accessing internal resources

Caching

Direct egress from workers

Gradual deployment

Moving from prototype to production

What’s next?

We’ve done this before

It’s harder this time

I want to know more

Watch on Cloudflare TV

Starting as an Engineering Manager

Other opinions

Principles

How did it go?

Further reading