The Cloudflare Blog

Cloudflare just got faster and more secure, powered by Rust

Richard Boulton — Fri, 26 Sep 2025 14:00:00 GMT

Cloudflare is relentless about building and running the world’s fastest network. We have been tracking and reporting on our network performance since 2021: you can see the latest update here.

Building the fastest network requires work in many areas. We invest a lot of time in our hardware, to have efficient and fast machines. We invest in peering arrangements, to make sure we can talk to every part of the Internet with minimal delay. On top of this, we also have to invest in the software we run our network on, especially as each new product can otherwise add more processing delay.

No matter how fast messages arrive, we introduce a bottleneck if that software takes too long to think about how to process and respond to requests. Today we are excited to share a significant upgrade to our software that cuts the median time we take to respond by 10ms and delivers a 25% performance boost, as measured by third-party CDN performance tests.

We've spent the last year rebuilding major components of our system, and we've just slashed the latency of traffic passing through our network for millions of our customers. At the same time, we've made our system more secure, and we've reduced the time it takes for us to build and release new products.

Where did we start?

Every request that hits Cloudflare starts a journey through our network. It might come from a browser loading a webpage, a mobile app calling an API, or automated traffic from another service. These requests first terminate at our HTTP and TLS layer, then pass into a system we call FL, and finally through Pingora, which performs cache lookups or fetches data from the origin if needed.

FL is the brain of Cloudflare. Once a request reaches FL, we then run the various security and performance features in our network. It applies each customer’s unique configuration and settings, from enforcing WAF rules and DDoS protection to routing traffic to the Developer Platform and R2.

Built more than 15 years ago, FL has been at the core of Cloudflare’s network. It enables us to deliver a broad range of features, but over time that flexibility became a challenge. As we added more products, FL grew harder to maintain, slower to process requests, and more difficult to extend. Each new feature required careful checks across existing logic, and every addition introduced a little more latency, making it increasingly difficult to sustain the performance we wanted.

You can see how FL is key to our system — we’ve often called it the “brain” of Cloudflare. It’s also one of the oldest parts of our system: the first commit to the codebase was made by one of our founders, Lee Holloway, well before our initial launch. We’re celebrating our 15th Birthday this week - this system started 9 months before that!

commit 39c72e5edc1f05ae4c04929eda4e4d125f86c5ce
Author: Lee Holloway 
Date:   Wed Jan 6 09:57:55 2010 -0800

    nginx-fl initial configuration

As the commit implies, the first version of FL was implemented based on the NGINX webserver, with product logic implemented in PHP. After 3 years, the system became too complex to manage effectively, and too slow to respond, and an almost complete rewrite of the running system was performed. This led to another significant commit, this time made by Dane Knecht, who is now our CTO.

commit bedf6e7080391683e46ab698aacdfa9b3126a75f
Author: Dane Knecht
Date:   Thu Sep 19 19:31:15 2013 -0700

    remove PHP.

From this point on, FL was implemented using NGINX, the OpenResty framework, and LuaJIT. While this was great for a long time, over the last few years it started to show its age. We had to spend increasing amounts of time fixing or working around obscure bugs in LuaJIT. The highly dynamic and unstructured nature of our Lua code, which was a blessing when first trying to implement logic quickly, became a source of errors and delay when trying to integrate large amounts of complex product logic. Each time a new product was introduced, we had to go through all the other existing products to check if they might be affected by the new logic.

It was clear that we needed a rethink. So, in July 2024, we cut an initial commit for a brand new, and radically different, implementation. To save time agreeing on a new name for this, we just called it “FL2”, and started, of course, referring to the original FL as “FL1”.

commit a72698fc7404a353a09a3b20ab92797ab4744ea8
Author: Maciej Lechowski
Date:   Wed Jul 10 15:19:28 2024 +0100

    Create fl2 project

Rust and rigid modularization

We weren’t starting from scratch. We’ve previously blogged about how we replaced another one of our legacy systems with Pingora, which is built in the Rust programming language, using the Tokio runtime. We’ve also blogged about Oxy, our internal framework for building proxies in Rust. We write a lot of Rust, and we’ve gotten pretty good at it.

We built FL2 in Rust, on Oxy, and built a strict module framework to structure all the logic in FL2.

Why Oxy?

When we set out to build FL2, we knew we weren’t just replacing an old system; we were rebuilding the foundations of Cloudflare. That meant we needed more than just a proxy; we needed a framework that could evolve with us, handle the immense scale of our network, and let teams move quickly without sacrificing safety or performance.

Oxy gives us a powerful combination of performance, safety, and flexibility. Built in Rust, it eliminates entire classes of bugs that plagued our Nginx/LuaJIT-based FL1, like memory safety issues and data races, while delivering C-level performance. At Cloudflare’s scale, those guarantees aren’t nice-to-haves, they’re essential. Every microsecond saved per request translates into tangible improvements in user experience, and every crash or edge case avoided keeps the Internet running smoothly. Rust’s strict compile-time guarantees also pair perfectly with FL2’s modular architecture, where we enforce clear contracts between product modules and their inputs and outputs.

But the choice wasn’t just about language. Oxy is the culmination of years of experience building high-performance proxies. It already powers several major Cloudflare services, from our Zero Trust Gateway to Apple’s iCloud Private Relay, so we knew it could handle the diverse traffic patterns and protocol combinations that FL2 would see. Its extensibility model lets us intercept, analyze, and manipulate traffic from layer 3 up to layer 7, and even decapsulate and reprocess traffic at different layers. That flexibility is key to FL2’s design because it means we can treat everything from HTTP to raw IP traffic consistently and evolve the platform to support new protocols and features without rewriting fundamental pieces.

Oxy also comes with a rich set of built-in capabilities that previously required large amounts of bespoke code. Things like monitoring, soft reloads, dynamic configuration loading and swapping are all part of the framework. That lets product teams focus on the unique business logic of their module rather than reinventing the plumbing every time. This solid foundation means we can make changes with confidence, ship them quickly, and trust they’ll behave as expected once deployed.

Smooth restarts - keeping the Internet flowing

One of the most impactful improvements Oxy brings is handling of restarts. Any software under continuous development and improvement will eventually need to be updated. In desktop software, this is easy: you close the program, install the update, and reopen it. On the web, things are much harder. Our software is in constant use and cannot simply stop. A dropped HTTP request can cause a page to fail to load, and a broken connection can kick you out of a video call. Reliability is not optional.

In FL1, upgrades meant restarts of the proxy process. Restarting a proxy meant terminating the process entirely, which immediately broke any active connections. That was particularly painful for long-lived connections such as WebSockets, streaming sessions, and real-time APIs. Even planned upgrades could cause user-visible interruptions, and unplanned restarts during incidents could be even worse.

Oxy changes that. It includes a built-in mechanism for graceful restarts that lets us roll out new versions without dropping connections whenever possible. When a new instance of an Oxy-based service starts up, the old one stops accepting new connections but continues to serve existing ones, allowing those sessions to continue uninterrupted until they end naturally.

This means that if you have an ongoing WebSocket session when we deploy a new version, that session can continue uninterrupted until it ends naturally, rather than being torn down by the restart. Across Cloudflare’s fleet, deployments are orchestrated over several hours, so the aggregate rollout is smooth and nearly invisible to end users.

We take this a step further by using systemd socket activation. Instead of letting each proxy manage its own sockets, we let systemd create and own them. This decouples the lifetime of sockets from the lifetime of the Oxy application itself. If an Oxy process restarts or crashes, the sockets remain open and ready to accept new connections, which will be served as soon as the new process is running. That eliminates the “connection refused” errors that could happen during restarts in FL1 and improves overall availability during upgrades.

We also built our own coordination mechanisms in Rust to replace Go libraries like tableflip with shellflip. This uses a restart coordination socket that validates configuration, spawns new instances, and ensures the new version is healthy before the old one shuts down. This improves feedback loops and lets our automation tools detect and react to failures immediately, rather than relying on blind signal-based restarts.

Composing FL2 from Modules

To avoid the problems we had in FL1, we wanted a design where all interactions between product logic were explicit and easy to understand.

So, on top of the foundations provided by Oxy, we built a platform which separates all the logic built for our products into well-defined modules. After some experimentation and research, we designed a module system which enforces some strict rules:

No IO (input or output) can be performed by the module.
The module provides a list of phases.
Phases are evaluated in a strictly defined order, which is the same for every request.
Each phase defines a set of inputs which the platform provides to it, and a set of outputs which it may emit.

Here’s an example of what a module phase definition looks like:

Phase {
    name: phases::SERVE_ERROR_PAGE,
    request_types_enabled: PHASE_ENABLED_FOR_REQUEST_TYPE,
    inputs: vec![
        InputKind::IPInfo,
        InputKind::ModuleValue(
            MODULE_VALUE_CUSTOM_ERRORS_FETCH_WORKER_RESPONSE.as_str(),
        ),
        InputKind::ModuleValue(MODULE_VALUE_ORIGINAL_SERVE_RESPONSE.as_str()),
        InputKind::ModuleValue(MODULE_VALUE_RULESETS_CUSTOM_ERRORS_OUTPUT.as_str()),
        InputKind::ModuleValue(MODULE_VALUE_RULESETS_UPSTREAM_ERROR_DETAILS.as_str()),
        InputKind::RayId,
        InputKind::StatusCode,
        InputKind::Visitor,
    ],
    outputs: vec![OutputValue::ServeResponse],
    filters: vec![],
    func: phase_serve_error_page::callback,
}

This phase is for our custom error page product. It takes a few things as input — information about the IP of the visitor, some header and other HTTP information, and some “module values.” Module values allow one module to pass information to another, and they’re key to making the strict properties of the module system workable. For example, this module needs some information that is produced by the output of our rulesets-based custom errors product (the “MODULE_VALUE_RULESETS_CUSTOM_ERRORS_OUTPUT” input). These input and output definitions are enforced at compile time.

While these rules are strict, we’ve found that we can implement all our product logic within this framework. The benefit of doing so is that we can immediately tell which other products might affect each other.

How to replace a running system

Building a framework is one thing. Building all the product logic and getting it right, so that customers don’t notice anything other than a performance improvement, is another.

The FL code base supports 15 years of Cloudflare products, and it’s changing all the time. We couldn’t stop development. So, one of our first tasks was to find ways to make the migration easier and safer.

Step 1 - Rust modules in OpenResty

It’s a big enough distraction from shipping products to customers to rebuild product logic in Rust. Asking all our teams to maintain two versions of their product logic, and reimplement every change a second time until we finished our migration was too much.

So, we implemented a layer in our old NGINX and OpenResty based FL which allowed the new modules to be run. Instead of maintaining a parallel implementation, teams could implement their logic in Rust, and replace their old Lua logic with that, without waiting for the full replacement of the old system.

For example, here’s part of the implementation for the custom error page module phase defined earlier (we’ve cut out some of the more boring details, so this doesn’t quite compile as-written):

pub(crate) fn callback(_services: &mut Services, input: &Input<'_>) -> Output {
    // Rulesets produced a response to serve - this can either come from a special
    // Cloudflare worker for serving custom errors, or be directly embedded in the rule.
    if let Some(rulesets_params) = input
        .get_module_value(MODULE_VALUE_RULESETS_CUSTOM_ERRORS_OUTPUT)
        .cloned()
    {
        // Select either the result from the special worker, or the parameters embedded
        // in the rule.
        let body = input
            .get_module_value(MODULE_VALUE_CUSTOM_ERRORS_FETCH_WORKER_RESPONSE)
            .and_then(|response| {
                handle_custom_errors_fetch_response("rulesets", response.to_owned())
            })
            .or(rulesets_params.body);

        // If we were able to load a body, serve it, otherwise let the next bit of logic
        // handle the response
        if let Some(body) = body {
            let final_body = replace_custom_error_tokens(input, &body);

            // Increment a metric recording number of custom error pages served
            custom_pages::pages_served("rulesets").inc();

            // Return a phase output with one final action, causing an HTTP response to be served.
            return Output::from(TerminalAction::ServeResponse(ResponseAction::OriginError {
                rulesets_params.status,
                source: "rulesets http_custom_errors",
                headers: rulesets_params.headers,
                body: Some(Bytes::from(final_body)),
            }));
        }
    }
}

The internal logic in each module is quite cleanly separated from the handling of data, with very clear and explicit error handling encouraged by the design of the Rust language.

Many of our most actively developed modules were handled this way, allowing the teams to maintain their change velocity during our migration.

Step 2 - Testing and automated rollouts

It’s essential to have a seriously powerful test framework to cover such a migration. We built a system, internally named Flamingo, which allows us to run thousands of full end-to-end test requests concurrently against our production and pre-production systems. The same tests run against FL1 and FL2, giving us confidence that we’re not changing behaviours.

Whenever we deploy a change, that change is rolled out gradually across many stages, with increasing amounts of traffic. Each stage is automatically evaluated, and only passes when the full set of tests have been successfully run against it - as well as overall performance and resource usage metrics being within acceptable bounds. This system is fully automated, and pauses or rolls back changes if the tests fail.

The benefit is that we’re able to build and ship new product features in FL2 within 48 hours - where it would have taken weeks in FL1. In fact, at least one of the announcements this week involved such a change!

Step 3 - Fallbacks

Over 100 engineers have worked on FL2, and we have over 130 modules. And we’re not quite done yet. We're still putting the final touches on the system, to make sure it replicates all the behaviours of FL1.

So how do we send traffic to FL2 without it being able to handle everything? If FL2 receives a request, or a piece of configuration for a request, that it doesn’t know how to handle, it gives up and does what we’ve called a fallback - it passes the whole thing over to FL1. It does this at the network level - it just passes the bytes on to FL1.

As well as making it possible for us to send traffic to FL2 without it being fully complete, this has another massive benefit. When we have implemented a piece of new functionality in FL2, but want to double check that it is working the same as in FL1, we can evaluate the functionality in FL2, and then trigger a fallback. We are able to compare the behaviour of the two systems, allowing us to get a high confidence that our implementation was correct.

Step 4 - Rollout

We started running customer traffic through FL2 early in 2025, and have been progressively increasing the amount of traffic served throughout the year. Essentially, we’ve been watching two graphs: one with the proportion of traffic routed to FL2 going up, and another with the proportion of traffic failing to be served by FL2 and falling back to FL1 going down.

We started this process by passing traffic for our free customers through the system. We were able to prove that the system worked correctly, and drive the fallback rates down for our major modules. Our Cloudflare Community MVPs acted as an early warning system, smoke testing and flagging when they suspected the new platform might be the cause of a new reported problem. Crucially their support allowed our team to investigate quickly, apply targeted fixes, or confirm the move to FL2 was not to blame.

We then advanced to our paying customers, gradually increasing the amount of customers using the system. We also worked closely with some of our largest customers, who wanted the performance benefits of FL2, and onboarded them early in exchange for lots of feedback on the system.

Right now, most of our customers are using FL2. We still have a few features to complete, and are not quite ready to onboard everyone, but our target is to turn off FL1 within a few more months.

Impact of FL2

As we described at the start of this post, FL2 is substantially faster than FL1. The biggest reason for this is simply that FL2 performs less work. You might have noticed in the module definition example a line

    filters: vec![],

Every module is able to provide a set of filters, which control whether they run or not. This means that we don’t run logic for every product for every request — we can very easily select just the required set of modules. The incremental cost for each new product we develop has gone away.

Another huge reason for better performance is that FL2 is a single codebase, implemented in a performance focussed language. In comparison, FL1 was based on NGINX (which is written in C), combined with LuaJIT (Lua, and C interface layers), and also contained plenty of Rust modules. In FL1, we spent a lot of time and memory converting data from the representation needed by one language, to the representation needed by another.

As a result, our internal measures show that FL2 uses less than half the CPU of FL1, and much less than half the memory. That’s a huge bonus — we can spend the CPU on delivering more and more features for our customers!

How do we measure if we are getting better?

Using our own tools and independent benchmarks like CDNPerf, we measured the impact of FL2 as we rolled it out across the network. The results are clear: websites are responding 10 ms faster at the median, a 25% performance boost.

Security

FL2 is also more secure by design than FL1. No software system is perfect, but the Rust language brings us huge benefits over LuaJIT. Rust has strong compile-time memory checks and a type system that avoids large classes of errors. Combine that with our rigid module system, and we can make most changes with high confidence.

Of course, no system is secure if used badly. It’s easy to write code in Rust, which causes memory corruption. To reduce risk, we maintain strong compile time linting and checking, together with strict coding standards, testing and review processes.

We have long followed a policy that any unexplained crash of our systems needs to be investigated as a high priority. We won’t be relaxing that policy, though the main cause of novel crashes in FL2 so far has been due to hardware failure. The massively reduced rates of such crashes will give us time to do a good job of such investigations.

What’s next?

We’re spending the rest of 2025 completing the migration from FL1 to FL2, and will turn off FL1 in early 2026. We’re already seeing the benefits in terms of customer performance and speed of development, and we’re looking forward to giving these to all our customers.

We have one last service to completely migrate. The “HTTP & TLS Termination” box from the diagram way back at the top is also an NGINX service, and we’re midway through a rewrite in Rust. We’re making good progress on this migration, and expect to complete it early next year.

After that, when everything is modular, in Rust and tested and scaled, we can really start to optimize! We’ll reorganize and simplify how the modules connect to each other, expand support for non-HTTP traffic like RPC and streams, and much more.

If you’re interested in being part of this journey, check out our careers page for open roles - we’re always looking for new talent to help us to help build a better Internet.

Upgrading one of the oldest components in Cloudflare’s software stack

Maciej Lechowski — Fri, 31 Mar 2023 13:00:00 GMT

Cloudflare serves a huge amount of traffic: 45 million HTTP requests per second on average (as of 2023; 61 million at peak) from more than 285 cities in over 100 countries. What inevitably happens with that kind of scale is that software will be pushed to its limits. As we grew, one of the problems we faced was related to deploying our code. Sometimes, a release would be delayed because of inadequate hardware resources on our servers. Buying more and more hardware is expensive and there are limits to e.g. how much memory we can realistically have on a server. In this article, we explain how we optimised our software and its release process so that no additional resources are needed.

In order to handle traffic, each of our servers runs a set of specialised proxies. Historically, they were based on NGINX, but increasingly they include services created in Rust. Out of our proxy applications, FL (Front Line) is the oldest and still has a broad set of responsibilities.

At its core, it’s one of the last uses of NGINX at Cloudflare. It contains a large amount of business logic that runs many Cloudflare products, using a variety of Lua and Rust libraries. As a result, it consumes a large amount of system resources: up to 50-60 GiB of RAM. As FL grew, it became more and more difficult to release it. The upgrade mechanism requires double the memory (which sometimes is not available) than at runtime. This was causing delays in releases. We have now improved the release procedure of FL, and in effect, removed the need for additional memory during the release process, improving its speed and performance.

Architecture

To accomplish all of its work, each FL instance runs many workers (individual processes). By design, individual processes handle requests while the master process controls them and makes sure they stay up. This allows NGINX to handle more traffic by adding more workers. We take advantage of this architecture.

The number of workers depends on the numbers of total CPU cores present. It’s typically equal to half of the CPU cores available, e.g. on a 128-core CPU we use 64 FL workers.

So far so good, what's the problem then?

We aim to deploy code in a way that’s transparent to our customers. We want to continue serving requests without interruptions. In practice, this means briefly running both versions of FL at the same time during the upgrade, so that we can flawlessly transition from one version to another. As soon as the new instance is up and running, we begin to shut down the old one, giving it some time to finish its work. In the end, only the new version is left running. NGINX implements this procedure and FL makes use of it.

After a new version of FL is installed on a server, the upgrade procedure starts. NGINX’s implementation revolves around communicating with the master process using signals. The upgrade process starts by sending the USR2 signal which will start the new master process and its workers.

At that point, as can be seen below, both versions will be running and processing requests. The unfortunate side effect of this is the memory footprint has been doubled.

  PID  PPID COMMAND
33126     1 nginx: master process /usr/local/nginx/sbin/nginx
33134 33126 nginx: worker process (nginx)
33135 33126 nginx: worker process (nginx)
33136 33126 nginx: worker process (nginx)
36264 33126 nginx: master process /usr/local/nginx/sbin/nginx
36265 36264 nginx: worker process (nginx)
36266 36264 nginx: worker process (nginx)
36267 36264 nginx: worker process (nginx)

Then, the WINCH signal will be sent to the master process which will then ask its workers to gracefully shut down. Eventually, they will all quit, leaving just the original master process running (which can be shut down with the QUIT signal). The successful outcome of this will leave just the new instance running, which will look similar to this:

  PID  PPID COMMAND
36264     1 nginx: master process /usr/local/nginx/sbin/nginx
36265 36264 nginx: worker process (nginx)
36266 36264 nginx: worker process (nginx)
36267 36264 nginx: worker process (nginx)

The standard NGINX upgrade mechanism is visualised in this diagram:

It’s also clearly visible in the memory usage graph below (notice the large bump during the upgrade).

The mechanism outlined above has both versions running at the same time for a while. When both sets of workers are running, they are still sharing the same sockets, so all of them accept requests. As the release progresses, ‘old’ workers stop listening and accepting new requests (at that point only the new workers accept new requests). As we release new code multiple times per week, this process is effectively doubling up our memory requirements. At our scale, it’s easy to see how multiplying this event by the number of servers we operate results in an immense waste of memory resources.

In addition, sometimes servers would take hours to upgrade (a concern especially when we need to release something quickly), as we are waiting to have enough memory available to kick off the reload action.

New upgrade mechanism

We solved this problem by modifying NGINX's method for upgrading executable. Here's how it works.

The crux of the problem is that NGINX treats the entire instance (master + workers) as one. When upgrading, we need to start all the workers whilst all the previous ones are still running. Considering the number of workers we use and how heavy they are, this is not sustainable.

So, instead, we modified NGINX to be able to control individual workers. Rather than starting and stopping them all at once, we can do so by selecting them individually. To accomplish this, the master process and workers understand additional signals compared to the ones NGINX uses. The actual mechanism to accomplish this in NGINX is nearly the same as when handling workers in bulk. However, there’s a crucial difference.

Typically, NGINX's master process ensures that the right number of workers is actually running (per configuration). If any of them crashes, it will be restarted. This is good, but it doesn't work for our new upgrade mechanism because when we need to shut down a single worker, we don't want the NGINX master process to think that a worker has crashed and needs to be restarted. So we introduced a signal to disable that behaviour in NGINX while we're shutting down a single process.

Step by step

Our improved mechanism handles each worker individually. Here are the steps:

At the beginning, we have an instance of FL running 64 workers.
Disable the feature to automatically restart workers which exit.
Shut down a worker from the first (old) instance of FL. We're down to 63 workers.
Create a new instance of FL but only with a single worker. We're back to 64 workers but including one from the new version.
Re-enable the feature to automatically restart worker processes which exit.
We continue this pattern of shutting down a worker from an older instance and creating a new one to replace it. This can be visualised in the diagram below.

We can observe our new mechanism in action in the graph below. Thanks to our new procedure, our use of memory remains stable during the release.

But why do we begin by shutting down a worker belonging to an older instance (v1)? This turns out to be important.

Worker-CPU pinning

During this workflow, we also had to take care of CPU pinning. FL workers on our servers are pinned to CPU cores with one process occupying one CPU core to help us distribute resources more efficiently. If we start a new worker first, it will share the CPU core with another one for a brief amount of time. This will make one CPU overloaded compared to other ones running FL, impacting the latency for requests served. That's why we start by freeing up a CPU core which then can be taken over by a new worker rather than starting by creating a new worker.

For the same reason, pinning of worker processes to cores must be maintained throughout the whole operation. At no point, we can have two different workers sharing a CPU core. We make sure this is the case by executing the whole procedure in the same order every time.

We start from CPU core 1 (or whichever is the first one used by FL) and do the following:

Shut down a worker that's running there.
Create a new worker. NGINX will pin it to the CPU core we have freed up in the previous step.

After doing that for all workers, we end up with a new set of workers which are correctly pinned to their CPU cores.

Conclusion

At Cloudflare, we need to release new software multiple times per day across our fleet. Standard upgrade mechanism used by NGINX is not suitable at our scale. For this reason, we customised the process to avoid increasing the amount of memory needed to release FL. This enabled us to safely ship code whenever it's needed, everywhere. The custom upgrade mechanism enables us to release a large application such as FL reliably regardless of how much memory we have available on an edge server. We showed that it’s possible to extend NGINX and its built-in upgrade mechanism to accommodate the unique requirements of Cloudflare.

If you enjoy solving challenging application infrastructure problems and want to help maintain the busiest web server in the world, we’re hiring!

ROFL with a LOL: rewriting an NGINX module in Rust

Sam Howson — Fri, 24 Feb 2023 14:00:00 GMT

At Cloudflare, engineers spend a great deal of time refactoring or rewriting existing functionality. When your company doubles the amount of traffic it handles every year, what was once an elegant solution to a problem can quickly become outdated as the engineering constraints change. Not only that, but when you're averaging 40 million requests a second, issues that might affect 0.001% of requests flowing through our network are big incidents which may impact millions of users, and one-in-a-trillion events happen several times a day.

Recently, we've been working on a replacement to one of our oldest and least-well-known components called cf-html, which lives inside the core reverse web proxy of Cloudflare known as FL (Front Line). Cf-html is the framework in charge of parsing and rewriting HTML as it streams back through from the website origin to the website visitor. Since the early days of Cloudflare, we’ve offered features which will rewrite the response body of web requests for you on the fly. The first ever feature we wrote in this way was to replace email addresses with chunks of JavaScript, which would then load the email address when viewed in a web browser. Since bots are often unable to evaluate JavaScript, this helps to prevent scraping of email addresses from websites. You can see this in action if you view the source of this page and look for this email address: foo@example.com.

FL is where most of the application infrastructure logic for Cloudflare runs, and largely consists of code written in the Lua scripting language, which runs on top of NGINX as part of OpenResty. In order to interface with NGINX directly, some parts (like cf-html) are written in lower-level languages like C and C++. In the past, there were many such OpenResty services at Cloudflare, but these days FL is one of the few left, as we move other components to Workers or Rust-based proxies. The platform that once was the best possible blend of developer ease and speed has more than started to show its age for us.

When discussing what happens to an HTTP request passing through our network and in particular FL, nearly all the attention is given to what happens up until the request reaches the customer's origin. That’s understandable as this is where most of the business logic happens: firewall rules, Workers, and routing decisions all happen on the request. But it's not the end of the story. From an engineering perspective, much of the more interesting work happens on the response, as we stream the HTML response back from the origin to the site visitor.

The logic to handle this is contained in a static NGINX module, and runs in the Response Body Filters phase in NGINX, as chunks of the HTTP response body are streamed through. Over time, more features were added, and the system became known as cf-html. cf-html uses a streaming HTML parser to match on specific HTML tags and content, called Lazy HTML or lhtml, with much of the logic for both it and the cf-html features written using the Ragel state machine engine.

Memory safety

All the cf-html logic was written in C, and therefore was susceptible to memory corruption issues that plague many large C codebases. In 2017 this led to a security bug as the team was trying to replace part of cf-html. FL was reading arbitrary data from memory and appending it to response bodies. This could potentially include data from other requests passing through FL at the same time. This security event became known widely as Cloudbleed.

Since this episode, Cloudflare implemented a number of policies and safeguards to ensure something like that never happened again. While work has been carried out on cf-html over the years, there have been few new features implemented on the framework, and we’re now hyper-sensitive to crashes happening in FL (and, indeed, any other process running on our network), especially in parts that can reflect data back with a response.

Fast-forward to 2022 into 2023, and the FL Platform team have been getting more and more requests for a system they can easily use to look at and rewrite response body data. At the same time, another team has been working on a new response body parsing and rewriting framework for Workers called lol-html or Low Output Latency HTML. Not only is lol-html faster and more efficient than Lazy HTML, but it’s also currently in full production use as part of the Worker interface, and written in Rust, which is much safer than C in terms of its handling of memory. It’s ideal, therefore, as a replacement for the ancient and creaking HTML parser we’ve been using in FL up until now.

So we started working on a new framework, written in Rust, that would incorporate lol-html and allow other teams to write response body parsing features without the threat of causing massive security issues. The new system is called ROFL or Response Overseer for FL, and it’s a brand-new NGINX module written completely in Rust. As of now, ROFL is running in production on millions of responses a second, with comparable performance to cf-html. In building ROFL, we’ve been able to deprecate one of the scariest bits of code in Cloudflare’s entire codebase, while providing teams at Cloudflare with a robust system they can use to write features which need to parse and rewrite response body data.

Writing an NGINX module in Rust

While writing the new module, we learned a lot about how NGINX works, and how we can get it to talk to Rust. NGINX doesn’t provide much documentation on writing modules written in languages other than C, and so there was some work which needed to be done to figure out how to write an NGINX module in our language of choice. When starting out, we made heavy use of parts of the code from the nginx-rs project, particularly around the handling of buffers and memory pools. While writing a full NGINX module in Rust is a long process and beyond the scope of this blog post, there are a few key bits that make the whole thing possible, and that are worth talking about.

The first one of these is generating the Rust bindings so that NGINX can communicate with it. To do that, we used Rust’s library Bindgen to build the FFI bindings for us, based on the symbol definitions in NGINX’s header files. To add this to an existing Rust project, the first thing is to pull down a copy of NGINX and configure it. Ideally this would be done in a simple script or Makefile, but when done by hand it would look something like this:

$ git clone --depth=1 https://github.com/nginx/nginx.git
$ cd nginx
$ ./auto/configure --without-http_rewrite_module --without-http_gzip_module

With NGINX in the right state, we need to create a build.rs file in our Rust project to auto-generate the bindings at build-time of the module. We’ll now add the necessary arguments to the build, and use Bindgen to generate us the bindings.rs file. For the arguments, we just need to include all the directories that may contain header files for clang to do its thing. We can then feed them into Bindgen, along with some allowlist arguments, so it knows for what things it should generate the bindings, and which things it can ignore. Adding a little boilerplate code to the top, the whole file should look something like this:

use std::env;
use std::path::PathBuf;

fn main() {
    println!("cargo:rerun-if-changed=build.rs");

    let clang_args = [
        "-Inginx/objs/",
        "-Inginx/src/core/",
        "-Inginx/src/event/",
        "-Inginx/src/event/modules/",
        "-Inginx/src/os/unix/",
        "-Inginx/src/http/",
        "-Inginx/src/http/modules/"
    ];

    let bindings = bindgen::Builder::default()
        .header("wrapper.h")
        .layout_tests(false)
        .allowlist_type("ngx_.*")
        .allowlist_function("ngx_.*")
        .allowlist_var("NGX_.*|ngx_.*|nginx_.*")
        .parse_callbacks(Box::new(bindgen::CargoCallbacks))
        .clang_args(clang_args)
        .generate()
        .expect("Unable to generate bindings");

    let out_path = PathBuf::from(env::var("OUT_DIR").unwrap());

    bindings.write_to_file(out_path.join("bindings.rs"))
        .expect("Unable to write bindings.");
}

Hopefully this is all fairly self-explanatory. Bindgen traverses the NGINX source and generates equivalent constructs in Rust in a file called bindings.rs, which we can import into our project. There’s just one more thing to add- Bindgen has trouble with a couple of symbols in NGINX, which we’ll need to fix in a file called wrapper.h. It should have the following contents:

#include 

const char* NGX_RS_MODULE_SIGNATURE = NGX_MODULE_SIGNATURE;
const size_t NGX_RS_HTTP_LOC_CONF_OFFSET = NGX_HTTP_LOC_CONF_OFFSET;

With this in place and Bindgen set in the [build-dependencies] section of the Cargo.toml file, we should be ready to build.

$ cargo build
   Compiling rust-nginx-module v0.1.0 (/Users/sam/cf-repos/rust-nginx-module)
    Finished dev [unoptimized + debuginfo] target(s) in 4.70s

With any luck, we should see a file called bindings.rs in the target/debug/build directory, which contains Rust definitions of all the NGINX symbols.

$ find target -name 'bindings.rs' 
target/debug/build/rust-nginx-module-c5504dc14560ecc1/out/bindings.rs

$ head target/debug/build/rust-nginx-module-c5504dc14560ecc1/out/bindings.rs
/* automatically generated by rust-bindgen 0.61.0 */
[...]

To be able to use them in the project, we can include them in a new file under the src directory which we’ll call bindings.rs.

$ cat > src/bindings.rs
include!(concat!(env!("OUT_DIR"), "/bindings.rs"));

With that set, we just need to add the usual imports to the top of the lib.rs file, and we can access NGINX constructs from Rust. Not only does this make bugs in the interface between NGINX and our Rust module much less likely than if these values were hand-coded, but it’s also a fantastic reference we can use to check the structure of things in NGINX when building modules in Rust, and it takes a lot of the leg-work out of setting everything up. It’s really a testament to the quality of a lot of Rust libraries such as Bindgen that something like this can be done with so little effort, in a robust way.

Once the Rust library has been built, the next step is to hook it into NGINX. Most NGINX modules are compiled statically. That is, the module is compiled as part of the compilation of NGINX as a whole. However, since NGINX 1.9.11, it has supported dynamic modules, which are compiled separately and then loaded using the load_module directive in the nginx.conf file. This is what we needed to use to build ROFL, so that the library could be compiled separately and loaded-in at the time NGINX starts up. Finding the right format so that the necessary symbols could be found from the documentation was tricky, though, and although it is possible to use a separate config file to set some of this metadata, it’s better if we can load it as part of the module, to keep things neat. Luckily, it doesn’t take much spelunking through the NGINX codebase to find where dlopen is called.

So after that it’s just a case of making sure the relevant symbols exist.

use std::os::raw::c_char;
use std::ptr;

#[no_mangle]
pub static mut ngx_modules: [*const ngx_module_t; 2] = [
    unsafe { rust_nginx_module as *const ngx_module_t },
    ptr::null()
];

#[no_mangle]
pub static mut ngx_module_type: [*const c_char; 2] = [
    "HTTP_FILTER\0".as_ptr() as *const c_char,
    ptr::null()
];

#[no_mangle]
pub static mut ngx_module_names: [*const c_char; 2] = [
    "rust_nginx_module\0".as_ptr() as *const c_char,
    ptr::null()
];

When writing an NGINX module, it’s crucial to get its order relative to the other modules correct. Dynamic modules get loaded as NGINX starts, which means they are (perhaps counterintuitively) the first to run on a response. Ensuring your module runs after gzip decompression by specifying its order relative to the gunzip module is essential, otherwise you can spend lots of time staring at streams of unprintable characters, wondering why you aren’t seeing the response you expected. Not fun. Fortunately this is also something that can be solved by looking at the NGINX source, and making sure the relevant entities exist in your module. Here’s an example of what you might set-

pub static mut ngx_module_order: [*const c_char; 3] = [
    "rust_nginx_module\0".as_ptr() as *const c_char,
    "ngx_http_headers_more_filter_module\0".as_ptr() as *const c_char,
    ptr::null()
];

We’re essentially saying we want our module rust_nginx_module to run just before the ngx_http_headers_more_filter_module module, which should allow it to run in the place we expect.

One of the quirks of NGINX and OpenResty is how it is really hostile to making calls to external services at the point that you’re dealing with the HTTP response. It’s something that isn’t provided as part of the OpenResty Lua framework, even though it would make working with the response phase of a request much easier. While we could do this anyway, that would mean having to fork NGINX and OpenResty, which would bring its own challenges. As a result, we’ve spent a lot of time over the years thinking about ways to pass state from the time when NGINX’s dealing with an HTTP request, over to the time when it’s streaming through the response, and much of our logic is built around this style of work.

For ROFL, that means in order to determine if we need to apply a certain feature for a response, we need to figure that out on the request, then pass that information over to the response so that we know which features to activate. To do that, we need to use one of the utilities that NGINX provides you with. With the help of the bindings.rs file generated earlier, we can take a look at the definition of the ngx_http_request_s struct, which contains all the state associated with a given request:

#[repr(C)]
#[derive(Debug, Copy, Clone)]
pub struct ngx_http_request_s {
    pub signature: u32,
    pub connection: *mut ngx_connection_t,
    pub ctx: *mut *mut ::std::os::raw::c_void,
    pub main_conf: *mut *mut ::std::os::raw::c_void,
    pub srv_conf: *mut *mut ::std::os::raw::c_void,
    pub loc_conf: *mut *mut ::std::os::raw::c_void,
    pub read_event_handler: ngx_http_event_handler_pt,
    pub write_event_handler: ngx_http_event_handler_pt,
    pub cache: *mut ngx_http_cache_t,
    pub upstream: *mut ngx_http_upstream_t,
    pub upstream_states: *mut ngx_array_t,
    pub pool: *mut ngx_pool_t,
    pub header_in: *mut ngx_buf_t,
    pub headers_in: ngx_http_headers_in_t,
    pub headers_out: ngx_http_headers_out_t,
    pub request_body: *mut ngx_http_request_body_t,
[...]
}

As we can see, there’s a member called ctx. As the NGINX Development Guide mentions, it’s a place where you’re able to store any value associated with a request, which should live for as long as the request does. In OpenResty this is used heavily for the storing of state to do with a request over its lifetime in a Lua context. We can do the same thing for our module, so that settings initialised during the request phase are there when our HTML parsing and rewriting is run in the response phase. Here’s an example function which can be used to get the request ctx:

pub fn get_ctx(request: &ngx_http_request_t) -> Option<&mut Ctx> {
    unsafe {
        match *request.ctx.add(ngx_http_rofl_module.ctx_index) {
            p if p.is_null() => None,
            p => Some(&mut *(p as *mut Ctx)),
        }
    }
}

Notice that ctx is at the offset of the ctx_index member of ngx_http_rofl_module - this is the structure of type ngx_module_t that’s part of the module definition needed to make an NGINX module. Once we have this, we can point it to a structure containing any setting we want. For example, here’s the actual function we use to enable the Email Obfuscation feature from Lua, via FFI to the Rust module using LuaJIT’s FFI tools:

#[no_mangle]
pub extern "C" fn rofl_module_email_obfuscation_new(
    request: &mut ngx_http_request_t,
    dry_run: bool,
    decode_script_url: *const u8,
    decode_script_url_len: usize,
) {
    let ctx = context::get_or_init_ctx(request);
    let decode_script_url = unsafe {
        std::str::from_utf8(std::slice::from_raw_parts(decode_script_url, decode_script_url_len))
            .expect("invalid utf-8 string for decode script")
    };

    ctx.register_module(EmailObfuscation::new(decode_script_url.to_owned()), dry_run);
}

The function is called get_or_init_ctx here- it performs the same job as get_ctx, but also initialises the structure if it doesn’t exist yet. Once we’ve set whatever data we need in ctx during the request, we can then check what features need to be run in the response, without having to make any calls to external databases, which might slow us down.

One of the nice things about storing state on ctx in this way, and working with NGINX in general, is that it relies heavily on memory pools to store request content. This largely removes any need for the programmer to have to think about freeing memory after use- the pool is automatically allocated at the start of a request, and is automatically freed when the request is done. All that’s needed is to allocate the memory using NGINX’s built-in functions for allocating memory to the pool and then registering a callback that will be called to free everything. In Rust, that would look something like the following:

pub struct Pool<'a>(&'a mut ngx_pool_t);

impl<'a> Pool<'a> {    
    /// Register a cleanup handler that will get called at the end of the request.
    fn add_cleanup(&mut self, value: *mut T) -> Result<(), ()> {
        unsafe {
            let cln = ngx_pool_cleanup_add(self.0, 0);
            if cln.is_null() {
                return Err(());
            }
            (*cln).handler = Some(cleanup_handler::);
            (*cln).data = value as *mut c_void;
            Ok(())
        }
    }

    /// Allocate memory for a given value.
    pub fn alloc(&mut self, value: T) -> Option<&'a mut T> {
        unsafe {
            let p = ngx_palloc(self.0, mem::size_of::()) as *mut _ as *mut T;
            ptr::write(p, value);
            if let Err(_) = self.add_cleanup(p) {
                ptr::drop_in_place(p);
                return None;
            };
            Some(&mut *p)
        }
    }
}

unsafe extern "C" fn cleanup_handler(data: *mut c_void) {
    ptr::drop_in_place(data as *mut T);
}

This should allow us to allocate memory for whatever we want, safe in the knowledge that NGINX will handle it for us.

It is regrettable that we have to write a lot of unsafe blocks when dealing with NGINX’s interface in Rust. Although we’ve done a lot of work to minimise them where possible, unfortunately this is often the case with writing Rust code which has to manipulate C constructs through FFI. We have plans to do more work on this in the future, and remove as many lines as possible from unsafe.

Challenges encountered

The NGINX module system allows for a massive amount of flexibility in terms of the way the module itself works, which makes it very accommodating to specific use-cases, but that flexibility can also lead to problems. One that we ran into had to do with the way the response data is handled between Rust and FL. In NGINX, response bodies are chunked, and these chunks are then linked together into a list. Additionally, there may be more than one of these linked lists per response, if the response is large.

Efficiently handling these chunks means processing them and passing them on as quickly as possible. When writing a Rust module for manipulating responses, it’s tempting to implement a Rust-based view into these linked lists. However, if you do that, you must be sure to update both the Rust-based view and the underlying NGINX data structures when mutating them, otherwise this can lead to serious bugs where Rust becomes out of sync with NGINX. Here’s a small function from an early version of ROFL that caused headaches:

fn handle_chunk(&mut self, chunk: &[u8]) {
    let mut free_chain = self.chains.free.borrow_mut();
    let mut out_chain = self.chains.out.borrow_mut();
    let mut data = chunk;

    self.metrics.borrow_mut().bytes_out += data.len() as u64;

    while !data.is_empty() {
        let free_link = self
            .pool
            .get_free_chain_link(free_chain.head, self.tag, &mut self.metrics.borrow_mut())
            .expect("Could not get a free chain link.");

        let mut link_buf = unsafe { TemporaryBuffer::from_ngx_buf(&mut *(*free_link).buf) };
        data = link_buf.write_data(data).unwrap_or(b"");
        out_chain.append(free_link);
    }
}

What this code was supposed to do is take the output of lol-html’s HTMLRewriter, and write it to the output chain of buffers. Importantly, the output can be larger than a single buffer, so you need to take new buffers off the chain in a loop until you’ve written all the output to buffers. Within this logic, NGINX is supposed to take care of popping the buffer off the free chain and appending the new chunk to the output chain, which it does. However, if you’re only thinking in terms of the way NGINX handles its view of the linked list, you may not notice that Rust never changes which buffer its free_chain.head points to, causing the logic to loop forever and the NGINX worker process to lock-up completely. This sort of issue can take a long time to track down, especially since we couldn’t reproduce it on our personal machines until we understood it was related to the response body size.

Getting a coredump to perform some analysis with gdb was also hard because once we noticed it happening, it was already too late and the process memory had grown to the point the server was in danger of falling over, and the memory consumed was too large to be written to disk. Fortunately, this code never made it to production. As ever, while Rust’s compiler can help you to catch a lot of common mistakes, it can’t help as much if the data is being shared via FFI from another environment, even without much direct use of unsafe, so extra care must be taken in these cases, especially when NGINX allows the kind of flexibility that might lead to a whole machine being taken out of service.

Another major challenge we faced had to do with backpressure from incoming response body chunks. In essence, if ROFL increased the size of the response due to having to inject some large amount of code into the stream (such as replacing an email address with a large chunk of JavaScript), NGINX can feed the output from ROFL to the other downstream modules faster than they could push it along, potentially leading to data being dropped and HTTP response bodies being truncated if the EAGAIN error from the next module is left unhandled. This was another case where the issue was really hard to test, because most of the time the response would be flushed fast enough for backpressure never to be a problem. To handle this, we had to create a special chain to store these chunks called saved_in, which required a special method for appending to it.

#[derive(Debug)]
pub struct Chains {
    /// This saves buffers from the `in` chain that were not processed for any reason (most likely
    /// backpressure for the next nginx module).
    saved_in: RefCell,
    pub free: RefCell,
    pub busy: RefCell,
    pub out: RefCell,
    [...]
}

Effectively we’re ‘queueing’ the data for a short period of time so that we don’t overwhelm the other modules by feeding them data faster than they can handle it. The NGINX Developer Guide has a lot of great information, but many of its examples are trivial to the point where issues like this don’t come up. Things such as this are the result of working in a complex NGINX-based environment, and need to be discovered independently.

A future without NGINX

The obvious question a lot of people might ask is: why are we still using NGINX? As already mentioned, Cloudflare is well on its way to replacing components that either used to run NGINX/OpenResty proxies, or would have done without heavy investment in home-grown platforms. That said, some components are easier to replace than others and FL, being where most of the logic for Cloudflare’s application services runs, is definitely on the more challenging end of the spectrum.

Another motivating reason for doing this work is that whichever platform we eventually migrate to, we’ll need to run the features that make up cf-html, and in order to do that we’ll want to have a system that is less heavily integrated and dependent on NGINX. ROFL has been specifically designed with the intention of running it in multiple places, so it will be easy to move it to another Rust-based web proxy (or indeed our Workers platform) without too much trouble. That said it’s hard to imagine we’d be in the same place without a language like Rust, which offers speed at the same time as a high degree of safety, not to mention high-quality libraries like Bindgen and Serde. More broadly, the FL team are working to migrate other aspects of the platform over to Rust, and while cf-html and the features of which make it up are a key part of our infrastructure that needed work, there are many others.

Safety in programming languages is often seen as beneficial in terms of preventing bugs, but as a company we’ve found that it also allows you to do things which would be considered very hard, or otherwise impossible to do safely. Whether it be providing a Wireshark-like filter language for writing firewall rules, allowing millions of users to write arbitrary JavaScript code and run it directly on our platform or rewriting HTML responses on the fly, having strict boundaries in place allows us to provide services we wouldn’t be able to otherwise, all while safe in the knowledge that the kind of memory-safety issues that used to plague the industry are increasingly a thing of the past.

If you enjoy rewriting code in Rust, solving challenging application infrastructure problems and want to help maintain the busiest web server in the world, we’re hiring!

How we built Pingora, the proxy that connects Cloudflare to the Internet

Yuchen Wu — Wed, 14 Sep 2022 13:00:00 GMT

Introduction

Today we are excited to talk about Pingora, a new HTTP proxy we’ve built in-house using Rust that serves over 1 trillion requests a day, boosts our performance, and enables many new features for Cloudflare customers, all while requiring only a third of the CPU and memory resources of our previous proxy infrastructure.

As Cloudflare has scaled we’ve outgrown NGINX. It was great for many years, but over time its limitations at our scale meant building something new made sense. We could no longer get the performance we needed nor did NGINX have the features we needed for our very complex environment.

Many Cloudflare customers and users use the Cloudflare global network as a proxy between HTTP clients (such as web browsers, apps, IoT devices and more) and servers. In the past, we’ve talked a lot about how browsers and other user agents connect to our network, and we’ve developed a lot of technology and implemented new protocols (see QUIC and optimizations for http2) to make this leg of the connection more efficient.

Today, we’re focusing on a different part of the equation: the service that proxies traffic between our network and servers on the Internet. This proxy service powers our CDN, Workers fetch, Tunnel, Stream, R2 and many, many other features and products.

Let’s dig in on why we chose to replace our legacy service and how we developed Pingora, our new system designed specifically for Cloudflare’s customer use cases and scale.

Why build yet another proxy

Over the years, our usage of NGINX has run up against limitations. For some limitations, we optimized or worked around them. But others were much harder to overcome.

Architecture limitations hurt performance

The NGINX worker (process) architecture has operational drawbacks for our use cases that hurt our performance and efficiency.

First, in NGINX each request can only be served by a single worker. This results in unbalanced load across all CPU cores, which leads to slowness.

Because of this request-process pinning effect, requests that do CPU heavy or blocking IO tasks can slow down other requests. As those blog posts attest we’ve spent a lot of time working around these problems.

The most critical problem for our use cases is poor connection reuse. Our machines establish TCP connections to origin servers to proxy HTTP requests. Connection reuse speeds up TTFB (time-to-first-byte) of requests by reusing previously established connections from a connection pool, skipping TCP and TLS handshakes required on a new connection.

However, the NGINX connection pool is per worker. When a request lands on a certain worker, it can only reuse the connections within that worker. When we add more NGINX workers to scale up, our connection reuse ratio gets worse because the connections are scattered across more isolated pools of all the processes. This results in slower TTFB and more connections to maintain, which consumes resources (and money) for both us and our customers.

As mentioned in past blog posts, we have workarounds for some of these issues. But if we can address the fundamental issue: the worker/process model, we will resolve all these problems naturally.

Some types of functionality are difficult to add

NGINX is a very good web server, load balancer or a simple gateway. But Cloudflare does way more than that. We used to build all the functionality we needed around NGINX, which is not easy to do while trying not to diverge too much from NGINX upstream codebase.

For example, when retrying/failing over a request, sometimes we want to send a request to a different origin server with a different set of request headers. But that is not something NGINX allows us to do. In cases like this, we spend time and effort on working around the NGINX constraints.

Meanwhile, the programming languages we had to work with didn’t provide help alleviating the difficulties. NGINX is purely in C, which is not memory safe by design. It is very error-prone to work with such a 3rd party code base. It is quite easy to get into memory safety issues, even for experienced engineers, and we wanted to avoid these as much as possible.

The other language we used to complement C is Lua. It is less risky but also less performant. In addition, we often found ourselves missing static typing when working with complicated Lua code and business logic.

And the NGINX community is not very active, and development tends to be “behind closed doors”.

Choosing to build our own

Over the past few years, as we’ve continued to grow our customer base and feature set, we continually evaluated three choices:

Continue to invest in NGINX and possibly fork it to tailor it 100% to our needs. We had the expertise needed, but given the architecture limitations mentioned above, significant effort would be required to rebuild it in a way that fully supported our needs.
Migrate to another 3rd party proxy codebase. There are definitely good projects, like envoy and others. But this path means the same cycle may repeat in a few years.
Start with a clean slate, building an in-house platform and framework. This choice requires the most upfront investment in terms of engineering effort.

We evaluated each of these options every quarter for the past few years. There is no obvious formula to tell which choice is the best. For several years, we continued with the path of the least resistance, continuing to augment NGINX. However, at some point, building our own proxy’s return on investment seemed worth it. We made a call to build a proxy from scratch, and began designing the proxy application of our dreams.

The Pingora Project

Design decisions

To make a proxy that serves millions of requests per second fast, efficient and secure, we have to make a few important design decisions first.

We chose Rust as the language of the project because it can do what C can do in a memory safe way without compromising performance.

Although there are some great off-the-shelf 3rd party HTTP libraries, such as hyper, we chose to build our own because we want to maximize the flexibility in how we handle HTTP traffic and to make sure we can innovate at our own pace.

At Cloudflare, we handle traffic across the entire Internet. We have many cases of bizarre and non-RFC compliant HTTP traffic that we have to support. This is a common dilemma across the HTTP community and web, where there is tension between strictly following HTTP specifications and accommodating the nuances of a wide ecosystem of potentially legacy clients or servers. Picking one side can be a tough job.

HTTP status codes are defined in RFC 9110 as a three digit integer, and generally expected to be in the range 100 through 599. Hyper was one such implementation. However, many servers support the use of status codes between 599 and 999. An issue had been created for this feature, which explored various sides of the debate. While the hyper team did ultimately accept that change, there would have been valid reasons for them to reject such an ask, and this was only one of many cases of noncompliant behavior we needed to support.

In order to satisfy the requirements of Cloudflare's position in the HTTP ecosystem, we needed a robust, permissive, customizable HTTP library that can survive the wilds of the Internet and support a variety of noncompliant use cases. The best way to guarantee that is to implement our own.

The next design decision was around our workload scheduling system. We chose multithreading over multiprocessing in order to share resources, especially connection pools, easily. We also decided that work stealing was required to avoid some classes of performance problems mentioned above. The Tokio async runtime turned out to be a great fit for our needs.

Finally, we wanted our project to be intuitive and developer friendly. What we build is not the final product, and should be extensible as a platform as more features are built on top of it. We decided to implement a “life of a request” event based programmable interface similar to NGINX/OpenResty. For example, the “request filter” phase allows developers to run code to modify or reject the request when a request header is received. With this design, we can separate our business logic and generic proxy logic cleanly. Developers who previously worked on NGINX can easily switch to Pingora and quickly become productive.

Pingora is faster in production

Let’s fast-forward to the present. Pingora handles almost every HTTP request that needs to interact with an origin server (for a cache miss, for example), and we’ve collected a lot of performance data in the process.

First, let’s see how Pingora speeds up our customer’s traffic. Overall traffic on Pingora shows 5ms reduction on median TTFB and 80ms reduction on the 95th percentile. This is not because we run code faster. Even our old service could handle requests in the sub-millisecond range.

The savings come from our new architecture which can share connections across all threads. This means a better connection reuse ratio, which spends less time on TCP and TLS handshakes.

Across all customers, Pingora makes only a third as many new connections per second compared to the old service. For one major customer, it increased the connection reuse ratio from 87.1% to 99.92%, which reduced new connections to their origins by 160x. To present the number more intuitively, by switching to Pingora, we save our customers and users 434 years of handshake time every day.

More features

Having a developer friendly interface engineers are familiar with while eliminating the previous constraints allows us to develop more features, more quickly. Core functionality like new protocols act as building blocks to more products we can offer to customers.

As an example, we were able to add HTTP/2 upstream support to Pingora without major hurdles. This allowed us to offer gRPC to our customers shortly afterwards. Adding this same functionality to NGINX would have required significantly more engineering effort and might not have materialized.

More recently we've announced Cache Reserve where Pingora uses R2 storage as a caching layer. As we add more functionality to Pingora, we’re able to offer new products that weren’t feasible before.

More efficient

In production, Pingora consumes about 70% less CPU and 67% less memory compared to our old service with the same traffic load. The savings come from a few factors.

Our Rust code runs more efficiently compared to our old Lua code. On top of that, there are also efficiency differences from their architectures. For example, in NGINX/OpenResty, when the Lua code wants to access an HTTP header, it has to read it from the NGINX C struct, allocate a Lua string and then copy it to the Lua string. Afterwards, Lua has to garbage-collect its new string as well. In Pingora, it would just be a direct string access.

The multithreading model also makes sharing data across requests more efficient. NGINX also has shared memory but due to implementation limitations, every shared memory access has to use a mutex lock and only strings and numbers can be put into shared memory. In Pingora, most shared items can be accessed directly via shared references behind atomic reference counters.

Another significant portion of CPU saving, as mentioned above, is from making fewer new connections. TLS handshakes are expensive compared to just sending and receiving data via established connections.

Safer

Shipping features quickly and safely is difficult, especially at our scale. It's hard to predict every edge case that can occur in a distributed environment processing millions of requests a second. Fuzzing and static analysis can only mitigate so much. Rust's memory-safe semantics guard us from undefined behavior and give us confidence our service will run correctly.

With those assurances we can focus more on how a change to our service will interact with other services or a customer's origin. We can develop features at a higher cadence and not be burdened by memory safety and hard to diagnose crashes.

When crashes do occur an engineer needs to spend time to diagnose how it happened and what caused it. Since Pingora's inception we’ve served a few hundred trillion requests and have yet to crash due to our service code.

In fact, Pingora crashes are so rare we usually find unrelated issues when we do encounter one. Recently we discovered a kernel bug soon after our service started crashing. We've also discovered hardware issues on a few machines, in the past ruling out rare memory bugs caused by our software even after significant debugging was nearly impossible.

Conclusion

To summarize, we have built an in-house proxy that is faster, more efficient and versatile as the platform for our current and future products.

We will be back with more technical details regarding the problems we faced, the optimizations we applied and the lessons we learned from building Pingora and rolling it out to power a significant portion of the Internet. We will also be back with our plan to open source it.

Pingora is our latest attempt at rewriting our system, but it won’t be our last. It is also only one of the building blocks in the re-architecting of our systems.

Interested in joining us to help build a better Internet? Our engineering teams are hiring.

Keepalives considered harmful

Ivan Babrou — Thu, 19 Mar 2020 12:57:35 GMT

This may sound like a weird title, but hear me out. You’d think keepalives would always be helpful, but turns out reality isn’t always what you expect it to be. It really helps if you read Why does one NGINX worker take all the load? first. This post is an adaptation of a rather old post on Cloudflare’s internal blog, so not all details are exactly as they are in production today but the lessons are still valid.

This is a story about how we were seeing some complaints about sporadic latency spikes, made some unconventional changes, and were able to slash the 99.9th latency percentile by 4x!

Request flow on Cloudflare edge

I'm going to focus only on two parts of our edge stack: FL and SSL.

FL accepts plain HTTP connections and does the main request logic, including our WAF
SSL terminates SSL and passes connections to FL over local Unix socket:

Here’s a diagram:

These days we route all traffic through SSL for simplicity, but in the grand scheme of things it’s not going to matter much.

Each of these processes is not itself a single process, but rather a main process and a collection of workers that do actual processing. Another diagram for you to make this more visual:

Keepalives

Requests come over connections that are reused for performance reasons, which is sometimes referred to as “keepalive”. It's generally expensive for a client to open a new TCP connection and our servers keep some memory pools associated with connections that need to be recycled. In fact, one of the range of mitigations we use for attack traffic is disabling keepalives for abusive clients, forcing them to reopen connections, which slows them down considerably.

To illustrate the usefulness of keepalives, here's me requesting http://example.com/ from curl over the same connection:

$ curl -s -w "Time to connect: %{time_connect} Time to first byte: %{time_starttransfer}\n" -o /dev/null http://example.com/ -o /dev/null http://example.com/

Time to connect: 0.012108 Time to first byte: 0.018724
Time to connect: 0.000019 Time to first byte: 0.007391

The first request took 18.7ms and out of them 12.1ms were used to establish a new connection, which is not even a TLS one. When I sent another request over the same connection, I didn't need to pay extra and it took just 7.3ms to service my request. That's a big win, which gets even bigger if you need to establish a brand new connection (especially if it’s not TLSv1.3 or QUIC). DNS was also cached in the example above, but it may be another negative factor for domains with low TTL.

Keepalives tend to be used extensively, because they seem beneficial. Generally keepalives are enabled by default for this exact reason.

Taking all the load

Due to how Linux works (see the first link in this blog post), most of the load goes to a few workers out of the pool. When that worker is not able to accept() a pending connection because it's busy processing requests, that connection spills to another worker. The process cascades until some worker is ready to pick up.

This leaves us with a ladder of load spread between workers:

               CPU%                                       Runtime
nobody    4254 51.2  0.8 5655600 1093848 ?     R    Aug23 3938:34 nginx: worker process
nobody    4257 47.9  0.8 5615848 1071612 ?     S    Aug23 3682:05 nginx: worker process
nobody    4253 43.8  0.8 5594124 1069424 ?     R    Aug23 3368:27 nginx: worker process
nobody    4255 39.4  0.8 5573888 1070272 ?     S    Aug23 3030:01 nginx: worker process
nobody    4256 36.2  0.7 5556700 1052560 ?     R    Aug23 2784:23 nginx: worker process
nobody    4251 33.1  0.8 5563276 1063700 ?     S    Aug23 2545:07 nginx: worker process
nobody    4252 29.2  0.8 5561232 1058748 ?     S    Aug23 2245:59 nginx: worker process
nobody    4248 26.7  0.8 5554652 1057288 ?     S    Aug23 2056:19 nginx: worker process
nobody    4249 24.5  0.7 5537276 1043568 ?     S    Aug23 1883:18 nginx: worker process
nobody    4245 22.5  0.7 5552340 1048592 ?     S    Aug23 1736:37 nginx: worker process
nobody    4250 20.7  0.7 5533728 1038676 ?     R    Aug23 1598:16 nginx: worker process
nobody    4247 19.6  0.7 5547548 1044480 ?     S    Aug23 1507:27 nginx: worker process
nobody    4246 18.4  0.7 5538104 1043452 ?     S    Aug23 1421:23 nginx: worker process
nobody    4244 17.5  0.7 5530480 1035264 ?     S    Aug23 1345:39 nginx: worker process
nobody    4243 16.6  0.7 5529232 1024268 ?     S    Aug23 1281:55 nginx: worker process
nobody    4242 16.6  0.7 5537956 1038408 ?     R    Aug23 1278:40 nginx: worker process

The third column is instant CPU%, the time after the date is total on-CPU time. If you look at the the same processes and count their open sockets, you'll see this (using the same order of processes as above):

More load corresponds to more open sockets. More open sockets generate more load to serve these connections. It’s a vicious circle.

The twist

Now that we have all these workers holding onto this connections, requests over these connections are also in a way pinned to workers:

In FL we're doing some things that are somewhat compute intensive, which means that some workers can be busy for a somewhat prolonged period of time, while other workers will be sitting idle, increasing latency for requests. Ideally we want to always hand over a request to a worker that's idle, because that maximizes our chances of not being blocked, even if it takes a few event loop iterations to serve a request (meaning that we may still block down the road).

One part of dealing with the issue is offloading some of the compute intensive parts of the request processing into a thread pool, but that was something that hadn’t happened at that point. We had to do something else in the meantime.

Clients have no way of knowing that their worker is busy and they should probably ask another worker to serve the connection. For clients over a real network this doesn't even make sense, since by the time their request comes to NGINX up to tens of milliseconds later, the situation will probably be different.

But our clients are not over a long haul network! They are local and they are SSL connecting over a Unix socket. It's not exactly that expensive to reopen a new connection for each request and what it gives is the ability to pass a fully formed request that's buffered in SSL into FL:

Two key points here:

The request is always picked up by an idle worker
The request is readily available for potentially compute intensive processing

The former is the most important part.

Validating the hypothesis

To validate this assumption, I wrote the following program:

package main
 
import (
    "flag"
    "io/ioutil"
    "log"
    "net/http"
    "os"
    "time"
)
 
func main() {
    u := flag.String("url", "", "url to request")
    p := flag.Duration("pause", 0, "pause between requests")
    t := flag.Duration("threshold", 0, "threshold for reporting")
    c := flag.Int("count", 10000, "number of requests to send")
    n := flag.Int("close", 1, "close connection after every that many requests")
 
    flag.Parse()
 
    if *u == "" {
        flag.PrintDefaults()
        os.Exit(1)
    }
 
    client := http.Client{}
 
    for i := 0; i < *c; i++ {
        started := time.Now()
 
        request, err := http.NewRequest("GET", *u, nil)
        if err != nil {
            log.Fatalf("Error constructing request: %s", err)
        }
 
        if i%*n == 0 {
            request.Header.Set("Connection", "Close")
        }
 
        response, err := client.Do(request)
        if err != nil {
            log.Fatalf("Error performing request: %s", err)
        }
 
        _, err = ioutil.ReadAll(response.Body)
        if err != nil {
            log.Fatalf("Error reading request body: %s", err)
        }
 
        response.Body.Close()
 
        elapsed := time.Since(started)
        if elapsed > *t {
            log.Printf("Request %d took %dms", i, int(elapsed.Seconds()*1000))
        }
 
        time.Sleep(*p)
    }
}

The program connects to a requested URL and recycles the connection after X requests have completed. We also pause for a short time between requests.

If we close a connection after every request:

$ go run /tmp/main.go -url http://test.domain/cdn-cgi/trace -count 10000 -pause 5ms -threshold 20ms -close 1
2018/08/24 23:42:34 Request 453 took 32ms
2018/08/24 23:42:38 Request 1044 took 24ms
2018/08/24 23:43:00 Request 4106 took 83ms
2018/08/24 23:43:12 Request 5778 took 27ms
2018/08/24 23:43:16 Request 6292 took 27ms
2018/08/24 23:43:20 Request 6856 took 21ms
2018/08/24 23:43:32 Request 8578 took 45ms
2018/08/24 23:43:42 Request 9938 took 22ms

We request an endpoint that's served in FL by Lua, so seeing any slow requests is unfortunate. There's an element of luck in this game and our program sees no cooperation from eyeballs or SSL, so it's somewhat expected.

Now, if we start closing the connection only after every other request, the situation immediately gets a lot worse:

$ go run /tmp/main.go -url http://teste1.cfperf.net/cdn-cgi/trace -count 10000 -pause 5ms -threshold 20ms -close 2
2018/08/24 23:43:51 Request 162 took 22ms
2018/08/24 23:43:51 Request 220 took 21ms
2018/08/24 23:43:53 Request 452 took 23ms
2018/08/24 23:43:54 Request 540 took 41ms
2018/08/24 23:43:54 Request 614 took 23ms
2018/08/24 23:43:56 Request 900 took 40ms
2018/08/24 23:44:02 Request 1705 took 21ms
2018/08/24 23:44:03 Request 1850 took 27ms
2018/08/24 23:44:03 Request 1878 took 36ms
2018/08/24 23:44:08 Request 2470 took 21ms
2018/08/24 23:44:11 Request 2926 took 22ms
2018/08/24 23:44:14 Request 3350 took 37ms
2018/08/24 23:44:14 Request 3404 took 21ms
2018/08/24 23:44:16 Request 3598 took 32ms
2018/08/24 23:44:16 Request 3606 took 22ms
2018/08/24 23:44:19 Request 4026 took 33ms
2018/08/24 23:44:20 Request 4250 took 74ms
2018/08/24 23:44:22 Request 4483 took 20ms
2018/08/24 23:44:23 Request 4572 took 21ms
2018/08/24 23:44:23 Request 4644 took 23ms
2018/08/24 23:44:24 Request 4758 took 63ms
2018/08/24 23:44:25 Request 4808 took 39ms
2018/08/24 23:44:30 Request 5496 took 28ms
2018/08/24 23:44:31 Request 5736 took 88ms
2018/08/24 23:44:32 Request 5845 took 43ms
2018/08/24 23:44:33 Request 5988 took 52ms
2018/08/24 23:44:34 Request 6042 took 26ms
2018/08/24 23:44:34 Request 6049 took 23ms
2018/08/24 23:44:40 Request 6872 took 86ms
2018/08/24 23:44:40 Request 6940 took 23ms
2018/08/24 23:44:40 Request 6964 took 23ms
2018/08/24 23:44:44 Request 7532 took 32ms
2018/08/24 23:44:49 Request 8224 took 22ms
2018/08/24 23:44:49 Request 8234 took 29ms
2018/08/24 23:44:51 Request 8536 took 24ms
2018/08/24 23:44:55 Request 9028 took 22ms
2018/08/24 23:44:55 Request 9050 took 23ms
2018/08/24 23:44:55 Request 9092 took 26ms
2018/08/24 23:44:57 Request 9330 took 25ms
2018/08/24 23:45:01 Request 9962 took 48ms

If we close after every 5 requests, the number of slow responses almost doubles. This is counter-intuitive, keepalives are supposed to help with latency, not make it worse!

Trying this in the wild

To see how this works out in the real world, we disabled keepalives between SSL and FL in one location, forcing SSL to send every request over a separate connection (remember: cheap local Unix socket). Here’s how our probes toward that location reacted to this:

Here’s cumulative wait time between SSL and FL:

And finally 99.9th percentile of the same measurement:

This is a huge win.

Another telling graph is comparing our average “edge processing time” (which includes WAF) in a test location to a global value:

We reduced unnecessary wait time due to an imbalance without increasing the CPU load, which directly translates into improved user experience and lower resource consumption for us.

The downsides

There has to be a downside from this, right? The problem we introduced is that CPU imbalance between individual CPU cores went up:

Overall CPU usage did not change, just the distribution of it. We already know of a way of dealing with this: SO_REUSEPORT. Either that or EPOLLROUNDROBIN, which doesn’t have some of the drawbacks of SO_REUSEPORT (which does not work for a Unix socket, for example), but requires a patched kernel. If we combine both disabled keepalives and EPOLLROUNDROBIN changes, we can see CPUs allocated to FL converge in their utilization nicely:

We’ve tried different combinations and having both EPOLLROUNDROBIN with disabled keepalives worked best. Having just one of them is not as beneficial to lower latency.

Conclusions

We disabled keepalives between SSL and FL running on the same box and this greatly improved our tail latency caused by requests landing on non-optimal FL workers. This was an unexpected fix, but it worked and we are able to explain it.

This doesn’t mean that you should go and disable keepalives everywhere. Generally keepalives are great and should stay enabled, but in our case the latency of local connection establishment is much lower than the delay we can get from landing on a busy worker.

In reality this means that we can run our machines hotter and not see latency rise as much as it did before. Imagine moving the CPU cap from 50% to 80% with no effect on latency. The numbers are arbitrary, but the idea holds. Running hotter allows for fewer machines able to serve the same amount of traffic, reducing our overall footprint. ?

The problem with thread^W event loops

Julien Desgats — Wed, 18 Mar 2020 12:00:00 GMT

Back when Cloudflare was created, over 10 years ago now, the dominant HTTP server used to power websites was Apache httpd. However, we decided to build our infrastructure using the then relatively new NGINX server.

There are many differences between the two, but crucially for us, the event loop architecture of NGINX was the key differentiator. In a nutshell, event loops work around the need to have one thread or process per connection by coalescing many of them in a single process, this reduces the need for expensive context switching from the operating system and also keeps the memory usage predictable. This is done by processing each connection until it wants to do some I/O, at that point, the said connection is queued until the I/O task is complete. During that time the event loop is available to process other in-flight connections, accept new clients, and the like. The loop uses a multiplexing system call like epoll (or kqueue) to be notified whenever an I/O task is complete among all the running connections.

In this article we will see that despite its advantages, event loop models also have their limits and falling back to good old threaded architecture is sometimes a good move.

Photo credit: Flickr CC-BY-SA

The key assumption of an event loop architecture

For an event loop to work correctly, there is one key requirement that has to be held: every piece of work has to finish quickly. This is because, as with other collaborative multitasking approaches, once a piece of work is started, there is no preemption mechanism.

For a proxy service like Cloudflare, this assumption works quite well as we spend most of the time waiting for the client to send data, or the origin to answer. This model is also widely successful for web applications that spend most of their time waiting for the database or other kind of RPC.

Let's take an example at a situation where two requests hit a Cloudflare server at the same time:

In this case, requests don't block each other too much… However, if one of the work units takes an unreasonable amount of time, it will start blocking other requests and the whole model will fall apart very quickly.

Such long operations might be CPU-intensive tasks, or blocking system calls. A common mistake is to call a library that uses blocking system calls internally: in this case an event-based program will perform a lot worse than a threaded one.

Issues with the Cloudflare workload

As I said previously, most of the work done by Cloudflare servers to process an HTTP request is quick and fits the event loop model well. However, a few places might require more CPU. Most notably, our Web Application Firewall (WAF) is in charge of inspecting incoming requests to look for potential threats.

Although this process only takes a few milliseconds in the large majority of cases, a tiny portion of requests might need more time. It might be tempting to say that it is rare enough to be ignored. However, a typical worker process can have hundreds of requests in flight at the same time. This means that halting the event loop could slow down any of these unrelated requests as well. Keep in mind that the median web page requires around 70 requests to fully load, and pages well over 100 assets are common. So looking at the average metrics is not very useful in this case, the 99th percentile is more relevant as a regular page load will likely hit that case at some point.

We want to make the web as fast as possible, even in edge cases. So we started to think about solutions to remove or mitigate that delay. There are quite a few options:

Increase the number of worker processes: this merely mitigates the problem, and creates more pressure on the kernel. Worse, spawning more workers does not cause a linear increase in capacity (even with spare CPU cores) because it also means that critical sections of code will have more lock contention.
Create a separate service dedicated to the WAF: this is a more realistic solution, it makes more sense to adopt a thread-based model for CPU-intensive tasks. A separate process allows that. However it would make migration from the existing codebase more difficult, and adding more IPC also has costs: serialization/deserialization, latency, more error cases, etc.
Offload CPU-intensive tasks to a thread pool: this is a hybrid approach where we could just hand over the WAF processing to a dedicated thread. There are still some costs of doing that, but overall using a thread pool is a lot faster and simpler than calling an external service. Moreover, we keep roughly the same code, we just call it differently.

All of these solutions are valid and this list is far from exhaustive. As in many situations, there is no one right solution: we have to weigh the different tradeoffs. In this case, given that we already have working code in NGINX, and that the process must be as quick as possible, we chose the thread pool approach.

NGINX already has thread pools!

Okay, I omitted one detail: NGINX already takes this hybrid approach for other reasons. Which made our task easier.

It is not always easy to avoid blocking system calls. The filesystem operations on Linux are famously known to be tricky in an asynchronous mode: among other limitations, files have to be read using "direct" mode, bypassing the filesystem cache. Quite annoying for a static file server^[1].

This is why thread pools were introduced back in 2015. The idea is quite simple: each worker process will spawn a group of threads that will be dedicated to process these synchronous operations. We use (and improved) that feature ourselves with great success for our caching layer.

Whenever the event loop wants to have such an operation performed, it will push it into a queue that gets processed by the threads, and it will be notified with a result when it is done.

This approach is not unique to NGINX: libuv (that powers node.js) also uses thread pools for filesystem operations.

Can we repurpose this system to offload our CPU-intensive sections too? It turns out we can. The threading model here is quite simple: nearly nothing is shared between the main loop, only a struct describing the operation to perform is sent and a result is sent back to the event loop^[2]. This share-nothing approach also has some drawbacks, notably memory usage: in our case, every thread has its own Lua VM to run the WAF code and its own compiled regular expression cache. Some of it can be improved, but as our code was written assuming there were no data races, changing that assumption would require a significant refactoring.

The best of both worlds? Not quite yet.

It's not a secret that epoll is very difficult to use correctly, in the past my colleague Marek wrote extensively about its challenges. But more importantly in our case, its load balancing issues were a real problem.

In a nutshell, when multiple processes listen on the same socket, some processes will be busier than others. This is due to the fact that whenever the event loop sits idle, it is free to accept new connections.

Offloading the WAF to a thread pool means freeing up time on the event loop, that can accept even more connections for the unlucky processes, which in turn will need the WAF. In this case, making this change would only make a bad situation worse: the WAF tasks would start piling up in the job queue waiting for the threads to process them.

Fortunately, this problem is quite well known and even if there is no solution yet in the upstream kernel, people have already attempted to fix it. We applied that patch, adding the EPOLLROUNDROBIN flag to our kernels quite some time ago and it played a crucial role in this case.

That's a lot of text… Show us the numbers!

Alright, that was a lot of talking, let's have a look at actual numbers. I will examine how our servers behaved before (baseline) and after offloading our WAF into thread pools^[3].

First let's have a look at the NGINX event loop itself. Did our change really improve the situation?

We have a metric telling us the maximum amount of time a request blocked the event loop during its processing. Let's look at its 99th percentile:

Using 100% as our baseline, we can see that this metric is 30% to 40% lower for the requests using the WAF when it is offloaded (yellow line). This is quite expected as we just offloaded a fair chunk of processing to a thread. For other requests (green line), the situation seems a tiny bit worse, but can be explained by the fact that the kernel has more threads to care about now, so the main event loop is more likely to be interrupted by the scheduler while it is running.

This is encouraging (at least for the requests with WAF enabled), but this doesn't really say what is the value for our customers, so let's look at more concrete metrics. First, the Time To First Byte (TTFB), the following graph only takes cache hits into account to reduce the noise due to other moving parts.

The 99th percentile is significantly reduced for both WAF and non-WAF requests, overall the gains of freeing up time on the event loop dwarfs the slight penalty we saw on the previous graph.

Let's finish by looking at another metric: the TTFB here starts only when a request has been accepted by the server. But now we can assume that these requests will be accepted faster as the event loop spends more time idle. Is that the case?

Success! The accept latency is also a lot lower. Not only the requests are faster, but the server is able to start processing them more quickly as well.

Conclusion

Overall, event-loop-based processing makes a lot of sense in many situations, especially in a microservices world. But this model is particularly vulnerable in cases where a lot of CPU time is necessary. There are a few different ways to mitigate this issue and as is often the case, the right answer is "it depends". In our case the thread pool approach was the best tradeoff: not only did it give visible improvements for our customers but it also allowed us to spread the load across more CPUs so we can more efficiently use our hardware.

But this tradeoff has many variables. Despite being too long to comfortably run on an event loop, the WAF is still very quick. For more complex tasks, having a separate service is usually a better option in my opinion. And there are a ton of other factors to take into account: security, development and deployment processes, etc.

We also saw that despite looking slightly worse in some micro metrics, the overall performance improved. This is something we all have to keep in mind when working on complex systems.

Are you interested in debugging and solving problems involving userspace, kernel and hardware? We're hiring!

^[1] hopefully this pain will go away over time as we now have io_uring

^[2] we still have to make sure the buffers are safely shared to avoid use-after-free and other data races, copying is the safest approach, but not necessarily the fastest.

^[3] in every case, the metrics are taken over a period of 24h, exactly one week apart.

Experiment with HTTP/3 using NGINX and quiche

Alessandro Ghedini — Thu, 17 Oct 2019 14:00:00 GMT

Just a few weeks ago we announced the availability on our edge network of HTTP/3, the new revision of HTTP intended to improve security and performance on the Internet. Everyone can now enable HTTP/3 on their Cloudflare zone and experiment with it using Chrome Canary as well as curl, among other clients.

We have previously made available an example HTTP/3 server as part of the quiche project to allow people to experiment with the protocol, but it’s quite limited in the functionality that it offers, and was never intended to replace other general-purpose web servers.

We are now happy to announce that our implementation of HTTP/3 and QUIC can be integrated into your own installation of NGINX as well. This is made available as a patch to NGINX, that can be applied and built directly with the upstream NGINX codebase.

It’s important to note that this is not officially supported or endorsed by the NGINX project, it is just something that we, Cloudflare, want to make available to the wider community to help push adoption of QUIC and HTTP/3.

Building

The first step is to download and unpack the NGINX source code. Note that the HTTP/3 and QUIC patch only works with the 1.16.x release branch (the latest stable release being 1.16.1).

 % curl -O https://nginx.org/download/nginx-1.16.1.tar.gz
 % tar xvzf nginx-1.16.1.tar.gz

As well as quiche, the underlying implementation of HTTP/3 and QUIC:

 % git clone --recursive https://github.com/cloudflare/quiche

Next you’ll need to apply the patch to NGINX:

 % cd nginx-1.16.1
 % patch -p01 < ../quiche/extras/nginx/nginx-1.16.patch

And finally build NGINX with HTTP/3 support enabled:

 % ./configure                          	\
   	--prefix=$PWD                       	\
   	--with-http_ssl_module              	\
   	--with-http_v2_module               	\
   	--with-http_v3_module               	\
   	--with-openssl=../quiche/deps/boringssl \
   	--with-quiche=../quiche
 % make

The above command instructs the NGINX build system to enable the HTTP/3 support ( --with-http_v3_module) by using the quiche library found in the path it was previously downloaded into ( --with-quiche=../quiche), as well as TLS and HTTP/2. Additional build options can be added as needed.

You can check out the full instructions here.

Running

Once built, NGINX can be configured to accept incoming HTTP/3 connections by adding the quic and reuseport options to the listen configuration directive.

Here is a minimal configuration example that you can start from:

events {
    worker_connections  1024;
}

http {
    server {
        # Enable QUIC and HTTP/3.
        listen 443 quic reuseport;

        # Enable HTTP/2 (optional).
        listen 443 ssl http2;

        ssl_certificate      cert.crt;
        ssl_certificate_key  cert.key;

        # Enable all TLS versions (TLSv1.3 is required for QUIC).
        ssl_protocols TLSv1 TLSv1.1 TLSv1.2 TLSv1.3;
        
        # Add Alt-Svc header to negotiate HTTP/3.
        add_header alt-svc 'h3-23=":443"; ma=86400';
    }
}

This will enable both HTTP/2 and HTTP/3 on the TCP/443 and UDP/443 ports respectively.

You can then use one of the available HTTP/3 clients (such as Chrome Canary, curl or even the example HTTP/3 client provided as part of quiche) to connect to your NGINX instance using HTTP/3.

We are excited to make this available for everyone to experiment and play with HTTP/3, but it’s important to note that the implementation is still experimental and it’s likely to have bugs as well as limitations in functionality. Feel free to submit a ticket to the quiche project if you run into problems or find any bug.

NGINX structural enhancements for HTTP/2 performance

Nick Jones — Wed, 22 May 2019 17:14:56 GMT

Introduction

My team: the Cloudflare PROTOCOLS team is responsible for termination of HTTP traffic at the edge of the Cloudflare network. We deal with features related to: TCP, QUIC, TLS and Secure Certificate management, HTTP/1 and HTTP/2. Over Q1, we were responsible for implementing the Enhanced HTTP/2 Prioritization product that Cloudflare announced during Speed Week.

This is a very exciting project to be part of, and doubly exciting to see the results of, but during the course of the project, we had a number of interesting realisations about NGINX: the HTTP oriented server onto which Cloudflare currently deploys its software infrastructure. We quickly became certain that our Enhanced HTTP/2 Prioritization project could not achieve even moderate success if the internal workings of NGINX were not changed.

Due to these realisations we embarked upon a number of significant changes to the internal structure of NGINX in parallel to the work on the core prioritization product. This blog post describes the motivation behind the structural changes, how we approached them, and what impact they had. We also identify additional changes that we plan to add to our roadmap, which we hope will improve performance further.

Background

Enhanced HTTP/2 Prioritization aims to do one thing to web traffic flowing between a client and a server: it provides a means to shape the many HTTP/2 streams as they flow from upstream (server or origin side) into a single HTTP/2 connection that flows downstream (client side).

Enhanced HTTP/2 Prioritization allows site owners and the Cloudflare edge systems to dictate the rules about how various objects should combine into the single HTTP/2 connection: whether a particular object should have priority and dominate that connection and reach the client as soon as possible, or whether a group of objects should evenly share the capacity of the connection and put more emphasis on parallelism.

As a result, Enhanced HTTP/2 Prioritization allows site owners to tackle two problems that exist between a client and a server: how to control precedence and ordering of objects, and: how to make the best use of a limited connection resource, which may be constrained by a number of factors such as bandwidth, volume of traffic and CPU workload at the various stages on the path of the connection.

What did we see?

The key to prioritisation is being able to compare two or more HTTP/2 streams in order to determine which one’s frame is to go down the pipe next. The Enhanced HTTP/2 Prioritization project necessarily drew us into the core NGINX codebase, as our intention was to fundamentally alter the way that NGINX compared and queued HTTP/2 data frames as they were written back to the client.

Very early in the analysis phase, as we rummaged through the NGINX internals to survey the site of our proposed features, we noticed a number of shortcomings in the structure of NGINX itself, in particular: how it moved data from upstream (server side) to downstream (client side) and how it temporarily stored (buffered) that data in its various internal stages. The main conclusion of our early analysis of NGINX was that it largely failed to give the stream data frames any 'proximity'. Either streams were processed in the NGINX HTTP/2 layer in isolated succession or frames of different streams spent very little time in the same place: a shared queue for example. The net effect was a reduction in the opportunities for useful comparison.

We coined a new, barely scientific but useful measurement: Potential, to describe how effectively the Enhanced HTTP/2 Prioritization strategies (or even the default NGINX prioritization) can be applied to queued data streams. Potential is not so much a measurement of the effectiveness of prioritization per se, that metric would be left for later on in the project, it is more a measurement of the levels of participation during the application of the algorithm. In simple terms, it considers the number of streams and frames thereof that are included in an iteration of prioritization, with more streams and more frames leading to higher Potential.

What we could see from early on was that by default, NGINX displayed low Potential: rendering prioritization instructions from either the browser, as is the case in the traditional HTTP/2 prioritization model, or from our Enhanced HTTP/2 Prioritization product, fairly useless.

What did we do?

With the goal of improving the specific problems related to Potential, and also improving general throughput of the system, we identified some key pain points in NGINX. These points, which will be described below, have either been worked on and improved as part of our initial release of Enhanced HTTP/2 Prioritization, or have now branched out into meaningful projects of their own that we will put engineering effort into over the course of the next few months.

HTTP/2 frame write queue reclamation

Write queue reclamation was successfully shipped with our release of Enhanced HTTP/2 Prioritization and ironically, it wasn’t a change made to the original NGINX, it was in fact a change made against our Enhanced HTTP/2 Prioritization implementation when we were part way through the project, and it serves as a good example of something one may call: conservation of data, which is a good way to increase Potential.

Similar to the original NGINX, our Enhanced HTTP/2 Prioritization algorithm will place a cohort of HTTP/2 data frames into a write queue as a result of an iteration of the prioritization strategies being applied to them. The contents of the write queue would be destined to be written the downstream TLS layer. Also similar to the original NGINX, the write queue may only be partially written to the TLS layer due to back-pressure from the network connection that has temporarily reached write capacity.

Early on in our project, if the write queue was only partially written to the TLS layer, we would simply leave the frames in the write queue until the backlog was cleared, then we would re-attempt to write that data to the network in a future write iteration, just like the original NGINX.

The original NGINX takes this approach because the write queue is the only place that waiting data frames are stored. However, in our NGINX modified for Enhanced HTTP/2 Prioritization, we have a unique structure that the original NGINX lacks: per-stream data frame queues where we temporarily store data frames before our prioritization algorithms are applied to them.

We came to the realisation that in the event of a partial write, we were able to restore the unwritten frames back into their per-stream queues. If it was the case that a subsequent data cohort arrived behind the partially unwritten one, then the previously unwritten frames could participate in an additional round of prioritization comparisons, thus raising the Potential of our algorithms.

The following diagram illustrates this process:

We were very pleased to ship Enhanced HTTP/2 Prioritization with the reclamation feature included as this single enhancement greatly increased Potential and made up for the fact that we had to withhold the next enhancement for speed week due to its delicacy.

HTTP/2 frame write event re-ordering

In Cloudflare infrastructure, we map the many streams of a single HTTP/2 connection from the eyeball to multiple HTTP/1.1 connections to the upstream Cloudflare control plane.

As a note: it may seem counterintuitive that we downgrade protocols like this, and it may seem doubly counterintuitive when I reveal that we also disable HTTP keepalive on these upstream connections, resulting in only one transaction per connection, however this arrangement offers a number of advantages, particularly in the form of improved CPU workload distribution.

When NGINX monitors its upstream HTTP/1.1 connections for read activity, it may detect readability on many of those connections and process them all in a batch. However, within that batch, each of the upstream connections is processed sequentially, one at a time, from start to finish: from HTTP/1.1 connection read, to framing in the HTTP/2 stream, to HTTP/2 connection write to the TLS layer.

The existing NGINX workflow is illustrated in this diagram:

By committing each streams’ frames to the TLS layer one stream at a time, many frames may pass entirely through the NGINX system before backpressure on the downstream connection allows the queue of frames to build up, providing an opportunity for these frames to be in proximity and allowing prioritization logic to be applied. This negatively impacts Potential and reduces the effectiveness of prioritization.

The Cloudflare Enhanced HTTP/2 Prioritization modified NGINX aims to re-arrange the internal workflow described above into the following model:

Although we continue to frame upstream data into HTTP/2 data frames in the separate iterations for each upstream connection, we no longer commit these frames to a single write queue within each iteration, instead we arrange the frames into the per-stream queues described earlier. We then post a single event to the end of the per-connection iterations, and perform the prioritization, queuing and writing of the HTTP/2 data frames of all streams in that single event.

This single event finds the cohort of data conveniently stored in their respective per-stream queues, all in close proximity, which greatly increases the Potential of the Edge Prioritization algorithms.

In a form closer to actual code, the core of this modification looks a bit like this:

ngx_http_v2_process_data(ngx_http_v2_connection *h2_conn,
                         ngx_http_v2_stream *h2_stream,
                         ngx_buffer *buffer)
{
    while ( ! ngx_buffer_empty(buffer) {
        ngx_http_v2_frame_data(h2_conn,
                               h2_stream->frames,
                               buffer);
    }

    ngx_http_v2_prioritise(h2_conn->queue,
                           h2_stream->frames);

    ngx_http_v2_write_queue(h2_conn->queue);
}

To this:

ngx_http_v2_process_data(ngx_http_v2_connection *h2_conn,
                         ngx_http_v2_stream *h2_stream,
                         ngx_buffer *buffer)
{
    while ( ! ngx_buffer_empty(buffer) {
        ngx_http_v2_frame_data(h2_conn,
                               h2_stream->frames,
                               buffer);
    }

    ngx_list_add(h2_conn->active_streams, h2_stream);

    ngx_call_once_async(ngx_http_v2_write_streams, h2_conn);
}

ngx_http_v2_write_streams(ngx_http_v2_connection *h2_conn)
{
    ngx_http_v2_stream *h2_stream;

    while ( ! ngx_list_empty(h2_conn->active_streams)) {
        h2_stream = ngx_list_pop(h2_conn->active_streams);

        ngx_http_v2_prioritise(h2_conn->queue,
                               h2_stream->frames);
    }

    ngx_http_v2_write_queue(h2_conn->queue);
}

There is a high level of risk in this modification, for even though it is remarkably small, we are taking the well established and debugged event flow in NGINX and switching it around to a significant degree. Like taking a number of Jenga pieces out of the tower and placing them in another location, we risk: race conditions, event misfires and event blackholes leading to lockups during transaction processing.

Because of this level of risk, we did not release this change in its entirety during Speed Week, but we will continue to test and refine it for future release.

Upstream buffer partial re-use

Nginx has an internal buffer region to store connection data it reads from upstream. To begin with, the entirety of this buffer is Ready for use. When data is read from upstream into the Ready buffer, the part of the buffer that holds the data is passed to the downstream HTTP/2 layer. Since HTTP/2 takes responsibility for that data, that portion of the buffer is marked as: Busy and it will remain Busy for as long as it takes for the HTTP/2 layer to write the data into the TLS layer, which is a process that may take some time (in computer terms!).

During this gulf of time, the upstream layer may continue to read more data into the remaining Ready sections of the buffer and continue to pass that incremental data to the HTTP/2 layer until there are no Ready sections available.

When Busy data is finally finished in the HTTP/2 layer, the buffer space that contained that data is then marked as: Free

The process is illustrated in this diagram:

You may ask: When the leading part of the upstream buffer is marked as Free (in blue in the diagram), even though the trailing part of the upstream buffer is still Busy, can the Free part be re-used for reading more data from upstream?

The answer to that question is: NO

Because just a small part of the buffer is still Busy, NGINX will refuse to allow any of the entire buffer space to be re-used for reads. Only when the entirety of the buffer is Free, can the buffer be returned to the Ready state and used for another iteration of upstream reads. So in summary, data can be read from upstream into Ready space at the tail of the buffer, but not into Free space at the head of the buffer.

This is a shortcoming in NGINX and is clearly undesirable as it interrupts the flow of data into the system. We asked: what if we could cycle through this buffer region and re-use parts at the head as they became Free? We seek to answer that question in the near future by testing the following buffering model in NGINX:

TLS layer Buffering

On a number of occasions in the above text, I have mentioned the TLS layer, and how the HTTP/2 layer writes data into it. In the OSI network model, TLS sits just below the protocol (HTTP/2) layer, and in many consciously designed networking software systems such as NGINX, the software interfaces are separated in a way that mimics this layering.

The NGINX HTTP/2 layer will collect the current cohort of data frames and place them in priority order into an output queue, then submit this queue to the TLS layer. The TLS layer makes use of a per-connection buffer to collect HTTP/2 layer data before performing the actual cryptographic transformations on that data.

The purpose of the buffer is to give the TLS layer a more meaningful quantity of data to encrypt, for if the buffer was too small, or the TLS layer simply relied on the units of data from the HTTP/2 layer, then the overhead of encrypting and transmitting the multitude of small blocks may negatively impact system throughput.

The following diagram illustrates this undersize buffer situation:

If the TLS buffer is too big, then an excessive amount of HTTP/2 data will be committed to encryption and if it failed to write to the network due to backpressure, it would be locked into the TLS layer and not be available to return to the HTTP/2 layer for the reclamation process, thus reducing the effectiveness of reclamation. The following diagram illustrates this oversize buffer situation:

In the coming months, we will embark on a process to attempt to find the ‘goldilocks’ spot for TLS buffering: To size the TLS buffer so it is big enough to maintain efficiency of encryption and network writes, but not so big as to reduce the responsiveness to incomplete network writes and the efficiency of reclamation.

Thank you - Next!

The Enhanced HTTP/2 Prioritization project has the lofty goal of fundamentally re-shaping how we send traffic from the Cloudflare edge to clients, and as results of our testing and feedback from some of our customers shows, we have certainly achieved that! However, one of the most important aspects that we took away from the project was the critical role the internal data flow within our NGINX software infrastructure plays in the outlook of the traffic observed by our end users. We found that changing a few lines of (albeit critical) code, could have significant impacts on the effectiveness and performance of our prioritization algorithms. Another positive outcome is that in addition to improving HTTP/2, we are looking forward to carrying our newfound skills and lessons learned and apply them to HTTP/3 over QUIC.

We are eager to share our modifications to NGINX with the community, so we have opened this ticket, through which we will discuss upstreaming the event re-ordering change and the buffer partial re-use change with the NGINX team.

As Cloudflare continues to grow, our requirements on our software infrastructure also shift. Cloudflare has already moved beyond proxying of HTTP/1 over TCP to support termination and Layer 3 and 4 protection for any UDP and TCP traffic. Now we are moving on to other technologies and protocols such as QUIC and HTTP/3, and full proxying of a wide range of other protocols such as messaging and streaming media.

For these endeavours we are looking at new ways to answer questions on topics such as: scalability, localised performance, wide scale performance, introspection and debuggability, release agility, maintainability.

If you would like to help us answer these questions and know a bit about: hardware and software scalability, network programming, asynchronous event and futures based software design, TCP, TLS, QUIC, HTTP, RPC protocols, Rust or maybe something else?, then have a look here.

SEO Best Practices with Cloudflare Workers, Part 2: Implementing Subdomains

Michael Pinter — Fri, 15 Feb 2019 17:09:26 GMT

Recap

In Part 1, the merits and tradeoffs of subdirectories and subdomains were discussed. The subdirectory strategy is typically superior to subdomains because subdomains suffer from keyword and backlink dilution. The subdirectory strategy more effectively boosts a site's search rankings by ensuring that every keyword is attributed to the root domain instead of diluting across subdomains.

Subdirectory Strategy without the NGINX

In the first part, our friend Bob set up a hosted Ghost blog at bobtopia.coolghosthost.com that he connected to blog.bobtopia.com using a CNAME DNS record. But what if he wanted his blog to live at bobtopia.com/blog to gain the SEO advantages of subdirectories?

A reverse proxy like NGINX is normally needed to route traffic from subdirectories to remotely hosted services. We'll demonstrate how to implement the subdirectory strategy with Cloudflare Workers and eliminate our dependency on NGINX. (Cloudflare Workers are serverless functions that run on the Cloudflare global network.)

Back to Bobtopia

Let's write a Worker that proxies traffic from a subdirectory – bobtopia.com/blog – to a remotely hosted platform – bobtopia.coolghosthost.com. This means that if I go to bobtopia.com/blog, I should see the content of bobtopia.coolghosthost.com, but my browser should still think it's on bobtopia.com.

Configuration Options

In the Workers editor, we'll start a new script with some basic configuration options.

// keep track of all our blog endpoints here
const myBlog = {
  hostname: "bobtopia.coolghosthost.com",
  targetSubdirectory: "/articles",
  assetsPathnames: ["/public/", "/assets/"]
}

The script will proxy traffic from myBlog.targetSubdirectory to Bob's hosted Ghost endpoint, myBlog.hostname. We'll talk about myBlog.assetsPathnames a little later.

Requests are proxied from bobtopia.com/articles to bobtopia.coolghosthost.com (Uh oh... is because the hosted Ghost blog doesn't actually exist)

Request Handlers

Next, we'll add a request handler:

async function handleRequest(request) {
  return fetch(request)
}

addEventListener("fetch", event => {
  event.respondWith(handleRequest(event.request))
})

So far we're just passing requests through handleRequest unmodified. Let's make it do something:


async function handleRequest(request) { 
  ...

  // if the request is for blog html, get it
  if (requestMatches(myBlog.targetSubdirectory)) {
    console.log("this is a request for a blog document", parsedUrl.pathname)
    const targetPath = formatPath(parsedUrl)
    
    return fetch(`https://${myBlog.hostname}/${targetPath}`)
  }

  ...
  
  console.log("this is a request to my root domain", parsedUrl.pathname)
  // if its not a request blog related stuff, do nothing
  return fetch(request)
}

addEventListener("fetch", event => {
  event.respondWith(handleRequest(event.request))
})

In the above code, we added a conditional statement to handle traffic to myBlog.targetSubdirectory. Note that we've omitted our helper functions here. The relevant code lives inside the if block near the top of the function. The requestMatches helper checks if the incoming request contains targetSubdirectory. If it does, a request is made to myBlog.hostname to fetch the HTML document which is returned to the browser.

When the browser parses the HTML, it makes additional asset requests required by the document (think images, stylesheets, and scripts). We'll need another conditional statement to handle these kinds of requests.

// if its blog assets, get them
if ([myBlog.assetsPathnames].some(requestMatches)) {
    console.log("this is a request for blog assets", parsedUrl.pathname)
    const assetUrl = request.url.replace(parsedUrl.hostname, myBlog.hostname);

    return fetch(assetUrl)
  }

This similarly shaped block checks if the request matches any pathnames enumerated in myBlog.assetPathnames and fetches the assets required to fully render the page. Assets happen to live in /public and /assets on a Ghost blog. You'll be able to identify your assets directories when you fetch the HTML and see logs for scripts, images, and stylesheets.

Logs show the various scripts and stylesheets required by Ghost live in /assets and /public

The full script with helper functions included is:


// keep track of all our blog endpoints here
const myBlog = {
  hostname: "bobtopia.coolghosthost.com",
  targetSubdirectory: "/articles",
  assetsPathnames: ["/public/", "/assets/"]
}

async function handleRequest(request) {
  // returns an empty string or a path if one exists
  const formatPath = (url) => {
    const pruned = url.pathname.split("/").filter(part => part)
    return pruned && pruned.length > 1 ? `${pruned.join("/")}` : ""
  }
  
  const parsedUrl = new URL(request.url)
  const requestMatches = match => new RegExp(match).test(parsedUrl.pathname)
  
  // if its blog html, get it
  if (requestMatches(myBlog.targetSubdirectory)) {
    console.log("this is a request for a blog document", parsedUrl.pathname)
    const targetPath = formatPath(parsedUrl)
    
    return fetch(`https://${myBlog.hostname}/${targetPath}`)
  }
  
  // if its blog assets, get them
  if ([myBlog.assetsPathnames].some(requestMatches)) {
    console.log("this is a request for blog assets", parsedUrl.pathname)
    const assetUrl = request.url.replace(parsedUrl.hostname, myBlog.hostname);

    return fetch(assetUrl)
  }

  console.log("this is a request to my root domain", parsedUrl.host, parsedUrl.pathname);
  // if its not a request blog related stuff, do nothing
  return fetch(request)
}

addEventListener("fetch", event => {
  event.respondWith(handleRequest(event.request))
})

Caveat

There is one important caveat about the current implementation that bears mentioning. This script will not work if your hosted service assets are stored in a folder that shares a name with a route on your root domain. For example, if you're serving assets from the root directory of your hosted service, any request made to the bobtopia.com home page will be masked by these asset requests, and the home page won't load.

The solution here involves modifying the blog assets block to handle asset requests without using paths. I'll leave it to the reader to solve this, but a more general solution might involve changing myBlog.assetPathnames to myBlog.assetFileExtensions, which is a list of all asset file extensions (like .png and .css). Then, the assets block would handle requests that contain assetFileExtensions instead of assetPathnames.

Conclusion

Bob is now enjoying the same SEO advantages as Alice after converting his subdomains to subdirectories using Cloudflare Workers. Bobs of the world, rejoice!

Interested in deploying a Cloudflare Worker without setting up a domain on Cloudflare? We’re making it easier to get started building serverless applications with custom subdomains on workers.dev. If you’re already a Cloudflare customer, you can add Workers to your existing website here.

Reserve a workers.dev subdomain

SEO Best Practices with Cloudflare Workers, Part 1: Subdomain vs. Subdirectory

Michael Pinter — Fri, 15 Feb 2019 17:09:05 GMT

Subdomain vs. Subdirectory: 2 Different SEO Strategies

Alice and Bob are budding blogger buddies who met up at a meetup and purchased some root domains to start writing. Alice bought aliceblogs.com and Bob scooped up bobtopia.com.

Alice and Bob decided against WordPress because its what their parents use and purchased subscriptions to a popular cloud-based Ghost blogging platform instead.

Bob decides his blog should live at at blog.bobtopia.com – a subdomain of bobtopia.com. Alice keeps it old school and builds hers at aliceblogs.com/blog – a subdirectory of aliceblogs.com.

Subdomains and subdirectories are different strategies for instrumenting root domains with new features (think a blog or a storefront). Alice and Bob chose their strategies on a whim, but which strategy is technically better? The short answer is, it depends. But the long answer can actually improve your SEO. In this article, we'll review the merits and tradeoffs of each. In Part 2, we'll show you how to convert subdomains to subdirectories using Cloudflare Workers.

Setting Up Subdomains and Subdirectories

Setting up subdirectories is trivial on basic websites. A web server treats its subdirectories (aka subfolders) the same as regular old folders in a file system. In other words, basic sites are already organized using subdirectories out of the box. No set up or configuration is required.

In the old school site above, we'll assume the blog folder contains an index.html file. The web server renders blog/index.html when a user navigates to the oldschoolsite.com/blog subdirectory_._ But Alice and Bob's sites don't have a blog folder because their blogs are hosted remotely – so this approach won't work.

On the modern Internet, subdirectory setup is more complicated because the services that comprise a root domain are often hosted on machines scattered across the world.

Because DNS records only operate on the domain level, records like CNAME have no effect on a url like aliceblogs.com/blog – and because her blog is hosted remotely, Alice needs to install NGINX or another reverse proxy and write some configuration code that proxies traffic from aliceblogs.com/blog to her hosted blog. It takes time, patience, and experience to connect her domain to her hosted blog.

A location block in NGINX is necessary to proxy traffic from a subdirectory to a remote host

Bob's subdomain strategy is the easier approach with his remotely hosted blog. A DNS CNAME record is often all that's required to connect Bob's blog to his subdomain. No additional configuration is needed if he can remember to pay his monthly subscription.

Configuring a DNS record to point a hosted service at your blog subdomain

To recap, subdirectories are already built into simple sites that serve structured content from the same machine, but modern sites often rely on various remote services. Subdomain set up is comparatively easy for sites that take advantage of various hosted cloud-based platforms.

Are Subdomains or Subdirectories Better for SEO?

Subdomains are neat. If you ask me, blog.bobtopia.com is more appealing than bobtopia.com/blog. But if we want to make an informed decision about the best strategy, where do we look? If we're interested in SEO, we ought to consult the Google Bot.

Subdomains and subdirectories are equal in the eyes of the Google Bot, according to Google itself. This means that Alice and Bob have the same chance at ranking in search results. This is because Alice's root domain and Bob's subdomain build their own sets of keywords. Relevant keywords help your audience find your site in a search. There is one important caveat to point out for Bob:

A subdomain is equal and distinct from a root domain. This means that a subdomain's keywords are treated separately from the root domain.

What does this mean for Bob? Let's imagine bobtopia.com is already a popular online platform for folks named Bob to seek kinship with other Bobs. In this peculiar world, searches that rank for bobtopia.com wouldn't automatically rank for blog.bobtopia.com because each domain has its own separate keywords. The lesson here is that keywords are diluted across subdomains. Each additional subdomain decreases the likelihood that any particular domain ranks in a given search. A high ranking subdomain does not imply your root domain ranks well.

In a search for "Cool Blog", bobtopia.com suffers from keyword dilution. It doesn't rank because its blog keyword is owned by blog.bobtopia.com.

Subdomains also suffer from backlink dilution. A backlink is simply a hyperlink that points back to your site. Alice's attribution to a post on the etymology of Bob from blog.bobtopia.com does not help bobtopia.com because the subdomain is treated separate but equal from the root domain. If Bob used subdirectories instead, Bob's blog posts would feed the authority of bobtopia.com and Bobs everywhere would rejoice.

The authority of blog.bobtopia.com is increased when Alice links to Bob's interesting blog post, but the authority of bobtopia.com is not affected.

Although search engines have improved at identifying subdomains and attributing keywords back to the root domain, they still have a long way to go. A prudent marketer would avoid risk by assuming search engines will always be bad at cataloguing subdomains.

So when would you want to use subdomains? A good use case is for companies who are interested in expanding into foreign markets. Pretend bobtopia.com is an American company whose website is in English. Their English keywords won't rank well in German searches – so they translate their site into German to begin building new keywords on deutsch.bobtopia.com. Erfolg!

Other use cases for subdomains include product stratification (think global brands with presence across many markets) and corporate internal tools (think productivity and organization tools that aren't user facing). But unless you're a huge corporation or just finished your Series C round of funding, subdomaining your site into many silos is not helping your SEO.

Conclusion

If you're a startup or small business looking to optimize your SEO, consider subdirectories over subdomains. Boosting the authority of your root domain should be a universal goal of any organization. The subdirectory strategy concentrates your keywords onto a single domain while the subdomain strategy spreads your keywords across multiple distinct domains. In a word, the subdirectory strategy results in better root domain authority. Higher domain authority leads to better search rankings which translates to more engagement.

Consider the multitude of disruptive PaaS startups with docs.disruptivepaas.com and blog.disruptivepaas.com. Why not switch to disruptivepaas.com/docs and disruptivepaas.com/blog to boost the authority of your root domain with all those docs searches and StackOverflow backlinks?

Want to Switch Your Subdomains to Subdirectories?

Interested in switching your subdomains to subdirectories without a reverse proxy? In Part 2, we'll show you how using Cloudflare Workers.

Reserve a workers.dev subdomain

Know your SCM_RIGHTS

Vlad Krasnov — Thu, 29 Nov 2018 09:54:22 GMT

As TLS 1.3 was ratified earlier this year, I was recollecting how we got started with it here at Cloudflare. We made the decision to be early adopters of TLS 1.3 a little over two years ago. It was a very important decision, and we took it very seriously.

It is no secret that Cloudflare uses nginx to handle user traffic. A little less known fact, is that we have several instances of nginx running. I won’t go into detail, but there is one instance whose job is to accept connections on port 443, and proxy them to another instance of nginx that actually handles the requests. It has pretty limited functionality otherwise. We fondly call it nginx-ssl.

Back then we were using OpenSSL for TLS and Crypto in nginx, but OpenSSL (and BoringSSL) had yet to announce a timeline for TLS 1.3 support, therefore we had to implement our own TLS 1.3 stack. Obviously we wanted an implementation that would not affect any customer or client that would not enable TLS 1.3. We also needed something that we could iterate on quickly, because the spec was very fluid back then, and also something that we can release frequently without worrying about the rest of the Cloudflare stack.

The obvious solution was to implement it on top of OpenSSL. The OpenSSL version we were using was 1.0.2, but not only were we looking ahead to replace it with version 1.1.0 or with BoringSSL (which we eventually did), it was so ingrained in our stack and so fragile that we wouldn’t be able to achieve our stated goals, without risking serious bugs.

Instead, Filippo Valsorda and Brendan McMillion suggested that the easier path would be to implement TLS 1.3 on top of the Go TLS library and make a Go replica of nginx-ssl (go-ssl). Go is very easy to iterate and prototype, with a powerful standard library, and we had a great pool of Go talent to use, so it made a lot of sense. Thus tls-tris was born.

The question remained how would we have Go handle only TLS 1.3 while letting nginx handling all prior versions of TLS?

And herein lies the problem. Both TLS 1.3 and older versions of TLS communicate on port 443, and it is common knowledge that only one application can listen on a given TCP port, and that application is nginx, that would still handle the bulk of the TLS traffic. We could pipe all the TCP data into another connection in Go, effectively creating an additional proxy layer, but where is the fun in that? Also it seemed a little inefficient.

Meet SCM_RIGHTS

So how do you make two different processes, written in two different programming languages, share the same TCP socket?

Fortunately, Linux (or rather UNIX) provides us with just the tool that we need. You can use UNIX-domain sockets to pass file descriptors between applications, and like everything else in UNIX connections are files.Looking at man 7 unix we see the following:

   Ancillary messages
       Ancillary  data  is  sent and received using sendmsg(2) and recvmsg(2).
       For historical reasons the ancillary message  types  listed  below  are
       specified with a SOL_SOCKET type even though they are AF_UNIX specific.
       To send them  set  the  cmsg_level  field  of  the  struct  cmsghdr  to
       SOL_SOCKET  and  the cmsg_type field to the type.  For more information
       see cmsg(3).

       SCM_RIGHTS
              Send or receive a set of  open  file  descriptors  from  another
              process.  The data portion contains an integer array of the file
              descriptors.  The passed file descriptors behave as though  they
              have been created with dup(2).

Technically you do not send “file descriptors”. The “file descriptors” you handle in the code are simply indices into the processes' local file descriptor table, which in turn points into the OS' open file table, that finally points to the vnode representing the file. Thus the “file descriptor” observed by the other process will most likely have a different numeric value, despite pointing to the same file.

We can also check man 3 cmsg as suggested, to find a handy example on how to use SCM_RIGHTS:

   struct msghdr msg = { 0 };
   struct cmsghdr *cmsg;
   int myfds[NUM_FD];  /* Contains the file descriptors to pass */
   int *fdptr;
   union {         /* Ancillary data buffer, wrapped in a union
                      in order to ensure it is suitably aligned */
       char buf[CMSG_SPACE(sizeof(myfds))];
       struct cmsghdr align;
   } u;

   msg.msg_control = u.buf;
   msg.msg_controllen = sizeof(u.buf);
   cmsg = CMSG_FIRSTHDR(&msg);
   cmsg->cmsg_level = SOL_SOCKET;
   cmsg->cmsg_type = SCM_RIGHTS;
   cmsg->cmsg_len = CMSG_LEN(sizeof(int) * NUM_FD);
   fdptr = (int *) CMSG_DATA(cmsg);    /* Initialize the payload */
   memcpy(fdptr, myfds, NUM_FD * sizeof(int));

And that is what we decided to use. We let OpenSSL read the “Client Hello” message from an established TCP connection. If the “Client Hello” indicated TLS version 1.3, we would use SCM_RIGHTS to send it to the Go process. The Go process would in turn try to parse the rest of the “Client Hello”, if it were successful it would proceed with TLS 1.3 connection, and upon failure it would give the file descriptor back to OpenSSL, to handle regularly.

So how exactly do you implement something like that?

Since in our case we established that the C process will listen for TCP connections, our other process will have to listen on a UNIX socket, for connections C will want to forward.

For example in Go:

type scmListener struct {
	*net.UnixListener
}

type scmConn struct {
	*net.UnixConn
}

var path = "/tmp/scm_example.sock"

func listenSCM() (*scmListener, error) {
	syscall.Unlink(path)

	addr, err := net.ResolveUnixAddr("unix", path)
	if err != nil {
		return nil, err
	}

	ul, err := net.ListenUnix("unix", addr)
	if err != nil {
		return nil, err
	}

	err = os.Chmod(path, 0777)
	if err != nil {
		return nil, err
	}

	return &scmListener{ul}, nil
}

func (l *scmListener) Accept() (*scmConn, error) {
	uc, err := l.AcceptUnix()
	if err != nil {
		return nil, err
	}
	return &scmConn{uc}, nil
}

Then in the C process, for each connection we want to pass, we will connect to that socket first:

int connect_unix()
{
    struct sockaddr_un addr = {.sun_family = AF_UNIX,
                               .sun_path = "/tmp/scm_example.sock"};

    int unix_sock = socket(AF_UNIX, SOCK_STREAM, 0);
    if (unix_sock == -1)
        return -1;

    if (connect(unix_sock, (struct sockaddr *)&addr, sizeof(addr)) == -1)
    {
        close(unix_sock);
        return -1;
    }

    return unix_sock;
}

To actually pass a file descriptor we utilize the example from man 3 cmsg:

int send_fd(int unix_sock, int fd)
{
    struct iovec iov = {.iov_base = ":)", // Must send at least one byte
                        .iov_len = 2};

    union {
        char buf[CMSG_SPACE(sizeof(fd))];
        struct cmsghdr align;
    } u;

    struct msghdr msg = {.msg_iov = &iov,
                         .msg_iovlen = 1,
                         .msg_control = u.buf,
                         .msg_controllen = sizeof(u.buf)};

    struct cmsghdr *cmsg = CMSG_FIRSTHDR(&msg);
    *cmsg = (struct cmsghdr){.cmsg_level = SOL_SOCKET,
                             .cmsg_type = SCM_RIGHTS,
                             .cmsg_len = CMSG_LEN(sizeof(fd))};

    memcpy(CMSG_DATA(cmsg), &fd, sizeof(fd));

    return sendmsg(unix_sock, &msg, 0);
}

Then to receive the file descriptor in Go:

func (c *scmConn) ReadFD() (*os.File, error) {
	msg, oob := make([]byte, 2), make([]byte, 128)

	_, oobn, _, _, err := c.ReadMsgUnix(msg, oob)
	if err != nil {
		return nil, err
	}

	cmsgs, err := syscall.ParseSocketControlMessage(oob[0:oobn])
	if err != nil {
		return nil, err
	} else if len(cmsgs) != 1 {
		return nil, errors.New("invalid number of cmsgs received")
	}

	fds, err := syscall.ParseUnixRights(&cmsgs[0])
	if err != nil {
		return nil, err
	} else if len(fds) != 1 {
		return nil, errors.New("invalid number of fds received")
	}

	fd := os.NewFile(uintptr(fds[0]), "")
	if fd == nil {
		return nil, errors.New("could not open fd")
	}

	return fd, nil
}

Rust

We can also do this in Rust, although the standard library in Rust does not yet support UNIX sockets, but it does let you address the C library via the libc crate. Warning, unsafe code ahead!

First we want to implement some UNIX socket functionality in Rust:

use libc::*;
use std::io::prelude::*;
use std::net::TcpStream;
use std::os::unix::io::FromRawFd;
use std::os::unix::io::RawFd;

fn errno_str() -> String {
    let strerr = unsafe { strerror(*__error()) };
    let c_str = unsafe { std::ffi::CStr::from_ptr(strerr) };
    c_str.to_string_lossy().into_owned()
}

pub struct UNIXSocket {
    fd: RawFd,
}

pub struct UNIXConn {
    fd: RawFd,
}

impl Drop for UNIXSocket {
    fn drop(&mut self) {
        unsafe { close(self.fd) };
    }
}

impl Drop for UNIXConn {
    fn drop(&mut self) {
        unsafe { close(self.fd) };
    }
}

impl UNIXSocket {
    pub fn new() -> Result {
        match unsafe { socket(AF_UNIX, SOCK_STREAM, 0) } {
            -1 => Err(errno_str()),
            fd @ _ => Ok(UNIXSocket { fd }),
        }
    }

    pub fn bind(self, address: &str) -> Result {
        assert!(address.len() < 104);

        let mut addr = sockaddr_un {
            sun_len: std::mem::size_of::() as u8,
            sun_family: AF_UNIX as u8,
            sun_path: [0; 104],
        };

        for (i, c) in address.chars().enumerate() {
            addr.sun_path[i] = c as i8;
        }

        match unsafe {
            unlink(&addr.sun_path as *const i8);
            bind(
                self.fd,
                &addr as *const sockaddr_un as *const sockaddr,
                std::mem::size_of::() as u32,
            )
        } {
            -1 => Err(errno_str()),
            _ => Ok(self),
        }
    }

    pub fn listen(self) -> Result {
        match unsafe { listen(self.fd, 50) } {
            -1 => Err(errno_str()),
            _ => Ok(self),
        }
    }

    pub fn accept(&self) -> Result {
        match unsafe { accept(self.fd, std::ptr::null_mut(), std::ptr::null_mut()) } {
            -1 => Err(errno_str()),
            fd @ _ => Ok(UNIXConn { fd }),
        }
    }
}

And the code to extract the file desciptor:

#[repr(C)]
pub struct ScmCmsgHeader {
    cmsg_len: c_uint,
    cmsg_level: c_int,
    cmsg_type: c_int,
    fd: c_int,
}

impl UNIXConn {
    pub fn recv_fd(&self) -> Result {
        let mut iov = iovec {
            iov_base: std::ptr::null_mut(),
            iov_len: 0,
        };

        let mut scm = ScmCmsgHeader {
            cmsg_len: 0,
            cmsg_level: 0,
            cmsg_type: 0,
            fd: 0,
        };

        let mut mhdr = msghdr {
            msg_name: std::ptr::null_mut(),
            msg_namelen: 0,
            msg_iov: &mut iov as *mut iovec,
            msg_iovlen: 1,
            msg_control: &mut scm as *mut ScmCmsgHeader as *mut c_void,
            msg_controllen: std::mem::size_of::() as u32,
            msg_flags: 0,
        };

        let n = unsafe { recvmsg(self.fd, &mut mhdr, 0) };

        if n == -1
            || scm.cmsg_len as usize != std::mem::size_of::()
            || scm.cmsg_level != SOL_SOCKET
            || scm.cmsg_type != SCM_RIGHTS
        {
            Err("Invalid SCM message".to_string())
        } else {
            Ok(scm.fd)
        }
    }
}

Conclusion

SCM_RIGHTS is a very powerful tool that can be used for many purposes. In our case we used to to introduce a new service in a non-obtrusive fashion. Other uses may be:

A/B testing
Phasing out of an old C based service in favor of new Go or Rust one
Passing connections from a privileged process to an unprivileged one

And more

You can find the full example here.

Optimizing HTTP/2 prioritization with BBR and tcp_notsent_lowat

Patrick Meenan (Guest Author) — Fri, 12 Oct 2018 12:00:00 GMT

Getting the best end-user performance from HTTP/2 requires good support for resource prioritization. While most web servers support HTTP/2 prioritization, getting it to work well all the way to the browser requires a fair bit of coordination across the networking stack. This article will expose some of the interactions between the web server, Operating System and network and how to tune a server to optimize performance for end users.

tl;dr

On Linux 4.9 kernels and later, enable BBR congestion control and set tcp_notsent_lowat to 16KB for HTTP/2 prioritization to work reliably. This can be done in /etc/sysctl.conf:

    net.core.default_qdisc = fq
    net.ipv4.tcp_congestion_control = bbr
    net.ipv4.tcp_notsent_lowat = 16384

Browsers and Request Prioritization

A single web page is made up of dozens to hundreds of separate pieces of content that a web browser pulls together to create and present to the user. The main content (HTML) for the page you are visiting is a list of instructions on how to construct the page and the browser goes through the instructions from beginning to end to figure out everything it needs to load and how to put it all together. Each piece of content requires a separate HTTP request from the browser to the server responsible for that content (or if it has been loaded before, it can be loaded from a local cache in the browser).

In a simple implementation, the web browser could wait until everything is loaded and constructed and then show the result but that would be pretty slow. Not all of the content is critical to the user and can include things such as images way down in the page, analytics for tracking usage, ads, like buttons, etc. All the browsers work more incrementally where they display the content as it becomes available. This results in a much faster user experience. The visible part of the page can be displayed while the rest of the content is being loaded in the background. Deciding on the best order to request the content in is where browser request prioritization comes into play. Done correctly the visible content can display significantly faster than a naive implementation.

HTML Parser blocking page render for styles and scripts in the head of the document.

Most modern browsers use similar prioritization schemes which generally look like:

Load similar resources (scripts, images, styles) in the order they were listed in the HTML.
Load styles/CSS before anything else because content cannot be displayed until styles are complete.
Load blocking scripts/JavaScript next because blocking scripts stop the browser from moving on to the next instruction in the HTML until they have been loaded and executed.
Load images and non-blocking scripts (async/defer).

Fonts are a bit of a special case in that they are needed to draw the text on the screen but the browser won’t know that it needs to load a font until it is actually ready to draw the text to the screen. So they are discovered pretty late. As a result they are generally given a very high priority once they are discovered but aren’t known about until fairly late in the loading process.

Chrome also applies some special treatment to images that are visible in the current browser viewport (part of the page visible on the screen). Once the styles have been applied and the page has been laid out it will give visible images a much higher priority and load them in order from largest to smallest.

HTTP/1.x prioritization

With HTTP/1.x, each connection to a server can support one request at a time (practically anyway as no browser supports pipelining) and most browsers will open up to 6 connections at a time to each server. The browser maintains a prioritized list of the content it needs and makes the requests to each server as a connection becomes available. When a high-priority piece of content is discovered it is moved to the front of a list and when the next connection becomes available it is requested.

HTTP/2 prioritization

With HTTP/2, the browser uses a single connection and the requests are multiplexed over the connection as separate “streams”. The requests are all sent to the server as soon as they are discovered along with some prioritization information to let the server know the preferred ordering of the responses. It is then up to the server to do its best to deliver the most important responses first, followed by lower priority responses. When a high priority request comes in to the server, it should immediately jump ahead of the lower priority responses, even mid-response. The actual priority scheme implemented by HTTP/2 allows for parallel downloads with weighting between them and more complicated schemes. For now it is easiest to just think about it as a priority ordering of the resources.

Most servers that support prioritization will send data for the highest priority responses for which it has data available. But if the most important response takes longer to generate than lower priority responses, the server may end up starting to send data for a lower priority response and then interrupt its stream when the higher priority response becomes available. That way it can avoid wasting available bandwidth and head-of-line blocking where a slow response holds everything else up.

Browser requesting a high-priority resource after several low-priority resources.

In an optimal configuration, the time to retrieve a top-priority resource on a busy connection with lots of other streams will be identical to the time to retrieve it on an empty connection. Effectively that means that the server needs to be able to interrupt the response streams of all of the other responses immediately with no additional buffering to delay the high-priority response (beyond the minimal amount of data in-flight on the network to keep the connection fully utilized).

Buffers on the Internet

Excessive buffering is pretty much the nemesis for HTTP/2 because it directly impacts the ability for a server to be nimble in responding to priority shifts. It is not unusual for there to be megabytes-worth of buffering between the server and the browser which is larger than most websites. Practically that means that the responses will get delivered in whatever order they become available on the server. It is not unusual to have a critical resource (like a font or a render-blocking script in the of a document) delayed by megabytes of lower priority images. For the end-user this translates to seconds or even minutes of delay rendering the page.

TCP send buffers

The first layer of buffering between the server and the browser is in the server itself. The operating system maintains a TCP send buffer that the server writes data into. Once the data is in the buffer then the operating system takes care of delivering the data as-needed (pulling from the buffer as data is sent and signaling to the server when the buffer needs more data). A large buffer also reduces CPU load because it reduces the amount of writing that the server has to do to the connection.

The actual size of the send buffer needs to be big enough to keep a copy of all of the data that has been sent to the browser but has yet to be acknowledged in case a packet gets dropped and some data needs to be retransmitted. Too small of a buffer will prevent the server from being able to max-out the connection bandwidth to the client (and is a common cause of slow downloads over long distances). In the case of HTTP/1.x (and a lot of other protocols), the data is delivered in bulk in a known-order and tuning the buffers to be as big as possible has no downside other than the increase in memory use (trading off memory for CPU). Increasing the TCP send buffer sizes is an effective way to increase the throughput of a web server.

For HTTP/2, the problem with large send buffers is that it limits the nimbleness of the server to adjust the data it is sending on a connection as high priority responses become available. Once the response data has been written into the TCP send buffer it is beyond the server’s control and has been committed to be delivered in the order it is written.

High-priority resource queued behind low-priority resources in the TCP send buffer.

The optimal send buffer size for HTTP/2 is the minimal amount of data required to fully utilize the available bandwidth to the browser (which is different for every connection and changes over time even for a single connection). Practically you’d want the buffer to be slightly bigger to allow for some time between when the server is signaled that more data is needed and when the server writes the additional data.

TCP_NOTSENT_LOWAT

TCP_NOTSENT_LOWAT is a socket option that allows configuration of the send buffer so that it is always the optimal size plus a fixed additional buffer. You provide a buffer size (X) which is the additional amount of size you’d like in addition to the minimal needed to fully utilize the connection and it dynamically adjusts the TCP send buffer to always be X bytes larger than the current connection congestion window. The congestion window is the TCP stack’s estimate of the amount of data that needs to be in-flight on the network to fully utilize the connection.

TCP_NOTSENT_LOWAT can be configured in code on a socket-by-socket basis if the web server software supports it or system-wide using the net.ipv4.tcp_notsent_lowat sysctl:

    net.ipv4.tcp_notsent_lowat = 16384

We have a patch we are preparing to upstream for NGINX to make it configurable but it isn’t quite ready yet so configuring it system-wide is required. Experimentally, the value 16,384 (16K) has proven to be a good balance where the connections are kept fully-utilized with negligible additional CPU overhead. That will mean that at most 16KB of lower priority data will be buffered before a higher priority response can interrupt it and be delivered. As always, your mileage may vary and it is worth experimenting with.

High-priority resource ready to send with minimal TCP buffering.

Bufferbloat

Beyond buffering on the server, the network connection between the server and the browser can act as a buffer. It is increasingly common for networking gear to have large buffers that absorb data that is sent faster than the receiving side can consume it. This is generally referred to as Bufferbloat. I hedged my explanation of the effectiveness of tcp_notsent_lowat a little bit in that it is based on the current congestion window which is an estimate of the optimal amount of in-flight data needed but not necessarily the actual optimal amount of in-flight data.

The buffers in the network can be quite large at times (megabytes) and they interact very poorly with the congestion control algorithms usually used by TCP. Most classic congestion-control algorithms determine the congestion window by watching for packet loss. Once a packet is dropped then it knows there was too much data on the network and it scales back from there. With Bufferbloat that limit is raised artificially high because the buffers are absorbing the extra packets beyond what is needed to saturate the connection. As a result, the TCP stack ends up calculating a congestion window that spikes to much larger than the actual size needed, then drops to significantly smaller once the buffers are saturated and a packet is dropped and the cycle repeats.

Loss-based congestion control congestion window graph.

TCP_NOTSENT_LOWAT uses the calculated congestion window as a baseline for the size of the send buffer it needs to use so when the underlying calculation is wrong, the server ends up with send buffers much larger (or smaller) than it actually needs.

I like to think about Bufferbloat as being like a line for a ride at an amusement park. Specifically, one of those lines where it’s a straight shot to the ride when there are very few people in line but once the lines start to build they can divert you through a maze of zig-zags. Approaching the ride it looks like a short distance from the entrance to the ride but things can go horribly wrong.

Bufferbloat is very similar. When the data is coming into the network slower than the links can support, everything is nice and fast:

Response traveling through the network with no buffering.

Once the data comes in faster than it can go out the gates are flipped and the data gets routed through the maze of buffers to hold it until it can be sent. From the entrance to the line it still looks like everything is going fine since the network is absorbing the extra data but it also means there is a long queue of the low-priority data already absorbed when you want to send the high-priority data and it has no choice but to follow at the back of the line:

Responses queued in network buffers.

BBR congestion control

BBR is a new congestion control algorithm from Google that uses changes in packet delays to model the congestion instead of waiting for packets to drop. Once it sees that packets are taking longer to be acknowledged it assumes it has saturated the connection and packets have started to buffer. As a result the congestion window is often very close to the optimal needed to keep the connection fully utilized while also avoiding Bufferbloat. BBR was merged into the Linux kernel in version 4.9 and can be configured through sysctl:

    net.core.default_qdisc = fq
    net.ipv4.tcp_congestion_control = bbr

BBR also tends to perform better overall since it doesn’t require packet loss as part of probing for the correct congestion window and also tends to react better to random packet loss.

BBR congestion window graph.

Back to the amusement park line, BBR is like having each person carry one of the RFID cards they use to measure the wait time. Once the wait time looks like it is getting slower the people at the entrance slow down the rate that they let people enter the line.

BBR detecting network congestion early.

This way BBR essentially keeps the line moving as fast as possible and prevents the maze of lines from being used. When a guest with a fast pass arrives (the high-priority request) they can jump into the fast-moving line and hop right onto the ride.

BBR delivering responses without network buffering.

Technically, any congestion control that keeps Bufferbloat in check and maintains an accurate congestion window will work for keeping the TCP send buffers in check, BBR just happens to be one of them (with lots of good properties).

Putting it all together

The combination of TCP_NOTSENT_LOWAT and BBR reduces the amount of buffering on the network to the absolute minimum and is CRITICAL for good end-user performance with HTTP/2. This is particularly true for NGINX and other HTTP/2 servers that don’t implement their own buffer throttling.

The end-user impact of correct prioritization is huge and may not show up in most of the metrics you are used to watching (particularly any server-side metrics like requests-per-second, request response time, etc).

Even on a 5Mbps cable connection proper resource ordering can result in rendering a page significantly faster (and the difference can explode to dozens of seconds or even minutes on a slower connection). Here is a relatively common case of a WordPress blog served over HTTP/2:

The page from the tuned server (After) starts to render at 1.8 seconds.

The page from the tuned server (After) is completely done rendering at 4.5 seconds, well before the default configuration (Before) even started to render.

Finally, at 10.2 seconds the default configuration started to render (8.4 seconds later or 5.6 times slower than the tuned server).

Visually complete on the default configuration arrives at 10.7 seconds (6.2 seconds or 2.3 times slower than the tuned server).

Both configurations served the exact same content using the exact same servers with “After” being tuned for TCP_NOTSENT_LOWAT of 16KB (both configurations used BBR).

Identifying Prioritization Issues In The Wild

If you look at a network waterfall diagram of a page loading prioritization issues will show up as high-priority requests completing much later than lower-priority requests from the same origin. Usually that will also push metrics like First Paint and DOM Content Loaded (the vertical purple bar below) much later.

Network waterfall showing critical CSS and JavaScript delayed by images.

When prioritization is working correctly you will see critical resources all completing much earlier and not be blocked by the lower-priority requests. You may still see SOME low-priority data download before the higher-priority data starts downloading because there is still some buffering even under ideal conditions but it should be minimal.

Network waterfall showing critical CSS and JavaScript loading quickly.

Chrome 69 and later may hide the problem a bit. Chrome holds back lower-priority requests even on HTTP/2 connections until after it has finished processing the head of the document. In a waterfall it will look like a delayed block of requests that all start at the same time after the critical requests have completed. That doesn’t mean that it isn’t a problem for Chrome, just that it isn’t as obvious. Even with the staggering of requests there are still high-priority requests outside of the head of the document that can be delayed by lower-priority requests. Most notable are any blocking scripts in the body of the page and any external fonts that were not preloaded.

Network waterfall showing Chrome delaying the requesting of low-priority resources.

Hopefully this post gives you the tools to be able to identify HTTP/2 prioritization issues when they happen, a deeper understanding of how HTTP/2 prioritization works and some tools to fix the issues when they appear.

How we scaled nginx and saved the world 54 years every day

Ka-Hing Cheung — Tue, 31 Jul 2018 15:00:00 GMT

The @Cloudflare team just pushed a change that improves our network's performance significantly, especially for particularly slow outlier requests. How much faster? We estimate we're saving the Internet ~54 years *per day* of time we'd all otherwise be waiting for sites to load.
— Matthew Prince (@eastdakota) June 28, 2018

10 million websites, apps and APIs use Cloudflare to give their users a speed boost. At peak we serve more than 10 million requests a second across our 151 data centers. Over the years we’ve made many modifications to our version of NGINX to handle our growth. This is blog post is about one of them.

How NGINX works

NGINX is one of the programs that popularized using event loops to solve the C10K problem. Every time a network event comes in (a new connection, a request, or a notification that we can send more data, etc.) NGINX wakes up, handles the event, and then goes back to do whatever it needs to do (which may be handling other events). When an event arrives, data associated with the event is already ready, which allows NGINX to efficiently handle many requests simultaneously without waiting.

num_events = epoll_wait(epfd, /*returned=*/events, events_len, /*timeout=*/-1);
// events is list of active events
// handle event[0]: incoming request GET http://example.com/
// handle event[1]: send out response to GET http://cloudflare.com/

For example, here's what a piece of code could look like to read data from a file descriptor:

// we got a read event on fd
while (buf_len > 0) {
    ssize_t n = read(fd, buf, buf_len);
    if (n < 0) {
        if (errno == EWOULDBLOCK || errno == EAGAIN) {
            // try later when we get a read event again
        }
        if (errno == EINTR) {
            continue;
        }
        return total;
    }
    buf_len -= n;
    buf += n;
    total += n;
}

When fd is a network socket, this will return the bytes that have already arrived. The final call will return EWOULDBLOCK which means we have drained the local read buffer, so we should not read from the socket again until more data becomes available.

Disk I/O is not like network I/O

When fd is a regular file on Linux, EWOULDBLOCK and EAGAIN never happens, and read always waits to read the entire buffer. This is true even if the file was opened with O_NONBLOCK. Quoting open(2):

Note that this flag has no effect for regular files and block devices

In other words, the code above basically reduces to:

if (read(fd, buf, buf_len) > 0) {
   return buf_len;
}

Which means that if an event handler needs to read from disk, it will block the event loop until the entire read is finished, and subsequent event handlers are delayed.

This ends up being fine for most workloads, because reading from disk is usually fast enough, and much more predictable compared to waiting for a packet to arrive from network. That's especially true now that everyone has an SSD, and our cache disks are all SSDs. Modern SSDs have very low latency, typically in 10s of µs. On top of that, we can run NGINX with multiple worker processes so that a slow event handler does not block requests in other processes. Most of the time, we can rely on NGINX's event handling to service requests quickly and efficiently.

SSD performance: not always what’s on the label

As you might have guessed, these rosy assumptions aren’t always true. If each read always takes 50µs then it should only take 2ms to read 0.19MB in 4KB blocks (and we read in larger blocks). But our own measurements showed that our time to first byte is sometimes much worse, particularly at 99th and 999th percentile. In other words, the slowest read per 100 (or per 1000) reads often takes much longer.

SSDs are very fast but they are also notoriously complicated. Inside them are computers that queue up and re-order I/O, and also perform various background tasks like garbage collection and defragmentation. Once in a while, a request gets slowed down enough to matter. My colleague Ivan Babrou ran some I/O benchmarks and saw read spikes of up to 1 second. Moreover, some of our SSDs have more performance outliers than others. Going forward we will consider performance consistency in our SSD purchases, but in the meantime we need to have a solution for our existing hardware.

Spreading the load evenly with `SO_REUSEPORT`

An individual slow response once in a blue moon is difficult to avoid, but what we really don't want is a 1 second I/O blocking 1000 other requests that we receive within the same second. Conceptually NGINX can handle many requests in parallel but it only runs 1 event handler at a time. So I added a metric that measures this:

gettimeofday(&start, NULL);
num_events = epoll_wait(epfd, /*returned=*/events, events_len, /*timeout=*/-1);
// events is list of active events
// handle event[0]: incoming request GET http://example.com/
gettimeofday(&event_start_handle, NULL);
// handle event[1]: send out response to GET http://cloudflare.com/
timersub(&event_start_handle, &start, &event_loop_blocked);

p99 of event_loop_blocked turned out to be more than 50% of our TTFB. Which is to say, half of the time it takes to serve a request is a result of the event loop being blocked by other requests. event_loop_blocked only measures about half of the blocking (because delayed calls to epoll_wait() are not measured) so the actual ratio of blocked time is much higher.

Each of our machines run NGINX with 15 worker processes, which means one slow I/O should only block up to 6% of the requests. However, the events are not evenly distributed, with the top worker taking 11% of the requests (or twice as many as expected).

SO_REUSEPORT can solve the uneven distribution problem. Marek Majkowski has previously written about the downside in the context of other NGINX instances, but that downside mostly doesn't apply in our case since upstream connections in our cache process are long-lived, so a slightly higher latency in opening the connection is negligible. This single configuration change to enable SO_REUSEPORT improved peak p99 by 33%.

Moving read() to thread pool: not a silver bullet

A solution to this is to make read() not block. In fact, this is a feature that's implemented in upstream NGINX! When the following configuration is used, read() and write() are done in a thread pool and won't block the event loop:

aio         threads;
aio_write   on;

However when we tested this, instead of 33x response time improvement, we actually saw a slight increase in p99. The difference was within margin of error but we were quite discouraged by the result and stopped pursuing this option for a while.

There are a few reasons why we didn’t see the level of improvements that NGINX saw. In their test, they were using 200 concurrent connections to request files that were 4MB in size, which were residing on spinning disks. Spinning disks increase I/O latency so it makes sense that an optimization that helps latency would have more dramatic effect.

We are also mostly concerned with p99 (and p999) performance. Solutions that help the average performance don't necessarily help with outliers.

Finally, in our environment, typical file sizes are much smaller. 90% of our cache hits are for files smaller than 60KB. Smaller files mean fewer occasions to block (we typically read the entire file in 2 reads).

If we look at the disk I/O that a cache hit has to do:

// we got a request for https://example.com which has cache key 0xCAFEBEEF
fd = open("/cache/prefix/dir/EF/BE/CAFEBEEF", O_RDONLY);
// read up to 32KB for the metadata as well as the headers
// done in thread pool if "aio threads" is on
read(fd, buf, 32*1024);

32KB isn't a static number, if the headers are small we need to read just 4KB (we don't use direct IO so kernel will round up to 4KB). The open() seems innocuous but it's actually not free. At a minimum the kernel needs to check if the file exists and if the calling process has permission to open it. For that it would have to find the inode of /cache/prefix/dir/EF/BE/CAFEBEEF, and to do that it would have to look up CAFEBEEF in /cache/prefix/dir/EF/BE/. Long story short, in the worst case the kernel has to do the following lookups:

/cache
/cache/prefix
/cache/prefix/dir
/cache/prefix/dir/EF
/cache/prefix/dir/EF/BE
/cache/prefix/dir/EF/BE/CAFEBEEF

That's 6 separate reads done by open() compared to just 1 read done by read()! Fortunately, most of the time lookups are serviced by the dentry cache and don't require trips to the SSDs. But clearly having read() done in thread pool is only half of the picture.

The coup de grâce: non-blocking open() in thread pools

So I modified NGINX to do most of open() inside the thread pool as well so it won't block the event loop. And the result (both non-blocking open and non-blocking read):

On June 26 we deployed our changes to 5 of our busiest data centers, followed by world wide roll-out the next day. Overall peak p99 TTFB improved by a factor of 6. In fact, adding up all the time from processing 8 million requests per second, we saved the Internet 54 years of wait time every day.

We've submitted our work to upstream. Interested parties can follow along.

Our event loop handling is still not completely non-blocking. In particular, we still block when we are caching a file for the first time (both the open(O_CREAT) and rename()), or doing revalidation updates. However, those are rare compared to cache hits. In the future we will consider moving those off of the event loop to further improve our p99 latency.

Conclusion

NGINX is a powerful platform, but scaling extremely high I/O loads on linux can be challenging. Upstream NGINX can offload reads in separate threads, but at our scale we often need to go one step further. If working on challenging performance problems sounds exciting to you, apply to join our team in San Francisco, London, Austin or Champaign.

HTTP Analytics for 6M requests per second using ClickHouse

Alex Bocharov — Tue, 06 Mar 2018 13:00:00 GMT

One of our large scale data infrastructure challenges here at Cloudflare is around providing HTTP traffic analytics to our customers. HTTP Analytics is available to all our customers via two options:

Analytics tab in Cloudflare dashboard
Zone Analytics API with 2 endpoints
- Dashboard endpoint
- Co-locations endpoint (Enterprise plan only)

In this blog post I'm going to talk about the exciting evolution of the Cloudflare analytics pipeline over the last year. I'll start with a description of the old pipeline and the challenges that we experienced with it. Then, I'll describe how we leveraged ClickHouse to form the basis of a new and improved pipeline. In the process, I'll share details about how we went about schema design and performance tuning for ClickHouse. Finally, I'll look forward to what the Data team is thinking of providing in the future.

Let's start with the old data pipeline.

Old data pipeline

The previous pipeline was built in 2014. It has been mentioned previously in Scaling out PostgreSQL for CloudFlare Analytics using CitusDB and More data, more data blog posts from the Data team.

It had following components:

Log forwarder - collected Cap'n Proto formatted logs from the edge, notably DNS and Nginx logs, and shipped them to Kafka in Cloudflare central datacenter.
Kafka cluster - consisted of 106 brokers with x3 replication factor, 106 partitions, ingested Cap'n Proto formatted logs at average rate 6M logs per second.
Kafka consumers - each of 106 partitions had dedicated Go consumer (a.k.a. Zoneagg consumer), which read logs and produced aggregates per partition per zone per minute and then wrote them into Postgres. Postgres database - single instance PostgreSQL database (a.k.a. RollupDB), accepted aggregates from Zoneagg consumers and wrote them into temporary tables per partition per minute. It then rolled-up the aggregates into further aggregates with aggregation cron. More specifically:
- Aggregates per partition, minute, zone → aggregates data per minute, zone
- Aggregates per minute, zone → aggregates data per hour, zone
- Aggregates per hour, zone → aggregates data per day, zone
- Aggregates per day, zone → aggregates data per month, zone
Citus Cluster - consisted of Citus main and 11 Citus workers with x2 replication factor (a.k.a. Zoneagg Citus cluster), the storage behind Zone Analytics API and our BI internal tools. It had replication cron, which did remote copy of tables from Postgres instance into Citus worker shards.
Zone Analytics API - served queries from internal PHP API. It consisted of 5 API instances written in Go and queried Citus cluster, and was not visible to external users.
PHP API - 3 instances of proxying API, which forwarded public API queries to internal Zone Analytics API, and had some business logic on zone plans, error messages, etc.
Load Balancer - nginx proxy, forwarded queries to PHP API/Zone Analytics API.

Cloudflare has grown tremendously since this pipeline was originally designed in 2014. It started off processing under 1M requests per second and grew to current levels of 6M requests per second. The pipeline had served us and our customers well over the years, but began to split at the seams. Any system should be re-engineered after some time, when requirements change.

Some specific disadvantages of the original pipeline were:

Postgres SPOF - single PostgreSQL instance was a SPOF (Single Point of Failure), as it didn't have replicas or backups and if we were to lose this node, whole analytics pipeline could be paralyzed and produce no new aggregates for Zone Analytics API.
Citus main SPOF - Citus main was the entrypoint to all Zone Analytics API queries and if it went down, all our customers' Analytics API queries would return errors.
Complex codebase - thousands of lines of bash and SQL for aggregations, and thousands of lines of Go for API and Kafka consumers made the pipeline difficult to maintain and debug.
Many dependencies - the pipeline consisted of many components, and failure in any individual component could result in halting the entire pipeline.
High maintenance cost - due to its complex architecture and codebase, there were frequent incidents, which sometimes took engineers from the Data team and other teams many hours to mitigate.

Over time, as our request volume grew, the challenges of operating this pipeline became more apparent, and we realized that this system was being pushed to its limits. This realization inspired us to think about which components would be ideal candidates for replacement, and led us to build new data pipeline.

Our first design of an improved analytics pipeline centred around the use of the Apache Flink stream processing system. We had previously used Flink for other data pipelines, so it was a natural choice for us. However, these pipelines had been at a much lower rate than the 6M requests per second we needed to process for HTTP Analytics, and we struggled to get Flink to scale to this volume - it just couldn't keep up with ingestion rate per partition on all 6M HTTP requests per second.

Our colleagues on our DNS team had already built and productionized DNS analytics pipeline atop ClickHouse. They wrote about it in "How Cloudflare analyzes 1M DNS queries per second" blog post. So, we decided to take a deeper look at ClickHouse.

ClickHouse

"ClickHouse не тормозит." Translation from Russian: ClickHouse doesn't have brakes (or isn't slow) © ClickHouse core developers

When exploring additional candidates for replacing some of the key infrastructure of our old pipeline, we realized that using a column oriented database might be well suited to our analytics workloads. We wanted to identify a column oriented database that was horizontally scalable and fault tolerant to help us deliver good uptime guarantees, and extremely performant and space efficient such that it could handle our scale. We quickly realized that ClickHouse could satisfy these criteria, and then some.

ClickHouse is an open source column-oriented database management system capable of real time generation of analytical data reports using SQL queries. It is blazing fast, linearly scalable, hardware efficient, fault tolerant, feature rich, highly reliable, simple and handy. ClickHouse core developers provide great help on solving issues, merging and maintaining our PRs into ClickHouse. For example, engineers from Cloudflare have contributed a whole bunch of code back upstream:

Aggregate function topK by Marek Vavruša
IP prefix dictionary by Marek Vavruša
SummingMergeTree engine optimizations by Marek Vavruša
Kafka table Engine by Marek Vavruša. We're thinking to replace Kafka Go consumers with this engine when it will be stable enough and ingest from Kafka into ClickHouse directly.
Aggregate function sumMap by Alex Bocharov. Without this function it would be impossible to build our new Zone Analytics API.
Mark cache fix by Alex Bocharov
uniqHLL12 function fix for big cardinalities by Alex Bocharov. The description of the issue and its fix should be an interesting reading.

Along with filing many bug reports, we also report about every issue we face in our cluster, which we hope will help to improve ClickHouse in future.

Even though DNS analytics on ClickHouse had been a great success, we were still skeptical that we would be able to scale ClickHouse to the needs of the HTTP pipeline:

Kafka DNS topic has on average 1.5M messages per second vs 6M messages per second for HTTP requests topic.
Kafka DNS topic average uncompressed message size is 130B vs 1630B for HTTP requests topic.
DNS query ClickHouse record consists of 40 columns vs 104 columns for HTTP request ClickHouse record.

After unsuccessful attempts with Flink, we were skeptical of ClickHouse being able to keep up with the high ingestion rate. Luckily, early prototype showed promising performance and we decided to proceed with old pipeline replacement. The first step in replacing the old pipeline was to design a schema for the new ClickHouse tables.

ClickHouse schema design

Once we identified ClickHouse as a potential candidate, we began exploring how we could port our existing Postgres/Citus schemas to make them compatible with ClickHouse.

For our Zone Analytics API we need to produce many different aggregations for each zone (domain) and time period (minutely / hourly / daily / monthly). For deeper dive about specifics of aggregates please follow Zone Analytics API documentation or this handy spreadsheet.

These aggregations should be available for any time range for the last 365 days. While ClickHouse is a really great tool to work with non-aggregated data, with our volume of 6M requests per second we just cannot afford yet to store non-aggregated data for that long.

To give you an idea of how much data is that, here is some "napkin-math" capacity planning. I'm going to use an average insertion rate of 6M requests per second and $100 as a cost estimate of 1 TiB to calculate storage cost for 1 year in different message formats:

Metric	Cap'n Proto	Cap'n Proto (zstd)	ClickHouse
Avg message/record size	1630 B	360 B	36.74 B
Storage requirements for 1 year	273.93 PiB	60.5 PiB	18.52 PiB (RF x3)
Storage cost for 1 year	$28M	$6.2M	$1.9M

And that is if we assume that requests per second will stay the same, but in fact it's growing fast all the time.

Even though storage requirements are quite scary, we're still considering to store raw (non-aggregated) requests logs in ClickHouse for 1 month+. See "Future of Data APIs" section below.

Non-aggregated requests table

We store over 100+ columns, collecting lots of different kinds of metrics about each request passed through Cloudflare. Some of these columns are also available in our Enterprise Log Share product, however ClickHouse non-aggregated requests table has more fields.

With so many columns to store and huge storage requirements we've decided to proceed with the aggregated-data approach, which worked well for us before in old pipeline and which will provide us with backward compatibility.

Aggregations schema design #1

According to the API documentation, we need to provide lots of different requests breakdowns and to satisfy these requirements we decided to test the following approach:

Create Cickhouse materialized views with ReplicatedAggregatingMergeTree engine pointing to non-aggregated requests table and containing minutely aggregates data for each of the breakdowns:
- Requests totals - containing numbers like total requests, bytes, threats, uniques, etc.
- Requests by colo - containing requests, bytes, etc. breakdown by edgeColoId - each of 120+ Cloudflare datacenters
- Requests by http status - containing breakdown by HTTP status code, e.g. 200, 404, 500, etc.
- Requests by content type - containing breakdown by response content type, e.g. HTML, JS, CSS, etc.
- Requests by country - containing breakdown by client country (based on IP)
- Requests by threat type - containing breakdown by threat type
- Requests by browser - containing breakdown by browser family, extracted from user agent
- Requests by ip class - containing breakdown by client IP class
Write the code gathering data from all 8 materialized views, using two approaches:
- Querying all 8 materialized views at once using JOIN
- Querying each one of 8 materialized views separately in parallel
Run performance testing benchmark against common Zone Analytics API queries

Schema design #1 didn't work out well. ClickHouse JOIN syntax forces to write monstrous query over 300 lines of SQL, repeating the selected columns many times because you can do only pairwise joins in ClickHouse.

As for querying each of materialized views separately in parallel, benchmark showed prominent, but moderate results - query throughput would be a little bit better than using our Citus based old pipeline.

Aggregations schema design #2

In our second iteration of the schema design, we strove to keep a similar structure to our existing Citus tables. To do this, we experimented with the SummingMergeTree engine, which is described in detail by the excellent ClickHouse documentation:

In addition, a table can have nested data structures that are processed in a special way. If the name of a nested table ends in 'Map' and it contains at least two columns that meet the following criteria... then this nested table is interpreted as a mapping of key => (values...), and when merging its rows, the elements of two data sets are merged by 'key' with a summation of the corresponding (values...).

We were pleased to find this feature, because the SummingMergeTree engine allowed us to significantly reduce the number of tables required as compared to our initial approach. At the same time, it allowed us to match the structure of our existing Citus tables. The reason was that the ClickHouse Nested structure ending in 'Map' was similar to the Postgres hstore data type, which we used extensively in the old pipeline.

However, there were two existing issues with ClickHouse maps:

SummingMergeTree does aggregation for all records with same primary key, but final aggregation across all shards should be done using some aggregate function, which didn't exist in ClickHouse.
For storing uniques (uniques visitors based on IP), we need to use AggregateFunction data type, and although SummingMergeTree allows you to create column with such data type, it will not perform aggregation on it for records with same primary keys.

To resolve problem #1, we had to create a new aggregation function sumMap. Luckily, ClickHouse source code is of excellent quality and its core developers are very helpful with reviewing and merging requested changes.

As for problem #2, we had to put uniques into separate materialized view, which uses the ReplicatedAggregatingMergeTree Engine and supports merge of AggregateFunction states for records with the same primary keys. We're considering adding the same functionality into SummingMergeTree, so it will simplify our schema even more.

We also created a separate materialized view for the Colo endpoint because it has much lower usage (5% for Colo endpoint queries, 95% for Zone dashboard queries), so its more dispersed primary key will not affect performance of Zone dashboard queries.

Once schema design was acceptable, we proceeded to performance testing.

ClickHouse performance tuning

We explored a number of avenues for performance improvement in ClickHouse. These included tuning index granularity, and improving the merge performance of the SummingMergeTree engine.

By default ClickHouse recommends to use 8192 index granularity. There is nice article explaining ClickHouse primary keys and index granularity in depth.

While default index granularity might be excellent choice for most of use cases, in our case we decided to choose the following index granularities:

For the main non-aggregated requests table we chose an index granularity of 16384. For this table, the number of rows read in a query is typically on the order of millions to billions. In this case, a large index granularity does not make a huge difference on query performance.
For the aggregated requests_* stables, we chose an index granularity of 32. A low index granularity makes sense when we only need to scan and return a few rows. It made a huge difference in API performance - query latency decreased by 50% and throughput increased by ~3 times when we changed index granularity 8192 → 32.

Not relevant to performance, but we also disabled the min_execution_speed setting, so queries scanning just a few rows won't return exception because of "slow speed" of scanning rows per second.

On the aggregation/merge side, we've made some ClickHouse optimizations as well, like increasing SummingMergeTree maps merge speed by x7 times, which we contributed back into ClickHouse for everyone's benefit.

Once we had completed the performance tuning for ClickHouse, we could bring it all together into a new data pipeline. Next, we describe the architecture for our new, ClickHouse-based data pipeline.

New data pipeline

The new pipeline architecture re-uses some of the components from old pipeline, however it replaces its most weak components.

New components include:

Kafka consumers - 106 Go consumers per each partition consume Cap'n Proto raw logs and extract/prepare needed 100+ ClickHouse fields. Consumers no longer do any aggregation logic.
ClickHouse cluster - 36 nodes with x3 replication factor. It handles non-aggregate requests logs ingestion and then produces aggregates using materialized views.
Zone Analytics API - rewritten and optimized version of API in Go, with many meaningful metrics, healthchecks, failover scenarios.

As you can see the architecture of new pipeline is much simpler and fault-tolerant. It provides Analytics for all our 7M+ customers' domains totalling more than 2.5 billion monthly unique visitors and over 1.5 trillion monthly page views.

On average we process 6M HTTP requests per second, with peaks of upto 8M requests per second.

Average log message size in Cap’n Proto format used to be ~1630B, but thanks to amazing job on Kafka compression by our Platform Operations Team, it decreased significantly. Please see "Squeezing the firehose: getting the most from Kafka compression" blog post with deeper dive into those optimisations.

Benefits of new pipeline

No SPOF - removed all SPOFs and bottlenecks, everything has at least x3 replication factor.
Fault-tolerant - it's more fault-tolerant, even if Kafka consumer or ClickHouse node or Zone Analytics API instance fails, it doesn't impact the service.
Scalable - we can add more Kafka brokers or ClickHouse nodes and scale ingestion as we grow. We are not so confident about query performance when cluster will grow to hundreds of nodes. However, Yandex team managed to scale their cluster to 500+ nodes, distributed geographically between several data centers, using two-level sharding.
Reduced complexity - due to removing messy crons and consumers which were doing aggregations and refactoring API code we were able to:
- Shutdown Postgres RollupDB instance and free it up for reuse.
- Shutdown Citus cluster 12 nodes and free it up for reuse. As we won't use Citus for serious workload anymore we can reduce our operational and support costs.
- Delete tens of thousands of lines of old Go, SQL, Bash, and PHP code.
- Remove WWW PHP API dependency and extra latency.
Improved API throughput and latency - with previous pipeline Zone Analytics API was struggling to serve more than 15 queries per second, so we had to introduce temporary hard rate limits for largest users. With new pipeline we were able to remove hard rate limits and now we are serving ~40 queries per second. We went further and did intensive load testing for new API and with current setup and hardware we are able serve up to ~150 queries per second and this is scalable with additional nodes.
Easier to operate - with shutdown of many unreliable components, we are finally at the point where it's relatively easy to operate this pipeline. ClickHouse quality helps us a lot in this matter.
Decreased amount of incidents - with new more reliable pipeline, we now have fewer incidents than before, which ultimately has reduced on-call burden. Finally, we can sleep peacefully at night :-).

Recently, we've improved the throughput and latency of the new pipeline even further with better hardware. I'll provide details about this cluster below.

Our ClickHouse cluster

In total we have 36 ClickHouse nodes. The new hardware is a big upgrade for us:

Chassis - Quanta D51PH-1ULH chassis instead of Quanta D51B-2U chassis (2x less physical space)
CPU - 40 logical cores E5-2630 v3 @ 2.40 GHz instead of 32 cores E5-2630 v4 @ 2.20 GHz
RAM - 256 GB RAM instead of 128 GB RAM
Disks - 12 x 10 TB Seagate ST10000NM0016-1TT101 disks instead of 12 x 6 TB Toshiba TOSHIBA MG04ACA600E
Network - 2 x 25G Mellanox ConnectX-4 in MC-LAG instead of 2 x 10G Intel 82599ES

Our Platform Operations team noticed that ClickHouse is not great at running heterogeneous clusters yet, so we need to gradually replace all nodes in the existing cluster with new hardware, all 36 of them. The process is fairly straightforward, it's no different than replacing a failed node. The problem is that ClickHouse doesn't throttle recovery.

Here is more information about our cluster:

Avg insertion rate - all our pipelines bring together 11M rows per second.
Avg insertion bandwidth - 47 Gbps.
Avg queries per second - on average cluster serves ~40 queries per second with frequent peaks up to ~80 queries per second.
CPU time - after recent hardware upgrade and all optimizations, our cluster CPU time is quite low.
Max disk IO (device time) - it's low as well.

In order to make the switch to the new pipeline as seamless as possible, we performed a transfer of historical data from the old pipeline. Next, I discuss the process of this data transfer.

Historical data transfer

As we have 1 year storage requirements, we had to do one-time ETL (Extract Transfer Load) from the old Citus cluster into ClickHouse.

At Cloudflare we love Go and its goroutines, so it was quite straightforward to write a simple ETL job, which:

For each minute/hour/day/month extracts data from Citus cluster
Transforms Citus data into ClickHouse format and applies needed business logic
Loads data into ClickHouse

The whole process took couple of days and over 60+ billions rows of data were transferred successfully with consistency checks. The completion of this process finally led to the shutdown of old pipeline. However, our work does not end there, and we are constantly looking to the future. In the next section, I'll share some details about what we are planning.

Future of Data APIs

Log Push

We're currently working on something called "Log Push". Log push allows you to specify a desired data endpoint and have your HTTP request logs sent there automatically at regular intervals. At the moment, it's in private beta and going to support sending logs to:

Amazon S3 bucket
Google Cloud Service bucket
Other storage services and platforms

It's expected to be generally available soon, but if you are interested in this new product and you want to try it out please contact our Customer Support team.

Logs SQL API

We're also evaluating possibility of building new product called Logs SQL API. The idea is to provide customers access to their logs via flexible API which supports standard SQL syntax and JSON/CSV/TSV/XML format response.

Queries can extract:

Raw requests logs fields (e.g. SELECT field1, field2, ... FROM requests WHERE ...)
Aggregated data from requests logs (e.g. SELECT clientIPv4, count() FROM requests GROUP BY clientIPv4 ORDER BY count() DESC LIMIT 10)

Google BigQuery provides similar SQL API and Amazon has product callled Kinesis Data analytics with SQL API support as well.

Another option we're exploring is to provide syntax similar to DNS Analytics API with filters and dimensions.

We're excited to hear your feedback and know more about your analytics use case. It can help us a lot to build new products!

Conclusion

All this could not be possible without hard work across multiple teams! First of all thanks to other Data team engineers for their tremendous efforts to make this all happen. Platform Operations Team made significant contributions to this project, especially Ivan Babrou and Daniel Dao. Contributions from Marek Vavruša in DNS Team were also very helpful.

Finally, Data team at Cloudflare is a small team, so if you're interested in building and operating distributed services, you stand to have some great problems to work on. Check out the Distributed Systems Engineer - Data and Data Infrastructure Engineer roles in London, UK and San Francisco, US, and let us know what you think.

However improbable: The story of a processor bug

David Wragg — Thu, 18 Jan 2018 12:06:48 GMT

Processor problems have been in the news lately, due to the Meltdown and Spectre vulnerabilities. But generally, engineers writing software assume that computer hardware operates in a reliable, well-understood fashion, and that any problems lie on the software side of the software-hardware divide. Modern processor chips routinely execute many billions of instructions in a second, so any erratic behaviour must be very hard to trigger, or it would quickly become obvious.

But sometimes that assumption of reliable processor hardware doesn’t hold. Last year at Cloudflare, we were affected by a bug in one of Intel’s processor models. Here’s the story of how we found we had a mysterious problem, and how we tracked down the cause.

CC-BY-SA-3.0 image by Alterego

Prologue

Back in February 2017, Cloudflare disclosed a security problem which became known as Cloudbleed. The bug behind that incident lay in some code that ran on our servers to parse HTML. In certain cases involving invalid HTML, the parser would read data from a region of memory beyond the end of the buffer being parsed. The adjacent memory might contain other customers’ data, which would then be returned in the HTTP response, and the result was Cloudbleed.

But that wasn’t the only consequence of the bug. Sometimes it could lead to an invalid memory read, causing the NGINX process to crash, and we had metrics showing these crashes in the weeks leading up to the discovery of Cloudbleed. So one of the measures we took to prevent such a problem happening again was to require that every crash be investigated in detail.

We acted very swiftly to address Cloudbleed, and so ended the crashes due to that bug, but that did not stop all crashes. We set to work investigating these other crashes.

Crash is not a technical term

But what exactly does “crash” mean in this context? When a processor detects an attempt to access invalid memory (more precisely, an address without a valid page in the page tables), it signals a page fault to the operating system’s kernel. In the case of Linux, these page faults result in the delivery of a SIGSEGV signal to the relevant process (the name SIGSEGV derives from the historical Unix term “segmentation violation”, also known as a segmentation fault or segfault). The default behaviour for SIGSEGV is to terminate the process. It’s this abrupt termination that was one symptom of the Cloudbleed bug.

This possibility of invalid memory access and the resulting termination is mostly relevant to processes written in C or C++. Higher-level compiled languages, such as Go and JVM-based languages, use type systems which prevent the kind of low-level programming errors that can lead to accesses of invalid memory. Furthermore, such languages have sophisticated runtimes that take advantage of page faults for implementation tricks that make them more efficient (a process can install a signal handler for SIGSEGV so that it does not get terminated, and instead can recover from the situation). And for interpreted languages such as Python, the interpreter checks that conditions leading to invalid memory accesses cannot occur. So unhandled SIGSEGV signals tend to be restricted to programming in C and C++.

SIGSEGV is not the only signal that indicates an error in a process and causes termination. We also saw process terminations due to SIGABRT and SIGILL, suggesting other kinds of bugs in our code.

If the only information we had about these terminated NGINX processes was the signal involved, investigating the causes would have been difficult. But there is another feature of Linux (and other Unix-derived operating systems) that provided a path forward: core dumps. A core dump is a file written by the operating system when a process is terminated abruptly. It records the full state of the process at the time it was terminated, allowing post-mortem debugging. The state recorded includes:

The processor register values for all threads in the process (the values of some program variables will be held in registers)
The contents of the process’ conventional memory regions (giving the values of other program variables and heap data)
Descriptions of regions of memory that are read-only mappings of files, such as executables and shared libraries
Information associated with the signal that caused termination, such as the address of an attempted memory access that led to a SIGSEGV

Because core dumps record all this state, their size depends upon the program involved, but they can be fairly large. Our NGINX core dumps are often several gigabytes.

Once a core dump has been recorded, it can be inspected using a debugging tool such as gdb. This allows the state from the core dump to be explored in terms of the original program source code, so that you can inquire about the program stack and contents of variables and the heap in a reasonably convenient manner.

A brief aside: Why are core dumps called core dumps? It’s a historical term that originated in the 1960s when the principal form of random access memory was magnetic core memory. At the time, the word core was used as a shorthand for memory, so “core dump” means a dump of the contents of memory.

CC BY-SA 3.0 image by Konstantin Lanzet

The game is afoot

As we examined the core dumps, we were able to track some of them back to more bugs in our code. None of them leaked data as Cloudbleed had, or had other security implications for our customers. Some might have allowed an attacker to try to impact our service, but the core dumps suggested that the bugs were being triggered under innocuous conditions rather than attacks. We didn’t have to fix many such bugs before the number of core dumps being produced had dropped significantly.

But there were still some core dumps being produced on our servers — about one a day across our whole fleet of servers. And finding the root cause of these remaining ones proved more difficult.

We gradually began to suspect that these residual core dumps were not due to bugs in our code. These suspicions arose because we found cases where the state recorded in the core dump did not seem to be possible based on the program code (and in examining these cases, we didn’t rely on the C code, but looked at the machine code produced by the compiler, in case we were dealing with compiler bugs). At first, as we discussed these core dumps among the engineers at Cloudflare, there was some healthy scepticism about the idea that the cause might lie outside of our code, and there was at least one joke about cosmic rays. But as we amassed more and more examples, it became clear that something unusual was going on. Finding yet another “mystery core dump”, as we had taken to calling them, became routine, although the details of these core dumps were diverse, and the code triggering them was spread throughout our code base. The common feature was their apparent impossibility.

There was no obvious pattern to the servers which produced these mystery core dumps. We were getting about one a day on average across our fleet of servers. So the sample size was not very big, but they seemed to be evenly spread across all our servers and datacenters, and no one server was struck twice. The probability that an individual server would get a mystery core dump seemed to be very low (about one per ten years of server uptime, assuming they were indeed equally likely for all our servers). But because of our large number of servers, we got a steady trickle.

In quest of a solution

The rate of mystery core dumps was low enough that it didn’t appreciably impact the service to our customers. But we were still committed to examining every core dump that occurred. Although we got better at recognizing these mystery core dumps, investigating and classifying them was a drain on engineering resources. We wanted to find the root cause and fix it. So we started to consider causes that seemed somewhat plausible:

We looked at hardware problems. Memory errors in particular are a real possibility. But our servers use ECC (Error-Correcting Code) memory which can detect, and in most cases correct, any memory errors that do occur. Furthermore, any memory errors should be recorded in the IPMI logs of the servers. We do see some memory errors on our server fleet, but they were not correlated with the core dumps.

If not memory errors, then could there be a problem with the processor hardware? We mostly use Intel Xeon processors, of various models. These have a good reputation for reliability, and while the rate of core dumps was low, it seemed like it might be too high to be attributed to processor errors. We searched for reports of similar issues, and asked on the grapevine, but didn’t hear about anything that seemed to match our issue.

While we were investigating, an issue with Intel Skylake processors came to light. But at that time we did not have Skylake-based servers in production, and furthermore that issue related to particular code patterns that were not a common feature of our mystery core dumps.

Maybe the core dumps were being incorrectly recorded by the Linux kernel, so that a mundane crash due to a bug in our code ended up looking mysterious? But we didn’t see any patterns in the core dumps that pointed to something like this. Also, upon an unhandled SIGSEGV, the kernel generates a log line with a small amount of information about the cause, like this:

segfault at ffffffff810c644a ip 00005600af22884a sp 00007ffd771b9550 error 15 in nginx-fl[5600aeed2000+e09000]

We checked these log lines against the core dumps, and they were always consistent.

The kernel has a role in controlling the processor’s Memory Management Unit to provide virtual memory to application programs. So kernel bugs in that area can lead to surprising results (and we have encountered such a bug at Cloudflare in a different context). But we examined the kernel code, and searched for reports of relevant bugs against Linux, without finding anything.

For several weeks, our efforts to find the cause were not fruitful. Due to the very low frequency of the mystery core dumps when considered on a per-server basis, we couldn’t follow the usual last-resort approach to problem solving - changing various possible causative factors in the hope that they make the problem more or less likely to occur. We needed another lead.

The solution

But eventually, we noticed something crucial that we had missed until that point: all of the mystery core dumps came from servers containing The Intel Xeon E5-2650 v4. This model belongs to the generation of Intel processors that had the codename “Broadwell”, and it’s the only model of that generation that we use in our edge servers, so we simply call these servers Broadwells. The Broadwells made up about a third of our fleet at that time, and they were in many of our datacenters. This explains why the pattern was not immediately obvious.

This insight immediately threw the focus of our investigation back onto the possibility of processor hardware issues. We downloaded Intel’s Specification Update for this model. In these Specification Update documents Intel discloses all the ways that its processors deviate from their published specifications, whether due to benign discrepancies or bugs in the hardware (Intel entertainingly calls these “errata”).

The Specification Update described 85 issues, most of which are obscure issues of interest mainly to the developers of the BIOS and operating systems. But one caught our eye: “BDF76 An Intel® Hyper-Threading Technology Enabled Processor May Exhibit Internal Parity Errors or Unpredictable System Behavior”. The symptoms described for this issue are very broad (“unpredictable system behavior may occur”), but what we were observing seemed to match the description of this issue better than any other.

Furthermore, the Specification Update stated that BDF76 was fixed in a microcode update. Microcode is firmware that controls the lowest-level operation of the processor, and can be updated by the BIOS (from system vendor) or the OS. Microcode updates can change the behaviour of the processor to some extent (exactly how much is a closely-guarded secret of Intel, although the recent microcode updates to address the Spectre vulnerability give some idea of the impressive degree to which Intel can reconfigure the processor’s behaviour).

The most convenient way for us to apply the microcode update to our Broadwell servers at that time was via a BIOS update from the server vendor. But rolling out a BIOS update to so many servers in so many data centers takes some planning and time to conduct. Due to the low rate of mystery core dumps, we would not know if BDF76 was really the root cause of our problems until a significant fraction of our Broadwell servers had been updated. A couple of weeks of keen anticipation followed while we awaited the outcome.

To our great relief, once the update was completed, the mystery core dumps stopped. This chart shows the number of core dumps we were getting each day for the relevant months of 2017:

As you can see, after the microcode update there is a marked reduction in the rate of core dumps. But we still get some core dumps. These are not mysteries, but represent conventional issues in our software. We continue to investigate and fix them to ensure they don’t represent security issues in our service.

The conclusion

Eliminating the mystery core dumps has made it easier to focus on any remaining crashes that are due to our code. It removes the temptation to dismiss a core dump because its cause is obscure.

And for some of the core dumps that we see now, understanding the cause can be very challenging. They correspond to very unlikely conditions, and often involve a root cause that is distant from the immediate issue that triggered the core dump. For example, we see segfaults in LuaJIT (which we embed in NGINX via OpenResty) that are not due to problems in LuaJIT, but rather because LuaJIT is particularly susceptible to damage to its data structures by bugs in unrelated C code.

Excited by core dump detective work? Or building systems at a scale where once-in-a-decade problems can get triggered every day? Then join our team.

The end of the road for Server: cloudflare-nginx

John Graham-Cumming — Mon, 11 Dec 2017 14:00:00 GMT

Six years ago when I joined Cloudflare the company had a capital F, about 20 employees, and a software stack that was mostly NGINX, PHP and PowerDNS (there was even a little Apache). Today, things are quite different.

CC BY-SA 2.0 image by Randy Merrill

The F got lowercased, there are now more than 500 people and the software stack has changed radically. PowerDNS is gone and has been replaced with our own DNS server, RRDNS, written in Go. The PHP code that used to handle the business logic of dealing with our customers’ HTTP requests is now Lua code, Apache is long gone and new technologies like Railgun, Warp, Argo and Tiered Cache have been added to our ‘edge’ stack.

And yet our servers still identify themselves in HTTP responses with

Server: cloudflare-nginx

Of course, NGINX is still a part of our stack, but the code that handles HTTP requests goes well beyond the capabilities of NGINX alone. It’s also not hard to imagine a time where the role of NGINX diminishes further. We currently run four instances of NGINX on each edge machine (one for SSL, one for non-SSL, one for caching and one for connections between data centers). We used to have a fifth but it’s been deprecated and are planning for the merging of the SSL and non-SSL instances.

As we have done with other bits of software (such as the KyotoTycoon distributed key-value store or PowerDNS) we’re quite likely to write our own caching or web serving code at some point. The time may come when we no longer use NGINX for caching, for example. And so, now is a good time to switch away from Server: cloudflare-nginx.

We like to write our own when the cost of customizing or configuring existing open source software becomes too high. For example, we switched away from PowerDNS because it was becoming too complicated to implement all the logic we need for the services we provide.

Over the next month we will be transitioning to simply:

Server: cloudflare

If you have software that looks for cloudflare-nginx in the Server header it’s time to update it.

We’ve worked closely with companies that rely on the Server header to determine whether a website, application or API uses Cloudflare, so that their software or service is updated and we’ll be rolling out this change in stages between December 18, 2017 and January 15, 2018. Between those dates Cloudflare-powered HTTP responses may contain either Server: cloudflare-nginx or Server: cloudflare.

Perfect locality and three epic SystemTap scripts

Marek Majkowski — Tue, 07 Nov 2017 10:15:00 GMT

In a recent blog post we discussed epoll behavior causing uneven load among NGINX worker processes. We suggested a work around - the REUSEPORT socket option. It changes the queuing from "combined queue model" aka Waitrose (formally: M/M/s), to a dedicated accept queue per worker aka "the Tesco superstore model" (formally: M/M/1). With this setup the load is spread more evenly, but in certain conditions the latency distribution might suffer.

After reading that piece, a colleague of mine, John, said: "Hey Marek, don't forget that REUSEPORT has an additional advantage: it can improve packet locality! Packets can avoid being passed around CPUs!"

John had a point. Let's dig into this step by step.

In this blog post we'll explain the REUSEPORT socket option, how it can help with packet locality and its performance implications. We'll show three advanced SystemTap scripts which we used to help us understand and measure the packet locality.

A shared queue

The standard BSD socket API model is rather simple. In order to receive new TCP connections a program calls bind() and then listen() on a fresh socket. This will create a single accept queue. Programs can share the file descriptor - pointing to one kernel data structure - among multiple processes to spread the load. As we've seen in a previous blog post connections might not be distributed perfectly. Still, this allows programs to scale up processing power from a limited single-process, single-CPU design.

Modern network cards split the inbound packets across multiple RX queues, allowing multiple CPUs to share interrupt and packet processing load. Unfortunately in the standard BSD API the new connections will all be funneled back to single accept queue, causing a potential bottleneck.

Introducing REUSEPORT

This bottleneck was identified at Google, where a reported application was dealing with 40,000 connections per second. Google kernel hackers fixed it by adding a TCP support for SO_REUSEPORT socket option in Linux kernel 3.9.

REUSEPORT allows the application to set multiple accept queues on a single TCP listen port. This removes the central bottleneck and enables the CPUs to do more work in parallel.

REUSEPORT locality

Initially there was no way to influence the load balancing algorithm. While REUSEPORT allowed setting up a dedicated accept queue per each worker process, it wasn't possible to influence what packets would go into them. New connections flowing into the network stack would be distributed using only the usual 5-tuple hash. Packets from any of the RX queues, hitting any CPU, might flow into any of the accept queues.

This changed in Linux kernel 4.4 with the introduction of the SO_INCOMING_CPU settable socket option. Now a userspace program could add a hint to make the packets received on a specific CPU go to a specific accept queue. With this improvement the accept queue won't need to be shared across multiple cores, improving CPU cache locality and fixing lock contention issues.

There are other benefits - with proper tuning it is possible to keep the processing of packets belonging to entire connections local. Think about it like that: if a SYN packet was received on some CPU it is likely that further packets for this connection will also be delivered to the same CPU[1]. Therefore, making sure the worker on the same CPU called the accept() has strong advantages. With the right tuning all processing of the connection might be performed on a single CPU. This can help keep the CPU cache warm, reduce cross-CPU interrupts and boost the performance of memory allocation algorithms.

SO_INCOMING_CPU interface is pretty rudimentary and was deemed unsuitable for more complex usage. It was superseded by the more powerful SO_ATTACH_REUSEPORT_CBPF option (and it's extended variant: SO_ATTACH_REUSEPORT_EBPF) in kernel 4.6. These flags allow a program to specify a fully functional BPF program as a load balancing algorithm.

Beware that the introduction of SO_ATTACH_REUSEPORT_[CE]BPF broke SO_INCOMING_CPU. Nowadays there isn't a choice - you have to use the BPF variants to get the intended behavior.

Setting CBPF on NGINX

NGINX in "reuseport" mode doesn't set the advanced socket options increasing packet locality. John suggested that improving packet locality is beneficial for performance. We must verify such a bold claim!

We wanted to play with setting couple of SO_ATTACH_REUSEPORT_CBPF BPF scripts. We didn't want to hack the NGINX sources though. After some tinkering we decided it would be easier to write a SystemTap script to set the option from outside the server process. This turned out to be a big mistake!

After plenty of work, numerous kernel panics caused by our buggy scripts (running in "guru" mode), we finally managed to get it into working order. The SystemTap script that calls "setsockopt" with right parameters. It's one of the most complex scripts we've written so far. Here it is:

setcbpf.stp

We tested it on kernel 4.9. It sets the following CBPF (classical BPF) load balancing program on the REUSEPORT socket group. Sockets received on Nth CPU will be passed to Nth member of the REUSEPORT group:

A = #cpu
A = A % 
return A

The SystemTap script takes three parameters: pid, file descriptor and REUSEPORT group size. To figure out the pid of a process and a file descriptor number use the "ss" tool:

$ ss -4nlp -t 'sport = :8181' | sort
LISTEN  0   511    *:8181  *:*   users:(("nginx",pid=29333,fd=3),...
LISTEN  0   511    *:8181  *:*   ...
...

In this listing we see that pid=29333 fd=3 points to REUSEPORT descriptor bound to port tcp/8181. On our test machine we have 24 logical CPUs (including HT) and we run 12 NGINX workers - the group size is 12. Example invocation of the script:

$ sudo stap -g setcbpf.stp 29333 3 12

Measuring performance

Unfortunately on Linux it's pretty hard to verify if setting CBPF actually does anything. To understand what's going on we wrote another SystemTap script. It hooks into a process and prints all successful invocations of the accept() function, including the CPU on which the connection was delivered to kernel, and current CPU - on which the application is running. The idea is simple - if they match, we'll have good locality!

The script:

accept.stp

Before setting the CBPF socket option on the server, we saw this output:

$ sudo stap -g accept.stp nginx|grep "cpu=#12"
cpu=#12 pid=29333 accept(3) -> fd=30 rxcpu=#19
cpu=#12 pid=29333 accept(3) -> fd=31 rxcpu=#21
cpu=#12 pid=29333 accept(3) -> fd=32 rxcpu=#16
cpu=#12 pid=29333 accept(3) -> fd=33 rxcpu=#22
cpu=#12 pid=29333 accept(3) -> fd=34 rxcpu=#19
cpu=#12 pid=29333 accept(3) -> fd=35 rxcpu=#21
cpu=#12 pid=29333 accept(3) -> fd=37 rxcpu=#16

We can see accept()s done from a worker on CPU #12 returning client sockets received on some other CPUs like: #19, #21, #16 and so on.

Now, let's run CBPF and see the results:

$ sudo stap -g setcbpf.stp `pidof nginx -s` 3 12
[+] Pid=29333 fd=3 group_size=12 setsockopt(SO_ATTACH_REUSEPORT_CBPF)=0

$ sudo stap -g accept.stp nginx|grep "cpu=#12"
cpu=#12 pid=29333 accept(3) -> fd=30 rxcpu=#12
cpu=#12 pid=29333 accept(3) -> fd=31 rxcpu=#12
cpu=#12 pid=29333 accept(3) -> fd=32 rxcpu=#12
cpu=#12 pid=29333 accept(3) -> fd=33 rxcpu=#12
cpu=#12 pid=29333 accept(3) -> fd=34 rxcpu=#12
cpu=#12 pid=29333 accept(3) -> fd=35 rxcpu=#12
cpu=#12 pid=29333 accept(3) -> fd=36 rxcpu=#12

Now the situation is perfect. All accept()s called from the NGINX worker pinned to CPU #12 got client sockets received on the same CPU.

But does it actually help with the performance?

Sadly: no. We've run a number of tests (using the setup introduced in previous blog post) but we weren't able to record any significant performance difference. Compared to other costs incurred by running a high level HTTP server, a couple of microseconds shaved by keeping connections local to a CPU doesn't seem to make a measurable difference.

Measuring packet locality

But no, we didn't give up!

Not being able to measure an end to end performance gain, we decided to try another approach. Why not try to measure packet locality itself!

Measuring locality is tricky. In certain circumstances a packet can cross multiple CPUs on its way down the networking stack. Fortunately we can simplify the problem. Let's define "packet locality" as the probability of a packet (to be specific: the Linux sock_buff data structure, skb) being allocated and freed on the same CPU.

For this, we wrote yet another SystemTap script:

locality.stp

When run without the CBPF option the script gave us this results:

$ sudo stap -g locality.stp 12
rx= 21%   29kpps tx=  9%  24kpps
rx=  8%  130kpps tx=  8% 131kpps
rx= 11%  132kpps tx=  9% 126kpps
rx= 10%  128kpps tx=  8% 127kpps
rx= 10%  129kpps tx=  8% 126kpps
rx= 11%  132kpps tx=  9% 127kpps
rx= 11%  129kpps tx= 10% 128kpps
rx= 10%  130kpps tx=  9% 127kpps
rx= 12%   94kpps tx=  8%  90kpps

During our test the HTTP server received about 130,000 packets per second and transmitted about as much. 10-11% of the received and 8-10% of the transmitted packets had good locality - were allocated and freed on the same CPU.

Achieving good locality is not that easy. On the RX side, this means the packet must be received on the same CPU as the application that will read() it. On the transmission side it's even trickier. In case of TCP, a piece of data must all: be sent() by application, get transmitted, and receive back an ACK from the other party, all on the same CPU.

We performed a bit of tuning, which included inspecting:

number of RSS queues and their interrupts being pinned to right CPUs
the indirection table
correct XPS settings on the TX path
NGINX workers being pinned to right CPUs
NGINX using the REUSEPORT bind option
and finally setting CBPF on the REUSEPORT sockets

We were able to achieve almost perfect locality! With all tweaks done the script output looked better:

$ sudo stap -g locality.stp 12
rx= 99%   18kpps tx=100%  12kpps
rx= 99%  118kpps tx= 99% 115kpps
rx= 99%  132kpps tx= 99% 129kpps
rx= 99%  138kpps tx= 99% 136kpps
rx= 99%  140kpps tx=100% 134kpps
rx= 99%  138kpps tx= 99% 135kpps
rx= 99%  139kpps tx=100% 137kpps
rx= 99%  139kpps tx=100% 135kpps
rx= 99%   77kpps tx= 99%  74kpps

Now the test runs at 138,000 packets per second received and transmitted. The packets have a whopping 99% packet locality.

As for performance difference in practice - it's too small to measure. Even though we received about 7% more packets, the end-to-end tests didn't show a meaningful speed boost.

Conclusion

We weren't able to prove definitely if improving packet locality actually improves performance for a high-level TCP application like an HTTP server. In hindsight it makes sense - the added benefit is minuscule compared to the overhead of running an HTTP server, especially with logic in a high level language like Lua.

This hasn't stopped us from having fun! We (myself, Gilberto Bertin and David Wragg) wrote three pretty cool SystemTap scripts, which are super useful when debugging Linux packet locality. They may come handy for demanding users, for example running high performance UDP servers or doing high frequency trading.

Most importantly - in the process we learned a lot about the Linux networking stack. We got to practice writing CBPF scripts, and learned how to measure locality with hackish SystemTap scripts. We got reminded of the obvious - out of the box Linux is remarkably well tuned.

Dealing with the internals of Linux and NGINX sound interesting? Join our world famous team in London, Austin, San Francisco and our elite office in Warsaw, Poland.

We are not taking into account aRFS - accelerated RFS. ↩︎

Why does one NGINX worker take all the load?

Marek Majkowski — Mon, 23 Oct 2017 12:57:36 GMT

Scaling up TCP servers is usually straightforward. Most deployments start by using a single process setup. When the need arises more worker processes are added. This is a scalability model for many applications, including HTTP servers like Apache, NGINX or Lighttpd.

CC BY-SA 2.0 image by Paul Townsend

Increasing the number of worker processes is a great way to overcome a single CPU core bottleneck, but opens a whole new set of problems.

There are generally three ways of designing a TCP server with regard to performance:

(a) Single listen socket, single worker process.

(b) Single listen socket, multiple worker processes.

(a) Single listen socket, single worker process This is the simplest model, where processing is limited to a single CPU. A single worker process is doing both accept() calls to receive the new connections and processing of the requests themselves. This model is the preferred Lighttpd setup.

(b) Single listen socket, multiple worker process The new connections sit in a single kernel data structure (the listen socket). Multiple worker processes are doing both the accept() calls and processing of the requests. This model enables some spreading of the inbound connections across multiple CPUs. This is the standard model for NGINX.

(c) Separate listen socket for each worker process By using the SO_REUSEPORT socket option it's possible to create a dedicated kernel data structure (the listen socket) for each worker process. This can avoid listen socket contention, which isn't really an issue unless you run at Google scale. It can also help in better balancing the load. More on that later.

At Cloudflare we run NGINX, and we are most familiar with the (b) model. In this blog post we'll describe a specific problem with this model, but let's start from the beginning.

Spreading the accept() load

Not many people realize that there are two different ways of spreading the accept() new connection load across multiple processes. Consider these two code snippets. Let's call the first one blocking-accept. It's best described with this pseudo code:

sd = bind(('127.0.0.1', 1024))
for i in range(3):
    if os.fork () == 0:
        while True:
            cd, _ = sd.accept()
            cd.close()
            print 'worker %d' % (i,)

The idea is to share an accept queue across processes, by calling blocking accept() from multiple workers concurrently.

The second model should be called epoll-and-accept:

sd = bind(('127.0.0.1', 1024))
sd.setblocking(False)
for i in range(3):
    if os.fork () == 0:
        ed = select.epoll()
        ed.register(sd, EPOLLIN | EPOLLEXCLUSIVE)
        while True:
            ed.poll()
            cd, _ = sd.accept()
            cd.close()
            print 'worker %d' % (i,)

The intention is to have a dedicated epoll in each worker process. The worker will call non-blocking accept() only when epoll reports new connections. We can avoid the usual thundering-herd issue by using the EPOLLEXCLUSIVE flag.

(Full code is available here)

While these programs look similar, their behavior differs subtly[1] [2]. Let's see what happens when we establish a couple of connections to each:

$ ./blocking-accept.py &
$ for i in `seq 6`; do nc localhost 1024; done
worker 2
worker 1
worker 0
worker 2
worker 1
worker 0

$ ./epoll-and-accept.py &
$ for i in `seq 6`; do nc localhost 1024; done
worker 0
worker 0
worker 0
worker 0
worker 0
worker 0

The blocking-accept model distributed connections across all workers - each got exactly 2 connections. The epoll-and-accept model on the other hand forwarded all the connections to the first worker. The remaining workers got no traffic whatsoever.

It might catch you by surprise, but Linux does different load balancing in both cases.

In the first one Linux will do proper FIFO-like round robin load balancing. Each process waiting on accept() is added to a queue and they will be served connections in order.

In the epoll-and-accept the load balancing algorithm differs: Linux seems to choose the last added process, a LIFO-like behavior. The process added to the waiting queue most recently will get the new connection. This behavior causes the busiest process, the one that only just went back to event loop, to receive the majority of the new connections. Therefore, the busiest worker is likely to get most of the load.

In fact, this is what we see in NGINX. Here's a dump of one of our synthetic tests where one worker is taking most of the load, while others are relatively underutilized:

Notice the last worker got almost no load, while the busiest is using 30% of CPU.

SO_REUSEPORT to the rescue

Linux supports a feature to work around this balancing problem - the SO_REUSEPORT socket option. We explained this in the (c) model, where the incoming connections are split into multiple separate accept queues. Usually it's one dedicated queue for each worker process.

Since the accept queues are not shared, and Linux spreads the load by a simple hashing logic, each worker will get statistically the same number of incoming connections. This results in much better balancing of the load. Each worker gets roughly a similar amount of traffic to handle:

Here all the workers are handling the work, with the busiest at 13.2% CPU while the least busy uses 9.3%. Much better load distribution than before.

This is better, however the balancing of the load is not the end of the story. Splitting the accept queue worsens the latency distribution in some circumstances! It's best explained by The Engineer guy:

I call this problem - the Waitrose vs Tesco Superstore cashiers. The Waitrose "combined queue model" is better at reducing the maximum latency. A single clogged cashier will not significantly affect the latency of whole system. The remainder of the load will be spread across the other less busy cashiers. The shared queue will be drained relatively promptly. On the other hand the Tesco Superstore model - of separate queues to each cashier - will suffer from the large latency issue. If a single cashier gets blocked, all the traffic waiting in its queue will stall. The maximum latency will grow if any single queue gets stuck.

In a case of increased load the (b) single accept queue model, while is not balancing the load evenly, is better for latency. We can show this by running another synthetic benchmark. Here is the latency distribution for 100k relatively CPU-intensive HTTP requests, with HTTP keepalives disabled, concurrency set to 200 and running against a single-queue (b) NGINX:

$ ./benchhttp -n 100000 -c 200 -r target:8181 http://a.a/
        | cut -d " " -f 1
        | ./mmhistogram -t "Duration in ms (single queue)"
min:3.61 avg:30.39 med=30.28 max:72.65 dev:1.58 count:100000
Duration in ms (single queue):
 value |-------------------------------------------------- count
     0 |                                                   0
     1 |                                                   0
     2 |                                                   1
     4 |                                                   16
     8 |                                                   67
    16 |************************************************** 91760
    32 |                                              **** 8155
    64 |                                                   1

As you can see the latency is very predictable. The median is almost equal to the average and standard deviation is small.

Here is the same test run against the SO_REUSEPORT multi-queue NGINX setup (c):

$ ./benchhttp -n 100000 -c 200 -r target:8181 http://a.a/
        | cut -d " " -f 1
        | ./mmhistogram -t "Duration in ms (multiple queues)"
min:1.49 avg:31.37 med=24.67 max:144.55 dev:25.27 count:100000
Duration in ms (multiple queues):
 value |-------------------------------------------------- count
     0 |                                                   0
     1 |                                                 * 1023
     2 |                                         ********* 5321
     4 |                                 ***************** 9986
     8 |                  ******************************** 18443
    16 |    ********************************************** 25852
    32 |************************************************** 27949
    64 |                              ******************** 11368
   128 |                                                   58

The average is comparable, the median dropped, the max value significantly increased, and most importantly the deviation is now gigantic!

The latency distribution is all over the place - it's not something you want to have on a production server.

(Instructions to reproduce the test)

Take this benchmark with a grain of salt though. We try to generate substantial load in order to prove the point. Depending on your setup it might be possible to shield your server from excessive traffic and prevent it from entering this degraded-latency state[3].

Conclusion

Balancing the incoming connections across multiple application workers is far from a solved problem. The single queue approach (b) scales well and keeps the max latency in check, but due to the epoll LIFO behavior the worker processes won't be evenly load balanced.

For workloads that require even balancing it might be beneficial to use the SO_REUSEPORT pattern (c). Unfortunately in high load situations the latency distribution might degrade.

The best generic solution seem to be to change the standard epoll behavior from LIFO to FIFO. There have been attempts to address this in the past by Jason Baron from Akamai (1, 2, 3), but none had landed in mainline so far.

Dealing with the internals of Linux and NGINX sound interesting? Join our world famous team in London, Austin, San Francisco and our elite office in Warsaw, Poland.

Of course comparing blocking accept() with a full featured epoll() event loop is not fair. Epoll is more powerful and allows us to create rich event driven programs. Using blocking accept is rather cumbersome or just not useful at all. To make any sense, blocking accept programs would require careful multi-threading programming, with a dedicated thread per request. ↩︎
Another surprise lurking in the corner - using blocking accept() on Linux is technically incorrect! Alan Burlison pointed out that calling close() on listen socket that has blocking accepts() will not interrupt them. This can result in a buggy behavior - you may get a successful accept() on a listen socket that no longer exists. When in doubt - avoid using blocking accept() in multithreaded programs. The workaround is to call shutdown() first, but this is not POSIX compliant. It's a mess. ↩︎
There are a couple of things you need to take into account when using NGINX reuseport implementation. First make sure you run 1.13.6+ or have this patch applied. Then, remember that due to a defect in Linux TCP REUSEPORT implementation reducing number of REUSEPORT queues will cause some waiting TCP connections to be dropped. ↩︎

HPACK: the silent killer (feature) of HTTP/2

Vlad Krasnov — Mon, 28 Nov 2016 14:10:41 GMT

If you have experienced HTTP/2 for yourself, you are probably aware of the visible performance gains possible with HTTP/2 due to features like stream multiplexing, explicit stream dependencies, and Server Push.

There is however one important feature that is not obvious to the eye. This is the HPACK header compression. Current implementation of nginx, as well edge networks and CDNs using it, do not support the full HPACK implementation. We have, however, implemented the full HPACK in nginx, and upstreamed the part that performs Huffman encoding.

CC BY 2.0 image by Conor Lawless

This blog post gives an overview of the reasons for the development of HPACK, and the hidden bandwidth and latency benefits it brings.

Some Background

As you probably know, a regular HTTPS connection is in fact an overlay of several connections in the multi-layer model. The most basic connection you usually care about is the TCP connection (the transport layer), on top of that you have the TLS connection (mix of transport/application layers), and finally the HTTP connection (application layer).

In the the days of yore, HTTP compression was performed in the TLS layer, using gzip. Both headers and body were compressed indiscriminately, because the lower TLS layer was unaware of the transferred data type. In practice it meant both were compressed with the DEFLATE algorithm.

Then came SPDY with a new, dedicated, header compression algorithm. Although specifically designed for headers, including the use of a preset dictionary, it was still using DEFLATE, including dynamic Huffman codes and string matching.

Unfortunately both were found to be vulnerable to the CRIME attack, that can extract secret authentication cookies from compressed headers: because DEFLATE uses backward string matches and dynamic Huffman codes, an attacker that can control part of the request headers, can gradually recover the full cookie by modifying parts of the request and seeing how the total size of the request changes during compression.

Most edge networks, including Cloudflare, disabled header compression because of CRIME. That’s until HTTP/2 came along.

HPACK

HTTP/2 supports a new dedicated header compression algorithm, called HPACK. HPACK was developed with attacks like CRIME in mind, and is therefore considered safe to use.

HPACK is resilient to CRIME, because it does not use partial backward string matches and dynamic Huffman codes like DEFLATE. Instead, it uses these three methods of compression:

Static Dictionary: A predefined dictionary of 61 commonly used header fields, some with predefined values.
Dynamic Dictionary: A list of actual headers that were encountered during the connection. This dictionary has limited size, and when new entries are added, old entries might be evicted.
Huffman Encoding: A static Huffman code can be used to encode any string: name or value. This code was computed specifically for HTTP Response/Request headers - ASCII digits and lowercase letters are given shorter encodings. The shortest encoding possible is 5 bits long, therefore the highest compression ratio achievable is 8:5 (or 37.5% smaller).

HPACK flow

When HPACK needs to encode a header in the format name:value, it will first look in the static and dynamic dictionaries. If the full name:value is present, it will simply reference the entry in the dictionary. This will usually take one byte, and in most cases two bytes will suffice! A whole header encoded in a single byte! How crazy is that?

Since many headers are repetitive, this strategy has a very high success rate. For example, headers like :authority:www.cloudflare.com or the sometimes huge cookie headers are the usual suspects in this case.

When HPACK can't match a whole header in a dictionary, it will attempt to find a header with the same name. Most of the popular header names are present in the static table, for example: content-encoding, cookie, etag. The rest are likely to be repetitive and therefore present in the dynamic table. For example, Cloudflare assigns a unique cf-ray header to each response, and while the value of this field is always different, the name can be reused!

If the name was found, it can again be expressed in one or two bytes in most cases, otherwise the name will be encoded using either raw encoding or the Huffman encoding: the shorter of the two. The same goes for the value of the header.

We found that the Huffman encoding alone saves almost 30% of header size.

Although HPACK does string matching, for the attacker to find the value of a header, they must guess the entire value, instead of a gradual approach that was possible with DEFLATE matching, and was vulnerable to CRIME.

Request Headers

The gains HPACK provides for HTTP request headers are more significant than for response headers. Request headers get better compression, due to much higher duplication in the headers. For example, here are two requests for our own blog, using Chrome:

Request #1:

**:authority:**blog.cloudflare.com**:method:**GET:path: /**:scheme:**https**accept:**text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,/;q=0.8**accept-encoding:**gzip, deflate, sdch, br**accept-language:**en-US,en;q=0.8cookie: 297 byte cookie**upgrade-insecure-requests:**1**user-agent:**Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2853.0 Safari/537.36

I marked in red the headers that can be compressed with the use of the static dictionary. Three fields: :method:GET, :path:/ and :scheme:https are always present in the static dictionary, and will each be encoded in a single byte. Then some fields will only have their names compressed in a byte: :authority, accept, accept-encoding, accept-language, cookie and user-agent are present in the static dictionary.

Everything else, marked in green will be Huffman encoded.

Headers that were not matched, will be inserted into the dynamic dictionary for the following requests to use.

Let's take a look at a later request:

Request #2:

:authority:blog.cloudflare.com****:method:GET:path:/assets/images/cloudflare-sprite-small.png**:scheme:**https**accept:**image/webp,image/,/*;q=0.8**accept-encoding:**gzip, deflate, sdch, br**accept-language:**en-US,en;q=0.8**cookie:**same 297 byte cookiereferer:http://blog.cloudflare.com/assets/css/screen.css?v=2237be22c2**user-agent:**Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2853.0 Safari/537.36

Here I added blue encoded fields. Those indicate fields that were matched from the dynamic dictionary. It is clear that most fields repeat between requests. In this case two fields are again present in the static dictionary and five more are repeated and therefore present in the dynamic dictionary, that means they can be encoded in one or two bytes each. One of those is the ~300 byte cookie header, and ~130 byte user-agent. That is 430 bytes encoded into mere 4 bytes, 99% compression!

All in all for the repeat request, only three short strings will be Huffman encoded.

This is how ingress header traffic appears on the Cloudflare edge network during a six hour period:

On average we are seeing a 76% compression for ingress headers. As the headers represent the majority of ingress traffic, it also provides substantial savings in the total ingress traffic:

We can see that the total ingress traffic is reduced by 53% as the result of HPACK compression!

In fact today, we process about the same number of requests for HTTP/1 and HTTP/2 over HTTPS, yet the ingress traffic for HTTP/2 is only half that of HTTP/1.

Response Headers

For the response headers (egress traffic) the gains are more modest, but still spectacular:

Response #1:

**cache-control:**public, max-age=30**cf-cache-status:**HITcf-h2-pushed:,**cf-ray:**2ded53145e0c1ffa-DFW**content-encoding:**gzip**content-type:**text/html; charset=utf-8**date:**Wed, 07 Sep 2016 21:41:23 GMT**expires:**Wed, 07 Sep 2016 21:41:53 GMTlink: ; rel=preload; as=script,<https://code.jquery.com/jquery-1.11.3.min.js>; rel=preload; as=script**server:**cloudflare-nginxstatus:200**vary:**Accept-Encoding**x-ghost-cache-status:**From Cache**x-powered-by:**Express

The majority of the first response will be Huffman encoded, with some of the field names being matched from the static dictionary.

Response #2:

**cache-control:**public, max-age=31536000**cf-bgj:imgq:**100**cf-cache-status:**HITcf-ray:2ded53163e241ffa-DFW**content-type:**image/png**date:**Wed, 07 Sep 2016 21:41:23 GMT**expires:**Thu, 07 Sep 2017 21:41:23 GMT**server:**cloudflare-nginx**status:**200**vary:**Accept-Encoding**x-ghost-cache-status:**From Cachex-powered-by:Express

Again, the blue color indicates matches from the dynamic table, red indicate matches from the static table, and the green ones represent Huffman encoded strings.

On the second response it is possible to fully match seven of twelve headers. For four of the remaining five, the name can be fully matched, and six strings will be efficiently encoded using the static Huffman encoding.

Although the two expires headers are almost identical, they can only be Huffman compressed, because they can't be matched in full.

The more requests are being processed, the bigger the dynamic table becomes, and more headers can be matched, leading to increased compression ratio.

This is how egress header traffic appears on the Cloudflare edge:

On average egress headers are compressed by 69%. The savings for the total egress traffic are not that significant however:

It is difficult to see, but we get 1.4% savings in the total egress HTTP/2 traffic. While it does not look like much, it is still more than increasing the compression level for data would give in many cases. This number is also significantly skewed by websites that serve very large files: we measured savings of well over 15% for some websites.

Test your HPACK

If you have nghttp2 installed, you can test the efficiency of HPACK compression on your website with a bundled tool called h2load.

For example:

h2load https://blog.cloudflare.com | tail -6 |head -1
traffic: 18.27KB (18708) total, 538B (538) headers (space savings 27.98%), 17.65KB (18076) data

We see 27.98% space savings in the headers. That is for a single request, and the gains are mostly due to the Huffman encoding. To test if the website utilizes the full power of HPACK, we need to issue two requests, for example:

h2load https://blog.cloudflare.com -n 2 | tail -6 |head -1
traffic: 36.01KB (36873) total, 582B (582) headers (space savings 61.15%), 35.30KB (36152) data

If for two similar requests the savings are 50% or more then it is very likely full HPACK compression is utilized.

Note that compression ratio improves with additional requests:

h2load https://blog.cloudflare.com -n 4 | tail -6 |head -1
traffic: 71.46KB (73170) total, 637B (637) headers (space savings 78.68%), 70.61KB (72304) data

Conclusion

By implementing HPACK compression for HTTP response headers we've seen a significant drop in egress bandwidth. HPACK has been enabled for all Cloudflare customers using HTTP/2, all of whom benefit from faster, smaller HTTP responses.

Open sourcing our NGINX HTTP/2 + SPDY code

John Graham-Cumming — Fri, 13 May 2016 09:55:41 GMT

In December, we released HTTP/2 support for all customers and on April 28 we released HTTP/2 Server Push support as well.

The release of HTTP/2 by CloudFlare had a huge impact on the number of sites supporting and using the protocol. Today, 50% of sites that use HTTP/2 are served via CloudFlare.

CC BY 2.0 image by JD Hancock

When we released HTTP/2 support we decided not to deprecate SPDY immediately because it was still in widespread use and we promised to open source our modifications to NGINX as it was not possible to support both SPDY and HTTP/2 together with the standard release of NGINX.

We've extracted our changes and they are available as a patch here. This patch should build cleanly against NGINX 1.9.7.

The patch means that NGINX can be built with both --with-http_v2_module and --with-http_spdy_module. And it will accept both the spdy and http2 keywords to the listen directive.

To configure both HTTP/2 and SPDY in NGINX you'll need to run:

./configure --with-http_spdy_module --with-http_v2_module --with-http_ssl_module

Note that you need SSL support for both SPDY and HTTP/2.

Then it will be possible to configure an NGINX server to support both HTTP/2 and SPDY on the same port as follows:

server {
        listen       443 ssl spdy http2;
        server_name  www.example.com;

        ssl_certificate      cert.pem;
        ssl_certificate_key  cert.key;

        location / {
            root   html;
            index  index.html index.htm;
        }
}

Our patch uses ALPN and NPN to advertise the availability of the two protocols. To test that the two protocols are being advertised you can use the OpenSSL client as follows (sending an empty ALPN/NPN extension in the ClientHello causes the server to return a list of available protocols).

openssl s_client -connect www.example.com:443 -nextprotoneg ''
CONNECTED(00000003)
Protocols advertised by server: h2, spdy/3.1, http/1.1

Many other tools for testing and debugging HTTP/2 connections can be found here.

The patch puts HTTP/2 before SPDY/3.1 and will prefer HTTP/2 over SPDY/3.1. If a web browser offers both, HTTP/2 will be preferred and used for the connection.

We continue to support SPDY and HTTP/2 across all CloudFlare sites and will keep an eye on the percentage of connections that use SPDY before making a decision on its eventual deprecation.