The Cloudflare Blog

Open sourcing Pingora: our Rust framework for building programmable network services

Yuchen Wu — Wed, 28 Feb 2024 15:00:11 GMT

Today, we are proud to open source Pingora, the Rust framework we have been using to build services that power a significant portion of the traffic on Cloudflare. Pingora is released under the Apache License version 2.0.

As mentioned in our previous blog post, Pingora is a Rust async multithreaded framework that assists us in constructing HTTP proxy services. Since our last blog post, Pingora has handled nearly a quadrillion Internet requests across our global network.

We are open sourcing Pingora to help build a better and more secure Internet beyond our own infrastructure. We want to provide tools, ideas, and inspiration to our customers, users, and others to build their own Internet infrastructure using a memory safe framework. Having such a framework is especially crucial given the increasing awareness of the importance of memory safety across the industry and the US government. Under this common goal, we are collaborating with the Internet Security Research Group (ISRG) Prossimo project to help advance the adoption of Pingora in the Internet’s most critical infrastructure.

In our previous blog post, we discussed why and how we built Pingora. In this one, we will talk about why and how you might use Pingora.

Pingora provides building blocks for not only proxies but also clients and servers. Along with these components, we also provide a few utility libraries that implement common logic such as event counting, error handling, and caching.

What’s in the box

Pingora provides libraries and APIs to build services on top of HTTP/1 and HTTP/2, TLS, or just TCP/UDS. As a proxy, it supports HTTP/1 and HTTP/2 end-to-end, gRPC, and websocket proxying. (HTTP/3 support is on the roadmap.) It also comes with customizable load balancing and failover strategies. For compliance and security, it supports both the commonly used OpenSSL and BoringSSL libraries, which come with FIPS compliance and post-quantum crypto.

Besides providing these features, Pingora provides filters and callbacks to allow its users to fully customize how the service should process, transform and forward the requests. These APIs will be especially familiar to OpenResty and NGINX users, as many map intuitively onto OpenResty's "*_by_lua" callbacks.

Operationally, Pingora provides zero downtime graceful restarts to upgrade itself without dropping a single incoming request. Syslog, Prometheus, Sentry, OpenTelemetry and other must-have observability tools are also easily integrated with Pingora as well.

Who can benefit from Pingora

You should consider Pingora if:

Security is your top priority: Pingora is a more memory safe alternative for services that are written in C/C++. While some might argue about memory safety among programming languages, from our practical experience, we find ourselves way less likely to make coding mistakes that lead to memory safety issues. Besides, as we spend less time struggling with these issues, we are more productive implementing new features.

Your service is performance-sensitive: Pingora is fast and efficient. As explained in our previous blog post, we saved a lot of CPU and memory resources thanks to Pingora’s multi-threaded architecture. The saving in time and resources could be compelling for workloads that are sensitive to the cost and/or the speed of the system.

Your service requires extensive customization: The APIs that the Pingora proxy framework provides are highly programmable. For users who wish to build a customized and advanced gateway or load balancer, Pingora provides powerful yet simple ways to implement it. We provide examples in the next section.

Let’s build a load balancer

Let's explore Pingora's programmable API by building a simple load balancer. The load balancer will select between https://1.1.1.1/ and https://1.0.0.1/ to be the upstream in a round-robin fashion.

First let’s create a blank HTTP proxy.

pub struct LB();

#[async_trait]
impl ProxyHttp for LB {
    async fn upstream_peer(...) -> Result> {
        todo!()
    }
}

Any object that implements the ProxyHttp trait (similar to the concept of an interface in C++ or Java) is an HTTP proxy. The only required method there is upstream_peer(), which is called for every request. This function should return an HttpPeer which contains the origin IP to connect to and how to connect to it.

Next let’s implement the round-robin selection. The Pingora framework already provides the LoadBalancer with common selection algorithms such as round robin and hashing, so let’s just use it. If the use case requires more sophisticated or customized server selection logic, users can simply implement it themselves in this function.

pub struct LB(Arc>);

#[async_trait]
impl ProxyHttp for LB {
    async fn upstream_peer(...) -> Result> {
        let upstream = self.0
            .select(b"", 256) // hash doesn't matter for round robin
            .unwrap();

        // Set SNI to one.one.one.one
        let peer = Box::new(HttpPeer::new(upstream, true, "one.one.one.one".to_string()));
        Ok(peer)
    }
}

Since we are connecting to an HTTPS server, the SNI also needs to be set. Certificates, timeouts, and other connection options can also be set here in the HttpPeer object if needed.

Finally, let's put the service in action. In this example we hardcode the origin server IPs. In real life workloads, the origin server IPs can also be discovered dynamically when the upstream_peer() is called or in the background. After the service is created, we just tell the LB service to listen to 127.0.0.1:6188. In the end we created a Pingora server, and the server will be the process which runs the load balancing service.

fn main() {
    let mut upstreams = LoadBalancer::try_from_iter(["1.1.1.1:443", "1.0.0.1:443"]).unwrap();

    let mut lb = pingora_proxy::http_proxy_service(&my_server.configuration, LB(upstreams));
    lb.add_tcp("127.0.0.1:6188");

    let mut my_server = Server::new(None).unwrap();
    my_server.add_service(lb);
    my_server.run_forever();
}

Let’s try it out:

curl 127.0.0.1:6188 -svo /dev/null
> GET / HTTP/1.1
> Host: 127.0.0.1:6188
> User-Agent: curl/7.88.1
> Accept: */*
> 
< HTTP/1.1 403 Forbidden

We can see that the proxy is working, but the origin server rejects us with a 403. This is because our service simply proxies the Host header, 127.0.0.1:6188, set by curl, which upsets the origin server. How do we make the proxy correct that? This can simply be done by adding another filter called upstream_request_filter. This filter runs on every request after the origin server is connected and before any HTTP request is sent. We can add, remove or change http request headers in this filter.

async fn upstream_request_filter(…, upstream_request: &mut RequestHeader, …) -> Result<()> {
    upstream_request.insert_header("Host", "one.one.one.one")
}

Let’s try again:

curl 127.0.0.1:6188 -svo /dev/null
< HTTP/1.1 200 OK

This time it works! The complete example can be found here.

Below is a very simple diagram of how this request flows through the callback and filter we used in this example. The Pingora proxy framework currently provides more filters and callbacks at different stages of a request to allow users to modify, reject, route and/or log the request (and response).

Behind the scenes, the Pingora proxy framework takes care of connection pooling, TLS handshakes, reading, writing, parsing requests and any other common proxy tasks so that users can focus on logic that matters to them.

Open source, present and future

Pingora is a library and toolset, not an executable binary. In other words, Pingora is the engine that powers a car, not the car itself. Although Pingora is production-ready for industry use, we understand a lot of folks want a batteries-included, ready-to-go web service with low or no-code config options. Building that application on top of Pingora will be the focus of our collaboration with the ISRG to expand Pingora's reach. Stay tuned for future announcements on that project.

Other caveats to keep in mind:

Today, API stability is not guaranteed. Although we will try to minimize how often we make breaking changes, we still reserve the right to add, remove, or change components such as request and response filters as the library evolves, especially during this pre-1.0 period.
Support for non-Unix based operating systems is not currently on the roadmap. We have no immediate plans to support these systems, though this could change in the future.

How to contribute

Feel free to raise bug reports, documentation issues, or feature requests in our GitHub issue tracker. Before opening a pull request, we strongly suggest you take a look at our contribution guide.

Conclusion

In this blog we announced the open source of our Pingora framework. We showed that Internet entities and infrastructure can benefit from Pingora’s security, performance and customizability. We also demonstrated how easy it is to use Pingora and how customizable it is.

Whether you're building production web services or experimenting with network technologies we hope you find value in Pingora. It's been a long journey, but sharing this project with the open source community has been a goal from the start. We'd like to thank the Rust community as Pingora is built with many great open-sourced Rust crates. Moving to a memory safe Internet may feel like an impossible journey, but it's one we hope you join us on.

How Pingora keeps count

Yuchen Wu — Fri, 12 May 2023 13:00:56 GMT

A while ago we shared how we replaced NGINX with our in-house proxy, Pingora. We promised to share more technical details as well as our open sourcing plan. This blog post will be the first of a series that shares both the code libraries that power Pingora and the ideas behind them.

Today, we take a look at one of Pingora’s libraries: pingora-limits.

pingora-limits provides the functionality to count inflight events and estimate the rate of events over time. These functions are commonly used to protect infrastructure and services from being overwhelmed by certain types of malicious or misbehaving requests.

For example, when an origin server becomes slow or unresponsive, requests will accumulate on our servers, which adds pressure on both our servers and our customers’ servers. With this library, we are able to identify which origins have issues, so that action can be taken without affecting other traffic.

The problem can be abstracted in a very simple way. The input is a (never ending) stream of different types of events. At any point, the system should be able to tell the number of appearances (or the rate) of a certain type of event.

In a simple example, colors are used as the type of event. The following is one possible example of a sequence of events:

red, blue, red, orange, green, brown, red, blue,...

In this example, the system should report that “red” appears three times.

The corresponding algorithms are straightforward to design. One obvious answer is to use a hash table, where the keys are the colors and the values are their corresponding appearances. Whenever a new event appears, the algorithm looks up the hash table and increases the appearance counter. It is not hard to tell that this algorithm’s time complexity is O(1) (per event) and the space complexity O(n) where n is the number of the types of events.

How Pingora does it

The hash table solution is fine in common scenarios, but we believe there are a few things that can be improved.

We observe traffic to millions of different servers when the misbehaving ones are only a few at a given time. It seems a bit wasteful to require space (memory) that holds the counter for all the keys.
Concurrently updating the hash table (especially when adding new keys) requires a lock. This behavior potentially forces all concurrent event processing to go through our system serialized. In other words, when lock contention is severe, the lock slows down the system.

The motivation to improve the above algorithm is even stronger considering such algorithms need to be deployed at scale. This algorithm operates on tens of thousands of machines. It handles more than twenty million requests per second. The benefits of efficiency improvement can be significant.

pingora-limits adopts a different approach: count–min sketch (CM sketch) estimation. CM sketch estimates the counts of events in O(1) (per event) but only using O(log(n)) of space (polylogarithmic, to be precise, more details here). Because of the simplicity of this algorithm, which we will discuss in a bit, it can be implemented without locks. Therefore, pingora-limits runs much faster and more efficiently compared to the hash table approach discussed earlier.

CM sketch

The idea of a CM sketch is similar to a Bloom filter. The mathematical details of the CM sketch can be found in this paper. In this section, we will just illustrate how it works.

A CM sketch data structure takes two parameters, H: number of hashes (rows) and N number of counters (columns) per hash (row). The rows and columns form a matrix. The space they take is H*N. Each row has its own independent hash function (hash_i()).

For this example, we use H=3 and N=4:

0	0	0	0
0	0	0	0
0	0	0	0

When an event, "red", arrives, it is counted by every row independently. Each row will use its own hashing function ( hash_i(“red”) ) to choose a column. The counter of the column is increased without worrying about collisions (see the end of this section).

The table below illustrates a possible state of the matrix after a single “red” event:

0	1	0	0
0	0	1	0
1	0	0	0

Then, let’s assume the event "blue" arrives, and we assume it collides with "red" at row 2: both hash to the third slot:

1	1	0	0
0	0	2	0
1	0	0	1

Let’s say after another series of events, “blue, red, red, red, blue, red”, So far the algorithm observed 5 “red”s and 3 “blue”s in total. Following the algorithm, the estimator eventually becomes:

3	5	0	0
0	0	8	0
5	0	0	3

Now, let’s see how the matrix reports the occurrence of each event. In order to retrieve the count of keys, the estimator just returns the minimal value of all the columns to which that key belongs. So the count of red is min(5, 8, 5) = 5 and blue is min(3, 8, 3) = 3.

This algorithm chooses the cells with the least collisions (via the min() operations). Therefore, collisions between events in single cells are acceptable because as long as there are collision free cells for a given type of event, the counting for that event is accurate.

The estimator can overestimate when two (or more) keys collide on all slots. Assuming there are only two keys, the probability of their total collision is 1/ N^H (1/64 in this example). On the other hand, it never underestimates because it never loses count of any events.

Practical implementation

Because the algorithm only requires hashing, array index and counter increment, it can be implemented in a few lines of code and lock-free.

The following is a code snippet of how it is implemented in Rust.

pub struct Estimator {
    estimator: Box<[(Box<[AtomicIsize]>, RandomState)]>,
}
 
impl Estimator {
    /// Increment `key` by the value given. Return the new estimated value as a result.
    pub fn incr(&self, key: T, value: isize) -> isize {
        let mut min = isize::MAX;
        for (slot, hasher) in self.estimator.iter() {
            let hash = hash(&key, hasher) as usize;
            let counter = &slot[hash % slot.len()];
            let current = counter.fetch_add(value, Ordering::Relaxed);
            min = std::cmp::min(min, current + value);
        }
        min
    }
}

Performance

We compare the design above with the two hash table based approaches.

naive: Mutex>. This approach references the simple hash table approach mentioned above. This design requires a lock on every operation.
optimized: DashMap. DashMap leverages multiple hash tables in order to shard the keys to reduce contentions across different keys. We also use atomic counters here so that counting existing keys won't need a write lock.

We have two test cases, one that is single threaded and another that is multi-threaded. In both cases, we have one million keys. We generate 100 million events from the keys. The keys are uniformly distributed among the events.

The results below are performed on Debian VM running on M1 MacBook Pro.

SpeedPer event (the incr() function above) timing, lower is better:

	pingora-limits	naive	optimized
Single thread	10ns	51ns	43ns
Eight threads	212ns	1505ns	212ns

In the single thread case, where there is no lock contention, our approach is 5x faster than the naive one and 4x faster than the optimized one. With multiple threads, there is a high amount of contention. Our approach is similar to the optimized version. Both are 7x faster than the naive one. The reason the performance of pingora-limits and the optimized hash table are similar is because in both approaches the hot path is just updating the atomic counter.

Memory consumptionLower is better. The numbers are collected only from the single threaded test cases for simplicity.

	peak memory bytes	total allocations	total allocated bytes
pingora-limits	26,184	9	26,184
naive	53,477,392	20	71,303,260
optimized	36,211,208	491	71,307,722

Pingora-limits at peak requires 1/2000 of the memory compared to the naive one and 1/1300 of the memory of the optimized one.

From the data above, pingora-limits is both CPU and memory efficient.

The estimator provided by Pingora-limits is a biased estimator because it is possible for it to overestimate the appearance of events.

In the case of accurate counting, where false positives are absolutely unacceptable, pingora-limits can still be very useful. It can work as a first stage filter where only the events beyond a certain threshold are fed to a hash table to perform accurate counting. In this case, the majority of low frequency event types are filtered out by the filter so that the hash table also consumes little memory without losing any accuracy.

How it is used in production

In production, pingora uses this library in a few places. The most common one is the connection limit feature. When our servers try to establish too many connections to a single origin server, in order to protect the server and our infrastructure from becoming overloaded, this feature will start rejecting new requests with 503 errors.

In this feature every incoming request increases a counter, shared by all other requests with the same customer ID, server IP and the server hostname. When the request finishes, the counter decreases accordingly. If the value of the counter is beyond a certain threshold, the request is rejected with a 503 error response. In our production environment we choose the parameters of the library so that a theoretical collision chance between two unrelated customers is about 1 / 2 ^ 52. Additionally, the rejection threshold is significantly higher than what a healthy customer’s traffic would reach. Therefore, even if multiple customers’ counters collide, it is not likely that the overestimated value would reach the threshold. So a false positive on the connection limit is not likely to happen.

Conclusion

Pingora-limits crate is available now on GitHub. Both the core functionality and the performance benchmark performed above can be found there.

In this blog post, we introduced pingora-limits, a library that counts events efficiently. We explained the core idea, which is based on a probabilistic data structure. We also showed through a performance benchmark that the pingora-limits implementation is fast and very efficient for memory consumption.

Not only that, but we will continue introducing and open sourcing Pingora components and libraries because we believe that sharing the idea behind the code is equally important as sharing the code itself.

Interested in joining us to help build a better Internet? Our engineering teams are hiring.

Cloudflare's handling of a bug in interpreting IPv4-mapped IPv6 addresses

Lucas Ferreira — Thu, 02 Feb 2023 13:32:00 GMT

In November 2022, our bug bounty program received a critical and very interesting report. The report stated that certain types of DNS records could be used to bypass some of our network policies and connect to ports on the loopback address (e.g. 127.0.0.1) of our servers. This post will explain how we dealt with the report, how we fixed the bug, and the outcome of our internal investigation to see if the vulnerability had been previously exploited.

RFC 4291 defines ways to embed an IPv4 address into IPv6 addresses. One of the methods defined in the RFC is to use IPv4-mapped IPv6 addresses, that have the following format:

   |                80 bits               | 16 |      32 bits        |
   +--------------------------------------+--------------------------+
   |0000..............................0000|FFFF|    IPv4 address     |
   +--------------------------------------+----+---------------------+

In IPv6 notation, the corresponding mapping for 127.0.0.1 is ::ffff:127.0.0.1 (RFC 4038)

The researcher was able to use DNS entries based on mapped addresses to bypass some of our controls and access ports on the loopback address or non-routable IPs.

This vulnerability was reported on November 27 to our bug bounty program. Our Security Incident Response Team (SIRT) was contacted, and incident response activities began shortly after the report was filed. A hotpatch was deployed three hours later to prevent exploitation of the bug.

Date	Time (UTC)	Activity
27 November 2022	20:42	Initial report to Cloudflare's bug bounty program
	21:04	SIRT oncall is paged
	21:15	SIRT manager on call starts working on the report
	21:22	Incident declared and team is assembled and debugging starts
	23:20	A hotfix is ready and deployment starts
	23:47	Team confirms that the hotfix is deployed and working
	23:58	Team investigates if other products are affected. Load Balancers and Spectrum are potential targets. Both products are found to be unaffected by the vulnerability.
28 November 2022	21:14	A permanent fix is ready
29 November 2022	21:34	Permanent fix is merged

Blocking exploitation

Immediately after the vulnerability was reported to our Bug Bounty program, the team began working to understand the issue and find ways to quickly block potential exploitation. It was determined that the fastest way to prevent exploitation would be to block the creation of the DNS records required to execute the attack.

The team then began to implement a patch to prevent the creation of DNS records that include IPv6 addresses that map loopback or RFC 1918 (internal) IPv4 addresses. The fix was fully deployed and confirmed three hours after the report was filed. We later realized that this change was insufficient because records hosted on external DNS servers could also be used in this attack.

The exploit

The exploit provided consisted of the following: a DNS entry, and a Cloudflare Worker. The DNS entry was an AAAA record pointing to ::ffff:127.0.0.1:

exploit.example.com AAAA ::ffff:127.0.0.1

The worker included the following code:

export default {
    async fetch(request) {
        const requestJson = await request.json()
        return fetch(requestJson.url, requestJson)
    }
}

The Worker was given a custom URL such as proxy.example.com.

With that setup, it was possible to make the worker attempt connections on the loopback interface of the server where it was running. The call would look like this:

curl https://proxy.example.com/json -d '{"url":"http://exploit.example.com:80/url_path"}'

The attack could then be scripted to attempt to connect to multiple ports on the server.

It was also found that a similar setup could be used with other IPv4 addresses to attempt connections into internal services. In this case, the DNS entry would look like:

exploit.example.com AAAA ::ffff:10.0.0.1

This exploit would allow an attacker to connect to services running on the loopback interface of the server. If the attacker was able to bypass the security and authentication mechanisms of a service, it could impact the confidentiality and integrity of data. For services running on other servers, the attacker could also use the worker to attempt connections and map services available over the network. As in most networks, Cloudflare's network policies and ACLs must allow a few ports to be accessible. These ports would be accessible by an attacker using this exploit.

Investigation

We started an investigation to understand the root cause of the problem and created a proof-of-concept that allowed the team to debug the issue. At the same time, we started a parallel investigation to determine if the issue had been previously exploited.

It all happened when two bugs collided.

The first bug happened in our internal DNS system which is responsible for mapping hostnames to IP addresses of our customers’ origin servers (the DNS system). When the DNS system tried to answer a query for the DNS record from exploit.example.com, it serialized the IP as a string. The Golang net library used for DNS automatically converted the IP ::ffff:10.0.0.1 to string “10.0.0.1”. However, the DNS system still treated it as an IPv6 address. So a query response {ipv6: “10.0.0.1”} was returned.

The second bug was in our internal HTTP system (the proxy) which is responsible for forwarding HTTP traffic to customer’s origin servers. The bug happened in how the proxy validates this DNS response, {ipv6: “10.0.0.1”}. The proxy has two deny lists of IPs that are not allowed to be used, one for IPv4 and one for IPv6. These lists contain localhost IPs and private IPs. The bug was that the proxy system compared the address 10.0.0.1 against the IPv6 deny list because the address was in the “ipv6” section. Naturally the address didn’t match any entry in the deny list. So the address was allowed to be used as an origin IP address.

The second investigation team searched through the logs and found no evidence of previous exploitation of this vulnerability. The team also checked Cloudflare DNS for entries using IPv4-mapped IPv6 addresses and determined that all the existing entries had been used for testing purposes. As of now, there are no signs that this vulnerability could have been previously used against Cloudflare systems.

Remediating the vulnerability

To address this issue we implemented a fix in the proxy service to correctly use the deny list of the parsed address, not the deny list of the IP family the DNS API response claimed to be, to validate the IP address. We confirmed both in our test and production environments that the fix did prevent the issue from happening again.

Beyond maintaining a bug bounty program, we regularly perform internal security reviews and hire third-party firms to audit the software we develop. But it is through our bug bounty program that we receive some of the most interesting and creative reports. Each report has helped us improve the security of our services. We invite those that find a security issue in any of Cloudflare’s services to report it to us through HackerOne.

How we built Pingora, the proxy that connects Cloudflare to the Internet

Yuchen Wu — Wed, 14 Sep 2022 13:00:00 GMT

Introduction

Today we are excited to talk about Pingora, a new HTTP proxy we’ve built in-house using Rust that serves over 1 trillion requests a day, boosts our performance, and enables many new features for Cloudflare customers, all while requiring only a third of the CPU and memory resources of our previous proxy infrastructure.

As Cloudflare has scaled we’ve outgrown NGINX. It was great for many years, but over time its limitations at our scale meant building something new made sense. We could no longer get the performance we needed nor did NGINX have the features we needed for our very complex environment.

Many Cloudflare customers and users use the Cloudflare global network as a proxy between HTTP clients (such as web browsers, apps, IoT devices and more) and servers. In the past, we’ve talked a lot about how browsers and other user agents connect to our network, and we’ve developed a lot of technology and implemented new protocols (see QUIC and optimizations for http2) to make this leg of the connection more efficient.

Today, we’re focusing on a different part of the equation: the service that proxies traffic between our network and servers on the Internet. This proxy service powers our CDN, Workers fetch, Tunnel, Stream, R2 and many, many other features and products.

Let’s dig in on why we chose to replace our legacy service and how we developed Pingora, our new system designed specifically for Cloudflare’s customer use cases and scale.

Why build yet another proxy

Over the years, our usage of NGINX has run up against limitations. For some limitations, we optimized or worked around them. But others were much harder to overcome.

Architecture limitations hurt performance

The NGINX worker (process) architecture has operational drawbacks for our use cases that hurt our performance and efficiency.

First, in NGINX each request can only be served by a single worker. This results in unbalanced load across all CPU cores, which leads to slowness.

Because of this request-process pinning effect, requests that do CPU heavy or blocking IO tasks can slow down other requests. As those blog posts attest we’ve spent a lot of time working around these problems.

The most critical problem for our use cases is poor connection reuse. Our machines establish TCP connections to origin servers to proxy HTTP requests. Connection reuse speeds up TTFB (time-to-first-byte) of requests by reusing previously established connections from a connection pool, skipping TCP and TLS handshakes required on a new connection.

However, the NGINX connection pool is per worker. When a request lands on a certain worker, it can only reuse the connections within that worker. When we add more NGINX workers to scale up, our connection reuse ratio gets worse because the connections are scattered across more isolated pools of all the processes. This results in slower TTFB and more connections to maintain, which consumes resources (and money) for both us and our customers.

As mentioned in past blog posts, we have workarounds for some of these issues. But if we can address the fundamental issue: the worker/process model, we will resolve all these problems naturally.

Some types of functionality are difficult to add

NGINX is a very good web server, load balancer or a simple gateway. But Cloudflare does way more than that. We used to build all the functionality we needed around NGINX, which is not easy to do while trying not to diverge too much from NGINX upstream codebase.

For example, when retrying/failing over a request, sometimes we want to send a request to a different origin server with a different set of request headers. But that is not something NGINX allows us to do. In cases like this, we spend time and effort on working around the NGINX constraints.

Meanwhile, the programming languages we had to work with didn’t provide help alleviating the difficulties. NGINX is purely in C, which is not memory safe by design. It is very error-prone to work with such a 3rd party code base. It is quite easy to get into memory safety issues, even for experienced engineers, and we wanted to avoid these as much as possible.

The other language we used to complement C is Lua. It is less risky but also less performant. In addition, we often found ourselves missing static typing when working with complicated Lua code and business logic.

And the NGINX community is not very active, and development tends to be “behind closed doors”.

Choosing to build our own

Over the past few years, as we’ve continued to grow our customer base and feature set, we continually evaluated three choices:

Continue to invest in NGINX and possibly fork it to tailor it 100% to our needs. We had the expertise needed, but given the architecture limitations mentioned above, significant effort would be required to rebuild it in a way that fully supported our needs.
Migrate to another 3rd party proxy codebase. There are definitely good projects, like envoy and others. But this path means the same cycle may repeat in a few years.
Start with a clean slate, building an in-house platform and framework. This choice requires the most upfront investment in terms of engineering effort.

We evaluated each of these options every quarter for the past few years. There is no obvious formula to tell which choice is the best. For several years, we continued with the path of the least resistance, continuing to augment NGINX. However, at some point, building our own proxy’s return on investment seemed worth it. We made a call to build a proxy from scratch, and began designing the proxy application of our dreams.

The Pingora Project

Design decisions

To make a proxy that serves millions of requests per second fast, efficient and secure, we have to make a few important design decisions first.

We chose Rust as the language of the project because it can do what C can do in a memory safe way without compromising performance.

Although there are some great off-the-shelf 3rd party HTTP libraries, such as hyper, we chose to build our own because we want to maximize the flexibility in how we handle HTTP traffic and to make sure we can innovate at our own pace.

At Cloudflare, we handle traffic across the entire Internet. We have many cases of bizarre and non-RFC compliant HTTP traffic that we have to support. This is a common dilemma across the HTTP community and web, where there is tension between strictly following HTTP specifications and accommodating the nuances of a wide ecosystem of potentially legacy clients or servers. Picking one side can be a tough job.

HTTP status codes are defined in RFC 9110 as a three digit integer, and generally expected to be in the range 100 through 599. Hyper was one such implementation. However, many servers support the use of status codes between 599 and 999. An issue had been created for this feature, which explored various sides of the debate. While the hyper team did ultimately accept that change, there would have been valid reasons for them to reject such an ask, and this was only one of many cases of noncompliant behavior we needed to support.

In order to satisfy the requirements of Cloudflare's position in the HTTP ecosystem, we needed a robust, permissive, customizable HTTP library that can survive the wilds of the Internet and support a variety of noncompliant use cases. The best way to guarantee that is to implement our own.

The next design decision was around our workload scheduling system. We chose multithreading over multiprocessing in order to share resources, especially connection pools, easily. We also decided that work stealing was required to avoid some classes of performance problems mentioned above. The Tokio async runtime turned out to be a great fit for our needs.

Finally, we wanted our project to be intuitive and developer friendly. What we build is not the final product, and should be extensible as a platform as more features are built on top of it. We decided to implement a “life of a request” event based programmable interface similar to NGINX/OpenResty. For example, the “request filter” phase allows developers to run code to modify or reject the request when a request header is received. With this design, we can separate our business logic and generic proxy logic cleanly. Developers who previously worked on NGINX can easily switch to Pingora and quickly become productive.

Pingora is faster in production

Let’s fast-forward to the present. Pingora handles almost every HTTP request that needs to interact with an origin server (for a cache miss, for example), and we’ve collected a lot of performance data in the process.

First, let’s see how Pingora speeds up our customer’s traffic. Overall traffic on Pingora shows 5ms reduction on median TTFB and 80ms reduction on the 95th percentile. This is not because we run code faster. Even our old service could handle requests in the sub-millisecond range.

The savings come from our new architecture which can share connections across all threads. This means a better connection reuse ratio, which spends less time on TCP and TLS handshakes.

Across all customers, Pingora makes only a third as many new connections per second compared to the old service. For one major customer, it increased the connection reuse ratio from 87.1% to 99.92%, which reduced new connections to their origins by 160x. To present the number more intuitively, by switching to Pingora, we save our customers and users 434 years of handshake time every day.

More features

Having a developer friendly interface engineers are familiar with while eliminating the previous constraints allows us to develop more features, more quickly. Core functionality like new protocols act as building blocks to more products we can offer to customers.

As an example, we were able to add HTTP/2 upstream support to Pingora without major hurdles. This allowed us to offer gRPC to our customers shortly afterwards. Adding this same functionality to NGINX would have required significantly more engineering effort and might not have materialized.

More recently we've announced Cache Reserve where Pingora uses R2 storage as a caching layer. As we add more functionality to Pingora, we’re able to offer new products that weren’t feasible before.

More efficient

In production, Pingora consumes about 70% less CPU and 67% less memory compared to our old service with the same traffic load. The savings come from a few factors.

Our Rust code runs more efficiently compared to our old Lua code. On top of that, there are also efficiency differences from their architectures. For example, in NGINX/OpenResty, when the Lua code wants to access an HTTP header, it has to read it from the NGINX C struct, allocate a Lua string and then copy it to the Lua string. Afterwards, Lua has to garbage-collect its new string as well. In Pingora, it would just be a direct string access.

The multithreading model also makes sharing data across requests more efficient. NGINX also has shared memory but due to implementation limitations, every shared memory access has to use a mutex lock and only strings and numbers can be put into shared memory. In Pingora, most shared items can be accessed directly via shared references behind atomic reference counters.

Another significant portion of CPU saving, as mentioned above, is from making fewer new connections. TLS handshakes are expensive compared to just sending and receiving data via established connections.

Safer

Shipping features quickly and safely is difficult, especially at our scale. It's hard to predict every edge case that can occur in a distributed environment processing millions of requests a second. Fuzzing and static analysis can only mitigate so much. Rust's memory-safe semantics guard us from undefined behavior and give us confidence our service will run correctly.

With those assurances we can focus more on how a change to our service will interact with other services or a customer's origin. We can develop features at a higher cadence and not be burdened by memory safety and hard to diagnose crashes.

When crashes do occur an engineer needs to spend time to diagnose how it happened and what caused it. Since Pingora's inception we’ve served a few hundred trillion requests and have yet to crash due to our service code.

In fact, Pingora crashes are so rare we usually find unrelated issues when we do encounter one. Recently we discovered a kernel bug soon after our service started crashing. We've also discovered hardware issues on a few machines, in the past ruling out rare memory bugs caused by our software even after significant debugging was nearly impossible.

Conclusion

To summarize, we have built an in-house proxy that is faster, more efficient and versatile as the platform for our current and future products.

We will be back with more technical details regarding the problems we faced, the optimizations we applied and the lessons we learned from building Pingora and rolling it out to power a significant portion of the Internet. We will also be back with our plan to open source it.

Pingora is our latest attempt at rewriting our system, but it won’t be our last. It is also only one of the building blocks in the re-architecting of our systems.

Interested in joining us to help build a better Internet? Our engineering teams are hiring.

Introducing Smart Edge Revalidation

Alex Krivit — Wed, 28 Jul 2021 12:59:17 GMT

Today we’re excited to announce Smart Edge Revalidation. It was designed to ensure that compute resources are synchronized efficiently between our edge and a browser. Right now, as many as 30% of objects cached on Cloudflare’s edge do not have the HTTP response headers required for revalidation. This can result in unnecessary origin calls. Smart Edge Revalidation fixes this: it does the work to ensure that these headers are present, even when an origin doesn’t send them to us. The advantage of this? There’s less wasted bandwidth and compute for objects that do not need to be redownloaded. And there are faster browser page loads for users.

So What Is Revalidation?

Revalidation is one part of a longer story about efficiently serving objects that live on an origin server from an intermediary cache. Visitors to a website want it to be fast. One foundational way to make sure that a website is fast for visitors is to serve objects from cache. In this way, requests and responses do not need to transit unnecessary parts of the Internet back to an origin and, instead, can be served from a data center that is closer to the visitor. As such, website operators generally only want to serve content from an origin when content has changed. So how do objects stay in cache for as long as necessary?

One way to do that is with HTTP response headers.

When Cloudflare gets a response from an origin, included in that response are a number of headers. You can see these headers by opening any webpage, inspecting the page, going to the network tab, and clicking any file. In the response headers section there will generally be a header known as “Cache-Control.” This header is a way for origins to answer caching intermediaries’ questions like: is this object eligible for cache? How long should this object be in cache? And what should the caching intermediary do after that time expires?

How long something should be in cache can be specified through the max-age or s-maxage directives. These directives specify a TTL or time-to-live for the object in seconds. Once the object has been in cache for the requisite TTL, the clock hits 0 (zero) and it is marked as expired. Cache can no longer safely serve expired content to requests without figuring out if the object has changed on the origin or if it is the same.

If it has changed, it must be redownloaded from the origin. If it hasn’t changed, then it can be marked as fresh and continue to be served. This check, again, is known as revalidation.

We’re excited that Smart Edge Revalidation extends the efficiency of revalidation to everyone, regardless of an origin sending the necessary response headers

How is Revalidation Accomplished?

Two additional headers, Last-Modified and ETag, are set by an origin in order to distinguish different versions of the same URL/object across modifications. After the object expires and the revalidation check occurs, if the ETag value hasn’t changed or a more recent Last-Modified timestamp isn’t present, the object is marked “revalidated” and the expired object can continue to be served from cache. If there has been a change as indicated by the ETag value or Last-Modified timestamp, then the new object is downloaded and the old object is removed from cache.

Revalidation checks occur when a browser sends a request to a cache server using If-Modified-Since or If-None-Match headers. These request headers are questions sent from the browser cache about when an object has last changed that can be answered via the ETag or Last-Modified response headers on the cache server. For example, if the browser sends a request to a cache server with If-Modified-Since: Tue, 8 Nov 2021 07:28:00 GMT the cache server must look at the object being asked about and if it has not changed since November 8 at 7:28 AM, it will respond with a 304 status code indicating it’s unchanged. If the object has changed, the cache server will respond with the new object.

Sending a 304 status code that indicates an object can be reused is much more efficient than sending the entire object. It’s like if you ran a news website that updated every 24 hours. Once the content is updated for the day, you wouldn’t want to keep redownloading the same unchanged content from the origin and instead, you would prefer to make sure that the day’s content was just reused by sending a lightweight signal to that effect, until the site changes the next day.

The problem with this system of browser questions and revalidation responses is that sometimes origins don’t set ETag or Last-Modified headers, or they aren’t configured by the website’s admin, making revalidation impossible. This means that every time an object expires, it must be redownloaded regardless of if there has been a change or not, because we have to assume that the asset has been updated, or else risk serving stale content.

This is an incredible waste of resources which costs hundreds of GB/sec of needless bandwidth between the edge and the visitor. Meaning browsers are downloading hundreds of GB/sec of content they may already have. If our baseline of revalidation is around 10% of all traffic and in initial tests, Smart Edge Revalidation increased revalidation just under 50%, this means that without a user needing to configure anything, we can increase total revalidations by around 5%!

Such a large reduction in bandwidth use also comes with potential environmental benefits. Based on Cloudflare's carbon emissions per byte, the needless bandwidth being used could amount to 2000+ metric tons CO2e/year, the equivalent of the CO2 emissions from more than 400 cars in a year.

Revalidation also comes with a performance improvement because it usually means a browser is downloading less than 1KB of data to check if the asset has changed or not, while pulling the full asset can be 100sKB. This can improve performance and reduce the bandwidth between the visitor and our edge.

How Smart Edge Revalidation Works

When both Last-Modified and Etag headers are absent from the origin server response, Smart Edge Revalidation will use the time the object was cached on Cloudflare’s edge as the Last-Modified header value. When a browser sends a revalidation request to Cloudflare using If-Modified-Since or If-None-Match, our edge can answer those revalidation questions using the Last-Modified header generated from Smart Edge Revalidation. In this way, our edge can ensure efficient revalidation even if the headers are not sent from the origin.

Smart Edge Revalidation will be enabled automatically for all Cloudflare customers over the coming weeks. If this behavior is undesired, you can always ensure that Smart Edge Revalidation is not activated by confirming your origin is sending ETag or Last-Modified headers when you want to indicate changed content. Additionally, you could have your origin direct your desired revalidation behavior by making sure it sets appropriate cache-control headers.

Smart Edge Revalidation is a win for everyone: visitors will get more content faster from cache, website owners can serve and revalidate additional content from Cloudflare efficiently, and the Internet will get a bit greener and more efficient.

Smart Edge Revalidation is the latest announcement to join the list of ways we're making our network more sustainable to help build a greener Internet — check out posts from earlier this week to learn about our climate commitments, Green Compute with Workers, Carbon Impact Report, Pages x Green Web Foundation partnership, and crawler hints.

Road to gRPC

Junho Choi — Mon, 26 Oct 2020 16:40:02 GMT

Cloudflare launched support for gRPC® during our 2020 Birthday Week. We’ve been humbled by the immense interest in the beta, and we’d like to thank everyone that has applied and tried out gRPC! In this post we’ll do a deep-dive into the technical details on how we implemented support.

What is gRPC?

gRPC is an open source RPC framework running over HTTP/2. RPC (remote procedure call) is a way for one machine to tell another machine to do something, rather than calling a local function in a library. RPC has been around in the history of distributed computing, with different implementations focusing on different areas, for a long time. What makes gRPC unique are the following characteristics:

It requires the modern HTTP/2 protocol for transport, which is now widely available.
A full client/server reference implementation, demo, and test suites are available as open source.
It does not specify a message format, although Protocol Buffers are the preferred serialization mechanism.
Both clients and servers can stream data, which avoids having to poll for new data or create new connections.

In terms of the protocol, gRPC uses HTTP/2 frames extensively: requests and responses look very similar to a normal HTTP/2 request.

What’s unusual, however, is gRPC’s usage of the HTTP trailer. While it’s not widely used in the wild, HTTP trailers have been around since 1999, as defined in original HTTP/1.1 RFC2616. HTTP message headers are defined to come before the HTTP message body, but HTTP trailer is a set of HTTP headers that can be appended after the message body. However, because there are not many use cases for trailers, many server and client implementations don't fully support them. While HTTP/1.1 needs to use chunked transfer encoding for its body to send an HTTP trailer, in the case of HTTP/2 the trailer is in HEADER frame after the DATA frame of the body.

There are some cases where an HTTP trailer is useful. For example, we use an HTTP response code to indicate the status of request, but the response code is the very first line of the HTTP response, so we need to decide on the response code very early. A trailer makes it possible to send some metadata after the body. For example, let’s say your web server sends a stream of large data (which is not a fixed size), and in the end you want to send a SHA256 checksum of the data you sent so that the client can verify the contents. Normally, this is not possible with an HTTP status code or the response header which should be sent at the beginning of the response. Using a HTTP trailer header, you can send another header (e.g. Digest) after having sent all the data.

gRPC uses HTTP trailers for two purposes. To begin with, it sends its final status (grpc-status) as a trailer header after the content has been sent. The second reason is to support streaming use cases. These use cases last much longer than normal HTTP requests. The HTTP trailer is used to give the post processing result of the request or the response. For example if there is an error during streaming data processing, you can send an error code using the trailer, which is not possible with the header before the message body.

Here is a simple example of a gRPC request and response in HTTP/2 frames:

Adding gRPC support to the Cloudflare Edge

Since gRPC uses HTTP/2, it may sound easy to natively support gRPC, because Cloudflare already supports HTTP/2. However, we had a couple of issues:

The HTTP request/response trailer headers were not fully supported by our edge proxy: Cloudflare uses NGINX to accept traffic from eyeballs, and it has limited support for trailers. Further complicating things, requests and responses flowing through Cloudflare go through a set of other proxies.
HTTP/2 to origin: our edge proxy uses HTTP/1.1 to fetch objects (whether dynamic or static) from origin. To proxy gRPC traffic, we need support connections to customer gRPC origins using HTTP/2.
gRPC streaming needs to allow bidirectional request/response flow: gRPC has two types of protocol flow; one is unary, which is a simple request and response, and another is streaming, which allows non-stop data flow in each direction. To fully support the streaming, the HTTP message body needs to be sent after receiving the response header on the other end. For example, client streaming will keep sending a request body after receiving a response header.

Due to these reasons, gRPC requests would break when proxied through our network. To overcome these limitations, we looked at various solutions. For example, NGINX has a gRPC upstream module to support HTTP/2 gRPC origin, but it’s a separate module, and it also requires HTTP/2 downstream, which cannot be used for our service, as requests cascade through multiple HTTP proxies in some cases. Using HTTP/2 everywhere in the pipeline is not realistic, because of the characteristics of our internal load balancing architecture, and because it would have taken too much effort to make sure all internal traffic uses HTTP/2.

Converting to HTTP/1.1?

Ultimately, we discovered a better way: convert gRPC messages to HTTP/1.1 messages without a trailer inside our network, and then convert them back to HTTP/2 before sending the request off to origin. This would work with most HTTP proxies inside Cloudflare that don't support HTTP trailers, and we would need minimal changes.

Rather than inventing our own format, the gRPC community has already come up with an HTTP/1.1-compatible version: gRPC-web. gRPC-web is a modification of the original HTTP/2 based gRPC specification. The original purpose was to be used with the web browsers, which lack direct access HTTP/2 frames. With gRPC-web, the HTTP trailer is moved to the body, so we don’t need to worry about HTTP trailer support inside the proxy. It also comes with streaming support. The resulting HTTP/1.1 message can be still inspected by our security products, such as WAF and Bot Management, to provide the same level of security that Cloudflare brings to other HTTP traffic.

When an HTTP/2 gRPC message is received at Cloudflare’s edge proxy, the message is “converted” to HTTP/1.1 gRPC-web format. Once the gRPC message is converted, it goes through our pipeline, applying services such as WAF, Cache and Argo services the same way any normal HTTP request would.

Right before a gRPC-web message leaves the Cloudflare network, it needs to be “converted back” to HTTP/2 gRPC again. Requests that are converted by our system are marked so that our system won’t accidentally convert gRPC-web traffic originated from clients.

HTTP/2 Origin Support

One of the engineering challenges was to support using HTTP/2 to connect to origins. Before this project, Cloudflare didn't have the ability to connect to origins via HTTP/2.

Therefore, we decided to build support for HTTP/2 origin support in-house. We built a standalone origin proxy that is able to connect to origins via HTTP/2. On top of this new platform, we implemented the conversion logic for gRPC. gRPC support is the first feature that takes advantage of this new platform. Broader support for HTTP/2 connections to origin servers is on the roadmap.

gRPC Streaming Support

As explained above, gRPC has a streaming mode that request body or response body can be sent in stream; in the lifetime of gRPC requests, gRPC message blocks can be sent at any time. At the end of the stream, there will be a HEADER frame indicating the end of the stream. When it’s converted to gRPC-web, we will send the body using chunked encoding and keep the connection open, accepting both sides of the body until we get a gRPC message block, which indicates the end of the stream. This requires our proxy to support bidirectional transfer.

For example, client streaming is an interesting mode where the server already responds with a response code and its header, but the client is still able to send the request body.

Interoperability Testing

Every new feature at Cloudflare needs proper testing before release. During initial development, we used the envoy proxy with its gRPC-web filter feature and official examples of gRPC. We prepared a test environment with envoy and a gRPC test origin to make sure that the edge proxy worked properly with gRPC requests. Requests from the gRPC test client are sent to the edge proxy and converted to gRPC-web, and forwarded to the envoy proxy. After that, envoy converts back to gRPC request and sends to gRPC test origin. We were able to verify the basic behavior in this way.

Once we had basic functionality ready, we also needed to make sure both ends’ conversion functionality worked properly. To do that, we built deeper interoperability testing.

We referenced the existing gRPC interoperability test cases for our test suite and ran the first iteration of tests between the edge proxy and the new origin proxy locally.

For the second iteration of tests we used different gRPC implementations. For example, some servers sent their final status (grpc-status) in a trailers-only response when there was an immediate error. This response would contain the HTTP/2 response headers and trailer in a single HEADERS frame block with both the END_STREAM and END_HEADERS flags set. Other implementations sent the final status as trailer in a separate HEADERS frame.

After verifying interoperability locally we ran the test harness against a development environment that supports all the services we have in production. We were then able to ensure no unintended side effects were impacting gRPC requests.

We love dogfooding! One of the first services we successfully deployed edge gRPC support to is the Cloudflare drand randomness beacon. Onboarding was easy and we’ve been running the beacon in production for the last few weeks without a hitch.

Conclusion

Supporting a new protocol is exciting work! Implementing support for new technologies in existing systems is exciting and intricate, often involving tradeoffs between speed of implementation and overall system complexity. In the case of gRPC, we were able to build support quickly and in a way that did not require significant changes to the Cloudflare edge. This was accomplished by carefully considering implementation options before settling on the idea of converting between HTTP/2 gRPC and HTTP/1.1 gRPC-web format. This design choice made service integration quicker and easier while still satisfying our user’s expectations and constraints.

If you are interested in using Cloudflare to secure and accelerate your gRPC service, you can read more here. And if you want to work on interesting engineering challenges like the one described in this post, apply!

gRPC® is a registered trademark of The Linux Foundation.

Why We Started Putting Unpopular Assets in Memory

Yuchen Wu — Tue, 24 Mar 2020 12:00:00 GMT

Part of Cloudflare's service is a CDN that makes millions of Internet properties faster and more reliable by caching web assets closer to browsers and end users.

We make improvements to our infrastructure to make end-user experiences faster, more secure, and more reliable all the time. Here’s a case study of one such engineering effort where something counterintuitive turned out to be the right approach.

Our storage layer, which serves millions of cache hits per second globally, is powered by high IOPS NVMe SSDs.

Although SSDs are fast and reliable, cache hit tail latency within our system is dominated by the IO capacity of our SSDs. Moreover, because flash memory chips wear out, a non-negligible portion of our operational cost, including the cost of new devices, shipment, labor and downtime, is spent on replacing dead SSDs.

Recently, we developed a technology that reduces our hit tail latency and reduces the wear out of SSDs. This technology is a memory-SSD hybrid storage system that puts unpopular assets in memory.

The end result: cache hits from our infrastructure are now faster for all customers.

You may have thought that was a typo in my explanation of how the technique works. In fact, a few colleagues thought the same when we proposed this project internally, “I think you meant to say ‘popular assets’ in your document”. Intuitively, we should only put popular assets in memory since memory is even faster than SSDs.

You are not wrong. But I’m also correct. This blog post explains why.

Putting popular assets in memory

First let me explain how we already use memory to speed up the IO of popular assets.

Page cache

Since it is so obvious that memory speeds up popular assets, Linux already does much of the hard work for us. This functionality is called the “page cache”. Files in Linux systems are organized into pages internally, hence the name.

The page cache uses the system’s available memory to cache reads from and buffer writes to durable storage. During typical operations of a Cloudflare server, all the services and processes themselves do not consume all physical memory. The remaining memory is used by the page cache.

Below shows the memory layout of a typical Cloudflare edge server with 256GB of physical memory.

We can see the used memory (RSS) on top. RSS memory consumption for all services is about 87.71 GB, with our cache service consuming 4.1 GB. At the bottom we see overall page cache usage. Total size is 128.6 GB, with the cache service using 41.6 GB of it to serve cached web assets.

Caching reads

Pages that are frequently used are cached in memory to speed up reads. Linux uses the least recently used (LRU) algorithm among other techniques to decide what to keep in page cache. In other words, popular assets are already in memory.

Buffering writes

The page cache also buffers writes. Writes are not synchronized to disk right away. They are buffered until the page cache decides to do so at a later time, either periodically or due to memory pressure. Therefore, it also reduces repetitive writes to the same file.

However, because of the nature of the caching workload, a web asset arriving at our cache system is usually static. Unlike workloads of databases, where a single value can get repeatedly updated, in our caching workload, there are very few repetitive updates to an asset before it is completely cached. Therefore, the page cache does not help in reducing writes to disk since all pages will eventually be written to disks.

Smarter than the page cache?

Although the Linux page cache works great out of the box, our caching system can still outsmart it since our system knows more about the context of the data being accessed, such as the content type, access frequency and the time-to-live value.

Instead of improving caching popular assets, which the page cache already does, in the next section, we will focus on something the page cache cannot do well: putting unpopular assets in memory.

Putting unpopular assets in memory

Motivation: Writes are bad

Worn out SSDs are solely caused by writes, not reads. As mentioned above, it is costly to replace SSDs. Since Cloudflare operates data centers in 200 cities across the world, the cost of shipment and replacement can be a significant proportion of the cost of the SSD itself.

Moreover, write operations on SSDs slow down read operations. This is because SSD writes require program/erase (P/E) operations issued to the storage chips. These operations block reads to the same chips. In short, the more writes, the slower the reads. Since the page cache already cached popular assets, this effect has the highest impact on our cache hit tail latency. Previously we talked about how we dramatically reduced the impact of such latency. In the meantime, directly reducing the latency itself will also be very helpful.

Motivation: Many assets are “one-hit-wonders”

One-hit-wonders refer to cached assets that are never accessed; they get accessed once and then cached yet never read again. In addition, one-hit-wonders also include cached assets that are evicted before they are used once due to their lack of popularity. The performance of websites is not harmed even if we don’t cache such assets in the first place.

To quantify how many of our assets are one-hit-wonders and to test the hypothesis that reducing disk writes can speed up disk reads, we conducted a simple experiment. The experiment: using a representative sample of traffic, we modified our caching logic to only cache an asset the second time our server encountered it.

The red line indicates when the experiment started. The green line represents the experimental group and the yellow line represents the control group.

The result: disk writes per second were reduced by roughly half and corresponding disk hit tail latency was reduced by approximately five percent.

This experiment demonstrated that if we don’t cache one-hit-wonders, our disks last longer and our cache hits are faster.

One more benefit of not caching one-hit-wonders, which was not immediately obvious from the experiment, is that, it increases the effective capacity of cache because of the reduced competing pressure from the removal of the unpopular assets. This in turn increases the cache hit ratio and cache retention.

The next question: how can we replicate these results, at scale, in production, without impacting customer experience negatively?

Potential implementation approaches

Remember but don’t cache

One smart way to eliminate one-hit-wonders from cache is to not cache an asset during its first few misses but to remember its appearances. When the cache system encounters the same asset repeatedly over a certain number of times, the system will start to cache the asset. This is basically what we did in the experiment above.

In order to remember the appearances, beside hash tables, many memory efficient data structures, such as Bloom filter, Counting Bloom filter and Count-min sketch, can be used with their specific trade-offs.

Such an approach solves one problem but introduces a new one: every asset is missed at least twice before it is cached. Because the number of cache misses is amplified compared to what we do today, this approach multiplies both the cost of bandwidth and server utilization for our customers. This is not an acceptable tradeoff in our eyes.

Transient cache in memory

A better idea we came up with: put every asset we want to cache to disk in memory first. The memory is called a “transient cache”. Assets in transient cache are promoted to the permanent cache, backed by SSDs, after they are accessed a certain number of times, indicating they are popular enough to be stored persistently. If they are not promoted, they are eventually evicted from transient cache because of lack of popularity.

This design makes sure that disk writes are only spent on assets that are popular, while ensuring every asset is only “missed” once.

The transient cache system on the real world Internet

We implemented this idea and deployed it to our production caching systems. Here are some learnings and data points we would like to share.

Trade-offs

Our transient cache technology does not pull a performance improvement rabbit out of a hat. It achieves improved performance for our customers by consuming other resources in our system. These resource consumption trade-offs must be carefully considered when deploying to systems at the scale at which Cloudflare operates.

Transient cache memory footprint

The amount of memory allocated for our transient cache matters. Cache size directly dictates how long a given asset will live in cache before it is evicted. If the transient cache is too small, new assets will evict old assets before the old assets receive hits that promote them to disk. From a customer's perspective, cache retention being too short is unacceptable because it leads to more misses, commensurate higher cost, and worse eyeball performance.

Competition with page cache

As mentioned before, the page cache uses the system’s available memory. The more memory we allocate to the transient cache for unpopular assets, the less memory we have for popular assets in the page cache. Finding the sweet spot for the trade-off between these two depends on traffic volume, usage patterns, and the specific hardware configuration our software is running on.

Competition with process memory usage

Another competitor to transient cache is regular memory used by our services and processes. Unlike the page cache, which is used opportunistically, process memory usage is a hard requirement for the operation of our software. An additional wrinkle is that some of our edge services increase their total RSS memory consumption when performing a “zero downtime upgrade”, which runs both the old and new version of the service in parallel for a short period of time. From our experience, if there is not enough physical memory to do so, the system’s overall performance will degrade to an unacceptable level due to increased IO pressure from reduced page cache space.

In production

Given the considerations above, we enabled this technology conservatively to start. The new hybrid storage system is used only by our newer generations of servers that have more physical memory than older generations. Additionally, the system is only used on a subset of assets. By tuning the size of the transient cache and what percentage of requests use it at all, we are able to explore the sweet spots for the trade-offs between performance, cache retention, and memory resource consumption.

The chart below shows the IO usage of an enabled cohort before and after (red line) enabling transient cache.

We can see that disk write (in bytes per second) was reduced by 25% at peak and 20% off peak. Although it is too early to tell, the life-span of those SSDs should be extended proportionally.

More importantly for our customers: our CDN cache hit tail latency is measurably decreased!

Future work

We just made the first step towards a smart memory-SSD hybrid storage system. There is still a lot to be done.

Broader deployment

Currently, we only apply this technology to specific hardware generations and a portion of traffic. When we decommission our older generations of servers and replace them with newer generations with more physical memory, we will be able to apply the transient cache to a larger portion of traffic. Per our preliminary experiments, in certain data centers, applying the transient cache to all traffic is able to reduce disk writes up to 70% and decrease end user visible tail latency by up to 20% at peak hours.

Smarter promotion algorithms

Our implemented promotion strategy for moving assets from transient cache to durable storage is based on a simple heuristic: the number of hits. Other information can be used to make a better promotion decision. For example, the TTL (time to live) of an asset can be taken into account as well. One such strategy is to refuse to promote an asset if the TTL of it has only a few seconds left, which will further reduce unnecessary disk writes.

Another aspect of the algorithms is demotion. We can actively or passively (during eviction) demote assets from persistent cache to transient cache based on the criteria mentioned above. The demotion itself does not directly reduce writes to persistent cache but it could help cache hit ratio and cache retention if done smartly.

Transient cache on other types of storage

We choose memory to host the transient cache because its wear out cost is low (none). This is also the case for HDDs. It is possible to build a hybrid storage system across memory, SSDs and HDDs to find the right balance between their costs and performance characteristics.

Conclusion

By putting unpopular assets in memory, we are able to trade-off between memory usage, tail hit latency and SSD lifetimes. With our current configuration, we are able to extend SSD life while reducing tail hit latency, at the expense of available system memory. Transient cache also opens possibilities for future heuristic and storage hierarchy improvements.

Whether the techniques described in this post benefit your system depends on the workload and the hardware resource constraints of your system. Hopefully us sharing this counterintuitive idea can inspire other novel ways of building faster and more efficient systems.

And finally, if doing systems work like this sounds interesting to you, come work at Cloudflare!