blog post about the services at Cloudflare that are migrating to post-quantum.

Theoretically, there is no impediment to adding post-quantum cryptography to any system. But the reality is harder. In the middle of last year, we posed ourselves a big challenge: to change all internal connections at Cloudflare to use post-quantum cryptography. We call this, in a cheeky way, “post-quantum-ifying” our services. Theoretically, this should be simple: swap algorithms for post-quantum ones and move along. But with dozens of different services in various programming languages (as we have at Cloudflare), it is not so simple. The challenge is big but we are here and up for the task! In this blog post, we will look at what our plan was, where we are now, and what we have learned so far. Welcome to the first announcement of a post-quantum future at Cloudflare: our connections are going to be quantum-secure!

What are we doing?

The life of most requests at Cloudflare begins and ends at the edge of our global network. Not all requests are equal and on their path they are transmitted by several protocols. Some of those protocols provide security properties whilst others do not. For the protocols that do, for context, Cloudflare uses: TLS, QUIC, WireGuard, DNSSEC, IPsec, Privacy Pass, and more. Migrating all of these protocols and connections to use post-quantum cryptography is a formidable task. It is also a task that we do not treat lightly because:

  • We have to be assured that the security properties provided by the protocols are not diminished.
  • We have to be assured that performance is not negatively affected.
  • We have to be wary of other requirements of our ever-changing ecosystem (like, for example, keeping in mind our FedRAMP certification efforts).

Given these requirements, we had to decide on the following:

  • How are we going to introduce post-quantum cryptography into the protocols?
  • Which protocols will we be migrating to post-quantum cryptography?
  • Which Cloudflare services will be targeted for this migration?

Let’s explore now what we chose: welcome to our path!

TLS and post-quantum in the real world

One of the most used security protocols is Transport Layer Security (TLS). It is the vital protocol that protects most of the data that flows over the Internet today. Many of Cloudflare’s internal services also rely on TLS for security. It seemed natural that, for our migration to post-quantum cryptography, we would start with this protocol.

The protocol provides three security properties: integrity, authentication, and confidentiality. The algorithms used to provide the first property, integrity, seem to not be quantum-threatened (there is some research on the matter). The second property, authentication, is under quantum threat, but we will not focus on it for reasons detailed later. The third property, confidentiality, is the one that we are interested in protecting as it is urgent to do this now.

Confidentiality assures that no one other than the intended receiver and sender of a message can read the transmitted message. Confidentiality is especially threatened by quantum computers as an attacker can record traffic now and decrypt it in the future (when they get access to a quantum computer): this means that all past and current traffic, not just future traffic, is vulnerable to be read by anyone who obtains a quantum computer (and has stored the encrypted traffic captured today).

At Cloudflare, to protect many of our connections, we use TLS. We mainly use the latest version of the protocol, TLS 1.3, but we do sometimes still use TLS 1.2 (as seen in the image, though, it only shows the connections between websites to our network).  As we are a company that pushes for innovation, this means that we are intent on using this time of migration to post-quantum cryptography as an opportunity to also update TLS handshakes to 1.3 and be assured that we are using TLS in the right way (by, for example, ensuring that we are not using deprecated features of TLS).

Image of Cloudflare’s TLS and QUIC usage taken on the 17/02/2022 and showing the last 7 days.
Cloudflare’s TLS and QUIC usage taken on the 17/02/2022 and showing the last 7 days.

Changing TLS 1.3 to provide quantum-resistant security for its confidentiality means changing the ‘key exchange’ phase of the TLS handshake. Let’s briefly look at how the TLS 1.3 handshake works.

Image showing a TLS 1.3 handshake
The TLS 1.3 handshake

In TLS 1.3, there will always be two parties: a client and a server. A client is a party that wants the server to “serve” them something, which could be a website, emails, chat messages, voice messages, and more. The handshake is the process by which the server and client attempt to agree on a shared secret, which will be used to encrypt the subsequent exchange of data (this shared secret is called the “master secret”). The client selects their favorite key exchange algorithms and submits one or more “key shares” to the server (they send both the name of the key share and its public key parameter). The server picks one of the key exchange algorithms (assuming that it supports one of them), and replies back with their own key share. Both the server and the client then combine the key shares to compute a shared secret (the “master secret”), which is used to protect the remainder of the connection. If the client only chooses algorithms that the server does not support, the server instead replies with the algorithms that it does support and asks the client to try again. During this initial conversation, the client and server also agree on authentication methods and the parameters for encryption, but we can leave that aside for today in this blog post. This description is also simplified to focus only in the “key exchange” phase.

There is a mechanism to add post-quantum cryptography to this procedure: you advertise post-quantum algorithms in the list of key shares, so the final derived shared key (the “master secret”) is quantum secure. But there are requirements we had to take into account when doing this with our connections: the security of much post-quantum cryptography is still under debate and we need to respect our compliance efforts. The solution to these requirements is to use a “hybrid” mechanism.

A “hybrid” mechanism means to use both a “pre-quantum” or “classical” algorithm and a “post-quantum” algorithm, and mixing both generated shared secrets into the derivation of the “master secret”. The combination of both shared secrets is of the form \Z′ = Z || T\ (for TLS and with the fixed size shared secrets, simple concatenation is secure. In other cases, you have to be a bit more careful). This procedure is a concatenation consisting of:

The usage of a “hybrid” approach allows us to safeguard our connections in case the security of the post-quantum algorithm fails. It also results in a suitable-for-FIPS secret, as it is approved in the “Recommendation for Key-Derivation Methods in Key-Establishment Schemes” (SP 800-56C Rev. 2), which is listed in the Annex D, as an approved key establishing technique for FIPS 140-2.

At Cloudflare, we are using different TLS libraries. We decided to add post-quantum cryptography to those, specifically, to the BoringCrypto library or the compiled version of Golang with BoringCrypto. We added our implementation of the Kyber-512 algorithm (this algorithm can be eventually swapped by another one; we’re not picking any here. We are using it for our testing phase) to those libraries and implemented the “hybrid” mechanism as part of the TLS handshake. For the “classical” algorithm we used curve P-256. We then compiled certain services with these new TLS libraries.

Name of algorithm Number of times loop executed Average runtime per operation Number of bytes required per operation Number of allocations
Curve P-256 23,056 52,204 ns/op 256 B/op 5 allocs/op
Kyber-512 100,977 11,793 ns/op 832 B/op 3 allocs/op

Table 1: Benchmarks of the “key share” operation of Curve P-256 and Kyber-512: Scalar Multiplication and Encapsulation respectively. Benchmarks ran on Darwin, amd64, Intel(R) Core(TM) i7-9750H CPU @ 2.60 GHz.

Note that as TLS supports the described “negotiation” mechanism for the key exchange, the client and server have a way of mutually deciding what algorithms they want to use. This means that it is not required that both a client or a server support or even prefer the exact same algorithms: they just need to share support for a single algorithm for a handshake to succeed. In turn, herewith, even if we advertise post-quantum cryptography and a server/client does not support it, they will not fail but rather agree on some other algorithm they share.

A note on a matter we left on hold above: why are we not migrating the authentication phase of TLS to post-quantum? Certificate-based authentication in TLS, which is the one we commonly use at Cloudflare, also depends on systems on the wider Internet. Thus, changes to authentication require a coordinated and much wider effort to change. Certificates are attested as proofs of identity by outside parties: migrating authentication means coordinating a ceremony of migration with these outside parties. Note though that at Cloudflare we use a PKI with internally-hosted Certificate Authorities (CAs), which means that we can more easily change our algorithms. This will still need careful planning. We will not do this today, but we will in the near future.

Cloudflare services

The first step of our post-quantum migration is done. We have TLS libraries with post-quantum cryptography using a hybrid mechanism. The second step is to test this new mechanism in specific Cloudflare connections and services. We will look at three systems from Cloudflare that we have started migrating to post-quantum cryptography. The services in question are: Logfwrdr, Cloudflare Tunnel, and GoKeyless.

A post-quantum Logfwdr

Logfwdr is an internal service, written in Golang, that handles structured logs, and sends them from our servers for processing (to a subservice called ‘Logreceiver’) where they write them to Kafka. The connection between Logfwdr and Logreceiver is protected by TLS. The same goes for the connection between Logreceiver and Kafka in core. Logfwdr pushes its logs through “streams” for processing.

This service seemed an ideal candidate for migrating to post-quantum cryptography as its architecture is simple, it has long-lived connections, and it handles a lot of traffic. In order to first test the viability of using post-quantum cryptography, we created our own instance of Logreceiver and deployed it. We also created our own stream (the “pq-stream”), which is basically a copy of a HTTP stream (which was remarkably easy to add). We then compiled these services with the modified TLS library and we got a post-quantum protected Logfwdr.

Graph showing the TLS latency of Logfwdr for selected servers.
Figure 1: TLS latency of Logfwdr for selected metals. Notice how post-quantum cryptography is faster than non-post-quantum one (it is labeled with a “PQ” acronym).

What we found was that using post-quantum cryptography is faster than using “classical” cryptography! This was expected, though, as we are using a lattice-based post-quantum algorithm (Kyber512). The TLS latency of both post-quantum handshakes and “classical” ones can be noted in Figure 1. The figure shows more handshakes are executed than usual behavior as these servers are frequently restarted.

Note though that we are not using “only” post-quantum cryptography but rather the “hybrid” mechanism described above. This could increase performance times: in this case, the increase was minimal and still kept the post-quantum handshakes faster than the classical ones. Perhaps what makes the TLS handshakes faster in the post-quantum case is the usage of TLS 1.3, as the “classical” Logfwdr is using TLS 1.2. Logfwdr, though, is a service that executes long-lived handshakes, so in aggregate TLS 1.2 is not “slower” but it does have a slower start time.

As shown in Figure 2, the average batch duration of the post-quantum stream is lower than when not using post-quantum cryptography. This may be in part due to the fact that we are not sending the quantum-protected data all the way to Kafka (as the non-post-quantum stream is doing). We didn’t yet change the connection to Kafka post-quantum.

Graph comparing the average batch send duration of post-quantum (orange) and non-post-quantum streams (green).
Figure 2: Average batch send duration: post-quantum (orange) and non-post-quantum streams (green).

We didn’t encounter any failures during this testing that ran for about some weeks. This gave us good insight that putting post-quantum cryptography into our internal network with actual data is possible. It also gave us confidence to begin migrating codebases to modified TLS libraries, which we will maintain.

What are the next steps for Logfwdr? Now that we confirmed it is possible, we will first start migrating stream by stream to this hybrid mechanism until we reach full post-quantum migration.

A post-quantum gokeyless

gokeyless is our own way to separate servers and TLS long-term private keys. With it, private keys are kept on a specialized key server operated by customers on their own architecture or, if using Geo Key Manager, in selected Cloudflare locations. We also use it for Cloudflare-held private keys with a service creatively known as gokeyless-internal. The final piece of this architecture is another service called Keynotto. Keynotto is a service written in Rust that only mints RSA and ECDSA key signatures (that are executed with the stored private key).

How does the overall architecture of gokeyless work? Let’s start with a request. The request arrives at the Cloudflare network and we perform TLS termination. Any signing request is forwarded to Keynotto. A small portion of requests (specifically from GeoKDL or external gokeyless) cannot be handled by Keynotto directly, and are instead forwarded to gokeyless-internal. gokeyless-internal also acts as a key server proxy, as it redirects connections to the customer’s keyservers (external gokeyless). External gokeyless is both the server that a customer runs and the client that will be used to contact it. The architecture can be seen in Figure 3.

Diagram of the life of a gokeyless request.
Figure 3: The life of a gokeyless request.

Migrating the transport that this architecture uses to post-quantum cryptography is a bigger challenge, as it involves migrating a service that lives on the customer side. So, for our testing phase, we decided to go for the simpler path that we are able to change ourselves: the TLS handshake between Keynotto and gokeyless-internal. This small test-bed means two things: first, that we needed to change another TLS library (as Keynotto is written in Rust) and, second, that we needed to change gokeyless-internal in such a way that it used post-quantum cryptography only for the handshakes with Keynotto and for nothing else. Note that we did not migrate the signing operations that gokeyless or Keynotto executes with the stored private key; we just migrated the transport connections.

Adding post-quantum cryptography to the rustls codebase was a straightforward exercise and we exposed an easy-to-use API call to signal the usage of post-quantum cryptography (as seen in Figure 4 and Figure 5). One thing that we noted when reviewing the TLS usage in several Cloudflare services is that giving the option to choose the algorithms for a ciphersuite, key share, and authentication in the TLS handshake confuses users. It seemed more straightforward to define the algorithm at the library level, and have a boolean or API call signal the need for this post-quantum algorithm.

post-Quantum API for rustls.
Figure 4: post-Quantum API for rustls.
usage post-quantum API for rustls.
Figure 5: usage of the post-quantum API for rustls.

We ran a small test between Keynotto and gokeyless-internal with much success. Our next steps are to integrate this test into the real connection between Keynotto and gokeyless-internal, and to devise a plan for a customer post-quantum protected gokeyless external. This is the first instance in which our migration to post-quantum will not be ending at our edge but rather at the customer’s connection point.

A post-quantum Cloudflare Tunnel

Cloudflare Tunnel is a reverse proxy that allows customers to quickly connect their private services and networks to the Cloudflare network without having to expose their public IPs or ports through their firewall. It is mainly managed at the customer level through the usage of cloudflared, a lightweight server-side daemon, in their infrastructure. cloudflared opens several long-lived TCP connections (although, cloudflared is increasingly using the QUIC protocol) to servers on Cloudflare’s global network. When a request to a hostname comes, it is proxied through these connections to the origin service behind cloudflared.

The easiest part of the service to make post-quantum secure appears to be the connection between our network (with a service part of Tunnel called origintunneld located there) and cloudflared, which we have started migrating. While exploring this path and looking at the whole life of a Tunnel connection, we found something more interesting, though. When the Tunnel connections eventually reach core, they end up going to a service called Tunnelstore. Tunnelstore runs as a stateless application in a Kubernetes deployment, and to provide TLS termination (alongside load balancing and more) it uses a Kubernetes ingress.

The Kubernetes ingress we use at Cloudflare is made of Envoy and Contour. The latter configures the former depending on Kubernetes resources. Envoy uses the BoringSSL library for TLS. Switching TLS libraries in Envoy seemed difficult: there are thoughts on how to integrate OpenSSL to it (and even some thoughts on adding post-quantum cryptography) and ways to switch TLS libraries. Adding post-quantum cryptography to a modified version of BoringSSL, and then specifying that dependency in the Bazel file of Envoy seems to be the path to go for, as our internal test has confirmed (as seen in Figure 6). As for Contour, for many years, Cloudflare has been running their own patched version of it: we will have to again patch this version with our Golang library to provide post-quantum cryptography. We will make these libraries (and the TLS ones) available for usage.

Image displaying an option to allow post-quantum cryptography in Envoy.
Figure 6: Option to allow post-quantum cryptography in Envoy.

Changing the Kubernetes ingress at Cloudflare not only makes Tunnel completely quantum-safe (beyond the connection between our global network and cloudflared), but it also makes any other services using ingress safe. Our first tests on migrating Envoy and Contour to TLS libraries that contain post-quantum protections have been successful, and now we have to test how it behaves in the whole ingress ecosystem.

What is next?

The main tests are now done. We now have TLS libraries (in Go, Rust, and C) that give us post-quantum cryptography. We have two systems ready to deploy post-quantum cryptography, and a shared service (Kubernetes ingress) that we can change. At the beginning of the blog post, we said that “the life of most requests at Cloudflare begins and ends at the edge of our global network”: our aim is that post-quantum cryptography does not end there, but rather reaches all the way to where customers connect as well. Let’s explore the future challenges and this customer post-quantum path in this other blog post!