Marek Majkowski - The Cloudflare Blog

QUIC restarts, slow problems: udpgrm to the rescue

Marek Majkowski — Wed, 07 May 2025 13:00:00 GMT

At Cloudflare, we do everything we can to avoid interruption to our services. We frequently deploy new versions of the code that delivers the services, so we need to be able to restart the server processes to upgrade them without missing a beat. In particular, performing graceful restarts (also known as "zero downtime") for UDP servers has proven to be surprisingly difficult.

We've previously written about graceful restarts in the context of TCP, which is much easier to handle. We didn't have a strong reason to deal with UDP until recently — when protocols like HTTP3/QUIC became critical. This blog post introduces udpgrm, a lightweight daemon that helps us to upgrade UDP servers without dropping a single packet.

Here's the udpgrm GitHub repo.

Historical context

In the early days of the Internet, UDP was used for stateless request/response communication with protocols like DNS or NTP. Restarts of a server process are not a problem in that context, because it does not have to retain state across multiple requests. However, modern protocols like QUIC, WireGuard, and SIP, as well as online games, use stateful flows. So what happens to the state associated with a flow when a server process is restarted? Typically, old connections are just dropped during a server restart. Migrating the flow state from the old instance to the new instance is possible, but it is complicated and notoriously hard to get right.

The same problem occurs for TCP connections, but there a common approach is to keep the old instance of the server process running alongside the new instance for a while, routing new connections to the new instance while letting existing ones drain on the old. Once all connections finish or a timeout is reached, the old instance can be safely shut down. The same approach works for UDP, but it requires more involvement from the server process than for TCP.

In the past, we described the established-over-unconnected method. It offers one way to implement flow handoff, but it comes with significant drawbacks: it’s prone to race conditions in protocols with multi-packet handshakes, and it suffers from a scalability issue. Specifically, the kernel hash table used for dispatching packets is keyed only by the local IP:port tuple, which can lead to bucket overfill when dealing with many inbound UDP sockets.

Now we have found a better method, leveraging Linux’s SO_REUSEPORT API. By placing both old and new sockets into the same REUSEPORT group and using an eBPF program for flow tracking, we can route packets to the correct instance and preserve flow stickiness. This is how udpgrm works.

REUSEPORT group

Before diving deeper, let's quickly review the basics. Linux provides the SO_REUSEPORT socket option, typically set after socket() but before bind(). Please note that this has a separate purpose from the better known SO_REUSEADDR socket option.

SO_REUSEPORT allows multiple sockets to bind to the same IP:port tuple. This feature is primarily used for load balancing, letting servers spread traffic efficiently across multiple CPU cores. You can think of it as a way for an IP:port to be associated with multiple packet queues. In the kernel, sockets sharing an IP:port this way are organized into a reuseport group — a term we'll refer to frequently throughout this post.

Linux supports several methods for distributing inbound packets across a reuseport group. By default, the kernel uses a hash of the packet's 4-tuple to select a target socket. Another method is SO_INCOMING_CPU, which, when enabled, tries to steer packets to sockets running on the same CPU that received the packet. This approach works but has limited flexibility.

To provide more control, Linux introduced the SO_ATTACH_REUSEPORT_CBPF option, allowing server processes to attach a classic BPF (cBPF) program to make socket selection decisions. This was later extended with SO_ATTACH_REUSEPORT_EBPF, enabling the use of modern eBPF programs. With eBPF, developers can implement arbitrary custom logic. A boilerplate program would look like this:

To select a specific socket, the eBPF program calls bpf_sk_select_reuseport, using a reference to a map with sockets (SOCKHASH, SOCKMAP, or the older, mostly obsolete SOCKARRAY), along with a key or index. For example, a declaration of a SOCKHASH might look like this:

This SOCKHASH is a hash map that holds references to sockets, even though the value size looks like a scalar 8-byte value. In our case it's indexed by an uint64_t key. This is pretty neat, as it allows for a simple number-to-socket mapping!

However, there's a catch: the SOCKHASH must be populated and maintained from user space (or a separate control plane), outside the eBPF program itself. Keeping this socket map accurate and in sync with the server process state is surprisingly difficult to get right — especially under dynamic conditions like restarts, crashes, or scaling events. The point of udpgrm is to take care of this stuff, so that server processes don’t have to.

Socket generation and working generation

Let’s look at how graceful restarts for UDP flows are achieved in udpgrm. To reason about this setup, we’ll need a bit of terminology: A socket generation is a set of sockets within a reuseport group that belong to the same logical application instance:

When a server process needs to be restarted, the new version creates a new socket generation for its sockets. The old version keeps running alongside the new one, using sockets from the previous socket generation.

Reuseport eBPF routing boils down to two problems:

For new flows, we should choose a socket from the socket generation that belongs to the active server instance.
For already established flows, we should choose the appropriate socket — possibly from an older socket generation — to keep the flows sticky. The flows will eventually drain away, allowing the old server instance to shut down.

Easy, right?

Of course not! The devil is in the details. Let's take it one step at a time.

Routing new flows is relatively easy. udpgrm simply maintains a reference to the socket generation that should handle new connections. We call this reference the working generation. Whenever a new flow arrives, the eBPF program consults the working generation pointer and selects a socket from that generation.

For this to work, we first need to be able to differentiate packets belonging to new connections from packets belonging to old connections. This is very tricky and highly dependent on the specific UDP protocol. For example, QUIC has an initial packet concept, similar to a TCP SYN, but other protocols might not.

There needs to be some flexibility in this and udpgrm makes this configurable. Each reuseport group sets a specific flow dissector.

Flow dissector has two tasks:

It distinguishes new packets from packets belonging to old, already established flows.
For recognized flows, it tells udpgrm which specific socket the flow belongs to.

These concepts are closely related and depend on the specific server. Different UDP protocols define flows differently. For example, a naive UDP server might use a typical 5-tuple to define flows, while QUIC uses a "connection ID" field in the QUIC packet header to survive NAT rebinding.

udpgrm supports three flow dissectors out of the box and is highly configurable to support any UDP protocol. More on this later.

Welcome udpgrm!

Now that we covered the theory, we're ready for the business: please welcome udpgrm — UDP Graceful Restart Marshal! udpgrm is a stateful daemon that handles all the complexities of the graceful restart process for UDP. It installs the appropriate eBPF REUSEPORT program, maintains flow state, communicates with the server process during restarts, and reports useful metrics for easier debugging.

We can describe udpgrm from two perspectives: for administrators and for programmers.

udpgrm daemon for the system administrator

udpgrm is a stateful daemon, to run it:

This sets up the basic functionality, prints rudimentary logs, and should be deployed as a dedicated systemd service — loaded after networking. However, this is not enough to fully use udpgrm. udpgrm needs to hook into getsockopt, setsockopt, bind, and sendmsg syscalls, which are scoped to a cgroup. To install the udpgrm hooks, you can install it like this:

But a more common pattern is to install it within the current cgroup:

Better yet, use it as part of the systemd "service" config:

Once udpgrm is running, the administrator can use the CLI to list reuseport groups, sockets, and metrics, like this:

Now, with both the udpgrm daemon running, and cgroup hooks set up, we can focus on the server part.

udpgrm for the programmer

We expect the server to create the appropriate UDP sockets by itself. We depend on SO_REUSEPORT, so that each server instance can have a dedicated socket or a set of sockets:

With a socket descriptor handy, we can pursue the udpgrm magic dance. The server communicates with the udpgrm daemon using setsockopt calls. Behind the scenes, udpgrm provides eBPF setsockopt and getsockopt hooks and hijacks specific calls. It's not easy to set up on the kernel side, but when it works, it’s truly awesome. A typical socket setup looks like this:

You can see three blocks here:

First, we retrieve the working generation number and, by doing so, check for udpgrm presence. Typically, udpgrm absence is fine for non-production workloads.
Then we register the socket to an arbitrary socket generation. We choose work_gen + 1 as the value and verify that the registration went through correctly.
Finally, we bump the working generation pointer.

That's it! Hopefully, the API presented here is clear and reasonable. Under the hood, the udpgrm daemon installs the REUSEPORT eBPF program, sets up internal data structures, collects metrics, and manages the sockets in a SOCKHASH.

Advanced socket creation with udpgrm_activate.py

In practice, we often need sockets bound to low ports like :443, which requires elevated privileges like CAP_NET_BIND_SERVICE. It's usually better to configure listening sockets outside the server itself. A typical pattern is to pass the listening sockets using socket activation.

Sadly, systemd cannot create a new set of UDP SO_REUSEPORT sockets for each server instance. To overcome this limitation, udpgrm provides a script called udpgrm_activate.py, which can be used like this:

Here, udpgrm_activate.py binds to 0.0.0.0:5201 and stores the created socket in the systemd FD store under the name test-port. The server echoserver.py will inherit this socket and receive the appropriate FD_LISTEN environment variables, following the typical systemd socket activation pattern.

Systemd service lifetime

Systemd typically can't handle more than one server instance running at the same time. It prefers to kill the old instance quickly. It supports the "at most one" server instance model, not the "at least one" model that we want. To work around this, udpgrm provides a decoy script that will exit when systemd asks it to, while the actual old instance of the server can stay active in the background.

At this point, we showed a full template for a udpgrm enabled server that contains all three elements: udpgrm --install --self for cgroup hooks, udpgrm_activate.py for socket creation, and mmdecoy for fooling systemd service lifetime checks.

Dissector modes

We've discussed the udpgrm daemon, the udpgrm setsockopt API, and systemd integration, but we haven't yet covered the details of routing logic for old flows. To handle arbitrary protocols, udpgrm supports three dissector modes out of the box:

DISSECTOR_FLOW: udpgrm maintains a flow table indexed by a flow hash computed from a typical 4-tuple. It stores a target socket identifier for each flow. The flow table size is fixed, so there is a limit to the number of concurrent flows supported by this mode. To mark a flow as "assured," udpgrm hooks into the sendmsg syscall and saves the flow in the table only when a message is sent.

DISSECTOR_CBPF: A cookie-based model where the target socket identifier — called a udpgrm cookie — is encoded in each incoming UDP packet. For example, in QUIC, this identifier can be stored as part of the connection ID. The dissection logic is expressed as cBPF code. This model does not require a flow table in udpgrm but is harder to integrate because it needs protocol and server support.

DISSECTOR_NOOP: A no-op mode with no state tracking at all. It is useful for traditional UDP services like DNS, where we want to avoid losing even a single packet during an upgrade.

Finally, udpgrm provides a template for a more advanced dissector called DISSECTOR_BESPOKE. Currently, it includes a QUIC dissector that can decode the QUIC TLS SNI and direct specific TLS hostnames to specific socket generations.

For more details, please consult the udpgrm README. In short: the FLOW dissector is the simplest one, useful for old protocols. CBPF dissector is good for experimentation when the protocol allows storing a custom connection id (cookie) — we used it to develop our own QUIC Connection ID schema (also named DCID) — but it's slow, because it interprets cBPF inside eBPF (yes really!). NOOP is useful, but only for very specific niche servers. The real magic is in the BESPOKE type, where users can create arbitrary, fast, and powerful dissector logic.

Summary

The adoption of QUIC and other UDP-based protocols means that gracefully restarting UDP servers is becoming an increasingly important problem. To our knowledge, a reusable, configurable and easy to use solution didn't exist yet. The udpgrm project brings together several novel ideas: a clean API using setsockopt(), careful socket-stealing logic hidden under the hood, powerful and expressive configurable dissectors, and well-thought-out integration with systemd.

While udpgrm is intended to be easy to use, it hides a lot of complexity and solves a genuinely hard problem. The core issue is that the Linux Sockets API has not kept up with the modern needs of UDP.

Ideally, most of this should really be a feature of systemd. That includes supporting the "at least one" server instance mode, UDP SO_REUSEPORT socket creation, installing a REUSEPORT_EBPF program, and managing the "working generation" pointer. We hope that udpgrm helps create the space and vocabulary for these long-term improvements.

Multi-Path TCP: revolutionizing connectivity, one path at a time

Marek Majkowski — Fri, 03 Jan 2025 14:00:00 GMT

The Internet is designed to provide multiple paths between two endpoints. Attempts to exploit multi-path opportunities are almost as old as the Internet, culminating in RFCs documenting some of the challenges. Still, today, virtually all end-to-end communication uses only one available path at a time. Why? It turns out that in multi-path setups, even the smallest differences between paths can harm the connection quality due to packet reordering and other issues. As a result, Internet devices usually use a single path and let the routers handle the path selection.

There is another way. Enter Multi-Path TCP (MPTCP), which exploits the presence of multiple interfaces on a device, such as a mobile phone that has both Wi-Fi and cellular antennas, to achieve multi-path connectivity.

MPTCP has had a long history — see the Wikipedia article and the spec (RFC 8684) for details. It's a major extension to the TCP protocol, and historically most of the TCP changes failed to gain traction. However, MPTCP is supposed to be mostly an operating system feature, making it easy to enable. Applications should only need minor code changes to support it.

There is a caveat, however: MPTCP is still fairly immature, and while it can use multiple paths, giving it superpowers over regular TCP, it's not always strictly better than it. Whether MPTCP should be used over TCP is really a case-by-case basis.

In this blog post we show how to set up MPTCP to find out.

Subflows

Internally, MPTCP extends TCP by introducing "subflows". When everything is working, a single TCP connection can be backed by multiple MPTCP subflows, each using different paths. This is a big deal - a single TCP byte stream is now no longer identified by a single 5-tuple. On Linux you can see the subflows with ss -M, like:

Here you can see a single MPTCP connection, composed of two underlying TCP flows.

MPTCP aspirations

Being able to separate the lifetime of a connection from the lifetime of a flow allows MPTCP to address two problems present in classical TCP: aggregation and mobility.

Aggregation: MPTCP can aggregate the bandwidth of many network interfaces. For example, in a data center scenario, it's common to use interface bonding. A single flow can make use of just one physical interface. MPTCP, by being able to launch many subflows, can expose greater overall bandwidth. I'm personally not convinced if this is a real problem. As we'll learn below, modern Linux has a BLESS-like MPTCP scheduler and macOS stack has the "aggregation" mode, so aggregation should work, but I'm not sure how practical it is. However, there are certainly projects that are trying to do link aggregation using MPTCP.
Mobility: On a customer device, a TCP stream is typically broken if the underlying network interface goes away. This is not an uncommon occurrence — consider a smartphone dropping from Wi-Fi to cellular. MPTCP can fix this — it can create and destroy many subflows over the lifetime of a single connection and survive multiple network changes.

Improving reliability for mobile clients is a big deal. While some software can use QUIC, which also works on Multipath Extensions, a large number of classical services still use TCP. A great example is SSH: it would be very nice if you could walk around with a laptop and keep an SSH session open and switch Wi-Fi networks seamlessly, without breaking the connection.

MPTCP work was initially driven by UCLouvain in Belgium. The first serious adoption was on the iPhone. Apparently, users have a tendency to use Siri while they are walking out of their home. It's very common to lose Wi-Fi connectivity while they are doing this. (source)

Implementations

Currently, there are only two major MPTCP implementations — Linux kernel support from v5.6, but realistically you need at least kernel v6.1 (MPTCP is not supported on Android yet) and iOS from version 7 / Mac OS X from 10.10.

Typically, Linux is used on the server side, and iOS/macOS as the client. It's possible to get Linux to work as a client-side, but it's not straightforward, as we'll learn soon. Beware — there is plenty of outdated Linux MPTCP documentation. The code has had a bumpy history and at least two different APIs were proposed. See the Linux kernel source for the mainline API and the mptcp.dev website.

Linux as a server

Conceptually, the MPTCP design is pretty sensible. After the initial TCP handshake, each peer may announce additional addresses (and ports) on which it can be reached. There are two ways of doing this. First, in the handshake TCP packet each peer specifies the "Do not attempt to establish new subflows to this address and port" bit, also known as bit [C], in the MPTCP TCP extensions header.

^{Wireshark dissecting MPTCP flags from a SYN packet. Tcpdump does not report this flag yet.}

With this bit cleared, the other peer is free to assume the two-tuple is fine to be reconnected to. Typically, the server allows the client to reuse the server IP/port address. Usually, the client is not listening and disallows the server to connect back to it. There are caveats though. For example, in the context of Cloudflare, where our servers are using Anycast addressing, reconnecting to the server IP/port won't work. Going twice to the IP/port pair is unlikely to reach the same server. For us it makes sense to set this flag, disallowing clients from reconnecting to our server addresses. This can be done on Linux with:

There is also a second way to advertise a listening IP/port. During the lifetime of a connection, a peer can send an ADD-ADDR MPTCP signal which advertises a listening IP/port. This can be managed on Linux by ip mptcp endpoint ... signal, like:

With such a config, a Linux peer (typically server) will report the additional IP/port with ADD-ADDR MPTCP signal in an ACK packet, like this:

It's important to realize that either peer can send ADD-ADDR messages. Unusual as it might sound, it's totally fine for the client to advertise extra listening addresses. The most common scenario though, consists of either nobody, or just a server, sending ADD-ADDR.

Technically, to launch an MPTCP socket on Linux, you just need to replace IPPROTO_TCP with IPPROTO_MPTCP in the application code:

In practice, though, this introduces some changes to the sockets API. Currently not all setsockopt's work yet — like TCP_USER_TIMEOUT. Additionally, at this stage, MPTCP is incompatible with kTLS.

Path manager / scheduler

Once the peers have exchanged the address information, MPTCP is ready to kick in and perform the magic. There are two independent pieces of logic that MPTCP handles. First, given the address information, MPTCP must figure out if it should establish additional subflows. The component that decides on this is called "Path Manager". Then, another component called "scheduler" is responsible for choosing a specific subflow to transmit the data over.

Both peers have a path manager, but typically only the client uses it. A path manager has a hard task to launch enough subflows to get the benefits, but not too many subflows which could waste resources. This is where the MPTCP stacks get complicated.

Linux as client

On Linux, path manager is an operating system feature, not an application feature. The in-kernel path manager requires some configuration — it must know which IP addresses and interfaces are okay to start new subflows. This is configured with ip mptcp endpoint ... subflow, like:

This informs the path manager that we (typically a client) own a 192.0.2.3 IP address on interface wlp1s0, and that it's fine to use it as source of a new subflow. There are two additional flags that can be passed here: "backup" and "fullmesh". Maintaining these ip mptcp endpoints on a client is annoying. They need to be added and removed every time networks change. Fortunately, NetworkManager from 1.40 supports managing these by default. If you want to customize the "backup" or "fullmesh" flags, you can do this here (see the documentation):

Path manager also takes a "limit" setting, to set a cap of additional subflows per MPTCP connection, and limit the received ADD-ADDR messages, like:

I experimented with the "mobility" use case on my Ubuntu 22 Linux laptop. I repeatedly enabled and disabled Wi-Fi and Ethernet. On new kernels (v6.12), it works, and I was able to hold a reliable MPTCP connection over many interface changes. I was less lucky with the Ubuntu v6.8 kernel. Unfortunately, the default path manager on Linux client only works when the flag "Do not attempt to establish new subflows to this address and port" is cleared on the server. Server-announced ADD-ADDR don't result in new subflows created, unless ip mptcp endpoint has a fullmesh flag.

It feels like the underlying MPTCP transport code works, but the path manager requires a bit more intelligence. With a new kernel, it's possible to get the "interactive" case working out of the box, but not for the ADD-ADDR case.

Custom path manager

Linux allows for two implementations of a path manager component. It can either use built-in kernel implementation (default), or userspace netlink daemon.

However, from what I found there is no serious implementation of configurable userspace path manager. The existing implementations don't do much, and the API seems immature yet.

Scheduler and BPF extensions

Thus far we've covered Path Manager, but what about the scheduler that chooses which link to actually use? It seems that on Linux there is only one built-in "default" scheduler, and it can do basic failover on packet loss. The developers want to write MPTCP schedulers in BPF, and this work is in-progress.

macOS

As opposed to Linux, macOS and iOS expose a raw MPTCP API. On those operating systems, path manager is not handled by the kernel, but instead can be an application responsibility. The exposed low-level API is based on connectx(). For example, here's an example of obscure code that establishes one connection with two subflows:

This powerful API is hard to use though, as it would require every application to listen for network changes. Fortunately, macOS and iOS also expose higher-level APIs. One example is nw_connection in C, which uses nw_parameters_set_multipath_service.

Another, more common example is using Network.framework, and would look like this:

The API supports three MPTCP service type modes:

Handover Mode: Tries to minimize cellular. Uses only Wi-Fi. Uses cellular only when Wi-Fi Assist is enabled and makes such a decision.
Interactive Mode: Used for Siri. Reduces latency. Only for low-bandwidth flows.
Aggregation Mode: Enables resource pooling but it's only available for developer accounts and not deployable.

The MPTCP API is nicely integrated with the iPhone "Wi-Fi Assist" feature. While the official documentation is lacking, it's possible to find sources explaining how it actually works. I was able to successfully test both the cleared "Do not attempt to establish new subflows" bit and ADD-ADDR scenarios. Hurray!

IPv6 caveat

Sadly, MPTCP IPv6 has a caveat. Since IPv6 addresses are long, and MPTCP uses the space-constrained TCP Extensions field, there is not enough room for ADD-ADDR messages if TCP timestamps are enabled. If you want to use MPTCP and IPv6, it's something to consider.

Summary

I find MPTCP very exciting, being one of a few deployable serious TCP extensions. However, current implementations are limited. My experimentation showed that the only practical scenario where currently MPTCP might be useful is:

Linux as a server
macOS/iOS as a client
"interactive" use case

With a bit of effort, Linux can be made to work as a client.

Don't get me wrong, Linux developers did tremendous work to get where we are, but, in my opinion for any serious out-of-the-box use case, we're not there yet. I'm optimistic that Linux can develop a good MPTCP client story relatively soon, and the possibility of implementing the Path manager and Scheduler in BPF is really enticing.

Time will tell if MPTCP succeeds — it's been 15 years in the making. In the meantime, Multi-Path QUIC is under active development, but it's even further from being usable at this stage.

We're not quite sure if it makes sense for Cloudflare to support MPTCP. Reach out if you have a use case in mind!

Shoutout to Matthieu Baerts for tremendous help with this blog post.

The forecast is clear: clouds on e-paper, powered by the cloud

Marek Majkowski — Tue, 31 Dec 2024 14:00:00 GMT

I’ve noticed that many shops are increasingly using e-paper displays. They’re impressive: high contrast, no backlight, and no visible cables. Unlike most electronics, these displays are seamlessly integrated and feel very natural. This got me wondering: is it possible to use such a display for a pet project? I want to experiment with this technology myself.

My main goal in this project is to understand the hardware and its capabilities. Here, I'll be using an e-paper display to show the current weather, but at its core, I’m simply feeding data from a website to the display. While it sounds straightforward, it actually requires three layers of software to pull off. Still, it’s a fun challenge and a great opportunity to work with both embedded hardware and Cloudflare Workers.

Sourcing the hardware

For this project, I'm using components from Waveshare. They offer a variety of e-paper displays, ranging from credit card-sized to A4-sized models. I chose the 7.5-inch, two-color "e-Paper (G)" display. For the controller, I'm using a Waveshare ESP32-based universal board. With just these two components — a display and a controller — I was ready to get started.

When the components arrived, I carefully connected the display’s ribbon cable to the ESP32 board. Even though this step isn’t documented anywhere, it was simple and almost impossible to get wrong. Best of all, no soldering was needed!

That’s pretty much it for the hardware setup! I’m keeping the device powered with a 5V supply through a micro-USB connection.

One layer of hardware

This was my first time working with the ESP32 CPU family, and I’m really impressed. It’s a system-on-chip controller with built-in Bluetooth and Wi-Fi. It’s relatively fast, very power-efficient, and quite popular in DSP (digital signal processing) applications. For example, your audio device might be powered by a CPU like this. Interestingly, the newer models have switched to the RISC-V instruction set.

For our purposes, we’ll only scratch the surface of what the ESP32 is capable of. The chip is straightforward to work with, thanks to the familiar Arduino environment. A great starting point is the demo provided by Waveshare. It sets up a web page where you can easily upload a custom image to the display.

To run the demo you need to:

Install the Arduino IDE.
Fix permissions of the /dev/ACM0 device.
Install "Additional Boards Manager URL" as per the instructions, and install the "esp32 by expressif" bundle.
Open the "Loader_esp32wf" example downloaded from waveshare.
Change the Wi-Fi name, password and IP address in the Arduino IDE srvr.h tab.

Once everything is set up, you should be able to connect to the ESP32’s IP address and use the simple web interface to upload an image to the display.

With a simple click of the "Upload Image" button, the magic happens: the e-paper display comes to life, showcasing the uploaded image.

With the demo up and running, we can move on to the next step: figuring out how to render a web page on the e-paper display.

Three layers of software

The ESP32 comes with some limitations. It has 520 KiB of RAM, 4 MiB of flash, and a 240 MHz clock speed. While this is fine for tasks like connecting to Wi-Fi or fetching a simple URL, it’s not powerful enough for more demanding tasks, such as parsing JSON or rendering an entire web page.

There are basic Arduino libraries for handling bitmaps, which can draw rectangles and render simple fonts, but manually managing layout doesn't sound appealing to me. A better approach is to play to the ESP32’s strengths — fetching and displaying bitmaps — and delegate the more complex task of HTML rendering to a more powerful server.

Let’s break the problem into three layers:

ESP32 (Display Layer): The ESP32 will periodically, say every minute, fetch a pre-rendered bitmap from the server and display it on the e-paper screen. This keeps the ESP32's tasks lightweight and manageable.
Server A (Rendering Layer): This server will fetch the desired website, render it, and rasterize it into a bitmap format. Its job is to prepare a bitmap that the ESP32 can handle without additional processing.
Server B (Content Layer): This server hosts the actual website with the HTML and CSS content. In this case, it will provide the local weather data in a styled format, ready to be fetched and rendered by Server A.

ESP32 (Display Layer)

The ESP32 provides some great higher-level libraries to simplify development. For this project, we’ll need three key components:

Wi-Fi Arduino Library: To connect the ESP32 to a Wi-Fi network.
HTTP Arduino Library: To handle HTTP requests and fetch the rendered bitmap from the server.
EPD (e-Paper Display) Driver: To control the e-paper display and render the fetched bitmap.

These libraries make it much easier to implement the required functionality without dealing with low-level details.

Here's my ESP32 Arduino project code. It's actually pretty straightforward:

First, it connects to Wi-Fi
Then, it fetches a rendered bitmap from an HTTP endpoint
Then it pushes it to the e-paper display if needed
Waits a minute
And repeats the whole process forever

E-paper displays typically start to degrade after about one million refresh cycles. To preserve the display's lifespan, I’m being extra careful to avoid unnecessary refreshes.

Server A (Rendering Layer)

Now for the exciting part! We need an online service that can fetch a website, render it, rasterize it to fit our small monochromatic display, and return it as a display-sized binary blob. Initially, I considered using headless Chrome paired with an ImageMagick script, but then I discovered Cloudflare’s Browser Rendering API, which fits our needs perfectly.

This API can be used quite trivially and nicely fits our needs. Here's the typescript worker code, and there are two particularly interesting parts: handling a remote browser and dithering.

Remote Browser API

First, see how easy it is to render a website as a PNG using Browser Rendering:

I’m genuinely surprised at how practical and effective this approach is. While the remote browser startup isn’t exactly fast — it can take a few seconds to generate the screenshot — it’s not an issue for my use case. The delay is perfectly acceptable, especially considering how much work is offloaded to the cloud.

Dithering

To prepare the bitmap for the ESP32, we need to decode the PNG, reduce the color palette to monochromatic, and apply dithering. Here's the dithering code:

This was my first time experimenting with dithering, and it’s been a lot of fun! I was surprised by how straightforward the process is and that it’s fully deterministic. Now that I understand the details of the algorithm, I can’t help but notice its subtle side effects everywhere — in printed materials, on screens, and even in design choices around me. It’s fascinating how something so simple has such a broad impact!

To deploy this code as a Cloudflare Worker, you only need to install the required dependencies, configure the wrangler.toml file, and publish the code. Here’s a step-by-step guide:

With this out of the way, you can run the code:

With everything set up, you can now open a browser and see a rendered and rasterized version of a website, processed through your Cloudflare Worker! For example, here’s how the 1.1.1.1 page looks in a 800x480 monochromatic resolution, complete with dithering:

This demonstrates how effectively the Worker can handle rendering, rasterizing, and adapting web content for an e-paper display. It’s quite satisfying to see the pipeline in action.

Server B (Content Layer)

To create the weather panel, I designed a simple HTML and CSS page and published it as another Cloudflare Worker. This time, I used Python in Cloudflare Workers because it felt more straightforward, especially since the site needs to query an external weather API. The simplicity of the code was surprising and made the process smooth.

Here’s how it appears in a normal browser compared to the rendered and rasterized version by our worker:

Summary

Finally, the display deserves a proper frame. Here’s the finished version:

I started this project wanting to experiment with an e-paper display hardware, but I ended up spending most of my time writing software—and it turned out to be surprisingly enjoyable across all layers:

ESP32: The CPU is fantastic. Programming it is straightforward, thanks to powerful built-in libraries that simplify development.
Cloudflare Worker Browser Rendering: This is an underrated but incredibly powerful technology. It made implementing features like the Floyd–Steinberg dithering algorithm surprisingly easy.
Cloudflare Worker Python: Although still in beta, it worked flawlessly for my needs and was a great fit for handling API requests and serving dynamic content.

It’s remarkable how much you can achieve with relatively inexpensive hardware and free Cloudflare services.

Virtual networking 101: bridging the gap to understanding TAP

Marek Majkowski — Fri, 06 Oct 2023 13:05:33 GMT

It's a never-ending effort to improve the performance of our infrastructure. As part of that quest, we wanted to squeeze as much network oomph as possible from our virtual machines. Internally for some projects we use Firecracker, which is a KVM-based virtual machine manager (VMM) that runs light-weight “Micro-VM”s. Each Firecracker instance uses a tap device to communicate with a host system. Not knowing much about tap, I had to up my game, however, it wasn't easy — the documentation is messy and spread across the Internet.

Here are the notes that I wish someone had passed me when I started out on this journey!

A tap device is a virtual network interface that looks like an ethernet network card. Instead of having real wires plugged into it, it exposes a nice handy file descriptor to an application willing to send/receive packets. Historically tap devices were mostly used to implement VPN clients. The machine would route traffic towards a tap interface, and a VPN client application would pick them up and process accordingly. For example this is what our Cloudflare WARP Linux client does. Here's how it looks on my laptop:

More recently tap devices started to be used by virtual machines to enable networking. The VMM (like Qemu, Firecracker, or gVisor) would open the application side of a tap and pass all the packets to the guest VM. The tap network interface would be left for the host kernel to deal with. Typically, a host would behave like a router and firewall, forward or NAT all the packets. This design is somewhat surprising - it's almost reversing the original use case for tap. In the VPN days tap was a traffic destination. With a VM behind, tap looks like a traffic source.

A Linux tap device is a mean creature. It looks trivial — a virtual network interface, with a file descriptor behind it. However, it's surprisingly hard to get it to perform well. The Linux networking stack is optimized for packets handled by a physical network card, not a userspace application. However, over the years the Linux tap interface grew in features and nowadays, it's possible to get good performance out of it. Later I'll explain how to use the Linux tap API in a modern way.

Source: DALL-E

To tun or to tap?

The interface is called "the universal tun/tap" in the kernel. The "tun" variant, accessible via the IFF_TUN flag, looks like a point-to-point link. There are no L2 Ethernet headers. Since most modern networks are Ethernet, this is a bit less intuitive to set up for a novice user. Most importantly, projects like Firecracker and gVisor do expect L2 headers.

"Tap", with the IFF_TAP flag, is the one which has Ethernet headers, and has been getting all the attention lately. If you are like me and always forget which one is which, you can use this AI-generated rhyme (check out WorkersAI/LLama) to help to remember:

Tap is like a switch,Ethernet headers it'll hitch.Tun is like a tunnel,VPN connections it'll funnel.Ethernet headers it won't hold,Tap uses, tun does not, we're told.

Listing devices

Tun/tap devices are natively supported by iproute2 tooling. Typically, one creates a device with ip tuntap add and lists it with ip tuntap list:

Alternatively, it's possible to look for the /sys/devices/virtual/net//tun_flags files.

Tap device setup

To open or create a new device, you first need to open /dev/net/tun which is called a "clone device":

With the clone device file descriptor we can now instantiate a specific tap device by name:

If ifr_name is empty or with a name that doesn't exist, a new tap device is created. Otherwise, an existing device is opened. When opening existing devices, flags like IFF_MULTI_QUEUE must match with the way the device was created, or EINVAL is returned. It's a good idea to try to reopen the device with flipped multi queue setting on EINVAL error.

The ifr_flags can have the following bits set:

You almost always want IFF_TAP, IFF_NO_PI, IFF_VNET_HDR flags and perhaps sometimes IFF_MULTI_QUEUE.

The curious IFF_NAPI

Judging by the original patchset introducing IFF_NAPI and IFF_NAPI_FRAGS, these flags were introduced to increase code coverage of syzkaller. However, later work indicates there were performance benefits when doing XDP on tap. IFF_NAPI enables a dedicated NAPI instance for packets written from an application into a tap. Besides allowing XDP, it also allows packets to be batched and GRO-ed. Otherwise, a backlog NAPI is used.

A note on buffer sizes

Internally, a tap device is just a pair of packet queues. It's exposed as a network interface towards the host, and a file descriptor, a character device, towards the application. The queue in the direction of application (tap TX queue) is of size txqueuelen packets, controlled by an interface parameter:

In "ip link" statistics the column "TX dropped" indicates the tap application was too slow and the queue space exhausted.

In the other direction - interface RX queue - from application towards the host, the queue size limit is measured in bytes and controlled by the TUNSETSNDBUF ioctl. The qemu comment discusses this setting, however it's not easy to cause this queue to overflow. See this discussion for details.

vnethdr size

After the device is opened, a typical scenario is to set up VNET_HDR size and offloads. Typically the VNETHDRSZ should be set to 12:

Sensible values are {10, 12, 20}, which are derived from virtio spec. 12 bytes makes room for the following header (little endian):

offloads

To enable offloads use the ioctl:

Here are the allowed bit values. They confirm that the userspace application can receive:

Generally, offloads are extra packet features the tap application can deal with. Details of the offloads used by the sender are set on each packet in the vnethdr prefix.

Checksum offload TUN_F_CSUM

Structure of a typical UDP packet received over tap.

Let's start with the checksumming offload. The TUN_F_CSUM offload saves the kernel some work by pushing the checksum processing down the path. Applications which set that flag are indicating they can handle checksum validation. For example with this offload, for UDP IPv4 packet will have:

vnethdr flags will have VIRTIO_NET_HDR_F_NEEDS_CSUM set
hdr_len would be 42 (14+20+8)
csum_start 34 (14+20)
and csum_offset 6 (UDP header checksum is 6 bytes into L4)

This is illustrated above.

Supporting checksum offload is needed for further offloads.

TUN_F_CSUM is a must

Consider this code:

This simple code produces a packet. When directed at a tap device, this code will surprisingly yield an EIO "Input/output error". This weird behavior happens if the tap is opened without TUN_F_CSUM and the application is sending GSO / UDP_SEGMENT frames. Tough luck. It might be considered a kernel bug, and we're thinking about fixing that. However, in the meantime everyone using tap should just set the TUN_F_CSUM bit.

Segmentation offloads

We wrote about UDP_SEGMENT in the past. In short: on Linux an application can handle many packets with a single send/recv, as long as they have identical length.

With UDP_SEGMENT a single send() can transfer multiple packets.

Tap devices support offloading which exposes that very functionality. With TUN_F_TSO4 and TUN_F_TSO6 flags the tap application signals it can deal with long packet trains. Note, that with these features the application must be ready to receive much larger buffers - up to 65507 bytes for IPv4 and 65527 for IPv6.

TSO4/TSO6 flags are enabling long packet trains for TCP and have been supported for a long time. More recently TUN_F_USO4 and TUN_F_USO6 bits were introduced for UDP. When any of these offloads are used, the gso_type contains the relevant offload type and gso_size holds a segment size within the GRO packet train.

TUN_F_UFO is a UDP fragmentation offload which is deprecated.

By setting TUNSETOFFLOAD, the application is telling the kernel which offloads it's able to handle on the read() side of a tap device. If the ioctl(TUNSETOFFLOAD) succeeds, the application can assume the kernel supports the same offloads for packets in the other direction.

Bug in rx-udp-gro-forwarding - TUN_F_USO4

When working with tap and offloads it's useful to inspect ethtool:

With ethtool we can see the enabled offloads and disable them as needed.

While toying with UDP Segmentation Offload (USO) I've noticed that when packet trains from tap are forwarded to a real network interface, sometimes they seem badly packetized. See the netdev discussion, and the proposed fix. In any case - beware of this bug, and maybe consider doing "ethtool -K tap0 rx-udp-gro-forwarding off".

Miscellaneous setsockopts

Setup related socket options

Filtering related socket options

Multi queue related socket options

Single queue speed

Tap devices are quite weird — they aren't network sockets, nor true files. Their semantics are closest to pipes, and unfortunately the API reflects that. To receive or send a packet from a tap device, the application must do a read() or write() syscall, one packet at a time.

One might think that some sort of syscall batching would help. Sockets have sendmmsg()/recvmmsg(), but that doesn't work on tap file descriptors. The typical alternatives enabling batching are: an old io_submit AIO interface, or modern io_uring. Io_uring added tap support quite recently. However, it turns out syscall batching doesn't really offer that much of an improvement. Maybe in the range of 10%.

The Linux kernel is just not capable of forwarding millions of packets per second for a single flow or on a single CPU. The best possible solution is to scale vertically for elephant flows with TSO/USO (packet trains) offloads, and scale horizontally for multiple concurrent flows with multi queue.

In this chart you can see how dramatic the performance gain of offloads is. Without them, a sample "echo" tap application can process between 320 and 500 thousand packets per second on a single core. MTU being 1500. When the offloads are enabled it jumps to 2.7Mpps, while keeping the number of received "packet trains" to just 56 thousand per second. Of course not every traffic pattern can fully utilize GRO/GSO. However, to get decent performance from tap, and from Linux in general, offloads are absolutely critical.

Multi queue considerations

Multi queue is useful when the tap application is handling multiple concurrent flows and needs to utilize more than one CPU.

To get a file descriptor of a tap queue, just add the IFF_MULTI_QUEUE flag when opening the tap. It's possible to detach/reattach a queue with TUNSETQUEUE and IFF_DETACH_QUEUE/IFF_ATTACH_QUEUE, but I'm unsure when this is useful.

When a multi queue tap is created, it spreads the load across multiple tap queues, each one having a unique file descriptor. Beware of the algorithm selecting the queue though: it might bite you back.

By default, Linux tap driver records a symmetric flow hash of any handled flow in a flow table. It saves on which queue the traffic from the application was transmitted. Then, on the receiving side it follows that selection and sends subsequent packets to that specific queue. For example, if your userspace application is sending some TCP flow over queue #2, then the packets going into the application which are a part of that flow will go to queue #2. This is generally a sensible design as long as the sender is always selecting one specific queue. If the sender changes the TX queue, new packets will immediately shift and packets within one flow might be seen as reordered. Additionally, this queue selection design does not take into account CPU locality and might have minor negative effects on performance for very high throughput applications.

It's possible to override the flow hash based queue selection by using tc multiq qdisc and skbedit queue_mapping filter:

tc is fragile and thus it's not a solution I would recommend. A better way is to customize the queue selection algorithm with a TUNSETSTEERINGEBPF eBPF program. In that case, the flow tracking code is not employed anymore. By smartly using such a steering eBPF program, it's possible to keep the flow processing local to one CPU — useful for best performance.

Summary

Now you know everything I wish I had known when I was setting out on this journey!

To get the best performance, I recommend:

enable vnethdr
enable offloads (TSO and USO)
consider spreading the load across multiple queues and CPUs with multi queue
consider syscall batching for additional gain of maybe 10%, perhaps try io_uring
consider customizing the steering algorithm

References:

The day my ping took countermeasures

Marek Majkowski — Tue, 11 Jul 2023 13:05:00 GMT

Once my holidays had passed, I found myself reluctantly reemerging into the world of the living. I powered on a corporate laptop, scared to check on my email inbox. However, before turning on the browser, obviously, I had to run a ping. Debugging the network is a mandatory first step after a boot, right? As expected, the network was perfectly healthy but what caught me off guard was this message:

I was not expecting ping to take countermeasures that early on in a day. Gosh, I wasn't expecting any countermeasures that Monday!

Once I got over the initial confusion, I took a deep breath and collected my thoughts. You don't have to be Sherlock Holmes to figure out what has happened. I'm really fast - I started ping before the system NTP daemon synchronized the time. In my case, the computer clock was rolled backward, confusing ping.

While this doesn't happen too often, a computer clock can be freely adjusted either forward or backward. However, it's pretty rare for a regular network utility, like ping, to try to manage a situation like this. It's even less common to call it "taking countermeasures". I would totally expect ping to just print a nonsensical time value and move on without hesitation.

Ping developers clearly put some thought into that. I wondered how far they went. Did they handle clock changes in both directions? Are the bad measurements excluded from the final statistics? How do they test the software?

I can't just walk past ping "taking countermeasures" on me. Now I have to understand what ping did and why.

Understanding ping

An investigation like this starts with a quick glance at the source code:

Ping goes back a long way. It was originally written by Mike Muuss while at the U. S. Army Ballistic Research Laboratory, in 1983, before I was born. The code we're looking for is under iputils/ping/ping_common.c gather_statistics() function:

The code is straightforward: the message in question is printed when the measured RTT is negative. In this case ping resets the latency measurement to zero. Here you are: "taking countermeasures" is nothing more than just marking an erroneous measurement as if it was 0ms.

But what precisely does ping measure? Is it the wall clock? The man page comes to the rescue. Ping has two modes.

The "old", -U mode, in which it uses the wall clock. This mode is less accurate (has more jitter). It calls gettimeofday before sending and after receiving the packet.

The "new", default, mode in which it uses "network time". It calls gettimeofday before sending, and gets the receive timestamp from a more accurate SO_TIMESTAMP CMSG. More on this later.

Tracing gettimeofday is hard

Let's start with a good old strace:

It doesn't show any calls to gettimeofday. What is going on?

On modern Linux some syscalls are not true syscalls. Instead of jumping to the kernel space, which is slow, they remain in userspace and go to a special code page provided by the host kernel. This code page is called vdso. It's visible as a .so library to the program:

Calls to the vdso region are not syscalls, they remain in userspace and are super fast, but classic strace can't see them. For debugging it would be nice to turn off vdso and fall back to classic slow syscalls. It's easier said than done.

There is no way to prevent loading of the vdso. However there are two ways to convince a loaded program not to use it.

The first technique is about fooling glibc into thinking the vdso is not loaded. This case must be handled for compatibility with ancient Linux. When bootstrapping in a freshly run process, glibc inspects the Auxiliary Vector provided by ELF loader. One of the parameters has the location of the vdso pointer, the man page gives this example:

A technique proposed on Stack Overflow works like that: let's hook on a program before execve() exits and overwrite the Auxiliary Vector AT_SYSINFO_EHDR parameter. Here's the novdso.c code. However, the linked code doesn't quite work for me (one too many kill(SIGSTOP)), and has one bigger, fundamental flaw. To hook on execve() it uses ptrace() therefore doesn't work under our strace!

While this technique of rewriting AT_SYSINFO_EHDR is pretty cool, it won't work for us. (I wonder if there is another way of doing that, but without ptrace. Maybe with some BPF? But that is another story.)

A second technique is to use LD_PRELOAD and preload a trivial library overloading the functions in question, and forcing them to go to slow real syscalls. This works fine:

To load it:

Hurray! We can see the clock_gettime call in strace output. Surely we'll also see gettimeofday from our ping, right?

Not so fast, it still doesn't quite work:

To suid or not to suid

I forgot that ping might need special permissions to read and write raw packets. Historically it had a suid bit set, which granted the program elevated user identity. However LD_PRELOAD doesn't work with suid. When a program is being loaded a dynamic linker checks if it has suid bit, and if so, it ignores LD_PRELOAD and LD_LIBRARY_PATH settings.

However, does ping need suid? Nowadays it's totally possible to send and receive ICMP Echo messages without any extra privileges, like this:

Now you know how to write "ping" in eight lines of Python. This Linux API is known as ping socket. It generally works on modern Linux, however it requires a correct sysctl, which is typically enabled:

The ping socket is not as mature as UDP or TCP sockets. The "ICMP ID" field is used to dispatch an ICMP Echo Response to an appropriate socket, but when using bind() this property is settable by the user without any checks. A malicious user can deliberately cause an "ICMP ID" conflict.

But we're not here to discuss Linux networking API's. We're here to discuss the ping utility and indeed, it's using the ping sockets:

Ping sockets are rootless, and ping, at least on my laptop, is not a suid program:

So why doesn't the LD_PRELOAD? It turns out ping binary holds a CAP_NET_RAW capability. Similarly to suid, this is preventing the library preloading machinery from working:

I think this capability is enabled only to handle the case of a misconfigured net.ipv4.ping_group_range sysctl. For me ping works perfectly fine without this capability.

Rootless is perfectly fine

Let's remove the CAP_NET_RAW and try out LD_PRELOAD hack again:

We finally made it! Without -U, in the "network timestamp" mode, ping:

Sets SO_TIMESTAMP flag on a socket.
Calls gettimeofday before sending the packet.
When fetching a packet, gets the timestamp from the CMSG.

Fault injection - fooling ping

With strace up and running we can finally do something interesting. You see, strace has a little known fault injection feature, named "tampering" in the manual:

With a couple of command line parameters we can overwrite the result of the gettimeofday call. I want to set it forward to confuse ping into thinking the SO_TIMESTAMP time is in the past:

It worked! We can now generate the "taking countermeasures" message reliably!

While we can cheat on the gettimeofday result, with strace it's impossible to overwrite the CMSG timestamp. Perhaps it might be possible to adjust the CMSG timestamp with Linux time namespaces, but I don't think it'll work. As far as I understand, time namespaces are not taken into account by the network stack. A program using SO_TIMESTAMP is deemed to compare it against the system clock, which might be rolled backwards.

Fool me once, fool me twice

At this point we could conclude our investigation. We're now able to reliably trigger the "taking countermeasures" message using strace fault injection.

There is one more thing though. When sending ICMP Echo Request messages, does ping remember the send timestamp in some kind of hash table? That might be wasteful considering a long-running ping sending thousands of packets.

Ping is smart, and instead puts the timestamp in the ICMP Echo Request packet payload!

Here's how the full algorithm works:

Ping sets the SO_TIMESTAMP_OLD socket option to receive timestamps.
It looks at the wall clock with gettimeofday.
It puts the current timestamp in the first bytes of the ICMP payload.
After receiving the ICMP Echo Reply packet, it inspects the two timestamps: the send timestamp from the payload and the receive timestamp from CMSG.
It calculates the RTT delta.

This is pretty neat! With this algorithm, ping doesn't need to remember much, and can have an unlimited number of packets in flight! (For completeness, ping maintains a small fixed-size bitmap to account for the DUP! packets).

What if we set a packet length to be less than 16 bytes? Let's see:

In such a case ping just skips the RTT from the output. Smart!

Right... this opens two completely new subjects. While ping was written back when everyone was friendly, today’s Internet can have rogue actors. What if we spoofed responses to confuse ping. Can we: cut the payload to prevent ping from producing RTT, and spoof the timestamp and fool the RTT measurements?

Both things work! The truncated case will look like this to the sender:

The second case, of an overwritten timestamp, is even cooler. We can move timestamp forwards causing ping to show our favorite "taking countermeasures" message:

Alternatively we can move the time in the packet backwards causing ping to show nonsensical RTT values:

We proved that "countermeasures" work only when time moves in one direction. In another direction ping is just fooled.

Here's a rough scapy snippet that generates an ICMP Echo Response fooling ping:

Leap second

In practice, how often does time change on a computer? The NTP daemon adjusts the clock all the time to account for any drift. However, these are very small changes. Apart from initial clock synchronization after boot or sleep wakeup, big clock shifts shouldn't really happen.

There are exceptions as usual. Systems that operate in virtual environments or have unreliable Internet connections often experience their clocks getting out of sync.

One notable case that affects all computers is a coordinated clock adjustment called a leap second. It causes the clock to move backwards, which is particularly troublesome. An issue with handling leap second caused our engineers a headache in late 2016.

Leap seconds often cause issues, so the current consensus is to deprecate them by 2035. However, according to Wikipedia the solution seem to be to just kick the can down the road:

A suggested possible future measure would be to let the discrepancy increase to a full minute, which would take 50 to 100 years, and then have the last minute of the day taking two minutes in a "kind of smear" with no discontinuity.

In any case, there hasn't been a leap second since 2016, there might be some in the future, but there likely won't be any after 2035. Many environments already use a leap second smear to avoid the problem of clock jumping back.

In most cases, it might be completely fine to ignore the clock changes. When possible, to count time durations use CLOCK_MONOTONIC, which is bulletproof.

We haven't mentioned daylight savings clock adjustments here because, from a computer perspective they are not real clock changes! Most often programmers deal with the operating system clock, which is typically set to the UTC timezone. DST timezone is taken into account only when pretty printing the date on screen. The underlying software operates on integer values. Let's consider an example of two timestamps, which in my Warsaw timezone, appear as two different DST timezones. While it may like the clock rolled back, this is just a user interface illusion. The integer timestamps are sequential:

Lessons

Arguably, the clock jumping backwards is a rare occurrence. It's very hard to test for such cases, and I was surprised to find that ping made such an attempt. To avoid the problem, to measure the latency ping might use CLOCK_MONOTONIC, its developers already use this time source in another place.

Unfortunately this won't quite work here. Ping needs to compare send timestamp to receive timestamp from SO_TIMESTAMP CMSG, which uses the non-monotonic system clock. Linux API's are sometimes limited, and dealing with time is hard. For time being, clock adjustments will continue to confuse ping.

In any case, now we know what to do when ping is "taking countermeasures"! Pull down your periscope and check the NTP daemon status!

Cloudflare servers don't own IPs anymore – so how do they connect to the Internet?

Marek Majkowski — Fri, 25 Nov 2022 14:00:00 GMT

A lot of Cloudflare's technology is well documented. For example, how we handle traffic between the eyeballs (clients) and our servers has been discussed many times on this blog: “A brief primer on anycast (2011)”, "Load Balancing without Load Balancers (2013)", "Path MTU discovery in practice (2015)", "Cloudflare's edge load balancer (2020)", "How we fixed the BSD socket API (2022)".

However, we have rarely talked about the second part of our networking setup — how our servers fetch the content from the Internet. In this blog we’re going to cover this gap. We'll discuss how we manage Cloudflare IP addresses used to retrieve the data from the Internet, how our egress network design has evolved and how we optimized it for best use of available IP space.

Brace yourself. We have a lot to cover.

Terminology first!

Each Cloudflare server deals with many kinds of networking traffic, but two rough categories stand out:

Internet sourced traffic - Inbound connections initiated by eyeball to our servers. In the context of this blog post we'll call these "ingress connections".
Cloudflare sourced traffic - Outgoing connections initiated by our servers to other hosts on the Internet. For brevity, we'll call these "egress connections".

The egress part, while rarely discussed on this blog, is critical for our operation. Our servers must initiate outgoing connections to get their jobs done! Like:

In our CDN product, before the content is cached, it's fetched from the origin servers. See "Pingora, the proxy that connects Cloudflare to the Internet (2022)", Argo and Tiered Cache.
For the Spectrum product, each ingress TCP connection results in one egress connection.
Workers often run multiple subrequests to construct an HTTP response. Some of them might be querying servers to the Internet.
We also operate client-facing forward proxy products - like WARP and Cloudflare Gateway. These proxies deal with eyeball connections destined to the Internet. Our servers need to establish connections to the Internet on behalf of our users.

And so on.

Anycast on ingress, unicast on egress

Our ingress network architecture is very different from the egress one. On ingress, the connections sourced from the Internet are handled exclusively by our anycast IP ranges. Anycast is a technology where each of our data centers "announces" and can handle the same IP ranges. With many destinations possible, how does the Internet know where to route the packets? Well, the eyeball packets are routed towards the closest data center based on Internet BGP metrics, often it's also geographically the closest one. Usually, the BGP routes don't change much, and each eyeball IP can be expected to be routed to a single data center.

However, while anycast works well in the ingress direction, it can't operate on egress. Establishing an outgoing connection from an anycast IP won't work. Consider the response packet. It's likely to be routed back to a wrong place - a data center geographically closest to the sender, not necessarily the source data center!

For this reason, until recently, we established outgoing connections in a straightforward and conventional way: each server was given its own unicast IP address. "Unicast IP" means there is only one server using that address in the world. Return packets will work just fine and get back exactly to the right server identified by the unicast IP.

Segmenting traffic based on egress IP

Originally connections sourced by Cloudflare were mostly HTTP fetches going to origin servers on the Internet. As our product line grew, so did the variety of traffic. The most notable example is our WARP app. For WARP, our servers operate a forward proxy, and handle the traffic sourced by end-user devices. It's done without the same degree of intermediation as in our CDN product. This creates a problem. Third party servers on the Internet — like the origin servers — must be able to distinguish between connections coming from Cloudflare services and our WARP users. Such traffic segmentation is traditionally done by using different IP ranges for different traffic types (although recently we introduced more robust techniques like Authenticated Origin Pulls).

To work around the trusted vs untrusted traffic pool differentiation problem, we added an untrusted WARP IP address to each of our servers:

Country tagged egress IP addresses

It quickly became apparent that trusted vs untrusted weren't the only tags needed. For WARP service we also need country tags. For example, United Kingdom based WARP users expect the bbc.com website to just work. However, the BBC restricts many of its services to people just in the UK.

It does this by geofencing — using a database mapping public IP addresses to countries, and allowing only the UK ones. Geofencing is widespread on today's Internet. To avoid geofencing issues, we need to choose specific egress addresses tagged with an appropriate country, depending on WARP user location. Like many other parties on the Internet, we tag our egress IP space with country codes and publish it as a geofeed (like this one). Notice, the published geofeed is just data. The fact that an IP is tagged as say UK does not mean it is served from the UK, it just means the operator wants it to be geolocated to the UK. Like many things on the Internet, it is based on trust.

Notice, at this point we have three independent geographical tags:

the country tag of the WARP user - the eyeball connecting IP
the location of the data center the eyeball connected to
the country tag of the egressing IP

For best service, we want to choose the egressing IP so that its country tag matches the country from the eyeball IP. But egressing from a specific country tagged IP is challenging: our data centers serve users from all over the world, potentially from many countries! Remember: due to anycast we don't directly control the ingress routing. Internet geography doesn’t always match physical geography. For example our London data center receives traffic not only from users in the United Kingdom, but also from Ireland, and Saudi Arabia. As a result, our servers in London need many WARP egress addresses associated with many countries:

Can you see where this is going? The problem space just explodes! Instead of having one or two egress IP addresses for each server, now we require dozens, and IPv4 addresses aren't cheap. With this design, we need many addresses per server, and we operate thousands of servers. This architecture becomes very expensive.

Is anycast a problem?

Let me recap: with anycast ingress we don't control which data center the user is routed to. Therefore, each of our data centers must be able to egress from an address with any conceivable tag. Inside the data center we also don't control which server the connection is routed to. There are potentially many tags, many data centers, and many servers inside a data center.

Maybe the problem is the ingress architecture? Perhaps it's better to use a traditional networking design where a specific eyeball is routed with DNS to a specific data center, or even a server?

That's one way of thinking, but we decided against it. We like our anycast on ingress. It brings us many advantages:

Performance: with anycast, by definition, the eyeball is routed to the closest (by BGP metrics) data center. This is usually the fastest data center for a given user.
Automatic failover: if one of our data centers becomes unavailable, the traffic will be instantly, automatically re-routed to the next best place.
DDoS resilience: during a denial of service attack or a traffic spike, the load is automatically balanced across many data centers, significantly reducing the impact.
Uniform software: The functionality of every data center and of every server inside a data center is identical. We use the same software stack on all the servers around the world. Each machine can perform any action, for any product. This enables easy debugging and good scalability.

For these reasons we'd like to keep the anycast on ingress. We decided to solve the issue of egress address cardinality in some other way.

Solving a million dollar problem

Out of the thousands of servers we operate, every single one should be able to use an egress IP with any of the possible tags. It's easiest to explain our solution by first showing two extreme designs.

Each server owns all the needed IPs: each server has all the specialized egress IPs with the needed tags.

One server owns the needed IP: a specialized egress IP with a specific tag lives in one place, other servers forward traffic to it.

Both options have pros and cons:

There's a third way

We've been thinking hard about this problem. Frankly, the first extreme option of having every needed IP available locally on every Cloudflare server is not totally unworkable. This is, roughly, what we were able to pull off for IPv6. With IPv6, access to the large needed IP space is not a problem.

However, in IPv4 neither option is acceptable. The first offers fast and reliable egress, but requires great cost — the IPv4 addresses needed are expensive. The second option uses the smallest possible IP space, so it's cheap, but compromises on performance and reliability.

The solution we devised is a compromise between the extremes. The rough idea is to change the assignment unit. Instead of assigning one /32 IPv4 address for each server, we devised a method of assigning a /32 IP per data center, and then sharing it among physical servers.

Sharing an IP inside data center

The idea of sharing an IP among servers is not new. Traditionally this can be achieved by Source-NAT on a router. Sadly, the sheer number of egress IP's we need and the size of our operation, prevents us from relying on stateful firewall / NAT at the router level. We also dislike shared state, so we're not fans of distributed NAT installations.

What we chose instead, is splitting an egress IP across servers by a port range. For a given egress IP, each server owns a small portion of available source ports - a port slice.

When return packets arrive from the Internet, we have to route them back to the correct machine. For this task we've customized "Unimog" - our L4 XDP-based load balancer - ("Unimog, Cloudflare's load balancer (2020)") and it's working flawlessly.

With a port slice of say 2,048 ports, we can share one IP among 31 servers. However, there is always a possibility of running out of ports. To address this, we've worked hard to be able to reuse the egress ports efficiently. See the "How to stop running out of ports (2022)", "How to share IPv4 addresses (2022)" and our Cloudflare.TV segment.

This is pretty much it. Each server is aware which IP addresses and port slices it owns. For inbound routing Unimog inspects the ports and dispatches the packets to appropriate machines.

Sharing a subnet between data centers

This is not the end of the story though, we haven't discussed how we can route a single /32 address into a datacenter. Traditionally, in the public Internet, it's only possible to route subnets with granularity of /24 or 256 IP addresses. In our case this would lead to great waste of IP space.

To solve this problem and improve the utilization of our IP space, we deployed our egress ranges as... anycast! With that in place, we customized Unimog and taught it to forward the packets over our backbone network to the right data center. Unimog maintains a database like this:

With this design, it doesn't matter to which data center return packets are delivered. Unimog can always fix it and forward the data to the right place. Basically, while at the BGP layer we are using anycast, due to our design, semantically an IP identifies a datacenter and an IP and port range identify a specific machine. It behaves almost like a unicast.

We call this technology stack "soft-unicast" and it feels magical. It's like we did unicast in software over anycast in the BGP layer.

Soft-unicast is indistinguishable from magic

With this setup we can achieve significant benefits:

We are able to share a /32 egress IP amongst many servers.
We can spread a single subnet across many data centers, and change it easily on the fly. This allows us to fully use our egress IPv4 ranges.
We can group similar IP addresses together. For example, all the IP addresses tagged with the "UK" tag might form a single continuous range. This reduces the size of the published geofeed.
It's easy for us to onboard new egress IP ranges, like customer IP's. This is useful for some of our products, like Cloudflare Zero Trust.

All this is done at sensible cost, at no loss to performance and reliability:

Typically, the user is able to egress directly from the closest datacenter, providing the best possible performance.
Depending on the actual needs we can allocate or release the IP addresses. This gives us flexibility with the IP cost management, we don't need to overspend upfront.
Since we operate multiple egress IP addresses in different locations, the reliability is not compromised.

The true location of our IP addresses is: “the cloud”

While soft-unicast allows us to gain great efficiency, we've hit some issues. Sometimes we get a question "Where does this IP physically exist?". But it doesn't have an answer! Our egress IPs don't exist physically anywhere. From a BGP standpoint our egress ranges are anycast, so they live everywhere. Logically each address is used in one data center at a time, but we can move it around on demand.

Content Delivery Networks misdirect users

As another example of problems, here's one issue we've hit with third party CDNs. As we mentioned before, there are three country tags in our pipeline:

The country tag of the IP eyeball is connecting from.
The location of our data center.
The country tag of the IP addresses we chose for the egress connections.

The fact that our egress address is tagged as "UK" doesn't always mean it actually is being used in the UK. We’ve had cases when a UK-tagged WARP user, due to the maintenance of our LHR data center, was routed to Paris. A popular CDN performed a reverse-lookup of our egress IP, found it tagged as "UK", and directed the user to a London CDN server. This is generally OK... but we actually egressed from Paris at the time. This user ended up routing packets from their home in the UK, to Paris, and back to the UK. This is bad for performance.

We address this issue by performing DNS requests in the egressing data center. For DNS we use IP addresses tagged with the location of the data center, not the intended geolocation for the user. This generally fixes the problem, but sadly, there are still some exceptions.

The future is here

Our 2021 experiments with Addressing Agility proved we have plenty of opportunity to innovate with the addressing of the ingress. Soft-unicast shows us we can achieve great flexibility and density on the egress side.

With each new product, the number of tags we need on the egress grows - from traffic trustworthiness, product category to geolocation. As the pool of usable IPv4 addresses shrinks, we can be sure there will be more innovation in the space. Soft-unicast is our solution, but for sure it's not our last development.

For now though, it seems like we're moving away from traditional unicast. Our egress IP's really don't exist in a fixed place anymore, and some of our servers don't even own a true unicast IP nowadays.

When the window is not fully open, your TCP stack is doing more than you think

Marek Majkowski — Tue, 26 Jul 2022 13:00:00 GMT

Over the years I've been lurking around the Linux kernel and have investigated the TCP code many times. But when recently we were working on Optimizing TCP for high WAN throughput while preserving low latency, I realized I have gaps in my knowledge about how Linux manages TCP receive buffers and windows. As I dug deeper I found the subject complex and certainly non-obvious.

In this blog post I'll share my journey deep into the Linux networking stack, trying to understand the memory and window management of the receiving side of a TCP connection. Specifically, looking for answers to seemingly trivial questions:

How much data can be stored in the TCP receive buffer? (it's not what you think)
How fast can it be filled? (it's not what you think either!)

Our exploration focuses on the receiving side of the TCP connection. We'll try to understand how to tune it for the best speed, without wasting precious memory.

A case of a rapid upload

To best illustrate the receive side buffer management we need pretty charts! But to grasp all the numbers, we need a bit of theory.

We'll draw charts from a receive side of a TCP flow, running a pretty straightforward scenario:

The client opens a TCP connection.
The client does send(), and pushes as much data as possible.
The server doesn't recv() any data. We expect all the data to stay and wait in the receive queue.
We fix the SO_RCVBUF for better illustration.

Simplified pseudocode might look like (full code if you dare):

We're interested in basic questions:

How much data can fit in the server’s receive buffer? It turns out it's not exactly the same as the default read buffer size on Linux; we'll get there.
Assuming infinite bandwidth, what is the minimal time - measured in RTT - for the client to fill the receive buffer?

A bit of theory

Let's start by establishing some common nomenclature. I'll follow the wording used by the ss Linux tool from the iproute2 package.

First, there is the buffer budget limit. ss manpage calls it skmem_rb, in the kernel it's named sk_rcvbuf. This value is most often controlled by the Linux autotune mechanism using the net.ipv4.tcp_rmem setting:

Alternatively it can be manually set with setsockopt(SO_RCVBUF) on a socket. Note that the kernel doubles the value given to this setsockopt. For example SO_RCVBUF=16384 will result in skmem_rb=32768. The max value allowed to this setsockopt is limited to meager 208KiB by default:

The aforementioned blog post discusses why manual buffer size management is problematic - relying on autotuning is generally preferable.

Here’s a diagram showing how skmem_rb budget is being divided:

In any given moment, we can think of the budget as being divided into four parts:

Recv-q: part of the buffer budget occupied by actual application bytes awaiting read().
Another part of is consumed by metadata handling - the cost of struct sk_buff and such.
Those two parts together are reported by ss as skmem_r - kernel name is sk_rmem_alloc.
What remains is "free", that is: it's not actively used yet.
However, a portion of this "free" region is an advertised window - it may become occupied with application data soon.
The remainder will be used for future metadata handling, or might be divided into the advertised window further in the future.

The upper limit for the window is configured by tcp_adv_win_scale setting. By default, the window is set to at most 50% of the "free" space. The value can be clamped further by the TCP_WINDOW_CLAMP option or an internal rcv_ssthresh variable.

How much data can a server receive?

Our first question was "How much data can a server receive?". A naive reader might think it's simple: if the server has a receive buffer set to say 64KiB, then the client will surely be able to deliver 64KiB of data!

But this is totally not how it works. To illustrate this, allow me to temporarily set sysctl tcp_adv_win_scale=0. This is not a default and, as we'll learn, it's the wrong thing to do. With this setting the server will indeed set 100% of the receive buffer as an advertised window.

Here's our setup:

The client tries to send as fast as possible.
Since we are interested in the receiving side, we can cheat a bit and speed up the sender arbitrarily. The client has transmission congestion control disabled: we set initcwnd=10000 as the route option.
The server has a fixed skmem_rb set at 64KiB.
The server has tcp_adv_win_scale=0.

There are so many things here! Let's try to digest it. First, the X axis is an ingress packet number (we saw about 65). The Y axis shows the buffer sizes as seen on the receive path for every packet.

First, the purple line is a buffer size limit in bytes - skmem_rb. In our experiment we called setsockopt(SO_RCVBUF)=32K and skmem_rb is double that value. Notice, by calling SO_RCVBUF we disabled the Linux autotune mechanism.
Green recv-q line is how many application bytes are available in the receive socket. This grows linearly with each received packet.
Then there is the blue skmem_r, the used data + metadata cost in the receive socket. It grows just like recv-q but a bit faster, since it accounts for the cost of the metadata kernel needs to deal with.
The orange rcv_win is an advertised window. We start with 64KiB (100% of skmem_rb) and go down as the data arrives.
Finally, the dotted line shows rcv_ssthresh, which is not important yet, we'll get there.

Running over the budget is bad

It's super important to notice that we finished with skmem_r higher than skmem_rb! This is rather unexpected, and undesired. The whole point of the skmem_rb memory budget is, well, not to exceed it. Here's how ss shows it:

As you can see, skmem_rb is 65536 and skmem_r is 73984, which is 8448 bytes over! When this happens we have an even bigger issue on our hands. At around the 62nd packet we have an advertised window of 3072 bytes, but while packets are being sent, the receiver is unable to process them! This is easily verifiable by inspecting an nstat TcpExtTCPRcvQDrop counter:

In our run 13 packets were dropped. This variable counts a number of packets dropped due to either system-wide or per-socket memory pressure - we know we hit the latter. In our case, soon after the socket memory limit was crossed, new packets were prevented from being enqueued to the socket. This happened even though the TCP advertised window was still open.

This results in an interesting situation. The receiver's window is open which might indicate it has resources to handle the data. But that's not always the case, like in our example when it runs out of the memory budget.

The sender will think it hit a network congestion packet loss and will run the usual retry mechanisms including exponential backoff. This behavior can be looked at as desired or undesired, depending on how you look at it. On one hand no data will be lost, the sender can eventually deliver all the bytes reliably. On the other hand the exponential backoff logic might stall the sender for a long time, causing a noticeable delay.

The root of the problem is straightforward - Linux kernel skmem_rb sets a memory budget for both the data and metadata which reside on the socket. In a pessimistic case each packet might incur a cost of a struct sk_buff + struct skb_shared_info, which on my system is 576 bytes, above the actual payload size, plus memory waste due to network card buffer alignment:

We now understand that Linux can't just advertise 100% of the memory budget as an advertised window. Some budget must be reserved for metadata and such. The upper limit of window size is expressed as a fraction of the "free" socket budget. It is controlled by tcp_adv_win_scale, with the following values:

By default, Linux sets the advertised window at most at 50% of the remaining buffer space.

Even with 50% of space "reserved" for metadata, the kernel is very smart and tries hard to reduce the metadata memory footprint. It has two mechanisms for this:

TCP Coalesce - on the happy path, Linux is able to throw away struct sk_buff. It can do so, by just linking the data to the previously enqueued packet. You can think about it as if it was extending the last packet on the socket.
TCP Collapse - when the memory budget is hit, Linux runs "collapse" code. Collapse rewrites and defragments the receive buffer from many small skb's into a few very long segments - therefore reducing the metadata cost.

Here's an extension to our previous chart showing these mechanisms in action:

TCP Coalesce is a very effective measure and works behind the scenes at all times. In the bottom chart, the packets where the coalesce was engaged are shown with a pink line. You can see - the skmem_r bumps (blue line) are clearly correlated with a lack of coalesce (pink line)! The nstat TcpExtTCPRcvCoalesce counter might be helpful in debugging coalesce issues.

The TCP Collapse is a bigger gun. Mike wrote about it extensively, and I wrote a blog post years ago, when the latency of TCP collapse hit us hard. In the chart above, the collapse is shown as a red circle. We clearly see it being engaged after the socket memory budget is reached - from packet number 63. The nstat TcpExtTCPRcvCollapsed counter is relevant here. This value growing is a bad sign and might indicate bad latency spikes - especially when dealing with larger buffers. Normally collapse is supposed to be run very sporadically. A prominent kernel developer describes this pessimistic situation:

This also means tcp advertises a too optimistic window for a given allocated rcvspace: When receiving frames, sk_rmem_alloc can hit sk_rcvbuf limit and we call tcp_collapse() too often, especially when application is slow to drain its receive queue [...] This is a major latency source.

If the memory budget remains exhausted after the collapse, Linux will drop ingress packets. In our chart it's marked as a red "X". The nstat TcpExtTCPRcvQDrop counter shows the count of dropped packets.

rcv_ssthresh predicts the metadata cost

Perhaps counter-intuitively, the memory cost of a packet can be much larger than the amount of actual application data contained in it. It depends on number of things:

Network card: some network cards always allocate a full page (4096, or even 16KiB) per packet, no matter how small or large the payload.
Payload size: shorter packets, will have worse metadata to content ratio since struct skb will be comparably larger.
Whether XDP is being used.
L2 header size: things like ethernet, vlan tags, and tunneling can add up.
Cache line size: many kernel structs are cache line aligned. On systems with larger cache lines, they will use more memory (see P4 or S390X architectures).

The first two factors are the most important. Here's a run when the sender was specially configured to make the metadata cost bad and the coalesce ineffective (the details of the setup are messy):

You can see the kernel hitting TCP collapse multiple times, which is totally undesired. Each time a collapse kernel is likely to rewrite the full receive buffer. This whole kernel machinery, from reserving some space for metadata with tcp_adv_win_scale, via using coalesce to reduce the memory cost of each packet, up to the rcv_ssthresh limit, exists to avoid this very case of hitting collapse too often.

The kernel machinery most often works fine, and TCP collapse is rare in practice. However, we noticed that's not the case for certain types of traffic. One example is websocket traffic with loads of tiny packets and a slow reader. One kernel comment talks about such a case:

Notice that the rcv_ssthresh line dropped down on the TCP collapse. This variable is an internal limit to the advertised window. By dropping it the kernel effectively says: hold on, I mispredicted the packet cost, next time I'm given an opportunity I'm going to open a smaller window. Kernel will advertise a smaller window and be more careful - all of this dance is done to avoid the collapse.

Normal run - continuously updated window

Finally, here's a chart from a normal run of a connection. Here, we use the default tcp_adv_win_wcale=1 (50%):

Early in the connection you can see rcv_win being continuously updated with each received packet. This makes sense: while the rcv_ssthresh and tcp_adv_win_scale restrict the advertised window to never exceed 32KiB, the window is sliding nicely as long as there is enough space. At packet 18 the receiver stops updating the window and waits a bit. At packet 32 the receiver decides there still is some space and updates the window again, and so on. At the end of the flow the socket has 56KiB of data. This 56KiB of data was received over a sliding window reaching at most 32KiB .

The saw blade pattern of rcv_win is enabled by delayed ACK (aka QUICKACK). You can see the "acked" bytes in red dashed line. Since the ACK's might be delayed, the receiver waits a bit before updating the window. If you want a smooth line, you can use quickack 1 per-route parameter, but this is not recommended since it will result in many small ACK packets flying over the wire.

In normal connection we expect the majority of packets to be coalesced and the collapse/drop code paths never to be hit.

Large receive windows - rcv_ssthresh

For large bandwidth transfers over big latency links - big BDP case - it's beneficial to have a very wide advertised window. However, Linux takes a while to fully open large receive windows:

In this run, the skmem_rb is set to 2MiB. As opposed to previous runs, the buffer budget is large and the receive window doesn't start with 50% of the skmem_rb! Instead it starts from 64KiB and grows linearly. It takes a while for Linux to ramp up the receive window to full size - ~800KiB in this case. The window is clamped by rcv_ssthresh. This variable starts at 64KiB and then grows at a rate of two full-MSS packets per each packet which has a "good" ratio of total size (truesize) to payload size.

Eric Dumazet writes about this behavior:

Stack is conservative about RWIN increase, it wants to receive packets to have an idea of the skb->len/skb->truesize ratio to convert a memory budget to RWIN.Some drivers have to allocate 16K buffers (or even 32K buffers) just to hold one segment (of less than 1500 bytes of payload), while others are able to pack memory more efficiently.

This behavior of slow window opening is fixed, and not configurable in vanilla kernel. We prepared a kernel patch that allows to start up with higher rcv_ssthresh based on per-route option initrwnd:

With the patch and the route change deployed, this is how the buffers look:

The advertised window is limited to 64KiB during the TCP handshake, but with our kernel patch enabled it's quickly bumped up to 1MiB in the first ACK packet afterwards. In both runs it took ~1800 packets to fill the receive buffer, however it took different time. In the first run the sender could push only 64KiB onto the wire in the second RTT. In the second run it could immediately push full 1MiB of data.

This trick of aggressive window opening is not really necessary for most users. It's only helpful when:

You have high-bandwidth TCP transfers over big-latency links.
The metadata + buffer alignment cost of your NIC is sensible and predictable.
Immediately after the flow starts your application is ready to send a lot of data.
The sender has configured large initcwnd.

You care about shaving off every possible RTT.

On our systems we do have such flows, but arguably it might not be a common scenario. In the real world most of your TCP connections go to the nearest CDN point of presence, which is very close.

Getting it all together

In this blog post, we discussed a seemingly simple case of a TCP sender filling up the receive socket. We tried to address two questions: with our isolated setup, how much data can be sent, and how quickly?

With the default settings of net.ipv4.tcp_rmem, Linux initially sets a memory budget of 128KiB for the receive data and metadata. On my system, given full-sized packets, it's able to eventually accept around 113KiB of application data.

Then, we showed that the receive window is not fully opened immediately. Linux keeps the receive window small, as it tries to predict the metadata cost and avoid overshooting the memory budget, therefore hitting TCP collapse. By default, with the net.ipv4.tcp_adv_win_scale=1, the upper limit for the advertised window is 50% of "free" memory. rcv_ssthresh starts up with 64KiB and grows linearly up to that limit.

On my system it took five window updates - six RTTs in total - to fill the 128KiB receive buffer. In the first batch the sender sent ~64KiB of data (remember we hacked the initcwnd limit), and then the sender topped it up with smaller and smaller batches until the receive window fully closed.

I hope this blog post is helpful and explains well the relationship between the buffer size and advertised window on Linux. Also, it describes the often misunderstood rcv_ssthresh which limits the advertised window in order to manage the memory budget and predict the unpredictable cost of metadata.

In case you wonder, similar mechanisms are in play in QUIC. The QUIC/H3 libraries though are still pretty young and don't have so many complex and mysterious toggles.... yet.

As always, the code and instructions on how to reproduce the charts are available at our GitHub.

How to stop running out of ephemeral ports and start to love long-lived connections

Marek Majkowski — Wed, 02 Feb 2022 09:53:28 GMT

Often programmers have assumptions that turn out, to their surprise, to be invalid. From my experience this happens a lot. Every API, technology or system can be abused beyond its limits and break in a miserable way.

It's particularly interesting when basic things used everywhere fail. Recently we've reached such a breaking point in a ubiquitous part of Linux networking: establishing a network connection using the connect() system call.

Since we are not doing anything special, just establishing TCP and UDP connections, how could anything go wrong? Here's one example: we noticed alerts from a misbehaving server, logged in to check it out and saw:

You can imagine the face of my colleague who saw that. SSH to localhost refuses to work, while she was already using SSH to connect to that server! On another occasion:

This time a basic DNS query failed with a weird networking error. Failing DNS is a bad sign!

In both cases the problem was Linux running out of ephemeral ports. When this happens it's unable to establish any outgoing connections. This is a pretty serious failure. It's usually transient and if you don't know what to look for it might be hard to debug.

The root cause lies deeper though. We can often ignore limits on the number of outgoing connections. But we encountered cases where we hit limits on the number of concurrent outgoing connections during normal operation.

In this blog post I'll explain why we had these issues, how we worked around them, and present an userspace code implementing an improved variant of connect() syscall.

Outgoing connections on Linux part 1 - TCP

Let's start with a bit of historical background.

Long-lived connections

Back in 2014 Cloudflare announced support for WebSockets. We wrote two articles about it:

If you skim these blogs, you'll notice we were totally fine with the WebSocket protocol, framing and operation. What worried us was our capacity to handle large numbers of concurrent outgoing connections towards the origin servers. Since WebSockets are long-lived, allowing them through our servers might greatly increase the concurrent connection count. And this did turn out to be a problem. It was possible to hit a ceiling for a total number of outgoing connections imposed by the Linux networking stack.

In a pessimistic case, each Linux connection consumes a local port (ephemeral port), and therefore the total connection count is limited by the size of the ephemeral port range.

Basics - how port allocation works

When establishing an outbound connection a typical user needs the destination address and port. For example, DNS might resolve cloudflare.com to the '104.1.1.229' IPv4 address. A simple Python program can establish a connection to it with the following code:

The operating system’s job is to figure out how to reach that destination, selecting an appropriate source address and source port to form the full 4-tuple for the connection:

The operating system chooses the source IP based on the routing configuration. On Linux we can see which source IP will be chosen with ip route get:

The src parameter in the result shows the discovered source IP address that should be used when going towards that specific target.

The source port, on the other hand, is chosen from the local port range configured for outgoing connections, also known as the ephemeral port range. On Linux this is controlled by the following sysctls:

The ip_local_port_range sets the low and high (inclusive) port range to be used for outgoing connections. The ip_local_reserved_ports is used to skip specific ports if the operator needs to reserve them for services.

Vanilla TCP is a happy case

The default ephemeral port range contains more than 28,000 ports (60999+1-32768=28232). Does that mean we can have at most 28,000 outgoing connections? That’s the core question of this blog post!

In TCP the connection is identified by a full 4-tuple, for example:

In principle, it is possible to reuse the source IP and port, and share them against another destination. For example, there could be two simultaneous outgoing connections with these 4-tuples:

This "source two-tuple" sharing can happen in practice when establishing connections using the vanilla TCP code:

But slightly different code can prevent this sharing, as we’ll discuss.

In the rest of this blog post, we’ll summarise the behaviour of code fragments that make outgoing connections showing:

The technique’s description
The typical `errno` value in the case of port exhaustion
And whether the kernel is able to reuse the {source IP, source port}-tuple against another destination

The last column is the most important since it shows if there is a low limit of total concurrent connections. As we're going to see later, the limit is present more often than we'd expect.

In the case of generic TCP, things work as intended. Towards a single destination it's possible to have as many connections as an ephemeral range allows. When the range is exhausted (against a single destination), we'll see EADDRNOTAVAIL error. The system also is able to correctly reuse local two-tuple {source IP, source port} for ESTABLISHED sockets against other destinations. This is expected and desired.

Manually selecting source IP address

Let's go back to the Cloudflare server setup. Cloudflare operates many services, to name just two: CDN (caching HTTP reverse proxy) and WARP.

For Cloudflare, it’s important that we don’t mix traffic types among our outgoing IPs. Origin servers on the Internet might want to differentiate traffic based on our product. The simplest example is CDN: it's appropriate for an origin server to firewall off non-CDN inbound connections. Allowing Cloudflare cache pulls is totally fine, but allowing WARP connections which contain untrusted user traffic might lead to problems.

To achieve such outgoing IP separation, each of our applications must be explicit about which source IPs to use. They can’t leave it up to the operating system; the automatically-chosen source could be wrong. While it's technically possible to configure routing policy rules in Linux to express such requirements, we decided not to do that and keep Linux routing configuration as simple as possible.

Instead, before calling connect(), our applications select the source IP with the bind() syscall. A trick we call "bind-before-connect":

This code looks rather innocent, but it hides a considerable drawback. When calling bind(), the kernel attempts to find an unused local two-tuple. Due to BSD API shortcomings, the operating system can't know what we plan to do with the socket. It's totally possible we want to listen() on it, in which case sharing the source IP/port with a connected socket will be a disaster! That's why the source two-tuple selected when calling bind() must be unique.

Due to this API limitation, in this technique the source two-tuple can't be reused. Each connection effectively "locks" a source port, so the number of connections is constrained by the size of the ephemeral port range. Notice: one source port is used up for each connection, no matter how many destinations we have. This is bad, and is exactly the problem we were dealing with back in 2014 in the WebSockets articles mentioned above.

Fortunately, it's fixable.

IP_BIND_ADDRESS_NO_PORT

Back in 2014 we fixed the problem by setting the SO_REUSEADDR socket option and manually retrying bind()+ connect() a couple of times on error. This worked ok, but later in 2015 Linux introduced a proper fix: the IP_BIND_ADDRESS_NO_PORT socket option. This option tells the kernel to delay reserving the source port:

This gets us back to the desired behavior. On modern Linux, when doing bind-before-connect for TCP, you should set IP_BIND_ADDRESS_NO_PORT.

Explicitly selecting a source port

Sometimes an application needs to select a specific source port. For example: the operator wants to control full 4-tuple in order to debug ECMP routing issues.

Recently a colleague wanted to run a cURL command for debugging, and he needed the source port to be fixed. cURL provides the --local-port option to do this¹ :

In other situations source port numbers should be controlled, as they can be used as an input to a routing mechanism.

But setting the source port manually is not easy. We're back to square one in our hackery since IP_BIND_ADDRESS_NO_PORT is not an appropriate tool when calling bind() with a specific source port value. To get the scheme working again and be able to share source 2-tuple, we need to turn to SO_REUSEADDR:

Our summary table:

Here, the user takes responsibility for handling conflicts, when an ESTABLISHED socket sharing the 4-tuple already exists. In such a case connect will fail with EADDRNOTAVAIL and the application should retry with another acceptable source port number.

Userspace connectx implementation

With these tricks, we can implement a common function and call it connectx. It will do what bind()+connect() should, but won't have the unfortunate ephemeral port range limitation. In other words, created sockets are able to share local two-tuples as long as they are going to distinct destinations:

We have three use cases this API should support:

The name we chose isn't an accident. MacOS (specifically the underlying Darwin OS) has exactly that function implemented as a connectx() system call (implementation):

It's more powerful than our connectx code, since it supports TCP Fast Open.

Should we, Linux users, be envious? For TCP, it's possible to get the right kernel behaviour with the appropriate setsockopt/bind/connect dance, so a kernel syscall is not quite needed.

But for UDP things turn out to be much more complicated and a dedicated syscall might be a good idea.

Outgoing connections on Linux - part 2 - UDP

In the previous section we listed three use cases for outgoing connections that should be supported by the operating system:

Vanilla egress: operating system chooses the outgoing IP and port
Source IP selection: user selects outgoing IP but the OS chooses port
Full 4-tuple: user selects full 4-tuple for the connection

We demonstrated how to implement all three cases on Linux for TCP, without hitting connection count limits due to source port exhaustion.

It's time to extend our implementation to UDP. This is going to be harder.

For UDP, Linux maintains one hash table that is keyed on local IP and port, which can hold duplicate entries. Multiple UDP connected sockets can not only share a 2-tuple but also a 4-tuple! It's totally possible to have two distinct, connected sockets having exactly the same 4-tuple. This feature was created for multicast sockets. The implementation was then carried over to unicast connections, but it is confusing. With conflicting sockets on unicast addresses, only one of them will receive any traffic. A newer connected socket will "overshadow" the older one. It's surprisingly hard to detect such a situation. To get UDP connectx() right, we will need to work around this "overshadowing" problem.

Vanilla UDP is limited

It might come as a surprise to many, but by default, the total count for outbound UDP connections is limited by the ephemeral port range size. Usually, with Linux you can't have more than ~28,000 connected UDP sockets, even if they point to multiple destinations.

Ok, let's start with the simplest and most common way of establishing outgoing UDP connections:

The simplest case is not a happy one. The total number of concurrent outgoing UDP connections on Linux is limited by the ephemeral port range size. On our multi-tenant servers, with potentially long-lived gaming and H3/QUIC flows containing WebSockets, this is too limiting.

On TCP we were able to slap on a setsockopt and move on. No such easy workaround is available for UDP.

For UDP, without REUSEADDR, Linux avoids sharing local 2-tuples among UDP sockets. During connect() it tries to find a 2-tuple that is not used yet. As a side note: there is no fundamental reason that it looks for a unique 2-tuple as opposed to a unique 4-tuple during 'connect()'. This suboptimal behavior might be fixable.

SO_REUSEADDR is hard

To allow local two-tuple reuse we need the SO_REUSEADDR socket option. Sadly, this would also allow established sockets to share a 4-tuple, with the newer socket overshadowing the older one.

In other words, we can't just set SO_REUSEADDR and move on, since we might hit a local 2-tuple that is already used in a connection against the same destination. We might already have an identical 4-tuple connected socket underneath. Most importantly, during such a conflict we won't be notified by any error. This is unacceptably bad.

Detecting socket conflicts with eBPF

We thought a good solution might be to write an eBPF program to detect such conflicts. The idea was to put a code on the connect() syscall. Linux cgroups allow the BPF_CGROUP_INET4_CONNECT hook. The eBPF is called every time a process under a given cgroup runs the connect() syscall. This is pretty cool, and we thought it would allow us to verify if there is a 4-tuple conflict before moving the socket from UNCONNECTED to CONNECTED states.

Here is how to load and attach our eBPF

With such a code, we'll greatly reduce the probability of overshadowing:

However, this solution is limited. First, it doesn't work for sockets with an automatically assigned source IP or source port, it only works when a user manually creates a 4-tuple connection from userspace. Then there is a second issue: a typical race condition. We don't grab any lock, so it's technically possible a conflicting socket will be created on another CPU in the time between our eBPF conflict check and the finish of the real connect() syscall machinery. In short, this lockless eBPF approach is better than nothing, but fundamentally racy.

Socket traversal - SOCK_DIAG ss way

There is another way to verify if a conflicting socket already exists: we can check for connected sockets in userspace. It's possible to do it without any privileges quite effectively with the SOCK_DIAG_BY_FAMILY feature of netlink interface. This is the same technique the ss tool uses to print out sockets available on the system.

The netlink code is not even all that complicated. Take a look at the code. Inside the kernel, it goes quickly into a fast __udp_lookup() routine. This is great - we can avoid iterating over all sockets on the system.

With that function handy, we can draft our UDP code:

This code has the same race condition issue as the connect inet eBPF hook before. But it's a good starting point. We need some locking to avoid the race condition. Perhaps it's possible to do it in the userspace.

SO_REUSEADDR as a lock

Here comes a breakthrough: we can use SO_REUSEADDR as a locking mechanism. Consider this:

The idea here is:

We need REUSEADDR around bind, otherwise it wouldn't be possible to reuse a local port. It's technically possible to clear REUSEADDR after bind. Doing this technically makes the kernel socket state inconsistent, but it doesn't hurt anything in practice.
By clearing REUSEADDR, we're locking new sockets from using that source port. At this stage we can check if we have ownership of the 4-tuple we want. Even if multiple sockets enter this critical section, only one, the newest, can win this verification. This is a cooperative algorithm, so we assume all tenants try to behave.
At this point, if the verification succeeds, we can perform connect() and have a guarantee that the 4-tuple won't be reused by another socket at any point in the process.

This is rather convoluted and hacky, but it satisfies our requirements:

Sadly, this schema only works when we know the full 4-tuple, so we can't rely on kernel automatic source IP or port assignments.

Faking source IP and port discovery

In the case when the user calls 'connect' and specifies only target 2-tuple - destination IP and port, the kernel needs to fill in the missing bits - the source IP and source port. Unfortunately the described algorithm expects the full 4-tuple to be known in advance.

One solution is to implement source IP and port discovery in userspace. This turns out to be not that hard. For example, here's a snippet of our code:

Putting it all together

Combining the manual source IP, port discovery and the REUSEADDR locking dance, we get a decent userspace implementation of connectx() for UDP.

We have covered all three use cases this API should support:

Take a look at the full code.

Summary

This post described a problem we hit in production: running out of ephemeral ports. This was partially caused by our servers running numerous concurrent connections, but also because we used the Linux sockets API in a way that prevented source port reuse. It meant that we were limited to ~28,000 concurrent connections per protocol, which is not enough for us.

We explained how to allow source port reuse and prevent having this ephemeral-port-range limit imposed. We showed an userspace connectx() function, which is a better way of creating outgoing TCP and UDP connections on Linux.

Our UDP code is more complex, based on little known low-level features, assumes cooperation between tenants and undocumented behaviour of the Linux operating system. Using REUSEADDR as a locking mechanism is rather unheard of.

The connectx() functionality is valuable, and should be added to Linux one way or another. It's not trivial to get all its use cases right. Hopefully, this blog post shows how to achieve this in the best way given the operating system API constraints.

___

¹ On a side note, on the second cURL run it fails due to TIME-WAIT sockets: "bind failed with errno 98: Address already in use".

One option is to wait for the TIME_WAIT socket to die, or work around this with the time-wait sockets kill script. Killing time-wait sockets is generally a bad idea, violating protocol, unneeded and sometimes doesn't work. But hey, in some extreme cases it's good to know what's possible. Just saying.

Everything you ever wanted to know about UDP sockets but were afraid to ask, part 1

Marek Majkowski — Thu, 25 Nov 2021 17:27:37 GMT

Snippet from internal presentation about UDP inner workings in Spectrum. Who said UDP is simple!

Historically Cloudflare's core competency was operating an HTTP reverse proxy. We've spent significant effort optimizing traditional HTTP/1.1 and HTTP/2 servers running on top of TCP. Recently though, we started operating big scale stateful UDP services.

Stateful UDP gains popularity for a number of reasons:

— QUIC is a new transport protocol based on UDP, it powers HTTP/3. We see the adoption accelerating.

— We operate WARP — our Wireguard protocol based tunneling service — which uses UDP under the hood.

— We have a lot of generic UDP traffic going through our Spectrum service.

Although UDP is simple in principle, there is a lot of domain knowledge needed to run things at scale. In this blog post we'll cover the basics: all you need to know about UDP servers to get started.

Connected vs unconnected

How do you "accept" connections on a UDP server? If you are using unconnected sockets, you generally don't.

But let's start with the basics. UDP sockets can be "connected" (or "established") or "unconnected". Connected sockets have a full 4-tuple associated {source ip, source port, destination ip, destination port}, unconnected sockets have 2-tuple {bind ip, bind port}.

Traditionally the connected sockets were mostly used for outgoing flows, while unconnected for inbound "server" side connections.

UDP client

As we'll learn today, these can be mixed. It is possible to use connected sockets for ingress handling, and unconnected for egress. To illustrate the latter, consider these two snippets. They do the same thing — send a packet to the DNS resolver. First snippet is using a connected socket:

Second, using unconnected one:

Which one is better? In the second case, when receiving, the programmer should verify the source IP of the packet. Otherwise, the program can get confused by some random inbound internet junk — like port scanning. It is tempting to reuse the socket descriptor and query another DNS server afterwards, but this would be a bad idea, particularly when dealing with DNS. For security, DNS assumes the client source port is unpredictable and short-lived.

Generally speaking for outbound traffic it's preferable to use connected UDP sockets.

Connected sockets can save route lookup on each packet by employing a clever optimization — Linux can save a route lookup result on a connection struct. Depending on the specifics of the setup this might save some CPU cycles.

For completeness, it is possible to roll a new source port and reuse a socket descriptor with an obscure trick called "dissolving of the socket association". It can be done with connect(AF_UNSPEC), but this is rather advanced Linux magic.

UDP server

Traditionally on the server side UDP requires unconnected sockets. Using them requires a bit of finesse. To illustrate this, let's write an UDP echo server. In practice, you probably shouldn't write such a server, due to a risk of becoming a DoS reflection vector. Among other protections, like rate limiting, UDP services should always respond with a strictly smaller amount of data than was sent in the initial packet. But let's not digress, the naive UDP echo server might look like:

This code begs questions:

— Received packets can be longer than 2048 bytes. This can happen over loop back, when using jumbo frames or with help of IP fragmentation.

— It's totally possible for the received packet to have an empty payload.

— What about inbound ICMP errors?

These problems are specific to UDP, they don't happen in the TCP world. TCP can transparently deal with MTU / fragmentation and ICMP errors. Depending on the specific protocol, a UDP service might need to be more complex and pay extra care to such corner cases.

Sourcing packets from a wildcard socket

There is a bigger problem with this code. It only works correctly when binding to a specific IP address, like ::1 or 127.0.0.1. It won't always work when we bind to a wildcard. The issue lies in the sendto() line — we didn't explicitly set the outbound IP address! Linux doesn't know where we'd like to source the packet from, and it will choose a default egress IP address. It might not be the IP the client communicated to. For example, let's say we added ::2 address to loop back interface and sent a packet to it, with src IP set to a valid ::1:

Here we can see the packet correctly flying from ::1 to ::2, to our server. But then when the server responds, it sources the response from ::1 IP which in this case is wrong.

On the server side, when binding to a wildcard:

— we might receive packets destined to a number of IP addresses

— we must be very careful when responding and use appropriate source IP address

BSD Sockets API doesn't make it easy to understand where the received packet was destined to. On Linux and BSD it is possible to request useful CMSG metadata with IP_RECVPKTINO and IPV6_RECVPKTINFO.

An improved server loop might look like:

The recvmsg and sendmsg syscalls, as opposed to recvfrom / sendto allow the programmer to request and set extra CMSG metadata, which is very handy when dealing with UDP.

The IPV6_PKTINFO CMSG contains this data structure:

We can find here the IP address and interface number of the packet target. Notice, there's no place for a port number.

Graceful server restart

Many traditional UDP protocols, like DNS, are request-response based. Since there is no state associated with a higher level "connection", the server can restart, to upgrade or change configuration, without any problems. Ideally, sockets should be managed with the usual systemd socket activation to avoid the short time window where the socket is down.

Modern protocols are often connection-based. For such servers, on restart, it's beneficial to keep the old connections directed to the old server process, while the new server instance is available for handling the new connections. The old connections will eventually die off, and the old server process will be able to terminate. This is a common and easy practice in the TCP world where each connection has its own file descriptor. The old server process stops accept()-ing new connections and just waits for the old connections to gradually go away. NGINX has a good documentation on the subject.

Sadly, in UDP you can't accept() new connections. Doing graceful server restarts for UDP is surprisingly hard.

Established-over-unconnected technique

For some services we are using a technique which we call "established-over-unconnected". This comes from a realization that on Linux it's possible to create a connected socket *over* an unconnected one. Consider this code:

Does this look hacky? Well, it should. What we do here is:

— We start a UDP unconnected socket.

— We wait for a client to come in.

— As soon as we receive the first packet from the client, we immediately create a new fully connected socket, *over* the unconnected socket! It shares the same local port and local IP.

This is how it might look in ss:

Here you can see the two sockets managed in our python test server. Notice the established socket is sharing the unconnected socket port.

This trick is basically reproducing the 'accept()` behaviour in UDP, where each ingress connection gets its own dedicated socket descriptor.

While this trick is nice, it's not without drawbacks — it's racy in two places. First, it's possible that the client will send more than one packet to the unconnected socket before the connected socket is created. The application code should work around it — if a packet received from the server socket belongs to an already existing connected flow, it shall be handed over to the right place. Then, during the creation of the connected socket, in the short window after bind() before connect() we might receive unexpected packets belonging to the unconnected socket! We don't want these packets here. It is necessary to filter the source IP/port when receiving early packets on the connected socket.

Is this approach worth the extra complexity? It depends on the use case. For a relatively small number of long-lived flows, it might be ok. For a high number of short-lived flows (especially DNS or NTP) it's an overkill.

Keeping old flows stable during service restarts is particularly hard in UDP. The established-over-unconnected technique is just one of the simpler ways of handling it. We'll leave another technique, based on SO_REUSEPORT ebpf, for a future blog post.

Summary

In this blog post we started by highlighting connected and unconnected UDP sockets. Then we discussed why binding UDP servers to a wildcard is hard, and how IP_PKTINFO CMSG can help to solve it. We discussed the UDP graceful restart problem, and hinted on an established-over-unconnected technique.

Stay tuned, in future blog posts we might go even deeper into the curious world of production UDP servers.

Branch predictor: How many "if"s are too many? Including x86 and M1 benchmarks!

Marek Majkowski — Thu, 06 May 2021 13:00:00 GMT

Some time ago I was looking at a hot section in our code and I saw this:

This got me thinking. This code is in a performance critical loop and it looks like a waste - we never run with the "debug" flag enabled^[1]. Is it ok to have if clauses that will basically never be run? Surely, there must be some performance cost to that...

Just how bad is peppering the code with avoidable `if` statements?

Back in the days the general rule was: a fully predictable branch has close to zero CPU cost.

To what extent is this true? If one branch is fine, then how about ten? A hundred? A thousand? When does adding one more if statement become a bad idea?

At some point the negligible cost of simple branch instructions surely adds up to a significant amount. As another example, a colleague of mine found this snippet in our production code:

Obviously, this code could be improved^[2]. But when I thought about it more: should it be improved? Is there an actual performance hit of a code that consists of a series of simple branches?

Understanding the cost of jump

We must start our journey with a bit of theory. We want to figure out if the CPU cost of a branch increases as we add more of them. As it turns out, assessing the cost of a branch is not trivial. On modern processors it takes between one and twenty CPU cycles. There are at least four categories of control flow instructions^[3]: unconditional branch (jmp on x86), call/return, conditional branch (e.g. je on x86) taken and conditional branch not taken. The taken branches are especially problematic: without special care they are inherently costly - we'll explain this in the following section. To bring down the cost, modern CPU's try to predict the future and figure out the branch target before the branch is actually fully executed! This is done in a special part of the processor called the branch predictor unit (BPU).

The branch predictor attempts to figure out a destination of a branching instruction very early and with very little context. This magic happens before the "decoder" pipeline stage and the predictor has very limited data available. It only has some past history and the address of the current instruction. If you think about it - this is super powerful. Given only current instruction pointer it can assess, with very high confidence, where the target of the jump will be.

Source: https://en.wikipedia.org/wiki/Branch_predictor

The BPU maintains a couple of data structures, but today we'll focus on Branch Target Buffer (BTB). It's a place where the BPU remembers the target instruction pointer of previously taken branches. The whole mechanism is much more complex, take a look a the Vladimir Uzelac's Master thesis for details about branch prediction on CPU's from 2008 era:

For the scope of this article we'll simplify and focus on the BTB only. We'll try to show how large it is and how it behaves under different conditions.

Why is branch prediction needed?

But first, why is branch prediction used at all? In order to get the best performance, the CPU pipeline must feed a constant flow of instructions. Consider what happens to the multi-stage CPU pipeline on a branch instruction. To illustrate let's consider the following ARM program:

Assuming a simplistic CPU model, the operations would flow through the pipeline like this:

In the first cycle the BR instruction is fetched. This is an unconditional branch instruction changing the execution flow of the CPU. At this point it's not yet decoded, but the CPU would like to fetch another instruction already! Without a branch predictor in cycle 2 the fetch unit either has to wait or simply continues to the next instruction in memory, hoping it will be the right one.

In our example, instruction X1 is fetched even though this isn't the correct instruction to run. In cycle 4, when the branch instruction finishes the execute stage, the CPU will be able to understand the mistake, and roll back the speculated instructions before they have any effect. At this point the fetch unit is updated to correctly get the right instruction - Y1 in our case.

This situation of losing a number of cycles due to fetching code from an incorrect place is called a "frontend bubble". Our theoretical CPU has a two-cycle frontend bubble when a branch target wasn’t predicted right.

In this example we see that, although the CPU does the right thing in the end, without good branch prediction it wasted effort on bad instructions. In the past, various techniques have been used to reduce this problem, such as static branch prediction and branch delay slots. But the dominant CPU designs today rely on dynamic branch prediction. This technique is able to mostly avoid the frontend bubble problem, by predicting the correct address of the next instruction even for branches that aren’t fully decoded and executed yet.

Playing with the BTB

Today we're focusing on the BTB - a data structure managed by the branch predictor responsible for figuring out a target of a branch. It's important to note that the BTB is distinct from and independent of the system assessing if the branch was taken or not taken. Remember, we want to figure out if a cost of a branch increases as we run more of them.

Preparing an experiment to stress only the BTB is relatively simple (based on Matt Godbolt's work). It turns out a sequence of unconditional jmps is totally sufficient. Consider this x86 code:

This code stresses the BTB to an extreme - it just consists of a chain of jmp +2 statements (i.e. literally jumping to the next instruction). In order to avoid wasting cycles on frontend pipeline bubbles, each taken jump needs a BTB hit. This branch prediction must happen very early in the CPU pipeline, before instruction decode is finished. This same mechanism is needed for any taken branch, whether it's unconditional, conditional or a function call.

The code above was run inside a test harness that measures how many CPU cycles elapse for each instruction. For example, in this run we're measuring times of dense - every two bytes - 1024 jmp instructions one after another:

We’ll look at the results of experiments like this for a few different CPUs. But in this instance, it was run on a machine with an AMD EPYC 7642. Here, the cold run took 10.5 cycles per jmp, and then all subsequent runs took ~3.5 cycles per jmp. The code is prepared in such a way to make sure it's the BTB that is slowing down the first run. Take a look at the full code, there is quite some magic to warm up the L1 cache and iTLB without priming the BTB.

Top tip 1. On this CPU a branch instruction that is taken but not predicted, costs ~7 cycles more than one that is taken and predicted. Even if the branch was unconditional.

Density matters

To get a full picture we also need to think about the density of jmp instructions in the code. The code above did eight jmps per 16-byte code block. This is a lot. For example, the code below contains one jmp instruction in each block of 16 bytes. Notice that the nop opcodes are jumped over. The block size doesn't change the number of executed instructions, only the code density:

Varying the jmp block size might be important. It allows us to control the placement of the jmp opcodes. Remember the BTB is indexed by instruction pointer address. Its value and its alignment might influence the placement in the BTB and help us reveal the BTB layout. Increasing the alignment will cause more nop padding to be added. The sequence of a single measured instruction - jmp in this case - and zero or more nops, I will call "block", and its size "block size". Notice that the larger the block size, the larger the working code size for the CPU. At larger values we might see some performance drop due to exhausting L1 cache space.

The experiment

Our experiment is crafted to show the performance drop depending on the number of branches, across different working code sizes. Hopefully, we will be able to prove the performance is mostly dependent on the number of blocks - and therefore the BTB size, and not the working code size.

See the code on GitHub. If you want to see the generated machine code, though, you need to run a special command. It's created procedurally by the code, customized by passed parameters. Here's an example gdb incantation:

Let's bring this experiment forward, what if we took the best times of each run - with a fully primed BTB - for varying values of jmp block sizes and number of blocks - working set size? Here you go:

This is an astonishing chart. First, it's obvious something happens at the 4096 jmp mark[4] regardless of how large the jmp block sizes - how many nop's we skip over. Reading it aloud:

On the far left, we see that if the amount of code is small enough - less than 2048 bytes (256 times a block of 8 bytes) - it's possible to hit some kind of uop/L1 cache and get ~1.5 cycles per fully predicted branch. This is amazing.
Otherwise, if you keep your hot loop to 4096 branches then, no matter how dense your code is you are likely to see ~3.4 cycles per fully predicted branch
Above 4096 branches the branch predictor gives up and the cost of each branch shoots to ~10.5 cycles per jmp. This is consistent with what we saw above - unpredicted branch on flushed BTB took ~10.5 cycles.

Great, so what does it mean? Well, you should avoid branch instructions if you want to avoid branch misses because you have at most 4096 of fast BTB slots. This is not a very pragmatic advice though - it's not like we deliberately put many unconditional jmps in real code!

There are a couple of takeaways for the discussed CPU. I repeated the experiment with an always-taken conditional branch sequence and the resulting chart looks almost identical. The only difference being the predicted taken conditional-je instruction being 2 cycles slower than unconditional jmp.

An entry to BTB is added wherever a branch is "taken" - that is, the jump actually happens. An unconditional "jmp" or always taken conditional branch, will cost a BTB slot. To get best performance make sure to not have more than 4096 taken branches in the hot loop. The good news is that branches never-taken don't take space in the BTB. We can illustrate this with another experiment:

This boring code is going over not-taken jne followed by two nops (block size=4). Aimed with this test (jne never-taken), the previous one (jmp always-taken) and a conditional branch je always-taken, we can draw this chart:

First, without any surprise we can see the conditional 'je always-taken' is getting slightly more costly than the simple unconditional jmp, but only after the 4096 branches mark. This makes sense, the conditional branch is resolved later in the pipeline so the frontend bubble is longer. Then take a look at the blue line hovering near zero. This is the "jne never-taken" line flat at 0.3 clocks / block, no matter how many blocks we run in sequence. The takeaway is clear - you can have as many never-taken branches as you want, without incurring any cost. There isn't any spike at 4096 mark, meaning BTB is not used in this case. It seems the conditional jump not seen before is guessed to be not-taken.

Top tip 2: conditional branches never-taken are basically free - at least on this CPU.

So far we established that branches always-taken occupy BTB, branches never taken do not. How about other control flow instructions, like the call?

I haven't been able to find this in the literature, but it seems call/ret also need the BTB entry for best performance. I was able to illustrate this on our AMD EPYC. Let's take a look at this test:

This time we'll issue a number of callq instructions followed by ret - both of which should be fully predicted. The experiment is crafted so that each callq calls a unique function, to allow for retq prediction - each one returns to exactly one caller.

This chart confirms the theory: no matter the code density - with the exception of 64-byte block size being notably slower - the cost of predicted call/ret starts to deteriorate after the 2048 mark. At this point the BTB is filled with call and ret predictions and can't handle any more data. This leads to an important conclusion:

Top tip 3. In the hot code you want to have less than 2K function calls - on this CPU.

In our test CPU a sequence of fully predicted call/ret takes about 7 cycles, which is about the same as two unconditional predicted jmp opcodes. It's consistent with our results above.

So far we thoroughly checked AMD EPYC 7642. We started with this CPU because the branch predictor is relatively simple and the charts were easy to read. It turns out more recent CPUs are less clear.

AMD EPYC 7713

Newer AMD is more complex than the previous generations. Let's run the two most important experiments. First, the jmp one:

For the always-taken branches case we can see a very good, sub 1 cycle, timings when the number of branches doesn't exceed 1024 and the code isn't too dense.

Top tip 4. On this CPU it's possible to get <1 cycle per predicted jmp when the hot loop fits in ~32KiB.

Then there is some noise starting after the 4096 jmps mark. This is followed by a complete drop of speed at about 6000 branches. This is in line with the theory that BTB is 4096 entries long. We can speculate that some other prediction mechanism is successfully kicking in beyond that, and keeps up the performance up the ~6k mark.

The call/ret chart shows a similar tale, the timings start breaking after 2048 mark, and completely fail to be predicted beyond ~3000.

Xeon Gold 6262

The Intel Xeon looks different from the AMD:

Our test shows the predicted taken branch costs 2 cycles. Intel has documented a clock penalty for very dense branching code - this explains the 4-byte block size line hovering at ~3 cycles. The branch cost breaks at the 4096 jmp mark, confirming the theory that the Intel BTB can hold 4096 entries. The 64-byte block size chart looks confusing, but really isn't. The branch cost stays at flat 2 cycles up till the 512 jmp count. Then it increases. This is caused by the internal layout of the BTB which is said to be 8-way associative. It seems with the 64-byte block size we can utilize at most half of the 4096 BTB slots.

Top tip 5. On Intel avoid placing your jmp/call/ret instructions at regular 64-byte intervals.

Then the call/ret chart:

Similarly, we can see the branch predictions failing after the 2048 jmp mark - in this experiment one block uses two flow control instructions: call and ret. This again confirms the BTB size of 4K entries. The 64-byte block size is generally slower due to the nop padding but also breaks faster due to the instructions alignment issue. Notice, we haven't seen this effect on AMD.

Apple Silicon M1

So far we saw examples of AMD and Intel server grade CPUs. How does an Apple Silicon M1 fit in this picture?

We expect it to be very different - it's designed for mobile and it's using ARM64 architecture. Let's see our two experiments:

The predicted jmp test shows an interesting story. First, when the code fits 4096 bytes (1024*4 or 512*8, etc) you can expect a predicted jmp to cost 1 clock cycle. This is an excellent score.

Beyond that, generally, you can expect a cost of 3 clock cycles per predicted jmp. This is also very good. This starts to deteriorate when the working code grows beyond ~200KiB. This is visible with block size 64 breaking at 3072 mark 3072*64=196K, and for block 32 at 6144: 6144*32=196K. At this point the prediction seems to stop working. The documentation indicates that the M1 CPU has 192 KB L1 of instruction cache - our experiment matches that.

Let's compare the "predicted jmp" with the "unpredicted jmp" chart. Take this chart with a grain of salt, because flushing the branch predictor is notoriously difficult.

However, even if we don't trust the flush-bpu code (adapted from Matt Godbolt), this chart reveals two things. First, the "unpredicted" branch cost seems to be correlated with the branch distance. The longer the branch the costlier it is. We haven't seen such behaviour on x86 CPUs.

Then there is the cost itself. We saw a predicted sequence of branches cost, and what a supposedly-unpredicted jmp costs. In the first chart we saw that beyond ~192KiB working code, the branch predictor seems to become ineffective. The supposedly-flushed BPU seems to show the same cost. For example, the cost of a 64-byte block size jmp with a small working set size is 3 cycles. A miss is ~8 cycles. For a large working set size both times are ~8 cycles. It seems that the BTB is linked to the L1 cache state. Paul A. Clayton suggested a possibility of such a design back in 2016.

Top tip 6. on M1 the predicted-taken branch generally takes 3 cycles and unpredicted but taken has varying cost, depending on jmp length. BTB is likely linked with L1 cache.

The call/ret chart is funny:

Like in the chart before, we can see a big benefit if hot code fits within 4096 bytes (512*4 or 256*8). Otherwise, you can count on 4-6 cycles per call/ret sequence (or, bl/ret as it's known in ARM). The chart shows funny alignment issues. It's unclear what they are caused by. Beware, comparing the numbers in this chart with x86 is unfair, since ARM call operation differs substantially from the x86 variant.

M1 seems pretty fast, with predicted branches usually at 3 clock cycles. Even unpredicted branches never cost more than 8 ticks in our benchmark. Call+ret sequence for dense code should fit under 5 cycles.

Summary

We started our journey from a piece of trivial code, and asked a basic question: how costly is adding a never-taken if branch in the hot portion of code?

Then we quickly dived in very low level CPU features. By the end of this article, hopefully, an astute reader might get better intuition how a modern branch predictors works.

On x86 the hot code needs to split the BTB budget between function calls and taken branches. The BTB has only a size of 4096 entries. There are strong benefits in keeping the hot code under 16KiB.

On the other hand on M1 the BTB seems to be limited by L1 instruction cache. If you're writing super hot code, ideally it should fit 4KiB.

Finally, can you add this one more if statement? If it's never-taken, it's probably ok. I found no evidence that such branches incur any extra cost. But do avoid always-taken branches and function calls.

Sources

I'm not the first person to investigate how BTB works. I based my experiments on:

Vladimir Uzelac thesis
Matt Godbolt work. The series has 5 articles.
Travis Downs BTB questions on Real World Tech
various stackoverflow discussions. Especially this one and this
Agner Fog microarchitecture guide has a good section on branch predictions.

Acknowledgements

Thanks to David Wragg and Dan Luu for technical expertise and proofreading help.

PS

Oh, oh. But this is not the whole story! Similar research was the base to the Spectre v2 attack. The attack was exploiting the little known fact that the BPU state was not cleared between context switches. With the correct technique it was possible to train the BPU - in the case of Spectre it was iBTB - and force a privileged piece of code to be speculatively executed. This, combined with a cache side-channel data leak, allowed an attacker to steal secrets from the privileged kernel. Powerful stuff.

A proposed solution was to avoid using shared BTB. This can be done in two ways: make the indirect jumps to always fail to predict, or fix the CPU to avoid sharing BTB state across isolation domains. This is a long story, maybe for another time...

1. One historical solution to this specific 'if debug' problem is called "runtime nop'ing". The idea is to modify the code in runtime and patch the never-taken branch instruction with a nop. For example, see the "ISENABLED" discussion on https://bugzilla.mozilla.org/showbug.cgi?id=370906.

2. Fun fact: modern compilers are pretty smart. New gcc (>=11) and older clang (>=3.7) are able to actually optimize it quite a lot. See for yourself. But, let's not get distracted by that. This article is about low level machine code branch instructions!

3. This is a simplification. There are of course more control flow instructions, like: software interrupts, syscalls, VMENTER/VMEXIT.

4. Ok, I'm slightly overinterpreting the chart. Maybe the 4096 jmp mark is due to the 4096 uop cache or some instruction decoder artifact? To prove this spike is indeed BTB related I looked at Intel BPUCLEARS.EARLY and BACLEAR.CLEAR performance counters. Its value is small for block count under 4096 and large for block count greater than 5378. This is strong evidence that the performance drop is indeed caused by the BPU and likely BTB.

Computing Euclidean distance on 144 dimensions

Marek Majkowski — Fri, 18 Dec 2020 12:00:00 GMT

Late last year I read a blog post about our CSAM image scanning tool. I remember thinking: this is so cool! Image processing is always hard, and deploying a real image identification system at Cloudflare is no small achievement!

Some time later, I was chatting with Kornel: "We have all the pieces in the image processing pipeline, but we are struggling with the performance of one component." Scaling to Cloudflare needs ain't easy!

The problem was in the speed of the matching algorithm itself. Let me elaborate. As John explained in his blog post on the CSAM Scanning Tool, the image matching algorithm creates a fuzzy hash from a processed image. The hash is exactly 144 bytes long. For example, it might look like this:

The hash is designed to be used in a fuzzy matching algorithm that can find "nearby", related images. The specific algorithm is well defined, but making it fast is left to the programmer — and at Cloudflare we need the matching to be done super fast. We want to match thousands of hashes per second, of images passing through our network, against a database of millions of known images. To make this work, we need to seriously optimize the matching algorithm.

Naive quadratic algorithm

The first algorithm that comes to mind has O(K*N) complexity: for each query, go through every hash in the database. In naive implementation, this creates a lot of work. But how much work exactly?

First, we need to explain how fuzzy matching works.

Given a query hash, the fuzzy match is the "closest" hash in a database. This requires us to define a distance. We treat each hash as a vector containing 144 numbers, identifying a point in a 144-dimensional space. Given two such points, we can calculate the distance using the standard Euclidean formula.

For our particular problem, though, we are interested in the "closest" match in a database only if the distance is lower than some predefined threshold. Otherwise, when the distance is large, we can assume the images aren't similar. This is the expected result — most of our queries will not have a related image in the database.

The Euclidean distance equation used by the algorithm is standard:

To calculate the distance between two 144-byte hashes, we take each byte, calculate the delta, square it, sum it to an accumulator, do a square root, and ta-dah! We have the distance!

Here's how to count the squared distance in C:

This function returns the squared distance. We avoid computing the actual distance to save us from running the square root function - it's slow. Inside the code, for performance and simplicity, we'll mostly operate on the squared value. We don't need the actual distance value, we just need to find the vector with the smallest one. In our case it doesn't matter if we'll compare distances or squared distances!

As you can see, fuzzy matching is basically a standard problem of finding the closest point in a multi-dimensional space. Surely this has been solved in the past — but let's not jump ahead.

While this code might be simple, we expect it to be rather slow. Finding the smallest hash distance in a database of, say, 1M entries, would require going over all records, and would need at least:

144 * 1M subtractions
144 * 1M multiplications
144 * 1M additions

And more. This alone adds up to 432 million operations! How does it look in practice? To illustrate this blog post we prepared a full test suite. The large database of known hashes can be well emulated by random data. The query hashes can't be random and must be slightly more sophisticated, otherwise the exercise wouldn't be that interesting. We generated the test smartly by byte-swaps of the actual data from the database — this allows us to precisely control the distance between test hashes and database hashes. Take a look at the scripts for details. Here's our first run of the first, naive, algorithm:

We matched 1,536 test hashes against a database of 1 million random vectors in 85 seconds. It took 55ms of CPU time on average to find the closest neighbour. This is rather slow for our needs.

SIMD for help

An obvious improvement is to use more complex SIMD instructions. SIMD is a way to instruct the CPU to process multiple data points using one instruction. This is a perfect strategy when dealing with vector problems — as is the case for our task.

We settled on using AVX2, with 256 bit vectors. We did this for a simple reason — newer AVX versions are not supported by our AMD CPUs. Additionally, in the past, we were not thrilled by the AVX-512 frequency scaling.

Using AVX2 is easier said than done. There is no single instruction to count Euclidean distance between two uint8 vectors! The fastest way of counting the full distance of two 144-byte vectors with AVX2 we could find is authored by Vlad:

It’s actually simpler than it looks: load 16 bytes, convert vector from uint8 to int16, subtract the vector, store intermediate sums as int32, repeat. At the end, we need to do complex 4 instructions to extract the partial sums into the final sum. This AVX2 code improves the performance around 3x:

We measured 17ms per item, which is still below our expectations. Unfortunately, we can't push it much further without major changes. The problem is that this code is limited by memory bandwidth. The measurements come from my Intel i7-5557U CPU, which has the max theoretical memory bandwidth of just 25GB/s. The database of 1 million entries takes 137MiB, so it takes at least 5ms to feed the database to my CPU. With this naive algorithm we won't be able to go below that.

Vantage Point Tree algorithm

Since the naive brute force approach failed, we tried using more sophisticated algorithms. My colleague Kornel Lesiński implemented a super cool Vantage Point algorithm. After a few ups and downs, optimizations and rewrites, we gave up. Our problem turned out to be unusually hard for this kind of algorithm.

We observed "the curse of dimensionality". Space partitioning algorithms don't work well in problems with large dimensionality — and in our case, we have an enormous number of 144 dimensions. K-D trees are doomed. Locality-sensitive hashing is also doomed. It's a bizarre situation in which the space is unimaginably vast, but everything is close together. The volume of the space is a 347-digit-long number, but the maximum distance between points is just 3060 - sqrt(255*255*144).

Space partitioning algorithms are fast, because they gradually narrow the search space as they get closer to finding the closest point. But in our case, the common query is never close to any point in the set, so the search space can’t be narrowed to a meaningful degree.

A VP-tree was a promising candidate, because it operates only on distances, subdividing space into near and far partitions, like a binary tree. When it has a close match, it can be very fast, and doesn't need to visit more than O(log(N)) nodes. For non-matches, its speed drops dramatically. The algorithm ends up visiting nearly half of the nodes in the tree. Everything is close together in 144 dimensions! Even though the algorithm avoided visiting more than half of the nodes in the tree, the cost of visiting remaining nodes was higher, so the search ended up being slower overall.

Smarter brute force?

This experience got us thinking. Since space partitioning algorithms can't narrow down the search, and still need to go over a very large number of items, maybe we should focus on going over all the hashes, extremely quickly. We must be smarter about memory bandwidth though — it was the limiting factor in the naive brute force approach before.

Perhaps we don't need to fetch all the data from memory.

Short distance

The breakthrough came from the realization that we don't need to count the full distance between hashes. Instead, we can compute only a subset of dimensions, say 32 out of the total of 144. If this distance is already large, then there is no need to compute the full one! Computing more points is not going to reduce the Euclidean distance.

The proposed algorithm works as follows:

1. Take the query hash and extract a 32-byte short hash from it

2. Go over all the 1 million 32-byte short hashes from the database. They must be densely packed in the memory to allow the CPU to perform good prefetching and avoid reading data we won't need.

3. If the distance of the 32-byte short hash is greater or equal a best score so far, move on

4. Otherwise, investigate the hash thoroughly and compute the full distance.

Even though this algorithm needs to do less arithmetic and memory work, it's not faster than the previous naive one. See make short-avx2. The problem is: we still need to compute a full distance for hashes that are promising, and there are quite a lot of them. Computing the full distance for promising hashes adds enough work, both in ALU and memory latency, to offset the gains of this algorithm.

There is one detail of our particular application of the image matching problem that will help us a lot moving forward. As we described earlier, the problem is less about finding the closest neighbour and more about proving that the neighbour with a reasonable distance doesn't exist. Remember — in practice, we don't expect to find many matches! We expect almost every image we feed into the algorithm to be unrelated to image hashes stored in the database.

It's sufficient for our algorithm to prove that no neighbour exists within a predefined distance threshold. Let's assume we are not interested in hashes more distant than, say, 220, which squared is 48,400. This makes our short-distance algorithm variation work much better:

Origin distance variation

At some point, John noted that the threshold allows additional optimization. We can order the hashes by their distance from some origin point. Given a query hash which has origin distance of A, we can inspect only hashes which are distant between |A-threshold| and |A+threshold| from the origin. This is pretty much how each level of Vantage Point Tree works, just simplified. This optimization — ordering items in the database by their distance from origin point — is relatively simple and can help save us a bit of work.

While great on paper, this method doesn't introduce much gain in practice, as the vectors are not grouped in clusters — they are pretty much random! For the threshold values we are interested in, the origin distance algorithm variation gives us ~20% speed boost, which is okay but not breathtaking. This change might bring more benefits if we ever decide to reduce the threshold value, so it might be worth doing for production implementation. However, it doesn't work well with query batching.

Transposing data for better AVX

But we're not done with AVX optimizations! The usual problem with AVX is that the instructions don't normally fit a specific problem. Some serious mind twisting is required to adapt the right instruction to the problem, or to reverse the problem so that a specific instruction can be used. AVX2 doesn't have useful "horizontal" uint16 subtract, multiply and add operations. For example, _mm_hadd_epi16 exists, but it's slow and cumbersome.

Instead, we can twist the problem to make use of fast available uint16 operands. For example we can use:

_mm256_sub_epi16
_mm256_mullo_epi16
and _mm256_add_epu16.

The add would overflow in our case, but fortunately there is add-saturate _mm256_adds_epu16.

The saturated add is great and saves us conversion to uint32. It just adds a small limitation: the threshold passed to the program (i.e., the max squared distance) must fit into uint16. However, this is fine for us.

To effectively use these instructions we need to transpose the data in the database. Instead of storing hashes in rows, we can store them in columns:

So instead of:

[a1, a2, a3],
[b1, b2, b3],
[c1, c2, c3],

...

We can lay it out in memory transposed:

[a1, b1, c1],
[a2, b2, c2],
[a3, b3, c3],

...

Now we can load 16 first bytes of hashes using one memory operation. In the next step, we can subtract the first byte of the querying hash using a single instruction, and so on. The algorithm stays exactly the same as defined above; we just make the data easier to load and easier to process for AVX.

The hot loop code even looks relatively pretty:

With the well-tuned batch size and short distance size parameters we can see the performance of this algorithm:

Whoa! This is pretty awesome. We started from 55ms per query, and we finished with just 0.73ms. There are further micro-optimizations possible, like memory prefetching or using huge pages to reduce page faults, but they have diminishing returns at this point.

Roofline model from Denis Bakhvalov's book‌‌

If you are interested in architectural tuning such as this, take a look at the new performance book by Denis Bakhvalov. It discusses roofline model analysis, which is pretty much what we did here.

Do take a look at our code and tell us if we missed some optimization!

Summary

What an optimization journey! We jumped between memory and ALU bottlenecked code. We discussed more sophisticated algorithms, but in the end, a brute force algorithm — although tuned — gave us the best results.

To get even better numbers, I experimented with Nvidia GPU using CUDA. The CUDA intrinsics like vabsdiff4 and dp4a fit the problem perfectly. The V100 gave us some amazing numbers, but I wasn't fully satisfied with it. Considering how many AMD Ryzen cores with AVX2 we can get for the cost of a single server-grade GPU, we leaned towards general purpose computing for this particular problem.

This is a great example of the type of complexities we deal with every day. Making even the best technologies work “at Cloudflare scale” requires thinking outside the box. Sometimes we rewrite the solution dozens of times before we find the optimal one. And sometimes we settle on a brute-force algorithm, just very very optimized.

The computation of hashes and image matching are challenging problems that require running very CPU intensive operations.. The CPU we have available on the edge is scarce and workloads like this are incredibly expensive. Even with the optimization work talked about in this blog post, running the CSAM scanner at scale is a challenge and has required a huge engineering effort. And we’re not done! We need to solve more hard problems before we're satisfied. If you want to help, consider applying!

Why is there a "V" in SIGSEGV Segmentation Fault?

Marek Majkowski — Thu, 18 Jun 2020 11:56:33 GMT

Another long night. I was working on my perfect, bug-free program in C, when the predictable thing happened:

Oh, well... Maybe I'll be more lucky taking over the world another night. But then it struck me. My program received a SIGSEGV signal and crashed with "Segmentation Fault" message. Where does the "V" come from?

Did I read it wrong? Was there a "Segmentation _V_ault?"? Or did Linux authors make a mistake? Shouldn't the signal be named SIGSEGF?

I asked my colleagues and David Wragg quickly told me that the signal name stands for "Segmentation Violation". I guess that makes sense. Long long time ago, computers used to have memory segmentation. Each memory segment had defined length - called Segment Limit. Accessing data over this limit caused a processor fault. This error code got re-used by newer systems that used paging. I think the Intel manuals call this error "Invalid Page Fault". When it's triggered it gets reported to the userspace as a SIGSEGV signal. End of story.

Or is it?

Martin Levy pointed me to an ancient Version 6th UNIX documentation on "signal". This is from around 1978:

Look carefully. There is no SIGSEGV signal! Signal number 11 is called SIGSEG!

It seems that userspace parts of the UNIX tree (i.e. /usr/include/signal.h) switched to SIGSEGV fairly early on. But the kernel internals continued to use the name SIGSEG for much longer.

Looking deeper David found that PDP11 trap vector used wording "segmentation violation". This shows up in Research V4 Edition in the UNIX history repo, but it doesn't mean it was introduced in V4 - it's just because V4 is the first version with code still available.

This trap was converted into SIGSEG signal in trap.c file.

The file /usr/include/signal.h appears in the tree for Research V7, with the name SIGSEGV. But the kernel still called it SIGSEG at the time

It seems the kernel side was renamed to SIGSEGV in BSD-4.

Here you go. Originally the signal was called SIGSEG. It was subsequently renamed SIGSEGV in the userspace and a bit later - around 1980 - to SIGSEGV on the kernel side. Apparently there are still no Segmentation Vaults found on UNIX systems.

As for my original crash, I fixed it - of course - by catching the signal and jumping over the offending instruction. On Linux it is totally possible to catch and handle SIGSEGV. With that fix, my code will never again crash. For sure.

Conntrack tales - one thousand and one flows

Marek Majkowski — Mon, 06 Apr 2020 11:00:00 GMT

At Cloudflare we develop new products at a great pace. Their needs often challenge the architectural assumptions we made in the past. For example, years ago we decided to avoid using Linux's "conntrack" - stateful firewall facility. This brought great benefits - it simplified our iptables firewall setup, sped up the system a bit and made the inbound packet path easier to understand.

But eventually our needs changed. One of our new products had a reasonable need for it. But we weren't confident - can we just enable conntrack and move on? How does it actually work? I volunteered to help the team understand the dark corners of the "conntrack" subsystem.

What is conntrack?

"Conntrack" is a part of Linux network stack, specifically part of the firewall subsystem. To put that into perspective: early firewalls were entirely stateless. They could express only basic logic, like: allow SYN packets to port 80 and 443, and block everything else.

The stateless design gave some basic network security, but was quickly deemed insufficient. You see, there are certain things that can't be expressed in a stateless way. The canonical example is assessment of ACK packets - it's impossible to say if an ACK packet is legitimate or part of a port scanning attempt, without tracking the connection state.

To fill such gaps all the operating systems implemented connection tracking inside their firewalls. This tracking is usually implemented as a big table, with at least 6 columns: protocol (usually TCP or UDP), source IP, source port, destination IP, destination port and connection state. On Linux this subsystem is called "conntrack" and is often enabled by default. Here's how the table looks on my laptop inspected with "conntrack -L" command:

The obvious question is how large this state tracking table can be. This setting is under "/proc/sys/net/nf_conntrack_max":

This is a global setting, but the limit is per container. On my system each container, or "network namespace", can have up to 256K conntrack entries.

What exactly happens when the number of concurrent connections exceeds the conntrack limit?

Testing conntrack is hard

In past testing conntrack was hard - it required complex hardware or vm setup. Fortunately, these days we can use modern "user namespace" facilities which do permission magic, allowing an unprivileged user to feel like root. Using the tool "unshare" it's possible to create an isolated environment where we can precisely control the packets going through and experiment with iptables and conntrack without threatening the health of our host system. With appropriate parameters it's possible to create and manage a networking namespace, including access to namespaced iptables and conntrack, from an unprivileged user.

This script is the heart of our test:

This bash script is shortened for readability. See the full version here. The accompanying "send_syn.py" is just sending 10 SYN packets over "tun0" interface. Here is the source but allow me to paste it here - showing off "scapy" is always fun:

The bash script above contains a couple of gems. Let's walk through them.

First, please note that we can't just inject packets into the loopback interface using SOCK_RAW sockets. The Linux networking stack is a complex beast. The semantics of sending packets over a SOCK_RAW are different then delivering a packet over a real interface. We'll discuss this later, but for now, to avoid triggering unexpected behaviour, we will deliver packets over a tun/tap device which better emulates a real interface.

Then we need to make sure the conntrack is active in the network namespace we wish to use for testing. Traditionally, just loading the kernel module would have done that, but in the brave new world of containers and network namespaces, a method had to be found to allow conntrack to be active in some and inactive in other containers. Hence this is tied to usage - rules referencing conntrack must exist in the namespace's iptables for conntrack to be active inside the container.

As a side note, containers triggering host to load kernel modules is an interesting subject.

After the "-t raw -A PREROUTING" rule, which we added "-t mangle -A PREROUTING" rule, but notice - it doesn't have any action! This syntax is allowed by iptables and it is pretty useful to get iptables to report rule counters. We'll need these counters soon. A careful reader might suggest looking at "policy" counters in iptables to achieve our goal. Sadly, "policy" counters (increased for each packet entering a chain), work only if there is at least one rule inside it.

The rest of the steps are self-explanatory. We set up "tcpdump" in the background, send 10 SYN packets to 127.0.0.1:80 using the "scapy" Python library. Then we print the conntrack table and iptables counters.

Let's run this script in action. Remember to run it under networking namespace as fake root with "unshare -Ur -n":

This is all nice. First we see a "tcpdump" listing showing 10 SYN packets. Then we see the conntrack table state, showing 10 created flows. Finally, we see iptables counters in two rules we created, each showing 10 packets processed.

Can conntrack table fill up?

Given that the conntrack table is size constrained, what exactly happens when it fills up? Let's check it out. First, we need to drop the conntrack size. As mentioned it's controlled by a global toggle - it's necessary to tune it on the host side. Let's reduce the table size to 7 entries, and repeat our test:

This is getting interesting. We still see the 10 inbound SYN packets. We still see that the "-t raw PREROUTING" table received 10 packets, but this is where similarities end. The "-t mangle PREROUTING" table saw only 7 packets. Where did the three missing SYN packets go?

It turns out they went where all the dead packets go. They were hard dropped. Conntrack on overfill does exactly that. It even complains in the "dmesg":

This is confirmed by our iptables counters. Let's review the famous iptables diagram:

image by Jan Engelhardt CC BY-SA 3.0

As we can see, the "-t raw PREROUTING" happens before conntrack, while "-t mangle PREROUTING" is just after it. This is why we see 10 and 7 packets reported by our iptables counters.

Let me emphasize the gravity of our discovery. We showed three completely valid SYN packets being implicitly dropped by "conntrack". There is no explicit "-j DROP" iptables rule. There is no configuration to be toggled. Just the fact of using "conntrack" means that, when it's full, packets creating new flows will be dropped. No questions asked.

This is the dark side of using conntrack. If you use it, you absolutely must make sure it doesn't get filled.

We could end our investigation here, but there are a couple of interesting caveats.

Strict vs loose

Conntrack supports a "strict" and "loose" mode, as configured by "nf_conntrack_tcp_loose" toggle.

By default, it's set to "loose" which means that stray ACK packets for unseen TCP flows will create new flow entries in the table. We can generalize: "conntrack" will implicitly drop all the packets that create new flow, whether that's SYN or just stray ACK.

What happens when we clear the "nf_conntrack_tcp_loose=0" setting? This is a subject for another blog post, but suffice to say - it's a mess. First, this setting is not settable in the network namespace scope - although it should be. To test it you need to be in the root network namespace. Then, due to twisted logic the ACK will be dropped on a full conntrack table, even though in this case it doesn't create a flow. If the table is not full, the ACK packet will pass through it, having "-ctstate INVALID" from "mangle" table forward.

When doesn't a conntrack entry get created?

There are important situations when conntrack entry is not created. For example, we could replace these line in our script:

With those:

Naively we could think dropping SYN packets past the conntrack layer would not interfere with the created flows. This is not correct. In spite of these SYN packets having been seen by conntrack, no flow state is created for them. Packets hitting "-j DROP" will not create new conntrack flows. Pretty magical, isn't it?

Full Conntrack causes with EPERM

Recently we hit a case when a "sendto()" syscall on UDP socket from one of our applications was erroring with EPERM. This is pretty weird, and not documented in the man page. My colleague had no doubts:

I'll save you the gruesome details, but indeed, the full conntrack table will do that to your new UDP flows - you will get EPERM. Beware. Funnily enough, it's possible to get EPERM if an outbound packet is dropped on OUTPUT firewall in other ways. For example:

If you ever receive EPERM from "sendto()", you might want to treat it as a transient error, if you suspect a filled conntrack problem, or permanent error if you blame iptables configuration.

This is also why we can't send our SYN packets directly using SOCK_RAW sockets in our test. Let's see what happens on conntrack overfill with standard "hping3" tool:

"send()" even on a SOCK_RAW socket fails with EPERM when conntrack table is full.

Full conntrack can happen on a SYN flood

There is one more caveat. During a SYN flood, the conntrack entries will totally be created for the spoofed flows. Take a look at second test case we prepared, this time correctly listening on port 80, and sending SYN+ACK:

We can see 7 SYN+ACK's flying out of the port 80 listening socket. The final three SYN's go nowhere as they are dropped by conntrack.

This has important implications. If you use conntrack on publicly accessible ports, during SYN flood mitigation technologies like SYN Cookies won't help. You are still at risk of running out of conntrack space and therefore affecting legitimate connections.

For this reason, as a general rule consider avoiding conntrack on inbound connections (-j NOTRACK). Alternatively having some reasonable rate limits on iptables layer, doing "-j DROP". This will work well and won't create new flows, as we discussed above. The best method though, would be to trigger SYN Cookies from a layer before conntrack, like XDP. But this is a subject for another time.

Summary

Over the years Linux conntrack has gone through many changes and has improved a lot. While performance used to be a major concern, these days it's considered to be very fast. Dark corners remain. Correctly applying conntrack is tricky.

In this blog post we showed how it's possible to test parts of conntrack with "unshare" and a series of scripts. We showed the behaviour when the conntrack table gets filled - packets might implicitly be dropped. Finally, we mentioned the curious case of SYN floods where incorrectly applied conntrack may cause harm.

Stay tuned for more horror stories as we dig deeper and deeper into the Linux networking stack guts.

When Bloom filters don't bloom

Marek Majkowski — Mon, 02 Mar 2020 13:00:00 GMT

I've known about Bloom filters (named after Burton Bloom) since university, but I haven't had an opportunity to use them in anger. Last month this changed - I became fascinated with the promise of this data structure, but I quickly realized it had some drawbacks. This blog post is the tale of my brief love affair with Bloom filters.

While doing research about IP spoofing, I needed to examine whether the source IP addresses extracted from packets reaching our servers were legitimate, depending on the geographical location of our data centers. For example, source IPs belonging to a legitimate Italian ISP should not arrive in a Brazilian datacenter. This problem might sound simple, but in the ever-evolving landscape of the internet this is far from easy. Suffice it to say I ended up with many large text files with data like this:

This reads as: the IP 192.0.2.1 was recorded reaching Cloudflare data center number 107 with a legitimate request. This data came from many sources, including our active and passive probes, logs of certain domains we own (like cloudflare.com), public sources (like BGP table), etc. The same line would usually be repeated across multiple files.

I ended up with a gigantic collection of data of this kind. At some point I counted 1 billion lines across all the harvested sources. I usually write bash scripts to pre-process the inputs, but at this scale this approach wasn't working. For example, removing duplicates from this tiny file of a meager 600MiB and 40M lines, took... about an eternity:

Enough to say that deduplicating lines using the usual bash commands like 'sort' in various configurations (see '--parallel', '--buffer-size' and '--unique') was not optimal for such a large data set.

Bloom filters to the rescue

Image by David Eppstein Public Domain

Then I had a brainwave - it's not necessary to sort the lines! I just need to remove duplicated lines - using some kind of "set" data structure should be much faster. Furthermore, I roughly know the cardinality of the input file (number of unique lines), and I can live with some data points being lost - using a probabilistic data structure is fine!

Bloom-filters are a perfect fit!

While you should go and read Wikipedia on Bloom Filters, here is how I look at this data structure.

How would you implement a "set"? Given a perfect hash function, and infinite memory, we could just create an infinite bit array and set a bit number 'hash(item)' for each item we encounter. This would give us a perfect "set" data structure. Right? Trivial. Sadly, hash functions have collisions and infinite memory doesn't exist, so we have to compromise in our reality. But we can calculate and manage the probability of collisions. For example, imagine we have a good hash function, and 128GiB of memory. We can calculate the probability of the second item added to the bit array colliding would be 1 in 1099511627776. The probability of collision when adding more items worsens as we fill up the bit array.

Furthermore, we could use more than one hash function, and end up with a denser bit array. This is exactly what Bloom filters optimize for. A Bloom filter is a bunch of math on top of the four variables:

'n' - The number of input elements (cardinality)
'm' - Memory used by the bit-array
'k' - Number of hash functions counted for each input
'p' - Probability of a false positive match

Given the 'n' input cardinality and the 'p' desired probability of false positive, the Bloom filter math returns the 'm' memory required and 'k' number of hash functions needed.

Check out this excellent visualization by Thomas Hurst showing how parameters influence each other:

https://hur.st/bloomfilter/

mmuniq-bloom

Guided by this intuition, I set out on a journey to add a new tool to my toolbox - 'mmuniq-bloom', a probabilistic tool that, given input on STDIN, returns only unique lines on STDOUT, hopefully much faster than 'sort' + 'uniq' combo!

Here it is:

'mmuniq-bloom.c'

For simplicity and speed I designed 'mmuniq-bloom' with a couple of assumptions. First, unless otherwise instructed, it uses 8 hash functions k=8. This seems to be a close to optimal number for the data sizes I'm working with, and the hash function can quickly output 8 decent hashes. Then we align 'm', number of bits in the bit array, to be a power of two. This is to avoid the pricey % modulo operation, which compiles down to slow assembly 'div'. With power-of-two sizes we can just do bitwise AND. (For a fun read, see how compilers can optimize some divisions by using multiplication by a magic constant.)

We can now run it against the same data file we used before:

Oh, this is so much better! 12 seconds is much more manageable than 2 minutes before. But hold on... The program is using an optimized data structure, relatively limited memory footprint, optimized line-parsing and good output buffering... 12 seconds is still eternity compared to 'wc -l' tool:

What is going on? I understand that counting lines by 'wc' is easier than figuring out unique lines, but is it really worth the 26x difference? Where does all the CPU in 'mmuniq-bloom' go?

It must be my hash function. 'wc' doesn't need to spend all this CPU performing all this strange math for each of the 40M lines on input. I'm using a pretty non-trivial 'siphash24' hash function, so it surely burns the CPU, right? Let's check by running the code computing hash function but not doing any Bloom filter operations:

This is strange. Counting the hash function indeed costs about 2s, but the program took 12s in the previous run. The Bloom filter alone takes 10 seconds? How is that possible? It's such a simple data structure...

A secret weapon - a profiler

It was time to use a proper tool for the task - let's fire up a profiler and see where the CPU goes. First, let's fire an 'strace' to confirm we are not running any unexpected syscalls:

Everything looks good. The 10 calls to 'mmap' each taking 4ms (3971 us) is intriguing, but it's fine. We pre-populate memory up front with 'MAP_POPULATE' to save on page faults later.

What is the next step? Of course Linux's 'perf'!

Then we can see the results:

Right, so we indeed burn 87.2% of cycles in our hot code. Let's see where exactly. Doing 'perf annotate process_line --source' quickly shows something I didn't expect.

You can see 26.90% of CPU burned in the 'mov', but that's not all of it! The compiler correctly inlined the function, and unrolled the loop 8-fold. Summed up that 'mov' or 'uint64_t v = *p' line adds up to a great majority of cycles!

Clearly 'perf' must be mistaken, how can such a simple line cost so much? We can repeat the benchmark with any other profiler and it will show us the same problem. For example, I like using 'google-perftools' with kcachegrind since they emit eye-candy charts:

The rendered result looks like this:

Allow me to summarise what we found so far.

The generic 'wc' tool takes 0.45s CPU time to process 600MiB file. Our optimized 'mmuniq-bloom' tool takes 12 seconds. CPU is burned on one 'mov' instruction, dereferencing memory....

Image by Jose Nicdao CC BY/2.0

Oh! I how could I have forgotten. Random memory access is slow! It's very, very, very slow!

According to the general rule "latency numbers every programmer should know about", one RAM fetch is about 100ns. Let's do the math: 40 million lines, 8 hashes counted for each line. Since our Bloom filter is 128MiB, on our older hardware it doesn't fit into L3 cache! The hashes are uniformly distributed across the large memory range - each hash generates a memory miss. Adding it together that's...

That suggests 32 seconds burned just on memory fetches. The real program is faster, taking only 12s. This is because, although the Bloom filter data does not completely fit into L3 cache, it still gets some benefit from caching. It's easy to see with 'perf stat -d':

Right, so we should have had at least 320M LLC-load-misses, but we had only 280M. This still doesn't explain why the program was running only 12 seconds. But it doesn't really matter. What matters is that the number of cache misses is a real problem and we can only fix it by reducing the number of memory accesses. Let's try tuning Bloom filter to use only one hash function:

Ouch! That really hurt! The Bloom filter required 64 GiB of memory to get our desired false positive probability ratio of 1-error-per-10k-lines. This is terrible!

Also, it doesn't seem like we improved much. It took the OS 22 seconds to prepare memory for us, but we still burned 11 seconds in userspace. I guess this time any benefits from hitting memory less often were offset by lower cache-hit probability due to drastically increased memory size. In previous runs we required only 128MiB for the Bloom filter!

Dumping Bloom filters altogether

This is getting ridiculous. To get the same false positive guarantees we either must use many hashes in Bloom filter (like 8) and therefore many memory operations, or we can have 1 hash function, but enormous memory requirements.

We aren't really constrained by available memory, instead we want to optimize for reduced memory accesses. All we need is a data structure that requires at most 1 memory miss per item, and use less than 64 Gigs of RAM...

While we could think of more sophisticated data structures like Cuckoo filter, maybe we can be simpler. How about a good old simple hash table with linear probing?

Image by Vadims Podāns

Welcome mmuniq-hash

Here you can find a tweaked version of mmuniq-bloom, but using hash table:

'mmuniq-hash.c'

Instead of storing bits as for the Bloom-filter, we are now storing 64-bit hashes from the 'siphash24' function. This gives us much stronger probability guarantees, with probability of false positives much better than one error in 10k lines.

Let's do the math. Adding a new item to a hash table containing, say 40M, entries has '40M/2^64' chances of hitting a hash collision. This is about one in 461 billion - a reasonably low probability. But we are not adding one item to a pre-filled set! Instead we are adding 40M lines to the initially empty set. As per birthday paradox this has much higher chances of hitting a collision at some point. A decent approximation is 'n^2/2m', which in our case is '(40M^2)/(2*(264))'. This is a chance of one in 23000. In other words, assuming we are using good hash function, every one in 23 thousand random sets of 40M items, will have a hash collision. This practical chance of hitting a collision is non-negligible, but it's still better than a Bloom filter and totally acceptable for my use case.

The hash table code runs faster, has better memory access patterns and better false positive probability than the Bloom filter approach.

Don't be scared about the "hash conflicts" line, it just indicates how full the hash table was. We are using linear probing, so when a bucket is already used, we just pick up the next empty bucket. In our case we had to skip over 0.7 buckets on average to find an empty slot in the table. This is fine and, since we iterate over the buckets in linear order, we can expect the memory to be nicely prefetched.

From the previous exercise we know our hash function takes about 2 seconds of this. Therefore, it's fair to say 40M memory hits take around 4 seconds.

Lessons learned

Modern CPUs are really good at sequential memory access when it's possible to predict memory fetch patterns (see Cache prefetching). Random memory access on the other hand is very costly.

Advanced data structures are very interesting, but beware. Modern computers require cache-optimized algorithms. When working with large datasets, not fitting L3, prefer optimizing for reduced number loads, over optimizing the amount of memory used.

I guess it's fair to say that Bloom filters are great, as long as they fit into the L3 cache. The moment this assumption is broken, they are terrible. This is not news, Bloom filters optimize for memory usage, not for memory access. For example, see the Cuckoo Filters paper.

Another thing is the ever-lasting discussion about hash functions. Frankly - in most cases it doesn't matter. The cost of counting even complex hash functions like 'siphash24' is small compared to the cost of random memory access. In our case simplifying the hash function will bring only small benefits. The CPU time is simply spent somewhere else - waiting for memory!

One colleague often says: "You can assume modern CPUs are infinitely fast. They run at infinite speed until they hit the memory wall".

Finally, don't follow my mistakes - everyone should start profiling with 'perf stat -d' and look at the "Instructions per cycle" (IPC) counter. If it's below 1, it generally means the program is stuck on waiting for memory. Values above 2 would be great, it would mean the workload is mostly CPU-bound. Sadly, I'm yet to see high values in the workloads I'm dealing with...

Improved mmuniq

With the help of my colleagues I've prepared a further improved version of the 'mmuniq' hash table based tool. See the code:

'mmuniq.c'

It is able to dynamically resize the hash table, to support inputs of unknown cardinality. Then, by using batching, it can effectively use the 'prefetch' CPU hint, speeding up the program by 35-40%. Beware, sprinkling the code with 'prefetch' rarely works. Instead, I specifically changed the flow of algorithms to take advantage of this instruction. With all the improvements I got the run time down to 2.1 seconds:

The end

Writing this basic tool which tries to be faster than 'sort | uniq' combo revealed some hidden gems of modern computing. With a bit of work we were able to speed it up from more than two minutes to 2 seconds. During this journey we learned about random memory access latency, and the power of cache friendly data structures. Fancy data structures are exciting, but in practice reducing random memory loads often brings better results.

When TCP sockets refuse to die

Marek Majkowski — Fri, 20 Sep 2019 15:53:33 GMT

While working on our Spectrum server, we noticed something weird: the TCP sockets which we thought should have been closed were lingering around. We realized we don't really understand when TCP sockets are supposed to time out!

Image by Sergiodc2 CC BY SA 3.0

In our code, we wanted to make sure we don't hold connections to dead hosts. In our early code we naively thought enabling TCP keepalives would be enough... but it isn't. It turns out a fairly modern TCP_USER_TIMEOUT socket option is equally important. Furthermore, it interacts with TCP keepalives in subtle ways. Many people are confused by this.

In this blog post, we'll try to show how these options work. We'll show how a TCP socket can time out during various stages of its lifetime, and how TCP keepalives and user timeout influence that. To better illustrate the internals of TCP connections, we'll mix the outputs of the tcpdump and the ss -o commands. This nicely shows the transmitted packets and the changing parameters of the TCP connections.

SYN-SENT

Let's start from the simplest case - what happens when one attempts to establish a connection to a server which discards inbound SYN packets?

The scripts used here are available on our GitHub.

$ sudo ./test-syn-sent.py # all packets dropped 00:00.000 IP host.2 > host.1: Flags [S] # initial SYN State Recv-Q Send-Q Local:Port Peer:Port SYN-SENT 0 1 host:2 host:1 timer:(on,940ms,0) 00:01.028 IP host.2 > host.1: Flags [S] # first retry 00:03.044 IP host.2 > host.1: Flags [S] # second retry 00:07.236 IP host.2 > host.1: Flags [S] # third retry 00:15.427 IP host.2 > host.1: Flags [S] # fourth retry 00:31.560 IP host.2 > host.1: Flags [S] # fifth retry 01:04.324 IP host.2 > host.1: Flags [S] # sixth retry 02:10.000 connect ETIMEDOUT

Ok, this was easy. After the connect() syscall, the operating system sends a SYN packet. Since it didn't get any response the OS will by default retry sending it 6 times. This can be tweaked by the sysctl:

$ sysctl net.ipv4.tcp_syn_retries net.ipv4.tcp_syn_retries = 6

It's possible to overwrite this setting per-socket with the TCP_SYNCNT setsockopt:

setsockopt(sd, IPPROTO_TCP, TCP_SYNCNT, 6);

The retries are staggered at 1s, 3s, 7s, 15s, 31s, 63s marks (the inter-retry time starts at 2s and then doubles each time). By default, the whole process takes 130 seconds, until the kernel gives up with the ETIMEDOUT errno. At this moment in the lifetime of a connection, SO_KEEPALIVE settings are ignored, but TCP_USER_TIMEOUT is not. For example, setting it to 5000ms, will cause the following interaction:

$ sudo ./test-syn-sent.py 5000 # all packets dropped 00:00.000 IP host.2 > host.1: Flags [S] # initial SYN State Recv-Q Send-Q Local:Port Peer:Port SYN-SENT 0 1 host:2 host:1 timer:(on,996ms,0) 00:01.016 IP host.2 > host.1: Flags [S] # first retry 00:03.032 IP host.2 > host.1: Flags [S] # second retry 00:05.016 IP host.2 > host.1: Flags [S] # what is this? 00:05.024 IP host.2 > host.1: Flags [S] # what is this? 00:05.036 IP host.2 > host.1: Flags [S] # what is this? 00:05.044 IP host.2 > host.1: Flags [S] # what is this? 00:05.050 connect ETIMEDOUT

Even though we set user-timeout to 5s, we still saw the six SYN retries on the wire. This behaviour is probably a bug (as tested on 5.2 kernel): we would expect only two retries to be sent - at 1s and 3s marks and the socket to expire at 5s mark. Instead, we saw this, but also we saw further 4 retransmitted SYN packets aligned to 5s mark - which makes no sense. Anyhow, we learned a thing - the TCP_USER_TIMEOUT does affect the behaviour of connect().

SYN-RECV

SYN-RECV sockets are usually hidden from the application. They live as mini-sockets on the SYN queue. We wrote about the SYN and Accept queues in the past. Sometimes, when SYN cookies are enabled, the sockets may skip the SYN-RECV state altogether.

In SYN-RECV state, the socket will retry sending SYN+ACK 5 times as controlled by:

$ sysctl net.ipv4.tcp_synack_retries net.ipv4.tcp_synack_retries = 5

Here is how it looks on the wire:

$ sudo ./test-syn-recv.py 00:00.000 IP host.2 > host.1: Flags [S] # all subsequent packets dropped 00:00.000 IP host.1 > host.2: Flags [S.] # initial SYN+ACK State Recv-Q Send-Q Local:Port Peer:Port SYN-RECV 0 0 host:1 host:2 timer:(on,996ms,0) 00:01.033 IP host.1 > host.2: Flags [S.] # first retry 00:03.045 IP host.1 > host.2: Flags [S.] # second retry 00:07.301 IP host.1 > host.2: Flags [S.] # third retry 00:15.493 IP host.1 > host.2: Flags [S.] # fourth retry 00:31.621 IP host.1 > host.2: Flags [S.] # fifth retry 01:04:610 SYN-RECV disappears

With default settings, the SYN+ACK is re-transmitted at 1s, 3s, 7s, 15s, 31s marks, and the SYN-RECV socket disappears at the 64s mark.

Neither SO_KEEPALIVE nor TCP_USER_TIMEOUT affect the lifetime of SYN-RECV sockets.

Final handshake ACK

After receiving the second packet in the TCP handshake - the SYN+ACK - the client socket moves to an ESTABLISHED state. The server socket remains in SYN-RECV until it receives the final ACK packet.

Losing this ACK doesn't change anything - the server socket will just take a bit longer to move from SYN-RECV to ESTAB. Here is how it looks:

00:00.000 IP host.2 > host.1: Flags [S] 00:00.000 IP host.1 > host.2: Flags [S.] 00:00.000 IP host.2 > host.1: Flags [.] # initial ACK, dropped State Recv-Q Send-Q Local:Port Peer:Port SYN-RECV 0 0 host:1 host:2 timer:(on,1sec,0) ESTAB 0 0 host:2 host:1 00:01.014 IP host.1 > host.2: Flags [S.] 00:01.014 IP host.2 > host.1: Flags [.] # retried ACK, dropped State Recv-Q Send-Q Local:Port Peer:Port SYN-RECV 0 0 host:1 host:2 timer:(on,1.012ms,1) ESTAB 0 0 host:2 host:1

As you can see SYN-RECV, has the "on" timer, the same as in example before. We might argue this final ACK doesn't really carry much weight. This thinking lead to the development of TCP_DEFER_ACCEPT feature - it basically causes the third ACK to be silently dropped. With this flag set the socket remains in SYN-RECV state until it receives the first packet with actual data:

$ sudo ./test-syn-ack.py 00:00.000 IP host.2 > host.1: Flags [S] 00:00.000 IP host.1 > host.2: Flags [S.] 00:00.000 IP host.2 > host.1: Flags [.] # delivered, but the socket stays as SYN-RECV State Recv-Q Send-Q Local:Port Peer:Port SYN-RECV 0 0 host:1 host:2 timer:(on,7.192ms,0) ESTAB 0 0 host:2 host:1 00:08.020 IP host.2 > host.1: Flags [P.], length 11 # payload moves the socket to ESTAB State Recv-Q Send-Q Local:Port Peer:Port ESTAB 11 0 host:1 host:2 ESTAB 0 0 host:2 host:1

The server socket remained in the SYN-RECV state even after receiving the final TCP-handshake ACK. It has a funny "on" timer, with the counter stuck at 0 retries. It is converted to ESTAB - and moved from the SYN to the accept queue - after the client sends a data packet or after the TCP_DEFER_ACCEPT timer expires. Basically, with DEFER ACCEPT the SYN-RECV mini-socket discards the data-less inbound ACK.

Idle ESTAB is forever

Let's move on and discuss a fully-established socket connected to an unhealthy (dead) peer. After completion of the handshake, the sockets on both sides move to the ESTABLISHED state, like:

State Recv-Q Send-Q Local:Port Peer:Port ESTAB 0 0 host:2 host:1 ESTAB 0 0 host:1 host:2

These sockets have no running timer by default - they will remain in that state forever, even if the communication is broken. The TCP stack will notice problems only when one side attempts to send something. This raises a question - what to do if you don't plan on sending any data over a connection? How do you make sure an idle connection is healthy, without sending any data over it?

This is where TCP keepalives come in. Let's see it in action - in this example we used the following toggles:

SO_KEEPALIVE = 1 - Let's enable keepalives.
TCP_KEEPIDLE = 5 - Send first keepalive probe after 5 seconds of idleness.
TCP_KEEPINTVL = 3 - Send subsequent keepalive probes after 3 seconds.
TCP_KEEPCNT = 3 - Time out after three failed probes.

$ sudo ./test-idle.py 00:00.000 IP host.2 > host.1: Flags [S] 00:00.000 IP host.1 > host.2: Flags [S.] 00:00.000 IP host.2 > host.1: Flags [.] State Recv-Q Send-Q Local:Port Peer:Port ESTAB 0 0 host:1 host:2 ESTAB 0 0 host:2 host:1 timer:(keepalive,2.992ms,0) # all subsequent packets dropped 00:05.083 IP host.2 > host.1: Flags [.], ack 1 # first keepalive probe 00:08.155 IP host.2 > host.1: Flags [.], ack 1 # second keepalive probe 00:11.231 IP host.2 > host.1: Flags [.], ack 1 # third keepalive probe 00:14.299 IP host.2 > host.1: Flags [R.], seq 1, ack 1

Indeed! We can clearly see the first probe sent at the 5s mark, two remaining probes 3s apart - exactly as we specified. After a total of three sent probes, and a further three seconds of delay, the connection dies with ETIMEDOUT, and final the RST is transmitted.

For keepalives to work, the send buffer must be empty. You can notice the keepalive timer active in the "timer:(keepalive)" line.

Keepalives with TCP_USER_TIMEOUT are confusing

We mentioned the TCP_USER_TIMEOUT option before. It sets the maximum amount of time that transmitted data may remain unacknowledged before the kernel forcefully closes the connection. On its own, it doesn't do much in the case of idle connections. The sockets will remain ESTABLISHED even if the connectivity is dropped. However, this socket option does change the semantics of TCP keepalives. The tcp(7) manpage is somewhat confusing:

Moreover, when used with the TCP keepalive (SO_KEEPALIVE) option, TCP_USER_TIMEOUT will override keepalive to determine when to close a connection due to keepalive failure.

The original commit message has slightly more detail:

tcp: Add TCP_USER_TIMEOUT socket option

To understand the semantics, we need to look at the kernel code in linux/net/ipv4/tcp_timer.c:693:

if ((icsk->icsk_user_timeout != 0 && elapsed >= msecs_to_jiffies(icsk->icsk_user_timeout) && icsk->icsk_probes_out > 0) ||

For the user timeout to have any effect, the icsk_probes_out must not be zero. The check for user timeout is done only after the first probe went out. Let's check it out. Our connection settings:

TCP_USER_TIMEOUT = 5*1000 - 5 seconds
SO_KEEPALIVE = 1 - enable keepalives
TCP_KEEPIDLE = 1 - send first probe quickly - 1 second idle
TCP_KEEPINTVL = 11 - subsequent probes every 11 seconds
TCP_KEEPCNT = 3 - send three probes before timing out

00:00.000 IP host.2 > host.1: Flags [S] 00:00.000 IP host.1 > host.2: Flags [S.] 00:00.000 IP host.2 > host.1: Flags [.] # all subsequent packets dropped 00:01.001 IP host.2 > host.1: Flags [.], ack 1 # first probe 00:12.233 IP host.2 > host.1: Flags [R.] # timer for second probe fired, socket aborted due to TCP_USER_TIMEOUT

So what happened? The connection sent the first keepalive probe at the 1s mark. Seeing no response the TCP stack then woke up 11 seconds later to send a second probe. This time though, it executed the USER_TIMEOUT code path, which decided to terminate the connection immediately.

What if we bump TCP_USER_TIMEOUT to larger values, say between the second and third probe? Then, the connection will be closed on the third probe timer. With TCP_USER_TIMEOUT set to 12.5s:

00:01.022 IP host.2 > host.1: Flags [.] # first probe 00:12.094 IP host.2 > host.1: Flags [.] # second probe 00:23.102 IP host.2 > host.1: Flags [R.] # timer for third probe fired, socket aborted due to TCP_USER_TIMEOUT

We’ve shown how TCP_USER_TIMEOUT interacts with keepalives for small and medium values. The last case is when TCP_USER_TIMEOUT is extraordinarily large. Say we set it to 30s:

00:01.027 IP host.2 > host.1: Flags [.], ack 1 # first probe 00:12.195 IP host.2 > host.1: Flags [.], ack 1 # second probe 00:23.207 IP host.2 > host.1: Flags [.], ack 1 # third probe 00:34.211 IP host.2 > host.1: Flags [.], ack 1 # fourth probe! But TCP_KEEPCNT was only 3! 00:45.219 IP host.2 > host.1: Flags [.], ack 1 # fifth probe! 00:56.227 IP host.2 > host.1: Flags [.], ack 1 # sixth probe! 01:07.235 IP host.2 > host.1: Flags [R.], seq 1 # TCP_USER_TIMEOUT aborts conn on 7th probe timer

We saw six keepalive probes on the wire! With TCP_USER_TIMEOUT set, the TCP_KEEPCNT is totally ignored. If you want TCP_KEEPCNT to make sense, the only sensible USER_TIMEOUT value is slightly smaller than:

Busy ESTAB socket is not forever

Thus far we have discussed the case where the connection is idle. Different rules apply when the connection has unacknowledged data in a send buffer.

Let's prepare another experiment - after the three-way handshake, let's set up a firewall to drop all packets. Then, let's do a send on one end to have some dropped packets in-flight. An experiment shows the sending socket dies after ~16 minutes:

00:00.000 IP host.2 > host.1: Flags [S] 00:00.000 IP host.1 > host.2: Flags [S.] 00:00.000 IP host.2 > host.1: Flags [.] # All subsequent packets dropped 00:00.206 IP host.2 > host.1: Flags [P.], length 11 # first data packet 00:00.412 IP host.2 > host.1: Flags [P.], length 11 # early retransmit, doesn't count 00:00.620 IP host.2 > host.1: Flags [P.], length 11 # 1nd retry 00:01.048 IP host.2 > host.1: Flags [P.], length 11 # 2rd retry 00:01.880 IP host.2 > host.1: Flags [P.], length 11 # 3th retry State Recv-Q Send-Q Local:Port Peer:Port ESTAB 0 0 host:1 host:2 ESTAB 0 11 host:2 host:1 timer:(on,1.304ms,3) 00:03.543 IP host.2 > host.1: Flags [P.], length 11 # 4th 00:07.000 IP host.2 > host.1: Flags [P.], length 11 # 5th 00:13.656 IP host.2 > host.1: Flags [P.], length 11 # 6th 00:26.968 IP host.2 > host.1: Flags [P.], length 11 # 7th 00:54.616 IP host.2 > host.1: Flags [P.], length 11 # 8th 01:47.868 IP host.2 > host.1: Flags [P.], length 11 # 9th 03:34.360 IP host.2 > host.1: Flags [P.], length 11 # 10th 05:35.192 IP host.2 > host.1: Flags [P.], length 11 # 11th 07:36.024 IP host.2 > host.1: Flags [P.], length 11 # 12th 09:36.855 IP host.2 > host.1: Flags [P.], length 11 # 13th 11:37.692 IP host.2 > host.1: Flags [P.], length 11 # 14th 13:38.524 IP host.2 > host.1: Flags [P.], length 11 # 15th 15:39.500 connection ETIMEDOUT

The data packet is retransmitted 15 times, as controlled by:

$ sysctl net.ipv4.tcp_retries2 net.ipv4.tcp_retries2 = 15

From the ip-sysctl.txt documentation:

The default value of 15 yields a hypothetical timeout of 924.6 seconds and is a lower bound for the effective timeout. TCP will effectively time out at the first RTO which exceeds the hypothetical timeout.

The connection indeed died at ~940 seconds. Notice the socket has the "on" timer running. It doesn't matter at all if we set SO_KEEPALIVE - when the "on" timer is running, keepalives are not engaged.

TCP_USER_TIMEOUT keeps on working though. The connection will be aborted exactly after user-timeout specified time since the last received packet. With the user timeout set the tcp_retries2 value is ignored.

Zero window ESTAB is... forever?

There is one final case worth mentioning. If the sender has plenty of data, and the receiver is slow, then TCP flow control kicks in. At some point the receiver will ask the sender to stop transmitting new data. This is a slightly different condition than the one described above.

In this case, with flow control engaged, there is no in-flight or unacknowledged data. Instead the receiver throttles the sender with a "zero window" notification. Then the sender periodically checks if the condition is still valid with "window probes". In this experiment we reduced the receive buffer size for simplicity. Here's how it looks on the wire:

00:00.000 IP host.2 > host.1: Flags [S] 00:00.000 IP host.1 > host.2: Flags [S.], win 1152 00:00.000 IP host.2 > host.1: Flags [.]

00:00.202 IP host.2 > host.1: Flags [.], length 576 # first data packet 00:00.202 IP host.1 > host.2: Flags [.], ack 577, win 576 00:00.202 IP host.2 > host.1: Flags [P.], length 576 # second data packet 00:00.244 IP host.1 > host.2: Flags [.], ack 1153, win 0 # throttle it! zero-window

00:00.456 IP host.2 > host.1: Flags [.], ack 1 # zero-window probe 00:00.456 IP host.1 > host.2: Flags [.], ack 1153, win 0 # nope, still zero-window

State Recv-Q Send-Q Local:Port Peer:Port ESTAB 1152 0 host:1 host:2 ESTAB 0 129920 host:2 host:1 timer:(persist,048ms,0)

The packet capture shows a couple of things. First, we can see two packets with data, each 576 bytes long. They both were immediately acknowledged. The second ACK had "win 0" notification: the sender was told to stop sending data.

But the sender is eager to send more! The last two packets show a first "window probe": the sender will periodically send payload-less "ack" packets to check if the window size had changed. As long as the receiver keeps on answering, the sender will keep on sending such probes forever.

The socket information shows three important things:

The read buffer of the reader is filled - thus the "zero window" throttling is expected.
The write buffer of the sender is filled - we have more data to send.
The sender has a "persist" timer running, counting the time until the next "window probe".

In this blog post we are interested in timeouts - what will happen if the window probes are lost? Will the sender notice?

By default, the window probe is retried 15 times - adhering to the usual tcp_retries2 setting.

The tcp timer is in persist state, so the TCP keepalives will not be running. The SO_KEEPALIVE settings don't make any difference when window probing is engaged.

As expected, the TCP_USER_TIMEOUT toggle keeps on working. A slight difference is that similarly to user-timeout on keepalives, it's engaged only when the retransmission timer fires. During such an event, if more than user-timeout seconds since the last good packet passed, the connection will be aborted.

Note about using application timeouts

In the past we have shared an interesting war story:

The curious case of slow downloads

Our HTTP server gave up on the connection after an application-managed timeout fired. This was a bug - a slow connection might have correctly slowly drained the send buffer, but the application server didn't notice that.

We abruptly dropped slow downloads, even though this wasn't our intention. We just wanted to make sure the client connection was still healthy. It would be better to use TCP_USER_TIMEOUT than rely on application-managed timeouts.

But this is not sufficient. We also wanted to guard against a situation where a client stream is valid, but is stuck and doesn't drain the connection. The only way to achieve this is to periodically check the amount of unsent data in the send buffer, and see if it shrinks at a desired pace.

For typical applications sending data to the Internet, I would recommend:

Enable TCP keepalives. This is needed to keep some data flowing in the idle-connection case.
Set TCP_USER_TIMEOUT to TCP_KEEPIDLE + TCP_KEEPINTVL * TCP_KEEPCNT.
Be careful when using application-managed timeouts. To detect TCP failures use TCP keepalives and user-timeout. If you want to spare resources and make sure sockets don't stay alive for too long, consider periodically checking if the socket is draining at the desired pace. You can use ioctl(TIOCOUTQ) for that, but it counts both data buffered (notsent) on the socket and in-flight (unacknowledged) bytes. A better way is to use TCP_INFO tcpi_notsent_bytes parameter, which reports only the former counter.

An example of checking the draining pace:

while True: notsent1 = get_tcp_info(c).tcpi_notsent_bytes notsent1_ts = time.time() ... poll.poll(POLL_PERIOD) ... notsent2 = get_tcp_info(c).tcpi_notsent_bytes notsent2_ts = time.time() pace_in_bytes_per_second = (notsent1 - notsent2) / (notsent2_ts - notsent1_ts) if pace_in_bytes_per_second > 12000: # pace is above effective rate of 96Kbps, ok! else: # socket is too slow...

There are ways to further improve this logic. We could use TCP_NOTSENT_LOWAT, although it's generally only useful for situations where the send buffer is relatively empty. Then we could use the SO_TIMESTAMPING interface for notifications about when data gets delivered. Finally, if we are done sending the data to the socket, it's possible to just call close() and defer handling of the socket to the operating system. Such a socket will be stuck in FIN-WAIT-1 or LAST-ACK state until it correctly drains.

Summary

In this post we discussed five cases where the TCP connection may notice the other party going away:

SYN-SENT: The duration of this state can be controlled by TCP_SYNCNT or tcp_syn_retries.
SYN-RECV: It's usually hidden from application. It is tuned by tcp_synack_retries.
Idling ESTABLISHED connection, will never notice any issues. A solution is to use TCP keepalives.
Busy ESTABLISHED connection, adheres to tcp_retries2 setting, and ignores TCP keepalives.
Zero-window ESTABLISHED connection, adheres to tcp_retries2 setting, and ignores TCP keepalives.

Especially the last two ESTABLISHED cases can be customized with TCP_USER_TIMEOUT, but this setting also affects other situations. Generally speaking, it can be thought of as a hint to the kernel to abort the connection after so-many seconds since the last good packet. This is a dangerous setting though, and if used in conjunction with TCP keepalives should be set to a value slightly lower than TCP_KEEPIDLE + TCP_KEEPINTVL * TCP_KEEPCNT. Otherwise it will affect, and potentially cancel out, the TCP_KEEPCNT value.

In this post we presented scripts showing the effects of timeout-related socket options under various network conditions. Interleaving the tcpdump packet capture with the output of ss -o is a great way of understanding the networking stack. We were able to create reproducible test cases showing the "on", "keepalive" and "persist" timers in action. This is a very useful framework for further experimentation.

Finally, it's surprisingly hard to tune a TCP connection to be confident that the remote host is actually up. During our debugging we found that looking at the send buffer size and currently active TCP timer can be very helpful in understanding whether the socket is actually healthy. The bug in our Spectrum application turned out to be a wrong TCP_USER_TIMEOUT setting - without it sockets with large send buffers were lingering around for way longer than we intended.

The scripts used in this article can be found on our GitHub.

Figuring this out has been a collaboration across three Cloudflare offices. Thanks to Hiren Panchasara from San Jose, Warren Nelson from Austin and Jakub Sitnicki from Warsaw. Fancy joining the team? Apply here!

A gentle introduction to Linux Kernel fuzzing

Marek Majkowski — Wed, 10 Jul 2019 13:07:21 GMT

For some time I’ve wanted to play with coverage-guided fuzzing. Fuzzing is a powerful testing technique where an automated program feeds semi-random inputs to a tested program. The intention is to find such inputs that trigger bugs. Fuzzing is especially useful in finding memory corruption bugs in C or C++ programs.

Image by Patrick Shannon CC BY 2.0

Normally it's recommended to pick a well known, but little explored, library that is heavy on parsing. Historically things like libjpeg, libpng and libyaml were perfect targets. Nowadays it's harder to find a good target - everything seems to have been fuzzed to death already. That's a good thing! I guess the software is getting better! Instead of choosing a userspace target I decided to have a go at the Linux Kernel netlink machinery.

Netlink is an internal Linux facility used by tools like "ss", "ip", "netstat". It's used for low level networking tasks - configuring network interfaces, IP addresses, routing tables and such. It's a good target: it's an obscure part of kernel, and it's relatively easy to automatically craft valid messages. Most importantly, we can learn a lot about Linux internals in the process. Bugs in netlink aren't going to have security impact though - netlink sockets usually require privileged access anyway.

In this post we'll run AFL fuzzer, driving our netlink shim program against a custom Linux kernel. All of this running inside KVM virtualization.

This blog post is a tutorial. With the easy to follow instructions, you should be able to quickly replicate the results. All you need is a machine running Linux and 20 minutes.

Prior work

The technique we are going to use is formally called "coverage-guided fuzzing". There's a lot of prior literature:

The Smart Fuzzer Revolution by Dan Guido, and LWN article about it
Effective file format fuzzing by Mateusz “j00ru” Jurczyk
honggfuzz by Robert Swiecki, is a modern, feature-rich coverage-guided fuzzer
ClusterFuzz
Fuzzer Test Suite

Many people have fuzzed the Linux Kernel in the past. Most importantly:

syzkaller (aka syzbot) by Dmitry Vyukov, is a very powerful CI-style continuously running kernel fuzzer, which found hundreds of issues already. It's an awesome machine - it will even report the bugs automatically!
Trinity fuzzer

We'll use the AFL, everyone's favorite fuzzer. AFL was written by Michał Zalewski. It's well known for its ease of use, speed and very good mutation logic. It's a perfect choice for people starting their journey into fuzzing!

If you want to read more about AFL, the documentation is in couple of files:

Coverage-guided fuzzing

Coverage-guided fuzzing works on the principle of a feedback loop:

the fuzzer picks the most promising test case
the fuzzer mutates the test into a large number of new test cases
the target code runs the mutated test cases, and reports back code coverage
the fuzzer computes a score from the reported coverage, and uses it to prioritize the interesting mutated tests and remove the redundant ones

For example, let's say the input test is "hello". Fuzzer may mutate it to a number of tests, for example: "hEllo" (bit flip), "hXello" (byte insertion), "hllo" (byte deletion). If any of these tests will yield an interesting code coverage, then it will be prioritized and used as a base for a next generation of tests.

Specifics on how mutations are done, and how to efficiently compare code coverage reports of thousands of program runs is the fuzzer secret sauce. Read on the AFL's technical whitepaper for nitty gritty details.

The code coverage reported back from the binary is very important. It allows fuzzer to order the test cases, and identify the most promising ones. Without the code coverage the fuzzer is blind.

Normally, when using AFL, we are required to instrument the target code so that coverage is reported in an AFL-compatible way. But we want to fuzz the kernel! We can't just recompile it with "afl-gcc"! Instead we'll use a trick. We'll prepare a binary that will trick AFL into thinking it was compiled with its tooling. This binary will report back the code coverage extracted from kernel.

Kernel code coverage

The kernel has at least two built-in coverage mechanisms - GCOV and KCOV:

KCOV was designed with fuzzing in mind, so we'll use this.

Using KCOV is pretty easy. We must compile the Linux kernel with the right setting. First, enable the KCOV kernel config option:

KCOV is capable of recording code coverage from the whole kernel. It can be set with KCOV_INSTRUMENT_ALL option. This has disadvantages though - it would slow down the parts of the kernel we don't want to profile, and would introduce noise in our measurements (reduce "stability"). For starters, let's disable KCOV_INSTRUMENT_ALL and enable KCOV selectively on the code we actually want to profile. Today, we focus on netlink machinery, so let's enable KCOV on whole "net" directory tree:

In a perfect world we would enable KCOV only for a couple of files we really are interested in. But netlink handling is peppered all over the network stack code, and we don't have time for fine tuning it today.

With KCOV in place, it's worth to add "kernel hacking" toggles that will increase the likelihood of reporting memory corruption bugs. See the README for the list of Syzkaller suggested options - most importantly KASAN.

With that set we can compile our KCOV and KASAN enabled kernel. Oh, one more thing. We are going to run the kernel in a kvm. We're going to use "virtme", so we need a couple of toggles:

(see the README for full list)

How to use KCOV

KCOV is super easy to use. First, note the code coverage is recorded in a per-process data structure. This means you have to enable and disable KCOV within a userspace process, and it's impossible to record coverage for non-task things, like interrupt handling. This is totally fine for our needs.

KCOV reports data into a ring buffer. Setting it up is pretty simple, see our code. Then you can enable and disable it with a trivial ioctl:

After this sequence the ring buffer contains the list of %rip values of all the basic blocks of the KCOV-enabled kernel code. To read the buffer just run:

With tools like addr2line it's possible to resolve the %rip to a specific line of code. We won't need it though - the raw %rip values are sufficient for us.

Feeding KCOV into AFL

The next step in our journey is to learn how to trick AFL. Remember, AFL needs a specially-crafted executable, but we want to feed in the kernel code coverage. First we need to understand how AFL works.

AFL sets up an array of 64K 8-bit numbers. This memory region is called "shared_mem" or "trace_bits" and is shared with the traced program. Every byte in the array can be thought of as a hit counter for a particular (branch_src, branch_dst) pair in the instrumented code.

It's important to notice that AFL prefers random branch labels, rather than reusing the %rip value to identify the basic blocks. This is to increase entropy - we want our hit counters in the array to be uniformly distributed. The algorithm AFL uses is:

In our case with KCOV we don't have compile-time-random values for each branch. Instead we'll use a hash function to generate a uniform 16 bit number from %rip recorded by KCOV. This is how to feed a KCOV report into the AFL "shared_mem" array:

Reading test data from AFL

Finally, we need to actually write the test code hammering the kernel netlink interface! First we need to read input data from AFL. By default AFL sends a test case to stdin:

Fuzzing netlink

Then we need to send this buffer into a netlink socket. But we know nothing about how netlink works! Okay, let's use the first 5 bytes of input as the netlink protocol and group id fields. This will allow the AFL to figure out and guess the correct values of these fields. The code testing netlink (simplified):

That's basically it! For speed, we will wrap this in a short loop that mimics the AFL "fork server" logic. I'll skip the explanation here, see our code for details. The resulting code of our AFL-to-KCOV shim looks like:

See full source code.

How to run the custom kernel

We're missing one important piece - how to actually run the custom kernel we've built. There are three options:

"native": You can totally boot the built kernel on your server and fuzz it natively. This is the fastest technique, but pretty problematic. If the fuzzing succeeds in finding a bug you will crash the machine, potentially losing the test data. Cutting the branches we sit on should be avoided.

"uml": We could configure the kernel to run as User Mode Linux. Running a UML kernel requires no privileges. The kernel just runs a user space process. UML is pretty cool, but sadly, it doesn't support KASAN, therefore the chances of finding a memory corruption bug are reduced. Finally, UML is a pretty magic special environment - bugs found in UML may not be relevant on real environments. Interestingly, UML is used by Android network_tests framework.

"kvm": we can use kvm to run our custom kernel in a virtualized environment. This is what we'll do.

One of the simplest ways to run a custom kernel in a KVM environment is to use "virtme" scripts. With them we can avoid having to create a dedicated disk image or partition, and just share the host file system. This is how we can run our code:

But hold on. We forgot about preparing input corpus data for our fuzzer!

Building the input corpus

Every fuzzer takes a carefully crafted test cases as input, to bootstrap the first mutations. The test cases should be short, and cover as large part of code as possible. Sadly - I know nothing about netlink. How about we don't prepare the input corpus...

Instead we can ask AFL to "figure out" what inputs make sense. This is what Michał did back in 2014 with JPEGs and it worked for him. With this in mind, here is our input corpus:

Instructions, how to compile and run the whole thing are in README.md on our github. It boils down to:

With this running you will see the familiar AFL status screen:

Further notes

That's it. Now you have a custom hardened kernel, running a basic coverage-guided fuzzer. All inside KVM.

Was it worth the effort? Even with this basic fuzzer, and no input corpus, after a day or two the fuzzer found an interesting code path: NEIGH: BUG, double timer add, state is 8. With a more specialized fuzzer, some work on improving the "stability" metric and a decent input corpus, we could expect even better results.

If you want to learn more about what netlink sockets actually do, see a blog post by my colleague Jakub Sitnicki Multipath Routing in Linux - part 1. Then there is a good chapter about it in Linux Kernel Networking book by Rami Rosen.

In this blog post we haven't mentioned:

details of AFL shared_memory setup
implementation of AFL persistent mode
how to create a network namespace to isolate the effects of weird netlink commands, and improve the "stability" AFL score
technique on how to read dmesg (/dev/kmsg) to find kernel crashes
idea to run AFL outside of KVM, for speed and stability - currently the tests aren't stable after a crash is found

But we achieved our goal - we set up a basic, yet still useful fuzzer against a kernel. Most importantly: the same machinery can be reused to fuzz other parts of Linux subsystems - from file systems to bpf verifier.

I also learned a hard lesson: tuning fuzzers is a full time job. Proper fuzzing is definitely not as simple as starting it up and idly waiting for crashes. There is always something to improve, tune, and re-implement. A quote at the beginning of the mentioned presentation by Mateusz Jurczyk resonated with me:

"Fuzzing is easy to learn but hard to master."

Happy bug hunting!

Cloudflare architecture and how BPF eats the world

Marek Majkowski — Sat, 18 May 2019 15:00:00 GMT

Recently at Netdev 0x13, the Conference on Linux Networking in Prague, I gave a short talk titled "Linux at Cloudflare". The talk ended up being mostly about BPF. It seems, no matter the question - BPF is the answer.

Here is a transcript of a slightly adjusted version of that talk.

At Cloudflare we run Linux on our servers. We operate two categories of data centers: large "Core" data centers, processing logs, analyzing attacks, computing analytics, and the "Edge" server fleet, delivering customer content from 180 locations across the world.

In this talk, we will focus on the "Edge" servers. It's here where we use the newest Linux features, optimize for performance and care deeply about DoS resilience.

Our edge service is special due to our network configuration - we are extensively using anycast routing. Anycast means that the same set of IP addresses are announced by all our data centers.

This design has great advantages. First, it guarantees the optimal speed for end users. No matter where you are located, you will always reach the closest data center. Then, anycast helps us to spread out DoS traffic. During attacks each of the locations receives a small fraction of the total traffic, making it easier to ingest and filter out unwanted traffic.

Anycast allows us to keep the networking setup uniform across all edge data centers. We applied the same design inside our data centers - our software stack is uniform across the edge servers. All software pieces are running on all the servers.

In principle, every machine can handle every task - and we run many diverse and demanding tasks. We have a full HTTP stack, the magical Cloudflare Workers, two sets of DNS servers - authoritative and resolver, and many other publicly facing applications like Spectrum and Warp.

Even though every server has all the software running, requests typically cross many machines on their journey through the stack. For example, an HTTP request might be handled by a different machine during each of the 5 stages of the processing.

Let me walk you through the early stages of inbound packet processing:

(1) First, the packets hit our router. The router does ECMP, and forwards packets onto our Linux servers. We use ECMP to spread each target IP across many, at least 16, machines. This is used as a rudimentary load balancing technique.

(2) On the servers we ingest packets with XDP eBPF. In XDP we perform two stages. First, we run volumetric DoS mitigations, dropping packets belonging to very large layer 3 attacks.

(3) Then, still in XDP, we perform layer 4 load balancing. All the non-attack packets are redirected across the machines. This is used to work around the ECMP problems, gives us fine-granularity load balancing and allows us to gracefully take servers out of service.

(4) Following the redirection the packets reach a designated machine. At this point they are ingested by the normal Linux networking stack, go through the usual iptables firewall, and are dispatched to an appropriate network socket.

(5) Finally packets are received by an application. For example HTTP connections are handled by a "protocol" server, responsible for performing TLS encryption and processing HTTP, HTTP/2 and QUIC protocols.

It's in these early phases of request processing where we use the coolest new Linux features. We can group useful modern functionalities into three categories:

DoS handling
Load balancing
Socket dispatch

Let's discuss DoS handling in more detail. As mentioned earlier, the first step after ECMP routing is Linux's XDP stack where, among other things, we run DoS mitigations.

Historically our mitigations for volumetric attacks were expressed in classic BPF and iptables-style grammar. Recently we adapted them to execute in the XDP eBPF context, which turned out to be surprisingly hard. Read on about our adventures:

L4Drop: XDP DDoS Mitigations
xdpcap: XDP Packet Capture
XDP based DoS mitigation talk by Arthur Fabre
XDP in practice: integrating XDP into our DDoS mitigation pipeline (PDF)

During this project we encountered a number of eBPF/XDP limitations. One of them was the lack of concurrency primitives. It was very hard to implement things like race-free token buckets. Later we found that Facebook engineer Julia Kartseva had the same issues. In February this problem has been addressed with the introduction of bpf_spin_lock helper.

While our modern volumetric DoS defenses are done in XDP layer, we still rely on iptables for application layer 7 mitigations. Here, a higher level firewall’s features are useful: connlimit, hashlimits and ipsets. We also use the xt_bpf iptables module to run cBPF in iptables to match on packet payloads. We talked about this in the past:

After XDP and iptables, we have one final kernel side DoS defense layer.

Consider a situation when our UDP mitigations fail. In such case we might be left with a flood of packets hitting our application UDP socket. This might overflow the socket causing packet loss. This is problematic - both good and bad packets will be dropped indiscriminately. For applications like DNS it's catastrophic. In the past to reduce the harm, we ran one UDP socket per IP address. An unmitigated flood was bad, but at least it didn't affect the traffic to other server IP addresses.

Nowadays that architecture is no longer suitable. We are running more than 30,000 DNS IP's and running that number of UDP sockets is not optimal. Our modern solution is to run a single UDP socket with a complex eBPF socket filter on it - using the SO_ATTACH_BPF socket option. We talked about running eBPF on network sockets in past blog posts:

The mentioned eBPF rate limits the packets. It keeps the state - packet counts - in an eBPF map. We can be sure that a single flooded IP won't affect other traffic. This works well, though during work on this project we found a rather worrying bug in the eBPF verifier:

eBPF can't count?!

I guess running eBPF on a UDP socket is not a common thing to do.

Apart from the DoS, in XDP we also run a layer 4 load balancer layer. This is a new project, and we haven't talked much about it yet. Without getting into many details: in certain situations we need to perform a socket lookup from XDP.

The problem is relatively simple - our code needs to look up the "socket" kernel structure for a 5-tuple extracted from a packet. This is generally easy - there is a bpf_sk_lookup helper available for this. Unsurprisingly, there were some complications. One problem was the inability to verify if a received ACK packet was a valid part of a three-way handshake when SYN-cookies are enabled. My colleague Lorenz Bauer is working on adding support for this corner case.

After DoS and the load balancing layers, the packets are passed onto the usual Linux TCP / UDP stack. Here we do a socket dispatch - for example packets going to port 53 are passed onto a socket belonging to our DNS server.

We do our best to use vanilla Linux features, but things get complex when you use thousands of IP addresses on the servers.

Convincing Linux to route packets correctly is relatively easy with the "AnyIP" trick. Ensuring packets are dispatched to the right application is another matter. Unfortunately, standard Linux socket dispatch logic is not flexible enough for our needs. For popular ports like TCP/80 we want to share the port between multiple applications, each handling it on a different IP range. Linux doesn't support this out of the box. You can call bind() either on a specific IP address or all IP's (with 0.0.0.0).

In order to fix this, we developed a custom kernel patch which adds a SO_BINDTOPREFIX socket option. As the name suggests - it allows us to call bind() on a selected IP prefix. This solves the problem of multiple applications sharing popular ports like 53 or 80.

Then we run into another problem. For our Spectrum product we need to listen on all 65535 ports. Running so many listen sockets is not a good idea (see our old war story blog), so we had to find another way. After some experiments we learned to utilize an obscure iptables module - TPROXY - for this purpose. Read about it here:

Abusing Linux's firewall: the hack that allowed us to build Spectrum

This setup is working, but we don't like the extra firewall rules. We are working on solving this problem correctly - actually extending the socket dispatch logic. You guessed it - we want to extend socket dispatch logic by utilizing eBPF. Expect some patches from us.

Then there is a way to use eBPF to improve applications. Recently we got excited about doing TCP splicing with SOCKMAP:

SOCKMAP - TCP splicing of the future

This technique has a great potential for improving tail latency across many pieces of our software stack. The current SOCKMAP implementation is not quite ready for prime time yet, but the potential is vast.

Similarly, the new TCP-BPF aka BPF_SOCK_OPS hooks provide a great way of inspecting performance parameters of TCP flows. This functionality is super useful for our performance team.

Some Linux features didn't age well and we need to work around them. For example, we are hitting limitations of networking metrics. Don't get me wrong - the networking metrics are awesome, but sadly they are not granular enough. Things like TcpExtListenDrops and TcpExtListenOverflows are reported as global counters, while we need to know it on a per-application basis.

Our solution is to use eBPF probes to extract the numbers directly from the kernel. My colleague Ivan Babrou wrote a Prometheus metrics exporter called "ebpf_exporter" to facilitate this. Read on:

With "ebpf_exporter" we can generate all manner of detailed metrics. It is very powerful and saved us on many occasions.

In this talk we discussed 6 layers of BPFs running on our edge servers:

Volumetric DoS mitigations are running on XDP eBPF
Iptables xt_bpf cBPF for application-layer attacks
SO_ATTACH_BPF for rate limits on UDP sockets
Load balancer, running on XDP
eBPFs running application helpers like SOCKMAP for TCP socket splicing, and TCP-BPF for TCP measurements
"ebpf_exporter" for granular metrics

And we're just getting started! Soon we will be doing more with eBPF based socket dispatch, eBPF running on Linux TC (Traffic Control) layer and more integration with cgroup eBPF hooks. Then, our SRE team is maintaining ever-growing list of BCC scripts useful for debugging.

It feels like Linux stopped developing new API's and all the new features are implemented as eBPF hooks and helpers. This is fine and it has strong advantages. It's easier and safer to upgrade eBPF program than having to recompile a kernel module. Some things like TCP-BPF, exposing high-volume performance tracing data, would probably be impossible without eBPF.

Some say "software is eating the world", I would say that: "BPF is eating the software".

RFC8482 - Saying goodbye to ANY

Marek Majkowski — Fri, 15 Mar 2019 17:01:17 GMT

Ladies and gentlemen, I would like you to welcome the new shiny RFC8482, which effectively deprecates the DNS ANY query type. DNS ANY was a "meta-query" - think of it as a similar thing to the common A, AAAA, MX or SRV query types, but unlike these it wasn't a real query type - it was special. Unlike the standard query types, ANY didn't age well. It was hard to implement on modern DNS servers, the semantics were poorly understood by the community and it unnecessarily exposed the DNS protocol to abuse. RFC8482 allows us to clean it up - it's a good thing.

But let's rewind a bit.

Historical context

It all started in 2015, when we were looking at the code of our authoritative DNS server. The code flow was generally fine, but it was all peppered with naughty statements like this:

This special code was ugly and error prone. This got us thinking: do we really need it? "ANY" is not a popular query type - no legitimate software uses it (with the notable exception of qmail).

Image by Christopher MichelCC BY 2.0

ANY is hard for modern DNS servers

"ANY" queries, also called "* queries" in old RFCs, are supposed to return "all records" (citing RFC1035). There are two problems with this notion.

First, it assumes the server is able to retrieve "all records". In our implementation - we can't. Our DNS server, like many modern implementations, doesn't have a single "zone" file listing all properties of a DNS zone. This design allows us to respond fast and with information always up to date, but it makes it incredibly hard to retrieve "all records". Correct handling of "ANY" adds unreasonable code complexity for an obscure, rarely used query type.

Second, many of the DNS responses are generated on-demand. To mention just two use cases:

Some of our DNS responses are based on location
We are using black lies and DNS shotgun for DNSSEC

Storing data in modern databases and dynamically generating responses poses a fundamental problem to ANY.

ANY is hard for clients

Around the same time a catastrophe happened - Firefox started shipping with DNS code issuing "ANY" types. The intention was, as usual, benign. Firefox developers wanted to get the TTL value for A and AAAA queries.

To cite a DNS guru Andrew Sullivan:

In general, ANY is useful for troubleshooting but should never be usedfor regular operation. Its output is unpredictable given the effectsof caches. It can return enormous result sets.

In user code you can't rely on anything sane to come out of an "ANY" query. While an "ANY" query has somewhat defined semantics on the DNS authoritative side, it's undefined on the DNS resolver side. Such a query can confuse the resolver:

Should it forward the "ANY" query to authoritative?
Should it respond with any record that is already in cache?
Should it do some a mixture of the above behaviors?
Should it cache the result of "ANY" query and re-use the data for other queries?

Different implementations do different things. "ANY" does not mean "ALL", which is the main source of confusion. To our joy, Firefox quickly backpedaled on the change and stopped issuing ANY queries.

ANY is hard for network operators

A typical 50Gbps DNS amplification targeting one of our customers. The attack lasted about 4 hours.

Furthermore, since the "ANY" query can generate a large response, they were often used for DNS reflection attacks. Authoritative providers receive a spoofed ANY query and send the large answer to a target, potentially causing DoS damage. We have blogged about that many times:

The DoS problem with ANY is ancient. Here is a discussion about a patch to bind tweaking ANY from 2013.

There is also a second angle to the ANY DoS problem. Some reports suggested that performant DNS servers (authoritative or resolvers) can fill their outbound network capacity with numerous ANY responses.

The recommendation is simple - network operators must use "response rate limiting" when answering large DNS queries, otherwise they pose a DoS threat. The "ANY" query type just happens to often give such large responses, while providing little value to legitimate users.

Terminating ANY

In 2015 frustrated with the experience we announced we would like to stop giving responses to "ANY" queries and wrote a (controversial at a time) blog post:

Deprecating DNS ANY meta-query type

A year later we followed up explaining possible solutions:

What happened next - the deprecation of ANY

And here we come today! With RFC8482 we have an RFC proposed standard clarifying that controversial query.

ANY queries are a background noise. Under normal circumstances, we see a very small volume of ANY queries.

The future for our users

What precisely can be done about "ANY" queries? RFC8482 specifies that:

A DNS responder that receives an ANY query MAY decline to provide aconventional ANY response or MAY instead send a response with asingle RRset (or a larger subset of available RRsets) in the answersection.

This clearly defines the corner case - from now on the authoritative server may respond with, well, any query type to an "ANY" query. Sometimes simple stuff like this matters most.

This opens a gate for implementers - we can prepare a simple answer to these queries. As an implementer you may stick "A", or "AAAA" or anything else in the response if you wish. Furthermore, the spec recommends returning a special (and rarely used thus far) HINFO type. This is in fact what we do:

Oh, we need to update the message to mention the fresh RFC number! NS1 agrees with our implementation:

Our ultimate hero is wikipedia.org, which does exactly what the RFC recommends:

On our resolver service we stop ANY queries with NOTIMP code. This makes us more confident the resolver isn't used to perform DNS reflections:

The future for developers

On the client side, just don't use ANY DNS queries. On the DNS server side - you are allowed to rip out all the gory QTYPE::ANY handling code, and replace it with a top level HINFO message or first RRset found. Enjoy cleaning your codebase!

Summary

It took the DNS community some time to agree on the specifics, but here we are at the end. RFC8482 cleans up the last remaining DNS meta-qtype, and allows for simpler DNS authoritative and DNS resolver implementations. It finally clearly defines the semantics of ANY queries going through resolvers and reduces the DoS risk for the whole Internet.

Not all the effort must go to new shiny protocols and developments, sometimes, cleaning the bitrot is as important. Similar cleanups are being done in other areas. Keep up the good work!

We would like to thank the co-authors of RFC8482, and the community scrutiny and feedback. For us, RFC8482 is definitely a good thing, and allowed us to simplify our codebase and make the Internet safer.

Mission accomplished! One step at a time we can help make the Internet a better place.

SOCKMAP - TCP splicing of the future

Marek Majkowski — Mon, 18 Feb 2019 13:13:02 GMT

Recently we stumbled upon the holy grail for reverse proxies - a TCP socket splicing API. This caught our attention because, as you may know, we run a global network of reverse proxy services. Proper TCP socket splicing reduces the load on userspace processes and enables more efficient data forwarding. We realized that Linux Kernel's SOCKMAP infrastructure can be reused for this purpose. SOCKMAP is a very promising API and is likely to cause a tectonic shift in the architecture of data-heavy applications like software proxies.

Image by Mustad Marine public domain

But let’s rewind a bit.

Birthing pains of L7 proxies

Transmitting large amounts of data from userspace is inefficient. Linux provides a couple of specialized syscalls that aim to address this problem. For example, the sendfile(2) syscall (which Linus doesn't like) can be used to speed up transferring large files from disk to a socket. Then there is splice(2) which traditional proxies use to forward data between two TCP sockets. Finally, vmsplice can be used to stick memory buffer into a pipe without copying, but is very hard to use correctly.

Sadly, sendfile, splice and vmsplice are very specialized, synchronous and solve only one part of the problem - they avoid copying the data to userspace. They leave other efficiency issues unaddressed.

Processes that forward large amounts of data face three problems:

Syscall cost: making multiple syscalls for every forwarded packet is costly.
Wakeup latency: the user-space process must be woken up often to forward the data. Depending on the scheduler, this may result in poor tail latency.
Copying cost: copying data from kernel to userspace and then immediately back to the kernel is not free and adds up to a measurable cost.

Many tried

Forwarding data between TCP sockets is a common practice. It's needed for:

Transparent forward HTTP proxies, like Squid.
Reverse caching HTTP proxies, like Varnish or NGINX.
Load balancers, like HAProxy, Pen or Relayd.

Over the years there have been many attempts to reduce the cost of dumb data forwarding between TCP sockets on Linux. This issue is generally called “TCP splicing”, “L7 splicing”, or “Socket splicing”.

Let’s compare the usual ways of doing TCP splicing. To simplify the problem, instead of writing a rich Layer 7 TCP proxy, we'll write a trivial TCP echo server.

It's not a joke. An echo server can illustrate TCP socket splicing well. You know - "echo" basically splices the socket… with itself!

Naive: read write loop

The naive TCP echo server would look like:

Nothing simpler. On a blocking socket this is a totally valid program, and will work just fine. For completeness, I prepared full code here.

Splice: specialized syscall

Linux has an amazing splice(2) syscall. It can tell the kernel to move data between a TCP buffer on a socket and a buffer on a pipe. The data remains in the buffers, on the kernel side. This solves the problem of needlessly having to copy the data between userspace and kernel-space. With the SPLICE_F_MOVE flag the kernel may be able to avoid copying the data at all!

Our program using splice() looks like:

We still need wake up the userspace program and make two syscalls to forward any piece of data, but at least we avoid all the copying. Full source.

io_submit: Using Linux AIO API

In a previous blog post about io_submit() we proposed using the AIO interface with network sockets. Read the blog post for details, but here is the prepared program that has the echo server loop implemented with only a single syscall.

Image by jrsnchzhrs By-Nd 2.0

SOCKMAP: The ultimate weapon

In recent years Linux Kernel introduced an eBPF virtual machine. With it, user-space programs can run specialized, non-turing-complete bytecode in the kernel context. Nowadays, it's possible to select eBPF programs for dozens of use cases, ranging from packet filtering, to policy enforcement.

From Kernel 4.14 Linux got new eBPF machinery that can be used for socket splicing - SOCKMAP. It was created by John Fastabend at Cilium.io, exposing the Strparser interface to eBPF programs. Cilium uses SOCKMAP for Layer 7 policy enforcement, and all the logic it uses is embedded in an eBPF program. The API is not well documented, requires root and, from our experience, is slightly buggy. But it's very promising. Read more:

LPC2018 - Combining kTLS and BPF for Introspection and Policy Enforcement Paper Video Slides
Original SOCKMAP commit

This is how to use SOCKMAP: SOCKMAP or specifically "BPF_MAP_TYPE_SOCKMAP", is a type of eBPF map. This map is an "array" - indices are integers. All this is pretty standard. The magic is in the map values - they must be TCP socket descriptors.

This map is very special - it has two eBPF programs attached to it. You read it right: the eBPF programs live attached to a map, not attached to a socket, cgroup or network interface as usual. This is how you would set up SOCKMAP in user program:

Ta-da! At this point we have an established sock_map eBPF map, with two eBPF programs attached: parser and verdict. The next step is to add a TCP socket descriptor to this map. Nothing simpler:

At this point the magic happens. From now on, each time our socket sd receives a packet, prog_parser and prog_verdict are called. Their semantics are described in the strparser.txt and the introductory SOCKMAP commit. For simplicity, our trivial echo server only needs the minimal stubs. This is the eBPF code:

Side note: for the purposes of this test program, I wrote a minimal eBPF loader. It has no dependencies (neither bcc, libelf, nor libbpf) and can do basic relocations (like resolving the sock_map symbol mentioned above). See the code.

The call to bpf_sk_redirect_map is doing all the work. It tells the kernel: for the received packet, please oh please redirect it from a receive queue of some socket, to a transmit queue of the socket living in sock_map under index 0. In our case, these are the same sockets! Here we achieved exactly what the echo server is supposed to do, but purely in eBPF.

This technology has multiple benefits. First, the data is never copied to userspace. Secondly, we never need to wake up the userspace program. All the action is done in the kernel. Quite cool, isn't it?

We need one more piece of code, to hang the userspace program until the socket is closed. This is best done with good old poll(2):

Full code.

The benchmarks

At this stage we have presented four simple TCP echo servers:

naive read-write loop
splice
io_submit
SOCKMAP

To recap, we are measuring the cost of three things:

Syscall cost
Wakeup latency, mostly visible as tail latency
The cost of copying data

Theoretically, SOCKMAP should beat all the others:

Show me the numbers

This is the part of the post where I'm showing you the breathtaking numbers, clearly showing the different approaches. Sadly, benchmarking is hard, and well... SOCKMAP turned out to be the slowest. It's important to publish negative results so here they are.

Our test rig was as follows:

Two bare-metal Xeon servers connected with a 25Gbps network.
Both have turbo-boost disabled, and the testing programs are CPU-pinned.
For better locality we localized RX and TX queues to one IRQ/CPU each.
The testing server runs a script that sends 10k batches of fixed-sized blocks of data. The script measures how long it takes for the echo server to return the traffic.
We do 10 separate runs for each measured echo-server program.
TCP: "cubic" and NONAGLE=1.
Both servers run the 4.14 kernel.

Our analysis of the experimental data identified some outliers. We think some of the worst times, manifested as long echo replies, were caused by unrelated factors such as network packet loss. In the charts presented we, perhaps controversially, skip the bottom 1% of outliers in order to focus on what we think is the important data.

Furthermore, we spotted a bug in SOCKMAP. Some of the runs were delayed by up to whopping 64ms. Here is one of the tests:

The great majority of the echo runs (of 128KiB in this case) were finished in the 512us band, while a small fraction stalled for 65ms. This is pretty bad and makes comparison of SOCKMAP to other implementations pretty meaningless. This is a second reason why we are skipping 1% of worst results from all the runs - it makes SOCKMAP numbers way more usable. Sorry.

2MiB blocks - throughput

The fastest of our programs was doing ~15Gbps over one flow, which seems to be a hardware limit. This is very visible in the first iteration, which shows the throughput of our echo programs.

This test shows: Time to transmit and receive 2MiB blocks of data, via our tested echo server. We repeat this 10k times, and run the test 10 times. After stripping the worst 1% numbers we get the following latency distribution:

This chart shows that both naive read+write and io_submit programs were able to achieve 1500us mean round trip time for TCP echo server of 2MiB blocks.

Here we clearly see that splice and SOCKMAP are slower than others. They were CPU-bound and unable to reach the line rate. We have raised the unusual splice performance problems in the past, but perhaps we should debug it one more time.

For each server we run the tests twice: without and with SO_BUSYPOLL setting. This setting should remove the "wakeup latency" and greatly reduce the jitter. The results show that naive and io_submit tests are almost identical. This is perfect! BUSYPOLL does indeed reduce the deviation and latency, at a cost of more CPU usage. Notice that neither splice nor SOCKMAP are affected by this setting.

16KiB blocks - wakeup time

Our second run of tests was with much smaller data sizes, sending tiny 16KiB blocks at a time. This test should illustrate the "wakeup time" of the tested programs.

In this test the non-BUSYPOLL runs of all the programs look quite similar (min and max values), with SOCKMAP being the exception. This is great - we can speculate the wakeup time is comparable. Surprisingly, the splice has slightly better median time from others. Perhaps this can be explained by CPU artifacts, like having better CPU cache locality due to fewer data copying. SOCKMAP is again, slowest with worst max and median times. Boo.

Remember we truncated the worst 1% of the data - we artificially shortened the "max" values.

TL;DR

In this blog post we discussed the theoretical benefits of SOCKMAP. Sadly, we noticed it's not ready for prime time yet. We compared it against splice, which we noticed didn't benefit from BUSYPOLL and had disappointing performance. We noticed that the naive read/write loop and iosubmit approaches have exactly the same performance characteristics and do benefit from BUSYPOLL to reduce jitter (wakeup time).

If you are piping data between TCP sockets, you should definitely take a look at SOCKMAP. While our benchmarks show it's not ready for prime time yet, with poor performance, high jitter and a couple of bugs, it's very promising. We are very excited about it. It's the first technology on Linux that truly allows the user-space process to offload TCP splicing to the kernel. It also has potential for much better performance than other approaches, ticking all the boxes of being async, kernel-only and totally avoiding needless copying of data.

This is not everything. SOCKMAP is able to pipe data across multiple sockets - you can imagine a full mesh of connections being able to send data to each other. Furthermore, it exposes the strparser API, which can be used to offload basic application framing. Combined with kTLS you can combine it with transparent encryption. Furthermore, there are rumors of adding UDP support. The possibilities are endless.

Recently the kernel has been exploding with eBPveF innovations. It seems like we've only just scratched the surface of the possibilities exposed by the modern eBPF interfaces.

Many thanks to Jakub Sitnicki for suggesting SOCKMAP in the first place, writing the proof of concept and now actually fixing the bugs we found. Go strong Warsaw office!

io_submit: The epoll alternative you've never heard about

Marek Majkowski — Fri, 04 Jan 2019 11:02:26 GMT

My curiosity was piqued by an LWN article about IOCB_CMD_POLL - A new kernel polling interface. It discusses an addition of a new polling mechanism to Linux AIO API, which was merged in 4.18 kernel. The whole idea is rather intriguing. The author of the patch is proposing to use the Linux AIO API with things like network sockets.

Hold on. The Linux AIO is designed for, well, Asynchronous disk IO! Disk files are not the same thing as network sockets! Is it even possible to use the Linux AIO API with network sockets in the first place?

The answer turns out to be a strong YES! In this article I'll explain how to use the strengths of Linux AIO API to write better and faster network servers.

But before we start, what is Linux AIO anyway?

Photo by Scott Schiller CC/BY/2.0

Introduction to Linux AIO

Linux AIO exposes asynchronous disk IO to userspace software.

Historically on Linux, all disk operations were blocking. Whether you did open(), read(), write() or fsync(), you could be sure your thread would stall if the needed data and meta-data was not ready in disk cache. This usually isn't a problem. If you do small amount of IO or have plenty of memory, the disk syscalls would gradually fill the cache and on average be rather fast.

The IO operation performance drops for IO-heavy workloads, like databases or caching web proxies. In such applications it would be tragic if a whole server stalled, just because some odd read() syscall had to wait for disk.

To work around this problem, applications use one of the three approaches:

(1) Use thread pools and offload blocking syscalls to worker threads. This is what glibc POSIX AIO (not to be confused with Linux AIO) wrapper does. (See: IBM's documentation). This is also what we ended up doing in our application at Cloudflare - we offloaded read() and open() calls to a thread pool.

(2) Pre-warm the disk cache with posix_fadvise(2) and hope for the best.

(3) Use Linux AIO with XFS file system, file opened with O_DIRECT, and avoid the undocumented pitfalls.

None of these methods is perfect. Even the Linux AIO if used carelessly, could still block in the io_submit() call. This was recently mentioned in another LWN article:

The Linux asynchronous I/O (AIO) layer tends to have many critics and few defenders, but most people at least expect it to actually be asynchronous. In truth, an AIO operation can block in the kernel for a number of reasons, making AIO difficult to use in situations where the calling thread truly cannot afford to block.

Now that we know what Linux AIO API doesn't do well, let's see where it shines.

Simplest Linux AIO program

To use Linux AIO you first need to define the 5 needed syscalls - glibc doesn't provide wrapper functions. To use Linux AIO we need:

(1) First call io_setup() to set up the aio_context data structure. Kernel will hand us an opaque pointer.

(2) Then we can call io_submit() to submit a vector of "I/O control blocks" struct iocb for processing.

(3) Finally, we can call io_getevents() to block and wait for a vector of struct io_event - completion notification of the iocb's.

There are 8 commands that can be submitted in an iocb. Two read, two write, two fsync variants and a POLL command introduced in 4.18 Kernel:

The struct iocb passed to io_submit is large, and tuned for disk IO. Here's a simplified version:

The completion notification retrieved from io_getevents:

Let's try an example. Here's the simplest program reading /etc/passwd file with Linux AIO API:

Full source is, as usual, on GitHub. Here's a strace of this program:

This all worked fine! But the disk read was not asynchronous: the io_submit syscall blocked and did all the work! The io_getevents call finished instantly. We could try to make the disk read async, but this requires O_DIRECT flag which skips the caches.

Let's try to better illustrate the blocking nature of io_submit on normal files. Here's similar example, showing strace when reading large 1GiB block from /dev/zero:

The kernel spent 738ms in io_submit and only 15us in io_getevents. The kernel behaves the same way with network sockets - all the work is done in io_submit.

Photo by Helix84 CC/BY-SA/3.0

Linux AIO with sockets - batching

The implementation of io_submit is rather conservative. Unless the passed descriptor is O_DIRECT file, it will just block and perform the requested action. In case of network sockets it means:

For blocking sockets IOCB_CMD_PREAD will hang until a packet arrives.
For non-blocking sockets IOCB_CMD_PREAD will -11 (EAGAIN) return code.

These are exactly the same semantics as for vanilla read() syscall. It's fair to say that for network sockets io_submit is no smarter than good old read/write calls.

It's important to note the iocb requests passed to kernel are evaluated in-order sequentially.

While Linux AIO won't help with async operations, it can definitely be used for syscall batching!

If you have a web server needing to send and receive data from hundreds of network sockets, using io_submit can be a great idea. This would avoid having to call send and recv hundreds of times. This will improve performance - jumping back and forth from userspace to kernel is not free, especially since the meltdown and spectre mitigations.

To illustrate the batching aspect of io_submit, let's create a small program forwarding data from one TCP socket to another. In simplest form, without Linux AIO, the program would have a trivial flow like this:

We can express the same logic with Linux AIO. The code will look like this:

This code submits two jobs to io_submit. First, request to write data to sd2 then to read data from sd1. After the read is done, the code fixes up the write buffer size and loops again. The code does a cool trick - the first write is of size 0. We are doing so since we can fuse write+read in one io_submit (but not read+write). After a read is done we have to fix the write buffer size.

Is this code faster than the simple read/write version? Not yet. Both versions have two syscalls: read+write and io_submit+io_getevents. Fortunately, we can improve it.

Getting rid of io_getevents

When running io_setup(), the kernel allocates a couple of pages of memory for the process. This is how this memory block looks like in /proc//maps:

The [aio] memory region (12KiB in my case) was allocated by the io_setup. This memory range is used a ring buffer storing the completion events. In most cases, there isn't any reason to call the real io_getevents syscall. The completion data can be easily retrieved from the ring buffer without the need of consulting the kernel. Here is a fixed version of the code:

Here's full code. This ring buffer interface is poorly documented. I adapted this code from the axboe/fio project.

With this code fixing the io_getevents function, our Linux AIO version of the TCP proxy needs only one syscall per loop, and indeed is a tiny bit faster than the read+write code.

Photo by Train Photos CC/BY-SA/2.0

Epoll alternative

With the addition of IOCB_CMD_POLL in kernel 4.18, one could use io_submit also as select/poll/epoll equivalent. For example, here's some code waiting for data on a socket:

Full code. Here's the strace view:

As you can see this time the "async" part worked fine, the io_submit finished instantly and the io_getevents successfully blocked for 1s while awaiting data. This is pretty powerful and can be used instead of the epoll_wait() syscall.

Furthermore, normally dealing with epoll requires juggling epoll_ctl syscalls. Application developers go to great lengths to avoid calling this syscall too often. Just read the man page on EPOLLONESHOT and EPOLLET flags. Using io_submit for polling works around this whole complexity, and doesn't require any spurious syscalls. Just push your sockets to the iocb request vector, call io_submit exactly once and wait for completions. The API can't be simpler than this.

Summary

In this blog post we reviewed the Linux AIO API. While initially conceived to be a disk-only API, it seems to be working in the same way as normal read/write syscalls on network sockets. But as opposed to read/write io_submit allows syscall batching, potentially improving performance.

Since kernel 4.18 io_submit and io_getevents can be used to wait for events like POLLIN and POLLOUT on network sockets. This is great, and could be used as a replacement for epoll() in the event loop.

I can imagine a network server that could just be doing io_submit and io_getevents syscalls, as opposed to the usual mix of read, write, epoll_ctl and epoll_wait. With such design the syscall batching aspect of io_submit could really shine. Such a server would be meaningfully faster.

Sadly, even with recent Linux AIO API improvements, the larger discussion remains. Famously, Linus hates it:

AIO is a horrible ad-hoc design, with the main excuse being "other, less gifted people, made that design, and we are implementing it for compatibility because database people - who seldom have any shred of taste - actually use it". But AIO was always really really ugly.

Over the years there had been multiple attempts on creating a better batching and async interfaces, unfortunately, lacking coherent vision. For example, recent addition of sendto(MSG_ZEROCOPY) allows for truly async transmission operations, but no batching. io_submit allows batching but not async. It's even worse than that - Linux currently has three ways of delivering async notifications - signals, io_getevents and MSG_ERRQUEUE.

Having said that I'm really excited to see the new developments which allow developing faster network servers. I'm jumping on the code to replace my rusty epoll event loops with io_submit!