
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/">
    <channel>
        <title><![CDATA[ The Cloudflare Blog ]]></title>
        <description><![CDATA[ Get the latest news on how products at Cloudflare are built, technologies used, and join the teams helping to build a better Internet. ]]></description>
        <link>https://blog.cloudflare.com</link>
        <atom:link href="https://blog.cloudflare.com/" rel="self" type="application/rss+xml"/>
        <language>en-us</language>
        <image>
            <url>https://blog.cloudflare.com/favicon.png</url>
            <title>The Cloudflare Blog</title>
            <link>https://blog.cloudflare.com</link>
        </image>
        <lastBuildDate>Sat, 04 Apr 2026 10:15:30 GMT</lastBuildDate>
        <item>
            <title><![CDATA[QUIC restarts, slow problems: udpgrm to the rescue]]></title>
            <link>https://blog.cloudflare.com/quic-restarts-slow-problems-udpgrm-to-the-rescue/</link>
            <pubDate>Wed, 07 May 2025 13:00:00 GMT</pubDate>
            <description><![CDATA[ udpgrm is a lightweight daemon for graceful restarts of UDP servers. It leverages SO_REUSEPORT and eBPF to route new and existing flows to the correct server instance. ]]></description>
            <content:encoded><![CDATA[ <p>At Cloudflare, we do everything we can to avoid interruption to our services. We frequently deploy new versions of the code that delivers the services, so we need to be able to restart the server processes to upgrade them without missing a beat. In particular, performing graceful restarts (also known as "zero downtime") for UDP servers has proven to be surprisingly difficult.</p><p>We've <a href="https://blog.cloudflare.com/graceful-upgrades-in-go/"><u>previously</u></a> <a href="https://blog.cloudflare.com/oxy-the-journey-of-graceful-restarts/"><u>written</u></a> about graceful restarts in the context of TCP, which is much easier to handle. We didn't have a strong reason to deal with UDP until recently — when protocols like HTTP3/QUIC became critical. This blog post introduces <b><i>udpgrm</i></b>, a lightweight daemon that helps us to upgrade UDP servers without dropping a single packet.</p><p><a href="https://github.com/cloudflare/udpgrm/blob/main/README.md"><u>Here's the </u><i><u>udpgrm</u></i><u> GitHub repo</u></a>.</p>
    <div>
      <h2>Historical context</h2>
      <a href="#historical-context">
        
      </a>
    </div>
    <p>In the early days of the Internet, UDP was used for stateless request/response communication with protocols like DNS or NTP. Restarts of a server process are not a problem in that context, because it does not have to retain state across multiple requests. However, modern protocols like QUIC, WireGuard, and SIP, as well as online games, use stateful flows. So what happens to the state associated with a flow when a server process is restarted? Typically, old connections are just dropped during a server restart. Migrating the flow state from the old instance to the new instance is possible, but it is complicated and notoriously hard to get right.</p><p>The same problem occurs for TCP connections, but there a common approach is to keep the old instance of the server process running alongside the new instance for a while, routing new connections to the new instance while letting existing ones drain on the old. Once all connections finish or a timeout is reached, the old instance can be safely shut down. The same approach works for UDP, but it requires more involvement from the server process than for TCP.</p><p>In the past, we <a href="https://blog.cloudflare.com/everything-you-ever-wanted-to-know-about-udp-sockets-but-were-afraid-to-ask-part-1/"><u>described</u></a> the <i>established-over-unconnected</i> method. It offers one way to implement flow handoff, but it comes with significant drawbacks: it’s prone to race conditions in protocols with multi-packet handshakes, and it suffers from a scalability issue. Specifically, the kernel hash table used for dispatching packets is keyed only by the local IP:port tuple, which can lead to bucket overfill when dealing with many inbound UDP sockets.</p><p>Now we have found a better method, leveraging Linux’s <code>SO_REUSEPORT</code> API. By placing both old and new sockets into the same REUSEPORT group and using an eBPF program for flow tracking, we can route packets to the correct instance and preserve flow stickiness. This is how <i>udpgrm</i> works.</p>
    <div>
      <h2>REUSEPORT group</h2>
      <a href="#reuseport-group">
        
      </a>
    </div>
    <p>Before diving deeper, let's quickly review the basics. Linux provides the <code>SO_REUSEPORT</code> socket option, typically set after <code>socket()</code> but before <code>bind()</code>. Please note that this has a separate purpose from the better known <code>SO_REUSEADDR</code> socket option.</p><p><code>SO_REUSEPORT</code> allows multiple sockets to bind to the same IP:port tuple. This feature is primarily used for load balancing, letting servers spread traffic efficiently across multiple CPU cores. You can think of it as a way for an IP:port to be associated with multiple packet queues. In the kernel, sockets sharing an IP:port this way are organized into a <i>reuseport group </i>— a term we'll refer to frequently throughout this post.</p>
            <pre><code>┌───────────────────────────────────────────┐
│ reuseport group 192.0.2.0:443             │
│ ┌───────────┐ ┌───────────┐ ┌───────────┐ │
│ │ socket #1 │ │ socket #2 │ │ socket #3 │ │
│ └───────────┘ └───────────┘ └───────────┘ │
└───────────────────────────────────────────┘
</code></pre>
            <p>Linux supports several methods for distributing inbound packets across a reuseport group. By default, the kernel uses a hash of the packet's 4-tuple to select a target socket. Another method is <code>SO_INCOMING_CPU</code>, which, when enabled, tries to steer packets to sockets running on the same CPU that received the packet. This approach works but has limited flexibility.</p><p>To provide more control, Linux introduced the <code>SO_ATTACH_REUSEPORT_CBPF</code> option, allowing server processes to attach a classic BPF (cBPF) program to make socket selection decisions. This was later extended with <code>SO_ATTACH_REUSEPORT_EBPF</code>, enabling the use of modern eBPF programs. With <a href="https://blog.cloudflare.com/cloudflare-architecture-and-how-bpf-eats-the-world/"><u>eBPF</u></a>, developers can implement arbitrary custom logic. A boilerplate program would look like this:</p>
            <pre><code>SEC("sk_reuseport")
int udpgrm_reuseport_prog(struct sk_reuseport_md *md)
{
    uint64_t socket_identifier = xxxx;
    bpf_sk_select_reuseport(md, &amp;sockhash, &amp;socket_identifier, 0);
    return SK_PASS;
}</code></pre>
            <p>To select a specific socket, the eBPF program calls <code>bpf_sk_select_reuseport</code>, using a reference to a map with sockets (<code>SOCKHASH</code>, <code>SOCKMAP</code>, or the older, mostly obsolete <code>SOCKARRAY</code>), along with a key or index. For example, a declaration of a <code>SOCKHASH</code> might look like this:</p>
            <pre><code>struct {
	__uint(type, BPF_MAP_TYPE_SOCKHASH);
	__uint(max_entries, MAX_SOCKETS);
	__uint(key_size, sizeof(uint64_t));
	__uint(value_size, sizeof(uint64_t));
} sockhash SEC(".maps");</code></pre>
            <p>This <code>SOCKHASH</code> is a hash map that holds references to sockets, even though the value size looks like a scalar 8-byte value. In our case it's indexed by an <code>uint64_t</code> key. This is pretty neat, as it allows for a simple number-to-socket mapping!</p><p>However, there's a catch: <b>the </b><code><b>SOCKHASH</b></code><b> must be populated and maintained from user space (or a separate control plane), outside the eBPF program itself</b>. Keeping this socket map accurate and in sync with the server process state is surprisingly difficult to get right — especially under dynamic conditions like restarts, crashes, or scaling events. The point of <i>udpgrm</i> is to take care of this stuff, so that server processes don’t have to.</p>
    <div>
      <h2>Socket generation and working generation</h2>
      <a href="#socket-generation-and-working-generation">
        
      </a>
    </div>
    <p>Let’s look at how graceful restarts for UDP flows are achieved in <i>udpgrm</i>. To reason about this setup, we’ll need a bit of terminology: A <b>socket generation</b> is a set of sockets within a reuseport group that belong to the same logical application instance:</p>
            <pre><code>┌───────────────────────────────────────────────────┐
│ reuseport group 192.0.2.0:443                     │
│  ┌─────────────────────────────────────────────┐  │
│  │ socket generation 0                         │  │
│  │  ┌───────────┐ ┌───────────┐ ┌───────────┐  │  │
│  │  │ socket #1 │ │ socket #2 │ │ socket #3 │  │  │
│  │  └───────────┘ └───────────┘ └───────────┘  │  │
│  └─────────────────────────────────────────────┘  │
│  ┌─────────────────────────────────────────────┐  │
│  │ socket generation 1                         │  │
│  │  ┌───────────┐ ┌───────────┐ ┌───────────┐  │  │
│  │  │ socket #4 │ │ socket #5 │ │ socket #6 │  │  │
│  │  └───────────┘ └───────────┘ └───────────┘  │  │
│  └─────────────────────────────────────────────┘  │
└───────────────────────────────────────────────────┘</code></pre>
            <p>When a server process needs to be restarted, the new version creates a new socket generation for its sockets. The old version keeps running alongside the new one, using sockets from the previous socket generation.</p><p>Reuseport eBPF routing boils down to two problems:</p><ul><li><p>For new flows, we should choose a socket from the socket generation that belongs to the active server instance.</p></li><li><p>For already established flows, we should choose the appropriate socket — possibly from an older socket generation — to keep the flows sticky. The flows will eventually drain away, allowing the old server instance to shut down.</p></li></ul><p>Easy, right?</p><p>Of course not! The devil is in the details. Let's take it one step at a time.</p><p>Routing new flows is relatively easy. <i>udpgrm</i> simply maintains a reference to the socket generation that should handle new connections. We call this reference the <b>working generation</b>. Whenever a new flow arrives, the eBPF program consults the working generation pointer and selects a socket from that generation.</p>
            <pre><code>┌──────────────────────────────────────────────┐
│ reuseport group 192.0.2.0:443                │
│   ...                                        │
│   Working generation ────┐                   │
│                          V                   │
│           ┌───────────────────────────────┐  │
│           │ socket generation 1           │  │
│           │  ┌───────────┐ ┌──────────┐   │  │
│           │  │ socket #4 │ │ ...      │   │  │
│           │  └───────────┘ └──────────┘   │  │
│           └───────────────────────────────┘  │
│   ...                                        │
└──────────────────────────────────────────────┘</code></pre>
            <p>For this to work, we first need to be able to differentiate packets belonging to new connections from packets belonging to old connections. This is very tricky and highly dependent on the specific UDP protocol. For example, QUIC has an <a href="https://datatracker.ietf.org/doc/html/rfc9000#name-initial-packet"><i><u>initial packet</u></i></a> concept, similar to a TCP SYN, but other protocols might not.</p><p>There needs to be some flexibility in this and <i>udpgrm</i> makes this configurable. Each reuseport group sets a specific <b>flow dissector</b>.</p><p>Flow dissector has two tasks:</p><ul><li><p>It distinguishes new packets from packets belonging to old, already established flows.</p></li><li><p>For recognized flows, it tells <i>udpgrm</i> which specific socket the flow belongs to.</p></li></ul><p>These concepts are closely related and depend on the specific server. Different UDP protocols define flows differently. For example, a naive UDP server might use a typical 5-tuple to define flows, while QUIC uses a "connection ID" field in the QUIC packet header to survive <a href="https://www.rfc-editor.org/rfc/rfc9308.html#section-3.2"><u>NAT rebinding</u></a>.</p><p><i>udpgrm</i> supports three flow dissectors out of the box and is highly configurable to support any UDP protocol. More on this later.</p>
    <div>
      <h2>Welcome udpgrm!</h2>
      <a href="#welcome-udpgrm">
        
      </a>
    </div>
    <p>Now that we covered the theory, we're ready for the business: please welcome <b>udpgrm</b> — UDP Graceful Restart Marshal! <i>udpgrm</i> is a stateful daemon that handles all the complexities of the graceful restart process for UDP. It installs the appropriate eBPF REUSEPORT program, maintains flow state, communicates with the server process during restarts, and reports useful metrics for easier debugging.</p><p>We can describe <i>udpgrm</i> from two perspectives: for administrators and for programmers.</p>
    <div>
      <h2>udpgrm daemon for the system administrator</h2>
      <a href="#udpgrm-daemon-for-the-system-administrator">
        
      </a>
    </div>
    <p><i>udpgrm</i> is a stateful daemon, to run it:</p>
            <pre><code>$ sudo udpgrm --daemon
[ ] Loading BPF code
[ ] Pinning bpf programs to /sys/fs/bpf/udpgrm
[*] Tailing message ring buffer  map_id 936146</code></pre>
            <p>This sets up the basic functionality, prints rudimentary logs, and should be deployed as a dedicated systemd service — loaded after networking. However, this is not enough to fully use <i>udpgrm</i>. <i>udpgrm</i> needs to hook into <code>getsockopt</code>, <code>setsockopt</code>, <code>bind</code>, and <code>sendmsg</code> syscalls, which are scoped to a cgroup. To install the <i>udpgrm</i> hooks, you can install it like this:</p>
            <pre><code>$ sudo udpgrm --install=/sys/fs/cgroup/system.slice</code></pre>
            <p>But a more common pattern is to install it within the <i>current</i> cgroup:</p>
            <pre><code>$ sudo udpgrm --install --self</code></pre>
            <p>Better yet, use it as part of the systemd "service" config:</p>
            <pre><code>[Service]
...
ExecStartPre=/usr/local/bin/udpgrm --install --self</code></pre>
            <p>Once <i>udpgrm</i> is running, the administrator can use the CLI to list reuseport groups, sockets, and metrics, like this:</p>
            <pre><code>$ sudo udpgrm list
[ ] Retrievieng BPF progs from /sys/fs/bpf/udpgrm
192.0.2.0:4433
	netns 0x1  dissector bespoke  digest 0xdead
	socket generations:
		gen  3  0x17a0da  &lt;=  app 0  gen 3
	metrics:
		rx_processed_total 13777528077
...</code></pre>
            <p>Now, with both the <i>udpgrm</i> daemon running, and cgroup hooks set up, we can focus on the server part.</p>
    <div>
      <h2>udpgrm for the programmer</h2>
      <a href="#udpgrm-for-the-programmer">
        
      </a>
    </div>
    <p>We expect the server to create the appropriate UDP sockets by itself. We depend on <code>SO_REUSEPORT</code>, so that each server instance can have a dedicated socket or a set of sockets:</p>
            <pre><code>sd = socket.socket(AF_INET, SOCK_DGRAM, 0)
sd.setsockopt(SOL_SOCKET, SO_REUSEPORT, 1)
sd.bind(("192.0.2.1", 5201))</code></pre>
            <p>With a socket descriptor handy, we can pursue the <i>udpgrm</i> magic dance. The server communicates with the <i>udpgrm</i> daemon using <code>setsockopt</code> calls. Behind the scenes, udpgrm provides eBPF <code>setsockopt</code> and <code>getsockopt</code> hooks and hijacks specific calls. It's not easy to set up on the kernel side, but when it works, it’s truly awesome. A typical socket setup looks like this:</p>
            <pre><code>try:
    work_gen = sd.getsockopt(IPPROTO_UDP, UDP_GRM_WORKING_GEN)
except OSError:
    raise OSError('Is udpgrm daemon loaded? Try "udpgrm --self --install"')
    
sd.setsockopt(IPPROTO_UDP, UDP_GRM_SOCKET_GEN, work_gen + 1)
for i in range(10):
    v = sd.getsockopt(IPPROTO_UDP, UDP_GRM_SOCKET_GEN, 8);
    sk_gen, sk_idx = struct.unpack('II', v)
    if sk_idx != 0xffffffff:
        break
    time.sleep(0.01 * (2 ** i))
else:
    raise OSError("Communicating with udpgrm daemon failed.")

sd.setsockopt(IPPROTO_UDP, UDP_GRM_WORKING_GEN, work_gen + 1)</code></pre>
            <p>You can see three blocks here:</p><ul><li><p>First, we retrieve the working generation number and, by doing so, check for <i>udpgrm</i> presence. Typically, <i>udpgrm</i> absence is fine for non-production workloads.</p></li><li><p>Then we register the socket to an arbitrary socket generation. We choose <code>work_gen + 1</code> as the value and verify that the registration went through correctly.</p></li><li><p>Finally, we bump the working generation pointer.</p></li></ul><p>That's it! Hopefully, the API presented here is clear and reasonable. Under the hood, the <i>udpgrm</i> daemon installs the REUSEPORT eBPF program, sets up internal data structures, collects metrics, and manages the sockets in a <code>SOCKHASH</code>.</p>
    <div>
      <h2>Advanced socket creation with udpgrm_activate.py</h2>
      <a href="#advanced-socket-creation-with-udpgrm_activate-py">
        
      </a>
    </div>
    <p>In practice, we often need sockets bound to low ports like <code>:443</code>, which requires elevated privileges like <code>CAP_NET_BIND_SERVICE</code>. It's usually better to configure listening sockets outside the server itself. A typical pattern is to pass the listening sockets using <a href="https://0pointer.de/blog/projects/socket-activation.html"><u>socket activation</u></a>.</p><p>Sadly, systemd cannot create a new set of UDP <code>SO_REUSEPORT</code> sockets for each server instance. To overcome this limitation, <i>udpgrm</i> provides a script called <code>udpgrm_activate.py</code>, which can be used like this:</p>
            <pre><code>[Service]
Type=notify                 # Enable access to fd store
NotifyAccess=all            # Allow access to fd store from ExecStartPre
FileDescriptorStoreMax=128  # Limit of stored sockets must be set

ExecStartPre=/usr/local/bin/udpgrm_activate.py test-port 0.0.0.0:5201</code></pre>
            <p>Here, <code>udpgrm_activate.py</code> binds to <code>0.0.0.0:5201</code> and stores the created socket in the systemd FD store under the name <code>test-port</code>. The server <code>echoserver.py</code> will inherit this socket and receive the appropriate <code>FD_LISTEN</code> environment variables, following the typical systemd socket activation pattern.</p>
    <div>
      <h2>Systemd service lifetime</h2>
      <a href="#systemd-service-lifetime">
        
      </a>
    </div>
    <p>Systemd typically can't handle more than one server instance running at the same time. It prefers to kill the old instance quickly. It supports the "at most one" server instance model, not the "at least one" model that we want. To work around this, <i>udpgrm</i> provides a <b>decoy</b> script that will exit when systemd asks it to, while the actual old instance of the server can stay active in the background.</p>
            <pre><code>[Service]
...
ExecStart=/usr/local/bin/mmdecoy examples/echoserver.py

Restart=always             # if pid dies, restart it.
KillMode=process           # Kill only decoy, keep children after stop.
KillSignal=SIGTERM         # Make signals explicit</code></pre>
            <p>At this point, we showed a full template for a <i>udpgrm</i> enabled server that contains all three elements: <code>udpgrm --install --self</code> for cgroup hooks, <code>udpgrm_activate.py</code> for socket creation, and <code>mmdecoy</code> for fooling systemd service lifetime checks.</p>
            <pre><code>[Service]
Type=notify                 # Enable access to fd store
NotifyAccess=all            # Allow access to fd store from ExecStartPre
FileDescriptorStoreMax=128  # Limit of stored sockets must be set

ExecStartPre=/usr/local/bin/udpgrm --install --self
ExecStartPre=/usr/local/bin/udpgrm_activate.py --no-register test-port 0.0.0.0:5201
ExecStart=/usr/local/bin/mmdecoy PWD/examples/echoserver.py

Restart=always             # if pid dies, restart it.
KillMode=process           # Kill only decoy, keep children after stop. 
KillSignal=SIGTERM         # Make signals explicit</code></pre>
            
    <div>
      <h2>Dissector modes</h2>
      <a href="#dissector-modes">
        
      </a>
    </div>
    <p>We've discussed the <i>udpgrm</i> daemon, the <i>udpgrm</i> setsockopt API, and systemd integration, but we haven't yet covered the details of routing logic for old flows. To handle arbitrary protocols, <i>udpgrm</i> supports three <b>dissector modes</b> out of the box:</p><p><b>DISSECTOR_FLOW</b>: <i>udpgrm</i> maintains a flow table indexed by a flow hash computed from a typical 4-tuple. It stores a target socket identifier for each flow. The flow table size is fixed, so there is a limit to the number of concurrent flows supported by this mode. To mark a flow as "assured," <i>udpgrm</i> hooks into the <code>sendmsg</code> syscall and saves the flow in the table only when a message is sent.</p><p><b>DISSECTOR_CBPF</b>: A cookie-based model where the target socket identifier — called a udpgrm cookie — is encoded in each incoming UDP packet. For example, in QUIC, this identifier can be stored as part of the connection ID. The dissection logic is expressed as cBPF code. This model does not require a flow table in <i>udpgrm</i> but is harder to integrate because it needs protocol and server support.</p><p><b>DISSECTOR_NOOP</b>: A no-op mode with no state tracking at all. It is useful for traditional UDP services like DNS, where we want to avoid losing even a single packet during an upgrade.</p><p>Finally, <i>udpgrm</i> provides a template for a more advanced dissector called <b>DISSECTOR_BESPOKE</b>. Currently, it includes a QUIC dissector that can decode the QUIC TLS SNI and direct specific TLS hostnames to specific socket generations.</p><p>For more details, <a href="https://github.com/cloudflare/udpgrm/blob/main/README.md"><u>please consult the </u><i><u>udpgrm</u></i><u> README</u></a>. In short: the FLOW dissector is the simplest one, useful for old protocols. CBPF dissector is good for experimentation when the protocol allows storing a custom connection id (cookie) — we used it to develop our own QUIC Connection ID schema (also named DCID) — but it's slow, because it interprets cBPF inside eBPF (yes really!). NOOP is useful, but only for very specific niche servers. The real magic is in the BESPOKE type, where users can create arbitrary, fast, and powerful dissector logic.</p>
    <div>
      <h2>Summary</h2>
      <a href="#summary">
        
      </a>
    </div>
    <p>The adoption of QUIC and other UDP-based protocols means that gracefully restarting UDP servers is becoming an increasingly important problem. To our knowledge, a reusable, configurable and easy to use solution didn't exist yet. The <i>udpgrm</i> project brings together several novel ideas: a clean API using <code>setsockopt()</code>, careful socket-stealing logic hidden under the hood, powerful and expressive configurable dissectors, and well-thought-out integration with systemd.</p><p>While <i>udpgrm</i> is intended to be easy to use, it hides a lot of complexity and solves a genuinely hard problem. The core issue is that the Linux Sockets API has not kept up with the modern needs of UDP.</p><p>Ideally, most of this should really be a feature of systemd. That includes supporting the "at least one" server instance mode, UDP <code>SO_REUSEPORT</code> socket creation, installing a <code>REUSEPORT_EBPF</code> program, and managing the "working generation" pointer. We hope that <i>udpgrm</i> helps create the space and vocabulary for these long-term improvements.</p> ]]></content:encoded>
            <category><![CDATA[UDP]]></category>
            <category><![CDATA[Linux]]></category>
            <category><![CDATA[Programming]]></category>
            <guid isPermaLink="false">2baeaA3qbgFISPMjlZ74a4</guid>
            <dc:creator>Marek Majkowski</dc:creator>
        </item>
        <item>
            <title><![CDATA[Multi-Path TCP: revolutionizing connectivity, one path at a time]]></title>
            <link>https://blog.cloudflare.com/multi-path-tcp-revolutionizing-connectivity-one-path-at-a-time/</link>
            <pubDate>Fri, 03 Jan 2025 14:00:00 GMT</pubDate>
            <description><![CDATA[ Multi-Path TCP (MPTCP) leverages multiple network interfaces, like Wi-Fi and cellular, to provide seamless mobility for more reliable connectivity. While promising, MPTCP is still in its early stages, ]]></description>
            <content:encoded><![CDATA[ <p></p><p>The Internet is designed to provide multiple paths between two endpoints. Attempts to exploit multi-path opportunities are almost as old as the Internet, culminating in <a href="https://datatracker.ietf.org/doc/html/rfc2991"><u>RFCs</u></a> documenting some of the challenges. Still, today, virtually all end-to-end communication uses only one available path at a time. Why? It turns out that in multi-path setups, even the smallest differences between paths can harm the connection quality due to <a href="https://en.wikipedia.org/wiki/Equal-cost_multi-path_routing#History"><u>packet reordering</u></a> and other issues. As a result, Internet devices usually use a single path and let the routers handle the path selection.</p><p>There is another way. Enter Multi-Path TCP (MPTCP), which exploits the presence of multiple interfaces on a device, such as a mobile phone that has both Wi-Fi and cellular antennas, to achieve multi-path connectivity.</p><p>MPTCP has had a long history — see the <a href="https://en.wikipedia.org/wiki/Multipath_TCP"><u>Wikipedia article</u></a> and the <a href="https://datatracker.ietf.org/doc/html/rfc8684"><u>spec (RFC 8684)</u></a> for details. It's a major extension to the TCP protocol, and historically most of the TCP changes failed to gain traction. However, MPTCP is supposed to be mostly an operating system feature, making it easy to enable. Applications should only need minor code changes to support it.</p><p>There is a caveat, however: MPTCP is still fairly immature, and while it can use multiple paths, giving it superpowers over regular TCP, it's not always strictly better than it. Whether MPTCP should be used over TCP is really a case-by-case basis.</p><p>In this blog post we show how to set up MPTCP to find out.</p>
    <div>
      <h2>Subflows</h2>
      <a href="#subflows">
        
      </a>
    </div>
    
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3r8AP5BHbvtYEtXmYSXFwO/36e95cbac93cdecf2f5ee65945abf0b3/Screenshot_2024-12-23_at_3.07.37_PM.png" />
          </figure><p>Internally, MPTCP extends TCP by introducing "subflows". When everything is working, a single TCP connection can be backed by multiple MPTCP subflows, each using different paths. This is a big deal - a single TCP byte stream is now no longer identified by a single 5-tuple. On Linux you can see the subflows with <code>ss -M</code>, like:</p>
            <pre><code>marek$ ss -tMn dport = :443 | cat
tcp   ESTAB 0  	0 192.168.2.143%enx2800af081bee:57756 104.28.152.1:443
tcp   ESTAB 0  	0       192.168.1.149%wlp0s20f3:44719 104.28.152.1:443
mptcp ESTAB 0  	0                 192.168.2.143:57756 104.28.152.1:443</code></pre>
            <p>Here you can see a single MPTCP connection, composed of two underlying TCP flows.</p>
    <div>
      <h2>MPTCP aspirations</h2>
      <a href="#mptcp-aspirations">
        
      </a>
    </div>
    <p>Being able to separate the lifetime of a connection from the lifetime of a flow allows MPTCP to address two problems present in classical TCP: aggregation and mobility.</p><ul><li><p><b>Aggregation</b>: MPTCP can aggregate the bandwidth of many network interfaces. For example, in a data center scenario, it's common to use interface bonding. A single flow can make use of just one physical interface. MPTCP, by being able to launch many subflows, can expose greater overall bandwidth. I'm personally not convinced if this is a real problem. As we'll learn below, modern Linux has a <a href="https://dl.ifip.org/db/conf/networking/networking2016/1570234725.pdf"><u>BLESS-like MPTCP scheduler</u></a> and macOS stack has the "aggregation" mode, so aggregation should work, but I'm not sure how practical it is. However, there are <a href="https://www.openmptcprouter.com/"><u>certainly projects that are trying to do link aggregation</u></a> using MPTCP.</p></li><li><p><b>Mobility</b>: On a customer device, a TCP stream is typically broken if the underlying network interface goes away. This is not an uncommon occurrence — consider a smartphone dropping from Wi-Fi to cellular. MPTCP can fix this — it can create and destroy many subflows over the lifetime of a single connection and survive multiple network changes.</p></li></ul><p>Improving reliability for mobile clients is a big deal. While some software can use QUIC, which also works on <a href="https://www.ietf.org/archive/id/draft-ietf-quic-multipath-11.html"><u>Multipath Extensions</u></a>, a large number of classical services still use TCP. A great example is SSH: it would be very nice if you could walk around with a laptop and keep an SSH session open and switch Wi-Fi networks seamlessly, without breaking the connection.</p><p>MPTCP work was initially driven by <a href="https://uclouvain.be/fr/index.html"><u>UCLouvain in Belgium</u></a>. The first serious adoption was on the iPhone. Apparently, users have a tendency to use Siri while they are walking out of their home. It's very common to lose Wi-Fi connectivity while they are doing this. (<a href="https://youtu.be/BucQ1lfbtd4?t=533"><u>source</u></a>) </p>
    <div>
      <h2>Implementations</h2>
      <a href="#implementations">
        
      </a>
    </div>
    <p>Currently, there are only two major MPTCP implementations — Linux kernel support from v5.6, but realistically you need at least kernel v6.1 (<a href="https://oracle.github.io/kconfigs/?config=UTS_RELEASE&amp;config=MPTCP"><u>MPTCP is not supported on Android</u></a> yet) and iOS from version 7 / Mac OS X from 10.10.</p><p>Typically, Linux is used on the server side, and iOS/macOS as the client. It's possible to get Linux to work as a client-side, but it's not straightforward, as we'll learn soon. Beware — there is plenty of outdated Linux MPTCP documentation. The code has had a bumpy history and at least two different APIs were proposed. See the Linux kernel source for <a href="https://docs.kernel.org/networking/mptcp.html"><u>the mainline API</u></a> and the <a href="https://www.mptcp.dev/"><u>mptcp.dev</u></a> website.</p>
    <div>
      <h2>Linux as a server</h2>
      <a href="#linux-as-a-server">
        
      </a>
    </div>
    <p>Conceptually, the MPTCP design is pretty sensible. After the initial TCP handshake, each peer may announce additional addresses (and ports) on which it can be reached. There are two ways of doing this. First, in the handshake TCP packet each peer specifies the "<i>Do not attempt to establish new subflows to this address and port</i>" bit, also known as bit [C], in the MPTCP TCP extensions header.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/bT8oz3wxpw7alftvdYg5n/b7614a4d10b6c81e18027f6785391ede/BLOG-2637_3.png" />
          </figure><p><sup><i>Wireshark dissecting MPTCP flags from a SYN packet. </i></sup><a href="https://github.com/multipath-tcp/mptcp_net-next/issues/535"><sup><i><u>Tcpdump does not report</u></i></sup></a><sup><i> this flag yet.</i></sup></p><p>With this bit cleared, the other peer is free to assume the two-tuple is fine to be reconnected to. Typically, the <b>server allows</b> the client to reuse the server IP/port address. Usually, the <b>client is not listening</b> and disallows the server to connect back to it. There are caveats though. For example, in the context of Cloudflare, where our servers are using Anycast addressing, reconnecting to the server IP/port won't work. Going twice to the IP/port pair is unlikely to reach the same server. For us it makes sense to set this flag, disallowing clients from reconnecting to our server addresses. This can be done on Linux with:</p>
            <pre><code># Linux server sysctl - useful for ECMP or Anycast servers
$ sysctl -w net.mptcp.allow_join_initial_addr_port=0
</code></pre>
            <p>There is also a second way to advertise a listening IP/port. During the lifetime of a connection, a peer can send an ADD-ADDR MPTCP signal which advertises a listening IP/port. This can be managed on Linux by <code>ip mptcp endpoint ... signal</code>, like:</p>
            <pre><code># Linux server - extra listening address
$ ip mptcp endpoint add 192.51.100.1 dev eth0 port 4321 signal
</code></pre>
            <p>With such a config, a Linux peer (typically server) will report the additional IP/port with ADD-ADDR MPTCP signal in an ACK packet, like this:</p>
            <pre><code>host &gt; host: Flags [.], ack 1, win 8, options [mptcp 30 add-addr v1 id 1 192.51.100.1:4321 hmac 0x...,nop,nop], length 0
</code></pre>
            <p>It's important to realize that either peer can send ADD-ADDR messages. Unusual as it might sound, it's totally fine for the client to advertise extra listening addresses. The most common scenario though, consists of either nobody, or just a server, sending ADD-ADDR.</p><p>Technically, to launch an MPTCP socket on Linux, you just need to replace IPPROTO_TCP with IPPROTO_MPTCP in the application code:</p>
            <pre><code>IPPROTO_MPTCP = 262
sd = socket(AF_INET, SOCK_STREAM, IPPROTO_MPTCP)
</code></pre>
            <p>In practice, though, this introduces some changes to the sockets API. Currently not all setsockopt's work yet — like <code>TCP_USER_TIMEOUT</code>. Additionally, at this stage, MPTCP is incompatible with kTLS.</p>
    <div>
      <h2>Path manager / scheduler</h2>
      <a href="#path-manager-scheduler">
        
      </a>
    </div>
    <p>Once the peers have exchanged the address information, MPTCP is ready to kick in and perform the magic. There are two independent pieces of logic that MPTCP handles. First, given the address information, MPTCP must figure out if it should establish additional subflows. The component that decides on this is called "Path Manager". Then, another component called "scheduler" is responsible for choosing a specific subflow to transmit the data over.</p><p>Both peers have a path manager, but typically only the client uses it. A path manager has a hard task to launch enough subflows to get the benefits, but not too many subflows which could waste resources. This is where the MPTCP stacks get complicated. </p>
    <div>
      <h2>Linux as client</h2>
      <a href="#linux-as-client">
        
      </a>
    </div>
    <p>On Linux, path manager is an operating system feature, not an application feature. The in-kernel path manager requires some configuration — it must know which IP addresses and interfaces are okay to start new subflows. This is configured with <code>ip mptcp endpoint ... subflow</code>, like:</p>
            <pre><code>$ ip mptcp endpoint add dev wlp1s0 192.0.2.3 subflow  # Linux client
</code></pre>
            <p>This informs the path manager that we (typically a client) own a 192.0.2.3 IP address on interface wlp1s0, and that it's fine to use it as source of a new subflow. There are two additional flags that can be passed here: "backup" and "fullmesh". Maintaining these <code>ip mptcp endpoints</code> on a client is annoying. They need to be added and removed every time networks change. Fortunately, <a href="https://ubuntu.com/core/docs/networkmanager"><u>NetworkManager</u></a> from 1.40 supports managing these by default. If you want to customize the "backup" or "fullmesh" flags, you can do this here (see <a href="https://networkmanager.dev/docs/api/1.44.4/settings-connection.html#:~:text=mptcp-flags"><u>the documentation</u></a>):</p>
            <pre><code>ubuntu$ cat /etc/NetworkManager/conf.d/95-mptcp.conf
# set "subflow" on all managed "ip mptcp endpoints". 0x22 is the default.
[connection]
connection.mptcp-flags=0x22
</code></pre>
            <p>Path manager also takes a "limit" setting, to set a cap of additional subflows per MPTCP connection, and limit the received ADD-ADDR messages, like: </p>
            <pre><code>$ ip mptcp limits set subflow 4 add_addr_accepted 2  # Linux client
</code></pre>
            <p>I experimented with the "mobility" use case on my Ubuntu 22 Linux laptop. I repeatedly enabled and disabled Wi-Fi and Ethernet. On new kernels (v6.12), it works, and I was able to hold a reliable MPTCP connection over many interface changes. I was <a href="https://github.com/multipath-tcp/mptcp_net-next/issues/534"><u>less lucky with the Ubuntu v6.8</u></a> kernel. Unfortunately, the <a href="https://github.com/multipath-tcp/mptcp_net-next/issues/536"><u>default path manager on Linux</u></a> client only works when the flag "<i>Do not attempt to establish new subflows to this address and port</i>" is cleared on the server. Server-announced ADD-ADDR don't result in new subflows created, unless <code>ip mptcp endpoint</code> has a <code>fullmesh</code> flag.</p><p>It feels like the underlying MPTCP transport code works, but the path manager requires a bit more intelligence. With a new kernel, it's possible to get the "interactive" case working out of the box, but not for the ADD-ADDR case. </p>
    <div>
      <h2>Custom path manager</h2>
      <a href="#custom-path-manager">
        
      </a>
    </div>
    <p>Linux allows for two implementations of a path manager component. It can either use built-in kernel implementation (default), or userspace netlink daemon.</p>
            <pre><code>$ sysctl -w net.mptcp.pm_type=1 # use userspace path manager
</code></pre>
            <p>However, from what I found there is no serious implementation of configurable userspace path manager. The existing <a href="https://github.com/multipath-tcp/mptcpd/blob/main/plugins/path_managers/sspi.c"><u>implementations don't do much</u></a>, and the API <a href="https://github.com/multipath-tcp/mptcp_net-next/issues/533"><u>seems</u></a> <a href="https://github.com/multipath-tcp/mptcp_net-next/issues/532"><u>immature</u></a> yet.</p>
    <div>
      <h2>Scheduler and BPF extensions</h2>
      <a href="#scheduler-and-bpf-extensions">
        
      </a>
    </div>
    <p>Thus far we've covered Path Manager, but what about the scheduler that chooses which link to actually use? It seems that on Linux there is only one built-in "default" scheduler, and it can do basic failover on packet loss. The developers want to write <a href="https://github.com/multipath-tcp/mptcp_net-next/issues/75"><u>MPTCP schedulers in BPF</u></a>, and this work is in-progress.</p>
    <div>
      <h2>macOS</h2>
      <a href="#macos">
        
      </a>
    </div>
    <p>As opposed to Linux, macOS and iOS expose a raw MPTCP API. On those operating systems, path manager is not handled by the kernel, but instead can be an application responsibility. The exposed low-level API is based on <code>connectx()</code>. For example, <a href="https://github.com/apple-oss-distributions/network_cmds/blob/97bfa5b71464f1286b51104ba3e60db78cd832c9/mptcp_client/mptcp_client.c#L461"><u>here's an example of obscure code</u></a> that establishes one connection with two subflows:</p>
            <pre><code>int sock = socket(AF_MULTIPATH, SOCK_STREAM, 0);
connectx(sock, ..., &amp;cid1);
connectx(sock, ..., &amp;cid2);
</code></pre>
            <p>This powerful API is hard to use though, as it would require every application to listen for network changes. Fortunately, macOS and iOS also expose higher-level APIs. One <a href="https://github.com/mptcp-apps/mptcp-hello/blob/main/c/macOS/main.c"><u>example is nw_connection</u></a> in C, which uses nw_parameters_set_multipath_service.</p><p>Another, more common example is using <code>Network.framework</code>, and would <a href="https://gist.github.com/majek/cb54b537c74506164d2a7fa2d6601491"><u>look like this</u></a>:</p>
            <pre><code>let parameters = NWParameters.tcp
parameters.multipathServiceType = .interactive
let connection = NWConnection(host: host, port: port, using: parameters) 
</code></pre>
            <p>The API supports three MPTCP service type modes:</p><ul><li><p><i>Handover Mode</i>: Tries to minimize cellular. Uses only Wi-Fi. Uses cellular only when <a href="https://support.apple.com/en-us/102228"><u>Wi-Fi Assist</u></a> is enabled and makes such a decision.</p></li><li><p><i>Interactive Mode</i>: Used for Siri. Reduces latency. Only for low-bandwidth flows.</p></li><li><p><i>Aggregation Mode</i>: Enables resource pooling but it's only available for developer accounts and not deployable.</p></li></ul>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/47MukOs6bhCMOkO1JL15sP/7dd75417b855b681bde504122d5af01e/Screenshot_2024-12-23_at_2.59.51_PM.png" />
          </figure><p>The MPTCP API is nicely integrated with the <a href="https://support.apple.com/en-us/102228"><u>iPhone "Wi-Fi Assist" feature</u></a>. While the official documentation is lacking, it's possible to find <a href="https://youtu.be/BucQ1lfbtd4?t=533"><u>sources explaining</u></a> how it actually works. I was able to successfully test both the cleared "<i>Do not attempt to establish new subflows"</i> bit and ADD-ADDR scenarios. Hurray!</p>
    <div>
      <h2>IPv6 caveat</h2>
      <a href="#ipv6-caveat">
        
      </a>
    </div>
    <p>Sadly, MPTCP IPv6 has a caveat. Since IPv6 addresses are long, and MPTCP uses the space-constrained TCP Extensions field, there is <a href="https://github.com/multipath-tcp/mptcp_net-next/issues/448"><u>not enough room for ADD-ADDR messages</u></a> if TCP timestamps are enabled. If you want to use MPTCP and IPv6, it's something to consider.</p>
    <div>
      <h2>Summary</h2>
      <a href="#summary">
        
      </a>
    </div>
    <p>I find MPTCP very exciting, being one of a few deployable serious TCP extensions. However, current implementations are limited. My experimentation showed that the only practical scenario where currently MPTCP might be useful is:</p><ul><li><p>Linux as a server</p></li><li><p>macOS/iOS as a client</p></li><li><p>"interactive" use case</p></li></ul><p>With a bit of effort, Linux can be made to work as a client.</p><p>Don't get me wrong, <a href="https://netdevconf.info/0x14/pub/slides/59/mptcp-netdev0x14-final.pdf"><u>Linux developers did tremendous work</u></a> to get where we are, but, in my opinion for any serious out-of-the-box use case, we're not there yet. I'm optimistic that Linux can develop a good MPTCP client story relatively soon, and the possibility of implementing the Path manager and Scheduler in BPF is really enticing. </p><p>Time will tell if MPTCP succeeds — it's been 15 years in the making. In the meantime, <a href="https://datatracker.ietf.org/meeting/121/materials/slides-121-quic-multipath-quic-00"><u>Multi-Path QUIC</u></a> is under active development, but it's even further from being usable at this stage.</p><p>We're not quite sure if it makes sense for Cloudflare to support MPTCP. <a href="https://community.cloudflare.com/c/feedback/feature-request/30"><u>Reach out</u></a> if you have a use case in mind!</p><p><i>Shoutout to </i><a href="https://fosstodon.org/@matttbe"><i><u>Matthieu Baerts</u></i></a><i> for tremendous help with this blog post.</i></p> ]]></content:encoded>
            <category><![CDATA[TCP]]></category>
            <category><![CDATA[Network]]></category>
            <category><![CDATA[Linux]]></category>
            <category><![CDATA[Deep Dive]]></category>
            <guid isPermaLink="false">6ZxrGIedGqREgTs02vpt0t</guid>
            <dc:creator>Marek Majkowski</dc:creator>
        </item>
        <item>
            <title><![CDATA[The forecast is clear: clouds on e-paper, powered by the cloud]]></title>
            <link>https://blog.cloudflare.com/the-forecast-is-clear-clouds-on-e-paper-powered-by-the-cloud/</link>
            <pubDate>Tue, 31 Dec 2024 14:00:00 GMT</pubDate>
            <description><![CDATA[ Follow along as I build a custom weather display using Cloudflare Workers and a popular e-paper display. ]]></description>
            <content:encoded><![CDATA[ <p>I’ve noticed that many shops are increasingly using e-paper displays. They’re impressive: high contrast, no backlight, and no visible cables. Unlike most electronics, these displays are seamlessly integrated and feel very natural. This got me wondering: is it possible to use such a display for a pet project? I want to experiment with this technology myself.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2Be8SEo0lqoZ0t10tPbtcT/ffa6f9263df05392d7c4408e9c98bc28/Screenshot_2024-12-23_at_15.26.23.png" />
          </figure><p><span><span><sup>(</sup></span></span><a href="https://www.wiadomoscihandlowe.pl/najwieksze-sieci-handlowe/lidl/lidl-wprowadza-do-sklepow-elektroniczne-etykiety-mniejsze-obciazenie-pracownikow-mniej-zuzytego-papieru-2447917"><span><span><sup><u>source</u></sup></span></span></a><span><span><sup>)</sup></span></span></p><p>My main goal in this project is to understand the hardware and its capabilities. Here, I'll be using an e-paper display to show the current weather, but at its core, I’m simply feeding data from a website to the display. While it sounds straightforward, it actually requires three layers of software to pull off. Still, it’s a fun challenge and a great opportunity to work with both embedded hardware and Cloudflare Workers.</p>
    <div>
      <h2>Sourcing the hardware</h2>
      <a href="#sourcing-the-hardware">
        
      </a>
    </div>
    <p>For this project, I'm using components from Waveshare. They offer <a href="https://www.waveshare.com/product/displays/e-paper/epaper-1.htm?___SID=U&amp;limit=80"><u>a variety of e-paper displays</u></a>, ranging from credit card-sized to A4-sized models. I chose the 7.5-inch, two-color "e-Paper (G)" display. For the controller, I'm using a Waveshare <a href="https://www.waveshare.com/e-Paper-ESP32-Driver-Board.htm"><u>ESP32-based universal board</u></a>. With just these two components — a display and a controller — I was ready to get started.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/7A6SqnMUHriAgRAAJM5Mbj/0d1282a915a1192d276e7dff1b901013/Screenshot_2024-12-23_at_15.25.45.png" />
          </figure><p>When the components arrived, I carefully connected the display’s ribbon cable to the ESP32 board. Even though this step isn’t documented anywhere, it was simple and almost impossible to get wrong. Best of all, no soldering was needed!</p><p>That’s pretty much it for the hardware setup! I’m keeping the device powered with a 5V supply through a micro-USB connection.</p>
    <div>
      <h2>One layer of hardware </h2>
      <a href="#one-layer-of-hardware">
        
      </a>
    </div>
    
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6WNVmUFDsLdewKrZJr4Otl/ae5eb976d1c7dc176cc288619b18c036/image1.png" />
          </figure><p><span><span><sup>(</sup></span></span><a href="https://www.waveshare.com/e-paper-esp32-driver-board.htm"><span><span><sup><u>source</u></sup></span></span></a><span><span><sup>)</sup></span></span></p><p>This was my first time working with the <a href="https://en.wikipedia.org/wiki/ESP32"><u>ESP32 CPU family</u></a>, and I’m really impressed. It’s a system-on-chip controller with built-in Bluetooth and Wi-Fi. It’s relatively fast, very power-efficient, and <a href="https://youtu.be/qLh1FOcGysY?t=157"><u>quite popular in DSP</u></a> (digital signal processing) applications. For example, your audio device might be powered by a CPU like this. Interestingly, the newer models have switched to the <a href="https://en.wikipedia.org/wiki/RISC-V"><u>RISC-V</u></a> instruction set.</p><p>For our purposes, we’ll only scratch the surface of what the ESP32 is capable of. The chip is straightforward to work with, thanks to the familiar Arduino environment. A great starting point is <a href="https://www.waveshare.com/wiki/E-Paper_ESP32_Driver_Board#How_to_Use"><u>the demo provided by Waveshare</u></a>. It sets up a web page where you can easily upload a custom image to the display.</p><p>To run the demo you need to:</p><ul><li><p>Install the <a href="https://support.arduino.cc/hc/en-us/articles/360019833020-Download-and-install-Arduino-IDE"><u>Arduino IDE</u></a>.</p></li><li><p>Fix permissions of the <code>/dev/ACM0</code> device.</p></li><li><p>Install "Additional Boards Manager URL" as per the <a href="https://www.waveshare.com/wiki/Arduino_ESP32/8266_Online_Installation"><u>instructions</u></a>, and install the "esp32 by expressif" bundle.</p></li><li><p>Open the "Loader_esp32wf" example downloaded from <a href="https://www.waveshare.com/wiki/E-Paper_ESP32_Driver_Board#Download_Demo"><u>waveshare</u></a>.</p></li><li><p>Change the Wi-Fi name, password and IP address in the Arduino IDE <code>srvr.h</code> tab.</p></li></ul><p>Once everything is set up, you should be able to connect to the ESP32’s IP address and use the simple web interface to upload an image to the display.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5KUXIJqF1J9hQqHISumuaV/85c211b315f338ba2339b4a7120162d4/Screenshot_2024-12-23_at_15.24.46.png" />
          </figure><p>With a simple click of the "Upload Image" button, the magic happens: the e-paper display comes to life, showcasing the uploaded image.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3qxQCybmGmpxufDfU69OcY/d995257657f5991dd1d152b59ae2b84e/Screenshot_2024-12-23_at_15.24.09.png" />
          </figure><p>With the demo up and running, we can move on to the next step: figuring out how to render a web page on the e-paper display.</p>
    <div>
      <h2>Three layers of software</h2>
      <a href="#three-layers-of-software">
        
      </a>
    </div>
    <p>The ESP32 comes with some limitations. It has 520 KiB of RAM, 4 MiB of flash, and a 240 MHz clock speed. While this is fine for tasks like connecting to Wi-Fi or fetching a simple URL, it’s not powerful enough for more demanding tasks, such as parsing JSON or rendering an entire web page.</p><p>There are basic Arduino libraries for handling bitmaps, which can draw rectangles and render simple fonts, but manually managing layout doesn't sound appealing to me. A better approach is to play to the ESP32’s strengths — fetching and displaying bitmaps — and delegate the more complex task of HTML rendering to a more powerful server. </p><p>Let’s break the problem into three layers:</p><ol><li><p><b>ESP32 (Display Layer): </b>The ESP32 will periodically, say every minute, fetch a pre-rendered bitmap from the server and display it on the e-paper screen. This keeps the ESP32's tasks lightweight and manageable.</p></li><li><p><b>Server A (Rendering Layer): </b>This server will fetch the desired website, render it, and rasterize it into a bitmap format. Its job is to prepare a bitmap that the ESP32 can handle without additional processing.</p></li><li><p><b>Server B (Content Layer): </b>This server hosts the actual website with the HTML and CSS content. In this case, it will provide the local weather data in a styled format, ready to be fetched and rendered by Server A.</p></li></ol>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3V3LbBwfxhaWMukCVw0Tx/ab272f72ee2e45879acf3fea9a0fe75e/image8.png" />
          </figure>
    <div>
      <h3>ESP32 (Display Layer)</h3>
      <a href="#esp32-display-layer">
        
      </a>
    </div>
    <p>The ESP32 provides some great higher-level libraries to simplify development. For this project, we’ll need three key components:</p><ol><li><p><a href="https://github.com/espressif/arduino-esp32/blob/master/libraries/WiFi/examples/WiFiClientBasic/WiFiClientBasic.ino"><b><u>Wi-Fi Arduino Library</u></b></a><b>:</b> To connect the ESP32 to a Wi-Fi network.</p></li><li><p><a href="https://github.com/espressif/arduino-esp32/blob/master/libraries/HTTPClient/examples/BasicHttpClient/BasicHttpClient.ino"><b><u>HTTP Arduino Library</u></b></a><b>:</b> To handle HTTP requests and fetch the rendered bitmap from the server.</p></li><li><p><b>EPD (e-Paper Display) Driver:</b> To control the e-paper display and render the fetched bitmap.</p></li></ol><p>These libraries make it much easier to implement the required functionality without dealing with low-level details.</p><p><a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2024-12-e-paper/ESP32-fetch-from-worker/ESP32-fetch-from-worker.ino#L31"><u>Here's my ESP32 Arduino project code</u></a>. It's actually pretty straightforward:</p><ul><li><p>First, it connects to Wi-Fi</p></li><li><p>Then, it fetches a rendered bitmap from an HTTP endpoint</p></li><li><p>Then it pushes it to the e-paper display if needed</p></li><li><p>Waits a minute</p></li><li><p>And repeats the whole process forever</p></li></ul><p>E-paper displays typically start to degrade after about one million refresh cycles. To preserve the display's lifespan, I’m being extra careful to avoid unnecessary refreshes.</p>
    <div>
      <h3>Server A (Rendering Layer)</h3>
      <a href="#server-a-rendering-layer">
        
      </a>
    </div>
    <p>Now for the exciting part! We need an online service that can fetch a website, render it, rasterize it to fit our small monochromatic display, and return it as a display-sized binary blob. Initially, I considered using headless Chrome paired with an ImageMagick script, but then I discovered <a href="https://developers.cloudflare.com/browser-rendering/"><u>Cloudflare’s </u><b><u>Browser Rendering API</u></b></a>, which fits our needs perfectly.</p><p>This API can be used quite trivially and nicely fits our needs. <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2024-12-e-paper/worker-render-raster/index.ts#L79"><u>Here's the typescript worker code</u></a>, and there are two particularly interesting parts: handling a remote browser and dithering.</p>
    <div>
      <h3>Remote Browser API</h3>
      <a href="#remote-browser-api">
        
      </a>
    </div>
    <p>First, see how easy it is to render a website as a PNG using Browser Rendering:</p>
            <pre><code>if (!browser) {
       browser = await puppeteer.launch(env.MYBROWSER, { keep_alive: 600000 });
       launched = true;
}
sessionId = browser.sessionId();

const page = await browser.newPage();
await page.setViewport({
         width: 480,
         height: 800,
         deviceScaleFactor: 1,
})

await page.goto(url);
img = (await page.screenshot()) as Buffer;</code></pre>
            <p>I’m genuinely surprised at how practical and effective this approach is. While the remote browser startup isn’t exactly fast — it can take a few seconds to generate the screenshot — it’s not an issue for my use case. The delay is perfectly acceptable, especially considering how much work is offloaded to the cloud.</p>
    <div>
      <h4>Dithering</h4>
      <a href="#dithering">
        
      </a>
    </div>
    <p>To prepare the bitmap for the ESP32, we need to decode the PNG, reduce the color palette to monochromatic, and apply <a href="https://en.wikipedia.org/wiki/Dither"><u>dithering</u></a>. Here's the dithering code:</p>
            <pre><code>function ditherTwoBits(px: Buffer,
                       width: number,
                       height: number
                      ): Buffer {
    px = new Float32Array(px);

    for (let y = 0; y &lt; height; y++) {
        for (let x = 0; x &lt; width; x++) {
            const old_pixel = px[y * width + x];
            const new_pixel = old_pixel &gt; 128 ? 0xff : 0x00;

            const quant_error = (old_pixel - new_pixel) / 16.0;
            px[(y + 0) * width + (x + 0)] = new_pixel;
            px[(y + 0) * width + (x + 1)] += quant_error * 7.;
            px[(y + 1) * width + (x - 1)] += quant_error * 3.;
            px[(y + 1) * width + (x + 0)] += quant_error * 5.;
            px[(y + 1) * width + (x + 1)] += quant_error * 1.;
        }
    }

    return Buffer.from(Uint8ClampedArray.from(px));
}</code></pre>
            <p>This was my first time experimenting with dithering, and it’s been a lot of fun! I was surprised by how straightforward the process is and that it’s fully deterministic. Now that I understand the details of the algorithm, I can’t help but notice its subtle side effects everywhere — in printed materials, on screens, and even in design choices around me. It’s fascinating how something so simple has such a broad impact!</p><p>To deploy this code as a Cloudflare Worker, you only need to install the required dependencies, configure the <code>wrangler.toml</code> file, and publish the code. Here’s a step-by-step guide:</p>
            <pre><code>sudo apt install npm
cd worker-render-raster
npm install wrangler
npm install @cloudflare/puppeteer --save-dev
npm install fast-png --save-dev
npx wrangler kv:namespace create KV
npx wrangler kv:namespace create KV --preview</code></pre>
            <p>With this out of the way, you can run the code:</p>
            <pre><code>2025-01-e-paper/worker-render-raster$ npx wrangler dev --remote

 ⛅️ wrangler 3.99.0
-------------------

Your worker has access to the following bindings:
- KV Namespaces:
  - KV: XXX
- Browser:
  - Name: BROWSER
[wrangler:inf] Ready on http://localhost:46131
⎔ Starting remote preview...
Total Upload: 755.39 KiB / gzip: 149.05 KiB
╭─────────────────────────────────────────────────────────────────────────────────────────────────╮
│  [b] open a browser, [d] open devtools, [l] turn on local mode, [c] clear console, [x] to exit  │
╰─────────────────────────────────────────────────────────────────────────────────────────────────╯</code></pre>
            <p>With everything set up, you can now open a browser and see a rendered and rasterized version of a website, processed through your Cloudflare Worker! For example, here’s how the <b>1.1.1.1</b> page looks in a 800x480 monochromatic resolution, complete with dithering:</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6awlaJqO4LduSLwvUXQbYJ/b2b43d14c0a8d44fbed1ed4bc2a4de12/Screenshot_2024-12-23_at_15.23.14.png" />
          </figure><p>This demonstrates how effectively the Worker can handle rendering, rasterizing, and adapting web content for an e-paper display. It’s quite satisfying to see the pipeline in action.</p>
    <div>
      <h3>Server B (Content Layer)</h3>
      <a href="#server-b-content-layer">
        
      </a>
    </div>
    <p>To create the weather panel, I designed a simple HTML and CSS page and <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2024-12-e-paper/worker-weather-panel/entry.py#L165"><u>published it as another Cloudflare Worker</u></a>. This time, I used <a href="https://developers.cloudflare.com/workers/languages/python/"><u>Python in Cloudflare Workers</u></a> because it felt more straightforward, especially since the site needs to query an external weather API. The simplicity of the code was surprising and made the process smooth.</p>
            <pre><code>async def on_fetch(request, env):
    cached = await env.KV.get("weather")
    if cached:
        cached = json.loads(cached)
    else:
        u = "https://api.open-meteo.com/..."
        a = await fetch(u)
        result = await a.text()
        cached = json.loads(result)
        await env.KV.put("weather", json.dumps(cached))
    return Response.new(render(...), headers=[('content-type', 'text/html')])</code></pre>
            <p>Here’s how it appears in a normal browser compared to the rendered and rasterized version by our worker:</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5irsWxy9nIgTRirXP6l8Nh/a4bb0d31df18c8385ddd96c4542941d0/Screenshot_2024-12-23_at_15.19.21.png" />
          </figure>
    <div>
      <h2>Summary</h2>
      <a href="#summary">
        
      </a>
    </div>
    <p>Finally, the display deserves a proper frame. Here’s the finished version:</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1Dymme1r2SJ8rAKRJjyl0K/4fa344362ca037607c7989dbf47b7fff/image11.png" />
          </figure><p>I started this project wanting to experiment with an e-paper display hardware, but I ended up spending most of my time writing software—and it turned out to be surprisingly enjoyable across all layers:</p><ul><li><p><b>ESP32:</b> The CPU is fantastic. Programming it is straightforward, thanks to powerful built-in libraries that simplify development.</p></li><li><p><b>Cloudflare Worker Browser Rendering:</b> This is an underrated but incredibly powerful technology. It made implementing features like the Floyd–Steinberg dithering algorithm surprisingly easy.</p></li><li><p><b>Cloudflare Worker Python:</b> Although still in beta, it worked flawlessly for my needs and was a great fit for handling API requests and serving dynamic content.</p></li></ul><p>It’s remarkable how much you can achieve with relatively inexpensive hardware and free Cloudflare services.</p> ]]></content:encoded>
            <category><![CDATA[Cloudflare Workers]]></category>
            <category><![CDATA[Python]]></category>
            <guid isPermaLink="false">2b03lxutI7PqAfcxswe5aE</guid>
            <dc:creator>Marek Majkowski</dc:creator>
        </item>
        <item>
            <title><![CDATA[Virtual networking 101: bridging the gap to understanding TAP]]></title>
            <link>https://blog.cloudflare.com/virtual-networking-101-understanding-tap/</link>
            <pubDate>Fri, 06 Oct 2023 13:05:33 GMT</pubDate>
            <description><![CDATA[ Tap devices were historically used for VPN clients. Using them for virtual machines is essentially reversing their original purpose - from traffic sinks to traffic sources. In the article I explore the intricacies of tap devices, covering topics like offloads, segmentation, and multi-queue. ]]></description>
            <content:encoded><![CDATA[ 
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5K0iKrieDwp0YkLViNDJKB/1c655399c5c70fe75c8f56f3294266a9/image1-5.png" />
            
            </figure><p>It's a never-ending effort to improve the performance of our infrastructure. As part of that quest, we wanted to squeeze as much network oomph as possible from our virtual machines. Internally for some projects we use <a href="https://firecracker-microvm.github.io/">Firecracker</a>, which is a KVM-based virtual machine manager (VMM) that runs light-weight “Micro-VM”s. Each Firecracker instance uses a tap device to communicate with a host system. Not knowing much about tap, I had to up my game, however, it wasn't easy — the documentation is messy and spread across the Internet.</p><p>Here are the notes that I wish someone had passed me when I started out on this journey!</p><p>A tap device is a virtual <b>network interface</b> that looks like an ethernet network card. Instead of having real wires plugged into it, it exposes a nice handy file descriptor to an application willing to send/receive packets. Historically tap devices were mostly used to implement VPN clients. The machine would route traffic towards a <b>tap interface</b>, and a VPN client application would pick them up and process accordingly. For example this is what our Cloudflare WARP Linux client does. Here's how it looks on my laptop:</p>
            <pre><code>$ ip link list
...
18: CloudflareWARP: &lt;POINTOPOINT,MULTICAST,NOARP,UP,LOWER_UP&gt; mtu 1280 qdisc mq state UNKNOWN mode DEFAULT group default qlen 500
	link/none

$ ip tuntap list
CloudflareWARP: tun multi_queue</code></pre>
            <p>More recently tap devices started to be used by <a href="https://www.cloudflare.com/learning/cloud/what-is-a-virtual-machine/">virtual machines</a> to enable networking. The VMM (like Qemu, Firecracker, or gVisor) would open the application side of a tap and pass all the packets to the guest VM. The tap network interface would be left for the host kernel to deal with. Typically, a host would behave like a router and firewall, forward or NAT all the packets. This design is somewhat surprising - it's almost reversing the original use case for tap. In the VPN days tap was a traffic destination. With a VM behind, tap looks like a traffic source.</p><p>A Linux tap device is a <b>mean creature</b>. It looks trivial — a virtual network interface, with a file descriptor behind it. However, it's <b>surprisingly hard</b> to get it to perform well. The Linux networking stack is optimized for packets handled by a physical network card, not a userspace application. However, over the years the Linux tap interface grew in features and nowadays, it's possible to get good performance out of it. Later I'll explain how to use the Linux tap API in a modern way.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/79hvP3TXwPib9DE9leEJbE/7f81ee55560cfdff5b7df7419adc91c5/Screenshot-2023-10-05-at-2.42.49-PM.png" />
            
            </figure><p>Source: DALL-E</p>
    <div>
      <h2>To tun or to tap?</h2>
      <a href="#to-tun-or-to-tap">
        
      </a>
    </div>
    <p>The interface is <a href="https://docs.kernel.org/networking/tuntap.html">called "the universal tun/tap"</a> in the kernel. The "tun" variant, accessible via the IFF_TUN flag, looks like a point-to-point link. There are no L2 Ethernet headers. Since most modern networks are Ethernet, this is a bit less intuitive to set up for a novice user. Most importantly, projects like Firecracker and gVisor do expect L2 headers.</p><p>"Tap", with the IFF_TAP flag, is the one which has Ethernet headers, and has been getting all the attention lately. If you are like me and always forget which one is which, you can use this  AI-generated rhyme (check out <a href="/writing-poems-using-llama-2-on-workers-ai/">WorkersAI/LLama</a>) to help to remember:</p><p><i>Tap is like a switch,</i><i>Ethernet headers it'll hitch.</i><i>Tun is like a tunnel,</i><i>VPN connections it'll funnel.</i><i>Ethernet headers it won't hold,</i><i>Tap uses, tun does not, we're told.</i></p>
    <div>
      <h2>Listing devices</h2>
      <a href="#listing-devices">
        
      </a>
    </div>
    <p>Tun/tap devices are natively supported by iproute2 tooling. Typically, one creates a device with <b>ip tuntap add</b> and lists it with <b>ip tuntap list</b>:</p>
            <pre><code>$ sudo ip tuntap add mode tap user marek group marek name tap0
$ ip tuntap list
tap0: tap persist user 1000 group 1000</code></pre>
            <p>Alternatively, it's possible to look for the <code>/sys/devices/virtual/net/&lt;ifr_name&gt;/tun_flags</code> files.</p>
    <div>
      <h2>Tap device setup</h2>
      <a href="#tap-device-setup">
        
      </a>
    </div>
    <p>To open or create a new device, you first need to open <code>/dev/net/tun</code> which is called a "clone device":</p>
            <pre><code>    /* First, whatever you do, the device /dev/net/tun must be
     * opened read/write. That device is also called the clone
     * device, because it's used as a starting point for the
     * creation of any tun/tap virtual interface. */
    char *clone_dev_name = "/dev/net/tun";
    int tap_fd = open(clone_dev_name, O_RDWR | O_CLOEXEC);
    if (tap_fd &lt; 0) {
   	 error(-1, errno, "open(%s)", clone_dev_name);
    }</code></pre>
            <p>With the clone device file descriptor we can now instantiate a specific tap device by name:</p>
            <pre><code>    struct ifreq ifr = {};
    strncpy(ifr.ifr_name, tap_name, IFNAMSIZ);
    ifr.ifr_flags = IFF_TAP | IFF_NO_PI | IFF_VNET_HDR;
    int r = ioctl(tap_fd, TUNSETIFF, &amp;ifr);
    if (r != 0) {
   	 error(-1, errno, "ioctl(TUNSETIFF)");
    }</code></pre>
            <p>If <b>ifr_name</b> is empty or with a name that doesn't exist, a new tap device is created. Otherwise, an existing device is opened. When opening existing devices, flags like IFF_MULTI_QUEUE must match with the way the device was created, or EINVAL is returned. It's a good idea to try to reopen the device with flipped multi queue setting on EINVAL error.</p><p>The <b>ifr_flags</b> can have the following bits set:</p><table><colgroup><col></col><col></col></colgroup><tbody><tr><td><p><span>IFF_TAP / IFF_TUN</span></p></td><td><p><span>Already discussed.</span></p></td></tr><tr><td><p><span>IFF_NO_CARRIER</span></p></td><td><p><span>Holding an open tap device file descriptor sets the Ethernet interface CARRIER flag up. In some cases it might be desired to delay that until a TUNSETCARRIER call.</span></p></td></tr><tr><td><p><span>IFF_NO_PI</span></p></td><td><p><span>Historically each packet on tap had a "struct tun_pi" 4 byte prefix. There are now better alternatives and this option disables this prefix.</span></p></td></tr><tr><td><p><span>IFF_TUN_EXCL</span></p></td><td><p><span>Ensures a new device is created. Returns EBUSY if the device exists</span></p></td></tr><tr><td><p><span>IFF_VNET_HDR</span></p></td><td><p><span>Prepend "</span><a href="https://elixir.bootlin.com/linux/v6.4.6/source/include/uapi/linux/virtio_net.h#L187"><span>struct virtio_net_hdr</span></a><span>" before the RX and TX packets, should be followed by setsockopt(TUNSETVNETHDRSZ).</span></p></td></tr><tr><td><p><span>IFF_MULTI_QUEUE</span></p></td><td><p><span>Use multi queue tap, see below.</span></p></td></tr><tr><td><p><span>IFF_NAPI / IFF_NAPI_FRAGS</span></p></td><td><p><span>See below.</span></p></td></tr></tbody></table><p>You almost always want IFF_TAP, IFF_NO_PI, IFF_VNET_HDR flags and perhaps sometimes IFF_MULTI_QUEUE.</p>
    <div>
      <h2>The curious IFF_NAPI</h2>
      <a href="#the-curious-iff_napi">
        
      </a>
    </div>
    <p>Judging by the <a href="https://www.mail-archive.com/netdev@vger.kernel.org/msg189704.html">original patchset introducing IFF_NAPI and IFF_NAPI_FRAGS</a>, these flags were introduced to increase code coverage of syzkaller. However, later work indicates there were <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=fb3f903769e805221eb19209b3d9128d398038a1">performance benefits when doing XDP on tap</a>. IFF_NAPI enables a dedicated NAPI instance for packets written from an application into a tap. Besides allowing XDP, it also allows packets to be batched and GRO-ed. Otherwise, a backlog NAPI is used.</p>
    <div>
      <h2>A note on buffer sizes</h2>
      <a href="#a-note-on-buffer-sizes">
        
      </a>
    </div>
    <p>Internally, a tap device is just a pair of packet queues. It's exposed as a network interface towards the host, and a file descriptor, a character device, towards the application. The queue in the direction of application (tap TX queue) is of size <b>txqueuelen</b> packets, controlled by an interface parameter:</p>
            <pre><code>$ ip link set dev tap0 txqueuelen 1000
$ ip -s link show dev tap0
26: tap0: &lt;BROADCAST,MULTICAST,UP,LOWER_UP&gt; mtu 1500 ... qlen 1000
	RX:  bytes packets errors dropped  missed   mcast      	 
         	0   	0  	0   	0   	0   	0
	TX:  bytes packets errors dropped carrier collsns      	 
       	266   	3  	0  	66   	0   	0</code></pre>
            <p>In "ip link" statistics the column "<b>TX dropped</b>" indicates the tap application was too slow and the queue space exhausted.</p><p>In the other direction - interface RX queue -  from application towards the host, the queue size limit is measured in bytes and controlled by the TUNSETSNDBUF ioctl. The <a href="https://elixir.bootlin.com/qemu/v8.1.1/source/net/tap-linux.c#L120">qemu comment discusses</a> this setting, however it's not easy to cause this queue to overflow. See this <a href="https://bugzilla.redhat.com/show_bug.cgi?id=508861">discussion for details</a>.</p>
    <div>
      <h2>vnethdr size</h2>
      <a href="#vnethdr-size">
        
      </a>
    </div>
    <p>After the device is opened, a typical scenario is to set up VNET_HDR size and offloads. Typically the VNETHDRSZ should be set to 12:</p>
            <pre><code>    len = 12;
    r = ioctl(tap_fd, TUNSETVNETHDRSZ, &amp;(int){len});
    if (r != 0) {
   	 error(-1, errno, "ioctl(TUNSETVNETHDRSZ)");
    }</code></pre>
            <p>Sensible values are {10, 12, 20}, which are derived from <a href="https://docs.oasis-open.org/virtio/virtio/v1.2/csd01/virtio-v1.2-csd01.html">virtio spec</a>. 12 bytes makes room for the following header (little endian):</p>
            <pre><code>struct virtio_net_hdr_v1 {
#define VIRTIO_NET_HDR_F_NEEDS_CSUM  1    /* Use csum_start, csum_offset */
#define VIRTIO_NET_HDR_F_DATA_VALID  2    /* Csum is valid */
    u8 flags;
#define VIRTIO_NET_HDR_GSO_NONE      0    /* Not a GSO frame */
#define VIRTIO_NET_HDR_GSO_TCPV4     1    /* GSO frame, IPv4 TCP (TSO) */
#define VIRTIO_NET_HDR_GSO_UDP       3    /* GSO frame, IPv4 UDP (UFO) */
#define VIRTIO_NET_HDR_GSO_TCPV6     4    /* GSO frame, IPv6 TCP */
#define VIRTIO_NET_HDR_GSO_UDP_L4    5    /* GSO frame, IPv4&amp; IPv6 UDP (USO) */
#define VIRTIO_NET_HDR_GSO_ECN       0x80 /* TCP has ECN set */
    u8 gso_type;
    u16 hdr_len;     /* Ethernet + IP + tcp/udp hdrs */
    u16 gso_size;    /* Bytes to append to hdr_len per frame */
    u16 csum_start;
    u16 csum_offset;
    u16 num_buffers;
};</code></pre>
            
    <div>
      <h2>offloads</h2>
      <a href="#offloads">
        
      </a>
    </div>
    <p>To enable offloads use the ioctl:</p>
            <pre><code>    unsigned off_flags = TUN_F_CSUM | TUN_F_TSO4 | TUN_F_TSO6;
    int r = ioctl(tap_fd, TUNSETOFFLOAD, off_flags);
    if (r != 0) {
   	 error(-1, errno, "ioctl(TUNSETOFFLOAD)");
    }</code></pre>
            <p>Here are the allowed bit values. They confirm that the userspace application can receive:</p><table><colgroup><col></col><col></col></colgroup><tbody><tr><td><p><span>TUN_F_CSUM</span></p></td><td><p><span>L4 packet checksum offload</span></p></td></tr><tr><td><p><span>TUN_F_TSO4</span></p></td><td><p><span>TCP Segmentation Offload - TSO for IPv4 packets</span></p></td></tr><tr><td><p><span>TUN_F_TSO6</span></p></td><td><p><span>TSO for IPv6 packets</span></p></td></tr><tr><td><p><span>TUN_F_TSO_ECN</span></p></td><td><p><span>TSO with ECN bits</span></p></td></tr><tr><td><p><span>TUN_F_UFO</span></p></td><td><p><span>UDP Fragmentation offload - UFO packets. Deprecated</span></p></td></tr><tr><td><p><span>TUN_F_USO4</span></p></td><td><p><span>UDP Segmentation offload - USO for IPv4 packets</span></p></td></tr><tr><td><p><span>TUN_F_USO6</span></p></td><td><p><span>USO for IPv6 packets</span></p></td></tr></tbody></table><p>Generally, offloads are extra packet features the tap application can deal with. Details of the offloads used by the sender are set on each packet in the vnethdr prefix.</p>
    <div>
      <h2>Checksum offload TUN_F_CSUM</h2>
      <a href="#checksum-offload-tun_f_csum">
        
      </a>
    </div>
    
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5QEojUsAvj7VueXmGRSexc/37671b9c92dcc7f6c40bf572cd6e73dc/image4.jpg" />
            
            </figure><p>Structure of a typical UDP packet received over tap.</p><p>Let's start with the checksumming offload. The TUN_F_CSUM offload saves the kernel some work by pushing the checksum processing down the path. Applications which set that flag are indicating they can handle checksum validation. For example with this offload, for UDP IPv4 packet will have:</p><ul><li><p>vnethdr flags will have VIRTIO_NET_HDR_F_NEEDS_CSUM set</p></li><li><p>hdr_len would be 42 (14+20+8)</p></li><li><p>csum_start 34 (14+20)</p></li><li><p>and csum_offset 6 (UDP header checksum is 6 bytes into L4)</p></li></ul><p>This is illustrated above.</p><p>Supporting checksum offload is needed for further offloads.</p>
    <div>
      <h2>TUN_F_CSUM is a must</h2>
      <a href="#tun_f_csum-is-a-must">
        
      </a>
    </div>
    <p>Consider this code:</p>
            <pre><code>s = socket(AF_INET, SOCK_DGRAM)
s.setsockopt(SOL_UDP, UDP_SEGMENT, 1400)
s.sendto(b"x", ("10.0.0.2", 5201))     # Would you expect EIO ?</code></pre>
            <p>This simple code produces a packet. When directed at a tap device, this code will surprisingly yield an EIO "Input/output error". This weird behavior happens if the tap is opened without TUN_F_CSUM and the application is sending GSO / UDP_SEGMENT frames. Tough luck. It might be considered a kernel bug, and we're thinking about fixing that. However, in the meantime everyone using tap should just set the TUN_F_CSUM bit.</p>
    <div>
      <h2>Segmentation offloads</h2>
      <a href="#segmentation-offloads">
        
      </a>
    </div>
    <p>We wrote about <a href="/accelerating-udp-packet-transmission-for-quic/">UDP_SEGMENT</a> in the past. In short: on Linux an application can handle many packets with a single send/recv, as long as they have identical length.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/31mztfk8fZtP46WzeFrcID/4dd98352343ffafd1cf7d62847c06904/image2-4.png" />
            
            </figure><p>With UDP_SEGMENT a single send() can transfer multiple packets.</p><p>Tap devices support offloading which exposes that very functionality. With TUN_F_TSO4 and TUN_F_TSO6 flags the tap application signals it can deal with long packet trains. Note, that with these features the application must be ready to receive much larger buffers - up to 65507 bytes for IPv4 and 65527 for IPv6.</p><p>TSO4/TSO6 flags are enabling long packet trains for <a href="https://www.cloudflare.com/learning/ddos/glossary/tcp-ip/">TCP</a> and have been supported for a long time. More recently TUN_F_USO4 and TUN_F_USO6 bits were introduced for <a href="https://www.cloudflare.com/learning/ddos/glossary/user-datagram-protocol-udp/">UDP</a>. When any of these offloads are used, the <b>gso_type</b> contains the relevant offload type and <b>gso_size</b> holds a segment size within the GRO packet train.</p><p>TUN_F_UFO is a UDP fragmentation offload which is deprecated.</p><p>By setting TUNSETOFFLOAD, the application is telling the kernel which offloads it's able to handle on the read() side of a tap device. If the ioctl(TUNSETOFFLOAD) succeeds, the application can assume the kernel supports the same offloads for packets in the other direction.</p>
    <div>
      <h2>Bug in rx-udp-gro-forwarding - TUN_F_USO4</h2>
      <a href="#bug-in-rx-udp-gro-forwarding-tun_f_uso4">
        
      </a>
    </div>
    <p>When working with tap and offloads it's useful to inspect <b>ethtool</b>:</p>
            <pre><code>$ ethtool -k tap0 | egrep -v fixed
tx-checksumming: on
    tx-checksum-ip-generic: on
scatter-gather: on
    tx-scatter-gather: on
    tx-scatter-gather-fraglist: on
tcp-segmentation-offload: on
    tx-tcp-segmentation: on
generic-segmentation-offload: on
generic-receive-offload: on
tx-udp-segmentation: on
rx-gro-list: off
rx-udp-gro-forwarding: off</code></pre>
            <p>With ethtool we can see the enabled offloads and disable them as needed.</p><p>While toying with UDP Segmentation Offload (USO) I've noticed that when packet trains from tap are forwarded to a real network interface, sometimes they seem badly packetized. See the <a href="https://lore.kernel.org/all/CAJPywTKDdjtwkLVUW6LRA2FU912qcDmQOQGt2WaDo28KzYDg+A@mail.gmail.com/">netdev discussion</a>, and the <a href="https://lore.kernel.org/netdev/ZK9ZiNMsJX8+1F3N@debian.debian/T/#m9f532bb5463b89d997c5d16c78490ceeccb4497f">proposed fix</a>. In any case - beware of this bug, and maybe consider doing "ethtool -K tap0 rx-udp-gro-forwarding off".</p>
    <div>
      <h2>Miscellaneous setsockopts</h2>
      <a href="#miscellaneous-setsockopts">
        
      </a>
    </div>
    
    <div>
      <h4>Setup related socket options</h4>
      <a href="#setup-related-socket-options">
        
      </a>
    </div>
    <table><colgroup><col></col><col></col></colgroup><tbody><tr><td><p><span>TUNGETFEATURES</span></p></td><td><p><span>Return vector of IFF_* constants that the kernel supports. Typically used to detect the host support of: IFF_VNET_HDR, IFF_NAPI and IFF_MULTI_QUEUE.</span></p></td></tr><tr><td><p><span>TUNSETIFF</span></p></td><td><p><span>Takes "struct ifreq", sets up a tap device, fills in the name if empty.</span></p></td></tr><tr><td><p><span>TUNGETIFF</span></p></td><td><p><span>Returns a "struct ifreq" containing the device's current name and flags.</span></p></td></tr><tr><td><p><span>TUNSETPERSIST</span></p></td><td><p><span>Sets TUN_PERSIST flag, if you want the device to remain in the system after the tap_fd is closed.</span></p></td></tr><tr><td><p><span>TUNSETOWNER, TUNSETGROUP</span></p></td><td><p><span>Set uid and gid that can own the device.</span></p></td></tr><tr><td><p><span>TUNSETLINK</span></p></td><td><p><span>Set the Ethernet link type for the device. The device must be down. See ARPHRD_* constants. For tap it defaults to ARPHRD_ETHER.</span></p></td></tr><tr><td><p><span>TUNSETOFFLOAD</span></p></td><td><p><span>As documented above.</span></p></td></tr><tr><td><p><span>TUNGETSNDBUF, TUNSETSNDBUF</span></p></td><td><p><span>Get/set send buffer. The default is INT_MAX.</span></p></td></tr><tr><td><p><span>TUNGETVNETHDRSZ, TUNSETVNETHDRSZ</span></p></td><td><p><span>Already discussed.</span></p></td></tr><tr><td><p><span>TUNSETIFINDEX</span></p></td><td><p><span>Set interface index (ifindex), </span><a href="https://patchwork.ozlabs.org/project/netdev/patch/51B99946.3000703@parallels.com/"><span>useful in checkpoint-restore</span></a><span>.</span></p></td></tr><tr><td><p><span>TUNSETCARRIER</span></p></td><td><p><span>Set the carrier state of an interface, as discussed earlier, useful with IFF_NO_CARRIER.</span></p></td></tr><tr><td><p><span>TUNGETDEVNETNS</span></p></td><td><p><span>Return an fd of a net namespace that the interface belongs to.</span></p></td></tr></tbody></table>
    <div>
      <h5>Filtering related socket options</h5>
      <a href="#filtering-related-socket-options">
        
      </a>
    </div>
    <table><colgroup><col></col><col></col></colgroup><tbody><tr><td><p><span>TUNSETTXFILTER</span></p></td><td><p><span>Takes "struct tun_filter" which limits the dst mac addresses that can be delivered to the application.</span></p></td></tr><tr><td><p><span>TUNATTACHFILTER, TUNDETACHFILTER, TUNGETFILTER</span></p></td><td><p><span>Attach/detach/get classic BPF filter for packets going to application. Takes "struct sock_fprog".</span></p><br /></td></tr><tr><td><p><span>TUNSETFILTEREBPF</span></p></td><td><p><span>Set an eBPF filter on a tap device. This is independent of the classic BPF above.</span></p></td></tr></tbody></table>
    <div>
      <h4>Multi queue related socket options</h4>
      <a href="#multi-queue-related-socket-options">
        
      </a>
    </div>
    <table><colgroup><col></col><col></col></colgroup><tbody><tr><td><p><span>TUNSETQUEUE</span></p></td><td><p><span>Used to set IFF_DETACH_QUEUE and IFF_ATTACH_QUEUE for multiqueue.</span></p></td></tr><tr><td><p><span>TUNSETSTEERINGEBPF</span></p></td><td><p><span>Set an eBPF program for selecting a specific tap queue, in the direction towards the application. This is useful if you want to ensure some traffic is sticky to a specific application thread. The eBPF program takes "struct __sk_buff" and returns an int. The result queue number is computed from the return value u16 modulo number of queues is the selection.</span></p></td></tr></tbody></table>
    <div>
      <h2>Single queue speed</h2>
      <a href="#single-queue-speed">
        
      </a>
    </div>
    <p>Tap devices are quite weird — they aren't network sockets, nor true files. Their semantics are closest to pipes, and unfortunately the API reflects that. To receive or send a packet from a tap device, the application must do a read() or write() syscall, one packet at a time.</p><p>One might think that some sort of syscall batching would help. Sockets have sendmmsg()/recvmmsg(), but that doesn't work on tap file descriptors. The typical alternatives enabling batching are: an old <a href="/io_submit-the-epoll-alternative-youve-never-heard-about/">io_submit AIO interface</a>, or modern io_uring. Io_uring added tap support quite recently. However, it turns out syscall batching doesn't really offer that much of an improvement. Maybe in the range of 10%.</p><p>The Linux kernel is just not capable of forwarding millions of packets per second for a single flow or on a single CPU. The best possible solution is to scale vertically for elephant flows with TSO/USO (packet trains) offloads, and scale horizontally for multiple concurrent flows with multi queue.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/7Mfbelr6xUN4bGLTPAFLHr/945a18d93d8bce312e7205bdb5911b16/image5-1.png" />
            
            </figure><p>In this chart you can see how dramatic the performance gain of offloads is. Without them, a sample "echo" tap application can process between 320 and 500 thousand packets per second on a single core. MTU being 1500. When the offloads are enabled it jumps to 2.7Mpps, while keeping the number of received "packet trains" to just 56 thousand per second. Of course not every traffic pattern can fully utilize GRO/GSO. However, to get decent performance from tap, and from Linux in general, offloads are absolutely critical.</p>
    <div>
      <h2>Multi queue considerations</h2>
      <a href="#multi-queue-considerations">
        
      </a>
    </div>
    <p>Multi queue is useful when the tap application is handling multiple concurrent flows and needs to utilize more than one CPU.</p><p>To get a file descriptor of a tap queue, just add the IFF_MULTI_QUEUE flag when opening the tap. It's possible to detach/reattach a queue with TUNSETQUEUE and IFF_DETACH_QUEUE/IFF_ATTACH_QUEUE, but I'm unsure when this is useful.</p><p>When a multi queue tap is created, it spreads the load across multiple tap queues, each one having a unique file descriptor. Beware of the algorithm selecting the queue though: it might bite you back.</p><p>By default, Linux tap driver records a symmetric flow hash of any handled flow in a flow table. It saves on which queue the traffic from the application was transmitted. Then, on the receiving side it follows that selection and sends subsequent packets to that specific queue. For example, if your userspace application is sending some TCP flow over queue #2, then the packets going into the application which are a part of that flow will go to queue #2. This is generally a sensible design as long as the sender is always selecting one specific queue. If the sender changes the TX queue, new packets will immediately shift and packets within one flow might be seen as reordered. Additionally, this queue selection design does not take into account CPU locality and might have minor negative effects on performance for very high throughput applications.</p><p>It's possible to override the flow hash based queue selection by using <a href="https://www.infradead.org/~mchehab/kernel_docs/networking/multiqueue.html">tc multiq qdisc and skbedit queue_mapping filter</a>:</p>
            <pre><code>tc qdisc add dev tap0 root handle 1: multiq
tc filter add dev tap0 parent 1: protocol ip prio 1 u32 \
        match ip dst 192.168.0.3 \
        action skbedit queue_mapping 0</code></pre>
            <p><b>tc</b> is fragile and thus it's not a solution I would recommend. A better way is to customize the queue selection algorithm with a TUNSETSTEERINGEBPF eBPF program. In that case, the flow tracking code is not employed anymore. By smartly using such a steering eBPF program, it's possible to keep the flow processing local to one CPU — useful for best performance.</p>
    <div>
      <h2>Summary</h2>
      <a href="#summary">
        
      </a>
    </div>
    <p>Now you know everything I wish I had known when I was setting out on this journey!</p><p>To get the best performance, I recommend:</p><ul><li><p>enable vnethdr</p></li><li><p>enable offloads (TSO and USO)</p></li><li><p>consider spreading the load across multiple queues and CPUs with multi queue</p></li><li><p>consider syscall batching for additional gain of maybe 10%, perhaps try io_uring</p></li><li><p>consider customizing the steering algorithm</p></li></ul><p>References:</p><ul><li><p><a href="https://www.linux-kvm.org/images/6/63/2012-forum-multiqueue-networking-for-kvm.pdf">https://www.linux-kvm.org/images/6/63/2012-forum-multiqueue-networking-for-kvm.pdf</a></p></li><li><p><a href="https://backreference.org/2010/03/26/tuntap-interface-tutorial/">https://backreference.org/2010/03/26/tuntap-interface-tutorial/</a></p></li><li><p><a href="https://ldpreload.com/p/tuntap-notes.txt">https://ldpreload.com/p/tuntap-notes.txt</a></p></li><li><p><a href="https://www.kernel.org/doc/Documentation/networking/tuntap.txt">https://www.kernel.org/doc/Documentation/networking/tuntap.txt</a></p></li><li><p><a href="https://tailscale.com/blog/more-throughput/">https://tailscale.com/blog/more-throughput/</a></p></li></ul> ]]></content:encoded>
            <category><![CDATA[Deep Dive]]></category>
            <guid isPermaLink="false">7Fur1bA5gwvHjrVPMLYx9E</guid>
            <dc:creator>Marek Majkowski</dc:creator>
        </item>
        <item>
            <title><![CDATA[The day my ping took countermeasures]]></title>
            <link>https://blog.cloudflare.com/the-day-my-ping-took-countermeasures/</link>
            <pubDate>Tue, 11 Jul 2023 13:05:00 GMT</pubDate>
            <description><![CDATA[ Ping developers clearly put some thought into that. I wondered how far they went. Did they handle clock changes in both directions? Are the bad measurements excluded from the final statistics? How do they test the software? ]]></description>
            <content:encoded><![CDATA[ <p></p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3n0xGuyOEFmzVNvoE2XTh0/d14bafdc7ac93fbc0bb1f9d3d790240e/Screenshot-2023-07-11-at-13.30.23.png" />
            
            </figure><p>Once my holidays had passed, I found myself reluctantly reemerging into the world of the living. I powered on a corporate laptop, scared to check on my email inbox. However, before turning on the browser, obviously, I had to run a ping. Debugging the network is a mandatory first step after a boot, right? As expected, the network was perfectly healthy but what caught me off guard was this message:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5bipMZ0BFtWHz9fK7eNo3Q/13a2d4930b304d293727af404c798c8e/image6.png" />
            
            </figure><p>I was not expecting <b>ping</b> to <b>take countermeasures</b> that early on in a day. Gosh, I wasn't expecting any countermeasures that Monday!</p><p>Once I got over the initial confusion, I took a deep breath and collected my thoughts. You don't have to be Sherlock Holmes to figure out what has happened. I'm really fast - I started <b>ping</b> <i>before</i> the system <b>NTP</b> daemon synchronized the time. In my case, the computer clock was rolled backward, confusing ping.</p><p>While this doesn't happen too often, a computer clock can be freely adjusted either forward or backward. However, it's pretty rare for a regular network utility, like ping, to try to manage a situation like this. It's even less common to call it "taking countermeasures". I would totally expect ping to just print a nonsensical time value and move on without hesitation.</p><p>Ping developers clearly put some thought into that. I wondered how far they went. Did they handle clock changes in both directions? Are the bad measurements excluded from the final statistics? How do they test the software?</p><p>I can't just walk past ping "taking countermeasures" on me. Now I have to understand what ping did and why.</p>
    <div>
      <h3>Understanding ping</h3>
      <a href="#understanding-ping">
        
      </a>
    </div>
    <p>An investigation like this starts with a quick glance at the source code:</p>
            <pre><code> *			P I N G . C
 *
 * Using the InterNet Control Message Protocol (ICMP) "ECHO" facility,
 * measure round-trip-delays and packet loss across network paths.
 *
 * Author -
 *	Mike Muuss
 *	U. S. Army Ballistic Research Laboratory
 *	December, 1983</code></pre>
            <p><b>Ping</b> goes back a long way. It was originally written by <a href="https://en.wikipedia.org/wiki/Mike_Muuss">Mike Muuss</a> while at the U. S. Army Ballistic Research Laboratory, in 1983, before I was born. The code we're looking for is under <a href="https://github.com/iputils/iputils/blob/ee0a515e74b8d39fbe9b68f3309f0cb2586ccdd4/ping/ping_common.c#L746">iputils/ping/ping_common.c</a> gather_statistics() function:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2FJGGtP0ecs0zhiBKsyfxx/224080cf97fc53fa3fb83c59a86e43c9/image5.png" />
            
            </figure><p>The code is straightforward: the message in question is printed when the measured <a href="https://www.cloudflare.com/learning/cdn/glossary/round-trip-time-rtt/">RTT</a> is negative. In this case ping resets the latency measurement to zero. Here you are: "taking countermeasures" is nothing more than just marking an erroneous measurement as if it was 0ms.</p><p>But what precisely does ping measure? Is it the wall clock? The <a href="https://man7.org/linux/man-pages/man8/ping.8.html">man page</a> comes to the rescue. Ping has two modes.</p><p>The "old", -U mode, in which it uses the wall clock. This mode is less accurate (has more jitter). It calls <b>gettimeofday</b> before sending and after receiving the packet.</p><p>The "new", default, mode in which it uses "network time". It calls <b>gettimeofday</b> before sending, and gets the receive timestamp from a more accurate SO_TIMESTAMP CMSG. More on this later.</p>
    <div>
      <h3>Tracing gettimeofday is hard</h3>
      <a href="#tracing-gettimeofday-is-hard">
        
      </a>
    </div>
    <p>Let's start with a good old strace:</p>
            <pre><code>$ strace -e trace=gettimeofday,time,clock_gettime -f ping -n -c1 1.1 &gt;/dev/null
... nil ...</code></pre>
            <p>It doesn't show any calls to <b>gettimeofday</b>. What is going on?</p><p>On modern Linux some syscalls are not true syscalls. Instead of jumping to the kernel space, which is slow, they remain in userspace and go to a special code page provided by the host kernel. This code page is called <b>vdso</b>. It's visible as a <b>.so</b> library to the program:</p>
            <pre><code>$ ldd `which ping` | grep vds
    linux-vdso.so.1 (0x00007ffff47f9000)</code></pre>
            <p>Calls to the <b>vdso</b> region are not syscalls, they remain in userspace and are super fast, but classic strace can't see them. For debugging it would be nice to turn off <b>vdso</b> and fall back to classic slow syscalls. It's easier said than done.</p><p>There is no way to prevent loading of the <b>vdso</b>. However there are two ways to convince a loaded program not to use it.</p><p>The first technique is about fooling glibc into thinking the <b>vdso</b> is not loaded. This case must be handled for compatibility with ancient Linux. When bootstrapping in a freshly run process, glibc inspects the <a href="https://www.gnu.org/software/libc/manual/html_node/Auxiliary-Vector.html">Auxiliary Vector</a> provided by ELF loader. One of the parameters has the location of the <b>vdso</b> pointer, <a href="https://man7.org/linux/man-pages/man7/vdso.7.html">the man page</a> gives this example:</p>
            <pre><code>void *vdso = (uintptr_t) getauxval(AT_SYSINFO_EHDR);</code></pre>
            <p>A technique proposed on <a href="https://stackoverflow.com/a/63811017">Stack Overflow</a> works like that: let's hook on a program before <b>execve</b>() exits and overwrite the Auxiliary Vector AT_SYSINFO_EHDR parameter. Here's the <a href="https://github.com/danteu/novdso/blob/master/novdso.c">novdso.c</a> code. However, the linked code doesn't quite work for me (one too many <b>kill(SIGSTOP)</b>), and has one bigger, fundamental flaw. To hook on <b>execve()</b> it uses <b>ptrace()</b> therefore doesn't work under our strace!</p>
            <pre><code>$ strace -f ./novdso ping 1.1 -c1 -n
...
[pid 69316] ptrace(PTRACE_TRACEME)  	= -1 EPERM (Operation not permitted)</code></pre>
            <p>While this technique of rewriting AT_SYSINFO_EHDR is pretty cool, it won't work for us. (I wonder if there is another way of doing that, but without ptrace. Maybe with some BPF? But that is another story.)</p><p>A second technique is to use <b>LD_PRELOAD</b> and preload a trivial library overloading the functions in question, and forcing them to go to slow real syscalls. This works fine:</p>
            <pre><code>$ cat vdso_override.c
#include &lt;sys/syscall.h&gt;
#include &lt;sys/time.h&gt;
#include &lt;time.h&gt;
#include &lt;unistd.h&gt;

int gettimeofday(struct timeval *restrict tv, void *restrict tz) {
	return syscall(__NR_gettimeofday, (long)tv, (long)tz, 0, 0, 0, 0);
}

time_t time(time_t *tloc) {
	return syscall(__NR_time, (long)tloc, 0, 0, 0, 0, 0);
}

int clock_gettime(clockid_t clockid, struct timespec *tp) {
    return syscall(__NR_clock_gettime, (long)clockid, (long)tp, 0, 0, 0, 0);
}</code></pre>
            <p>To load it:</p>
            <pre><code>$ gcc -Wall -Wextra -fpic -shared -o vdso_override.so vdso_override.c

$ LD_PRELOAD=./vdso_override.so \
       strace -e trace=gettimeofday,clock_gettime,time \
       date

clock_gettime(CLOCK_REALTIME, {tv_sec=1688656245 ...}) = 0
Thu Jul  6 05:10:45 PM CEST 2023
+++ exited with 0 +++</code></pre>
            <p>Hurray! We can see the <b>clock_gettime</b> call in <b>strace</b> output. Surely we'll also see <b>gettimeofday</b> from our <b>ping</b>, right?</p><p>Not so fast, it still doesn't quite work:</p>
            <pre><code>$ LD_PRELOAD=./vdso_override.so \
     strace -c -e trace=gettimeofday,time,clock_gettime -f \
     ping -n -c1 1.1 &gt;/dev/null
... nil ...</code></pre>
            
    <div>
      <h3>To suid or not to suid</h3>
      <a href="#to-suid-or-not-to-suid">
        
      </a>
    </div>
    <p>I forgot that <b>ping</b> might need special permissions to read and write raw packets. Historically it had a <b>suid</b> bit set, which granted the program elevated user identity. However LD_PRELOAD doesn't work with suid. When a program is being loaded a <a href="https://github.com/bminor/musl/blob/718f363bc2067b6487900eddc9180c84e7739f80/ldso/dynlink.c#L1820">dynamic linker checks if it has <b>suid</b> bit</a>, and if so, it ignores LD_PRELOAD and LD_LIBRARY_PATH settings.</p><p>However, does <b>ping</b> need suid? Nowadays it's totally possible to send and receive ICMP Echo messages without any extra privileges, like this:</p>
            <pre><code>from socket import *
import struct

sd = socket(AF_INET, SOCK_DGRAM, IPPROTO_ICMP)
sd.connect(('1.1', 0))

sd.send(struct.pack("!BBHHH10s", 8, 0, 0, 0, 1234, b'payload'))
data = sd.recv(1024)
print('type=%d code=%d csum=0x%x id=%d seq=%d payload=%s' % struct.unpack_from("!BBHHH10s", data))</code></pre>
            <p>Now you know how to write "ping" in eight lines of Python. This Linux API is known as <b>ping socket</b>. It generally works on modern Linux, however it requires a correct sysctl, which is typically enabled:</p>
            <pre><code>$ sysctl net.ipv4.ping_group_range
net.ipv4.ping_group_range = 0    2147483647</code></pre>
            <p>The <b>ping socket</b> is not as mature as UDP or TCP sockets. The "ICMP ID" field is used to dispatch an ICMP Echo Response to an appropriate socket, but when using <b>bind()</b> this property is settable by the user without any checks. A malicious user can deliberately cause an "ICMP ID" conflict.</p><p>But we're not here to discuss Linux networking API's. We're here to discuss the <b>ping</b> utility and indeed, it's using the <b>ping sockets</b>:</p>
            <pre><code>$ strace -e trace=socket -f ping 1.1 -nUc 1
socket(AF_INET, SOCK_DGRAM, IPPROTO_ICMP) = 3
socket(AF_INET6, SOCK_DGRAM, IPPROTO_ICMPV6) = 4</code></pre>
            <p>Ping sockets are rootless, and <b>ping</b>, at least on my laptop, is not a suid program:</p>
            <pre><code>$ ls -lah `which ping`
-rwxr-xr-x 1 root root 75K Feb  5  2022 /usr/bin/ping</code></pre>
            <p>So why doesn't the LD_PRELOAD? It turns out <b>ping</b> binary holds a CAP_NET_RAW capability. Similarly to suid, this is preventing the library preloading machinery from working:</p>
            <pre><code>$ getcap `which ping`
/usr/bin/ping cap_net_raw=ep</code></pre>
            <p>I think this capability is enabled only to handle the case of a misconfigured <b>net.ipv4.ping_group_range</b> sysctl. For me ping works perfectly fine without this capability.</p>
    <div>
      <h3>Rootless is perfectly fine</h3>
      <a href="#rootless-is-perfectly-fine">
        
      </a>
    </div>
    <p>Let's remove the CAP_NET_RAW and try out LD_PRELOAD hack again:</p>
            <pre><code>$ cp `which ping` .

$ LD_PRELOAD=./vdso_override.so strace -f ./ping -n -c1 1.1
...
setsockopt(3, SOL_SOCKET, SO_TIMESTAMP_OLD, [1], 4) = 0
gettimeofday({tv_sec= ... ) = 0
sendto(3, ...)
setitimer(ITIMER_REAL, {it_value={tv_sec=10}}, NULL) = 0
recvmsg(3, { ... cmsg_level=SOL_SOCKET, 
                 cmsg_type=SO_TIMESTAMP_OLD, 
                 cmsg_data={tv_sec=...}}, )</code></pre>
            <p>We finally made it! Without -U, in the "network timestamp" mode, <b>ping</b>:</p><ul><li><p>Sets SO_TIMESTAMP flag on a socket.</p></li><li><p>Calls <b>gettimeofday</b> before sending the packet.</p></li><li><p>When fetching a packet, gets the timestamp from the <b>CMSG</b>.</p></li></ul>
    <div>
      <h3>Fault injection - fooling ping</h3>
      <a href="#fault-injection-fooling-ping">
        
      </a>
    </div>
    <p>With <b>strace</b> up and running we can finally do something interesting. You see, <b>strace</b> has a little known fault injection feature, named <a href="https://man7.org/linux/man-pages/man1/strace.1.html">"tampering" in the manual</a>:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6EpzqSkXe9BCttqJajgXpi/79d3f4051066ab7e8fa88795c2c33e58/image3.png" />
            
            </figure><p>With a couple of command line parameters we can overwrite the result of the <b>gettimeofday</b> call. I want to set it forward to confuse ping into thinking the SO_TIMESTAMP time is in the past:</p>
            <pre><code>LD_PRELOAD=./vdso_override.so \
    strace -o /dev/null -e trace=gettimeofday \
            -e inject=gettimeofday:poke_exit=@arg1=ff:when=1 -f \
    ./ping -c 1 -n 1.1.1.1

PING 1.1.1.1 (1.1.1.1) 56(84) bytes of data.
./ping: Warning: time of day goes back (-59995290us), taking countermeasures
./ping: Warning: time of day goes back (-59995104us), taking countermeasures
64 bytes from 1.1.1.1: icmp_seq=1 ttl=60 time=0.000 ms

--- 1.1.1.1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.000/0.000/0.000/0.000 ms</code></pre>
            <p>It worked! We can now generate the "taking countermeasures" message reliably!</p><p>While we can cheat on the <b>gettimeofday</b> result, with <b>strace</b> it's impossible to overwrite the CMSG timestamp. Perhaps it might be possible to adjust the CMSG timestamp with Linux <a href="https://man7.org/linux/man-pages/man7/time_namespaces.7.html">time namespaces</a>, but I don't think it'll work. As far as I understand, time namespaces are not taken into account by the network stack. A program using SO_TIMESTAMP is deemed to compare it against the system clock, which might be rolled backwards.</p>
    <div>
      <h3>Fool me once, fool me twice</h3>
      <a href="#fool-me-once-fool-me-twice">
        
      </a>
    </div>
    <p>At this point we could conclude our investigation. We're now able to reliably trigger the "taking countermeasures" message using strace fault injection.</p><p>There is one more thing though. When sending ICMP Echo Request messages, does <b>ping</b> <b>remember</b> the send timestamp in some kind of hash table? That might be wasteful considering a long-running ping sending thousands of packets.</p><p>Ping is smart, and instead puts the timestamp in the ICMP Echo Request <b>packet payload</b>!</p><p>Here's how the full algorithm works:</p><ol><li><p>Ping sets the SO_TIMESTAMP_OLD socket option to receive timestamps.</p></li><li><p>It looks at the wall clock with <b>gettimeofday</b>.</p></li><li><p>It puts the current timestamp in the first bytes of the ICMP payload.</p></li><li><p>After receiving the ICMP Echo Reply packet, it inspects the two timestamps: the send timestamp from the payload and the receive timestamp from CMSG.</p></li><li><p>It calculates the RTT delta.</p></li></ol><p>This is pretty neat! With this algorithm, ping doesn't need to remember much, and can have an unlimited number of packets in flight! (For completeness, ping maintains a small fixed-size bitmap to account for the DUP! packets).</p><p>What if we set a packet length to be less than 16 bytes? Let's see:</p>
            <pre><code>$ ping 1.1 -c2 -s0
PING 1.1 (1.0.0.1) 0(28) bytes of data.
8 bytes from 1.0.0.1: icmp_seq=1 ttl=60
8 bytes from 1.0.0.1: icmp_seq=2 ttl=60
--- 1.1 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1002ms</code></pre>
            <p>In such a case ping just skips the RTT from the output. Smart!</p><p>Right... this opens two completely new subjects. While ping was written back when everyone was friendly, today’s Internet can have rogue actors. What if we spoofed responses to confuse ping. Can we: cut the payload to prevent ping from producing RTT, and spoof the timestamp and fool the RTT measurements?</p><p>Both things work! The truncated case will look like this to the sender:</p>
            <pre><code>$ ping 139.162.188.91
PING 139.162.188.91 (139.162.188.91) 56(84) bytes of data.
8 bytes from 139.162.188.91: icmp_seq=1 ttl=53 (truncated)</code></pre>
            <p>The second case, of an overwritten timestamp, is even cooler. We can move timestamp forwards causing ping to show our favorite "taking countermeasures" message:</p>
            <pre><code>$ ping 139.162.188.91  -c 2 -n
PING 139.162.188.91 (139.162.188.91) 56(84) bytes of data.
./ping: Warning: time of day goes back (-1677721599919015us), taking countermeasures
./ping: Warning: time of day goes back (-1677721599918907us), taking countermeasures
64 bytes from 139.162.188.91: icmp_seq=1 ttl=53 time=0.000 ms
./ping: Warning: time of day goes back (-1677721599905149us), taking countermeasures
64 bytes from 139.162.188.91: icmp_seq=2 ttl=53 time=0.000 ms

--- 139.162.188.91 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 0.000/0.000/0.000/0.000 ms</code></pre>
            <p>Alternatively we can move the time in the packet backwards causing <a href="https://github.com/iputils/iputils/issues/480">ping to show nonsensical RTT values</a>:</p>
            <pre><code>$ ./ping 139.162.188.91  -c 2 -n
PING 139.162.188.91 (139.162.188.91) 56(84) bytes of data.
64 bytes from 139.162.188.91: icmp_seq=1 ttl=53 time=1677721600430 ms
64 bytes from 139.162.188.91: icmp_seq=2 ttl=53 time=1677721600084 ms

--- 139.162.188.91 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1000ms
rtt min/avg/max/mdev = 1677721600084.349/1677721600257.351/1677721600430.354/-9223372036854775.-808 ms</code></pre>
            <p>We proved that "countermeasures" work only when time moves in one direction. In another direction ping is just fooled.</p><p>Here's a rough scapy snippet that generates an ICMP Echo Response fooling ping:</p>
            <pre><code># iptables -I INPUT -i eth0 -p icmp --icmp-type=8 -j DROP
import scapy.all as scapy
import struct

def custom_action(echo_req):
    try:
    	payload = bytes(echo_req[scapy.ICMP].payload)
    	if len(payload) &gt;= 8:
        	ts, tu = struct.unpack_from("&lt;II", payload)
        	payload = struct.pack("&lt;II", (ts-0x64000000)&amp;0xffffffff, tu) \
                     + payload[8:]

    	echo_reply = scapy.IP(
        	dst=echo_req[scapy.IP].src,
        	src=echo_req[scapy.IP].dst,
    	) / scapy.ICMP(type=0, code=0,
                 	id=echo_req[scapy.ICMP].id,
                 	seq=echo_req.payload.seq,
   	  	) / payload
    	scapy.send(echo_reply,iface=iface)
    except Exception as e:
        pass

scapy.sniff(filter="icmp and icmp[0] = 8", iface=iface, prn=custom_action)</code></pre>
            
    <div>
      <h3>Leap second</h3>
      <a href="#leap-second">
        
      </a>
    </div>
    <p>In practice, how often does time change on a computer? The <b>NTP</b> daemon adjusts the clock all the time to account for any drift. However, these are very small changes. Apart from initial clock synchronization after boot or sleep wakeup, big clock shifts shouldn't really happen.</p><p>There are exceptions as usual. Systems that operate in virtual environments or have unreliable Internet connections often experience their clocks getting out of sync.</p><p>One notable case that affects all computers is a coordinated clock adjustment called a <a href="https://en.wikipedia.org/wiki/Leap_second">leap second</a>. It causes the clock to move backwards, which is particularly troublesome. An issue with handling leap second <a href="/how-and-why-the-leap-second-affected-cloudflare-dns/">caused our engineers a headache in late 2016</a>.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4iGPlBeYYXm0RYyxhEmOMQ/20b49ba0ad47eeb0404610e710bd4226/Screenshot-2023-07-11-at-13.33.02.png" />
            
            </figure><p>Leap seconds often cause issues, so the current consensus is to <a href="https://www.nytimes.com/2022/11/19/science/time-leap-second-bipm.html">deprecate them by 2035</a>. However, <a href="https://en.wikipedia.org/wiki/Leap_second#International_proposals_for_elimination_of_leap_seconds">according to Wikipedia</a> the solution seem to be to just kick the can down the road:</p><blockquote><p><i>A suggested possible future measure would be to let the discrepancy increase to a full minute, which would take 50 to 100 years, and then have the last minute of the day taking two minutes in a "kind of smear" with no discontinuity.</i></p></blockquote><p>In any case, there hasn't been a leap second since 2016, there might be some in the future, but there likely won't be any after 2035. Many environments already use a <a href="https://cloudplatform.googleblog.com/2015/05/Got-a-second-A-leap-second-that-is-Be-ready-for-June-30th.html">leap second smear</a> to avoid the problem of clock jumping back.</p><p>In most cases, it might be completely fine to ignore the clock changes. When possible, to count time durations use CLOCK_MONOTONIC, which is bulletproof.</p><p>We haven't mentioned <a href="https://en.wikipedia.org/wiki/Daylight_saving_time">daylight savings</a> clock adjustments here because, from a computer perspective they are not real clock changes! Most often programmers deal with the operating system clock, which is typically set to the <a href="https://en.wikipedia.org/wiki/Coordinated_Universal_Time">UTC timezone</a>. DST timezone is taken into account only when pretty printing the date on screen. The underlying software operates on integer values. Let's consider an example of two timestamps, which in my <a href="https://devblogs.microsoft.com/oldnewthing/20061027-00/?p=29213">Warsaw timezone</a>, appear as two different DST timezones. While it may like the clock rolled back, this is just a user interface illusion. The integer timestamps are sequential:</p>
            <pre><code>$ date --date=@$[1698541199+0]
Sun Oct 29 02:59:59 AM CEST 2023

$ date --date=@$[1698541199+1]
Sun Oct 29 02:00:00 AM CET 2023</code></pre>
            
    <div>
      <h3>Lessons</h3>
      <a href="#lessons">
        
      </a>
    </div>
    <p>Arguably, the clock jumping backwards is a rare occurrence. It's very hard to test for such cases, and I was surprised to find that <b>ping</b> made such an attempt. To avoid the problem, to measure the latency ping might use CLOCK_MONOTONIC, its developers already <a href="https://github.com/iputils/iputils/commit/4fd276cd8211c502cb87c5db0ce15cd685177216">use this time source in another place</a>.</p><p>Unfortunately this won't quite work here. Ping needs to compare send timestamp to receive timestamp from SO_TIMESTAMP CMSG, which uses the non-monotonic system clock. Linux API's are sometimes limited, and dealing with time is hard. For time being, clock adjustments will continue to confuse ping.</p><p>In any case, now we know what to do when <b>ping</b> is "<b>taking countermeasures</b>"! Pull down your periscope and check the <b>NTP</b> daemon status!</p> ]]></content:encoded>
            <category><![CDATA[Network]]></category>
            <guid isPermaLink="false">6M6gXK7v29wVHnBMZMYRpS</guid>
            <dc:creator>Marek Majkowski</dc:creator>
        </item>
        <item>
            <title><![CDATA[Cloudflare servers don't own IPs anymore – so how do they connect to the Internet?]]></title>
            <link>https://blog.cloudflare.com/cloudflare-servers-dont-own-ips-anymore/</link>
            <pubDate>Fri, 25 Nov 2022 14:00:00 GMT</pubDate>
            <description><![CDATA[ In this blog we'll discuss how we manage Cloudflare IP addresses
used to retrieve the data from the Internet, how our egress
network design has evolved, how we optimized it for best use
of available IP space and introduce our soft-anycast technology ]]></description>
            <content:encoded><![CDATA[ <p><i></i></p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/27jjBnvTr1dvuTNt8baVMj/5244e0d8f89ffd15d4bf51a15e646625/image11-3.png" />
            
            </figure><p>A lot of Cloudflare's technology is well documented. For example, how we handle traffic between the eyeballs (clients) and our servers has been discussed many times on this blog: “<a href="/a-brief-anycast-primer/">A brief primer on anycast (2011)</a>”, "<a href="/cloudflares-architecture-eliminating-single-p/">Load Balancing without Load Balancers (2013)</a>", "<a href="/path-mtu-discovery-in-practice/">Path MTU discovery in practice (2015)</a>",  "<a href="/unimog-cloudflares-edge-load-balancer/">Cloudflare's edge load balancer (2020)</a>", "<a href="/tubular-fixing-the-socket-api-with-ebpf/">How we fixed the BSD socket API (2022)</a>".</p><p>However, we have rarely talked about the second part of our networking setup — how our servers fetch the content from the Internet. In this blog we’re going to cover this gap. We'll discuss how we manage Cloudflare IP addresses used to retrieve the data from the Internet, how our egress network design has evolved and how we optimized it for best use of available IP space.</p><p>Brace yourself. We have a lot to cover.</p>
    <div>
      <h3>Terminology first!</h3>
      <a href="#terminology-first">
        
      </a>
    </div>
    
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/01iopATg1CQGchxawLCk1U/9d8582351fed8929930c18c78321f194/image8-4.png" />
            
            </figure><p>Each Cloudflare server deals with many kinds of networking traffic, but two rough categories stand out:</p><ul><li><p><i>Internet sourced traffic</i> - Inbound connections initiated by eyeball to our servers. In the context of this blog post we'll call these "<b>ingress</b> connections".</p></li><li><p><i>Cloudflare sourced traffic</i> - Outgoing connections initiated by our servers to other hosts on the Internet. For brevity, we'll call these "<b>egress</b> connections".</p></li></ul><p>The egress part, while rarely discussed on this blog, is critical for our operation. Our servers must initiate outgoing connections to get their jobs done! Like:</p><ul><li><p>In our CDN product, before the content is cached, it's fetched from the origin servers. See "<a href="/how-we-built-pingora-the-proxy-that-connects-cloudflare-to-the-internet/">Pingora, the proxy that connects Cloudflare to the Internet (2022)</a>", <a href="/argo-v2/">Argo</a> and <a href="/tiered-cache-smart-topology/">Tiered Cache</a>.</p></li><li><p>For the <a href="https://www.cloudflare.com/products/cloudflare-spectrum/">Spectrum</a> product, each ingress TCP connection results in one egress connection.</p></li><li><p><a href="https://workers.cloudflare.com/">Workers</a> often run multiple subrequests to construct an HTTP response. Some of them might be querying servers to the Internet.</p></li><li><p>We also operate client-facing forward proxy products - like WARP and Cloudflare Gateway. These proxies deal with eyeball connections destined to the Internet. Our servers need to establish connections to the Internet on behalf of our users.</p></li></ul><p>And so on.</p>
    <div>
      <h3>Anycast on ingress, unicast on egress</h3>
      <a href="#anycast-on-ingress-unicast-on-egress">
        
      </a>
    </div>
    
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/WOoxT0QUqaGhBwa0lSgRu/15aba7b2dbf84e647d84dfa3a5adeeb2/image9-3.png" />
            
            </figure><p>Our ingress network architecture is very different from the egress one. On ingress, the connections sourced from the Internet are handled exclusively by our anycast IP ranges. Anycast is a technology where each of our data centers "announces" and can handle the same IP ranges. With many destinations possible, how does the Internet know where to route the packets? Well, the eyeball packets are routed towards the closest data center based on Internet BGP metrics, often it's also geographically the closest one. Usually, the BGP routes don't change much, and each eyeball IP can be expected to be routed to a single data center.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/7jiVD7EKAKmWRoIN6FlrMz/f7b12c24fcf94a56e44638f3f74f955a/image10-2.png" />
            
            </figure><p>However, while anycast works well in the ingress direction, it can't operate on egress. Establishing an outgoing connection from an anycast IP won't work. Consider the response packet. It's likely to be routed back to a wrong place - a data center geographically closest to the sender, not necessarily the source data center!</p><p>For this reason, until recently, we established outgoing connections in a straightforward and conventional way: each server was given its own unicast IP address. "Unicast IP" means there is only one server using that address in the world. Return packets will work just fine and get back exactly to the right server identified by the unicast IP.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2GWAHkJLOKz7tR9rIgwfgG/5b7e68b8e02942697d7de1fbf6d74b0e/image5-16.png" />
            
            </figure>
    <div>
      <h3>Segmenting traffic based on egress IP</h3>
      <a href="#segmenting-traffic-based-on-egress-ip">
        
      </a>
    </div>
    <p>Originally connections sourced by Cloudflare were mostly HTTP fetches going to origin servers on the Internet. As our product line grew, so did the variety of traffic. The most notable example is <a href="/1111-warp-better-vpn/">our WARP app</a>. For WARP, our servers operate a forward proxy, and handle the traffic sourced by end-user devices. It's done without the same degree of intermediation as in our <a href="https://www.cloudflare.com/application-services/products/cdn/">CDN product</a>. This creates a problem. Third party servers on the Internet — like the origin servers — must be able to distinguish between connections coming from Cloudflare services and our WARP users. Such traffic segmentation is traditionally done by using different IP ranges for different traffic types (although recently we introduced more robust techniques like <a href="https://developers.cloudflare.com/ssl/origin-configuration/authenticated-origin-pull/">Authenticated Origin Pulls</a>).</p><p>To work around the trusted vs untrusted traffic pool differentiation problem, we added an untrusted WARP IP address to each of our servers:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2hudtFCXBWUe1aRFIfcH7s/47bca3701568a8562029e41334d79692/image4-30.png" />
            
            </figure>
    <div>
      <h3>Country tagged egress IP addresses</h3>
      <a href="#country-tagged-egress-ip-addresses">
        
      </a>
    </div>
    <p>It quickly became apparent that trusted vs untrusted weren't the only tags needed. For WARP service we also need country tags. For example, United Kingdom based WARP users expect the bbc.com website to just work. However, the BBC restricts many of its services to people just in the UK.</p><p>It does this by <i>geofencing</i> — using a database mapping public IP addresses to countries, and allowing only the UK ones. Geofencing is widespread on today's Internet. To avoid geofencing issues, we need to choose specific egress addresses tagged with an appropriate country, depending on WARP user location. Like many other parties on the Internet, we tag our egress IP space with country codes and publish it as a geofeed (like <a href="https://mask-api.icloud.com/egress-ip-ranges.csv">this one</a>). Notice, the published geofeed is just data. The fact that an IP is tagged as say UK does not mean it is served from the UK, it just means the operator wants it to be geolocated to the UK. Like many things on the Internet, it is based on trust.</p><p>Notice, at this point we have three independent geographical tags:</p><ul><li><p>the country tag of the WARP user - the eyeball connecting IP</p></li><li><p>the location of the data center the eyeball connected to</p></li><li><p>the country tag of the egressing IP</p></li></ul><p>For best service, we want to choose the egressing IP so that its country tag matches the country from the eyeball IP. But egressing from a specific country tagged IP is challenging: our data centers serve users from all over the world, potentially from many countries! Remember: due to anycast we don't directly control the ingress routing. Internet geography doesn’t always match physical geography. For example our London data center receives traffic not only from users in the United Kingdom, but also from Ireland, and Saudi Arabia. As a result, our servers in London need many WARP egress addresses associated with many countries:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6rcN3OCsVAvpWBiER4JHgI/6edbc6ec71af73c39d13c52e8e50b4e0/image2-52.png" />
            
            </figure><p>Can you see where this is going? The problem space just explodes! Instead of having one or two egress IP addresses for each server, now we require dozens, and <a href="/amazon-2bn-ipv4-tax-how-avoid-paying">IPv4 addresses aren't cheap</a>. With this design, we need many addresses per server, and we operate thousands of servers. This architecture becomes very expensive.</p>
    <div>
      <h3>Is anycast a problem?</h3>
      <a href="#is-anycast-a-problem">
        
      </a>
    </div>
    <p>Let me recap: with anycast ingress we don't control which data center the user is routed to. Therefore, each of our data centers must be able to egress from an address with any conceivable tag. Inside the data center we also don't control which server the connection is routed to. There are potentially many tags, many data centers, and many servers inside a data center.</p><p>Maybe the problem is the ingress architecture? Perhaps it's better to use a traditional networking design where a specific eyeball is routed with <a href="https://www.cloudflare.com/learning/dns/what-is-dns/">DNS</a> to a specific data center, or even a server?</p><p>That's one way of thinking, but we decided against it. We like our anycast on ingress. It brings us many advantages:</p><ul><li><p><b>Performance</b>: with anycast, by definition, the eyeball is routed to the closest (by BGP metrics) data center. This is usually the fastest data center for a given user.</p></li><li><p><b>Automatic failover</b>: if one of our data centers becomes unavailable, the traffic will be instantly, automatically re-routed to the next best place.</p></li><li><p><b>DDoS resilience</b>: during a <a href="https://www.cloudflare.com/learning/ddos/what-is-a-ddos-attack/">denial of service attack</a> or a traffic spike, the load is automatically balanced across many data centers, significantly reducing the impact.</p></li><li><p><b>Uniform software</b>: The functionality of every data center and of every server inside a data center is identical. We use the same software stack on all the servers around the world. Each machine can perform any action, for any product. This enables easy debugging and good scalability.</p></li></ul><p>For these reasons we'd like to keep the anycast on ingress. We decided to solve the issue of egress address cardinality in some other way.</p>
    <div>
      <h3>Solving a million dollar problem</h3>
      <a href="#solving-a-million-dollar-problem">
        
      </a>
    </div>
    <p>Out of the thousands of servers we operate, every single one should be able to use an egress IP with any of the possible tags. It's easiest to explain our solution by first showing two extreme designs.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1UoCyD5rIsCDc6nAMY1jvp/646cd025de3707b60b205682f3b140c8/image6-10.png" />
            
            </figure><p><b>Each server owns all the needed IPs:</b> each server has all the specialized egress IPs with the needed tags.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3s0xwHrfjPQpusPeq9HTtX/3338238e28e89abcaa5b4fc3eff2cc09/image12-1.png" />
            
            </figure><p><b>One server owns the needed IP:</b> a specialized egress IP with a specific tag lives in one place, other servers forward traffic to it.</p><p>Both options have pros and cons:</p><table>
<thead>
  <tr>
    <th>Specialized IP on every server</th>
    <th>Specialized IP on one server</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td>Super expensive $$$, every server needs many IP addresses.</td>
    <td>Cheap $, only one specialized IP needed for a tag.</td>
  </tr>
  <tr>
    <td>Egress always local - fast</td>
    <td>Egress almost always forwarded - slow</td>
  </tr>
  <tr>
    <td>Excellent reliability - every server is independent</td>
    <td>Poor reliability - introduced chokepoints</td>
  </tr>
</tbody>
</table>
    <div>
      <h3>There's a third way</h3>
      <a href="#theres-a-third-way">
        
      </a>
    </div>
    <p>We've been thinking hard about this problem. Frankly, the first extreme option of having every needed IP available locally on every Cloudflare server is not totally unworkable. This is, roughly, what we were able to pull off for IPv6. With IPv6, access to the large needed IP space is not a problem.</p><p>However, in IPv4 neither option is acceptable. The first offers fast and reliable egress, but requires great cost — the IPv4 addresses needed are expensive. The second option uses the smallest possible IP space, so it's cheap, but compromises on performance and reliability.</p><p>The solution we devised is a compromise between the extremes. The rough idea is to change the assignment unit. Instead of assigning one /32 IPv4 address for each server, we devised a method of assigning a /32 IP per data center, and then sharing it among physical servers.</p><table>
<thead>
  <tr>
    <th>Specialized IP on every server</th>
    <th>Specialized IP per data center</th>
    <th>Specialized IP on one server</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td>Super expensive $$$</td>
    <td>Reasonably priced $$</td>
    <td>Cheap $</td>
  </tr>
  <tr>
    <td>Egress always local - fast</td>
    <td>Egress always local - fast</td>
    <td>Egress almost always forwarded - slow</td>
  </tr>
  <tr>
    <td>Excellent reliability - every server is independent</td>
    <td>Excellent reliability - every server is independent</td>
    <td>Poor reliability - many choke points</td>
  </tr>
</tbody>
</table>
    <div>
      <h3>Sharing an IP inside data center</h3>
      <a href="#sharing-an-ip-inside-data-center">
        
      </a>
    </div>
    <p>The idea of sharing an IP among servers is not new. Traditionally this can be achieved by Source-NAT on a router. Sadly, the sheer number of egress IP's we need and the size of our operation, prevents us from relying on stateful firewall / NAT at the router level. We also dislike shared state, so we're not fans of distributed NAT installations.</p><p>What we chose instead, is splitting an egress IP across servers by <b>a port range</b>. For a given egress IP, each server owns a small portion of available source ports - a port slice.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6ruhT8POdI4Vd3yDd4WIrh/5204f34fbe65306902cb0b12216cedd4/image1-68.png" />
            
            </figure><p>When return packets arrive from the Internet, we have to route them back to the correct machine. For this task we've customized "Unimog" - our L4 XDP-based load balancer - ("<a href="/unimog-cloudflares-edge-load-balancer/">Unimog, Cloudflare's load balancer (2020)</a>") and it's working flawlessly.</p><p>With a port slice of say 2,048 ports, we can share one IP among 31 servers. However, there is always a possibility of running out of ports. To address this, we've worked hard to be able to reuse the egress ports efficiently. See the "<a href="/how-to-stop-running-out-of-ephemeral-ports-and-start-to-love-long-lived-connections/">How to stop running out of ports (2022)</a>", "<a href="https://lpc.events/event/16/contributions/1349/">How to share IPv4 addresses (2022)</a>" and our <a href="https://cloudflare.tv/event/oZKxMJg4">Cloudflare.TV segment</a>.</p><p>This is pretty much it. Each server is aware which IP addresses and port slices it owns. For inbound routing Unimog inspects the ports and dispatches the packets to appropriate machines.</p>
    <div>
      <h3>Sharing a subnet between data centers</h3>
      <a href="#sharing-a-subnet-between-data-centers">
        
      </a>
    </div>
    <p>This is not the end of the story though, we haven't discussed how we can route a single /32 address into a datacenter. Traditionally, in the public Internet, it's only possible to route subnets with granularity of /24 or 256 IP addresses. In our case this would lead to great waste of IP space.</p><p>To solve this problem and improve the utilization of our IP space, we deployed our egress ranges as... <b>anycast</b>! With that in place, we customized Unimog and taught it to forward the packets over our <a href="/cloudflare-backbone-internet-fast-lane/">backbone network</a> to the right data center. Unimog maintains a database like this:</p>
            <pre><code>198.51.100.1 - forward to LHR
198.51.100.2 - forward to CDG
198.51.100.3 - forward to MAN
...</code></pre>
            <p>With this design, it doesn't matter to which data center return packets are delivered. Unimog can always fix it and forward the data to the right place. Basically, while at the <a href="https://www.cloudflare.com/learning/security/glossary/what-is-bgp/">BGP layer</a> we are using anycast, due to our design, semantically an IP identifies a datacenter and an IP and port range identify a specific machine. It behaves almost like a unicast.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1aQ9FLJSpyr3OjcEynnyse/0274312eac4adaa6f391e7e48d3d6e3c/image7-6.png" />
            
            </figure><p>We call this technology stack "<b>soft-unicast</b>" and it feels magical. It's like we did unicast in software over anycast in the BGP layer.</p>
    <div>
      <h3>Soft-unicast is indistinguishable from magic</h3>
      <a href="#soft-unicast-is-indistinguishable-from-magic">
        
      </a>
    </div>
    <p>With this setup we can achieve significant benefits:</p><ul><li><p>We are able to share a /32 egress IP amongst <b>many servers</b>.</p></li><li><p>We can spread a single subnet across <b>many data centers</b>, and change it easily on the fly. This allows us to fully use our egress IPv4 ranges.</p></li><li><p>We can <b>group similar IP addresses</b> together. For example, all the IP addresses tagged with the "UK" tag might form a single continuous range. This reduces the size of the published geofeed.</p></li><li><p>It's easy for us to <b>onboard new egress IP ranges</b>, like customer IP's. This is useful for some of our products, like <a href="https://www.cloudflare.com/products/zero-trust/">Cloudflare Zero Trust</a>.</p></li></ul><p>All this is done at sensible cost, at no loss to performance and reliability:</p><ul><li><p>Typically, the user is able to egress directly from the closest datacenter, providing the <b>best possible performance</b>.</p></li><li><p>Depending on the actual needs we can allocate or release the IP addresses. This gives us <b>flexibility with the IP</b> cost management, we don't need to overspend upfront.</p></li><li><p>Since we operate multiple egress IP addresses in different locations, the <b>reliability is not compromised</b>.</p></li></ul>
    <div>
      <h3>The true location of our IP addresses is: “the cloud”</h3>
      <a href="#the-true-location-of-our-ip-addresses-is-the-cloud">
        
      </a>
    </div>
    <p>While soft-unicast allows us to gain great efficiency, we've hit some issues. Sometimes we get a question "Where does this IP physically exist?". But it doesn't have an answer! Our egress IPs don't exist physically anywhere. From a BGP standpoint our egress ranges are anycast, so they live everywhere. Logically each address is used in one data center at a time, but we can move it around on demand.</p>
    <div>
      <h3>Content Delivery Networks misdirect users</h3>
      <a href="#content-delivery-networks-misdirect-users">
        
      </a>
    </div>
    
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5sCk5BkrFFweDjmy5zQnHs/058bef00f049ab0445b59d70a422f5d0/image3-43.png" />
            
            </figure><p>As another example of problems, here's one issue we've hit with third party CDNs. As we mentioned before, there are three country tags in our pipeline:</p><ul><li><p>The country tag of the IP eyeball is connecting from.</p></li><li><p>The location of our data center.</p></li><li><p>The country tag of the IP addresses we chose for the egress connections.</p></li></ul><p>The fact that our egress address is tagged as "UK" doesn't always mean it actually is being used in the UK. We’ve had cases when a UK-tagged WARP user, due to the maintenance of our LHR data center, was routed to Paris. A popular <a href="https://www.cloudflare.com/learning/cdn/what-is-a-cdn/">CDN</a> performed a reverse-lookup of our egress IP, found it tagged as "UK", and directed the user to a London CDN server. This is generally OK... but we actually egressed from Paris at the time. This user ended up routing packets from their home in the UK, to Paris, and back to the UK. This is bad for performance.</p><p>We address this issue by performing DNS requests in the egressing data center. For DNS we use IP addresses tagged with the location of the <b>data center</b>, not the intended geolocation for the user. This generally fixes the problem, but sadly, there are still some exceptions.</p>
    <div>
      <h3>The future is here</h3>
      <a href="#the-future-is-here">
        
      </a>
    </div>
    <p>Our 2021 experiments with <a href="/addressing-agility/">Addressing Agility</a> proved we have plenty of opportunity to innovate with the addressing of the ingress. Soft-unicast shows us we can achieve great flexibility and density on the egress side.</p><p>With each new product, the number of tags we need on the egress grows - from traffic trustworthiness, product category to geolocation. As the pool of usable IPv4 addresses shrinks, we can be sure there will be more innovation in the space. Soft-unicast is our solution, but for sure it's not our last development.</p><p>For now though, it seems like we're moving away from traditional unicast. Our egress IP's really don't exist in a fixed place anymore, and some of our servers don't even own a true unicast IP nowadays.</p> ]]></content:encoded>
            <category><![CDATA[Network]]></category>
            <guid isPermaLink="false">60l7qhQ3DKYiHtI3TduJSc</guid>
            <dc:creator>Marek Majkowski</dc:creator>
        </item>
        <item>
            <title><![CDATA[When the window is not fully open, your TCP stack is doing more than you think]]></title>
            <link>https://blog.cloudflare.com/when-the-window-is-not-fully-open-your-tcp-stack-is-doing-more-than-you-think/</link>
            <pubDate>Tue, 26 Jul 2022 13:00:00 GMT</pubDate>
            <description><![CDATA[ In this blog post I'll share my journey deep into the Linux networking stack, trying to understand the memory and window management of the receiving side of a TCP connection ]]></description>
            <content:encoded><![CDATA[ <p>Over the years I've been lurking around the Linux kernel and have investigated the TCP code many times. But when recently we were working on <a href="/optimizing-tcp-for-high-throughput-and-low-latency/">Optimizing TCP for high WAN throughput while preserving low latency</a>, I realized I have gaps in my knowledge about how Linux manages TCP receive buffers and windows. As I dug deeper I found the subject complex and certainly non-obvious.</p><p>In this blog post I'll share my journey deep into the Linux networking stack, trying to understand the memory and window management of the receiving side of a TCP connection. Specifically, looking for answers to seemingly trivial questions:</p><ul><li><p>How much data can be stored in the TCP receive buffer? (it's not what you think)</p></li><li><p>How fast can it be filled? (it's not what you think either!)</p></li></ul><p>Our exploration focuses on the receiving side of the TCP connection. We'll try to understand how to tune it for the best speed, without wasting precious memory.</p>
    <div>
      <h3>A case of a rapid upload</h3>
      <a href="#a-case-of-a-rapid-upload">
        
      </a>
    </div>
    <p>To best illustrate the receive side buffer management we need pretty charts! But to grasp all the numbers, we need a bit of theory.</p><p>We'll draw charts from a receive side of a TCP flow, running a pretty straightforward scenario:</p><ul><li><p>The client opens a TCP connection.</p></li><li><p>The client does <code>send()</code>, and pushes as much data as possible.</p></li><li><p>The server doesn't <code>recv()</code> any data. We expect all the data to stay and wait in the receive queue.</p></li><li><p>We fix the SO_RCVBUF for better illustration.</p></li></ul><p>Simplified pseudocode might look like (<a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2022-07-rmem-a/window.py">full code if you dare</a>):</p>
            <pre><code>sd = socket.socket(AF_INET, SOCK_STREAM, 0)
sd.bind(('127.0.0.3', 1234))
sd.listen(32)

cd = socket.socket(AF_INET, SOCK_STREAM, 0)
cd.setsockopt(SOL_SOCKET, SO_RCVBUF, 32*1024)
cd.connect(('127.0.0.3', 1234))

ssd, _ = sd.accept()

while true:
    cd.send(b'a'*128*1024)</code></pre>
            <p>We're interested in basic questions:</p><ul><li><p>How much data can fit in the server’s receive buffer? It turns out it's not exactly the same as the default read buffer size on Linux; we'll get there.</p></li><li><p>Assuming infinite bandwidth, what is the minimal time  - measured in <a href="https://www.cloudflare.com/learning/cdn/glossary/round-trip-time-rtt/">RTT</a> - for the client to fill the receive buffer?</p></li></ul>
    <div>
      <h3>A bit of theory</h3>
      <a href="#a-bit-of-theory">
        
      </a>
    </div>
    <p>Let's start by establishing some common nomenclature. I'll follow the wording used by the <a href="https://man7.org/linux/man-pages/man8/ss.8.html"><code>ss</code> Linux tool from the <code>iproute2</code> package</a>.</p><p>First, there is the buffer budget limit. <a href="https://man7.org/linux/man-pages/man8/ss.8.html"><code>ss</code> manpage</a> calls it <b>skmem_rb</b>, in the kernel it's named <b>sk_rcvbuf</b>. This value is most often controlled by the Linux autotune mechanism using the <code>net.ipv4.tcp_rmem</code> setting:</p>
            <pre><code>$ sysctl net.ipv4.tcp_rmem
net.ipv4.tcp_rmem = 4096 131072 6291456</code></pre>
            <p>Alternatively it can be manually set with <code>setsockopt(SO_RCVBUF)</code> on a socket. Note that the kernel doubles the value given to this setsockopt. For example SO_RCVBUF=16384 will result in skmem_rb=32768. The max value allowed to this setsockopt is limited to meager 208KiB by default:</p>
            <pre><code>$ sysctl net.core.rmem_max net.core.wmem_max
net.core.rmem_max = 212992
net.core.wmem_max = 212992</code></pre>
            <p><a href="/optimizing-tcp-for-high-throughput-and-low-latency/">The aforementioned blog post</a> discusses why manual buffer size management is problematic - relying on autotuning is generally preferable.</p><p>Here’s a diagram showing how <b>skmem_rb</b> budget is being divided:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3EEuOnbl8CKYCv4oWj5Ejw/4a0bf778f484bbddebfac4099d8e21f4/image2-17.png" />
            
            </figure><p>In any given moment, we can think of the budget as being divided into four parts:</p><ul><li><p><b>Recv-q</b>: part of the buffer budget occupied by actual application bytes awaiting <code>read()</code>.</p></li><li><p>Another part of is consumed by metadata handling - the cost of <b>struct sk_buff</b> and such.</p></li><li><p>Those two parts together are reported by <code>ss</code> as <b>skmem_r</b> - kernel name is <b>sk_rmem_alloc</b>.</p></li><li><p>What remains is "free", that is: it's not actively used yet.</p></li><li><p>However, a portion of this "free" region is an advertised window - it may become occupied with application data soon.</p></li><li><p>The remainder will be used for future metadata handling, or might be divided into the advertised window further in the future.</p></li></ul><p>The upper limit for the window is configured by <code>tcp_adv_win_scale</code> setting. By default, the window is set to at most 50% of the "free" space. The value can be clamped further by the TCP_WINDOW_CLAMP option or an internal <code>rcv_ssthresh</code> variable.</p>
    <div>
      <h3>How much data can a server receive?</h3>
      <a href="#how-much-data-can-a-server-receive">
        
      </a>
    </div>
    <p>Our first question was "How much data can a server receive?". A naive reader might think it's simple: if the server has a receive buffer set to say 64KiB, then the client will surely be able to deliver 64KiB of data!</p><p>But this is totally not how it works. To illustrate this, allow me to temporarily set sysctl <code>tcp_adv_win_scale=0</code>. This is not a default and, as we'll learn, it's the wrong thing to do. With this setting the server will indeed set 100% of the receive buffer as an advertised window.</p><p>Here's our setup:</p><ul><li><p>The client tries to send as fast as possible.</p></li><li><p>Since we are interested in the receiving side, we can cheat a bit and speed up the sender arbitrarily. The client has transmission congestion control disabled: we set initcwnd=10000 as the route option.</p></li><li><p>The server has a fixed <b>skmem_rb</b> set at 64KiB.</p></li><li><p>The server has <code><b>tcp_adv_win_scale=0</b></code>.</p></li></ul>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/44j6HUJ496dIXVMkltUe4O/1765b3f25ef767dfcb23d3c079f7e8cb/image6-10.png" />
            
            </figure><p>There are so many things here! Let's try to digest it. First, the X axis is an ingress packet number (we saw about 65). The Y axis shows the buffer sizes as seen on the receive path for every packet.</p><ul><li><p>First, the purple line is a buffer size limit in bytes - <b>skmem_rb</b>. In our experiment we called <code>setsockopt(SO_RCVBUF)=32K</code> and skmem_rb is double that value. Notice, by calling SO_RCVBUF we disabled the Linux autotune mechanism.</p></li><li><p>Green <b>recv-q</b> line is how many application bytes are available in the receive socket. This grows linearly with each received packet.</p></li><li><p>Then there is the blue <b>skmem_r</b>, the used data + metadata cost in the receive socket. It grows just like <b>recv-q</b> but a bit faster, since it accounts for the cost of the metadata kernel needs to deal with.</p></li><li><p>The orange <b>rcv_win</b> is an advertised window. We start with 64KiB (100% of skmem_rb) and go down as the data arrives.</p></li><li><p>Finally, the dotted line shows <b>rcv_ssthresh</b>, which is not important yet, we'll get there.</p></li></ul>
    <div>
      <h3>Running over the budget is bad</h3>
      <a href="#running-over-the-budget-is-bad">
        
      </a>
    </div>
    <p>It's super important to notice that we finished with <b>skmem_r</b> higher than <b>skmem_rb</b>! This is rather unexpected, and undesired. The whole point of the <b>skmem_rb</b> memory budget is, well, not to exceed it. Here's how <code>ss</code> shows it:</p>
            <pre><code>$ ss -m
Netid  State  Recv-Q  Send-Q  Local Address:Port  Peer Address:Port   
tcp    ESTAB  62464   0       127.0.0.3:1234      127.0.0.2:1235
     skmem:(r73984,rb65536,...)</code></pre>
            <p>As you can see, skmem_rb is 65536 and skmem_r is 73984, which is 8448 bytes over! When this happens we have an even bigger issue on our hands. At around the 62nd packet we have an advertised window of 3072 bytes, but while packets are being sent, the receiver is unable to process them! This is easily verifiable by inspecting an nstat TcpExtTCPRcvQDrop counter:</p>
            <pre><code>$ nstat -az TcpExtTCPRcvQDrop
TcpExtTCPRcvQDrop    13    0.0</code></pre>
            <p>In our run 13 packets were dropped. This variable counts a number of packets dropped due to either system-wide or per-socket memory pressure - we know we hit the latter. In our case, soon after the socket memory limit was crossed, new packets were prevented from being enqueued to the socket. This happened even though the TCP advertised window was still open.</p><p>This results in an interesting situation. The receiver's window is open which might indicate it has resources to handle the data. But that's not always the case, like in our example when it runs out of the memory budget.</p><p>The sender will think it hit a network congestion packet loss and will run the usual retry mechanisms including exponential backoff. This behavior can be looked at as desired or undesired, depending on how you look at it. On one hand no data will be lost, the sender can eventually deliver all the bytes reliably. On the other hand the exponential backoff logic might stall the sender for a long time, causing a noticeable delay.</p><p>The root of the problem is straightforward - Linux kernel <b>skmem_rb</b> sets a memory budget for both the <b>data</b> and <b>metadata</b> which reside on the socket. In a pessimistic case each packet might incur a cost of a <b>struct sk_buff</b> + <b>struct skb_shared_info</b>, which on my system is 576 bytes, above the actual payload size, plus memory waste due to network card buffer alignment:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/7nJyE7p1rtHK9SvSTDnZoj/c02019aeed1e3b17f24b506b4eeaef36/image7-10.png" />
            
            </figure><p>We now understand that Linux can't just advertise 100% of the memory budget as an advertised window. Some budget must be reserved for metadata and such. The upper limit of window size is expressed as a fraction of the "free" socket budget. It is controlled by <code>tcp_adv_win_scale</code>, with the following values:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6ZfsgDbgLiLQ0HXUV5mVeK/31e596946e101fef2443896f9db8fcdb/image9-5.png" />
            
            </figure><p>By default, Linux sets the advertised window at most at 50% of the remaining buffer space.</p><p>Even with 50% of space "reserved" for metadata, the kernel is very smart and tries hard to reduce the metadata memory footprint. It has two mechanisms for this:</p><ul><li><p><b>TCP Coalesce</b> - on the happy path, Linux is able to throw away <b>struct sk_buff</b>. It can do so, by just linking the data to the previously enqueued packet. You can think about it as if it was <a href="https://www.spinics.net/lists/netdev/msg755359.html">extending the last packet on the socket</a>.</p></li><li><p><b>TCP Collapse</b> - when the memory budget is hit, Linux runs "collapse" code. Collapse rewrites and defragments the receive buffer from many small skb's into a few very long segments - therefore reducing the metadata cost.</p></li></ul><p>Here's an extension to our previous chart showing these mechanisms in action:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2KhHEeAUvJ6rinNLoRBwd1/36b733fddcbb885d8db5b076602ca168/image3-10.png" />
            
            </figure><p><b>TCP Coalesce</b> is a very effective measure and works behind the scenes at all times. In the bottom chart, the packets where the coalesce was engaged are shown with a pink line. You can see - the <b>skmem_r</b> bumps (blue line) are clearly correlated with a <b>lack</b> of coalesce (pink line)! The nstat TcpExtTCPRcvCoalesce counter might be helpful in debugging coalesce issues.</p><p>The <b>TCP Collapse</b> is a bigger gun. <a href="/optimizing-tcp-for-high-throughput-and-low-latency/">Mike wrote about it extensively</a>, and <a href="/the-story-of-one-latency-spike/">I wrote a blog post years ago, when the latency of TCP collapse hit us hard</a>. In the chart above, the collapse is shown as a red circle. We clearly see it being engaged after the socket memory budget is reached - from packet number 63. The nstat TcpExtTCPRcvCollapsed counter is relevant here. This value growing is a bad sign and might indicate bad latency spikes - especially when dealing with larger buffers. Normally collapse is supposed to be run very sporadically. A <a href="https://lore.kernel.org/lkml/20120510173135.615265392@linuxfoundation.org/">prominent kernel developer describes</a> this pessimistic situation:</p><blockquote><p>This also means tcp advertises a too optimistic window for a given allocated rcvspace: When receiving frames, <code>sk_rmem_alloc</code> can hit <code>sk_rcvbuf</code> limit and we call <code>tcp_collapse()</code> too often, especially when application is slow to drain its receive queue [...] This is a major latency source.</p></blockquote><p>If the memory budget remains exhausted after the collapse, Linux will drop ingress packets. In our chart it's marked as a red "X". The nstat TcpExtTCPRcvQDrop counter shows the count of dropped packets.</p>
    <div>
      <h3>rcv_ssthresh predicts the metadata cost</h3>
      <a href="#rcv_ssthresh-predicts-the-metadata-cost">
        
      </a>
    </div>
    <p>Perhaps counter-intuitively, the memory cost of a packet can be much larger than the amount of actual application data contained in it. It depends on number of things:</p><ul><li><p><b>Network card</b>: some network cards always allocate a full page (4096, or even 16KiB) per packet, no matter how small or large the payload.</p></li><li><p><b>Payload size</b>: shorter packets, will have worse metadata to content ratio since <b>struct skb</b> will be comparably larger.</p></li><li><p>Whether XDP is being used.</p></li><li><p>L2 header size: things like ethernet, vlan tags, and tunneling can add up.</p></li><li><p>Cache line size: many kernel structs are cache line aligned. On systems with larger cache lines, they will use more memory (see P4 or S390X architectures).</p></li></ul><p>The first two factors are the most important. Here's a run when the sender was specially configured to make the metadata cost bad and the coalesce ineffective (the <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2022-07-rmem-a/window.py#L90">details of the setup are messy</a>):</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6Oo38G0pRDoxfqkcIE7D9Y/c372a9cba2402cee11c14fc815875ea3/image1-10.png" />
            
            </figure><p>You can see the kernel hitting TCP collapse multiple times, which is totally undesired. Each time a collapse kernel is likely to rewrite the full receive buffer. This whole kernel machinery, from reserving some space for metadata with tcp_adv_win_scale, via using coalesce to reduce the memory cost of each packet, up to the rcv_ssthresh limit, exists to avoid this very case of hitting collapse too often.</p><p>The kernel machinery most often works fine, and TCP collapse is rare in practice. However, we noticed that's not the case for certain types of traffic. One example is <a href="https://lore.kernel.org/lkml/CA+wXwBSGsBjovTqvoPQEe012yEF2eYbnC5_0W==EAvWH1zbOAg@mail.gmail.com/">websocket traffic with loads of tiny packets</a> and a slow reader. One <a href="https://elixir.bootlin.com/linux/latest/source/net/ipv4/tcp_input.c#L452">kernel comment talks about</a> such a case:</p>
            <pre><code>* The scheme does not work when sender sends good segments opening
* window and then starts to feed us spaghetti. But it should work
* in common situations. Otherwise, we have to rely on queue collapsing.</code></pre>
            <p>Notice that the <b>rcv_ssthresh</b> line dropped down on the TCP collapse. This variable is an internal limit to the advertised window. By dropping it the kernel effectively says: hold on, I mispredicted the packet cost, next time I'm given an opportunity I'm going to open a smaller window. Kernel will advertise a smaller window and be more careful - all of this dance is done to avoid the collapse.</p>
    <div>
      <h3>Normal run - continuously updated window</h3>
      <a href="#normal-run-continuously-updated-window">
        
      </a>
    </div>
    <p>Finally, here's a chart from a normal run of a connection. Here, we use the default <code>tcp_adv_win_wcale=1 (50%)</code>:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1ZlSO1vQxnHim1D8dav4Aa/5ce538b22b546d194df83130d9f39bc9/image5-13.png" />
            
            </figure><p>Early in the connection you can see <b>rcv_win</b> being continuously updated with each received packet. This makes sense: while the <b>rcv_ssthresh</b> and <b>tcp_adv_win_scale</b> restrict the advertised window to never exceed 32KiB, the window is sliding nicely as long as there is enough space. At packet 18 the receiver stops updating the window and waits a bit. At packet 32 the receiver decides there still is some space and updates the window again, and so on. At the end of the flow the socket has 56KiB of data. This 56KiB of data was received over a sliding window reaching at most 32KiB .</p><p>The saw blade pattern of rcv_win is enabled by delayed ACK (aka QUICKACK). You can see the "<b>acked</b>" bytes in red dashed line. Since the ACK's might be delayed, the receiver waits a bit before updating the window. If you want a smooth line, you can use <code>quickack 1</code> per-route parameter, but this is not recommended since it will result in many small ACK packets flying over the wire.</p><p>In normal connection we expect the majority of packets to be coalesced and the collapse/drop code paths never to be hit.</p>
    <div>
      <h3>Large receive windows - rcv_ssthresh</h3>
      <a href="#large-receive-windows-rcv_ssthresh">
        
      </a>
    </div>
    <p>For large bandwidth transfers over big latency links - big BDP case - it's beneficial to have a very wide advertised window. However, Linux takes a while to fully open large receive windows:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6UL4j5NH62FE1350yWnTcP/e257d2160e41a3f71aa3b727debc44fc/image8-4.png" />
            
            </figure><p>In this run, the <b>skmem_rb</b> is set to 2MiB. As opposed to previous runs, the buffer budget is large and the receive window doesn't start with 50% of the skmem_rb! Instead it starts from 64KiB and grows linearly. It takes a while for Linux to ramp up the receive window to full size - ~800KiB in this case. The window is clamped by <b>rcv_ssthresh</b>. This variable starts at 64KiB and then grows at a rate of two full-MSS packets per each packet which has a "good" ratio of total size (truesize) to payload size.</p><p><a href="https://lore.kernel.org/lkml/CANn89i+mhqGaM2tuhgEmEPbbNu_59GGMhBMha4jnnzFE=UBNYg@mail.gmail.com/">Eric Dumazet writes</a> about this behavior:</p><blockquote><p>Stack is conservative about RWIN increase, it wants to receive packets to have an idea of the skb-&gt;len/skb-&gt;truesize ratio to convert a memory budget to  RWIN.Some drivers have to allocate 16K buffers (or even 32K buffers) just to hold one segment (of less than 1500 bytes of payload), while others are able to pack memory more efficiently.</p></blockquote><p>This behavior of slow window opening is fixed, and not configurable in vanilla kernel. <a href="https://lore.kernel.org/netdev/20220721151041.1215017-1-marek@cloudflare.com/#r">We prepared a kernel patch that allows to start up with higher rcv_ssthresh</a> based on per-route option <code>initrwnd</code>:</p>
            <pre><code>$ ip route change local 127.0.0.0/8 dev lo initrwnd 1000</code></pre>
            <p>With the patch and the route change deployed, this is how the buffers look:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4YE7Oolhn4ZQ9ihNi11HEL/af8c492bc03243e12b54541e954a3061/image4-12.png" />
            
            </figure><p>The advertised window is limited to 64KiB during the TCP handshake, but with our kernel patch enabled it's quickly bumped up to 1MiB in the first ACK packet afterwards. In both runs it took ~1800 packets to fill the receive buffer, however it took different time. In the first run the sender could push only 64KiB onto the wire in the second RTT. In the second run it could immediately push full 1MiB of data.</p><p>This trick of aggressive window opening is not really necessary for most users. It's only helpful when:</p><ul><li><p>You have high-bandwidth TCP transfers over big-latency links.</p></li><li><p>The metadata + buffer alignment cost of your NIC is sensible and predictable.</p></li><li><p>Immediately after the flow starts your application is ready to send a lot of data.</p></li><li><p>The sender has configured large <code>initcwnd</code>.</p></li></ul>
    <div>
      <h3>You care about shaving off every possible RTT.</h3>
      <a href="#you-care-about-shaving-off-every-possible-rtt">
        
      </a>
    </div>
    <p>On our systems we do have such flows, but arguably it might not be a common scenario. In the real world most of your TCP connections go to the nearest CDN point of presence, which is very close.</p>
    <div>
      <h3>Getting it all together</h3>
      <a href="#getting-it-all-together">
        
      </a>
    </div>
    <p>In this blog post, we discussed a seemingly simple case of a TCP sender filling up the receive socket. We tried to address two questions: with our isolated setup, how much data can be sent, and how quickly?</p><p>With the default settings of net.ipv4.tcp_rmem, Linux initially sets a memory budget of 128KiB for the receive data and metadata. On my system, given full-sized packets, it's able to eventually accept around 113KiB of application data.</p><p>Then, we showed that the receive window is not fully opened immediately. Linux keeps the receive window small, as it tries to predict the metadata cost and avoid overshooting the memory budget, therefore hitting TCP collapse. By default, with the net.ipv4.tcp_adv_win_scale=1, the upper limit for the advertised window is 50% of "free" memory. rcv_ssthresh starts up with 64KiB and grows linearly up to that limit.</p><p>On my system it took five window updates - six RTTs in total - to fill the 128KiB receive buffer. In the first batch the sender sent ~64KiB of data (remember we hacked the <code>initcwnd</code> limit), and then the sender topped it up with smaller and smaller batches until the receive window fully closed.</p><p>I hope this blog post is helpful and explains well the relationship between the buffer size and advertised window on Linux. Also, it describes the often misunderstood rcv_ssthresh which limits the advertised window in order to manage the memory budget and predict the unpredictable cost of metadata.</p><p>In case you wonder, similar mechanisms are in play in QUIC. The QUIC/H3 libraries though are still pretty young and don't have so many complex and mysterious toggles.... yet.</p><p>As always, <a href="https://github.com/cloudflare/cloudflare-blog/tree/master/2022-07-rmem-a">the code and instructions on how to reproduce the charts are available at our GitHub</a>.</p> ]]></content:encoded>
            <category><![CDATA[Linux]]></category>
            <category><![CDATA[TCP]]></category>
            <guid isPermaLink="false">ROvfvY7ClXiGsjf1moUld</guid>
            <dc:creator>Marek Majkowski</dc:creator>
        </item>
        <item>
            <title><![CDATA[How to stop running out of ephemeral ports and start to love long-lived connections]]></title>
            <link>https://blog.cloudflare.com/how-to-stop-running-out-of-ephemeral-ports-and-start-to-love-long-lived-connections/</link>
            <pubDate>Wed, 02 Feb 2022 09:53:28 GMT</pubDate>
            <description><![CDATA[ Often programmers have assumptions that turn out, to their surprise, to be invalid. From my experience this happens a lot. Every API, technology or system can be abused beyond its limits and break in a miserable way ]]></description>
            <content:encoded><![CDATA[ <p>Often programmers have assumptions that turn out, to their surprise, to be invalid. From my experience this happens a lot. Every API, technology or system can be abused beyond its limits and break in a miserable way.</p><p>It's particularly interesting when basic things used everywhere fail. Recently we've reached such a breaking point in a ubiquitous part of Linux networking: establishing a network connection using the <code>connect()</code> system call.</p><p>Since we are not doing anything special, just establishing TCP and UDP connections, how could anything go wrong? Here's one example: we noticed alerts from a misbehaving server, logged in to check it out and saw:</p>
            <pre><code>marek@:~# ssh 127.0.0.1
ssh: connect to host 127.0.0.1 port 22: Cannot assign requested address</code></pre>
            <p>You can imagine the face of my colleague who saw that. SSH to localhost refuses to work, while she was already using SSH to connect to that server! On another occasion:</p>
            <pre><code>marek@:~# dig cloudflare.com @1.1.1.1
dig: isc_socket_bind: address in use</code></pre>
            <p>This time a basic DNS query failed with a weird networking error. Failing DNS is a bad sign!</p><p>In both cases the problem was Linux running out of ephemeral ports. When this happens it's unable to establish any outgoing connections. This is a pretty serious failure. It's usually transient and if you don't know what to look for it might be hard to debug.</p><p>The root cause lies deeper though. We can often ignore limits on the number of outgoing connections. But we encountered cases where we hit limits on the number of concurrent outgoing connections during normal operation.</p><p>In this blog post I'll explain why we had these issues, how we worked around them, and present an userspace code implementing an improved variant of <code>connect()</code> syscall.</p>
    <div>
      <h3>Outgoing connections on Linux part 1 - TCP</h3>
      <a href="#outgoing-connections-on-linux-part-1-tcp">
        
      </a>
    </div>
    <p>Let's start with a bit of historical background.</p>
    <div>
      <h3>Long-lived connections</h3>
      <a href="#long-lived-connections">
        
      </a>
    </div>
    <p>Back in 2014 Cloudflare announced support for WebSockets. We wrote two articles about it:</p><ul><li><p><a href="/cloudflare-now-supports-websockets/">Cloudflare Now Supports WebSockets</a></p></li><li><p><a href="https://idea.popcount.org/2014-04-03-bind-before-connect/">Bind before connect</a></p></li></ul><p>If you skim these blogs, you'll notice we were totally fine with the WebSocket protocol, framing and operation. What worried us was our capacity to handle large numbers of concurrent outgoing connections towards the origin servers. Since WebSockets are long-lived, allowing them through our servers might greatly increase the concurrent connection count. And this did turn out to be a problem. It was possible to hit a ceiling for a total number of outgoing connections imposed by the Linux networking stack.</p><p>In a pessimistic case, each Linux connection consumes a local port (ephemeral port), and therefore the total connection count is limited by the size of the ephemeral port range.</p>
    <div>
      <h3>Basics - how port allocation works</h3>
      <a href="#basics-how-port-allocation-works">
        
      </a>
    </div>
    <p>When establishing an outbound connection a typical user needs the destination address and port. For example, DNS might resolve <code>cloudflare.com</code> to the '104.1.1.229' IPv4 address. A simple Python program can establish a connection to it with the following code:</p>
            <pre><code>cd = socket.socket(AF_INET, SOCK_STREAM)
cd.connect(('104.1.1.229', 80))</code></pre>
            <p>The operating system’s job is to figure out how to reach that destination, selecting an appropriate source address and source port to form the full 4-tuple for the connection:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1zDTSzTRPl4JRrdWfjbzkP/63e0de7a453377f267b41ee0fa394a33/image4-1.png" />
            
            </figure><p>The operating system chooses the source IP based on the routing configuration. On Linux we can see which source IP will be chosen with <code>ip route get</code>:</p>
            <pre><code>$ ip route get 104.1.1.229
104.1.1.229 via 192.168.1.1 dev eth0 src 192.168.1.8 uid 1000
	cache</code></pre>
            <p>The <code>src</code> parameter in the result shows the discovered source IP address that should be used when going towards that specific target.</p><p>The source port, on the other hand, is chosen from the local port range configured for outgoing connections, also known as the ephemeral port range. On Linux this is controlled by the following sysctls:</p>
            <pre><code>$ sysctl net.ipv4.ip_local_port_range net.ipv4.ip_local_reserved_ports
net.ipv4.ip_local_port_range = 32768    60999
net.ipv4.ip_local_reserved_ports =</code></pre>
            <p>The <code>ip_local_port_range</code> sets the low and high (inclusive) port range to be used for outgoing connections. The <code>ip_local_reserved_ports</code> is used to skip specific ports if the operator needs to reserve them for services.</p>
    <div>
      <h3>Vanilla TCP is a happy case</h3>
      <a href="#vanilla-tcp-is-a-happy-case">
        
      </a>
    </div>
    <p>The default ephemeral port range contains more than 28,000 ports (60999+1-32768=28232). Does that mean we can have at most 28,000 outgoing connections? That’s the core question of this blog post!</p><p>In TCP the connection is identified by a full 4-tuple, for example:</p>
<table>
<thead>
  <tr>
    <td><span>full 4-tuple</span></td>
    <td><span>192.168.1.8</span></td>
    <td><span>32768</span></td>
    <td><span>104.1.1.229</span></td>
    <td><span>80</span></td>
  </tr>
</thead>
</table><p>In principle, it is possible to reuse the source IP and port, and share them against another destination. For example, there could be two simultaneous outgoing connections with these 4-tuples:</p>
<table>
<thead>
  <tr>
    <th><span>full 4-tuple #A</span></th>
    <th><span>192.168.1.8</span></th>
    <th><span>32768</span></th>
    <th><span>104.1.1.229</span></th>
    <th><span>80</span></th>
  </tr>
</thead>
<tbody>
  <tr>
    <td><span>full 4-tuple #B</span></td>
    <td><span>192.168.1.8</span></td>
    <td><span>32768</span></td>
    <td><span>151.101.1.57</span></td>
    <td><span>80</span></td>
  </tr>
</tbody>
</table><p>This "source two-tuple" sharing can happen in practice when establishing connections using the vanilla TCP code:</p>
            <pre><code>sd = socket.socket(SOCK_STREAM)
sd.connect( (remote_ip, remote_port) )</code></pre>
            <p>But slightly different code can prevent this sharing, as we’ll discuss.</p><p>In the rest of this blog post, we’ll summarise the behaviour of code fragments that make outgoing connections showing:</p><ul><li><p>The technique’s description</p></li><li><p>The typical `errno` value in the case of port exhaustion</p></li><li><p>And whether the kernel is able to reuse the {source IP, source port}-tuple against another destination</p></li></ul><p>The last column is the most important since it shows if there is a low limit of total concurrent connections. As we're going to see later, the limit is present more often than we'd expect.</p>
<table>
<thead>
  <tr>
    <th><span>technique description</span></th>
    <th><span>errno on port exhaustion</span></th>
    <th><span>possible src 2-tuple reuse</span></th>
  </tr>
</thead>
<tbody>
  <tr>
    <td><span>connect(dst_IP, dst_port)</span></td>
    <td><span>EADDRNOTAVAIL</span></td>
    <td><span>yes (good!)</span></td>
  </tr>
</tbody>
</table><p>In the case of generic TCP, things work as intended. Towards a single destination it's possible to have as many connections as an ephemeral range allows. When the range is exhausted (against a single destination), we'll see EADDRNOTAVAIL error. The system also is able to correctly reuse local two-tuple {source IP, source port} for ESTABLISHED sockets against other destinations. This is expected and desired.</p>
    <div>
      <h3>Manually selecting source IP address</h3>
      <a href="#manually-selecting-source-ip-address">
        
      </a>
    </div>
    <p>Let's go back to the Cloudflare server setup. Cloudflare operates many services, to name just two: CDN (caching HTTP reverse proxy) and <a href="/1111-warp-better-vpn">WARP</a>.</p><p>For Cloudflare, it’s important that we don’t mix traffic types among our outgoing IPs. Origin servers on the Internet might want to differentiate traffic based on our product. The simplest example is <a href="https://www.cloudflare.com/learning/cdn/what-is-a-cdn/">CDN</a>: it's appropriate for an origin server to firewall off non-CDN inbound connections. Allowing Cloudflare cache pulls is totally fine, but allowing WARP connections which contain untrusted user traffic might lead to problems.</p><p>To achieve such outgoing IP separation, each of our applications must be explicit about which source IPs to use. They can’t leave it up to the operating system; the automatically-chosen source could be wrong. While it's technically possible to configure routing policy rules in Linux to express such requirements, we decided not to do that and keep Linux routing configuration as simple as possible.</p><p>Instead, before calling <code>connect()</code>, our applications select the source IP with the <code>bind()</code> syscall. A trick we call "bind-before-connect":</p>
            <pre><code>sd = socket.socket(SOCK_STREAM)
sd.bind( (src_IP, 0) )
sd.connect( (dst_IP, dst_port) )</code></pre>
            
<table>
<thead>
  <tr>
    <th><span>technique description</span></th>
    <th><span>errno on port exhaustion</span></th>
    <th><span>possible src 2-tuple reuse</span></th>
  </tr>
</thead>
<tbody>
  <tr>
    <td><span>bind(src_IP, 0)</span><br /><span>connect(dst_IP, dst_port)</span></td>
    <td><span>EADDRINUSE</span></td>
    <td><span>no </span><span>(bad!)</span></td>
  </tr>
</tbody>
</table><p>This code looks rather innocent, but it hides a considerable drawback. When calling <code>bind()</code>, the kernel attempts to find an unused local two-tuple. Due to BSD API shortcomings, the operating system can't know what we plan to do with the socket. It's totally possible we want to <code>listen()</code> on it, in which case sharing the source IP/port with a connected socket will be a disaster! That's why the source two-tuple selected when calling <code>bind()</code> must be unique.</p><p>Due to this API limitation, in this technique the source two-tuple can't be reused. Each connection effectively "locks" a source port, so the number of connections is constrained by the size of the ephemeral port range. Notice: one source port is used up for each connection, no matter how many destinations we have. This is bad, and is exactly the problem we were dealing with back in 2014 in the WebSockets articles mentioned above.</p><p>Fortunately, it's fixable.</p>
    <div>
      <h3>IP_BIND_ADDRESS_NO_PORT</h3>
      <a href="#ip_bind_address_no_port">
        
      </a>
    </div>
    <p>Back in 2014 we fixed the problem by setting the SO_REUSEADDR socket option and manually retrying <code>bind()</code>+ <code>connect()</code> a couple of times on error. This worked ok, but later in 2015 <a href="https://kernelnewbies.org/Linux_4.2#Networking">Linux introduced a proper fix: the IP_BIND_ADDRESS_NO_PORT socket option</a>. This option tells the kernel to delay reserving the source port:</p>
            <pre><code>sd = socket.socket(SOCK_STREAM)
sd.setsockopt(IPPROTO_IP, IP_BIND_ADDRESS_NO_PORT, 1)
sd.bind( (src_IP, 0) )
sd.connect( (dst_IP, dst_port) )</code></pre>
            
<table>
<thead>
  <tr>
    <th><span>technique description</span></th>
    <th><span>errno on port exhaustion</span></th>
    <th><span>possible src 2-tuple reuse</span></th>
  </tr>
</thead>
<tbody>
  <tr>
    <td><span>IP_BIND_ADDRESS_NO_PORT<br />bind(src_IP, 0)</span><br /><span>connect(dst_IP, dst_port)</span></td>
    <td><span>EADDRNOTAVAIL</span></td>
    <td><span>yes (good!)</span></td>
  </tr>
</tbody>
</table><p>This gets us back to the desired behavior. On modern Linux, when doing bind-before-connect for TCP, you should set IP_BIND_ADDRESS_NO_PORT.</p>
    <div>
      <h3>Explicitly selecting a source port</h3>
      <a href="#explicitly-selecting-a-source-port">
        
      </a>
    </div>
    <p>Sometimes an application needs to select a specific source port. For example: the operator wants to control full 4-tuple in order to debug ECMP routing issues.</p><p>Recently a colleague wanted to run a cURL command for debugging, and he needed the source port to be fixed. cURL provides the <code>--local-port</code> option to do this¹ :</p>
            <pre><code>$ curl --local-port 9999 -4svo /dev/null https://cloudflare.com/cdn-cgi/trace
*   Trying 104.1.1.229:443...</code></pre>
            <p>In other situations source port numbers should be controlled, as they can be used as an input to a routing mechanism.</p><p>But setting the source port manually is not easy. We're back to square one in our hackery since IP_BIND_ADDRESS_NO_PORT is not an appropriate tool when calling <code>bind()</code> with a specific source port value. To get the scheme working again and be able to share source 2-tuple, we need to turn to SO_REUSEADDR:</p>
            <pre><code>sd = socket.socket(SOCK_STREAM)
sd.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
sd.bind( (src_IP, src_port) )
sd.connect( (dst_IP, dst_port) )</code></pre>
            <p>Our summary table:</p>
<table>
<thead>
  <tr>
    <th><span>technique description</span></th>
    <th><span>errno on port exhaustion</span></th>
    <th><span>possible src 2-tuple reuse</span></th>
  </tr>
</thead>
<tbody>
  <tr>
    <td><span>SO_REUSEADDR<br />bind(src_IP, src_port)</span><br /><span>connect(dst_IP, dst_port)</span></td>
    <td><span>EADDRNOTAVAIL</span></td>
    <td><span>yes (good!)</span></td>
  </tr>
</tbody>
</table><p>Here, the user takes responsibility for handling conflicts, when an ESTABLISHED socket sharing the 4-tuple already exists. In such a case <code>connect</code> will fail with EADDRNOTAVAIL and the application should retry with another acceptable source port number.</p>
    <div>
      <h3>Userspace connectx implementation</h3>
      <a href="#userspace-connectx-implementation">
        
      </a>
    </div>
    <p>With these tricks, <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2022-02-connectx/connectx.py#L93-L110">we can implement a common function and call it <code>connectx</code></a>. It will do what <code>bind()</code>+<code>connect()</code> should, but won't have the unfortunate ephemeral port range limitation. In other words, created sockets are able to share local two-tuples as long as they are going to distinct destinations:</p>
            <pre><code>def connectx((source_IP, source_port), (destination_IP, destination_port)):</code></pre>
            <p>We have three use cases this API should support:</p>
<table>
<thead>
  <tr>
    <th><span>user specified</span></th>
    <th><span>technique</span></th>
  </tr>
</thead>
<tbody>
  <tr>
    <td><span>{_, _, dst_IP, dst_port}</span></td>
    <td><span>vanilla connect()</span></td>
  </tr>
  <tr>
    <td><span>{src_IP, _, dst_IP, dst_port}</span></td>
    <td><span>IP_BIND_ADDRESS_NO_PORT</span></td>
  </tr>
  <tr>
    <td><span>{src_IP, src_port, dst_IP, dst_port}</span></td>
    <td><span>SO_REUSEADDR</span></td>
  </tr>
</tbody>
</table><p>The name we chose isn't an accident. MacOS (specifically the underlying Darwin OS) has exactly that function implemented <a href="https://www.manpagez.com/man/2/connectx">as a <code>connectx()</code> system call</a> (<a href="https://github.com/apple/darwin-xnu/blob/a1babec6b135d1f35b2590a1990af3c5c5393479/bsd/netinet/tcp_usrreq.c#L517">implementation</a>):</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2ixMd6STGDjs1IO4DhAaFQ/3cbca6a9ec28010fb15f4004e2450587/image2.png" />
            
            </figure><p>It's more powerful than our <code>connectx</code> code, since it supports TCP Fast Open.</p><p>Should we, Linux users, be envious? For TCP, it's possible to get the right kernel behaviour with the appropriate setsockopt/bind/connect dance, so a kernel syscall is not quite needed.</p><p>But for UDP things turn out to be much more complicated and a dedicated syscall might be a good idea.</p>
    <div>
      <h3>Outgoing connections on Linux - part 2 - UDP</h3>
      <a href="#outgoing-connections-on-linux-part-2-udp">
        
      </a>
    </div>
    <p>In the previous section we listed three use cases for outgoing connections that should be supported by the operating system:</p><ul><li><p>Vanilla egress: operating system chooses the outgoing IP and port</p></li><li><p>Source IP selection: user selects outgoing IP but the OS chooses port</p></li><li><p>Full 4-tuple: user selects full 4-tuple for the connection</p></li></ul><p>We demonstrated how to implement all three cases on Linux for TCP, without hitting connection count limits due to source port exhaustion.</p><p>It's time to extend our implementation to UDP. This is going to be harder.</p><p>For UDP, Linux maintains one hash table that is keyed on local IP and port, which can hold duplicate entries. Multiple UDP connected sockets can not only share a 2-tuple but also a 4-tuple! It's totally possible to have two distinct, connected sockets having exactly the same 4-tuple. This feature was created for multicast sockets. The implementation was then carried over to unicast connections, but it is confusing. With conflicting sockets on unicast addresses, only one of them will receive any traffic. A newer connected socket will "overshadow" the older one. It's surprisingly hard to detect such a situation. To get UDP <code>connectx()</code> right, we will need to work around this "overshadowing" problem.</p>
    <div>
      <h3>Vanilla UDP is limited</h3>
      <a href="#vanilla-udp-is-limited">
        
      </a>
    </div>
    <p>It might come as a surprise to many, but by default, the total count for outbound UDP connections is limited by the ephemeral port range size. Usually, with Linux you can't have more than ~28,000 connected UDP sockets, even if they point to multiple destinations.</p><p>Ok, let's start with the simplest and most common way of establishing outgoing UDP connections:</p>
            <pre><code>sd = socket.socket(SOCK_DGRAM)
sd.connect( (dst_IP, dst_port) )</code></pre>
            
<table>
<thead>
  <tr>
    <th><span>technique description</span></th>
    <th><span>errno on port exhaustion</span></th>
    <th><span>possible src 2-tuple reuse</span></th>
    <th><span>risk of overshadowing</span></th>
  </tr>
</thead>
<tbody>
  <tr>
    <td><span>connect(dst_IP, dst_port)</span></td>
    <td><span>EAGAIN</span></td>
    <td><span>no </span><span>(bad!)</span></td>
    <td><span>no</span></td>
  </tr>
</tbody>
</table><p>The simplest case is not a happy one. The total number of concurrent outgoing UDP connections on Linux is limited by the ephemeral port range size. On our multi-tenant servers, with potentially long-lived gaming and H3/QUIC flows containing WebSockets, this is too limiting.</p><p>On TCP we were able to slap on a <code>setsockopt</code> and move on. No such easy workaround is available for UDP.</p><p>For UDP, without REUSEADDR, Linux avoids sharing local 2-tuples among UDP sockets. During <code>connect()</code> it tries to find a 2-tuple that is not used yet. As a side note: there is no fundamental reason that it looks for a unique 2-tuple as opposed to a unique 4-tuple during 'connect()'. This suboptimal behavior might be fixable.</p>
    <div>
      <h3>SO_REUSEADDR is hard</h3>
      <a href="#so_reuseaddr-is-hard">
        
      </a>
    </div>
    <p>To allow local two-tuple reuse we need the SO_REUSEADDR socket option. Sadly, this would also allow established sockets to share a 4-tuple, with the newer socket overshadowing the older one.</p>
            <pre><code>sd = socket.socket(SOCK_DGRAM)
sd.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
sd.connect( (dst_IP, dst_port) )</code></pre>
            
<table>
<thead>
  <tr>
    <th><span>technique description</span></th>
    <th><span>errno on port exhaustion</span></th>
    <th><span>possible src 2-tuple reuse</span></th>
    <th><span>risk of overshadowing</span></th>
  </tr>
</thead>
<tbody>
  <tr>
    <td><span>SO_REUSEADDR</span><br /><span>connect(dst_IP, dst_port)</span></td>
    <td><span>EAGAIN</span></td>
    <td><span>yes</span></td>
    <td><span>yes </span><span>(bad!)</span></td>
  </tr>
</tbody>
</table><p>In other words, we can't just set SO_REUSEADDR and move on, since we might hit a local 2-tuple that is already used in a connection against the same destination. We might already have an identical 4-tuple connected socket underneath. Most importantly, during such a conflict we won't be notified by any error. This is unacceptably bad.</p>
    <div>
      <h3>Detecting socket conflicts with eBPF</h3>
      <a href="#detecting-socket-conflicts-with-ebpf">
        
      </a>
    </div>
    <p>We thought a good solution might be to write an eBPF program to detect such conflicts. The idea was to put a code on the <code>connect()</code> syscall. Linux cgroups allow the BPF_CGROUP_INET4_CONNECT hook. The eBPF is called every time a process under a given cgroup runs the <code>connect()</code> syscall. This is pretty cool, and we thought it would allow us to verify if there is a 4-tuple conflict before moving the socket from UNCONNECTED to CONNECTED states.</p><p><a href="https://github.com/cloudflare/cloudflare-blog/tree/master/2022-02-connectx/ebpf_connect4">Here is how to load and attach our eBPF</a></p>
            <pre><code>bpftool prog load ebpf.o /sys/fs/bpf/prog_connect4  type cgroup/connect4
bpftool cgroup attach /sys/fs/cgroup/unified/user.slice connect4 pinned /sys/fs/bpf/prog_connect4</code></pre>
            <p>With such a code, we'll greatly reduce the probability of overshadowing:</p>
<table>
<thead>
  <tr>
    <th><span>technique description</span></th>
    <th><span>errno on port exhaustion</span></th>
    <th><span>possible src 2-tuple reuse</span></th>
    <th><span>risk of overshadowing</span></th>
  </tr>
</thead>
<tbody>
  <tr>
    <td><span>INET4_CONNECT hook</span><br /><span>SO_REUSEADDR</span><br /><span>connect(dst_IP, dst_port)</span></td>
    <td><span>manual port discovery, EPERM on conflict</span></td>
    <td><span>yes</span></td>
    <td><span>yes, but small</span></td>
  </tr>
</tbody>
</table><p>However, this solution is limited. First, it doesn't work for sockets with an automatically assigned source IP or source port, it only works when a user manually creates a 4-tuple connection from userspace. Then there is a second issue: a typical race condition. We don't grab any lock, so it's technically possible a conflicting socket will be created on another CPU in the time between our eBPF conflict check and the finish of the real <code>connect()</code> syscall machinery. In short, this lockless eBPF approach is better than nothing, but fundamentally racy.</p>
    <div>
      <h3>Socket traversal - SOCK_DIAG ss way</h3>
      <a href="#socket-traversal-sock_diag-ss-way">
        
      </a>
    </div>
    <p>There is another way to verify if a conflicting socket already exists: we can check for connected sockets in userspace. It's possible to do it without any privileges quite effectively with the SOCK_DIAG_BY_FAMILY feature of <code>netlink</code> interface. This is the same technique the <code>ss</code> tool uses to print out sockets available on the system.</p><p>The netlink code is not even all that complicated. <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2022-02-connectx/connectx.py#L23">Take a look at the code</a>. Inside the kernel, it goes <a href="https://elixir.bootlin.com/linux/latest/source/net/ipv4/udp_diag.c#L28">quickly into a fast <code>__udp_lookup()</code> routine</a>. This is great - we can avoid iterating over all sockets on the system.</p><p>With that function handy, we can draft our UDP code:</p>
            <pre><code>sd = socket.socket(SOCK_DGRAM)
sd.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
cookie = sd.getsockopt(socket.SOL_SOCKET, SO_COOKIE, 8)
sd.bind( src_addr )
c, _ = _netlink_udp_lookup(family, src_addr, dst_addr)
if c != cookie:
    raise OSError(...)
sd.connect( dst_addr )</code></pre>
            <p>This code has the same race condition issue as the connect inet eBPF hook before. But it's a good starting point. We need some locking to avoid the race condition. Perhaps it's possible to do it in the userspace.</p>
    <div>
      <h3>SO_REUSEADDR as a lock</h3>
      <a href="#so_reuseaddr-as-a-lock">
        
      </a>
    </div>
    <p>Here comes a breakthrough: we can use SO_REUSEADDR as a locking mechanism. Consider this:</p>
            <pre><code>sd = socket.socket(SOCK_DGRAM)
cookie = sd.getsockopt(socket.SOL_SOCKET, SO_COOKIE, 8)
sd.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
sd.bind( src_addr )
sd.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 0)
c, _ = _netlink_udp_lookup(family, src_addr, dst_addr)
if c != cookie:
    raise OSError()
sd.connect( dst_addr )
sd.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)</code></pre>
            <p>The idea here is:</p><ul><li><p>We need REUSEADDR around bind, otherwise it wouldn't be possible to reuse a local port. It's technically possible to clear REUSEADDR after bind. Doing this technically makes the kernel socket state inconsistent, but it doesn't hurt anything in practice.</p></li><li><p>By clearing REUSEADDR, we're locking new sockets from using that source port. At this stage we can check if we have ownership of the 4-tuple we want. Even if multiple sockets enter this critical section, only one, the newest, can win this verification. This is a cooperative algorithm, so we assume all tenants try to behave.</p></li><li><p>At this point, if the verification succeeds, we can perform <code>connect()</code> and have a guarantee that the 4-tuple won't be reused by another socket at any point in the process.</p></li></ul><p>This is rather convoluted and hacky, but it satisfies our requirements:</p>
<table>
<thead>
  <tr>
    <th><span>technique description</span></th>
    <th><span>errno on port exhaustion</span></th>
    <th><span>possible src 2-tuple reuse</span></th>
    <th><span>risk of overshadowing</span></th>
  </tr>
</thead>
<tbody>
  <tr>
    <td><span>REUSEADDR as a lock</span></td>
    <td><span>EAGAIN</span></td>
    <td><span>yes</span></td>
    <td><span>no</span></td>
  </tr>
</tbody>
</table><p>Sadly, this schema only works when we know the full 4-tuple, so we can't rely on kernel automatic source IP or port assignments.</p>
    <div>
      <h3>Faking source IP and port discovery</h3>
      <a href="#faking-source-ip-and-port-discovery">
        
      </a>
    </div>
    <p>In the case when the user calls 'connect' and specifies only target 2-tuple - destination IP and port, the kernel needs to fill in the missing bits - the source IP and source port. Unfortunately the described algorithm expects the full 4-tuple to be known in advance.</p><p>One solution is to implement source IP and port discovery in userspace. This turns out to be not that hard. For example, <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2022-02-connectx/connectx.py#L204">here's a snippet of our code</a>:</p>
            <pre><code>def _get_udp_port(family, src_addr, dst_addr):
    if ephemeral_lo == None:
        _read_ephemeral()
    lo, hi = ephemeral_lo, ephemeral_hi
    start = random.randint(lo, hi)
    ...</code></pre>
            
    <div>
      <h3>Putting it all together</h3>
      <a href="#putting-it-all-together">
        
      </a>
    </div>
    <p>Combining the manual source IP, port discovery and the REUSEADDR locking dance, we get a decent userspace implementation of <code>connectx()</code> for UDP.</p><p>We have covered all three use cases this API should support:</p>
<table>
<thead>
  <tr>
    <th><span>user specified</span></th>
    <th><span>comments</span></th>
  </tr>
</thead>
<tbody>
  <tr>
    <td><span>{_, _, dst_IP, dst_port}</span></td>
    <td><span>manual source IP and source port discovery</span></td>
  </tr>
  <tr>
    <td><span>{src_IP, _, dst_IP, dst_port}</span></td>
    <td><span>manual source port discovery</span></td>
  </tr>
  <tr>
    <td><span>{src_IP, src_port, dst_IP, dst_port}</span></td>
    <td><span>just our "REUSEADDR as lock" technique</span></td>
  </tr>
</tbody>
</table><p><a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2022-02-connectx/connectx.py#L116-L166">Take a look at the full code</a>.</p>
    <div>
      <h3>Summary</h3>
      <a href="#summary">
        
      </a>
    </div>
    <p>This post described a problem we hit in production: running out of ephemeral ports. This was partially caused by our servers running numerous concurrent connections, but also because we used the Linux sockets API in a way that prevented source port reuse. It meant that we were limited to ~28,000 concurrent connections per protocol, which is not enough for us.</p><p>We explained how to allow source port reuse and prevent having this ephemeral-port-range limit imposed. We showed an userspace <code>connectx()</code> function, which is a better way of creating outgoing TCP and UDP connections on Linux.</p><p>Our UDP code is more complex, based on little known low-level features, assumes cooperation between tenants and undocumented behaviour of the Linux operating system. Using REUSEADDR as a locking mechanism is rather unheard of.</p><p>The <code>connectx()</code> functionality is valuable, and should be added to Linux one way or another. It's not trivial to get all its use cases right. Hopefully, this blog post shows how to achieve this in the best way given the operating system API constraints.</p><p>___</p><p>¹ On a side note, on the second cURL run it fails due to TIME-WAIT sockets: "bind failed with errno 98: Address already in use".</p><p>One option is to wait for the TIME_WAIT socket to die, or work around this with the <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2022-02-connectx/killtw.py">time-wait sockets kill script</a>. Killing time-wait sockets is generally a bad idea, violating protocol, unneeded and sometimes doesn't work. But hey, in some extreme cases it's good to know what's possible. Just saying.</p> ]]></content:encoded>
            <category><![CDATA[Deep Dive]]></category>
            <category><![CDATA[UDP]]></category>
            <category><![CDATA[TCP]]></category>
            <category><![CDATA[Linux]]></category>
            <guid isPermaLink="false">319tj39kXPyzuiPbC755uC</guid>
            <dc:creator>Marek Majkowski</dc:creator>
        </item>
        <item>
            <title><![CDATA[Everything you ever wanted to know about UDP sockets but were afraid to ask, part 1]]></title>
            <link>https://blog.cloudflare.com/everything-you-ever-wanted-to-know-about-udp-sockets-but-were-afraid-to-ask-part-1/</link>
            <pubDate>Thu, 25 Nov 2021 17:27:37 GMT</pubDate>
            <description><![CDATA[ Historically Cloudflare's core competency was operating an HTTP reverse proxy. We've spent significant effort optimizing traditional HTTP/1.1 and HTTP/2 servers running on top of TCP. ]]></description>
            <content:encoded><![CDATA[ <p></p><p>Snippet from internal presentation about UDP inner workings in Spectrum. Who said UDP is simple!</p><p>Historically Cloudflare's core competency was operating an <a href="https://www.cloudflare.com/learning/cdn/glossary/reverse-proxy/">HTTP reverse proxy</a>. We've spent significant effort optimizing traditional HTTP/1.1 and HTTP/2 servers running on top of TCP. Recently though, we started operating big scale stateful <a href="https://www.cloudflare.com/learning/ddos/glossary/user-datagram-protocol-udp/">UDP</a> services.</p><p>Stateful UDP gains popularity for a number of reasons:</p><p>— <a href="/quic-version-1-is-live-on-cloudflare/">QUIC</a> is a new transport protocol based on UDP, it powers HTTP/3. We see the adoption accelerating.</p><p>— <a href="/1111-warp-better-vpn/">We operate WARP</a> — our Wireguard protocol based tunneling service — which uses UDP under the hood.</p><p>— We have a lot of generic UDP traffic going through <a href="https://www.cloudflare.com/products/cloudflare-spectrum/">our Spectrum service</a>.</p><p>Although UDP is simple in principle, there is a lot of domain knowledge needed to run things at scale. In this blog post we'll cover the basics: all you need to know about UDP servers to get started.</p>
    <div>
      <h3>Connected vs unconnected</h3>
      <a href="#connected-vs-unconnected">
        
      </a>
    </div>
    <p>How do you "accept" connections on a UDP server? If you are using unconnected sockets, you generally don't.</p><p>But let's start with the basics. UDP sockets can be "connected" (or "established") or "unconnected". Connected sockets have a full 4-tuple associated {source ip, source port, destination ip, destination port}, unconnected sockets have 2-tuple {bind ip, bind port}.</p><p>Traditionally the connected sockets were mostly used for outgoing flows, while unconnected for inbound "server" side connections.</p>
    <div>
      <h3>UDP client</h3>
      <a href="#udp-client">
        
      </a>
    </div>
    <p>As we'll learn today, these can be mixed. It is possible to use connected sockets for ingress handling, and unconnected for egress. To illustrate the latter, consider these two snippets. They do the same thing — send a packet to the DNS resolver. First snippet is using a connected socket:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/nLimg1UkSX4TIIYTaxASJ/630c7f32c10868d6cccee7736d5c713c/image4-20.png" />
            
            </figure><p>Second, using unconnected one:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1Iq69Z2Nrk00zNeTvJyrVU/6cb272253a0906191cb6aabf6b2e7b3d/image7-14.png" />
            
            </figure><p>Which one is better? In the second case, when receiving, the programmer should verify the source IP of the packet. Otherwise, the program can get confused by some random inbound internet junk — like port scanning. It is tempting to reuse the socket descriptor and query another DNS server afterwards, but this would be a bad idea, particularly when dealing with DNS. For security, DNS assumes the client source port is unpredictable and short-lived.</p><p>Generally speaking for outbound traffic it's preferable to use connected UDP sockets.</p><p>Connected sockets can save route lookup on each packet by employing a clever optimization — Linux can save a route lookup result on <a href="https://elixir.bootlin.com/linux/v5.15.4/source/include/net/sock.h#L434">a connection struct</a>. Depending on the specifics of the setup this might save some CPU cycles.</p><p>For completeness, it is possible to roll a new source port and reuse a socket descriptor with an obscure trick called "dissolving of the socket association". It can be done with <code>connect(AF_UNSPEC)</code>, but this is rather advanced Linux magic.</p>
    <div>
      <h3>UDP server</h3>
      <a href="#udp-server">
        
      </a>
    </div>
    <p>Traditionally on the server side UDP requires unconnected sockets. Using them requires a bit of finesse. To illustrate this, let's write an UDP echo server. In practice, you probably shouldn't write such a server, due to a risk of becoming a DoS reflection vector. <a href="/how-to-receive-a-million-packets/">Among other protections</a>, like rate limiting, UDP services should always respond with a strictly smaller amount of data than was sent in the initial packet. But let's not digress, the naive UDP echo server might look like:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/se9nCWduF9hcK8K1XwiWW/9c9625acfd065d3a66a2b98b99e15708/image6-14.png" />
            
            </figure><p>This code begs questions:</p><p>— Received packets can be longer than 2048 bytes. This can happen over loop back, when using jumbo frames or with help of IP fragmentation.</p><p>— It's totally possible for the received packet to have an empty payload.</p><p>— What about inbound ICMP errors?</p><p>These problems are specific to UDP, they don't happen in the TCP world. TCP can transparently deal with MTU / fragmentation and ICMP errors. Depending on the specific protocol, a UDP service might need to be more complex and pay extra care to such corner cases.</p>
    <div>
      <h3>Sourcing packets from a wildcard socket</h3>
      <a href="#sourcing-packets-from-a-wildcard-socket">
        
      </a>
    </div>
    <p>There is a bigger problem with this code. It only works correctly when binding to a specific IP address, like <code>::1</code> or <code>127.0.0.1</code>. It won't always work when we bind to a wildcard. The issue lies in the <code>sendto()</code> line — we didn't explicitly set the outbound IP address! Linux doesn't know where we'd like to source the packet from, and it will choose a default egress IP address. It might not be the IP the client communicated to. For example, let's say we added <code>::2</code> address to loop back interface and sent a packet to it, with src IP set to a valid <code>::1</code>:</p>
            <pre><code>marek@mrprec:~$ sudo tcpdump -ni lo port 1234 -t
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on lo, link-type EN10MB (Ethernet), capture size 262144 bytes
IP6 ::1.41879 &gt; ::2.1234: UDP, length 2
IP6 ::1.1234 &gt; ::1.41879: UDP, length 2</code></pre>
            <p>Here we can see the packet correctly flying from <code>::1</code> to <code>::2</code>, to our server. But then when the server responds, it sources the response from <code>::1</code> IP which in this case is wrong.</p><p>On the server side, when binding to a wildcard:</p><p>— we might receive packets destined to a number of IP addresses</p><p>— we must be very careful when responding and use appropriate source IP address</p><p>BSD Sockets API doesn't make it easy to understand where the received packet was destined to. On Linux and <a href="https://www.freebsd.org/cgi/man.cgi?query=ip6&amp;sektion=4">BSD</a> it is possible to request useful CMSG metadata with IP_RECVPKTINO and IPV6_RECVPKTINFO.</p><p>An improved server loop might look like:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/52toeTH5ZY1aD3bOSry6Do/88b7b0e90c7802ae513ef2f5f2ade5f5/image2-30.png" />
            
            </figure><p>The <code>recvmsg</code> and <code>sendmsg</code> syscalls, as opposed to <code>recvfrom</code> / <code>sendto</code> allow the programmer to request and set extra CMSG metadata, which is very handy when dealing with UDP.</p><p>The IPV6_PKTINFO CMSG contains this data structure:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/JvR29xOL5YVQj4GvhudPL/c92277cd5b15270a2053663dc7412057/image1-71.png" />
            
            </figure><p>We can find here the IP address and interface number of the packet target. Notice, there's no place for a port number.</p>
    <div>
      <h3>Graceful server restart</h3>
      <a href="#graceful-server-restart">
        
      </a>
    </div>
    <p>Many traditional UDP protocols, like DNS, are request-response based. Since there is no state associated with a higher level "connection", the server can restart, to upgrade or change configuration, without any problems. Ideally, sockets should be managed with the usual <a href="http://0pointer.de/blog/projects/socket-activation.html">systemd socket activation</a> to avoid the short time window where the socket is down.</p><p>Modern protocols are often connection-based. For such servers, on restart, it's beneficial to keep the old connections directed to the old server process, while the new server instance is available for handling the new connections. The old connections will eventually die off, and the old server process will be able to terminate. This is a common and easy practice in the TCP world where each connection has its own file descriptor. The old server process stops accept()-ing new connections and just waits for the old connections to gradually go away. <a href="http://nginx.org/en/docs/control.html#upgrade">NGINX has a good documentation</a> on the subject.</p><p>Sadly, in UDP you can't <code>accept()</code> new connections. Doing graceful server restarts for UDP is surprisingly hard.</p>
    <div>
      <h3>Established-over-unconnected technique</h3>
      <a href="#established-over-unconnected-technique">
        
      </a>
    </div>
    <p>For some services we are using a technique which we call "established-over-unconnected". This comes from a realization that on Linux it's possible to create a connected socket *over* an unconnected one. Consider this code:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/78wXvRtNNnQFwlqfS7EYd5/bf3224997d0169e26c402d909b1c2012/image3-37.png" />
            
            </figure><p>Does this look hacky? Well, it should. What we do here is:</p><p>— We start a UDP unconnected socket.</p><p>— We wait for a client to come in.</p><p>— As soon as we receive the first packet from the client, we immediately create a new fully connected socket, *over* the unconnected socket! It shares the same local port and local IP.</p><p>This is how it might look in ss:</p>
            <pre><code>marek@mrprec:~$ ss -panu sport = :1234 or dport = :1234 | cat
State     Recv-Q    Send-Q       Local Address:Port        Peer Address:Port    Process                                                                         
ESTAB     0         0                    [::1]:1234               [::1]:44592    python3
UNCONN    0         0                        *:1234                   *:*        python3
ESTAB     0         0                    [::1]:44592              [::1]:1234     nc</code></pre>
            <p>Here you can see the two sockets managed in our python test server. Notice the established socket is sharing the unconnected socket port.</p><p>This trick is basically reproducing the 'accept()` behaviour in UDP, where each ingress connection gets its own dedicated socket descriptor.</p><p>While this trick is nice, it's not without drawbacks — it's racy in two places. First, it's possible that the client will send more than one packet to the unconnected socket before the connected socket is created. The application code should work around it — if a packet received from the server socket belongs to an already existing connected flow, it shall be handed over to the right place. Then, during the creation of the connected socket, in the short window after <code>bind()</code> before <code>connect()</code> we might receive unexpected packets belonging to the unconnected socket! We don't want these packets here. It is necessary to filter the source IP/port when receiving early packets on the connected socket.</p><p>Is this approach worth the extra complexity? It depends on the use case. For a relatively small number of long-lived flows, it might be ok. For a high number of short-lived flows (especially DNS or NTP) it's an overkill.</p><p>Keeping old flows stable during service restarts is particularly hard in UDP. The established-over-unconnected technique is just one of the simpler ways of handling it. We'll leave another technique, based on SO_REUSEPORT ebpf, for a future blog post.</p>
    <div>
      <h3>Summary</h3>
      <a href="#summary">
        
      </a>
    </div>
    <p>In this blog post we started by highlighting connected and unconnected UDP sockets. Then we discussed why binding UDP servers to a wildcard is hard, and how IP_PKTINFO CMSG can help to solve it. We discussed the UDP graceful restart problem, and hinted on an established-over-unconnected technique.</p><table><tr><td><p><b>Socket type</b></p></td><td><p><b>Created with</b></p></td><td><p><b>Appropriate syscalls</b></p></td></tr><tr><td><p>established</p></td><td><p>connect()</p></td><td><p>recv()/send()</p></td></tr><tr><td><p>established</p></td><td><p>bind() + connect()</p></td><td><p>recvfrom()/send(), watch out for the race after bind(), verify source of the packet</p></td></tr><tr><td><p>unconnected</p></td><td><p>bind(specific IP)</p></td><td><p>recvfrom()/sendto()</p></td></tr><tr><td><p>unconnected</p></td><td><p>bind(wildcard)</p></td><td><p>recvmsg()/sendmsg() with IP_PKTINFO CMSG</p></td></tr></table><p>Stay tuned, in future blog posts we might go even deeper into the curious world of production UDP servers.</p> ]]></content:encoded>
            <category><![CDATA[UDP]]></category>
            <guid isPermaLink="false">4Is8w5do12KTSyIAjAtpud</guid>
            <dc:creator>Marek Majkowski</dc:creator>
        </item>
        <item>
            <title><![CDATA[Branch predictor: How many "if"s are too many? Including x86 and M1 benchmarks!]]></title>
            <link>https://blog.cloudflare.com/branch-predictor/</link>
            <pubDate>Thu, 06 May 2021 13:00:00 GMT</pubDate>
            <description><![CDATA[ Is it ok to have if clauses that will basically never be run? Surely, there must be some performance cost to that... ]]></description>
            <content:encoded><![CDATA[ <p></p><p>Some time ago I was looking at a hot section in our code and I saw this:</p>
            <pre><code>
	if (debug) {
    	  log("...");
    }
    </code></pre>
            <p>This got me thinking. This code is in a performance critical loop and it looks like a waste - we never run with the "debug" flag enabled<sup>[</sup><a href="#footnotes"><sup>1</sup></a><sup>].</sup> Is it ok to have <code>if</code> clauses that will basically never be run? Surely, there must be some performance cost to that...</p>
    <div>
      <h3>Just how bad is peppering the code with avoidable <code>if</code> statements?</h3>
      <a href="#just-how-bad-is-peppering-the-code-with-avoidable-if-statements">
        
      </a>
    </div>
    <p>Back in the days the general rule was: a fully predictable branch has close to zero CPU cost.</p><p>To what extent is this true? If one branch is fine, then how about ten? A hundred? A thousand? When does adding one more <code>if</code> statement become a bad idea?</p><p>At some point the negligible cost of simple branch instructions surely adds up to a significant amount. As another example, a colleague of mine found this snippet in our production code:</p>
            <pre><code>
const char *getCountry(int cc) {
		if(cc == 1) return "A1";
        if(cc == 2) return "A2";
        if(cc == 3) return "O1";
        if(cc == 4) return "AD";
        if(cc == 5) return "AE";
        if(cc == 6) return "AF";
        if(cc == 7) return "AG";
        if(cc == 8) return "AI";
        ...
        if(cc == 252) return "YT";
        if(cc == 253) return "ZA";
        if(cc == 254) return "ZM";
        if(cc == 255) return "ZW";
        if(cc == 256) return "XK";
        if(cc == 257) return "T1";
        return "UNKNOWN";
}
        </code></pre>
            <p>Obviously, this code could be improved<sup>[</sup><a href="#footnotes"><sup>2</sup></a><sup>]</sup>. But when I thought about it more: <i>should</i> it be improved? Is there an actual performance hit of a code that consists of a series of simple branches?</p>
    <div>
      <h3>Understanding the cost of jump</h3>
      <a href="#understanding-the-cost-of-jump">
        
      </a>
    </div>
    <p>We must start our journey with a bit of theory. We want to figure out if the CPU cost of a branch increases as we add more of them. As it turns out, assessing the cost of a branch is not trivial. On modern processors it takes between one and twenty CPU cycles. There are at least four categories of control flow instructions<sup>[</sup><a href="#footnotes"><sup>3</sup></a><sup>]</sup>: unconditional branch (jmp on x86), call/return, conditional branch (e.g. je on x86) taken and conditional branch not taken. The taken branches are especially problematic: without special care they are inherently costly - we'll explain this in the following section. To bring down the cost, modern CPU's try to predict the future and figure out the branch <b>target</b> before the branch is actually fully executed! This is done in a special part of the processor called the branch predictor unit (BPU).</p><p>The branch predictor attempts to figure out a destination of a branching instruction very early and with very little context. This magic happens <b>before</b> the "decoder" pipeline stage and the predictor has very limited data available. It only has some past history and the address of the current instruction. If you think about it - this is super powerful. Given only current instruction pointer it can assess, with very high confidence, where the target of the jump will be.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1K3IOo5qkOOP0hhRtwu10U/e9997acf855e96044f2684e1adae87b0/pasted-image-0-1.png" />
            
            </figure><p>Source: <a href="https://en.wikipedia.org/wiki/Branch_predictor">https://en.wikipedia.org/wiki/Branch_predictor</a></p><p>The BPU maintains a couple of data structures, but today we'll focus on Branch Target Buffer (BTB). It's a place where the BPU remembers the target instruction pointer of previously taken branches. The whole mechanism is much more complex, take a look a the <a href="http://www.ece.uah.edu/~milenka/docs/VladimirUzelac.thesis.pdf">Vladimir Uzelac's Master thesis</a> for details about branch prediction on CPU's from 2008 era:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1FO1GCWdkQYR5I90qGopuH/9f51e10ca53eafd001f2238896103327/pasted-image-0--1-.png" />
            
            </figure><p>For the scope of this article we'll simplify and focus on the BTB only. We'll try to show how large it is and how it behaves under different conditions.</p>
    <div>
      <h3>Why is branch prediction needed?</h3>
      <a href="#why-is-branch-prediction-needed">
        
      </a>
    </div>
    <p>But first, why is branch prediction used at all? In order to get the best performance, the CPU pipeline must feed a constant flow of instructions. Consider what happens to the multi-stage CPU pipeline on a branch instruction. To illustrate let's consider the following ARM program:</p>
            <pre><code>
	BR label_a;
    X1
    ...
label_a:
 	Y1
    </code></pre>
            <p>Assuming a simplistic CPU model, the operations would flow through the pipeline like this:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6LYtd8rNXEQA2X2pOPwmvd/e9bbdd4d54a72dc78730684252d01a84/3P6PIWN6gPAdYzP8oDgrsaOMKgUmG51zIiFhbm071cZKM276S7vRb5atpTwlKrM1lFHRYsobw8P4e-Z9t1Vb9TGeutpBe2CkMrNGruWO8yb5Qz0vZ6Qn6RbOi5Tp.png" />
            
            </figure><p>In the first cycle the BR instruction is fetched. This is an unconditional branch instruction changing the execution flow of the CPU. At this point it's not yet decoded, but the CPU would like to fetch another instruction already! Without a branch predictor in cycle 2 the fetch unit either has to wait or simply continues to the next instruction in memory, hoping it will be the right one.</p><p>In our example, instruction X1 is fetched even though this isn't the correct instruction to run. In cycle 4, when the branch instruction finishes the execute stage, the CPU will be able to understand the mistake, and roll back the speculated instructions before they have any effect. At this point the fetch unit is updated to correctly get the right instruction - Y1 in our case.</p><p>This situation of losing a number of cycles due to fetching code from an incorrect place is called a "frontend bubble". Our theoretical CPU has a two-cycle frontend bubble when a branch target wasn’t predicted right.</p><p>In this example we see that, although the CPU does the right thing in the end, without good branch prediction it wasted effort on bad instructions. In the past, various techniques have been used to reduce this problem, such as static branch prediction and branch delay slots. But the dominant CPU designs today rely <a href="https://danluu.com/branch-prediction/#one-bit">on <i>dynamic branch prediction</i></a>. This technique is able to mostly avoid the frontend bubble problem, by predicting the correct address of the next instruction even for branches that aren’t fully decoded and executed yet.</p>
    <div>
      <h3>Playing with the BTB</h3>
      <a href="#playing-with-the-btb">
        
      </a>
    </div>
    <p>Today we're focusing on the BTB - a data structure managed by the branch predictor responsible for figuring out a target of a branch. It's important to note that the BTB is distinct from and independent of the system assessing if the branch was taken or not taken. Remember, we want to figure out if a cost of a branch increases as we run more of them.</p><p>Preparing an experiment to stress only the BTB is relatively simple (<a href="https://xania.org/201602/bpu-part-three">based on Matt Godbolt's work</a>). It turns out a sequence of unconditional <code>jmps</code> is totally sufficient. Consider this x86 code:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/259y50RQfLqaFcsT4tljGb/5010158442d4e2afb6ebb3c105ee4e5b/pasted-image-0--2-.png" />
            
            </figure><p>This code stresses the BTB to an extreme - it just consists of a chain of <code>jmp +2</code> statements (i.e. literally jumping to the next instruction). In order to avoid wasting cycles on frontend pipeline bubbles, each taken jump needs a BTB hit. This branch prediction must happen very early in the CPU pipeline, before instruction decode is finished. This same mechanism is needed for any taken branch, whether it's unconditional, conditional or a function call.</p><p>The code above was run inside a test harness that measures how many CPU cycles elapse for each instruction. For example, in this run we're measuring times of dense - every two bytes - 1024 jmp instructions one after another:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3QofFzm1kmqFLuRrFPoyjQ/857bfbfe366720bd480653388f97eac8/pasted-image-0--3-.png" />
            
            </figure><p>We’ll look at the results of experiments like this for a few different CPUs. But in this instance, it was run on a machine with an AMD EPYC 7642. Here, the cold run took 10.5 cycles per jmp, and then all subsequent runs took ~3.5 cycles per jmp. The code is prepared in such a way to make sure it's the BTB that is slowing down the first run. Take a look at the full code, there is quite some magic to warm up the L1 cache and iTLB without priming the BTB.</p><p><b>Top tip 1. On this CPU a branch instruction that is taken but not predicted, costs ~7 cycles more than one that is taken and predicted.</b> Even if the branch was unconditional.</p>
    <div>
      <h3>Density matters</h3>
      <a href="#density-matters">
        
      </a>
    </div>
    <p>To get a full picture we also need to think about the density of jmp instructions in the code. The code above did eight jmps per 16-byte code block. This is a lot. For example, the code below contains one jmp instruction in each block of 16 bytes. Notice that the <code>nop</code> opcodes are jumped over. The block size doesn't change the number of executed instructions, only the code density:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2uenA2zFAkPWQZldVX7gba/8823f9ec4052618dc06c21d0621a6581/pasted-image-0--4-.png" />
            
            </figure><p>Varying the jmp block size might be important. It allows us to control the placement of the jmp opcodes. Remember the BTB is indexed by instruction pointer address. Its value and its alignment might influence the placement in the BTB and help us reveal the BTB layout. Increasing the alignment will cause more nop padding to be added. The sequence of a single measured instruction - jmp in this case - and zero or more nops, I will call "block", and its size "block size". Notice that the larger the block size, the larger the working code size for the CPU. At larger values we might see some performance drop due to exhausting L1 cache space.</p>
    <div>
      <h3>The experiment</h3>
      <a href="#the-experiment">
        
      </a>
    </div>
    <p>Our experiment is crafted to show the performance drop depending on the number of branches, across different working code sizes. Hopefully, we will be able to prove the performance is mostly dependent on the number of blocks - and therefore the BTB size, and not the working code size.</p><p>See the <a href="https://github.com/cloudflare/cloudflare-blog/tree/master/2021-05-branch-prediction">code on GitHub</a>. If you want to see the generated machine code, though, you need to run a special command. It's created procedurally by the code, customized by passed parameters. Here's an example <code>gdb</code> incantation:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/7aAKMXz8aruwblN67UvZd7/1eb2aea7cc6e2e3a9c0f6823f60b8f08/pasted-image-0--5-.png" />
            
            </figure><p>Let's bring this experiment forward, what if we took the best times of each run - with a fully primed BTB - for varying values of jmp block sizes and number of blocks - working set size? Here you go:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4ZNeolS38Q90vdHPvehvVa/64e0c16d163d74c05cfbaee6f8d9e15c/pasted-image-0--6-.png" />
            
            </figure><p>This is an astonishing chart. First, it's obvious something happens at the 4096 jmp mark[<a href="#footnotes">4</a>] regardless of how large the jmp block sizes - how many nop's we skip over. Reading it aloud:</p><ul><li><p>On the far left, we see that if the amount of code is small enough - less than 2048 bytes (256 times a block of 8 bytes) - it's possible to hit some kind of uop/L1 cache and get ~1.5 cycles per fully predicted branch. This is amazing.</p></li><li><p>Otherwise, if you keep your hot loop to 4096 branches then, no matter how dense your code is you are likely to see ~3.4 cycles per fully predicted branch</p></li><li><p>Above 4096 branches the branch predictor gives up and the cost of each branch shoots to ~10.5 cycles per jmp. This is consistent with what we saw above - unpredicted branch on flushed BTB took ~10.5 cycles.</p></li></ul><p>Great, so what does it mean? Well, you should avoid branch instructions if you want to avoid branch misses because you have at most 4096 of fast BTB slots. This is not a very pragmatic advice though - it's not like we deliberately put many unconditional <code>jmp</code>s in real code!</p><p>There are a couple of takeaways for the discussed CPU. I repeated the experiment with an always-taken conditional branch sequence and the resulting chart looks almost identical. The only difference being the predicted taken conditional-je instruction being 2 cycles slower than unconditional jmp.</p><p>An entry to BTB is added wherever a branch is "taken" - that is, the jump actually happens. An unconditional "jmp" or always taken conditional branch, will cost a BTB slot. To get best performance make sure to not have more than 4096 taken branches in the hot loop. The good news is that branches never-taken don't take space in the BTB. We can illustrate this with another experiment:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5gd2JKAwYMpgk5obd4E0Hz/29876ce9f45dc01edbb8e3966bcb9903/pasted-image-0--7-.png" />
            
            </figure><p>This boring code is going over not-taken <code>jne</code> followed by two nops (block size=4). Aimed with this test (jne never-taken), the previous one (jmp always-taken) and a conditional branch <code>je</code> always-taken, we can draw this chart:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1F2STBl6iw6OpkwqmIdlR9/1de627e490ddd75816bb3899f99f6ed7/pasted-image-0--8-.png" />
            
            </figure><p>First, without any surprise we can see the conditional 'je always-taken' is getting slightly more costly than the simple unconditional <code>jmp</code>, but only after the 4096 branches mark. This makes sense, the conditional branch is resolved later in the pipeline so the frontend bubble is longer. Then take a look at the blue line hovering near zero. This is the "jne never-taken" line flat at 0.3 clocks / block, no matter how many blocks we run in sequence. The takeaway is clear - you can have as many never-taken branches as you want, without incurring any cost. There isn't any spike at 4096 mark, meaning BTB is not used in this case. It seems the conditional jump not seen before is guessed to be not-taken.</p><p><b>Top tip 2: conditional branches never-taken are basically free</b> - at least on this CPU.</p><p>So far we established that branches always-taken occupy BTB, branches never taken do not. How about other control flow instructions, like the <code>call</code>?</p><p>I haven't been able to find this in the literature, but it seems call/ret also need the BTB entry for best performance. I was able to illustrate this on our AMD EPYC. Let's take a look at this test:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/fHArGj5TWvRM9AEM84NLJ/1951dfe42651b923551d90dddc63964f/pasted-image-0--9-.png" />
            
            </figure><p>This time we'll issue a number of <code>callq</code> instructions followed by <code>ret</code> - both of which should be fully predicted. The experiment is crafted so that each callq calls a unique function, to allow for retq prediction - each one returns to exactly one caller.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/677H5BdxwzZ9B6hbebRFnB/a1ac1602d5efab51152a6219fde87579/pasted-image-0--10-.png" />
            
            </figure><p>This chart confirms the theory: no matter the code density - with the exception of 64-byte block size being notably slower -  the cost of predicted call/ret starts to deteriorate after the 2048 mark. At this point the BTB is filled with call and ret predictions and can't handle any more data. This leads to an important conclusion:</p><p><b>Top tip 3. In the hot code you want to have less than 2K function calls</b> - on this CPU.</p><p>In our test CPU a sequence of fully predicted call/ret takes about 7 cycles, which is about the same as two unconditional predicted <code>jmp</code> opcodes. It's consistent with our results above.</p><p>So far we thoroughly checked AMD EPYC 7642. We started with this CPU because the branch predictor is relatively simple and the charts were easy to read. It turns out more recent CPUs are less clear.</p>
    <div>
      <h3>AMD EPYC 7713</h3>
      <a href="#amd-epyc-7713">
        
      </a>
    </div>
    <p>Newer AMD is more complex than the previous generations. Let's run the two most important experiments. First, the <code>jmp</code> one:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/12g3Zg18eWyHIwyUuf63gj/0cb89d1fc7a8fedbed929aa455e11dbf/pasted-image-0--11-.png" />
            
            </figure><p>For the always-taken branches case we can see a very good, sub 1 cycle, timings when the number of branches doesn't exceed 1024 and the code isn't too dense.</p><p><b>Top tip 4. On this CPU it's possible to get &lt;1 cycle per predicted jmp when the hot loop fits in ~32KiB.</b></p><p>Then there is some noise starting after the 4096 jmps mark. This is followed by a complete drop of speed at about 6000 branches. This is in line with the theory that BTB is 4096 entries long. We can speculate that some other prediction mechanism is successfully kicking in beyond that, and keeps up the performance up the ~6k mark.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4X5KfDDT36u9u2MGb7iu7t/90c4fbf3d04cd386ed6eecae645f25f8/pasted-image-0--12-.png" />
            
            </figure><p>The call/ret chart shows a similar tale, the timings start breaking after 2048 mark, and completely fail to be predicted beyond ~3000.</p>
    <div>
      <h3>Xeon Gold 6262</h3>
      <a href="#xeon-gold-6262">
        
      </a>
    </div>
    <p>The Intel Xeon looks different from the AMD:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/7hIriKLopeE26yIor9qGG/9ab0d56fd18cfd4a782b7270845bc66f/pasted-image-0--13-.png" />
            
            </figure><p>Our test shows the predicted taken branch costs 2 cycles. Intel has documented a clock penalty for very dense branching code - this explains the 4-byte block size line hovering at ~3 cycles. The branch cost breaks at the 4096 jmp mark, confirming the theory that the Intel BTB can hold 4096 entries. The 64-byte block size chart looks confusing, but really isn't. The branch cost stays at flat 2 cycles up till the 512 jmp count. Then it increases. This is caused by the internal layout of the BTB which is said to be 8-way associative. It seems with the 64-byte block size we can utilize at most half of the 4096 BTB slots.</p><p><b>Top tip 5. On Intel avoid placing your jmp/call/ret instructions at regular 64-byte intervals.</b></p><p>Then the call/ret chart:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1XU4ofCRG3wfjLZlwsEhBv/2df1b19f6623a26be745caf35c28ee33/pasted-image-0--14-.png" />
            
            </figure><p>Similarly, we can see the branch predictions failing after the 2048 jmp mark - in this experiment one block uses two flow control instructions: call and ret. This again confirms the BTB size of 4K entries. The 64-byte block size is generally slower due to the nop padding but also breaks faster due to the instructions alignment issue. Notice, we haven't seen this effect on AMD.</p>
    <div>
      <h3>Apple Silicon M1</h3>
      <a href="#apple-silicon-m1">
        
      </a>
    </div>
    <p>So far we saw examples of AMD and Intel server grade CPUs. How does an Apple Silicon M1 fit in this picture?</p><p>We expect it to be very different - it's designed for mobile and it's using ARM64 architecture. Let's see our two experiments:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/bJ9lzYsIARltRCkOEJVZZ/6b1b5edcdf00e3c0a3c353fff72c8232/pasted-image-0--15-.png" />
            
            </figure><p>The predicted <code>jmp</code> test shows an interesting story. First, when the code fits 4096 bytes (1024*4 or 512*8, etc) you can expect a predicted <code>jmp</code> to cost 1 clock cycle. This is an excellent score.</p><p>Beyond that, generally, you can expect a cost of 3 clock cycles per predicted jmp. This is also very good. This starts to deteriorate when the working code grows beyond ~200KiB. This is visible with block size 64 breaking at 3072 mark 3072*64=196K, and for block 32 at 6144: 6144*32=196K. At this point the prediction seems to stop working. The documentation indicates that the M1 CPU has 192 KB L1 of instruction cache - our experiment matches that.</p><p>Let's compare the "predicted jmp" with the "unpredicted jmp" chart. Take this chart with a grain of salt, because flushing the branch predictor is notoriously difficult.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2VCtlgVdJkSB3MFuYbR5MQ/31b80611cab5b7a5ec14514dd375e7ea/pasted-image-0--16-.png" />
            
            </figure><p>However, even if we don't trust the flush-bpu code (<a href="https://xania.org/201602/bpu-part-three">adapted from Matt Godbolt</a>), this chart reveals two things. First, the "unpredicted" branch cost seems to be correlated with the branch distance. The longer the branch the costlier it is. We haven't seen such behaviour on x86 CPUs.</p><p>Then there is the cost itself. We saw a predicted sequence of branches cost, and what a supposedly-unpredicted jmp costs. In the first chart we saw that beyond ~192KiB working code, the branch predictor seems to become ineffective. The supposedly-flushed BPU seems to show the same cost. For example, the cost of a 64-byte block size jmp with a small working set size is 3 cycles. A miss is ~8 cycles. For a large working set size both times are ~8 cycles. It seems that the BTB is linked to the L1 cache state. <a href="https://www.realworldtech.com/forum/?threadid=159985&amp;curpostid=160001">Paul A. Clayton suggested</a> a possibility of such a design back in 2016.</p><p><b>Top tip 6. on M1 the predicted-taken branch generally takes 3 cycles and unpredicted but taken has varying cost, depending on jmp length. BTB is likely linked with L1 cache.</b></p><p>The call/ret chart is funny:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/54s7q6CJamAHCjpzDHFm6V/d1db62f4d8ec351ce539709cb6a512cd/pasted-image-0--17-.png" />
            
            </figure><p>Like in the chart before, we can see a big benefit if hot code fits within 4096 bytes (512*4 or 256*8). Otherwise, you can count on 4-6 cycles per call/ret sequence (or, bl/ret as it's known in ARM). The chart shows funny alignment issues. It's unclear what they are caused by. Beware, comparing the numbers in this chart with x86 is unfair, since ARM <code>call</code> operation differs substantially from the x86 variant.</p><p>M1 seems pretty fast, with predicted branches usually at 3 clock cycles. Even unpredicted branches never cost more than 8 ticks in our benchmark. Call+ret sequence for dense code should fit under 5 cycles.</p>
    <div>
      <h3>Summary</h3>
      <a href="#summary">
        
      </a>
    </div>
    <p>We started our journey from a piece of trivial code, and asked a basic question: how costly is  adding a never-taken <code>if</code> branch in the hot portion of code?</p><p>Then we quickly dived in very low level CPU features. By the end of this article, hopefully, an astute reader might get better intuition how a modern branch predictors works.</p><p>On x86 the hot code needs to split the BTB budget between function calls and taken branches. The BTB has only a size of 4096 entries. There are strong benefits in keeping the hot code under 16KiB.</p><p>On the other hand on M1 the BTB seems to be limited by L1 instruction cache. If you're writing super hot code, ideally it should fit 4KiB.</p><p>Finally, can you add this one more <code>if</code> statement? If it's never-taken, it's probably ok. I found no evidence that such branches incur any extra cost. But do avoid always-taken branches and function calls.</p><p><b>Sources</b></p><p>I'm not the first person to investigate how BTB works. I based my experiments on:</p><ul><li><p><a href="http://www.ece.uah.edu/~milenka/docs/VladimirUzelac.thesis.pdf">Vladimir Uzelac thesis</a></p></li><li><p><a href="https://xania.org/201602/bpu-part-three">Matt Godbolt work</a>. The series has 5 articles.</p></li><li><p><a href="https://www.realworldtech.com/forum/?threadid=159985&amp;curpostid=159985">Travis Downs BTB questions</a> on Real World Tech</p></li><li><p><a href="https://stackoverflow.com/questions/38811901/slow-jmp-instruction">various</a> <a href="https://stackoverflow.com/questions/51822731/why-did-intel-change-the-static-branch-prediction-mechanism-over-these-years">stackoverflow</a> <a href="https://stackoverflow.com/questions/38512886/btb-size-for-haswell-sandy-bridge-ivy-bridge-and-skylake">discussions</a>. Especially <a href="https://stackoverflow.com/questions/31280817/what-branch-misprediction-does-the-branch-target-buffer-detect">this one</a> and <a href="https://stackoverflow.com/questions/31642902/intel-cpus-instruction-queue-provides-static-branch-prediction">this</a></p></li><li><p><a href="https://www.agner.org/optimize/microarchitecture.pdf">Agner Fog</a> microarchitecture guide has a good section on branch predictions.</p></li></ul>
    <div>
      <h3>Acknowledgements</h3>
      <a href="#acknowledgements">
        
      </a>
    </div>
    <p>Thanks to <a href="/author/david-wragg/">David Wragg</a> and <a href="https://twitter.com/danluu">Dan Luu</a> for technical expertise and proofreading help.</p>
    <div>
      <h3>PS</h3>
      <a href="#ps">
        
      </a>
    </div>
    <p>Oh, oh. But this is not the whole story! Similar research was the base to the <a href="https://spectreattack.com/spectre.pdf">Spectre v2</a> attack. The attack was exploiting the little known fact that the BPU state was not cleared between context switches. With the correct technique it was possible to train the BPU - in the case of Spectre it was iBTB - and force a privileged piece of code to be speculatively executed. This, combined with a cache side-channel data leak, allowed an attacker to steal secrets from the privileged kernel. Powerful stuff.</p><p>A proposed solution was to avoid using shared BTB. This can be done in two ways: make the indirect jumps to always fail to predict, or fix the CPU to avoid sharing BTB state across isolation domains. This is a long story, maybe for another time...</p><hr /><p><a>Footnotes</a></p><p>1. One historical solution to this specific 'if debug' problem is called "runtime nop'ing". The idea is to modify the code in runtime and patch the never-taken branch instruction with a <code>nop</code>. For example, see the "ISENABLED" discussion on <a href="https://bugzilla.mozilla.org/showbug.cgi?id=370906.">https://bugzilla.mozilla.org/showbug.cgi?id=370906.</a></p><p>2. Fun fact: modern compilers are pretty smart. New gcc (&gt;=11) and older clang (&gt;=3.7) are able to actually optimize it quite a lot. <a href="https://godbolt.org/z/KWYEW3d9s">See for yourself</a>. But, let's not get distracted by that. This article is about low level machine code branch instructions!</p><p>3. This is a simplification. There are of course more control flow instructions, like: software interrupts, syscalls, VMENTER/VMEXIT.</p><p>4. Ok, I'm slightly overinterpreting the chart. Maybe the 4096 jmp mark is due to the 4096 uop cache or some instruction decoder artifact? To prove this spike is indeed BTB related I looked at Intel BPUCLEARS.EARLY and BACLEAR.CLEAR performance counters. Its value is small for block count under 4096 and large for block count greater than 5378. This is strong evidence that the performance drop is indeed caused by the BPU and likely BTB.</p> ]]></content:encoded>
            <category><![CDATA[Deep Dive]]></category>
            <category><![CDATA[Programming]]></category>
            <category><![CDATA[AMD]]></category>
            <category><![CDATA[EPYC]]></category>
            <category><![CDATA[Speed & Reliability]]></category>
            <guid isPermaLink="false">2pvX64jHrEfNLMSmJO1Iv4</guid>
            <dc:creator>Marek Majkowski</dc:creator>
        </item>
        <item>
            <title><![CDATA[Computing Euclidean distance on 144 dimensions]]></title>
            <link>https://blog.cloudflare.com/computing-euclidean-distance-on-144-dimensions/</link>
            <pubDate>Fri, 18 Dec 2020 12:00:00 GMT</pubDate>
            <description><![CDATA[ Last year we deployed a CSAM image scanning tool. This is so cool! Image processing is always hard, and deploying a real image identification system at a Cloudflare scale is no small achievement! But we hit a problem - the matching algorithm was too slow for our needs. ]]></description>
            <content:encoded><![CDATA[ 
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4rNxjRfxGzhJI3MAVokLgA/8b8be5359634c3c75ecd5cfc696b6072/image1-62.png" />
            
            </figure><p>Late last year I read a blog post about <a href="/the-csam-scanning-tool/">our CSAM image scanning tool</a>. I remember thinking: this is so cool! Image processing is always hard, and deploying a real image identification system at Cloudflare is no small achievement!</p><p>Some time later, I was chatting with Kornel: "We have all the pieces in the image processing pipeline, but we are struggling with the performance of one component." Scaling to Cloudflare needs ain't easy!</p><p>The problem was in the speed of the matching algorithm itself. Let me elaborate. As John explained in his blog post <a href="/the-csam-scanning-tool">on the CSAM Scanning Tool</a>, the image matching algorithm creates a fuzzy hash from a processed image. The hash is exactly 144 bytes long. For example, it might look like this:</p>
            <pre><code>00e308346a494a188e1043333147267a 653a16b94c33417c12b433095c318012
5612442030d14a4ce82c623f4e224733 1dd84436734e4a5d6e25332e507a8218
6e3b89174e30372d</code></pre>
            <p>The hash is designed to be used in a fuzzy matching algorithm that can find "nearby", related images. The specific algorithm is well defined, but making it fast is left to the programmer — and at Cloudflare we need the matching to be done super fast. We want to match thousands of hashes per second, of images passing through our network, against a database of millions of known images. To make this work, we need to seriously optimize the matching algorithm.</p>
    <div>
      <h3>Naive quadratic algorithm</h3>
      <a href="#naive-quadratic-algorithm">
        
      </a>
    </div>
    <p>The first algorithm that comes to mind has <code>O(K*N)</code> complexity: for each query, go through every hash in the database. In naive implementation, this creates a lot of work. But how much work exactly?</p><p>First, we need to explain how fuzzy matching works.</p><p>Given a query hash, the fuzzy match is the "closest" hash in a database. This requires us to define a distance. We treat each hash as a vector containing 144 numbers, identifying a point in a 144-dimensional space. Given two such points, we can calculate the distance using the standard Euclidean formula.</p><p>For our particular problem, though, we are interested in the "closest" match in a database only if the distance is lower than some predefined threshold. Otherwise, when the distance is large,  we can assume the images aren't similar. This is the expected result — most of our queries will not have a related image in the database.</p><p>The <a href="https://en.wikipedia.org/wiki/Euclidean_distance">Euclidean distance</a> equation used by the algorithm is standard:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5p3VcvTiFKgdwx5xLkRRWv/33fa46b931b1764cf6e51df3198e0a23/image3-41.png" />
            
            </figure><p>To calculate the distance between two 144-byte hashes, we take each byte, calculate the delta, square it, sum it to an accumulator, do a square root, and ta-dah! We have the distance!</p><p><a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2020-12-mmdist/mmdist-naive.c#L11-L20">Here's how to count the squared distance in C</a>:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4lwPTgH8PmEYB8mMyOhbsH/f660b81e92c9ff46a422bdbf1af18321/image4-24.png" />
            
            </figure><p>This function returns the squared distance. We avoid computing the actual distance to save us from running the square root function - it's slow. Inside the code, for performance and simplicity, we'll mostly operate on the squared value. We don't need the actual distance value, we just need to find the vector with the smallest one. In our case it doesn't matter if we'll compare distances or squared distances!</p><p>As you can see, fuzzy matching is basically a standard problem of finding the closest point in a multi-dimensional space. Surely this has been solved in the past — but let's not jump ahead.</p><p>While this code might be simple, we expect it to be rather slow. Finding the smallest hash distance in a database of, say, 1M entries, would require going over all records, and would need at least:</p><ol><li><p>144 * 1M subtractions</p></li><li><p>144 * 1M multiplications</p></li><li><p>144 * 1M additions</p></li></ol><p>And more. This alone adds up to 432 million operations! How does it look in practice? To illustrate this blog post <a href="https://github.com/cloudflare/cloudflare-blog/tree/master/2020-12-mmdist">we prepared a full test suite</a>. The large database of known hashes can be <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2020-12-mmdist/generate.c">well emulated by random data</a>. The query hashes can't be random and must be slightly more sophisticated, otherwise the exercise wouldn't be that interesting. We generated the test smartly by byte-swaps of the actual data from the database — this allows us to precisely control the distance between test hashes and database hashes. <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2020-12-mmdist/gentest.py">Take a look at the scripts for details</a>. Here's our first run of <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2020-12-mmdist/mmdist-naive.c">the first, naive, algorithm</a>:</p>
            <pre><code>$ make naive
&lt; test-vector.txt ./mmdist-naive &gt; test-vector.tmp
Total: 85261.833ms, 1536 items, avg 55.509ms per query, 18.015 qps</code></pre>
            <p>We matched 1,536 test hashes against a database of 1 million random vectors in 85 seconds. It took 55ms of CPU time on average to find the closest neighbour. This is rather slow for our needs.</p>
    <div>
      <h3>SIMD for help</h3>
      <a href="#simd-for-help">
        
      </a>
    </div>
    <p>An obvious improvement is to use more complex SIMD instructions. SIMD is a way to instruct the CPU to process multiple data points using one instruction. This is a perfect strategy when dealing with vector problems — as is the case for our task.</p><p>We settled on using AVX2, with 256 bit vectors. We did this for a simple reason — newer AVX versions are not supported by our AMD CPUs. Additionally, in the past, we were <a href="/on-the-dangers-of-intels-frequency-scaling/">not thrilled by the AVX-512 frequency scaling</a>.</p><p>Using AVX2 is easier said than done. There is no single instruction to count Euclidean distance between two uint8 vectors! The fastest way of counting the full distance of two 144-byte vectors <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2020-12-mmdist/mmdist-naive-avx2.c#L13-L36">with AVX2 we could find is authored</a> by <a href="https://twitter.com/thecomp1ler">Vlad</a>:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/7yFtHcZSRvyub5vHtjDXKE/54596189ea43193a5fd51653ec73daea/image2-39.png" />
            
            </figure><p>It’s actually simpler than it looks: load 16 bytes, convert vector from uint8 to int16, subtract the vector, store intermediate sums as int32, repeat. At the end, we need to do complex 4 instructions to extract the partial sums into the final sum. <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2020-12-mmdist/mmdist-naive-avx2.c">This AVX2 code</a> improves the performance around 3x:</p>
            <pre><code>$ make naive-avx2 
Total: 25911.126ms, 1536 items, avg 16.869ms per query, 59.280 qps</code></pre>
            <p>We measured 17ms per item, which is still below our expectations. Unfortunately, we can't push it much further without major changes. The problem is that this code is limited by memory bandwidth. The measurements come from my Intel i7-5557U CPU, which has the max theoretical memory bandwidth of just 25GB/s. The database of 1 million entries takes 137MiB, so it takes at least 5ms to feed the database to my CPU. With this naive algorithm we won't be able to go below that.</p>
    <div>
      <h3>Vantage Point Tree algorithm</h3>
      <a href="#vantage-point-tree-algorithm">
        
      </a>
    </div>
    <p>Since the naive brute force approach failed, we tried using more sophisticated algorithms. My colleague <a href="https://github.com/kornelski/vpsearch">Kornel Lesiński implemented</a> a super cool <a href="https://en.wikipedia.org/wiki/Vantage-point_tree">Vantage Point algorithm</a>. After a few ups and downs, optimizations and rewrites, we gave up. Our problem turned out to be unusually hard for this kind of algorithm.</p><p>We observed <a href="https://en.wikipedia.org/wiki/Curse_of_dimensionality">"the curse of dimensionality"</a>. Space partitioning algorithms don't work well in problems with large dimensionality — and in our case, we have an enormous number of 144 dimensions. K-D trees are doomed. Locality-sensitive hashing is also doomed. It's a bizarre situation in which the space is unimaginably vast, but everything is close together. The volume of the space is a 347-digit-long number, but the maximum distance between points is just 3060 - sqrt(255*255*144).</p><p>Space partitioning algorithms are fast, because they gradually narrow the search space as they get closer to finding the closest point. But in our case, the common query is never close to any point in the set, so the search space can’t be narrowed to a meaningful degree.</p><p>A VP-tree was a promising candidate, because it operates only on distances, subdividing space into near and far partitions, like a binary tree. When it has a close match, it can be very fast, and doesn't need to visit more than <code>O(log(N))</code> nodes. For non-matches, its speed drops dramatically. The algorithm ends up visiting nearly half of the nodes in the tree. Everything is close together in 144 dimensions! Even though the algorithm avoided visiting more than half of the nodes in the tree, the cost of visiting remaining nodes was higher, so the search ended up being slower overall.</p>
    <div>
      <h3>Smarter brute force?</h3>
      <a href="#smarter-brute-force">
        
      </a>
    </div>
    <p>This experience got us thinking. Since space partitioning algorithms can't narrow down the search, and still need to go over a very large number of items, maybe we should focus on going over all the hashes, extremely quickly. We must be smarter about memory bandwidth though — it was the limiting factor in the naive brute force approach before.</p><p>Perhaps we don't need to fetch all the data from memory.</p>
    <div>
      <h3>Short distance</h3>
      <a href="#short-distance">
        
      </a>
    </div>
    <p>The breakthrough came from the realization that we don't need to count the full distance between hashes. Instead, we can compute only a subset of dimensions, say 32 out of the total of 144. If this distance is already large, then there is no need to compute the full one! Computing more points is not going to reduce the Euclidean distance.</p><p>The proposed algorithm works as follows:</p><p>1. Take the query hash and extract a 32-byte short hash from it</p><p>2. Go over all the 1 million 32-byte short hashes from the database. They must be densely packed in the memory to allow the CPU to perform good prefetching and avoid reading data we won't need.</p><p>3. If the distance of the 32-byte short hash is greater or equal a best score so far, move on</p><p>4. Otherwise, investigate the hash thoroughly and compute the full distance.</p><p>Even though this algorithm needs to do less arithmetic and memory work, it's not faster than the previous naive one. See <code>make short-avx2</code>. The problem is: we still need to compute a full distance for hashes that are promising, and there are quite a lot of them. Computing the full distance for promising hashes adds enough work, both in ALU and memory latency, to offset the gains of this algorithm.</p><p>There is one detail of our particular application of the image matching problem that will help us a lot moving forward. As we described earlier, the problem is less about finding the closest neighbour and more about proving that the neighbour with a reasonable distance doesn't exist. Remember — in practice, we don't expect to find many matches! We expect almost every image we feed into the algorithm to be unrelated to image hashes stored in the database.</p><p>It's sufficient for our algorithm to prove that no neighbour exists within a predefined distance threshold. Let's assume we are not interested in hashes more distant than, say, 220, which squared is 48,400. <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2020-12-mmdist/mmdist-short-avx2.c">This makes our short-distance algorithm variation</a> work much better:</p>
            <pre><code>$ make short-avx2-threshold
Total: 4994.435ms, 1536 items, avg 3.252ms per query, 307.542 qps</code></pre>
            
    <div>
      <h3>Origin distance variation</h3>
      <a href="#origin-distance-variation">
        
      </a>
    </div>
    
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/rp4oarmXrL2JtieXZ4VWB/56dd69e834fa27e7c0c5cb0eb7544a58/image6-11.png" />
            
            </figure><p>At some point, John noted that the threshold allows additional optimization. We can order the hashes by their distance from some origin point. Given a query hash which has origin distance of A, we can inspect only hashes which are distant between |A-threshold| and |A+threshold| from the origin. This is pretty much how each level of Vantage Point Tree works, just simplified. This optimization — ordering items in the database by their distance from origin point — is relatively simple and can help save us a bit of work.</p><p>While great on paper, this method doesn't introduce much gain in practice, as the vectors are not grouped in clusters — they are pretty much random! For the threshold values we are interested in, the origin distance algorithm variation gives us ~20% speed boost, which is okay but not breathtaking. This change might bring more benefits if we ever decide to reduce the threshold value, so it might be worth doing for production implementation. However, it doesn't work well with query batching.</p>
    <div>
      <h3>Transposing data for better AVX</h3>
      <a href="#transposing-data-for-better-avx">
        
      </a>
    </div>
    <p>But we're not done with AVX optimizations! The usual problem with AVX is that the instructions don't normally fit a specific problem. Some serious mind twisting is required to adapt the right instruction to the problem, or to reverse the problem so that a specific instruction can be used. AVX2 doesn't have useful "horizontal" uint16 subtract, multiply and add operations. For example, <code>_mm_hadd_epi16</code> exists, but it's slow and cumbersome.</p><p>Instead, we can twist the problem to make use of fast available uint16 operands. For example we can use:</p><ol><li><p>_mm256_sub_epi16</p></li><li><p>_mm256_mullo_epi16</p></li><li><p>and _mm256_add_epu16.</p></li></ol><p>The <code>add</code> would overflow in our case, but fortunately there is add-saturate _mm256_adds_epu16.</p><p>The saturated <code>add</code> is great and saves us conversion to uint32. It just adds a small limitation: the threshold passed to the program (i.e., the max squared distance) must fit into uint16. However, this is fine for us.</p><p>To effectively use these instructions we need to transpose the data in the database. Instead of storing hashes in rows, we can store them in columns:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/29kHAIq6T1Mfif5MVP3Jol/7369ea0b4a38a416a4367bb9865996ca/image1-63.png" />
            
            </figure><p>So instead of:</p><ol><li><p>[a1, a2, a3],</p></li><li><p>[b1, b2, b3],</p></li><li><p>[c1, c2, c3],</p></li></ol><p>...</p><p>We can lay it out in memory transposed:</p><ol><li><p>[a1, b1, c1],</p></li><li><p>[a2, b2, c2],</p></li><li><p>[a3, b3, c3],</p></li></ol><p>...</p><p>Now we can load 16 first bytes of hashes using one memory operation. In the next step, we can subtract the first byte of the querying hash using a single instruction, and so on. The algorithm stays exactly the same as defined above; we just make the data easier to load and easier to process for AVX.</p><p><a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2020-12-mmdist/mmdist-short-inv-avx2.c#L138-L147">The hot loop code</a> even looks relatively pretty:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2Xb2h3rI8mXT1yeiQcVYfn/4a7eb556d7aea0014c3944b301b52b0d/image5-26.png" />
            
            </figure><p>With the well-tuned batch size and short distance size parameters we can <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2020-12-mmdist/mmdist-short-inv-avx2.c">see the performance of this algorithm</a>:</p>
            <pre><code>$ make short-inv-avx2
Total: 1118.669ms, 1536 items, avg 0.728ms per query, 1373.062 qps</code></pre>
            <p>Whoa! This is pretty awesome. We started from 55ms per query, and we finished with just 0.73ms. There are further micro-optimizations possible, like memory prefetching or using huge pages to reduce page faults, but they have diminishing returns at this point.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4dBsIg3yHloEmyLqUyfETK/3119d166040ef5d6b96d03acc9873120/image7-8.png" />
            
            </figure><p>Roofline model from Denis Bakhvalov's book‌‌</p><p>If you are interested in architectural tuning such as this, take a look at <a href="https://book.easyperf.net/perf_book">the new performance book by Denis Bakhvalov</a>. It discusses roofline model analysis, which is pretty much what we did here.</p><p><a href="https://github.com/cloudflare/cloudflare-blog/tree/master/2020-12-mmdist">Do take a look at our code</a> and tell us if we missed some optimization!</p>
    <div>
      <h3>Summary</h3>
      <a href="#summary">
        
      </a>
    </div>
    <p>What an optimization journey! We jumped between memory and ALU bottlenecked code. We discussed more sophisticated algorithms, but in the end, a brute force algorithm — although tuned — gave us the best results.</p><p>To get even better numbers, I experimented with Nvidia GPU using CUDA. The CUDA intrinsics like <code>vabsdiff4</code> and <code>dp4a</code> fit the problem perfectly. The V100 gave us some amazing numbers, but I wasn't fully satisfied with it. Considering how many AMD Ryzen cores with AVX2 we can get for the cost of a single server-grade GPU, we leaned towards general purpose computing for this particular problem.</p><p>This is a great example of the type of complexities we deal with every day. Making even the best technologies work “at Cloudflare scale” requires thinking outside the box. Sometimes we rewrite the solution dozens of times before we find the optimal one. And sometimes we settle on a brute-force algorithm, just very very optimized.</p><p>The computation of hashes and image matching are challenging problems that require running very CPU intensive operations.. The CPU we have available on the edge is scarce and workloads like this are incredibly expensive. Even with the optimization work talked about in this blog post, running the CSAM scanner at scale is a challenge and has required a huge engineering effort. And we’re not done! We need to solve more hard problems before we're satisfied. If you want to help, consider <a href="https://www.cloudflare.com/careers/">applying</a>!</p> ]]></content:encoded>
            <category><![CDATA[Performance]]></category>
            <category><![CDATA[Optimization]]></category>
            <category><![CDATA[Speed]]></category>
            <guid isPermaLink="false">3xsrCKMSJ12rBqyQDNejcu</guid>
            <dc:creator>Marek Majkowski</dc:creator>
        </item>
        <item>
            <title><![CDATA[Why is there a "V" in SIGSEGV Segmentation Fault?]]></title>
            <link>https://blog.cloudflare.com/why-is-there-a-v-in-sigsegv-segmentation-fault/</link>
            <pubDate>Thu, 18 Jun 2020 11:56:33 GMT</pubDate>
            <description><![CDATA[ My program received a SIGSEGV signal and crashed with "Segmentation Fault" message. Where does the "V" come from? 

Did I read it wrong? Was there a "Segmentation *V*ault?"? Or did Linux authors make a mistake? Shouldn't the signal be named SIGSEGF? 
 ]]></description>
            <content:encoded><![CDATA[ <p></p><p>Another long night. I was working on my perfect, bug-free program in C, when the predictable thing happened:</p>
            <pre><code>$ clang skynet.c -o skynet
$ ./skynet.out 
Segmentation fault (core dumped)</code></pre>
            <p>Oh, well... Maybe I'll be more lucky taking over the world another night. But then it struck me. My program received a SIGSEGV signal and crashed with "Segmentation Fault" message. Where does the "V" come from?</p><p>Did I read it wrong? Was there a "Segmentation _V_ault?"? Or did Linux authors make a mistake? Shouldn't the signal be named SIGSEGF?</p><p>I asked my colleagues and <a href="https://twitter.com/dwragg">David Wragg</a> quickly told me that the signal name stands for "Segmentation Violation". I guess that makes sense. Long long time ago, computers used to have <a href="https://en.wikipedia.org/wiki/X86_memory_segmentation">memory segmentation</a>. Each memory segment had defined length - called Segment Limit. Accessing data over this limit caused a processor fault. This error code got re-used <a href="https://en.wikipedia.org/wiki/Memory_segmentation#Segmentation_with_paging">by newer systems that used paging</a>. I think the Intel manuals call this error <a href="https://en.wikipedia.org/wiki/Page_fault#Invalid">"Invalid Page Fault"</a>. When it's triggered it gets reported to the userspace as a SIGSEGV signal. End of story.</p><p>Or is it?</p><p><a href="https://twitter.com/mahtin">Martin Levy</a> pointed me to an ancient <a href="http://man.cat-v.org/unix-6th/2/signal">Version 6th UNIX documentation on "signal"</a>. This is from around 1978:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4guHgMnVrZOyzBS0j2vOF3/1794a378bb035a8f938e3c11fb65de19/IMG_5177.jpg" />
            
            </figure><p>Look carefully. There is no SIGSEGV signal! Signal number 11 is called SIGSEG!</p><p>It seems that userspace parts of the UNIX tree (i.e. /usr/include/signal.h) switched to SIGSEGV fairly early on. But the kernel internals continued to use the name SIGSEG for much longer.</p><p>Looking deeper David found that PDP11 trap vector used wording <a href="https://github.com/dspinellis/unix-history-repo/blob/Research-V4-Snapshot-Development/sys/ken/low.s#L59">"segmentation violation"</a>. This shows up in Research V4 Edition in the UNIX history repo, but it doesn't mean it was introduced in V4 - it's just because V4 is the first version with code still available.</p><p>This trap was converted into <a href="https://github.com/dspinellis/unix-history-repo/blob/Research-V4-Snapshot-Development/sys/ken/trap.c#L73">SIGSEG signal in trap.c</a> file.</p><p>The file /usr/include/signal.h appears in the tree for Research V7, <a href="https://github.com/dspinellis/unix-history-repo/blob/Research-V7-Snapshot-Development/usr/include/signal.h">with the name SIGSEGV</a>. But the kernel <a href="https://github.com/dspinellis/unix-history-repo/blob/Research-V7-Snapshot-Development/usr/sys/sys/trap.c#L177">still called it SIGSEG at the time</a></p><p>It seems the kernel side was <a href="https://github.com/dspinellis/unix-history-repo/blob/BSD-4/usr/src/sys/sys/trap.c#L67">renamed to SIGSEGV in BSD-4</a>.</p><p>Here you go. Originally the signal was called SIGSEG. It was subsequently renamed SIGSEGV in the userspace and a bit later - around 1980 - to SIGSEGV on the kernel side. Apparently there are still no Segmentation Vaults found on UNIX systems.</p><p>As for my original crash, I fixed it - of course - by catching the signal and jumping over the offending instruction. On Linux it is totally possible to catch and handle SIGSEGV. With that fix, my code will never again crash. For sure.</p>
            <pre><code>#define _GNU_SOURCE
#include &lt;signal.h&gt;
#include &lt;stdio.h&gt;
#include &lt;ucontext.h&gt;

static void sighandler(int signo, siginfo_t *si, void* v_context)
{
    ucontext_t *context = v_context;
    context-&gt;uc_mcontext.gregs[REG_RIP] += 10;
}

int *totally_null_pointer = NULL;

int main() {
    struct sigaction psa;
    psa.sa_sigaction = sighandler;
    sigaction(SIGSEGV, &amp;psa, NULL);

    printf("Before NULL pointer dereference\n");
    *totally_null_pointer = 1;
    __asm__ __volatile__("nop;nop;nop;nop;nop;nop;nop;nop;nop;nop;");
    printf("After NULL pointer. Still here!\n");

    return 0;
}</code></pre>
             ]]></content:encoded>
            <category><![CDATA[Deep Dive]]></category>
            <category><![CDATA[Developers]]></category>
            <guid isPermaLink="false">4vXHuB3JMI6WziJEJtc90H</guid>
            <dc:creator>Marek Majkowski</dc:creator>
            <dc:creator>David Wragg</dc:creator>
        </item>
        <item>
            <title><![CDATA[Conntrack tales - one thousand and one flows]]></title>
            <link>https://blog.cloudflare.com/conntrack-tales-one-thousand-and-one-flows/</link>
            <pubDate>Mon, 06 Apr 2020 11:00:00 GMT</pubDate>
            <description><![CDATA[ We were wondering - can we just enable Linux "conntrack"? How does it actually work? I volunteered to help the team understand the dark corners of the Linux's "conntrack" stateful firewall subsystem. ]]></description>
            <content:encoded><![CDATA[ <p>At Cloudflare we develop new products at a great pace. Their needs often challenge the architectural assumptions we made in the past. For example, years ago we decided to avoid using Linux's "conntrack" - stateful firewall facility. This brought great benefits - it simplified our iptables firewall setup, sped up the system a bit and made the inbound packet path easier to understand.</p><p>But eventually our needs changed. One of our new products had a reasonable need for it. But we weren't confident - can we just enable conntrack and move on? How does it actually work? I volunteered to help the team understand the dark corners of the "conntrack" subsystem.</p>
    <div>
      <h2>What is conntrack?</h2>
      <a href="#what-is-conntrack">
        
      </a>
    </div>
    <p>"Conntrack" is a part of Linux network stack, specifically part of the firewall subsystem. To put that into perspective: early firewalls were entirely stateless. They could express only basic logic, like: allow SYN packets to port 80 and 443, and block everything else.</p><p>The stateless design gave some basic <a href="https://www.cloudflare.com/learning/network-layer/network-security/">network security</a>, but was quickly deemed insufficient. You see, there are certain things that can't be expressed in a stateless way. The canonical example is assessment of ACK packets - it's impossible to say if an ACK packet is legitimate or part of a port scanning attempt, without tracking the connection state.</p><p>To fill such gaps all the operating systems implemented connection tracking inside their firewalls. This tracking is usually implemented as a big table, with at least 6 columns: protocol (usually TCP or UDP), source IP, source port, destination IP, destination port and connection state. On Linux this subsystem is called "conntrack" and is often enabled by default. Here's how the table looks on my laptop inspected with "conntrack -L" command:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/57PxFTDUBFZqODrE5OrOul/a9d34a54bf247021ba000b15119461bc/image5-2.png" />
            
            </figure><p>The obvious question is how large this state tracking table can be. This setting is under "/proc/sys/net/nf_conntrack_max":</p>
            <pre><code>$ cat /proc/sys/net/nf_conntrack_max
262144</code></pre>
            <p>This is a global setting, but the limit is per container. On my system each container, or "network namespace", can have up to 256K conntrack entries.</p><p>What exactly happens when the number of concurrent connections exceeds the conntrack limit?</p>
    <div>
      <h2>Testing conntrack is hard</h2>
      <a href="#testing-conntrack-is-hard">
        
      </a>
    </div>
    <p>In past testing conntrack was hard - it required complex hardware or vm setup. Fortunately, these days we can use modern "user namespace" facilities which do permission magic, allowing an unprivileged user to feel like root. Using the tool "unshare" it's possible to create an isolated environment where we can precisely control the packets going through and experiment with iptables and conntrack without threatening the health of our host system. With appropriate parameters it's possible to create and manage a networking namespace, including access to namespaced iptables and conntrack, from an unprivileged user.</p><p>This script is the heart of our test:</p>
            <pre><code># Enable tun interface
ip tuntap add name tun0 mode tun
ip link set tun0 up
ip addr add 192.0.2.1 peer 192.0.2.2 dev tun0
ip route add 0.0.0.0/0 via 192.0.2.2 dev tun0

# Refer to conntrack at least once to ensure it's enabled
iptables -t raw -A PREROUTING -j CT
# Create a counter in mangle table
iptables -t mangle -A PREROUTING
# Make sure reverse traffic doesn't affect conntrack state
iptables -t raw -A OUTPUT -p tcp --sport 80 -j DROP

tcpdump -ni any -B 16384 -ttt &amp;
...
./venv/bin/python3 send_syn.py

conntrack -L
# Show iptables counters
iptables -nvx -t raw -L PREROUTING
iptables -nvx -t mangle -L PREROUTING</code></pre>
            <p>This bash script is shortened for readability. See the <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2020-04-conntrack-syn/test-1.bash">full version here</a>. The accompanying "send_syn.py" is just sending 10 SYN packets over "tun0" interface. <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2020-04-conntrack-syn/send_syn.py">Here is the source</a> but allow me to paste it here - showing off "scapy" is always fun:</p>
            <pre><code>tun = TunTapInterface("tun0", mode_tun=True)
tun.open()

for i in range(10000,10000+10):
    ip=IP(src="198.18.0.2", dst="192.0.2.1")
    tcp=TCP(sport=i, dport=80, flags="S")
    send(ip/tcp, verbose=False, inter=0.01, socket=tun)</code></pre>
            <p>The bash script above contains a couple of gems. Let's walk through them.</p><p>First, please note that we can't just inject packets into the loopback interface using <a href="http://man7.org/linux/man-pages/man7/raw.7.html">SOCK_RAW sockets</a>. The Linux networking stack is a complex beast. The semantics of sending packets over a SOCK_RAW are different then delivering a packet over a real interface. We'll discuss this later, but for now, to avoid triggering unexpected behaviour, we will deliver packets over a tun/tap device which better emulates a real interface.</p><p>Then we need to make sure the conntrack is active in the network namespace we wish to use for testing. Traditionally, just loading the kernel module would have done that, but in the brave new world of containers and network namespaces, a method had to be found to allow conntrack to be active in some and inactive in other containers. Hence this is tied to usage - rules referencing conntrack must exist in the namespace's iptables for conntrack to be active inside the container.</p><p>As a side note, <a href="https://lwn.net/Articles/740455/">containers triggering host to load kernel modules</a> is an <a href="https://github.com/weaveworks/go-odp/blob/6b0aa22550d9325eb8f43418185859e13dc0de1d/odp/dpif.go#L67-L90">interesting subject</a>.</p><p>After the "-t raw -A PREROUTING" rule, which we added "-t mangle -A PREROUTING" rule, but notice - it doesn't have any action! This syntax is allowed by iptables and it is pretty useful to get iptables to report rule counters. We'll need these counters soon. A careful reader might suggest looking at "policy" counters in iptables to achieve our goal. Sadly, "policy" counters (increased for each packet entering a chain), work only if there is at least one rule inside it.</p><p>The rest of the steps are self-explanatory. We set up "tcpdump" in the background, send 10 SYN packets to 127.0.0.1:80 using the "scapy" Python library. Then we print the conntrack table and iptables counters.</p><p>Let's run this script in action. Remember to run it under networking namespace as fake root with "unshare -Ur -n":</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4xUiZXoGQrhYU3Lugqbx97/9db644c8c56f944445fc7ebb559d6d95/image6.png" />
            
            </figure><p>This is all nice. First we see a "tcpdump" listing showing 10 SYN packets. Then we see the conntrack table state, showing 10 created flows. Finally, we see iptables counters in two rules we created, each showing 10 packets processed.</p>
    <div>
      <h2>Can conntrack table fill up?</h2>
      <a href="#can-conntrack-table-fill-up">
        
      </a>
    </div>
    <p>Given that the conntrack table is size constrained, what exactly happens when it fills up? Let's check it out. First, we need to drop the conntrack size. As mentioned it's controlled by a global toggle - it's necessary to tune it on the host side. Let's reduce the table size to 7 entries, and repeat our test:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2iH78254UDI6cAFqqiAWcy/92aafe709b997fcc234279d2966077c1/image4-3.png" />
            
            </figure><p>This is getting interesting. We still see the 10 inbound SYN packets. We still see that the "-t raw PREROUTING" table received 10 packets, but this is where similarities end. The "-t mangle PREROUTING" table saw only 7 packets. Where did the three missing SYN packets go?</p><p>It turns out they went where all the dead packets go. They were hard dropped. Conntrack on overfill does exactly that. It even complains in the "dmesg":</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3Gtrs2Na2a4NKQbNllKw8w/ddf3755fff4e9b89dc455260b8a14107/image1-1.png" />
            
            </figure><p>This is confirmed by our iptables counters. Let's review the <a href="https://upload.wikimedia.org/wikipedia/commons/3/37/Netfilter-packet-flow.svg">famous iptables</a> diagram:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6wBlH7JteM8K0KVBp6P6CG/c1d17141b74544ca9809714354bf05ca/image7.png" />
            
            </figure><p><a href="https://commons.wikimedia.org/wiki/File:Netfilter-packet-flow.svg">image</a> by <a href="https://commons.wikimedia.org/wiki/User_talk:Jengelh">Jan Engelhardt</a> CC BY-SA 3.0</p><p>As we can see, the "-t raw PREROUTING" happens before conntrack, while "-t mangle PREROUTING" is just after it. This is why we see 10 and 7 packets reported by our iptables counters.</p><p>Let me emphasize the gravity of our discovery. We showed three completely valid SYN packets being implicitly dropped by "conntrack". There is no explicit "-j DROP" iptables rule. There is no configuration to be toggled. Just the fact of using "conntrack" means that, when it's full, packets creating new flows will be dropped. No questions asked.</p><p>This is the dark side of using conntrack. If you use it, you absolutely must make sure it doesn't get filled.</p><p>We could end our investigation here, but there are a couple of interesting caveats.</p>
    <div>
      <h2>Strict vs loose</h2>
      <a href="#strict-vs-loose">
        
      </a>
    </div>
    <p>Conntrack supports a "strict" and "loose" mode, as configured by "nf_conntrack_tcp_loose" toggle.</p>
            <pre><code>$ cat /proc/sys/net/netfilter/nf_conntrack_tcp_loose
1</code></pre>
            <p>By default, it's set to "loose" which means that stray ACK packets for unseen TCP flows will create new flow entries in the table. We can generalize: "conntrack" will implicitly drop all the packets that create new flow, whether that's SYN or just stray ACK.</p><p>What happens when we clear the "nf_conntrack_tcp_loose=0" setting? This is a subject for another blog post, but suffice to say - it's a mess. First, this setting is not settable in the network namespace scope - although it should be. To test it you need to be in the root network namespace. Then, due to twisted logic the ACK will be dropped on a full conntrack table, even though in this case it doesn't create a flow. If the table is not full, the ACK packet will pass through it, having "-ctstate INVALID" from "mangle" table forward.</p>
    <div>
      <h2>When doesn't a conntrack entry get created?</h2>
      <a href="#when-doesnt-a-conntrack-entry-get-created">
        
      </a>
    </div>
    <p>There are important situations when conntrack entry is not created. For example, we could replace these line in our script:</p>
            <pre><code># Make sure reverse traffic doesn't affect conntrack state
iptables -t raw -A OUTPUT -p tcp --sport 80 -j DROP</code></pre>
            <p>With those:</p>
            <pre><code># Make sure inbound SYN packets don't go to networking stack
iptables -A INPUT -j DROP</code></pre>
            <p>Naively we could think dropping SYN packets past the conntrack layer would not interfere with the created flows. This is not correct. In spite of these SYN packets having been seen by conntrack, no flow state is created for them. Packets hitting "-j DROP" will not create new conntrack flows. Pretty magical, isn't it?</p>
    <div>
      <h2>Full Conntrack causes with EPERM</h2>
      <a href="#full-conntrack-causes-with-eperm">
        
      </a>
    </div>
    <p>Recently we hit a case when a "sendto()" syscall on UDP socket from one of our applications was erroring with EPERM. This is pretty weird, and not documented in the man page. My colleague had no doubts:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4qoszkkHAOaErPNjv6cwPZ/c22d83083c706014dff1c47dd3beb517/image9-1.png" />
            
            </figure><p>I'll save you the gruesome details, but indeed, the full conntrack table will do that to your new UDP flows - you will get EPERM. Beware. Funnily enough, it's possible to get EPERM if an outbound packet is dropped on OUTPUT firewall in other ways. For example:</p>
            <pre><code>marek:~$ sudo iptables -I OUTPUT -p udp --dport 53 --dst 192.0.2.8 -j DROP
marek:~$ strace -e trace=write nc -vu 192.0.2.8 53
write(3, "X", 1)                        = -1 EPERM (Operation not permitted)
+++ exited with 1 +++</code></pre>
            <p>If you ever receive EPERM from "sendto()", you might want to treat it as a transient error, if you suspect a filled conntrack problem, or permanent error if you blame iptables configuration.</p><p>This is also why we can't send our SYN packets directly using SOCK_RAW sockets in our test. Let's see what happens on conntrack overfill with standard "hping3" tool:</p>
            <pre><code>$ hping3 -S -i u10000 -c 10 --spoof 192.18.0.2 192.0.2.1 -p 80 -I lo
HPING 192.0.2.1 (lo 192.0.2.1): S set, 40 headers + 0 data bytes
[send_ip] sendto: Operation not permitted</code></pre>
            <p>"send()" even on a SOCK_RAW socket fails with EPERM when conntrack table is full.</p>
    <div>
      <h2>Full conntrack can happen on a SYN flood</h2>
      <a href="#full-conntrack-can-happen-on-a-syn-flood">
        
      </a>
    </div>
    <p>There is one more caveat. During a SYN flood, the conntrack entries will totally be created for the spoofed flows. Take a look at second test case we prepared, this time correctly listening on port 80, and sending SYN+ACK:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2vNW5JQuRFZyumzKOrIQVW/ea983dcbf2e8a876a8f3264b3104d4d4/image8.png" />
            
            </figure><p>We can see 7 SYN+ACK's flying out of the port 80 listening socket. The final three SYN's go nowhere as they are dropped by conntrack.</p><p>This has important implications. If you use conntrack on publicly accessible ports, during SYN flood <a href="/syn-packet-handling-in-the-wild/">mitigation technologies like SYN Cookies</a> won't help. You are still at risk of running out of conntrack space and therefore affecting legitimate connections.</p><p>For this reason, as a general rule consider avoiding conntrack on inbound connections (-j NOTRACK). Alternatively having some reasonable rate limits on iptables layer, doing "-j DROP". This will work well and won't create new flows, as we discussed above. The best method though, would be to trigger SYN Cookies from a layer before conntrack, like XDP. But this is a subject for another time.</p>
    <div>
      <h2>Summary</h2>
      <a href="#summary">
        
      </a>
    </div>
    <p>Over the years Linux conntrack has gone through many changes and has improved a lot. While performance used to be a major concern, these days it's considered to be very fast. Dark corners remain. Correctly applying conntrack is tricky.</p><p>In this blog post we showed how it's possible to test parts of conntrack with "unshare" and a series of scripts. We showed the behaviour when the conntrack table gets filled - packets might implicitly be dropped. Finally, we mentioned the curious case of SYN floods where incorrectly applied conntrack may cause harm.</p><p>Stay tuned for more horror stories as we dig deeper and deeper into the Linux networking stack guts.</p> ]]></content:encoded>
            <category><![CDATA[Linux]]></category>
            <category><![CDATA[Network]]></category>
            <category><![CDATA[TCP]]></category>
            <guid isPermaLink="false">13hrLdB4ySqi2j6KpGDkBy</guid>
            <dc:creator>Marek Majkowski</dc:creator>
        </item>
        <item>
            <title><![CDATA[When Bloom filters don't bloom]]></title>
            <link>https://blog.cloudflare.com/when-bloom-filters-dont-bloom/</link>
            <pubDate>Mon, 02 Mar 2020 13:00:00 GMT</pubDate>
            <description><![CDATA[ Last month finally I had an opportunity to use Bloom filters. I became fascinated with the promise of this data structure, but I quickly realized it had some drawbacks. This blog post is the tale of my brief love affair with Bloom filters. ]]></description>
            <content:encoded><![CDATA[ 
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4bQ9cbvVLCJTUntwCHSGSp/570583980831b19e4da88411fdff5eda/bloom-filter_2x.png" />
            
            </figure><p>I've known about <a href="https://en.wikipedia.org/wiki/Bloom_filter">Bloom filters</a> (named after Burton Bloom) since university, but I haven't had an opportunity to use them in anger. Last month this changed - I became fascinated with the promise of this data structure, but I quickly realized it had some drawbacks. This blog post is the tale of my brief love affair with Bloom filters.</p><p>While doing research about <a href="/the-root-cause-of-large-ddos-ip-spoofing/">IP spoofing</a>, I needed to examine whether the source IP addresses extracted from packets reaching our servers were legitimate, depending on the geographical location of our data centers. For example, source IPs belonging to a legitimate Italian ISP should not arrive in a Brazilian datacenter. This problem might sound simple, but in the ever-evolving landscape of the internet this is far from easy. Suffice it to say I ended up with many large text files with data like this:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4EyhhPLD0IymvVMr3R8pMz/2c30f18b77fdee9184c438f80e94781b/Screenshot-from-2020-03-01-23-57-10.png" />
            
            </figure><p>This reads as: the IP 192.0.2.1 was recorded reaching Cloudflare data center number 107 with a legitimate request. This data came from many sources, including our active and passive probes, logs of certain domains we own (like cloudflare.com), public sources (like BGP table), etc. The same line would usually be repeated across multiple files.</p><p>I ended up with a gigantic collection of data of this kind. At some point I counted 1 billion lines across all the harvested sources. I usually write bash scripts to pre-process the inputs, but at this scale this approach wasn't working. For example, removing duplicates from this tiny file of a meager 600MiB and 40M lines, took... about an eternity:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/7lzqpR8gHJRr6XC4fbcepJ/91185ef0b6036ecd484bd791f5a72651/Screenshot-from-2020-03-01-23-25-19a.png" />
            
            </figure><p>Enough to say that deduplicating lines using the usual bash commands like 'sort' in various configurations (see '--parallel', '--buffer-size' and '--unique') was not optimal for such a large data set.</p>
    <div>
      <h2>Bloom filters to the rescue</h2>
      <a href="#bloom-filters-to-the-rescue">
        
      </a>
    </div>
    
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/vR5EB9TdKmVJDjStuvpwL/fe51959f24694cc5c072f3263aa42ab0/Bloom_filter.png" />
            
            </figure><p><a href="https://en.wikipedia.org/wiki/Bloom_filter#/media/File:Bloom_filter.svg">Image</a> by <a href="https://commons.wikimedia.org/wiki/User:David_Eppstein">David Eppstein</a> Public Domain</p><p>Then I had a brainwave - it's not necessary to sort the lines! I just need to remove duplicated lines - using some kind of "set" data structure should be much faster. Furthermore, I roughly know the cardinality of the input file (number of unique lines), and I can live with some data points being lost - using a probabilistic data structure is fine!</p><p>Bloom-filters are a perfect fit!</p><p>While you should go and read <a href="https://en.wikipedia.org/wiki/Bloom_filter#Algorithm_description">Wikipedia on Bloom Filters</a>, here is how I look at this data structure.</p><p>How would you implement a "<a href="https://en.wikipedia.org/wiki/Set_(abstract_data_type)">set</a>"? Given a perfect hash function, and infinite memory, we could just create an infinite bit array and set a bit number 'hash(item)' for each item we encounter. This would give us a perfect "set" data structure. Right? Trivial. Sadly, hash functions have collisions and infinite memory doesn't exist, so we have to compromise in our reality. But we can calculate and manage the probability of collisions. For example, imagine we have a good hash function, and 128GiB of memory. We can calculate the probability of the second item added to the bit array colliding would be 1 in 1099511627776. The probability of collision when adding more items worsens as we fill up the bit array.</p><p>Furthermore, we could use more than one hash function, and end up with a denser bit array. This is exactly what Bloom filters optimize for. A Bloom filter is a bunch of math on top of the four variables:</p><ul><li><p>'n' - The number of input elements (cardinality)</p></li><li><p>'m' - Memory used by the bit-array</p></li><li><p>'k' - Number of hash functions counted for each input</p></li><li><p>'p' - Probability of a false positive match</p></li></ul><p>Given the 'n' input cardinality and the 'p' desired probability of false positive, the Bloom filter math returns the 'm' memory required and 'k' number of hash functions needed.</p><p>Check out this excellent visualization by Thomas Hurst showing how parameters influence each other:</p><ul><li><p><a href="https://hur.st/bloomfilter/">https://hur.st/bloomfilter/</a></p></li></ul>
    <div>
      <h2>mmuniq-bloom</h2>
      <a href="#mmuniq-bloom">
        
      </a>
    </div>
    <p>Guided by this intuition, I set out on a journey to add a new tool to my toolbox - 'mmuniq-bloom', a probabilistic tool that, given input on STDIN, returns only unique lines on STDOUT, hopefully much faster than 'sort' + 'uniq' combo!</p><p>Here it is:</p><ul><li><p><a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2020-02-mmuniq/mmuniq-bloom.c">'mmuniq-bloom.c'</a></p></li></ul><p>For simplicity and speed I designed 'mmuniq-bloom' with a couple of assumptions. First, unless otherwise instructed, it uses 8 hash functions k=8. This seems to be a close to optimal number for the data sizes I'm working with, and the hash function can quickly output 8 decent hashes. Then we align 'm', number of bits in the bit array, to be a power of two. This is to avoid the pricey % modulo operation, which compiles down to slow assembly 'div'. With power-of-two sizes we can just do bitwise AND. (For a fun read, see <a href="https://stackoverflow.com/questions/41183935/why-does-gcc-use-multiplication-by-a-strange-number-in-implementing-integer-divi">how compilers can optimize some divisions by using multiplication by a magic constant</a>.)</p><p>We can now run it against the same data file we used before:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/7eWA2sCxdTwbHW0JVE4dLm/ffffd6f6823305f41a15d0091d1acaef/image11.png" />
            
            </figure><p>Oh, this is so much better! 12 seconds is much more manageable than 2 minutes before. But hold on... The program is using an optimized data structure, relatively limited memory footprint, optimized line-parsing and good output buffering... 12 seconds is still eternity compared to 'wc -l' tool:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/10yS38GPKuBTqMMd4ZJVSe/199b9ceea61eb91700012362ef92e0ab/image5.png" />
            
            </figure><p>What is going on? I understand that counting lines by 'wc' is <i>easier</i> than figuring out unique lines, but is it really worth the 26x difference? Where does all the CPU in 'mmuniq-bloom' go?</p><p>It must be my hash function. 'wc' doesn't need to spend all this CPU performing all this strange math for each of the 40M lines on input. I'm using a pretty non-trivial 'siphash24' hash function, so it surely burns the CPU, right? Let's check by running the code computing hash function but <i>not</i> doing any Bloom filter operations:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4cmgeZbT0nJNh9bwRC4gEH/b5a71c8c573d506f20f6b6bc01173028/image2.png" />
            
            </figure><p>This is strange. Counting the hash function indeed costs about 2s, but the program took 12s in the previous run. The Bloom filter alone takes 10 seconds? How is that possible? It's such a simple data structure...</p>
    <div>
      <h2>A secret weapon - a profiler</h2>
      <a href="#a-secret-weapon-a-profiler">
        
      </a>
    </div>
    <p>It was time to use a proper tool for the task - let's fire up a profiler and see where the CPU goes. First, let's fire an 'strace' to confirm we are not running any unexpected syscalls:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/29N25pGnOTso9WBxn4SckO/4a8bbb0dccf4dfb5fd2a248625d29bd9/image14.png" />
            
            </figure><p>Everything looks good. The 10 calls to 'mmap' each taking 4ms (3971 us) is intriguing, but it's fine. We pre-populate memory up front with 'MAP_POPULATE' to save on page faults later.</p><p>What is the next step? Of course Linux's 'perf'!</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/632ygHAtmgUaNH18WsUPAG/b9f3990786d96bf84b2bc28d5a55f0e3/image10.png" />
            
            </figure><p>Then we can see the results:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/61GXvk0ujwFsP0vEDyrD6c/ab4ec96b0e748f5dd8e6e496d1b499b4/image6.png" />
            
            </figure><p>Right, so we indeed burn 87.2% of cycles in our hot code. Let's see where exactly. Doing 'perf annotate process_line --source' quickly shows something I didn't expect.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3i10E5RjlO3vvJTuA751wv/da7db1a7614e3936218206689767e752/image3.png" />
            
            </figure><p>You can see 26.90% of CPU burned in the 'mov', but that's not all of it! The compiler correctly inlined the function, and unrolled the loop 8-fold. Summed up that 'mov' or 'uint64_t v = *p' line adds up to a great majority of cycles!</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1sTjVPwEz6BILccJz8pdDa/2e3435805a1e092460dcb916f4d6ecb2/image4.png" />
            
            </figure><p>Clearly 'perf' must be mistaken, how can such a simple line cost so much? We can repeat the benchmark with any other profiler and it will show us the same problem. For example, I like using 'google-perftools' with kcachegrind since they emit eye-candy charts:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/7fWWh8EZelfOMfwA5XMT16/4ded280402c936f2a6f67e2930115f5e/Screenshot-from-2020-03-02-00-08-23.png" />
            
            </figure><p>The rendered result looks like this:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2p5hkIX3zVAO7IFFWEVZuF/14f1f9505b91e15c904dab7309a11907/image13.png" />
            
            </figure><p>Allow me to summarise what we found so far.</p><p>The generic 'wc' tool takes 0.45s CPU time to process 600MiB file. Our optimized 'mmuniq-bloom' tool takes 12 seconds. CPU is burned on one 'mov' instruction, dereferencing memory....</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1eRwMPBq6BEAyWB0g10EDo/3802664ab1b47d94b96845f487338a08/6784957048_4661ea7dfc_c.jpg" />
            
            </figure><p><a href="https://flickr.com/photos/jonicdao/6784957048">Image</a> by <a href="https://flickr.com/photos/jonicdao/">Jose Nicdao</a> CC BY/2.0</p><p>Oh! I how could I have forgotten. Random memory access <i>is</i> slow! It's very, very, very slow!</p><p>According to the general rule <a href="http://highscalability.com/blog/2011/1/26/google-pro-tip-use-back-of-the-envelope-calculations-to-choo.html">"latency numbers every programmer should know about"</a>, one RAM fetch is about 100ns. Let's do the math: 40 million lines, 8 hashes counted for each line. Since our Bloom filter is 128MiB, on <a href="/gen-x-performance-tuning/">our older hardware</a> it doesn't fit into L3 cache! The hashes are uniformly distributed across the large memory range - each hash generates a memory miss. Adding it together that's...</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/ORPcRAqG2H2xqeEEdGbmh/ef6968c82bfe0aa44706a4f36e59bb1c/Screenshot-from-2020-03-02-00-34-29.png" />
            
            </figure><p>That suggests 32 seconds burned just on memory fetches. The real program is faster, taking only 12s. This is because, although the Bloom filter data does not completely fit into L3 cache, it still gets some benefit from caching. It's easy to see with 'perf stat -d':</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/7pgVS7zKpVhueO7rAhQDRI/5145db41e7b06f5875d1da4adfa92142/image9.png" />
            
            </figure><p>Right, so we should have had at least 320M LLC-load-misses, but we had only 280M. This still doesn't explain why the program was running only 12 seconds. But it doesn't really matter. What matters is that the number of cache misses is a real problem and we can only fix it by reducing the number of memory accesses. Let's try tuning Bloom filter to use only one hash function:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/30BFuINOqYvVgCCdYXCzUB/0893742e7bc41b923af6e67f53629049/image12.png" />
            
            </figure><p>Ouch! That really hurt! The Bloom filter required 64 GiB of memory to get our desired false positive probability ratio of 1-error-per-10k-lines. This is terrible!</p><p>Also, it doesn't seem like we improved much. It took the OS 22 seconds to prepare memory for us, but we still burned 11 seconds in userspace. I guess this time any benefits from hitting memory less often were offset by lower cache-hit probability due to drastically increased memory size. In previous runs we required only 128MiB for the Bloom filter!</p>
    <div>
      <h2>Dumping Bloom filters altogether</h2>
      <a href="#dumping-bloom-filters-altogether">
        
      </a>
    </div>
    <p>This is getting ridiculous. To get the same false positive guarantees we either must use many hashes in Bloom filter (like 8) and therefore many memory operations, or we can have 1 hash function, but enormous memory requirements.</p><p>We aren't really constrained by available memory, instead we want to optimize for reduced memory accesses. All we need is a data structure that requires at most 1 memory miss per item, and use less than 64 Gigs of RAM...</p><p>While we could think of more sophisticated data structures like <a href="https://en.wikipedia.org/wiki/Cuckoo_filter">Cuckoo filter</a>, maybe we can be simpler. How about a good old simple hash table with linear probing?</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6PVz5hd2DqyiraxgMJ7KIp/37706f6ffcc1cd52f544d626381deaeb/linear-probing.png" />
            
            </figure><p><a href="https://www.sysadmins.lv/blog-en/array-search-hash-tables-behind-the-scenes.aspx">Image</a> by <a href="https://www.sysadmins.lv/about.aspx">Vadims Podāns</a></p>
    <div>
      <h2>Welcome mmuniq-hash</h2>
      <a href="#welcome-mmuniq-hash">
        
      </a>
    </div>
    <p>Here you can find a tweaked version of mmuniq-bloom, but using hash table:</p><ul><li><p><a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2020-02-mmuniq/mmuniq-hash.c">'mmuniq-hash.c'</a></p></li></ul><p>Instead of storing bits as for the Bloom-filter, we are now storing 64-bit hashes from the <a href="https://idea.popcount.org/2013-01-24-siphash/">'siphash24' function</a>. This gives us much stronger probability guarantees, with probability of false positives much better than one error in 10k lines.</p><p>Let's do the math. Adding a new item to a hash table containing, say 40M, entries has '40M/2^64' chances of hitting a hash collision. This is about one in 461 billion - a reasonably low probability. But we are not adding one item to a pre-filled set! Instead we are adding 40M lines to the initially empty set. As per <a href="https://en.wikipedia.org/wiki/Birthday_problem">birthday paradox</a> this has much higher chances of hitting a collision at some point. A decent approximation is 'n^2/2m', which in our case is '(40M<sup>2)/(2*(2</sup>64))'. This is a chance of one in 23000. In other words, assuming we are using good hash function, every one in 23 thousand random sets of 40M items, will have a hash collision. This practical chance of hitting a collision is non-negligible, but it's still better than a Bloom filter and totally acceptable for my use case.</p><p>The hash table code runs faster, has better memory access patterns and better false positive probability than the Bloom filter approach.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/7i3tnYhlVqJ7NEfutgywVD/dafe27be80727cd61533f548457bc4a3/image7.png" />
            
            </figure><p>Don't be scared about the "hash conflicts" line, it just indicates how full the hash table was. We are using linear probing, so when a bucket is already used, we just pick up the next empty bucket. In our case we had to skip over 0.7 buckets on average to find an empty slot in the table. This is fine and, since we iterate over the buckets in linear order, we can expect the memory to be nicely prefetched.</p><p>From the previous exercise we know our hash function takes about 2 seconds of this. Therefore, it's fair to say 40M memory hits take around 4 seconds.</p>
    <div>
      <h2>Lessons learned</h2>
      <a href="#lessons-learned">
        
      </a>
    </div>
    <p>Modern CPUs are really good at sequential memory access when it's possible to predict memory fetch patterns (see <a href="https://en.wikipedia.org/wiki/Cache_prefetching#Methods_of_hardware_prefetching">Cache prefetching</a>). Random memory access on the other hand is very costly.</p><p>Advanced data structures are very interesting, but beware. Modern computers require cache-optimized algorithms. When working with large datasets, not fitting L3, prefer optimizing for reduced number loads, over optimizing the amount of memory used.</p><p>I guess it's fair to say that Bloom filters are great, as long as they fit into the L3 cache. The moment this assumption is broken, they are terrible. This is not news, Bloom filters optimize for memory usage, not for memory access. For example, see <a href="https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf">the Cuckoo Filters paper</a>.</p><p>Another thing is the ever-lasting discussion about hash functions. Frankly - in most cases it doesn't matter. The cost of counting even complex hash functions like 'siphash24' is small compared to the cost of random memory access. In our case simplifying the hash function will bring only small benefits. The CPU time is simply spent somewhere else - waiting for memory!</p><p>One colleague often says: "You can assume modern CPUs are infinitely fast. They run at infinite speed until they <a href="http://www.di-srv.unisa.it/~vitsca/SC-2011/DesignPrinciplesMulticoreProcessors/Wulf1995.pdf">hit the memory wall</a>".</p><p>Finally, don't follow my mistakes - everyone should start profiling with 'perf stat -d' and look at the "Instructions per cycle" (IPC) counter. If it's below 1, it generally means the program is stuck on waiting for memory. Values above 2 would be great, it would mean the workload is mostly CPU-bound. Sadly, I'm yet to see high values in the workloads I'm dealing with...</p>
    <div>
      <h2>Improved mmuniq</h2>
      <a href="#improved-mmuniq">
        
      </a>
    </div>
    <p>With the help of my colleagues I've prepared a further improved version of the 'mmuniq' hash table based tool. See the code:</p><ul><li><p><a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2020-02-mmuniq/mmuniq.c">'mmuniq.c'</a></p></li></ul><p>It is able to dynamically resize the hash table, to support inputs of unknown cardinality. Then, by using batching, it can effectively use the 'prefetch' CPU hint, speeding up the program by 35-40%. Beware, sprinkling the code with 'prefetch' rarely works. Instead, I specifically changed the flow of algorithms to take advantage of this instruction. With all the improvements I got the run time down to 2.1 seconds:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6cnwoob2q1SVG27I7zqals/87621808c31bd4707e08e816a4a58480/Screenshot-from-2020-03-01-23-52-18.png" />
            
            </figure>
    <div>
      <h2>The end</h2>
      <a href="#the-end">
        
      </a>
    </div>
    <p>Writing this basic tool which tries to be faster than 'sort | uniq' combo revealed some hidden gems of modern computing. With a bit of work we were able to speed it up from more than two minutes to 2 seconds. During this journey we learned about random memory access latency, and the power of cache friendly data structures. Fancy data structures are exciting, but in practice reducing random memory loads often brings better results.</p> ]]></content:encoded>
            <category><![CDATA[Deep Dive]]></category>
            <category><![CDATA[Hardware]]></category>
            <category><![CDATA[Optimization]]></category>
            <category><![CDATA[Programming]]></category>
            <category><![CDATA[Tools]]></category>
            <guid isPermaLink="false">3CPWTXjZJXbtWVNIawBWsd</guid>
            <dc:creator>Marek Majkowski</dc:creator>
        </item>
        <item>
            <title><![CDATA[When TCP sockets refuse to die]]></title>
            <link>https://blog.cloudflare.com/when-tcp-sockets-refuse-to-die/</link>
            <pubDate>Fri, 20 Sep 2019 15:53:33 GMT</pubDate>
            <description><![CDATA[ We noticed something weird - the TCP sockets which we thought should have been closed - were lingering around. We realized we don't really understand when TCP sockets are supposed to time out!

We naively thought enabling TCP keepalives would be enough... but it isn't! ]]></description>
            <content:encoded><![CDATA[ <p>While working on our <a href="https://www.cloudflare.com/products/cloudflare-spectrum/">Spectrum server</a>, we noticed something weird: the TCP sockets which we thought should have been closed were lingering around. We realized we don't really understand when TCP sockets are supposed to time out!</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4J7NyByY5rLMwGCjildxuX/a80e344de39529860fe89230fff4259c/Tcp_state_diagram_fixed_new.svga.png" />
            
            </figure><p><a href="https://commons.wikimedia.org/wiki/File:Tcp_state_diagram_fixed_new.svg">Image</a> by Sergiodc2 CC BY SA 3.0</p><p>In our code, we wanted to make sure we don't hold connections to dead hosts. In our early code we naively thought enabling TCP keepalives would be enough... but it isn't. It turns out a fairly modern <a href="https://tools.ietf.org/html/rfc5482">TCP_USER_TIMEOUT</a> socket option is equally important. Furthermore, it interacts with TCP keepalives in subtle ways. <a href="http://codearcana.com/posts/2015/08/28/tcp-keepalive-is-a-lie.html">Many people</a> are confused by this.</p><p>In this blog post, we'll try to show how these options work. We'll show how a TCP socket can time out during various stages of its lifetime, and how TCP keepalives and user timeout influence that. To better illustrate the internals of TCP connections, we'll mix the outputs of the <code>tcpdump</code> and the <code>ss -o</code> commands. This nicely shows the transmitted packets and the changing parameters of the TCP connections.</p>
    <div>
      <h2>SYN-SENT</h2>
      <a href="#syn-sent">
        
      </a>
    </div>
    <p>Let's start from the simplest case - what happens when one attempts to establish a connection to a server which discards inbound SYN packets?</p><p>The scripts used here <a href="https://github.com/cloudflare/cloudflare-blog/tree/master/2019-09-tcp-keepalives">are available on our GitHub</a>.</p><p><code>$ sudo ./test-syn-sent.py
# all packets dropped
00:00.000 IP host.2 &gt; host.1: Flags [S] # initial SYN

State    Recv-Q Send-Q Local:Port Peer:Port
SYN-SENT 0      1      host:2     host:1    timer:(on,940ms,0)

00:01.028 IP host.2 &gt; host.1: Flags [S] # first retry
00:03.044 IP host.2 &gt; host.1: Flags [S] # second retry
00:07.236 IP host.2 &gt; host.1: Flags [S] # third retry
00:15.427 IP host.2 &gt; host.1: Flags [S] # fourth retry
00:31.560 IP host.2 &gt; host.1: Flags [S] # fifth retry
01:04.324 IP host.2 &gt; host.1: Flags [S] # sixth retry
02:10.000 connect ETIMEDOUT</code></p><p>Ok, this was easy. After the <code>connect()</code> syscall, the operating system sends a SYN packet. Since it didn't get any response the OS will by default retry sending it 6 times. This can be tweaked by the sysctl:</p><p><code>$ sysctl net.ipv4.tcp_syn_retries
net.ipv4.tcp_syn_retries = 6</code></p><p>It's possible to overwrite this setting per-socket with the TCP_SYNCNT setsockopt:</p><p><code>setsockopt(sd, IPPROTO_TCP, TCP_SYNCNT, 6);</code></p><p>The retries are staggered at 1s, 3s, 7s, 15s, 31s, 63s marks (the inter-retry time starts at 2s and then doubles each time). By default, the whole process takes 130 seconds, until the kernel gives up with the ETIMEDOUT errno. At this moment in the lifetime of a connection, SO_KEEPALIVE settings are ignored, but TCP_USER_TIMEOUT is not. For example, setting it to 5000ms, will cause the following interaction:</p><p><code>$ sudo ./test-syn-sent.py 5000
# all packets dropped
00:00.000 IP host.2 &gt; host.1: Flags [S] # initial SYN

State    Recv-Q Send-Q Local:Port Peer:Port
SYN-SENT 0      1      host:2     host:1    timer:(on,996ms,0)

00:01.016 IP host.2 &gt; host.1: Flags [S] # first retry
00:03.032 IP host.2 &gt; host.1: Flags [S] # second retry
00:05.016 IP host.2 &gt; host.1: Flags [S] # what is this?
00:05.024 IP host.2 &gt; host.1: Flags [S] # what is this?
00:05.036 IP host.2 &gt; host.1: Flags [S] # what is this?
00:05.044 IP host.2 &gt; host.1: Flags [S] # what is this?
00:05.050 connect ETIMEDOUT</code></p><p>Even though we set user-timeout to 5s, we still saw the six SYN retries on the wire. This behaviour is probably a bug (as tested on 5.2 kernel): we would expect only two retries to be sent - at 1s and 3s marks and the socket to expire at 5s mark. Instead, we saw this, but also we saw further 4 retransmitted SYN packets aligned to 5s mark - which makes no sense. Anyhow, we learned a thing - the TCP_USER_TIMEOUT does affect the behaviour of <code>connect()</code>.</p>
    <div>
      <h2>SYN-RECV</h2>
      <a href="#syn-recv">
        
      </a>
    </div>
    <p>SYN-RECV sockets are usually hidden from the application. They live as mini-sockets on the SYN queue. We wrote about <a href="/syn-packet-handling-in-the-wild/">the SYN and Accept queues in the past</a>. Sometimes, when SYN cookies are enabled, the sockets may skip the SYN-RECV state altogether.</p><p>In SYN-RECV state, the socket will retry sending SYN+ACK 5 times as controlled by:</p><p><code>$ sysctl net.ipv4.tcp_synack_retries
net.ipv4.tcp_synack_retries = 5</code></p><p>Here is how it looks on the wire:</p><p><code>$ sudo ./test-syn-recv.py
00:00.000 IP host.2 &gt; host.1: Flags [S]
# all subsequent packets dropped
00:00.000 IP host.1 &gt; host.2: Flags [S.] # initial SYN+ACK

State    Recv-Q Send-Q Local:Port Peer:Port
SYN-RECV 0      0      host:1     host:2    timer:(on,996ms,0)

00:01.033 IP host.1 &gt; host.2: Flags [S.] # first retry
00:03.045 IP host.1 &gt; host.2: Flags [S.] # second retry
00:07.301 IP host.1 &gt; host.2: Flags [S.] # third retry
00:15.493 IP host.1 &gt; host.2: Flags [S.] # fourth retry
00:31.621 IP host.1 &gt; host.2: Flags [S.] # fifth retry
01:04:610 SYN-RECV disappears</code></p><p>With default settings, the SYN+ACK is re-transmitted at 1s, 3s, 7s, 15s, 31s marks, and the SYN-RECV socket disappears at the 64s mark.</p><p>Neither SO_KEEPALIVE nor TCP_USER_TIMEOUT affect the lifetime of SYN-RECV sockets.</p>
    <div>
      <h2>Final handshake ACK</h2>
      <a href="#final-handshake-ack">
        
      </a>
    </div>
    <p>After receiving the second packet in the TCP handshake - the SYN+ACK - the client socket moves to an ESTABLISHED state. The server socket remains in SYN-RECV until it receives the final ACK packet.</p><p>Losing this ACK doesn't change anything - the server socket will just take a bit longer to move from SYN-RECV to ESTAB. Here is how it looks:</p><p><code>00:00.000 IP host.2 &gt; host.1: Flags [S]
00:00.000 IP host.1 &gt; host.2: Flags [S.]
00:00.000 IP host.2 &gt; host.1: Flags [.] # initial ACK, dropped

State    Recv-Q Send-Q Local:Port  Peer:Port
SYN-RECV 0      0      host:1      host:2 timer:(on,1sec,0)
ESTAB    0      0      host:2      host:1

00:01.014 IP host.1 &gt; host.2: Flags [S.]
00:01.014 IP host.2 &gt; host.1: Flags [.]  # retried ACK, dropped

State    Recv-Q Send-Q Local:Port Peer:Port
SYN-RECV 0      0      host:1     host:2    timer:(on,1.012ms,1)
ESTAB    0      0      host:2     host:1</code></p><p>As you can see SYN-RECV, has the "on" timer, the same as in example before. We might argue this final ACK doesn't really carry much weight. This thinking lead to the development of TCP_DEFER_ACCEPT feature - it basically causes the third ACK to be silently dropped. With this flag set the socket remains in SYN-RECV state until it receives the first packet with actual data:</p><p><code>$ sudo ./test-syn-ack.py
00:00.000 IP host.2 &gt; host.1: Flags [S]
00:00.000 IP host.1 &gt; host.2: Flags [S.]
00:00.000 IP host.2 &gt; host.1: Flags [.] # delivered, but the socket stays as SYN-RECV

State    Recv-Q Send-Q Local:Port Peer:Port
SYN-RECV 0      0      host:1     host:2    timer:(on,7.192ms,0)
ESTAB    0      0      host:2     host:1

00:08.020 IP host.2 &gt; host.1: Flags [P.], length 11  # payload moves the socket to ESTAB

State Recv-Q Send-Q Local:Port Peer:Port
ESTAB 11     0      host:1     host:2
ESTAB 0      0      host:2     host:1</code></p><p>The server socket remained in the SYN-RECV state even after receiving the final TCP-handshake ACK. It has a funny "on" timer, with the counter stuck at 0 retries. It is converted to ESTAB - and moved from the SYN to the accept queue - after the client sends a data packet or after the TCP_DEFER_ACCEPT timer expires. Basically, with DEFER ACCEPT the SYN-RECV mini-socket <a href="https://marc.info/?l=linux-netdev&amp;m=118793048828251&amp;w=2">discards the data-less inbound ACK</a>.</p>
    <div>
      <h2>Idle ESTAB is forever</h2>
      <a href="#idle-estab-is-forever">
        
      </a>
    </div>
    <p>Let's move on and discuss a fully-established socket connected to an unhealthy (dead) peer. After completion of the handshake, the sockets on both sides move to the ESTABLISHED state, like:</p><p><code>State Recv-Q Send-Q Local:Port Peer:Port
ESTAB 0      0      host:2     host:1
ESTAB 0      0      host:1     host:2</code></p><p>These sockets have no running timer by default - they will remain in that state forever, even if the communication is broken. The TCP stack will notice problems only when one side attempts to send something. This raises a question - what to do if you don't plan on sending any data over a connection? How do you make sure an idle connection is healthy, without sending any data over it?</p><p>This is where TCP keepalives come in. Let's see it in action - in this example we used the following toggles:</p><ul><li><p>SO_KEEPALIVE = 1 - Let's enable keepalives.</p></li><li><p>TCP_KEEPIDLE = 5 - Send first keepalive probe after 5 seconds of idleness.</p></li><li><p>TCP_KEEPINTVL = 3 - Send subsequent keepalive probes after 3 seconds.</p></li><li><p>TCP_KEEPCNT = 3 - Time out after three failed probes.</p></li></ul><p><code>$ sudo ./test-idle.py
00:00.000 IP host.2 &gt; host.1: Flags [S]
00:00.000 IP host.1 &gt; host.2: Flags [S.]
00:00.000 IP host.2 &gt; host.1: Flags [.]

State Recv-Q Send-Q Local:Port Peer:Port
ESTAB 0      0      host:1     host:2
ESTAB 0      0      host:2     host:1  timer:(keepalive,2.992ms,0)

# all subsequent packets dropped
00:05.083 IP host.2 &gt; host.1: Flags [.], ack 1 # first keepalive probe
00:08.155 IP host.2 &gt; host.1: Flags [.], ack 1 # second keepalive probe
00:11.231 IP host.2 &gt; host.1: Flags [.], ack 1 # third keepalive probe
00:14.299 IP host.2 &gt; host.1: Flags [R.], seq 1, ack 1</code></p><p>Indeed! We can clearly see the first probe sent at the 5s mark, two remaining probes 3s apart - exactly as we specified. After a total of three sent probes, and a further three seconds of delay, the connection dies with ETIMEDOUT, and final the RST is transmitted.</p><p>For keepalives to work, the send buffer must be empty. You can notice the keepalive timer active in the "timer:(keepalive)" line.</p>
    <div>
      <h2>Keepalives with TCP_USER_TIMEOUT are confusing</h2>
      <a href="#keepalives-with-tcp_user_timeout-are-confusing">
        
      </a>
    </div>
    <p>We mentioned the TCP_USER_TIMEOUT option before. It sets the maximum amount of time that transmitted data may remain unacknowledged before the kernel forcefully closes the connection. On its own, it doesn't do much in the case of idle connections. The sockets will remain ESTABLISHED even if the connectivity is dropped. However, this socket option does change the semantics of TCP keepalives. <a href="https://linux.die.net/man/7/tcp">The tcp(7) manpage</a> is somewhat confusing:</p><p><i>Moreover, when used with the TCP keepalive (SO_KEEPALIVE) option, TCP_USER_TIMEOUT will override keepalive to determine when to close a connection due to keepalive failure.</i></p><p>The original commit message has slightly more detail:</p><ul><li><p><a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=dca43c75e7e545694a9dd6288553f55c53e2a3a3">tcp: Add TCP_USER_TIMEOUT socket option</a></p></li></ul><p>To understand the semantics, we need to look at the <a href="https://github.com/torvalds/linux/blob/b41dae061bbd722b9d7fa828f35d22035b218e18/net/ipv4/tcp_timer.c#L693-L697">kernel code in linux/net/ipv4/tcp_timer.c:693</a>:</p><p><code>if ((icsk-&gt;icsk_user_timeout != 0 &amp;&amp;
elapsed &gt;= msecs_to_jiffies(icsk-&gt;icsk_user_timeout) &amp;&amp;
icsk-&gt;icsk_probes_out &gt; 0) ||</code></p><p>For the user timeout to have any effect, the <code>icsk_probes_out</code> must not be zero. The check for user timeout is done only <i>after</i> the first probe went out. Let's check it out. Our connection settings:</p><ul><li><p>TCP_USER_TIMEOUT = 5*1000 - 5 seconds</p></li><li><p>SO_KEEPALIVE = 1 - enable keepalives</p></li><li><p>TCP_KEEPIDLE = 1 - send first probe quickly - 1 second idle</p></li><li><p>TCP_KEEPINTVL = 11 - subsequent probes every 11 seconds</p></li><li><p>TCP_KEEPCNT = 3 - send three probes before timing out</p></li></ul><p><code>00:00.000 IP host.2 &gt; host.1: Flags [S]
00:00.000 IP host.1 &gt; host.2: Flags [S.]
00:00.000 IP host.2 &gt; host.1: Flags [.]

# all subsequent packets dropped
00:01.001 IP host.2 &gt; host.1: Flags [.], ack 1 # first probe
00:12.233 IP host.2 &gt; host.1: Flags [R.] # timer for second probe fired, socket aborted due to TCP_USER_TIMEOUT</code></p><p>So what happened? The connection sent the first keepalive probe at the 1s mark. Seeing no response the TCP stack then woke up 11 seconds later to send a second probe. This time though, it executed the USER_TIMEOUT code path, which decided to terminate the connection immediately.</p><p>What if we bump TCP_USER_TIMEOUT to larger values, say between the second and third probe? Then, the connection will be closed on the third probe timer. With TCP_USER_TIMEOUT set to 12.5s:</p><p><code>00:01.022 IP host.2 &gt; host.1: Flags [.] # first probe
00:12.094 IP host.2 &gt; host.1: Flags [.] # second probe
00:23.102 IP host.2 &gt; host.1: Flags [R.] # timer for third probe fired, socket aborted due to TCP_USER_TIMEOUT</code></p><p>We’ve shown how TCP_USER_TIMEOUT interacts with keepalives for small and medium values. The last case is when TCP_USER_TIMEOUT is extraordinarily large. Say we set it to 30s:</p><p><code>00:01.027 IP host.2 &gt; host.1: Flags [.], ack 1 # first probe
00:12.195 IP host.2 &gt; host.1: Flags [.], ack 1 # second probe
00:23.207 IP host.2 &gt; host.1: Flags [.], ack 1 # third probe
00:34.211 IP host.2 &gt; host.1: Flags [.], ack 1 # fourth probe! But TCP_KEEPCNT was only 3!
00:45.219 IP host.2 &gt; host.1: Flags [.], ack 1 # fifth probe!
00:56.227 IP host.2 &gt; host.1: Flags [.], ack 1 # sixth probe!
01:07.235 IP host.2 &gt; host.1: Flags [R.], seq 1 # TCP_USER_TIMEOUT aborts conn on 7th probe timer</code></p><p>We saw six keepalive probes on the wire! With TCP_USER_TIMEOUT set, the TCP_KEEPCNT is totally ignored. If you want TCP_KEEPCNT to make sense, the only sensible USER_TIMEOUT value is slightly smaller than:</p>
            <pre><code>TCP_KEEPIDLE + TCP_KEEPINTVL * TCP_KEEPCNT</code></pre>
            
    <div>
      <h2>Busy ESTAB socket is not forever</h2>
      <a href="#busy-estab-socket-is-not-forever">
        
      </a>
    </div>
    <p>Thus far we have discussed the case where the connection is idle. Different rules apply when the connection has unacknowledged data in a send buffer.</p><p>Let's prepare another experiment - after the three-way handshake, let's set up a firewall to drop all packets. Then, let's do a <code>send</code> on one end to have some dropped packets in-flight. An experiment shows the sending socket dies after ~16 minutes:</p><p><code>00:00.000 IP host.2 &gt; host.1: Flags [S]
00:00.000 IP host.1 &gt; host.2: Flags [S.]
00:00.000 IP host.2 &gt; host.1: Flags [.]

# All subsequent packets dropped
00:00.206 IP host.2 &gt; host.1: Flags [P.], length 11 # first data packet
00:00.412 IP host.2 &gt; host.1: Flags [P.], length 11 # early retransmit, doesn't count
00:00.620 IP host.2 &gt; host.1: Flags [P.], length 11 # 1nd retry
00:01.048 IP host.2 &gt; host.1: Flags [P.], length 11 # 2rd retry
00:01.880 IP host.2 &gt; host.1: Flags [P.], length 11 # 3th retry

State Recv-Q Send-Q Local:Port Peer:Port
ESTAB 0      0      host:1     host:2
ESTAB 0      11     host:2     host:1    timer:(on,1.304ms,3)

00:03.543 IP host.2 &gt; host.1: Flags [P.], length 11 # 4th
00:07.000 IP host.2 &gt; host.1: Flags [P.], length 11 # 5th
00:13.656 IP host.2 &gt; host.1: Flags [P.], length 11 # 6th
00:26.968 IP host.2 &gt; host.1: Flags [P.], length 11 # 7th
00:54.616 IP host.2 &gt; host.1: Flags [P.], length 11 # 8th
01:47.868 IP host.2 &gt; host.1: Flags [P.], length 11 # 9th
03:34.360 IP host.2 &gt; host.1: Flags [P.], length 11 # 10th
05:35.192 IP host.2 &gt; host.1: Flags [P.], length 11 # 11th
07:36.024 IP host.2 &gt; host.1: Flags [P.], length 11 # 12th
09:36.855 IP host.2 &gt; host.1: Flags [P.], length 11 # 13th
11:37.692 IP host.2 &gt; host.1: Flags [P.], length 11 # 14th
13:38.524 IP host.2 &gt; host.1: Flags [P.], length 11 # 15th
15:39.500 connection ETIMEDOUT</code></p><p>The data packet is retransmitted 15 times, as controlled by:</p><p><code>$ sysctl net.ipv4.tcp_retries2
net.ipv4.tcp_retries2 = 15</code></p><p>From the <a href="https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt"><code>ip-sysctl.txt</code></a> documentation:</p><p><i>The default value of 15 yields a hypothetical timeout of 924.6 seconds and is a lower bound for the effective timeout. TCP will effectively time out at the first RTO which exceeds the hypothetical timeout.</i></p><p>The connection indeed died at ~940 seconds. Notice the socket has the "on" timer running. It doesn't matter at all if we set SO_KEEPALIVE - when the "on" timer is running, keepalives are not engaged.</p><p>TCP_USER_TIMEOUT keeps on working though. The connection will be aborted <i>exactly</i> after user-timeout specified time since the last received packet. With the user timeout set the <code>tcp_retries2</code> value is ignored.</p>
    <div>
      <h2>Zero window ESTAB is... forever?</h2>
      <a href="#zero-window-estab-is-forever">
        
      </a>
    </div>
    <p>There is one final case worth mentioning. If the sender has plenty of data, and the receiver is slow, then TCP flow control kicks in. At some point the receiver will ask the sender to stop transmitting new data. This is a slightly different condition than the one described above.</p><p>In this case, with flow control engaged, there is no in-flight or unacknowledged data. Instead the receiver throttles the sender with a "zero window" notification. Then the sender periodically checks if the condition is still valid with "window probes". In this experiment we reduced the receive buffer size for simplicity. Here's how it looks on the wire:</p><p><code>00:00.000 IP host.2 &gt; host.1: Flags [S]
00:00.000 IP host.1 &gt; host.2: Flags [S.], win 1152
00:00.000 IP host.2 &gt; host.1: Flags [.]</code></p><p><code>00:00.202 IP host.2 &gt; host.1: Flags [.], length 576 # first data packet
00:00.202 IP host.1 &gt; host.2: Flags [.], ack 577, win 576
00:00.202 IP host.2 &gt; host.1: Flags [P.], length 576 # second data packet
00:00.244 IP host.1 &gt; host.2: Flags [.], ack 1153, win 0 # throttle it! zero-window</code></p><p><code>00:00.456 IP host.2 &gt; host.1: Flags [.], ack 1 # zero-window probe
00:00.456 IP host.1 &gt; host.2: Flags [.], ack 1153, win 0 # nope, still zero-window</code></p><p><code>State Recv-Q Send-Q Local:Port Peer:Port
ESTAB 1152   0      host:1     host:2
ESTAB 0      129920 host:2     host:1  timer:(persist,048ms,0)</code></p><p>The packet capture shows a couple of things. First, we can see two packets with data, each 576 bytes long. They both were immediately acknowledged. The second ACK had "win 0" notification: the sender was told to stop sending data.</p><p>But the sender is eager to send more! The last two packets show a first "window probe": the sender will periodically send payload-less "ack" packets to check if the window size had changed. As long as the receiver keeps on answering, the sender will keep on sending such probes forever.</p><p>The socket information shows three important things:</p><ul><li><p>The read buffer of the reader is filled - thus the "zero window" throttling is expected.</p></li><li><p>The write buffer of the sender is filled - we have more data to send.</p></li><li><p>The sender has a "persist" timer running, counting the time until the next "window probe".</p></li></ul><p>In this blog post we are interested in timeouts - what will happen if the window probes are lost? Will the sender notice?</p><p>By default, the window probe is retried 15 times - adhering to the usual <code>tcp_retries2</code> setting.</p><p>The tcp timer is in <code>persist</code> state, so the TCP keepalives will <i>not</i> be running. The SO_KEEPALIVE settings don't make any difference when window probing is engaged.</p><p>As expected, the TCP_USER_TIMEOUT toggle keeps on working. A slight difference is that similarly to user-timeout on keepalives, it's engaged only when the retransmission timer fires. During such an event, if more than user-timeout seconds since the last good packet passed, the connection will be aborted.</p>
    <div>
      <h2>Note about using application timeouts</h2>
      <a href="#note-about-using-application-timeouts">
        
      </a>
    </div>
    <p>In the past we have shared an interesting war story:</p><ul><li><p><a href="/the-curious-case-of-slow-downloads/">The curious case of slow downloads</a></p></li></ul><p>Our HTTP server gave up on the connection after an application-managed timeout fired. This was a bug - a slow connection might have correctly slowly drained the send buffer, but the application server didn't notice that.</p><p>We abruptly dropped slow downloads, even though this wasn't our intention. We just wanted to make sure the client connection was still healthy. It would be better to use TCP_USER_TIMEOUT than rely on application-managed timeouts.</p><p>But this is not sufficient. We also wanted to guard against a situation where a client stream is valid, but is stuck and doesn't drain the connection. The only way to achieve this is to periodically check the amount of unsent data in the send buffer, and see if it shrinks at a desired pace.</p><p>For typical applications sending data to the Internet, I would recommend:</p><ol><li><p>Enable TCP keepalives. This is needed to keep some data flowing in the idle-connection case.</p></li><li><p>Set TCP_USER_TIMEOUT to <code>TCP_KEEPIDLE + TCP_KEEPINTVL * TCP_KEEPCNT</code>.</p></li><li><p>Be careful when using application-managed timeouts. To detect TCP failures use TCP keepalives and user-timeout. If you want to spare resources and make sure sockets don't stay alive for too long, consider periodically checking if the socket is draining at the desired pace. You can use <code>ioctl(TIOCOUTQ)</code> for that, but it counts both data buffered (notsent) on the socket and in-flight (unacknowledged) bytes. A better way is to use TCP_INFO tcpi_notsent_bytes parameter, which reports only the former counter.</p></li></ol><p>An example of checking the draining pace:</p><p><code>while True:
notsent1 = get_tcp_info(c).tcpi_notsent_bytes
notsent1_ts = time.time()
...
poll.poll(POLL_PERIOD)
...
notsent2 = get_tcp_info(c).tcpi_notsent_bytes
notsent2_ts = time.time()
pace_in_bytes_per_second = (notsent1 - notsent2) / (notsent2_ts - notsent1_ts)
if pace_in_bytes_per_second &gt; 12000:
# pace is above effective rate of 96Kbps, ok!
else:
# socket is too slow...</code></p><p>There are ways to further improve this logic. We could use <a href="https://lwn.net/Articles/560082/"><code>TCP_NOTSENT_LOWAT</code></a>, although it's generally only useful for situations where the send buffer is relatively empty. Then we could use the <a href="https://www.kernel.org/doc/Documentation/networking/timestamping.txt"><code>SO_TIMESTAMPING</code></a> interface for notifications about when data gets delivered. Finally, if we are done sending the data to the socket, it's possible to just call <code>close()</code> and defer handling of the socket to the operating system. Such a socket will be stuck in FIN-WAIT-1 or LAST-ACK state until it correctly drains.</p>
    <div>
      <h2>Summary</h2>
      <a href="#summary">
        
      </a>
    </div>
    <p>In this post we discussed five cases where the TCP connection may notice the other party going away:</p><ul><li><p>SYN-SENT: The duration of this state can be controlled by <code>TCP_SYNCNT</code> or <code>tcp_syn_retries</code>.</p></li><li><p>SYN-RECV: It's usually hidden from application. It is tuned by <code>tcp_synack_retries</code>.</p></li><li><p>Idling ESTABLISHED connection, will never notice any issues. A solution is to use TCP keepalives.</p></li><li><p>Busy ESTABLISHED connection, adheres to <code>tcp_retries2</code> setting, and ignores TCP keepalives.</p></li><li><p>Zero-window ESTABLISHED connection, adheres to <code>tcp_retries2</code> setting, and ignores TCP keepalives.</p></li></ul><p>Especially the last two ESTABLISHED cases can be customized with TCP_USER_TIMEOUT, but this setting also affects other situations. Generally speaking, it can be thought of as a hint to the kernel to abort the connection after so-many seconds since the last good packet. This is a dangerous setting though, and if used in conjunction with TCP keepalives should be set to a value slightly lower than <code>TCP_KEEPIDLE + TCP_KEEPINTVL * TCP_KEEPCNT</code>. Otherwise it will affect, and potentially cancel out, the TCP_KEEPCNT value.</p><p>In this post we presented scripts showing the effects of timeout-related socket options under various network conditions. Interleaving the <code>tcpdump</code> packet capture with the output of <code>ss -o</code> is a great way of understanding the networking stack. We were able to create reproducible test cases showing the "on", "keepalive" and "persist" timers in action. This is a very useful framework for further experimentation.</p><p>Finally, it's surprisingly hard to tune a TCP connection to be confident that the remote host is actually up. During our debugging we found that looking at the send buffer size and currently active TCP timer can be very helpful in understanding whether the socket is actually healthy. The bug in our Spectrum application turned out to be a wrong TCP_USER_TIMEOUT setting - without it sockets with large send buffers were lingering around for way longer than we intended.</p><p>The scripts used in this article <a href="https://github.com/cloudflare/cloudflare-blog/tree/master/2019-09-tcp-keepalives">can be found on our GitHub</a>.</p><p>Figuring this out has been a collaboration across three Cloudflare offices. Thanks to <a href="https://twitter.com/Hirenpanchasara">Hiren Panchasara</a> from San Jose, <a href="https://twitter.com/warrncn">Warren Nelson</a> from Austin and <a href="https://twitter.com/jkbs0">Jakub Sitnicki</a> from Warsaw. Fancy joining the team? <a href="https://www.cloudflare.com/careers/departments/?utm_referrer=blog">Apply here!</a></p> ]]></content:encoded>
            <category><![CDATA[SYN]]></category>
            <category><![CDATA[TCP]]></category>
            <category><![CDATA[Spectrum]]></category>
            <category><![CDATA[Tech Talks]]></category>
            <guid isPermaLink="false">PTYUwpDIf4wDZ50CejAvL</guid>
            <dc:creator>Marek Majkowski</dc:creator>
        </item>
        <item>
            <title><![CDATA[A gentle introduction to Linux Kernel fuzzing]]></title>
            <link>https://blog.cloudflare.com/a-gentle-introduction-to-linux-kernel-fuzzing/</link>
            <pubDate>Wed, 10 Jul 2019 13:07:21 GMT</pubDate>
            <description><![CDATA[ For some time I’ve wanted to play with coverage-guided fuzzing. I decided to have a go at the Linux Kernel netlink machinery.  It's a good target: it's an obscure part of kernel, and it's relatively easy to automatically craft valid messages. ]]></description>
            <content:encoded><![CDATA[ <p>For some time I’ve wanted to play with coverage-guided <a href="https://en.wikipedia.org/wiki/Fuzzing">fuzzing</a>. Fuzzing is a powerful testing technique where an automated program feeds semi-random inputs to a tested program. The intention is to find such inputs that trigger bugs. Fuzzing is especially useful in finding memory corruption bugs in C or C++ programs.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/281ZRgfWvQ84ZxrUD29Bq2/f7f78fbced1487215ae7ce8e3801b55e/4152779709_d1ea8dd3b4_z.jpg" />
            
            </figure><p><a href="https://www.flickr.com/photos/patrick_s/4152779709">Image</a> by <a href="https://www.flickr.com/photos/patrick_s/">Patrick Shannon</a> CC BY 2.0</p><p>Normally it's recommended to pick a well known, but little explored, library that is heavy on parsing. Historically things like libjpeg, libpng and libyaml were perfect targets. Nowadays it's harder to find a good target - everything seems to have been fuzzed to death already. That's a good thing! I guess the software is getting better! Instead of choosing a userspace target I decided to have a go at the Linux Kernel netlink machinery.</p><p><a href="https://en.wikipedia.org/wiki/Netlink">Netlink is an internal Linux facility</a> used by tools like "ss", "ip", "netstat". It's used for low level networking tasks - configuring network interfaces, IP addresses, routing tables and such. It's a good target: it's an obscure part of kernel, and it's relatively easy to automatically craft valid messages. Most importantly, we can learn a lot about Linux internals in the process. Bugs in netlink aren't going to have security impact though - netlink sockets usually require privileged access anyway.</p><p>In this post we'll run <a href="http://lcamtuf.coredump.cx/afl/">AFL fuzzer</a>, driving our netlink shim program against a custom Linux kernel. All of this running inside KVM virtualization.</p><p>This blog post is a tutorial. With the easy to follow instructions, you should be able to quickly replicate the results. All you need is a machine running Linux and 20 minutes.</p>
    <div>
      <h2>Prior work</h2>
      <a href="#prior-work">
        
      </a>
    </div>
    <p>The technique we are going to use is formally called "coverage-guided fuzzing". There's a lot of prior literature:</p><ul><li><p><a href="https://blog.trailofbits.com/2017/02/16/the-smart-fuzzer-revolution/">The Smart Fuzzer Revolution</a> by Dan Guido, and <a href="https://lwn.net/Articles/677764/">LWN article</a> about it</p></li><li><p><a href="https://j00ru.vexillium.org/talks/blackhat-eu-effective-file-format-fuzzing-thoughts-techniques-and-results/">Effective file format fuzzing</a> by Mateusz “j00ru” Jurczyk</p></li><li><p><a href="http://honggfuzz.com/">honggfuzz</a> by Robert Swiecki, is a modern, feature-rich coverage-guided fuzzer</p></li><li><p><a href="https://google.github.io/clusterfuzz/">ClusterFuzz</a></p></li><li><p><a href="https://github.com/google/fuzzer-test-suite">Fuzzer Test Suite</a></p></li></ul><p>Many people have fuzzed the Linux Kernel in the past. Most importantly:</p><ul><li><p><a href="https://github.com/google/syzkaller/blob/master/docs/syzbot.md">syzkaller (aka syzbot)</a> by Dmitry Vyukov, is a very powerful CI-style continuously running kernel fuzzer, which found hundreds of issues already. It's an awesome machine - it will even report the bugs automatically!</p></li><li><p><a href="https://github.com/kernelslacker/trinity">Trinity fuzzer</a></p></li></ul><p>We'll use <a href="http://lcamtuf.coredump.cx/afl/">the AFL</a>, everyone's favorite fuzzer. AFL was written by <a href="http://lcamtuf.coredump.cx">Michał Zalewski</a>. It's well known for its ease of use, speed and very good mutation logic. It's a perfect choice for people starting their journey into fuzzing!</p><p>If you want to read more about AFL, the documentation is in couple of files:</p><ul><li><p><a href="http://lcamtuf.coredump.cx/afl/historical_notes.txt">Historical notes</a></p></li><li><p><a href="http://lcamtuf.coredump.cx/afl/technical_details.txt">Technical whitepaper</a></p></li><li><p><a href="http://lcamtuf.coredump.cx/afl/README.txt">README</a></p></li></ul>
    <div>
      <h2>Coverage-guided fuzzing</h2>
      <a href="#coverage-guided-fuzzing">
        
      </a>
    </div>
    <p>Coverage-guided fuzzing works on the principle of a feedback loop:</p><ul><li><p>the fuzzer picks the most promising test case</p></li><li><p>the fuzzer mutates the test into a large number of new test cases</p></li><li><p>the target code runs the mutated test cases, and reports back code coverage</p></li><li><p>the fuzzer computes a score from the reported coverage, and uses it to prioritize the interesting mutated tests and remove the redundant ones</p></li></ul><p>For example, let's say the input test is "hello". Fuzzer may mutate it to a number of tests, for example: "hEllo" (bit flip), "hXello" (byte insertion), "hllo" (byte deletion). If any of these tests will yield an interesting code coverage, then it will be prioritized and used as a base for a next generation of tests.</p><p>Specifics on how mutations are done, and how to efficiently compare code coverage reports of thousands of program runs is the fuzzer secret sauce. Read on the <a href="http://lcamtuf.coredump.cx/afl/technical_details.txt">AFL's technical whitepaper</a> for nitty gritty details.</p><p>The code coverage reported back from the binary is very important. It allows fuzzer to order the test cases, and identify the most promising ones. Without the code coverage the fuzzer is blind.</p><p>Normally, when using AFL, we are required to instrument the target code so that coverage is reported in an AFL-compatible way. But we want to fuzz the kernel! We can't just recompile it with "afl-gcc"! Instead we'll use a trick. We'll prepare a binary that will trick AFL into thinking it was compiled with its tooling. This binary will report back the code coverage extracted from kernel.</p>
    <div>
      <h2>Kernel code coverage</h2>
      <a href="#kernel-code-coverage">
        
      </a>
    </div>
    <p>The kernel has at least two built-in coverage mechanisms - GCOV and KCOV:</p><ul><li><p><a href="https://www.kernel.org/doc/html/v4.15/dev-tools/gcov.html">Using gcov with the Linux kernel</a></p></li><li><p><a href="https://www.kernel.org/doc/html/latest/dev-tools/kcov.html">KCOV: code coverage for fuzzing</a></p></li></ul><p>KCOV was designed with fuzzing in mind, so we'll use this.</p><p>Using KCOV is pretty easy. We must compile the Linux kernel with the right setting. First, enable the KCOV kernel config option:</p>
            <pre><code>cd linux
./scripts/config \
    -e KCOV \
    -d KCOV_INSTRUMENT_ALL</code></pre>
            <p>KCOV is capable of recording code coverage from the whole kernel. It can be set with KCOV_INSTRUMENT_ALL option. This has disadvantages though - it would slow down the parts of the kernel we don't want to profile, and would introduce noise in our measurements (reduce "stability"). For starters, let's disable KCOV_INSTRUMENT_ALL and enable KCOV selectively on the code we actually want to profile. Today, we focus on netlink machinery, so let's enable KCOV on whole "net" directory tree:</p>
            <pre><code>find net -name Makefile | xargs -L1 -I {} bash -c 'echo "KCOV_INSTRUMENT := y" &gt;&gt; {}'</code></pre>
            <p>In a perfect world we would enable KCOV only for a couple of files we really are interested in. But netlink handling is peppered all over the network stack code, and we don't have time for fine tuning it today.</p><p>With KCOV in place, it's worth to add "kernel hacking" toggles that will increase the likelihood of reporting memory corruption bugs. See the <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2019-07-kernel-fuzzing/README.md">README</a> for the list of <a href="https://github.com/google/syzkaller/blob/master/docs/linux/kernel_configs.md">Syzkaller suggested options</a> - most importantly <a href="https://www.kernel.org/doc/html/latest/dev-tools/kasan.html">KASAN</a>.</p><p>With that set we can compile our KCOV and KASAN enabled kernel. Oh, one more thing. We are going to run the kernel in a kvm. We're going to use <a href="https://github.com/amluto/virtme">"virtme"</a>, so we need a couple of toggles:</p>
            <pre><code>./scripts/config \
    -e VIRTIO -e VIRTIO_PCI -e NET_9P -e NET_9P_VIRTIO -e 9P_FS \
    -e VIRTIO_NET -e VIRTIO_CONSOLE  -e DEVTMPFS ...</code></pre>
            <p>(see the <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2019-07-kernel-fuzzing/README.md">README</a> for full list)</p>
    <div>
      <h2>How to use KCOV</h2>
      <a href="#how-to-use-kcov">
        
      </a>
    </div>
    <p>KCOV is super easy to use. First, note the code coverage is recorded in a per-process data structure. This means you have to enable and disable KCOV within a userspace process, and it's impossible to record coverage for non-task things, like interrupt handling. This is totally fine for our needs.</p><p>KCOV reports data into a ring buffer. Setting it up is pretty simple, <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2019-07-kernel-fuzzing/src/kcov.c">see our code</a>. Then you can enable and disable it with a trivial ioctl:</p>
            <pre><code>ioctl(kcov_fd, KCOV_ENABLE, KCOV_TRACE_PC);
/* profiled code */
ioctl(kcov_fd, KCOV_DISABLE, 0);</code></pre>
            <p>After this sequence the ring buffer contains the list of %rip values of all the basic blocks of the KCOV-enabled kernel code. To read the buffer just run:</p>
            <pre><code>n = __atomic_load_n(&amp;kcov_ring[0], __ATOMIC_RELAXED);
for (i = 0; i &lt; n; i++) {
    printf("0x%lx\n", kcov_ring[i + 1]);
}</code></pre>
            <p>With tools like <code>addr2line</code> it's possible to resolve the %rip to a specific line of code. We won't need it though - the raw %rip values are sufficient for us.</p>
    <div>
      <h2>Feeding KCOV into AFL</h2>
      <a href="#feeding-kcov-into-afl">
        
      </a>
    </div>
    <p>The next step in our journey is to learn how to trick AFL. Remember, AFL needs a specially-crafted executable, but we want to feed in the kernel code coverage. First we need to understand how AFL works.</p><p>AFL sets up an array of 64K 8-bit numbers. This memory region is called "shared_mem" or "trace_bits" and is shared with the traced program. Every byte in the array can be thought of as a hit counter for a particular (branch_src, branch_dst) pair in the instrumented code.</p><p>It's important to notice that AFL prefers random branch labels, rather than reusing the %rip value to identify the basic blocks. This is to increase entropy - we want our hit counters in the array to be uniformly distributed. The algorithm AFL uses is:</p>
            <pre><code>cur_location = &lt;COMPILE_TIME_RANDOM&gt;;
shared_mem[cur_location ^ prev_location]++; 
prev_location = cur_location &gt;&gt; 1;</code></pre>
            <p>In our case with KCOV we don't have compile-time-random values for each branch. Instead we'll use a hash function to generate a uniform 16 bit number from %rip recorded by KCOV. This is how to feed a KCOV report into the AFL "shared_mem" array:</p>
            <pre><code>n = __atomic_load_n(&amp;kcov_ring[0], __ATOMIC_RELAXED);
uint16_t prev_location = 0;
for (i = 0; i &lt; n; i++) {
        uint16_t cur_location = hash_function(kcov_ring[i + 1]);
        shared_mem[cur_location ^ prev_location]++;
        prev_location = cur_location &gt;&gt; 1;
}</code></pre>
            
    <div>
      <h2>Reading test data from AFL</h2>
      <a href="#reading-test-data-from-afl">
        
      </a>
    </div>
    <p>Finally, we need to actually write the test code hammering the kernel netlink interface! First we need to read input data from AFL. By default AFL sends a test case to stdin:</p>
            <pre><code>/* read AFL test data */
char buf[512*1024];
int buf_len = read(0, buf, sizeof(buf));</code></pre>
            
    <div>
      <h2>Fuzzing netlink</h2>
      <a href="#fuzzing-netlink">
        
      </a>
    </div>
    <p>Then we need to send this buffer into a netlink socket. But we know nothing about how netlink works! Okay, let's use the first 5 bytes of input as the netlink protocol and group id fields. This will allow the AFL to figure out and guess the correct values of these fields. The code testing netlink (simplified):</p>
            <pre><code>netlink_fd = socket(AF_NETLINK, SOCK_RAW | SOCK_NONBLOCK, buf[0]);

struct sockaddr_nl sa = {
        .nl_family = AF_NETLINK,
        .nl_groups = (buf[1] &lt;&lt;24) | (buf[2]&lt;&lt;16) | (buf[3]&lt;&lt;8) | buf[4],
};

bind(netlink_fd, (struct sockaddr *) &amp;sa, sizeof(sa));

struct iovec iov = { &amp;buf[5], buf_len - 5 };
struct sockaddr_nl sax = {
      .nl_family = AF_NETLINK,
};

struct msghdr msg = { &amp;sax, sizeof(sax), &amp;iov, 1, NULL, 0, 0 };
r = sendmsg(netlink_fd, &amp;msg, 0);
if (r != -1) {
      /* sendmsg succeeded! great I guess... */
}</code></pre>
            <p>That's basically it! For speed, we will wrap this in a short loop that mimics <a href="https://lcamtuf.blogspot.com/2014/10/fuzzing-binaries-without-execve.html">the AFL "fork server" logic</a>. I'll skip the explanation here, <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2019-07-kernel-fuzzing/src/forksrv.c">see our code for details</a>. The resulting code of our AFL-to-KCOV shim looks like:</p>
            <pre><code>forksrv_welcome();
while(1) {
    forksrv_cycle();
    test_data = afl_read_input();
    kcov_enable();
    /* netlink magic */
    kcov_disable();
    /* fill in shared_map with tuples recorded by kcov */
    if (new_crash_in_dmesg) {
         forksrv_status(1);
    } else {
         forksrv_status(0);
    }
}</code></pre>
            <p><a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2019-07-kernel-fuzzing/src/fuzznetlink.c">See full source code</a>.</p>
    <div>
      <h2>How to run the custom kernel</h2>
      <a href="#how-to-run-the-custom-kernel">
        
      </a>
    </div>
    <p>We're missing one important piece - how to actually run the custom kernel we've built. There are three options:</p><p><b>"native"</b>: You can totally boot the built kernel on your server and fuzz it natively. This is the fastest technique, but pretty problematic. If the fuzzing succeeds in finding a bug you will crash the machine, potentially losing the test data. Cutting the branches we sit on should be avoided.</p><p><b>"uml"</b>: We could configure the kernel to run as <a href="http://user-mode-linux.sourceforge.net/">User Mode Linux</a>. Running a UML kernel requires no privileges. The kernel just runs a user space process. UML is pretty cool, but sadly, it doesn't support KASAN, therefore the chances of finding a memory corruption bug are reduced. Finally, UML is a pretty magic special environment - bugs found in UML may not be relevant on real environments. Interestingly, UML is used by <a href="https://source.android.com/devices/architecture/kernel/network_tests">Android network_tests framework</a>.</p><p><b>"kvm"</b>: we can use kvm to run our custom kernel in a virtualized environment. This is what we'll do.</p><p>One of the simplest ways to run a custom kernel in a KVM environment is to use <a href="https://github.com/amluto/virtme">"virtme" scripts</a>. With them we can avoid having to create a dedicated disk image or partition, and just share the host file system. This is how we can run our code:</p>
            <pre><code>virtme-run \
    --kimg bzImage \
    --rw --pwd --memory 512M \
    --script-sh "&lt;what to run inside kvm&gt;" </code></pre>
            <p>But hold on. We forgot about preparing input corpus data for our fuzzer!</p>
    <div>
      <h2>Building the input corpus</h2>
      <a href="#building-the-input-corpus">
        
      </a>
    </div>
    <p>Every fuzzer takes a carefully crafted test cases as input, to bootstrap the first mutations. The test cases should be short, and cover as large part of code as possible. Sadly - I know nothing about netlink. How about we don't prepare the input corpus...</p><p>Instead we can ask AFL to "figure out" what inputs make sense. This is what <a href="https://lcamtuf.blogspot.com/2014/11/pulling-jpegs-out-of-thin-air.html">Michał did back in 2014 with JPEGs</a> and it worked for him. With this in mind, here is our input corpus:</p>
            <pre><code>mkdir inp
echo "hello world" &gt; inp/01.txt</code></pre>
            <p>Instructions, how to compile and run the whole thing are in <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2019-07-kernel-fuzzing">README.md</a> on our github. It boils down to:</p>
            <pre><code>virtme-run \
    --kimg bzImage \
    --rw --pwd --memory 512M \
    --script-sh "./afl-fuzz -i inp -o out -- fuzznetlink" </code></pre>
            <p>With this running you will see the familiar AFL status screen:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3Ms0btT4aEecNGhSwflYYX/c519bc18d15a5a33ae486456cdbde6e3/Screenshot-from-2019-07-10-11-44-50.png" />
            
            </figure>
    <div>
      <h2>Further notes</h2>
      <a href="#further-notes">
        
      </a>
    </div>
    <p>That's it. Now you have a custom hardened kernel, running a basic coverage-guided fuzzer. All inside KVM.</p><p>Was it worth the effort? Even with this basic fuzzer, and no input corpus, after a day or two the fuzzer found an interesting code path: <a href="https://lore.kernel.org/netdev/CAJPywTJWQ9ACrp0naDn0gikU4P5-xGcGrZ6ZOKUeeC3S-k9+MA@mail.gmail.com/T/#u">NEIGH: BUG, double timer add, state is 8</a>. With a more specialized fuzzer, some work on improving the "stability" metric and a decent input corpus, we could expect even better results.</p><p>If you want to learn more about what netlink sockets actually do, see a blog post by my colleague Jakub Sitnicki <a href="http://codecave.cc/multipath-routing-in-linux-part-1.html">Multipath Routing in Linux - part 1</a>. Then there is a good chapter about it in <a href="https://books.google.pl/books?redir_esc=y&amp;hl=pl&amp;id=96V4AgAAQBAJ&amp;q=netlink#v=snippet&amp;q=netlink&amp;f=false">Linux Kernel Networking book by Rami Rosen</a>.</p><p>In this blog post we haven't mentioned:</p><ul><li><p>details of AFL shared_memory setup</p></li><li><p>implementation of AFL persistent mode</p></li><li><p>how to create a network namespace to isolate the effects of weird netlink commands, and improve the "stability" AFL score</p></li><li><p>technique on how to read dmesg (/dev/kmsg) to find kernel crashes</p></li><li><p>idea to run AFL outside of KVM, for speed and stability - currently the tests aren't stable after a crash is found</p></li></ul><p>But we achieved our goal - we set up a basic, yet still useful fuzzer against a kernel. Most importantly: the same machinery can be reused to fuzz other parts of Linux subsystems - from file systems to bpf verifier.</p><p>I also learned a hard lesson: tuning fuzzers is a full time job. Proper fuzzing is definitely not as simple as starting it up and idly waiting for crashes. There is always something to improve, tune, and re-implement. A quote at the beginning of the mentioned presentation by Mateusz Jurczyk resonated with me:</p><blockquote><p>"Fuzzing is easy to learn but hard to master."</p></blockquote><p>Happy bug hunting!</p> ]]></content:encoded>
            <category><![CDATA[Linux]]></category>
            <category><![CDATA[Kernel]]></category>
            <category><![CDATA[Developers]]></category>
            <category><![CDATA[Deep Dive]]></category>
            <guid isPermaLink="false">5t8I5tspUrkYUD9SY5kNP3</guid>
            <dc:creator>Marek Majkowski</dc:creator>
        </item>
        <item>
            <title><![CDATA[Cloudflare architecture and how BPF eats the world]]></title>
            <link>https://blog.cloudflare.com/cloudflare-architecture-and-how-bpf-eats-the-world/</link>
            <pubDate>Sat, 18 May 2019 15:00:00 GMT</pubDate>
            <description><![CDATA[ Recently at I gave a short talk titled "Linux at Cloudflare". The talk ended up being mostly about BPF. It seems, no matter the question - BPF is the answer.

Here is a transcript of a slightly adjusted version of that talk. ]]></description>
            <content:encoded><![CDATA[ <p>Recently at <a href="https://www.netdevconf.org/0x13/schedule.html">Netdev 0x13</a>, the Conference on Linux Networking in Prague, I gave <a href="https://netdevconf.org/0x13/session.html?panel-industry-perspectives">a short talk titled "Linux at Cloudflare"</a>. The <a href="https://speakerdeck.com/majek04/linux-at-cloudflare">talk</a> ended up being mostly about BPF. It seems, no matter the question - BPF is the answer.</p><p>Here is a transcript of a slightly adjusted version of that talk.</p><hr />
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1jvISZgDDA9AnXaBAJzYys/13b124e1f305c9ab4594db66f9123789/01_edge-network-locations-100.jpg" />
            
            </figure><p>At Cloudflare we run Linux on our servers. We operate two categories of data centers: large "Core" data centers, processing logs, analyzing attacks, computing analytics, and the "Edge" server fleet, delivering customer content from 180 locations across the world.</p><p>In this talk, we will focus on the "Edge" servers. It's here where we use the newest Linux features, optimize for performance and care deeply about DoS resilience.</p><hr />
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2L3Wbz94VfdflLpsTuIp9D/8b11fa614a4cc1134e4d4df466464425/image-9.png" />
            
            </figure><p>Our edge service is special due to our network configuration - we are extensively using anycast routing. Anycast means that the same set of IP addresses are announced by all our data centers.</p><p>This design has great advantages. First, it guarantees the optimal speed for end users. No matter where you are located, you will always reach the closest data center. Then, anycast helps us to spread out DoS traffic. During attacks each of the locations receives a small fraction of the total traffic, making it easier to ingest and filter out unwanted traffic.</p><hr />
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4xiYwbjJRCpt76iDITKowY/9800bb2637bb3f601ccbc7c6eedb6b36/03_edge-network-uniform-software-100-1.jpg" />
            
            </figure><p>Anycast allows us to keep the networking setup uniform across all edge data centers. We applied the same design inside our data centers - our software stack is uniform across the edge servers. All software pieces are running on all the servers.</p><p>In principle, every machine can handle every task - and we run many diverse and demanding tasks. We have a full HTTP stack, the magical <a href="https://www.cloudflare.com/developer-platform/workers/">Cloudflare Workers</a>, two sets of DNS servers - authoritative and resolver, and many other publicly facing applications like <a href="https://www.cloudflare.com/application-services/products/cloudflare-spectrum/">Spectrum</a> and <a href="https://www.cloudflare.com/learning/dns/what-is-1.1.1.1/">Warp</a>.</p><p>Even though every server has all the software running, requests typically cross many machines on their journey through the stack. For example, an HTTP request might be handled by a different machine during each of the 5 stages of the processing.</p><hr />
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/268deIgT4XLrgXfhRXxfx1/a5ced85738359d15ee16f9238ac759d8/image-23.png" />
            
            </figure><p>Let me walk you through the early stages of inbound packet processing:</p><p>(1) First, the packets hit our router. The router does ECMP, and forwards packets onto our Linux servers. We use ECMP to spread each target IP across many, at least 16, machines. This is used as a rudimentary load balancing technique.</p><p>(2) On the servers we ingest packets with XDP eBPF. In XDP we perform two stages. First, we run volumetric <a href="https://www.cloudflare.com/learning/ddos/ddos-mitigation/">DoS mitigations</a>, dropping packets belonging to very large layer 3 attacks.</p><p>(3) Then, still in XDP, we perform layer 4 <a href="https://www.cloudflare.com/learning/performance/what-is-load-balancing/">load balancing</a>. All the non-attack packets are redirected across the machines. This is used to work around the ECMP problems, gives us fine-granularity load balancing and allows us to gracefully take servers out of service.</p><p>(4) Following the redirection the packets reach a designated machine. At this point they are ingested by the normal Linux networking stack, go through the usual iptables firewall, and are dispatched to an appropriate network socket.</p><p>(5) Finally packets are received by an application. For example HTTP connections are handled by a "protocol" server, responsible for performing TLS encryption and processing HTTP, HTTP/2 and QUIC protocols.</p><p>It's in these early phases of request processing where we use the coolest new Linux features. We can group useful modern functionalities into three categories:</p><ul><li><p>DoS handling</p></li><li><p>Load balancing</p></li><li><p>Socket dispatch</p></li></ul><hr />
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3IlhL5q8CplwN5H1BY0ZZo/7f73d7d4c662113fac18404bcfac96c4/image-25.png" />
            
            </figure><p>Let's discuss DoS handling in more detail. As mentioned earlier, the first step after ECMP routing is Linux's XDP stack where, among other things, we run DoS mitigations.</p><p>Historically our mitigations for volumetric attacks were expressed in classic BPF and iptables-style grammar. Recently we adapted them to execute in the XDP eBPF context, which turned out to be surprisingly hard. Read on about our adventures:</p><ul><li><p><a href="/l4drop-xdp-ebpf-based-ddos-mitigations/">L4Drop: XDP DDoS Mitigations</a></p></li><li><p><a href="/xdpcap/">xdpcap: XDP Packet Capture</a></p></li><li><p><a href="https://netdevconf.org/0x13/session.html?talk-XDP-based-DDoS-mitigation">XDP based DoS mitigation</a> talk by Arthur Fabre</p></li><li><p><a href="https://netdevconf.org/2.1/papers/Gilberto_Bertin_XDP_in_practice.pdf">XDP in practice: integrating XDP into our DDoS mitigation pipeline</a> (PDF)</p></li></ul><p>During this project we encountered a number of eBPF/XDP limitations. One of them was the lack of concurrency primitives. It was very hard to implement things like race-free token buckets. Later we found that <a href="http://vger.kernel.org/lpc-bpf2018.html#session-9">Facebook engineer Julia Kartseva</a> had the same issues. In February this problem has been addressed with the introduction of <code>bpf_spin_lock</code> helper.</p><hr />
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2LAXf4Ayxh2sRR3hIsbBja/f8be1df5b19b10954b6a4ad092a177f7/image-26.png" />
            
            </figure><p>While our modern volumetric DoS defenses are done in XDP layer, we still rely on <code>iptables</code> for application layer 7 mitigations. Here, a higher level firewall’s features are useful: connlimit, hashlimits and ipsets. We also use the <code>xt_bpf</code> iptables module to run cBPF in iptables to match on packet payloads. We talked about this in the past:</p><ul><li><p><a href="https://speakerdeck.com/majek04/lessons-from-defending-the-indefensible">Lessons from defending the indefensible</a> (PPT)</p></li><li><p><a href="/introducing-the-bpf-tools/">Introducing the BPF tools</a></p></li></ul><hr />
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/7MhIUEVsoocz1OSd8FKeYZ/a9e4321a070cd2d49800e978eb50b2d7/image-34.png" />
            
            </figure><p>After XDP and iptables, we have one final kernel side DoS defense layer.</p><p>Consider a situation when our UDP mitigations fail. In such case we might be left with a flood of packets hitting our application UDP socket. This might overflow the socket causing packet loss. This is problematic - both good and bad packets will be dropped indiscriminately. For applications like DNS it's catastrophic. In the past to reduce the harm, we ran one UDP socket per IP address. An unmitigated flood was bad, but at least it didn't affect the traffic to other server IP addresses.</p><p>Nowadays that architecture is no longer suitable. We are running more than 30,000 DNS IP's and running that number of UDP sockets is not optimal. Our modern solution is to run a single UDP socket with a complex eBPF socket filter on it - using the <code>SO_ATTACH_BPF</code> socket option. We talked about running eBPF on network sockets in past blog posts:</p><ul><li><p><a href="/epbf_sockets_hop_distance/">eBPF, Sockets, Hop Distance and manually writing eBPF assembly</a></p></li><li><p><a href="/sockmap-tcp-splicing-of-the-future/">SOCKMAP - TCP splicing of the future</a></p></li></ul><p>The mentioned eBPF rate limits the packets. It keeps the state - packet counts - in an eBPF map. We can be sure that a single flooded IP won't affect other traffic. This works well, though during work on this project we found a rather worrying bug in the eBPF verifier:</p><ul><li><p><a href="/ebpf-cant-count/">eBPF can't count?!</a></p></li></ul><p>I guess running eBPF on a UDP socket is not a common thing to do.</p><hr />
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/79bzCglYGH8Vb2hH6R9j39/39aa41a6e11b5cb0dc7f640cd7d7aa72/image-27.png" />
            
            </figure><p>Apart from the DoS, in XDP we also run a layer 4 load balancer layer. This is a new project, and we haven't talked much about it yet. Without getting into many details: in certain situations we need to perform a socket lookup from XDP.</p><p>The problem is relatively simple - our code needs to look up the "socket" kernel structure for a 5-tuple extracted from a packet. This is generally easy - there is a <code>bpf_sk_lookup</code> helper available for this. Unsurprisingly, there were some complications. One problem was the inability to verify if a received ACK packet was a valid part of a three-way handshake when SYN-cookies are enabled. My colleague Lorenz Bauer is working on adding support for this corner case.</p><hr />
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6j7ypFSgorN0lWBNY6Nc2m/b0ca29a08e3a1623a5f8c06c213f8d97/image-28.png" />
            
            </figure><p>After DoS and the load balancing layers, the packets are passed onto the usual Linux TCP / UDP stack. Here we do a socket dispatch - for example packets going to port 53 are passed onto a socket belonging to our DNS server.</p><p>We do our best to use vanilla Linux features, but things get complex when you use thousands of IP addresses on the servers.</p><p>Convincing Linux to route packets correctly is relatively easy with <a href="/how-we-built-spectrum">the "AnyIP" trick</a>. Ensuring packets are dispatched to the right application is another matter. Unfortunately, standard Linux socket dispatch logic is not flexible enough for our needs. For popular ports like TCP/80 we want to share the port between multiple applications, each handling it on a different IP range. Linux doesn't support this out of the box. You can call <code>bind()</code> either on a specific IP address or all IP's (with 0.0.0.0).</p><hr />
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2ERTm7jhXbKKfdo3CVmV81/2c8b5986539832c73971269a6bcf0964/image-29.png" />
            
            </figure><p>In order to fix this, we developed a custom kernel patch which adds <a href="http://patchwork.ozlabs.org/patch/602916/">a <code>SO_BINDTOPREFIX</code> socket option</a>. As the name suggests - it allows us to call <code>bind()</code> on a selected IP prefix. This solves the problem of multiple applications sharing popular ports like 53 or 80.</p><p>Then we run into another problem. For our Spectrum product we need to listen on all 65535 ports. Running so many listen sockets is not a good idea (see <a href="/revenge-listening-sockets/">our old war story blog</a>), so we had to find another way. After some experiments we learned to utilize an obscure iptables module - TPROXY - for this purpose. Read about it here:</p><ul><li><p><a href="/how-we-built-spectrum/">Abusing Linux's firewall: the hack that allowed us to build Spectrum</a></p></li></ul><p>This setup is working, but we don't like the extra firewall rules. We are working on solving this problem correctly - actually extending the socket dispatch logic. You guessed it - we want to extend socket dispatch logic by utilizing eBPF. Expect some patches from us.</p><hr />
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5eEog6Kt512U7uVmD5DVt8/d8216cc24e674f5f86b1d385bf93dfc5/image-32.png" />
            
            </figure><p>Then there is a way to use eBPF to improve applications. Recently we got excited about doing TCP splicing with SOCKMAP:</p><ul><li><p><a href="/sockmap-tcp-splicing-of-the-future/">SOCKMAP - TCP splicing of the future</a></p></li></ul><p>This technique has a great potential for improving tail latency across many pieces of our software stack. The current SOCKMAP implementation is not quite ready for prime time yet, but the potential is vast.</p><p>Similarly, the new <a href="https://netdevconf.org/2.2/papers/brakmo-tcpbpf-talk.pdf">TCP-BPF aka BPF_SOCK_OPS</a> hooks provide a great way of inspecting performance parameters of TCP flows. This functionality is super useful for our performance team.</p><hr />
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1MC8LLNDHLZ4wDYLcbbCHQ/3491b09f9b5eaf796457eb06781f8b98/12_prometheus-ebpf_exporter-100.jpg" />
            
            </figure><p>Some Linux features didn't age well and we need to work around them. For example, we are hitting limitations of networking metrics. Don't get me wrong - the networking metrics are awesome, but sadly they are not granular enough. Things like <code>TcpExtListenDrops</code> and <code>TcpExtListenOverflows</code> are reported as global counters, while we need to know it on a per-application basis.</p><p>Our solution is to use eBPF probes to extract the numbers directly from the kernel. My colleague Ivan Babrou wrote a Prometheus metrics exporter called "ebpf_exporter" to facilitate this. Read on:</p><ul><li><p><a href="/introducing-ebpf_exporter/">Introducing ebpf_exporter</a></p></li><li><p><a href="https://github.com/cloudflare/ebpf_exporter">https://github.com/cloudflare/ebpf_exporter</a></p></li></ul><p>With "ebpf_exporter" we can generate all manner of detailed metrics. It is very powerful and saved us on many occasions.</p><hr />
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1obSUWhzzUQxg8cmDqQ0Lr/15e6419a78802c4d19fc4d21bf161738/image-33.png" />
            
            </figure><p>In this talk we discussed 6 layers of BPFs running on our edge servers:</p><ul><li><p>Volumetric DoS mitigations are running on XDP eBPF</p></li><li><p>Iptables <code>xt_bpf</code> cBPF for application-layer attacks</p></li><li><p><code>SO_ATTACH_BPF</code> for rate limits on UDP sockets</p></li><li><p>Load balancer, running on XDP</p></li><li><p>eBPFs running application helpers like SOCKMAP for TCP socket splicing, and TCP-BPF for TCP measurements</p></li><li><p>"ebpf_exporter" for granular metrics</p></li></ul><p>And we're just getting started! Soon we will be doing more with eBPF based socket dispatch, eBPF running on <a href="https://linux.die.net/man/8/tc">Linux TC (Traffic Control)</a> layer and more integration with cgroup eBPF hooks. Then, our SRE team is maintaining ever-growing list of <a href="https://github.com/iovisor/bcc">BCC scripts</a> useful for debugging.</p><p>It feels like Linux stopped developing new API's and all the new features are implemented as eBPF hooks and helpers. This is fine and it has strong advantages. It's easier and safer to upgrade eBPF program than having to recompile a kernel module. Some things like TCP-BPF, exposing high-volume performance tracing data, would probably be impossible without eBPF.</p><p>Some say "software is eating the world", I would say that: "BPF is eating the software".</p> ]]></content:encoded>
            <category><![CDATA[eBPF]]></category>
            <category><![CDATA[Linux]]></category>
            <category><![CDATA[Programming]]></category>
            <category><![CDATA[Developers]]></category>
            <category><![CDATA[Anycast]]></category>
            <category><![CDATA[TCP]]></category>
            <guid isPermaLink="false">5EXsVZKcFTNLXVbrgHx73a</guid>
            <dc:creator>Marek Majkowski</dc:creator>
        </item>
        <item>
            <title><![CDATA[RFC8482 - Saying goodbye to ANY]]></title>
            <link>https://blog.cloudflare.com/rfc8482-saying-goodbye-to-any/</link>
            <pubDate>Fri, 15 Mar 2019 17:01:17 GMT</pubDate>
            <description><![CDATA[ Ladies and gentlemen, I would like you to welcome the new shiny RFC8482, which effectively deprecates DNS ANY query type. DNS ANY was a "meta-query" - think about it as a similar thing to the common A, AAAA, MX or SRV query types, but unlike these it wasn't a real query type - it was special. ]]></description>
            <content:encoded><![CDATA[ <p>Ladies and gentlemen, I would like you to welcome the new shiny <a href="https://tools.ietf.org/html/rfc8482">RFC8482</a>, which effectively deprecates the DNS ANY query type. DNS ANY was a "meta-query" - think of it as a similar thing to the common A, AAAA, MX or SRV query types, but unlike these it wasn't a real query type - it was special. Unlike the standard query types, ANY didn't age well. It was hard to implement on modern DNS servers, the semantics were poorly understood by the community and it unnecessarily exposed the DNS protocol to abuse. RFC8482 allows us to clean it up - it's a good thing.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/PCmdGHjKE7wTlBgmp6igp/e77a20411c26932eba98e8fcede95bfd/Screenshot-from-2019-03-15-14-22-51.png" />
            
            </figure><p>But let's rewind a bit.</p>
    <div>
      <h2>Historical context</h2>
      <a href="#historical-context">
        
      </a>
    </div>
    <p>It all started in 2015, when we were looking at the code of our authoritative DNS server. The code flow was generally fine, but it was all peppered with naughty statements like this:</p>
            <pre><code>if qtype == "ANY" {
    // special case
}</code></pre>
            <p>This special code was ugly and error prone. This got us thinking: do we really need it? "ANY" is not a popular query type - no legitimate software uses it (with the notable exception of qmail).</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5oOaG7G1WJfwM7TX5qGESz/c6e5c5359f4eb3c499d6b3e891d4f19b/11235945713_5bf22a701d_z.jpg" />
            
            </figure><p><a href="https://www.flickr.com/photos/cmichel67/11235945713/">Image</a> by <a href="https://www.flickr.com/photos/cmichel67/">Christopher Michel</a>CC BY 2.0</p>
    <div>
      <h2>ANY is hard for modern DNS servers</h2>
      <a href="#any-is-hard-for-modern-dns-servers">
        
      </a>
    </div>
    <p>"ANY" queries, also called "* queries" in old RFCs, are supposed to return "all records" (citing <a href="https://tools.ietf.org/html/rfc1035">RFC1035</a>). There are two problems with this notion.</p><p>First, it assumes the server is able to retrieve "all records". In our implementation - we can't. Our DNS server, like many modern implementations, doesn't have a single "zone" file listing all properties of a DNS zone. This design allows us to respond fast and with information always up to date, but it makes it incredibly hard to retrieve "all records". Correct handling of "ANY" adds unreasonable code complexity for an obscure, rarely used query type.</p><p>Second, many of the DNS responses are generated on-demand. To mention just two use cases:</p><ul><li><p>Some of our DNS responses <a href="/dnssec-done-right/">are based on location</a></p></li><li><p><a href="/black-lies/">We are using black lies and DNS shotgun for DNSSEC</a></p></li></ul><p>Storing data in modern databases and dynamically generating responses poses a fundamental problem to ANY.</p>
    <div>
      <h2>ANY is hard for clients</h2>
      <a href="#any-is-hard-for-clients">
        
      </a>
    </div>
    <p>Around the same time a catastrophe happened - <a href="https://lists.dns-oarc.net/pipermail/dns-operations/2015-March/012899.html">Firefox started shipping with DNS code issuing "ANY" types</a>. The intention was, as usual, benign. Firefox developers wanted to get the TTL value for A and AAAA queries.</p><p>To cite a DNS guru <a href="https://icannwiki.org/Andrew_Sullivan">Andrew Sullivan</a>:</p><blockquote><p>In general, ANY is useful for troubleshooting but should never be usedfor regular operation. Its output is unpredictable given the effectsof caches. It can return enormous result sets.</p></blockquote><p>In user code you can't rely on anything sane to come out of an "ANY" query. While an "ANY" query has somewhat defined semantics on the DNS authoritative side, it's undefined on the DNS resolver side. Such a query can confuse the resolver:</p><ul><li><p>Should it forward the "ANY" query to authoritative?</p></li><li><p>Should it respond with any record that is already in cache?</p></li><li><p>Should it do some a mixture of the above behaviors?</p></li><li><p>Should it cache the result of "ANY" query and re-use the data for other queries?</p></li></ul><p>Different implementations do different things. "ANY" does not mean "ALL", which is the main source of confusion. To our joy, <a href="https://bugzilla.mozilla.org/show_bug.cgi?id=1093983#c14">Firefox quickly backpedaled</a> on the change and stopped issuing ANY queries.</p>
    <div>
      <h2>ANY is hard for network operators</h2>
      <a href="#any-is-hard-for-network-operators">
        
      </a>
    </div>
    
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/7FBoDQ01mFhkylbffAGioR/b08d10f697d24c84f232a1a2ca9bff02/Screenshot-from-2019-03-15-14-44-58.png" />
            
            </figure><p>A typical 50Gbps DNS amplification targeting one of our customers. The attack lasted about 4 hours.</p><p>Furthermore, since the "ANY" query can generate a large response, they were often used for DNS reflection attacks. Authoritative providers receive a spoofed ANY query and send the large answer to a target, potentially causing DoS damage. We have blogged about that many times:</p><ul><li><p><a href="https://blog.cloudflare.com/the-ddos-that-knocked-spamhaus-offline-and-ho/">The DDoS that knocked Spamhaus offline</a></p></li><li><p><a href="https://blog.cloudflare.com/deep-inside-a-dns-amplification-ddos-attack/">Deep inside a DNS amplification attack</a></p></li><li><p><a href="https://blog.cloudflare.com/reflections-on-reflections/">Reflections on reflections</a></p></li><li><p><a href="https://blog.cloudflare.com/how-the-consumer-product-safety-commission-is-inadvertently-behind-the-internets-largest-ddos-attacks/">How the CPSC is inadvertently behind the largest attacks</a></p></li></ul><p>The DoS problem with ANY is ancient. Here is a discussion about a <a href="https://lists.dns-oarc.net/pipermail/dns-operations/2013-May/010178.html">patch to bind tweaking ANY from 2013</a>.</p><p>There is also a second angle to the ANY DoS problem. Some reports suggested that performant DNS servers (authoritative or resolvers) <a href="https://fanf.livejournal.com/140566.html">can fill their outbound network capacity</a> with numerous ANY responses.</p><p>The recommendation is simple - network operators must use <a href="https://kb.isc.org/docs/aa-00994">"response rate limiting"</a> when answering large DNS queries, otherwise they pose a DoS threat. The "ANY" query type just happens to often give such large responses, while providing little value to legitimate users.</p>
    <div>
      <h2>Terminating ANY</h2>
      <a href="#terminating-any">
        
      </a>
    </div>
    <p>In 2015 frustrated with the experience we announced we would like to stop giving responses to "ANY" queries and wrote a (controversial at a time) blog post:</p><ul><li><p><a href="https://blog.cloudflare.com/deprecating-dns-any-meta-query-type/">Deprecating DNS ANY meta-query type</a></p></li></ul><p>A year later we followed up explaining possible solutions:</p><ul><li><p><a href="https://blog.cloudflare.com/what-happened-next-the-deprecation-of-any/">What happened next - the deprecation of ANY</a></p></li></ul><p>And here we come today! With <a href="https://tools.ietf.org/html/rfc8482">RFC8482</a> we have an RFC proposed standard clarifying that controversial query.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1VXjlDZD1THaZzfyH53uo0/59098d02034b3b987d143ba9d5936a90/Screenshot-from-2019-03-15-13-19-21.png" />
            
            </figure><p>ANY queries are a background noise. Under normal circumstances, we see a very small volume of ANY queries.</p>
    <div>
      <h2>The future for our users</h2>
      <a href="#the-future-for-our-users">
        
      </a>
    </div>
    <p>What precisely can be done about "ANY" queries? RFC8482 specifies that:</p><blockquote><p>A DNS responder that receives an ANY query MAY decline to provide aconventional ANY response or MAY instead send a response with asingle RRset (or a larger subset of available RRsets) in the answersection.</p></blockquote><p>This clearly defines the corner case - from now on the authoritative server may respond with, well, any query type to an "ANY" query. Sometimes simple stuff like this matters most.</p><p>This opens a gate for implementers - we can prepare a simple answer to these queries. As an implementer you may stick "A", or "AAAA" or anything else in the response if you wish. Furthermore, the spec recommends returning a special (and rarely used thus far) HINFO type. This is in fact what we do:</p>
            <pre><code>$ dig ANY cloudflare.com @ns3.cloudflare.com. 
;; ANSWER SECTION:
cloudflare.com.		3789	IN	HINFO	"ANY obsoleted" "See draft-ietf-dnsop-refuse-any"</code></pre>
            <p>Oh, we need to update the message to mention the fresh RFC number! NS1 agrees with our implementation:</p>
            <pre><code>$ dig ANY nsone.net @dns1.p01.nsone.net.
;; ANSWER SECTION:
nsone.net.		3600	IN	HINFO	"ANY not supported." "See draft-ietf-dnsop-refuse-any"</code></pre>
            <p>Our ultimate hero is <code>wikipedia.org</code>, which does exactly what the RFC recommends:</p>
            <pre><code>$ dig ANY wikipedia.org @ns0.wikimedia.org.
;; ANSWER SECTION:
wikipedia.org.		3600	IN	HINFO	"RFC8482" ""</code></pre>
            <p>On our resolver service we stop ANY queries with NOTIMP code. This makes us more confident the resolver isn't used to perform DNS reflections:</p>
            <pre><code>$ dig ANY cloudflare.com @1.1.1.1
;; -&gt;&gt;HEADER&lt;&lt;- opcode: QUERY, status: NOTIMP, id: 14151</code></pre>
            
    <div>
      <h2>The future for developers</h2>
      <a href="#the-future-for-developers">
        
      </a>
    </div>
    <p>On the client side, just don't use ANY DNS queries. On the DNS server side - you are allowed to rip out all the gory QTYPE::ANY handling code, and replace it with a top level HINFO message or first RRset found. Enjoy cleaning your codebase!</p>
    <div>
      <h2>Summary</h2>
      <a href="#summary">
        
      </a>
    </div>
    <p>It took the DNS community some time to agree on the specifics, but here we are at the end. RFC8482 cleans up the last remaining DNS meta-qtype, and allows for simpler DNS authoritative and DNS resolver implementations. It finally clearly defines the semantics of ANY queries going through resolvers and reduces the DoS risk for the whole Internet.</p><p>Not all the effort must go to new shiny protocols and developments, sometimes, cleaning the bitrot is as important. Similar cleanups are being done <a href="https://tools.ietf.org/html/draft-davidben-tls-grease-00">in other areas</a>. Keep up the good work!</p><p>We would like to thank the co-authors of RFC8482, and the community scrutiny and feedback. For us, RFC8482 is definitely a good thing, and allowed us to simplify our codebase and make the Internet safer.</p><p>Mission accomplished! One step at a time we can help make the Internet a better place.</p> ]]></content:encoded>
            <category><![CDATA[DNS]]></category>
            <category><![CDATA[DNSSEC]]></category>
            <category><![CDATA[Security]]></category>
            <category><![CDATA[Speed & Reliability]]></category>
            <guid isPermaLink="false">58CcadkrVsMxWuJbB7efLi</guid>
            <dc:creator>Marek Majkowski</dc:creator>
        </item>
        <item>
            <title><![CDATA[SOCKMAP - TCP splicing of the future]]></title>
            <link>https://blog.cloudflare.com/sockmap-tcp-splicing-of-the-future/</link>
            <pubDate>Mon, 18 Feb 2019 13:13:02 GMT</pubDate>
            <description><![CDATA[ cccccblgtbhdtvkdkbdfcdvgiicuckkdkruvfceiuiur
 ]]></description>
            <content:encoded><![CDATA[ <p>Recently we stumbled upon the holy grail for reverse proxies - a TCP socket splicing API. This caught our attention because, as you may know, we run a global network of reverse proxy services. Proper TCP socket splicing reduces the load on userspace processes and enables more efficient data forwarding. We realized that Linux Kernel's SOCKMAP infrastructure can be reused for this purpose. SOCKMAP is a very promising API and is likely to cause a tectonic shift in the architecture of data-heavy applications like software proxies.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/69XVqZWzoAEEazmpmVeARy/6977b3f52755ec4075cb3f38e5e8bbff/31958194737_e06ecd6fcc_o.jpg" />
            
            </figure><p><a href="https://www.flickr.com/photos/mustadmarine/31958194737/">Image</a> by <a href="https://www.flickr.com/photos/mustadmarine/">Mustad Marine</a> public domain</p><p>But let’s rewind a bit.</p>
    <div>
      <h3>Birthing pains of L7 proxies</h3>
      <a href="#birthing-pains-of-l7-proxies">
        
      </a>
    </div>
    <p>Transmitting large amounts of data from userspace is inefficient. Linux provides a couple of specialized syscalls that aim to address this problem. For example, the <code>sendfile(2)</code> syscall (<a href="https://yarchive.net/comp/linux/splice.html">which Linus doesn't like</a>) can be used to speed up transferring large files from disk to a socket. Then there is <code>splice(2)</code> which traditional proxies use to forward data between two TCP sockets. Finally, <code>vmsplice</code> can be used to stick memory buffer into a pipe without copying, but is very hard to use correctly.</p><p>Sadly, <code>sendfile</code>, <code>splice</code> and <code>vmsplice</code> are very specialized, synchronous and solve only one part of the problem - they avoid copying the data to userspace. They leave other efficiency issues unaddressed.</p><table><tr><td><p><b>between</b></p></td><td><p><b></b></p></td><td><p><b>avoid user-space memory</b></p></td><td><p><b>zerocopy</b></p></td></tr><tr><td><p>sendfile</p></td><td><p>disk file --&gt; socket</p></td><td><p>yes</p></td><td><p>no</p></td></tr><tr><td><p>splice</p></td><td><p>pipe &lt;--&gt; socket</p></td><td><p>yes</p></td><td><p>yes?</p></td></tr><tr><td><p>vmsplice</p></td><td><p>memory region --&gt; pipe</p></td><td><p>no</p></td><td><p>yes</p></td></tr></table><p>Processes that forward large amounts of data face three problems:</p><ol><li><p>Syscall cost: making multiple syscalls for every forwarded packet is costly.</p></li><li><p>Wakeup latency: the user-space process must be woken up often to forward the data. Depending on the scheduler, this may result in poor tail latency.</p></li><li><p>Copying cost: copying data from kernel to userspace and then immediately back to the kernel is not free and adds up to a measurable cost.</p></li></ol>
    <div>
      <h3>Many tried</h3>
      <a href="#many-tried">
        
      </a>
    </div>
    <p>Forwarding data between TCP sockets is a common practice. It's needed for:</p><ul><li><p>Transparent forward HTTP proxies, like Squid.</p></li><li><p>Reverse caching HTTP proxies, like Varnish or NGINX.</p></li><li><p>Load balancers, like HAProxy, Pen or Relayd.</p></li></ul><p>Over the years there <a href="https://www.haproxy.org/download/1.3/doc/tcp-splicing.txt">have</a> been <a href="http://wwwconference.org/proceedings/www2002/refereed/627/index.html">many</a> <a href="https://lwn.net/Articles/200902/">attempts</a> to reduce the cost of dumb data forwarding between TCP sockets on Linux. This issue is generally called “TCP splicing”, “L7 splicing”, or “Socket splicing”.</p><p>Let’s compare the usual ways of doing TCP splicing. To simplify the problem, instead of writing a rich Layer 7 TCP proxy, we'll write a trivial TCP echo server.</p><p>It's not a joke. An echo server can illustrate TCP socket splicing well. You know - "echo" basically splices the socket… with itself!</p>
    <div>
      <h3>Naive: read write loop</h3>
      <a href="#naive-read-write-loop">
        
      </a>
    </div>
    <p>The naive TCP echo server would look like:</p>
            <pre><code>while data:
    data = read(sd, 4096)
    writeall(sd, data)</code></pre>
            <p>Nothing simpler. On a blocking socket this is a totally valid program, and will work just fine. For completeness, <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2019-02-tcp-splice/echo-naive.c#L57-L78">I prepared full code here</a>.</p>
    <div>
      <h3>Splice: specialized syscall</h3>
      <a href="#splice-specialized-syscall">
        
      </a>
    </div>
    <p>Linux has an amazing <a href="http://man7.org/linux/man-pages/man2/splice.2.html">splice(2) syscall</a>. It can tell the kernel to move data between a TCP buffer on a socket and a buffer on a pipe. The data remains in the buffers, on the kernel side. This solves the problem of needlessly having to copy the data between userspace and kernel-space. With the <code>SPLICE_F_MOVE</code> flag the kernel may be able to avoid copying the data at all!</p><p>Our program using <code>splice()</code> looks like:</p>
            <pre><code>pipe_rd, pipe_wr = pipe()
fcntl(pipe_rd, F_SETPIPE_SZ, 4096);

while n:
    n = splice(sd, pipe_wr, 4096)
    splice(pipe_rd, sd, n)</code></pre>
            <p>We still need wake up the userspace program and make two syscalls to forward any piece of data, but at least we avoid all the copying. <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2019-02-tcp-splice/echo-splice.c#L76-L98">Full source</a>.</p>
    <div>
      <h3>io_submit: Using Linux AIO API</h3>
      <a href="#io_submit-using-linux-aio-api">
        
      </a>
    </div>
    <p><a href="/io_submit-the-epoll-alternative-youve-never-heard-about/">In a previous blog post about io_submit()</a> we proposed using the AIO interface with network sockets. Read the blog post for details, but <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2019-02-tcp-splice/echo-iosubmit.c#L81-L107">here is the prepared program</a> that has the echo server loop implemented with only a single syscall.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4qk4EejV30qMZoVZ2RcMFW/d1020375b1a5083ccdd442cfdfba186b/452423494_31aa5caca5_z-1.jpg" />
            
            </figure><p><a href="https://www.flickr.com/photos/jrsnchzhrs/452423494">Image</a> by <a href="https://www.flickr.com/photos/jrsnchzhrs/">jrsnchzhrs</a> By-Nd 2.0</p>
    <div>
      <h3>SOCKMAP: The ultimate weapon</h3>
      <a href="#sockmap-the-ultimate-weapon">
        
      </a>
    </div>
    <p>In recent years Linux Kernel introduced an <a href="https://lwn.net/Articles/740157/">eBPF virtual machine</a>. With it, user-space programs can run specialized, non-turing-complete bytecode in the kernel context. Nowadays, it's possible to <a href="/tag/ebpf/">select eBPF programs</a> for dozens of use cases, ranging from packet filtering, to policy enforcement.</p><p>From Kernel 4.14 Linux got new eBPF machinery that can be used for socket splicing - SOCKMAP. It was created by John Fastabend at <a href="https://cilium.io/blog/2018/04/24/cilium-security-for-age-of-microservices/">Cilium.io</a>, exposing the <a href="https://www.kernel.org/doc/Documentation/networking/strparser.txt">Strparser</a> interface to eBPF programs. Cilium uses SOCKMAP for Layer 7 policy enforcement, and all the logic it uses is embedded in an eBPF program. The API is not well documented, requires root and, from our experience, is <a href="https://lore.kernel.org/netdev/20190211090949.18560-1-jakub@cloudflare.com/">slightly</a> <a href="https://lore.kernel.org/netdev/20190128091335.20908-1-jakub@cloudflare.com/">buggy</a>. But it's very promising. Read more:</p><ul><li><p>LPC2018 - Combining kTLS and BPF for Introspection and Policy Enforcement <a href="http://vger.kernel.org/lpc_net2018_talks/ktls_bpf_paper.pdf">Paper</a> <a href="https://www.youtube.com/watch?v=NnibidVRtWY">Video</a> <a href="http://vger.kernel.org/lpc_net2018_talks/ktls_bpf.pdf">Slides</a></p></li><li><p><a href="https://lwn.net/Articles/731133/">Original SOCKMAP commit</a></p></li></ul><p>This is how to use SOCKMAP: SOCKMAP or specifically "BPF_MAP_TYPE_SOCKMAP", is a type of eBPF map. This map is an "array" - indices are integers. All this is pretty standard. The magic is in the map values - they must be TCP socket descriptors.</p><p>This map is very special - it has two eBPF programs attached to it. You read it right: the eBPF programs live <i>attached to a map</i>, not attached to a socket, cgroup or network interface as usual. This is how you would set up <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2019-02-tcp-splice/echo-sockmap.c#L36-L80">SOCKMAP in user program</a>:</p>
            <pre><code>sock_map = bpf_create_map(BPF_MAP_TYPE_SOCKMAP, sizeof(int), sizeof(int), 2, 0)

prog_parser = bpf_load_program(BPF_PROG_TYPE_SK_SKB, ...)
prog_verdict = bpf_load_program(BPF_PROG_TYPE_SK_SKB, ...)
bpf_prog_attach(prog_parser, sock_map, BPF_SK_SKB_STREAM_PARSER)
bpf_prog_attach(prog_verdict, sock_map, BPF_SK_SKB_STREAM_VERDICT)</code></pre>
            <p>Ta-da! At this point we have an established <code>sock_map</code> eBPF map, with two eBPF programs attached: parser and verdict. The next step is to add a TCP socket descriptor to this map. <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2019-02-tcp-splice/echo-sockmap.c#L130-L142">Nothing simpler</a>:</p>
            <pre><code>int idx = 0;
int val = sd;
bpf_map_update_elem(sock_map, &amp;idx, &amp;val, BPF_ANY);</code></pre>
            <p>At this point <i>the magic happens</i>. From now on, each time our socket <code>sd</code> receives a packet, prog_parser and prog_verdict are called. Their semantics are described in the <a href="https://www.kernel.org/doc/Documentation/networking/strparser.txt">strparser.txt</a> and the <a href="https://lwn.net/Articles/731133/">introductory SOCKMAP commit</a>. For simplicity, our trivial echo server only needs the minimal stubs. <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2019-02-tcp-splice/echo-sockmap-kern.c#L32-L43">This is the eBPF code</a>:</p>
            <pre><code>SEC("prog_parser")
int _prog_parser(struct __sk_buff *skb)
{
	return skb-&gt;len;
}

SEC("prog_verdict")
int _prog_verdict(struct __sk_buff *skb)
{
	uint32_t idx = 0;
	return bpf_sk_redirect_map(skb, &amp;sock_map, idx, 0);
}</code></pre>
            <p>Side note: for the purposes of this test program, I wrote a minimal eBPF loader. It has no dependencies (neither bcc, libelf, nor libbpf) and can do basic relocations (like resolving the <code>sock_map</code> symbol mentioned above). <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2019-02-tcp-splice/tbpf.c">See the code</a>.</p><p>The call to <code>bpf_sk_redirect_map</code> is doing all the work. It tells the kernel: for the received packet, please oh please <i>redirect</i> it from a receive queue of some socket, to a transmit queue of the socket living in sock_map under index 0. In our case, these are the same sockets! Here we achieved exactly what the echo server is supposed to do, but purely in eBPF.</p><p>This technology has multiple benefits. First, the data is never copied to userspace. Secondly, we never need to wake up the userspace program. All the action is done in the kernel. Quite cool, isn't it?</p><p>We need one more piece of code, to hang the userspace program until the socket is closed. This is best done with good old <code>poll(2)</code>:</p>
            <pre><code>/* Wait for the socket to close. Let SOCKMAP do the magic. */
struct pollfd fds[1] = {
    {.fd = sd, .events = POLLRDHUP},
};
poll(fds, 1, -1);</code></pre>
            <p><a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2019-02-tcp-splice/echo-sockmap.c#L144-L148">Full code.</a></p>
    <div>
      <h3>The benchmarks</h3>
      <a href="#the-benchmarks">
        
      </a>
    </div>
    <p>At this stage we have presented four simple TCP echo servers:</p><ul><li><p>naive read-write loop</p></li><li><p>splice</p></li><li><p>io_submit</p></li><li><p>SOCKMAP</p></li></ul><p>To recap, we are measuring the cost of three things:</p><ol><li><p>Syscall cost</p></li><li><p>Wakeup latency, mostly visible as tail latency</p></li><li><p>The cost of copying data</p></li></ol><p>Theoretically, SOCKMAP should beat all the others:</p><table><tr><td><p></p></td><td><p><b>syscall cost</b></p></td><td><p><b>waking up userspace</b></p></td><td><p><b>copying cost</b></p></td></tr><tr><td><p>read write loop</p></td><td><p>2 syscalls</p></td><td><p>yes</p></td><td><p>2 copies</p></td></tr><tr><td><p>splice</p></td><td><p>2 syscalls</p></td><td><p>yes</p></td><td><p>0 copy (?)</p></td></tr><tr><td><p>io_submit</p></td><td><p>1 syscall</p></td><td><p>yes</p></td><td><p>2 copies</p></td></tr><tr><td><p>SOCKMAP</p></td><td><p>none</p></td><td><p>no</p></td><td><p>0 copies</p></td></tr></table>
    <div>
      <h3>Show me the numbers</h3>
      <a href="#show-me-the-numbers">
        
      </a>
    </div>
    <p>This is the part of the post where I'm showing you the breathtaking numbers, clearly showing the different approaches. Sadly, benchmarking is hard, and well... SOCKMAP turned out to be the slowest. It's <a href="https://cyber.wtf/2017/07/28/negative-result-reading-kernel-memory-from-user-mode/">important to publish negative results</a> so here they are.</p><p>Our test rig was as follows:</p><ul><li><p>Two bare-metal Xeon servers connected with a 25Gbps network.</p></li><li><p>Both have turbo-boost disabled, and the testing programs are CPU-pinned.</p></li><li><p>For better locality we localized RX and TX queues to one IRQ/CPU each.</p></li><li><p>The testing server runs a script that sends 10k batches of fixed-sized blocks of data. The script measures how long it takes for the echo server to return the traffic.</p></li><li><p>We do 10 separate runs for each measured echo-server program.</p></li><li><p>TCP: "cubic" and NONAGLE=1.</p></li><li><p>Both servers run the 4.14 kernel.</p></li></ul><p>Our analysis of the experimental data identified some outliers. We think some of the worst times, manifested as long echo replies, were caused by unrelated factors such as network packet loss. In the charts presented we, perhaps controversially, skip the bottom 1% of outliers in order to focus on what we think is the important data.</p><p>Furthermore, we spotted a bug in SOCKMAP. Some of the runs were delayed by up to whopping 64ms. Here is one of the tests:</p>
            <pre><code>Values min:236.00 avg:669.28 med=390.00 max:78039.00 dev:3267.75 count:2000000
Values:
 value |-------------------------------------------------- count
     1 |                                                   0
     2 |                                                   0
     4 |                                                   0
     8 |                                                   0
    16 |                                                   0
    32 |                                                   0
    64 |                                                   0
   128 |                                                   0
   256 |                                                   3531
   512 |************************************************** 1756052
  1024 |                                             ***** 208226
  2048 |                                                   18589
  4096 |                                                   2006
  8192 |                                                   9
 16384 |                                                   1
 32768 |                                                   0
 65536 |                                                   11585
131072 |                                                   1</code></pre>
            <p>The great majority of the echo runs (of 128KiB in this case) were finished in the 512us band, while a small fraction stalled for 65ms. This is pretty bad and makes comparison of SOCKMAP to other implementations pretty meaningless. This is a second reason why we are skipping 1% of worst results from all the runs - it makes SOCKMAP numbers way more usable. Sorry.</p>
    <div>
      <h3>2MiB blocks - throughput</h3>
      <a href="#2mib-blocks-throughput">
        
      </a>
    </div>
    <p>The fastest of our programs was doing ~15Gbps over one flow, which seems to be a hardware limit. This is very visible in the first iteration, which shows the throughput of our echo programs.</p><p>This test shows: Time to transmit and receive 2MiB blocks of data, via our tested echo server. We repeat this 10k times, and run the test 10 times. After stripping the worst 1% numbers we get the following latency distribution:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4yxe0fOUjkJpcsD1uoCIm6/03169327d9ee3060e31aa1d59d12283a/numbers-2mib-2.png" />
            
            </figure><p>This chart shows that both naive read+write and io_submit programs were able to achieve 1500us mean round trip time for TCP echo server of 2MiB blocks.</p><p>Here we clearly see that splice and SOCKMAP are slower than others. They were CPU-bound and unable to reach the line rate. We have raised the <a href="https://www.spinics.net/lists/netdev/msg539609.html">unusual splice performance problems</a> in the past, but perhaps we should debug it one more time.</p><p>For each server we run the tests twice: without and with SO_BUSYPOLL setting. This setting should remove the "wakeup latency" and greatly reduce the jitter. The results show that naive and io_submit tests are almost identical. This is perfect! BUSYPOLL does indeed reduce the deviation and latency, at a cost of more CPU usage. Notice that neither splice nor SOCKMAP are affected by this setting.</p>
    <div>
      <h3>16KiB blocks - wakeup time</h3>
      <a href="#16kib-blocks-wakeup-time">
        
      </a>
    </div>
    <p>Our second run of tests was with much smaller data sizes, sending tiny 16KiB blocks at a time. This test should illustrate the "wakeup time" of the tested programs.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/WxbvoQBvU00WzJldYpSED/b794cd76ed5acb14d807fd0fd624857a/numbers-16kib-1.png" />
            
            </figure><p>In this test the non-BUSYPOLL runs of all the programs look quite similar (min and max values), with SOCKMAP being the exception. This is great - we can speculate the wakeup time is comparable. Surprisingly, the splice has slightly better median time from others. Perhaps this can be explained by CPU artifacts, like having better CPU cache locality due to fewer data copying. SOCKMAP is again, slowest with worst max and median times. Boo.</p><p>Remember we truncated the worst 1% of the data - we artificially shortened the "max" values.</p>
    <div>
      <h3>TL;DR</h3>
      <a href="#tl-dr">
        
      </a>
    </div>
    <p>In this blog post we discussed the theoretical benefits of SOCKMAP. Sadly, we noticed it's not ready for prime time yet. We compared it against splice, which we noticed didn't benefit from BUSYPOLL and had disappointing performance. We noticed that the naive read/write loop and iosubmit approaches have exactly the same performance characteristics and do benefit from BUSYPOLL to reduce jitter (wakeup time).</p><p>If you are piping data between TCP sockets, you should definitely take a look at SOCKMAP. While our benchmarks show it's not ready for prime time yet, with poor performance, high jitter and a couple of bugs, it's very promising. We are very excited about it. It's the first technology on Linux that truly allows the user-space process to offload TCP splicing to the kernel. It also has potential for much better performance than other approaches, ticking all the boxes of being async, kernel-only and totally avoiding needless copying of data.</p><p>This is not everything. SOCKMAP is able to pipe data across multiple sockets - you can imagine a full mesh of connections being able to send data to each other. Furthermore, it exposes the <code>strparser</code> API, which can be used to offload basic application framing. Combined with <a href="https://github.com/torvalds/linux/blob/master/Documentation/networking/tls.txt">kTLS</a> you can combine it with transparent encryption. Furthermore, there are rumors of adding UDP support. The possibilities are endless.</p><p>Recently the kernel has been exploding with eBPveF innovations. It seems like we've only just scratched the surface of the possibilities exposed by the modern eBPF interfaces.</p><p>Many thanks to <a href="https://twitter.com/jkbs0">Jakub Sitnicki</a> for suggesting SOCKMAP in the first place, writing the proof of concept and now actually fixing the bugs we found. Go strong Warsaw office!</p> ]]></content:encoded>
            <category><![CDATA[TCP]]></category>
            <category><![CDATA[API]]></category>
            <category><![CDATA[Security]]></category>
            <guid isPermaLink="false">53j7wbfn6eJxOvUzmnn7H1</guid>
            <dc:creator>Marek Majkowski</dc:creator>
        </item>
        <item>
            <title><![CDATA[io_submit: The epoll alternative you've never heard about]]></title>
            <link>https://blog.cloudflare.com/io_submit-the-epoll-alternative-youve-never-heard-about/</link>
            <pubDate>Fri, 04 Jan 2019 11:02:26 GMT</pubDate>
            <description><![CDATA[ The Linux AIO is designed for, well, Asynchronous disk IO! They are not network sockets, but is it possible to use Linux AIO API with network sockets? The answer is a strong YES!
 ]]></description>
            <content:encoded><![CDATA[ <p>My curiosity was piqued by an LWN article about <a href="https://lwn.net/Articles/743714/">IOCB_CMD_POLL - A new kernel polling interface</a>. It discusses an addition of a new polling mechanism to Linux AIO API, which was merged in 4.18 kernel. The whole idea is rather intriguing. The author of the patch is proposing to use the Linux AIO API with things like network sockets.</p><p>Hold on. The Linux AIO is designed for, well, <b>A</b>synchronous disk <b>IO</b>! Disk files are not the same thing as network sockets! Is it even possible to use the Linux AIO API with network sockets in the first place?</p><p>The answer turns out to be a strong YES! In this article I'll explain how to use the strengths of Linux AIO API to write better and faster network servers.</p><p>But before we start, what is Linux AIO anyway?</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/eFqY9nu5REcVlHdjmlag3/bc9f6b06f408bd6703a8ceb51b27f588/6891085910_3390ebe29f_k.jpg" />
            
            </figure><p><a href="https://www.flickr.com/photos/schill/6891085910/">Photo</a> by <a href="https://www.flickr.com/photos/schill">Scott Schiller</a> CC/BY/2.0</p>
    <div>
      <h2>Introduction to Linux AIO</h2>
      <a href="#introduction-to-linux-aio">
        
      </a>
    </div>
    <p>Linux AIO exposes asynchronous disk IO to userspace software.</p><p>Historically on Linux, all disk operations were blocking. Whether you did <code>open()</code>, <code>read()</code>, <code>write()</code> or <code>fsync()</code>, you could be sure your thread would stall if the needed data and meta-data was not ready in disk cache. This usually isn't a problem. If you do small amount of IO or have plenty of memory, the disk syscalls would gradually fill the cache and on average be rather fast.</p><p>The IO operation performance drops for IO-heavy workloads, like databases or caching web proxies. In such applications it would be tragic if a whole server stalled, just because some odd <code>read()</code> syscall had to wait for disk.</p><p>To work around this problem, applications use one of the three approaches:</p><p>(1) Use thread pools and offload blocking syscalls to worker threads. This is what glibc POSIX AIO (not to be confused with Linux AIO) wrapper does. (See: <a href="https://www.ibm.com/developerworks/linux/library/l-async/">IBM's documentation</a>). This is also what we ended up doing in our application at Cloudflare - <a href="/how-we-scaled-nginx-and-saved-the-world-54-years-every-day/">we offloaded read() and open() calls to a thread pool</a>.</p><p>(2) Pre-warm the disk cache with <code>posix_fadvise(2)</code> and hope for the best.</p><p>(3) Use Linux AIO with XFS file system, <a href="https://lwn.net/Articles/671649/">file opened with O_DIRECT</a>, and <a href="https://www.scylladb.com/2016/02/09/qualifying-filesystems/">avoid the undocumented pitfalls</a>.</p><p>None of these methods is perfect. Even the Linux AIO if used carelessly, could still block in the <code>io_submit()</code> call. This was recently mentioned <a href="https://lwn.net/Articles/724198/">in another LWN article</a>:</p><blockquote><p>The Linux asynchronous I/O (AIO) layer tends to have many critics and few defenders, but most people at least expect it to actually be asynchronous. In truth, an AIO operation can block in the kernel for a number of reasons, making AIO difficult to use in situations where the calling thread truly cannot afford to block.</p></blockquote><p>Now that we know what Linux AIO API doesn't do well, let's see where it shines.</p>
    <div>
      <h2>Simplest Linux AIO program</h2>
      <a href="#simplest-linux-aio-program">
        
      </a>
    </div>
    <p>To use Linux AIO you first need <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2019-01-io-submit/aio_passwd.c">to define the 5 needed syscalls</a> - glibc doesn't provide wrapper functions. To use Linux AIO we need:</p><p>(1) First call <code>io_setup()</code> to set up the <code>aio_context</code> data structure. Kernel will hand us an opaque pointer.</p><p>(2) Then we can call <code>io_submit()</code> to submit a vector of "I/O control blocks" <code>struct iocb</code> for processing.</p><p>(3) Finally, we can call <code>io_getevents()</code> to block and wait for a vector of <code>struct io_event</code> - completion notification of the iocb's.</p><p>There are 8 commands that can be submitted in an iocb. Two read, two write, two fsync variants and a POLL command introduced in 4.18 Kernel:</p>
            <pre><code>IOCB_CMD_PREAD = 0,
IOCB_CMD_PWRITE = 1,
IOCB_CMD_FSYNC = 2,
IOCB_CMD_FDSYNC = 3,
IOCB_CMD_POLL = 5,   /* from 4.18 */
IOCB_CMD_NOOP = 6,
IOCB_CMD_PREADV = 7,
IOCB_CMD_PWRITEV = 8,</code></pre>
            <p>The <a href="https://github.com/torvalds/linux/blob/f346b0becb1bc62e45495f9cdbae3eef35d0b635/include/uapi/linux/aio_abi.h#L73-L107"><code>struct iocb</code></a> passed to <code>io_submit</code> is large, and tuned for disk IO. Here's a simplified version:</p>
            <pre><code>struct iocb {
  __u64 data;           /* user data */
  ...
  __u16 aio_lio_opcode; /* see IOCB_CMD_ above */
  ...
  __u32 aio_fildes;     /* file descriptor */
  __u64 aio_buf;        /* pointer to buffer */
  __u64 aio_nbytes;     /* buffer size */
...
}</code></pre>
            <p>The completion notification retrieved from <code>io_getevents</code>:</p>
            <pre><code>struct io_event {
  __u64  data;  /* user data */
  __u64  obj;   /* pointer to request iocb */
  __s64  res;   /* result code for this event */
  __s64  res2;  /* secondary result */
};
</code></pre>
            <p>Let's try an example. Here's the simplest program reading /etc/passwd file with Linux AIO API:</p>
            <pre><code>fd = open("/etc/passwd", O_RDONLY);

aio_context_t ctx = 0;
r = io_setup(128, &amp;ctx);

char buf[4096];
struct iocb cb = {.aio_fildes = fd,
                  .aio_lio_opcode = IOCB_CMD_PREAD,
                  .aio_buf = (uint64_t)buf,
                  .aio_nbytes = sizeof(buf)};
struct iocb *list_of_iocb[1] = {&amp;cb};

r = io_submit(ctx, 1, list_of_iocb);

struct io_event events[1] = {{0}};
r = io_getevents(ctx, 1, 1, events, NULL);

bytes_read = events[0].res;
printf("read %lld bytes from /etc/passwd\n", bytes_read);</code></pre>
            <p>Full source is, <a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2019-01-io-submit/aio_passwd.c">as usual, on GitHub</a>. Here's a strace of this program:</p>
            <pre><code>openat(AT_FDCWD, "/etc/passwd", O_RDONLY)
io_setup(128, [0x7f4fd60ea000])
io_submit(0x7f4fd60ea000, 1, [{aio_lio_opcode=IOCB_CMD_PREAD, aio_fildes=3, aio_buf=0x7ffc5ff703d0, aio_nbytes=4096, aio_offset=0}])
io_getevents(0x7f4fd60ea000, 1, 1, [{data=0, obj=0x7ffc5ff70390, res=2494, res2=0}], NULL)</code></pre>
            <p>This all worked fine! But the disk read was not asynchronous: the <code>io_submit</code> syscall blocked and did all the work! The <code>io_getevents</code> call finished instantly. We could try to make the disk read async, but this requires O_DIRECT flag which skips the caches.</p><p>Let's try to better illustrate the blocking nature of <code>io_submit</code> on normal files. Here's similar example, showing strace when reading large 1GiB block from <code>/dev/zero</code>:</p>
            <pre><code>io_submit(0x7fe1e800a000, 1, [{aio_lio_opcode=IOCB_CMD_PREAD, aio_fildes=3, aio_buf=0x7fe1a79f4000, aio_nbytes=1073741824, aio_offset=0}]) \
    = 1 &lt;0.738380&gt;
io_getevents(0x7fe1e800a000, 1, 1, [{data=0, obj=0x7fffb9588910, res=1073741824, res2=0}], NULL) \
    = 1 &lt;0.000015&gt;</code></pre>
            <p>The kernel spent 738ms in <code>io_submit</code> and only 15us in <code>io_getevents</code>. The kernel behaves the same way with network sockets - all the work is done in <code>io_submit</code>.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1Aem3ZSReNmWJZwFNQmySO/326c6c7ca476e955d42a4e14ba8e6150/Network_card-2.jpg" />
            
            </figure><p><a href="https://commons.wikimedia.org/wiki/File:Network_card.jpg">Photo</a> by <a href="https://commons.wikimedia.org/wiki/User:Helix84">Helix84</a> CC/BY-SA/3.0</p>
    <div>
      <h2>Linux AIO with sockets - batching</h2>
      <a href="#linux-aio-with-sockets-batching">
        
      </a>
    </div>
    <p>The implementation of <code>io_submit</code> is rather conservative. Unless the passed descriptor is O_DIRECT file, it will just block and perform the requested action. In case of network sockets it means:</p><ul><li><p>For blocking sockets IOCB_CMD_PREAD will hang until a packet arrives.</p></li><li><p>For non-blocking sockets IOCB_CMD_PREAD will -11 (EAGAIN) return code.</p></li></ul><p>These are exactly the same semantics as for vanilla <code>read()</code> syscall. It's fair to say that for network sockets <code>io_submit</code> is no smarter than good old read/write calls.</p><p>It's important to note the <code>iocb</code> requests passed to kernel are evaluated in-order sequentially.</p><p>While Linux AIO won't help with async operations, it can definitely be used for syscall batching!</p><p>If you have a web server needing to send and receive data from hundreds of network sockets, using <code>io_submit</code> can be a great idea. This would avoid having to call <code>send</code> and <code>recv</code> hundreds of times. This will improve performance - jumping back and forth from userspace to kernel is not free, especially <a href="https://gist.github.com/antirez/9e716670f76133ec81cb24036f86ee95">since the meltdown and spectre mitigations</a>.</p><table><tr><td><p></p></td><td><p><b>One buffer</b></p></td><td><p><b>Multiple buffers</b></p></td></tr><tr><td><p>One file descriptor</p></td><td><p>read()</p></td><td><p>readv()</p></td></tr><tr><td><p>Many file descriptors</p></td><td><p>io_submit + IOCB_CMD_PREAD</p></td><td><p>io_submit + IOCB_CMD_PREADV</p></td></tr></table><p>To illustrate the batching aspect of <code>io_submit</code>, let's create a small program forwarding data from one TCP socket to another. In simplest form, without Linux AIO, the program would have a trivial flow like this:</p>
            <pre><code>while True:
  d = sd1.read(4096)
  sd2.write(d)</code></pre>
            <p>We can express the same logic with Linux AIO. The code will look like this:</p>
            <pre><code>struct iocb cb[2] = {{.aio_fildes = sd2,
                      .aio_lio_opcode = IOCB_CMD_PWRITE,
                      .aio_buf = (uint64_t)&amp;buf[0],
                      .aio_nbytes = 0},
                     {.aio_fildes = sd1,
                     .aio_lio_opcode = IOCB_CMD_PREAD,
                     .aio_buf = (uint64_t)&amp;buf[0],
                     .aio_nbytes = BUF_SZ}};
struct iocb *list_of_iocb[2] = {&amp;cb[0], &amp;cb[1]};
while(1) {
  r = io_submit(ctx, 2, list_of_iocb);

  struct io_event events[2] = {};
  r = io_getevents(ctx, 2, 2, events, NULL);
  cb[0].aio_nbytes = events[1].res;
}</code></pre>
            <p>This code submits two jobs to <code>io_submit</code>. First, request to write data to <code>sd2</code> then to read data from <code>sd1</code>. After the read is done, the code fixes up the write buffer size and loops again. The code does a cool trick - the first write is of size 0. We are doing so since we can fuse write+read in one io_submit (but not read+write). After a read is done we have to fix the write buffer size.</p><p>Is this code faster than the simple read/write version? Not yet. Both versions have two syscalls: read+write and io_submit+io_getevents. Fortunately, we can improve it.</p>
    <div>
      <h2>Getting rid of io_getevents</h2>
      <a href="#getting-rid-of-io_getevents">
        
      </a>
    </div>
    <p>When running <code>io_setup()</code>, the kernel allocates a couple of pages of memory for the process. This is how this memory block looks like in /proc//maps:</p>
            <pre><code>marek:~$ cat /proc/`pidof -s aio_passwd`/maps
...
7f7db8f60000-7f7db8f63000 rw-s 00000000 00:12 2314562     /[aio] (deleted)
...</code></pre>
            <p>The [aio] memory region (12KiB in my case) was allocated by the <code>io_setup</code>. This memory range is used a ring buffer storing the completion events. In most cases, there isn't any reason to call the real <code>io_getevents</code> syscall. The completion data can be easily retrieved from the ring buffer without the need of consulting the kernel. Here is a fixed version of the code:</p>
            <pre><code>int io_getevents(aio_context_t ctx, long min_nr, long max_nr,
                 struct io_event *events, struct timespec *timeout)
{
    int i = 0;

    struct aio_ring *ring = (struct aio_ring*)ctx;
    if (ring == NULL || ring-&gt;magic != AIO_RING_MAGIC) {
        goto do_syscall;
    }

    while (i &lt; max_nr) {
        unsigned head = ring-&gt;head;
        if (head == ring-&gt;tail) {
            /* There are no more completions */
            break;
        } else {
            /* There is another completion to reap */
            events[i] = ring-&gt;events[head];
            read_barrier();
            ring-&gt;head = (head + 1) % ring-&gt;nr;
            i++;
        }
    }

    if (i == 0 &amp;&amp; timeout != NULL &amp;&amp; timeout-&gt;tv_sec == 0 &amp;&amp; timeout-&gt;tv_nsec == 0) {
        /* Requested non blocking operation. */
        return 0;
    }

    if (i &amp;&amp; i &gt;= min_nr) {
        return i;
    }

do_syscall:
    return syscall(__NR_io_getevents, ctx, min_nr-i, max_nr-i, &amp;events[i], timeout);
}
</code></pre>
            <p><a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2019-01-io-submit/aio_poll.c#L63">Here's full code</a>. This ring buffer interface is poorly documented. I adapted this code from <a href="https://github.com/axboe/fio/blob/702906e9e3e03e9836421d5e5b5eaae3cd99d398/engines/libaio.c#L149-L172">the axboe/fio project</a>.</p><p>With this code fixing the <code>io_getevents</code> function, our Linux AIO version of the TCP proxy needs only one syscall per loop, and indeed is a tiny bit faster than the read+write code.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5BUl6rp23b2NJ1XX8Z7RmV/1260b98a4e8b77dc6610626a38ce6ecb/16026681353_90b26e0731_z.jpg" />
            
            </figure><p><a href="https://www.flickr.com/photos/99279135@N05/16026681353/">Photo</a> by <a href="https://www.flickr.com/photos/99279135@N05/">Train Photos</a> CC/BY-SA/2.0</p>
    <div>
      <h2>Epoll alternative</h2>
      <a href="#epoll-alternative">
        
      </a>
    </div>
    <p>With the addition of IOCB_CMD_POLL in kernel 4.18, one could use <code>io_submit</code> also as select/poll/epoll equivalent. For example, here's some code waiting for data on a socket:</p>
            <pre><code>struct iocb cb = {.aio_fildes = sd,
                  .aio_lio_opcode = IOCB_CMD_POLL,
                  .aio_buf = POLLIN};
struct iocb *list_of_iocb[1] = {&amp;cb};

r = io_submit(ctx, 1, list_of_iocb);
r = io_getevents(ctx, 1, 1, events, NULL);</code></pre>
            <p><a href="https://github.com/cloudflare/cloudflare-blog/blob/master/2019-01-io-submit/aio_poll.c">Full code</a>. Here's the strace view:</p>
            <pre><code>io_submit(0x7fe44bddd000, 1, [{aio_lio_opcode=IOCB_CMD_POLL, aio_fildes=3}]) \
    = 1 &lt;0.000015&gt;
io_getevents(0x7fe44bddd000, 1, 1, [{data=0, obj=0x7ffef65c11a8, res=1, res2=0}], NULL) \
    = 1 &lt;1.000377&gt;</code></pre>
            <p>As you can see this time the "async" part worked fine, the <code>io_submit</code> finished instantly and the <code>io_getevents</code> successfully blocked for 1s while awaiting data. This is pretty powerful and can be used instead of the <code>epoll_wait()</code> syscall.</p><p>Furthermore, normally dealing with <code>epoll</code> requires juggling <code>epoll_ctl</code> syscalls. Application developers go to great lengths to avoid calling this syscall too often. <a href="http://man7.org/linux/man-pages/man7/epoll.7.html">Just read the man page</a> on EPOLLONESHOT and EPOLLET flags. Using <code>io_submit</code> for polling works around this whole complexity, and doesn't require any spurious syscalls. Just push your sockets to the iocb request vector, call <code>io_submit</code> exactly once and wait for completions. The API can't be simpler than this.</p>
    <div>
      <h2>Summary</h2>
      <a href="#summary">
        
      </a>
    </div>
    <p>In this blog post we reviewed the Linux AIO API. While initially conceived to be a disk-only API, it seems to be working in the same way as normal read/write syscalls on network sockets. But as opposed to read/write <code>io_submit</code> allows syscall batching, potentially improving performance.</p><p>Since kernel 4.18 <code>io_submit</code> and <code>io_getevents</code> can be used to wait for events like POLLIN and POLLOUT on network sockets. This is great, and could be used as a replacement for <code>epoll()</code> in the event loop.</p><p>I can imagine a network server that could just be doing <code>io_submit</code> and <code>io_getevents</code> syscalls, as opposed to the usual mix of <code>read</code>, <code>write</code>, <code>epoll_ctl</code> and <code>epoll_wait</code>. With such design the syscall batching aspect of <code>io_submit</code> could really shine. Such a server would be meaningfully faster.</p><p>Sadly, even with recent Linux AIO API improvements, the larger discussion remains. Famously, <a href="https://lwn.net/Articles/671657/">Linus hates it</a>:</p><blockquote><p>AIO is a horrible ad-hoc design, with the main excuse being "other, less gifted people, made that design, and we are implementing it for compatibility because database people - who seldom have any shred of taste - actually use it". But AIO was always really really ugly.</p></blockquote><p>Over the years there had been multiple attempts on creating a better batching and async interfaces, unfortunately, lacking coherent vision. For example, <a href="https://github.com/torvalds/linux/blob/6f0d349d922ba44e4348a17a78ea51b7135965b1/Documentation/networking/msg_zerocopy.rst">recent addition of <code>sendto(MSG_ZEROCOPY)</code></a> allows for truly async transmission operations, but no batching. <code>io_submit</code> allows batching but not async. It's even worse than that - Linux currently has three ways of delivering async notifications - signals, <code>io_getevents</code> and <code>MSG_ERRQUEUE</code>.</p><p>Having said that I'm really excited to see the new developments which allow developing faster network servers. I'm jumping on the code to replace my rusty epoll event loops with io_submit!</p> ]]></content:encoded>
            <category><![CDATA[Linux]]></category>
            <category><![CDATA[API]]></category>
            <category><![CDATA[Developers]]></category>
            <guid isPermaLink="false">6zyOz9SzvPQNiQFsPWodxb</guid>
            <dc:creator>Marek Majkowski</dc:creator>
        </item>
    </channel>
</rss>