Why we use the Linux kernel's TCP stack

A recent blog post posed the question Why do we use the Linux kernel's TCP stack?. It triggered a very interesting discussion on Hacker News.

I've also thought about this question while working at CloudFlare. My experience mostly comes from working with thousands of production machines here and I can try to answer the question from that perspective.

CC BY 2.0 image by John Vetterli

Let's start with a broader question - what is the point of running an operating system at all? If you planned on running a single application, having to use a kernel consisting of multiple million lines of code may sound like a burden.

But in fact most of us decide to run some kind of OS and we do that for two reasons. Firstly, the OS layer adds hardware independence and easy to use APIs. With these we can focus on writing the code for any machine - not only the specialized hardware we have at the moment. Secondly, the OS adds a time sharing layer. This allows us to run more than one application at a time. Whether it's a second HTTP server or just a bash session, this ability to share resources between multiple processes is critical. All of the resources exposed by the kernel can be shared between multiple processes!

Userspace Networking

This is no different for the networking stack. By using the general purpose operating system network stack we gain the ability to run multiple network applications. This is lost if we dedicate the network card hardware to a single application in order to run a userspace network stack. By claiming the network card from one process you lose the ability to run, say an SSH session, concurrently with your servers.

As crazy as it sounds, this is exactly what most off the shelf userspace network stack technologies propose. The common term for this is "full kernel bypass". The idea is to bypass the kernel and use network hardware directly from userspace process.

CC BY 2.0 image by Audiotecna Música

In the Linux ecosystem there are a few available technologies. Not all are open source:

I've written about these in a previous post. All of these technologies require handing over the full network card to a single process. In other words: it's totally possible to write your own network stack, make it brilliant, focusing on super features, and optimize for performance. But there is a big cost incurred - you will be restricted to running at most one process for each network card.

There is a small twist about virtualized network cards (VFs), but let's cut it here - it doesn't work. I talked about this in the "Virtualization approach" paragraph.

But even with all these road blocks, I can't just dismiss the benefits of kernel bypass. Many people do run custom network stacks and they do it for one of two reasons:

latency
performance (lower CPU cost, better throughput)

The latency is very important for the HFT (high frequency trading) folks. Traders can afford custom hardware and fancy proprietary network stacks. I would feel very uncomfortable running a closed source TCP stack.

Kernel bypass at CloudFlare

Having said that, at CloudFlare we do use kernel bypass. We are in the second group - we care about performance. More specifically we suffer from IRQ storms. The Linux networking stack has a limit on how many packets per second it can handle. When the limit is reached all CPUs become busy just receiving packets. In that case either the packets are dropped or the applications are starved of CPU. While we don't have to deal with IRQ storms during our normal operation, this does happen when we are the target of an L3 (layer 3 OSI) DDoS attack. This is a type of attack where the target is flooded with arbitrary packets not belonging to valid connections - typically spoofed packets.

CC BY-SA 2.0 image by Howard Lake

During some attacks we are flooded with up to 3M packets per second (pps) per server. A general rule is that Linux iptables can handle about 1Mpps on a decent server, while still having enough CPU for applications. This number can be increased by proper tuning.

With this scale of attack the Linux kernel is not enough for us. We must work around it. We don't use the previously mentioned "full kernel bypass", but instead we run what we call a "partial kernel bypass". With this the kernel retains the ownership of the network card, and allows us to perform a bypass only on a single "RX queue". We use Solarflare's EFVI API on Solarflare NICs. To support Intel NICs we added a partial kernel bypass feature to Netmap: that's described in this blog post. With this technique we can offload our anti-DDoS iptables to a very fast userspace process. This saves Linux from processing attack packets, therefore avoiding IRQ storms situations.

What about a full userspace TCP stack?

My colleagues regularly ask me about: Why don't we just run our NGINX with Solarflare OpenOnload framework, using a super fast userspace TCP?

Yes, it would be faster, but there is no evidence it will actually make much of a practical difference. Most of the CPU used on our servers goes to the userspace NGINX processes and not to the operating system. The CPU is mostly spent on usual NGINX bookkeeping and our Lua application logic, not on network handling. I estimate with bypass we could save about 5-10% CPU, which is (currently) not worth the effort.

CC BY 2.0 image by Charlie

Next, using kernel bypass for NGINX would interfere with our usual debugging tools. Our systemtap scripts would become useless. Linux netstat statistics will stop recording critical events and tcpdump won't work anymore.

Then there is the issue of our DDoS mitigation stack. We are heavy users of iptables, as I documented in this BlackHat presentation. Custom TCP stacks just don't have things like "hashlimits" and "ipsets".

But not only firewall features. The Linux TCP stack has some extremely useful non-trivial support for things like RFC4821 with the sys.net.ipv4.tcp_mtu_probing sysctl. Support for this is critical when a user is behind an ICMP black hole. Read more in this blog posts about PMTU.

Finally, each TCP stack comes with its own set of bugs and quirks. We've documented three non-obvious quirks in the Linux TCP stack:

Garbage collector kicking in on a read buffer;
Problems with too many listening sockets;
What it means for a socket to be writeable.

Imagine debugging issues like that in a closed source or a young TCP stack (or both!).

Conclusion

There are two general themes: first, there is no stable open-source partial kernel bypass technology yet. We hope Netmap will occupy this niche, and we are actively supporting it with our patches. Second, the Linux TCP stack has many critical features and very good debugging capabilities. It will take years to compete with this rich ecosystem.

For these reasons it's unlikely userspace networking will become mainstream. In practice I can think only of a few reasonable applications of kernel bypass techniques:

Software switches or routers. Here you want to hand over network cards to the application, deal with raw packets and skip the kernel altogether.
Dedicated loadbalancers. Similarly, if the machine is only doing packet shuffling skipping the kernel makes sense.
Partial bypass for selected high throughput / low latency applications. This is the setup we use for our DDoS mitigations. Unfortunately I'm not aware of a stable open source TCP stack that fits this category.

For the general user the Linux network stack is the right choice. Although it's less exciting than rewriting TCP stacks, we should focus on understanding the Linux stack performance and fixing its problems. There are some serious initiatives underway to improve the performance of the good old Linux TCP stack.

The Cloudflare Blog

Why we use the Linux kernel's TCP stack

Userspace Networking

Kernel bypass at CloudFlare

What about a full userspace TCP stack?

Conclusion

Vulnerability transparency: strengthening security through responsible disclosure

QUIC restarts, slow problems: udpgrm to the rescue

Scaling with safety: Cloudflare's approach to global service health metrics and software releases

A steam locomotive from 1993 broke my yarn test