The revenge of the listening sockets

Back in November we wrote a blog post about one latency spike. Today I'd like to share a continuation of that story. As it turns out, the misconfigured rmem setting wasn't the only source of added latency.

It looked like Mr. Wolf hadn't finished his job.

After adjusting the previously discussed rmem sysctl we continued monitoring our systems' latency. Among other things we measured ping times to our edge servers. While the worst case improved and we didn't see 1000ms+ pings anymore, the line still wasn't flat. Here's a graph of ping latency between an idling internal machine and a production server. The test was done within the datacenter, the packets never went to the public internet. The Y axis of the chart shows ping times in milliseconds, the X axis is the time of the measurement. Measurements were taken every second for over 6 hours:

As you can see most pings finished below 1ms. But out of 21,600 measurements about 20 had high latency of up to 100ms. Not ideal, is it?

System tap

The latency occurred within our datacenter and the packets weren't lost. This suggested a kernel issue again. Linux responds to ICMP pings from its soft interrupt handling code. A delay in handling ping indicates a delay in Soft IRQ handling which is really bad and can affect all packets delivered to a machine. Using the system tap script we were able to measure the time distribution of the main soft IRQ function net_rx_action:

This distribution was pretty awful. While most of the calls to net_rx_action were handled in under 81us (average), the slow outliers were really bad. Three calls took a whopping 32ms! No wonder the ping times were off.

The `inet_lookup` is slow

With some back and forth with flame graphs and the histogram-kernel.stp script we went deeper to look for the culprit. We found that tcp_v4_rcv had a similarly poor latency distribution. More specifically the problem lies between lines 1637 and 1642 in the tcp_v4_rcv function in the tcp_ipv4.c file. We wrote another script to show just that:

The latency is created at this specific line in tcp_v4_rcv function:

sk = __inet_lookup_skb(&tcp_hashinfo, skb, th->source, th->dest);

The numbers shown above indicate that the function usually terminated quickly, in under 2us, but sometimes it hit a slow path and took 1-2ms to finish.

The __inet_lookup_skb is inlined which makes it tricky to accurately measure. Fortunately the function is simple - all it does is to call __inet_lookup_established and __inet_lookup_listener. It's the latter function that was causing the trouble:

Let's discuss how __inet_lookup works. This function tries to find an appropriate connection sock struct structure for a packet. This is done in the __inet_lookup_established call. If that fails, the __inet_lookup will attempt to find a bound socket in listening state that could potentially handle the packet. For example, if the packet is SYN and the listening socket exists we should respond with SYN+ACK. If there is no bound listening socket we should send an RST instead. The __inet_lookup_listener function finds the bound socket in the LHTABLE hash table. It does so by using the destination port as a hash and picks an appropriate bucket in the hash table. Then it iterates over it linearly to find the matching listening socket.

To understand the problem we traced the slow packets, with another crafted system tap script. It hooks onto __inet_lookup_listener and prints out the details of only the slow packets:

With this data we went deeper and matched these log lines to specific packets captured with tcpdump. I'll spare you the details, but these are inbound SYN and RST packets which destination port modulo 32 is equal to 21. Check it out:

16725 % 32 = 21
53 % 32 = 21
63925 % 32 = 21

Now, where does this magic number come from?

The listening hash table

As mentioned above, Linux maintains a listening hash table containing the listening TCP sockets - the LHTABLE. It has a fixed size of 32 buckets:

/* Yes, really, this is all you need. */
#define INET_LHTABLE_SIZE       32

To recap:

All the SYN and RST packets trigger a lookup in LHTABLE. Since the connection entry doesn't exist the __inet_lookup_established call fails and __inet_lookup_listener will be called.
LHTABLE is small - it has only 32 buckets.
LHTABLE is hashed by destination port only.

It's time for a quick diversion. Let's speak about CloudFlare's DNS server.

Let's speak about DNS

At CloudFlare we are using a custom DNS server called rrdns. Among many other requirements, the server is designed to withstand DDoS attacks.

Even though our server is pretty fast, when a large attack kicks in it might be unable to cope with the attack load. If that happens we must contain the damage - even if one IP address is under a heavy attack the server must still handle legitimate traffic on other IP addresses. In fact, our DNS architecture is designed to spread the load among 16k IP addresses.

When an IP address is under attack, and the server is not keeping up with incoming packets, the kernel receive queue on a UDP socket will overflow. We monitor that by looking at the netstat counters:

$ netstat -s --udp
Udp:
    43957575 packet receive errors

With that number increasing we can see the affected IP addresses by listing the UDP sockets with non-empty receive queues:

$ netstat -ep4ln --udp|grep 53 |egrep -v "^udp *0"
udp   524293  0 173.245.1.1:53  0.0.0.0:*   0
udp   524288  0 173.245.2.3:53  0.0.0.0:*   0

In this case two IP addresses received heavy UDP traffic. It was more than the DNS server could handle, the receive queues built up and eventually overflowed. Fortunately, because we are binding to specific IP addresses, overflowing some UDP receive queues won't affect any other IP addresses.

Binding to specific IP addresses is critical to keep our DNS infrastructure online. With this setup even if other mitigation techniques fail and the DNS server is left exposed to the packet flood, we are certain the attack will not affect handling DNS on other IP addresses.

But what does that have to do with the LHTABLE? Well, in our setup we bound to specific IP addresses for both UDP and TCP. While having 16k listening sockets in UDP is okay, it turns out it is not fine for TCP.

What happened

Due to our DNS setup we had 16k TCP sockets bound to different IP addresses on port 53. Since the port number is fixed, all these sockets ended in exactly one LHTABLE bucket. This particular bucket was number 21 (53 % 32 = 21). When an RST or SYN packet hit it, the __inet_lookup_listener call had to traverse all 16k socket entries. This wasn't fast, in fact it took 2ms to finish.

To solve the problem we deployed two changes:

For TCP connections our DNS server now binds to ANY_IP address (aka: 0.0.0.0:53, *:53). We call this "bind to star". While binding to specific IP addresses is still necessary for UDP, there is little benefit in doing that for the TCP traffic. For TCP we can bind to star safely, without compromising our DDoS defenses.
We increased the LHTABLE size in our kernels. We are not the first to do that: Bill Sommerfeld from Google suggested that back in 2011.

With these changes deployed the ping times within our datacenter are finally flat, as they should always have been:

Final words

You can't have a very large number of bound TCP sockets and we learned that the hard way. We learned a bit about the Linux networking stack: the fact that LHTABLE is fixed size and is hashed by destination port only. Once again we showed a couple of powerful of System Tap scripts.

With the fixes deployed maximum latency numbers have dropped significantly. We are confident that soft interrupt handling in net_rx_action is behaving well.

Mr Wolf has finally finished his assignment.

If this sounds interesting, consider joining us. We have teams in Singapore, San Francisco and London.

The Cloudflare Blog

The revenge of the listening sockets

System tap

The `inet_lookup` is slow

The listening hash table

Let's speak about DNS

What happened

Final words

Iran's Internet is partially restored, Cloudflare Radar data shows

How Cloudflare responded to the “Copy Fail” Linux vulnerability

When DNSSEC goes wrong: how we responded to the .de TLD outage

Making Rust Workers reliable: panic and abort recovery in wasm‑bindgen

The revenge of the listening sockets

System tap

The inet_lookup is slow

The listening hash table

Let's speak about DNS

What happened

Final words

Iran's Internet is partially restored, Cloudflare Radar data shows

How Cloudflare responded to the “Copy Fail” Linux vulnerability

When DNSSEC goes wrong: how we responded to the .de TLD outage

Making Rust Workers reliable: panic and abort recovery in wasm‑bindgen

The `inet_lookup` is slow