Back in November we wrote a blog post about one latency spike. Today I'd like to share a continuation of that story. As it turns out, the misconfigured
rmem setting wasn't the only source of added latency.
It looked like Mr. Wolf hadn't finished his job.
After adjusting the previously discussed
rmem sysctl we continued monitoring our systems' latency. Among other things we measured
ping times to our edge servers. While the worst case improved and we didn't see 1000ms+ pings anymore, the line still wasn't flat. Here's a graph of ping latency between an idling internal machine and a production server. The test was done within the datacenter, the packets never went to the public internet. The Y axis of the chart shows
ping times in milliseconds, the X axis is the time of the measurement. Measurements were taken every second for over 6 hours:
As you can see most pings finished below 1ms. But out of 21,600 measurements about 20 had high latency of up to 100ms. Not ideal, is it?
The latency occurred within our datacenter and the packets weren't lost. This suggested a kernel issue again. Linux responds to ICMP pings from its soft interrupt handling code. A delay in handling
ping indicates a delay in Soft IRQ handling which is really bad and can affect all packets delivered to a machine. Using the system tap script we were able to measure the time distribution of the main soft IRQ function
This distribution was pretty awful. While most of the calls to
net_rx_action were handled in under 81us (average), the slow outliers were really bad. Three calls took a whopping 32ms! No wonder the
ping times were off.
inet_lookup is slow
With some back and forth with flame graphs and the
histogram-kernel.stp script we went deeper to look for the culprit. We found that
tcp_v4_rcv had a similarly poor latency distribution. More specifically the problem lies between lines 1637 and 1642 in the
tcp_v4_rcv function in the
tcp_ipv4.c file. We wrote another script to show just that:
The latency is created at this specific line in
sk = __inet_lookup_skb(&tcp_hashinfo, skb, th->source, th->dest);
The numbers shown above indicate that the function usually terminated quickly, in under 2us, but sometimes it hit a slow path and took 1-2ms to finish.
__inet_lookup_skb is inlined which makes it tricky to accurately measure. Fortunately the function is simple - all it does is to call
__inet_lookup_listener. It's the latter function that was causing the trouble:
Let's discuss how
__inet_lookup works. This function tries to find an appropriate connection
sock struct structure for a packet. This is done in the
__inet_lookup_established call. If that fails, the
__inet_lookup will attempt to find a bound socket in listening state that could potentially handle the packet. For example, if the packet is SYN and the listening socket exists we should respond with SYN+ACK. If there is no bound listening socket we should send an RST instead. The
__inet_lookup_listener function finds the bound socket in the
LHTABLE hash table. It does so by using the destination port as a hash and picks an appropriate bucket in the hash table. Then it iterates over it linearly to find the matching listening socket.
To understand the problem we traced the slow packets, with another crafted system tap script. It hooks onto
__inet_lookup_listener and prints out the details of only the slow packets:
With this data we went deeper and matched these log lines to specific packets captured with
tcpdump. I'll spare you the details, but these are inbound SYN and RST packets which destination port modulo 32 is equal to 21. Check it out:
- 16725 % 32 = 21
- 53 % 32 = 21
- 63925 % 32 = 21
Now, where does this magic number come from?
The listening hash table
As mentioned above, Linux maintains a listening hash table containing the listening TCP sockets - the
LHTABLE. It has a fixed size of 32 buckets:
/* Yes, really, this is all you need. */ #define INET_LHTABLE_SIZE 32
- All the SYN and RST packets trigger a lookup in LHTABLE. Since the connection entry doesn't exist the
__inet_lookup_establishedcall fails and
__inet_lookup_listenerwill be called.
- LHTABLE is small - it has only 32 buckets.
- LHTABLE is hashed by destination port only.
It's time for a quick diversion. Let's speak about CloudFlare's DNS server.
Let's speak about DNS
At CloudFlare we are using a custom DNS server called
rrdns. Among many other requirements, the server is designed to withstand DDoS attacks.
Even though our server is pretty fast, when a large attack kicks in it might be unable to cope with the attack load. If that happens we must contain the damage - even if one IP address is under a heavy attack the server must still handle legitimate traffic on other IP addresses. In fact, our DNS architecture is designed to spread the load among 16k IP addresses.
When an IP address is under attack, and the server is not keeping up with incoming packets, the kernel receive queue on a UDP socket will overflow. We monitor that by looking at the
$ netstat -s --udp Udp: 43957575 packet receive errors
With that number increasing we can see the affected IP addresses by listing the UDP sockets with non-empty receive queues:
$ netstat -ep4ln --udp|grep 53 |egrep -v "^udp *0" udp 524293 0 22.214.171.124:53 0.0.0.0:* 0 udp 524288 0 126.96.36.199:53 0.0.0.0:* 0
In this case two IP addresses received heavy UDP traffic. It was more than the DNS server could handle, the receive queues built up and eventually overflowed. Fortunately, because we are binding to specific IP addresses, overflowing some UDP receive queues won't affect any other IP addresses.
Binding to specific IP addresses is critical to keep our DNS infrastructure online. With this setup even if other mitigation techniques fail and the DNS server is left exposed to the packet flood, we are certain the attack will not affect handling DNS on other IP addresses.
But what does that have to do with the
LHTABLE? Well, in our setup we bound to specific IP addresses for both UDP and TCP. While having 16k listening sockets in UDP is okay, it turns out it is not fine for TCP.
Due to our DNS setup we had 16k TCP sockets bound to different IP addresses on port 53. Since the port number is fixed, all these sockets ended in exactly one
LHTABLE bucket. This particular bucket was number 21 (53 % 32 = 21). When an RST or SYN packet hit it, the
__inet_lookup_listener call had to traverse all 16k socket entries. This wasn't fast, in fact it took 2ms to finish.
To solve the problem we deployed two changes:
For TCP connections our DNS server now binds to ANY_IP address (aka: 0.0.0.0:53, *:53). We call this "bind to star". While binding to specific IP addresses is still necessary for UDP, there is little benefit in doing that for the TCP traffic. For TCP we can bind to star safely, without compromising our DDoS defenses.
We increased the
LHTABLEsize in our kernels. We are not the first to do that: Bill Sommerfeld from Google suggested that back in 2011.
With these changes deployed the
ping times within our datacenter are finally flat, as they should always have been:
You can't have a very large number of bound TCP sockets and we learned that the hard way. We learned a bit about the Linux networking stack: the fact that
LHTABLE is fixed size and is hashed by destination port only. Once again we showed a couple of powerful of System Tap scripts.
With the fixes deployed maximum latency numbers have dropped significantly. We are confident that soft interrupt handling in
net_rx_action is behaving well.
Mr Wolf has finally finished his assignment.
If this sounds interesting, consider joining us. We have teams in Singapore, San Francisco and London.