Here at Cloudflare, we have a lot of experience of operating servers on the wild Internet. But we are always improving our mastery of this black art. On this very blog we have touched on multiple dark corners of the Internet protocols: like understanding FIN-WAIT-2 or receive buffer tuning.
One subject hasn't had enough attention though - SYN floods. We use Linux and it turns out that SYN packet handling in Linux is truly complex. In this post we'll shine some light on this subject.
The tale of two queues
First we must understand that each bound socket, in the "LISTENING" TCP state has two separate queues:
- The SYN Queue
- The Accept Queue
In the literature these queues are often given other names such as "reqsk_queue", "ACK backlog", "listen backlog" or even "TCP backlog", but I'll stick to the names above to avoid confusion.
The SYN Queue stores inbound SYN packets (specifically:
struct inet_request_sock). It's responsible for sending out SYN+ACK packets and retrying them on timeout. On Linux the number of retries is configured with:
$ sysctl net.ipv4.tcp_synack_retries net.ipv4.tcp_synack_retries = 5
tcp_synack_retries - INTEGER Number of times SYNACKs for a passive TCP connection attempt will be retransmitted. Should not be higher than 255. Default value is 5, which corresponds to 31 seconds till the last retransmission with the current initial RTO of 1second. With this the final timeout for a passive TCP connection will happen after 63 seconds.
After transmitting the SYN+ACK, the SYN Queue waits for an ACK packet from the client - the last packet in the three-way-handshake. All received ACK packets must first be matched against the fully established connection table, and only then against data in the relevant SYN Queue. On SYN Queue match, the kernel removes the item from the SYN Queue, happily creates a fully fledged connection (specifically:
struct inet_sock), and adds it to the Accept Queue.
The Accept Queue contains fully established connections: ready to be picked up by the application. When a process calls
accept(), the sockets are de-queued and passed to the application.
This is a rather simplified view of SYN packet handling on Linux. With socket toggles like
TCP_FASTOPEN things work slightly differently.
Queue size limits
The maximum allowed length of both the Accept and SYN Queues is taken from the
backlog parameter passed to the
listen(2) syscall by the application. For example, this sets the Accept and SYN Queue sizes to 1,024:
Note: In kernels before 4.3 the SYN Queue length was counted differently.
This SYN Queue cap used to be configured by the
net.ipv4.tcp_max_syn_backlog toggle, but this isn't the case anymore. Nowadays
net.core.somaxconn caps both queue sizes. On our servers we set it to 16k:
$ sysctl net.core.somaxconn net.core.somaxconn = 16384
Perfect backlog value
Knowing all that, we might ask the question - what is the ideal
backlog parameter value?
The answer is: it depends. For the majority of trivial TCP Servers it doesn't really matter. For example Golang famously doesn't support customizing backlog and hardcodes it to 128. There are valid reasons to increase this value though:
- When the rate of incoming connections is really large, even with a performant application, the inbound SYN Queue may need a larger number of slots.
backlogvalue controls the SYN Queue size. This effectively can be read as "ACK packets in flight". The larger the average round trip time to the client, the more slots are going to be used. In the case of many clients far away from the server, hundreds of milliseconds away, it makes sense to increase the backlog value.
TCP_DEFER_ACCEPToption causes sockets to remain in the SYN-RECV state longer and contribute to the queue limits.
backlog is bad as well:
- Each slot in SYN Queue uses some memory. During a SYN Flood it makes no sense to waste resources on storing attack packets. Each
struct inet_request_sockentry in SYN Queue takes 256 bytes of memory on kernel 4.14.
To peek into the SYN Queue on Linux we can use the
ss command and look for
SYN-RECV sockets. For example, on one of Cloudflare's servers we can see 119 slots used in tcp/80 SYN Queue and 78 on tcp/443.
$ ss -n state syn-recv sport = :80 | wc -l 119 $ ss -n state syn-recv sport = :443 | wc -l 78
Similar data can be shown with our overenginered SystemTap script:
What happens if the application can't keep up with calling
accept() fast enough?
This is when the magic happens! When the Accept Queue gets full (is of a size of
- Inbound SYN packets to the SYN Queue are dropped.
- Inbound ACK packets to the SYN Queue are dropped.
- The TcpExtListenOverflows /
LINUX_MIB_LISTENOVERFLOWScounter is incremented.
- The TcpExtListenDrops /
LINUX_MIB_LISTENDROPScounter is incremented.
There is a strong rationale for dropping inbound packets: it's a push-back mechanism. The other party will sooner or later resend the SYN or ACK packets by which point, the hope is, the slow application will have recovered.
This is a desirable behavior for almost all servers. For completeness: it can be adjusted with the global
net.ipv4.tcp_abort_on_overflow toggle, but better not touch it.
If your server needs to handle a large number of inbound connections and is struggling with
accept() throughput, consider reading our Nginx tuning / Epoll work distribution post and a follow up showing useful SystemTap scripts.
You can trace the Accept Queue overflow stats by looking at
$ nstat -az TcpExtListenDrops TcpExtListenDrops 49199 0.0
This is a global counter. It's not ideal - sometimes we saw it increasing, while all applications looked healthy! The first step should always be to print the Accept Queue sizes with
$ ss -plnt sport = :6443|cat State Recv-Q Send-Q Local Address:Port Peer Address:Port LISTEN 0 1024 *:6443 *:*
Recv-Q shows the number of sockets in the Accept Queue, and
Send-Q shows the backlog parameter. In this case we see there are no outstanding sockets to be
accept()ed, but we still saw the ListenDrops counter increasing.
It turns out our application was stuck for fraction of a second. This was sufficient to let the Accept Queue overflow for a very brief period of time. Moments later it would recover. Cases like this are hard to debug with
ss, so we wrote an
acceptq.stp SystemTap script to help us. It hooks into kernel and prints the SYN packets which are being dropped:
$ sudo stap -v acceptq.stp time (us) acceptq qmax local addr remote_addr 1495634198449075 1025 1024 0.0.0.0:6443 10.0.1.92:28585 1495634198449253 1025 1024 0.0.0.0:6443 10.0.1.92:50500 1495634198450062 1025 1024 0.0.0.0:6443 10.0.1.92:65434 ...
Here you can see precisely which SYN packets were affected by the ListenDrops. With this script it's trivial to understand which application is dropping connections.
If it's possible to overflow the Accept Queue, it must be possible to overflow the SYN Queue as well. What happens in that case?
This is what SYN Flood attacks are all about. In the past flooding the SYN Queue with bogus spoofed SYN packets was a real problem. Before 1996 it was possible to successfully deny the service of almost any TCP server with very little bandwidth, just by filling the SYN Queues.
The solution is SYN Cookies. SYN Cookies are a construct that allows the SYN+ACK to be generated statelessly, without actually saving the inbound SYN and wasting system memory. SYN Cookies don't break legitimate traffic. When the other party is real, it will respond with a valid ACK packet including the reflected sequence number, which can be cryptographically verified.
By default SYN Cookies are enabled when needed - for sockets with a filled up SYN Queue. Linux updates a couple of counters on SYN Cookies. When a SYN cookie is being sent out:
- TcpExtTCPReqQFullDoCookies /
- TcpExtSyncookiesSent /
- Linux used to increment
TcpExtListenDropsbut doesn't from kernel 4.7.
When an inbound ACK is heading into the SYN Queue with SYN cookies engaged:
- TcpExtSyncookiesRecv /
LINUX_MIB_SYNCOOKIESRECVis incremented when crypto validation succeeds.
- TcpExtSyncookiesFailed /
LINUX_MIB_SYNCOOKIESFAILEDis incremented when crypto fails.
net.ipv4.tcp_syncookies can disable SYN Cookies or force-enable them. Default is good, don't change it.
SYN Cookies and TCP Timestamps
The SYN Cookies magic works, but isn't without disadvantages. The main problem is that there is very little data that can be saved in a SYN Cookie. Specifically, only 32 bits of the sequence number are returned in the ACK. These bits are used as follows:
+----------+--------+-------------------+ | 6 bits | 2 bits | 24 bits | | t mod 32 | MSS | hash(ip, port, t) | +----------+--------+-------------------+
With the MSS setting truncated to only 4 distinct values, Linux doesn't know any optional TCP parameters of the other party. Information about Timestamps, ECN, Selective ACK, or Window Scaling is lost, and can lead to degraded TCP session performance.
Fortunately Linux has a work around. If TCP Timestamps are enabled, the kernel can reuse another slot of 32 bits in the Timestamp field. It contains:
+-----------+-------+-------+--------+ | 26 bits | 1 bit | 1 bit | 4 bits | | Timestamp | ECN | SACK | WScale | +-----------+-------+-------+--------+
TCP Timestamps should be enabled by default, to verify see the sysctl:
$ sysctl net.ipv4.tcp_timestamps net.ipv4.tcp_timestamps = 1
Historically there was plenty of discussion about the usefulness of TCP Timestamps.
- In the past timestamps leaked server uptime (whether that matters is another discussion). This was fixed 8 months ago.
- TCP Timestamps use non-trivial amount of of bandwidth - 12 bytes on each packet.
- They can add additional randomness to packet checksums which can help with certain broken hardware.
- As mentioned above TCP Timestamps can boost the performance of TCP connections if SYN Cookies are engaged.
Currently at Cloudflare, we have TCP Timestamps disabled.
Finally, with SYN Cookies engaged some cool features won't work - things like
SYN Floods at Cloudflare scale
SYN Cookies are a great invention and solve the problem of smaller SYN Floods. At Cloudflare though, we try to avoid them if possible. While sending out a couple of thousand of cryptographically verifiable SYN+ACK packets per second is okay, we see attacks of more than 200 Million packets per second. At this scale, our SYN+ACK responses would just litter the internet, bringing absolutely no benefit.
Instead, we attempt to drop the malicious SYN packets on the firewall layer. We use the
p0f SYN fingerprints compiled to BPF. Read more in this blog post Introducing the p0f BPF compiler. To detect and deploy the mitigations we developed an automation system we call "Gatebot". We described that here Meet Gatebot - the bot that allows us to sleep
For more - slightly outdated - data on the subject read on an excellent explanation by Andreas Veithen from 2015 and a comprehensive paper by Gerald W. Gordon from 2013.
The Linux SYN packet handling landscape is constantly evolving. Until recently SYN Cookies were slow, due to an old fashioned lock in the kernel. This was fixed in 4.4 and now you can rely on the kernel to be able to send millions of SYN Cookies per second, practically solving the SYN Flood problem for most users. With proper tuning it's possible to mitigate even the most annoying SYN Floods without affecting the performance of legitimate connections.
Application performance is also getting significant attention. Recent ideas like
SO_ATTACH_REUSEPORT_EBPF introduce a whole new layer of programmability into the network stack.
It's great to see innovations and fresh thinking funneled into the networking stack, in the otherwise stagnant world of operating systems.
Thanks to Binh Le for helping with this post.
Dealing with the internals of Linux and NGINX sound interesting? Join our world famous team in London, Austin, San Francisco and our elite office in Warsaw, Poland.