Optimizing Your Linux Stack for Maximum Mobile Web Performance

The following is a technical post written by Ian Applegate
(@AppealingTea), a member of our Systems Engineering team, on how to optimize the Linux TCP stack for mobile connections. The article was originally published
as part of the 2012 Web Performance Calendar. At CloudFlare, we spend a significant amount of time ensuring our network stack is tuned to whatever kind of network or device is connecting to us. We wanted to share some of the technical details to help other organizations that are looking to optimize for mobile network performance, even if they're not using CloudFlare. And, if you are using CloudFlare, you get all these benefits and the fastest possible TCP performance when a mobile network accesses your site.

We spend a lot of time at CloudFlare thinking about how to make the Internet fast on mobile devices. Currently there are over 1.2 billion active mobile users and that number is growing rapidly. Earlier this year mobile Internet access passed fixed Internet access in India and that's likely to be repeated the world over. So, mobile network performance will only become more and more important.

Most of the focus today on improving mobile performance is on Layer 7 with front end optimizations (FEO). At CloudFlare, we've done significant work in this area with front end optimization technologies like Rocket Loader, Mirage, and Polish that dynamicall modify web content to make it load quickly whatever device is being used. However, while FEO is important to make mobile fast, the unique characteristics of mobile networks also means we have to pay attention to the underlying performance of the technologies down at Layer 4 of the network stack.

This article is about the challenges mobile devices present, how the default TCP configuration is ill-suited for optimal mobile performance, and what you can do to improve performance for visitors connecting via mobile networks. Before diving into the details, a quick technical note. At CloudFlare, we've build most of our systems on top of a custom version of Linux so, while the underlying technologies can apply to other operating systems, the examples I'll use are from Linux.

TCP Congestion Control

To understand the challenges of mobile network performance at Layer 4 of the networking stack you need to understand TCP Congestion Control. TCP Congestion Control is the gatekeeper that determines how to control the flow of packets from your server to your clients. Its goal is to prevent Internet congestion by detecting when congestion occurs and slowing down the rate data is transmitted. This helps ensure that the Internet is available to everyone, but can cause problems on mobile network when TCP mistakes mobile network problems for congestion.

TCP Congestion Control holds back the floodgates if it detects congestion (i.e. packet loss) on the remote end. A network is, inherently, a shared resource. The purpose of TCP Congestion Control was to ensure that every device on the network cooperates to not overwhelm its resource. On a wired network, if packet loss is detected it is a fairly reliable indicator that a port along the connection is overburdened. What is typically going on in these cases is that a memory buffer in a switch somewhere has filled beyond its capacity because packets are coming in faster than they can be sent out and data is being discarded. TCP Congestion Control on clients and servers is setup to "back off" in these cases in order to ensure that the network remains available for all its users.

But figuring out what packet loss means on a mobile network is a different matter. Radio networks are inherently susceptible to interference which results in packet loss. If packets are being dropped does that mean a switch is overburdened, like we can infer on a wired network? Or did someone travel from an undersubscribed wireless cell to an oversubscribed one? Or did someone just turn on a microwave? Or maybe it was just a random solar flare? Since it's not as clear what packet loss means on a mobile network, it's not clear what action a TCP Congestion Control algorithm should take.

A Series of Leaky Tubes

To optimize networks for lossy networks like those on mobile networks, it's important to understand exactly how TCP Congestion Control algorithms are designed. While the high level concept makes sense, the details of TCP Congestion Control are not widely understood by most people working in the web performance industry. That said, it is an important core part of what makes the Internet reliable and the subject of very active research and development.

To understand how TCP Congestion Control algorithms work, imagine the following analogy. Think of your web server as your local water utility plant. You've built on a large network of pipes in your hometown and you need to guarantee that each pipe is as pressurized as possible for delivery, but you don't want to burst the pipes. (Note: I recognize the late Senator Ted Stevens got a lot of flack for describing the Internet as a "series of tubes," but the metaphor is surprisingly accurate.)

Your client, Crazy Arty, runs a local water bottling plant that connects to your pipe network. Crazy Arty's infrastructure is built on old pipes that are leaky and brittle. For you to get water to them without bursting his pipes, you need to infer the capability of Crazy Arty's system. If you don't know in advance then you do a test — you send a known amount of water to the line and then measure the pressure. If the pressure is suddenly lost then you can infer that you broke a pipe. If not, then that level is likely safe and you can add more water pressure and repeat the test. You can iterate this test until you burst a pipe, see the drop in pressure, write down the maximum water volume, and going forward ensure you never exceed it.

Imagine, however, that there's some exogenous factor that could decrease the pressure in the pipe without actually indicating a pipe had burst. What if, for example, Crazy Arty ran a pump that he only turned on randomly from time to time and you didn't know about. If the only signal you have is observing a loss in pressure, you'd have no way of knowing whether you'd burst a pipe or if Crazy Arty had just plugged in the pump. The effect would be that you'd likely record a pressure level much less than the amount the pipes could actually withstand — leading to all your customers on the network potentially having lower water pressure than they should.

Optimizing for Congestion or Loss

If you've been following up to this point then you already know more about TCP Congestion Control than you would guess. The initial amount of water we talked about in TCP is known as the Initial Congestion Window (initcwnd) it is the initial number of packets in flight across the network. The congestion window (cwnd) either shrinks, grows, or stays the same depending on how many packets make it back and how fast (in ACK trains) they return after the initial burst. In essence, TCP Congestion Control is just like the water utility — measuring the pressure a
network can withstand and then adjusting the volume in an attempt to maximize flow without bursting any pipes.

When a TCP connection is first established it attempts to ramp up the cwnd quickly. This phase of the connection, where TCP grows the cwnd rapidly, is called Slow Start. That's a bit of a misnomer since it is generally an exponential growth function which is quite fast and aggressive. Just like when the water utility in the example above detects a drop in pressure it turns down the volume of water, when TCP detects packets are lost it reduces the size of the cwnd and delays the time before another burst of packets is delivered. The time between packet bursts is known as the Retransmission Timeout (RTO). The
algorithm within TCP that controls these processes is called the Congestion Control Algorithm. There are many congestion control algorithms and clients and servers can use different strategies based in the characteristics of their networks. Most of Congestion Control Algorithms focus on optimizing for one type of network loss or another: congestive loss (like you see on wired networks) or random loss (like you see on mobile networks).

In the example above, a pipe bursting would be an indication of congestive loss. There was a physical limit to the pipes, it is exceeded, and the appropriate response is to back off. On the other hand, Crazy Arty's pump is analogous to random loss. The capacity is still available on the network and only a temporary disturbance causes the water utility to see the pipes as overfull. The Internet started as a network of wired devices, and, as its name suggests, congestion control was largely designed to optimize for congestive loss. As a result, the default Congestion Control Algorithm in many operating systems is good for communicating wired networks but not as good for communicating with mobile networks.

A few Congestion Control algorithms try to bridge the gap by using the time of the delay in the "pressure increase" to "expected capacity" to figure out the cause of the loss. These are known as bandwidth estimation algorithms, and examples include Vegas,
Veno and Westwood+. Unfortunately, all of these methods are reactive and reuse no information across two similar streams.

At companies that see a significant amount of network traffic, like CloudFlare or Google, it is possible to map the characteristics of the Internet's networks and choose a specific congestion control algorithm in order to maximize performance for that network. Unfortunately, unless you are seeing the large amounts of traffic as we do and can record data on network performance, the ability to instrument your congestion control or build a "weather forecast" is usually impossible. Fortunately, there are still several things you can do to make your server more responsive to visitors even when they're coming from lossy, mobile devices.

Compelling Reasons to Upgrade You Kernel

The Linux network stack has been under extensive development to bring about some sensible defaults and mechanisms for dealing with the network topology of 2012. A mixed network of high bandwidth low latency and high bandwidth, high latency, lossy connections was never fully anticipated by the kernel developers of 2009 and if you check your server's kernel version chances are it's running a 2.6.32.x kernel from that era.

uname -a

There are a number of reasons that if you're running an old kernel on your web server and want to increase web performance, especially for mobile devices, you should investigate upgrading. To begin, Linux 2.6.38 bumps the default initcwnd and initrwnd (inital receive window) from 3
to 10. This is an easy, big win. It allows for 14.2KB (vs 5.7KB) of data to be sent or received in the initial round trip before slow start grows the cwnd further. This is important for HTTP and SSL because it gives you more room to fit the header in the initial set of packets. If you are running an older kernel you may be able to run the following command on a bash shell (use caution) to set all of your routes' initcwnd and initrwnd to 10. On average, this small change can be one of the biggest boosts when you're trying to maximize web performance.

ip route | while read p; do ip route change $p initcwnd 10 initrwnd 10; done

Linux kernel 3.2 implements Proportional Rate Reduction (PRR). PRR decreases the time it takes for a lossy connection to recover its full speed, potentially improving HTTP response times by 3-10%. The benefits of PRR are significant for mobile networks. To understand why, it's worth diving back into the details of how previous congestion
control strategies interacted with loss.

Many congestion control algorithms halve the cwnd when a loss is detected. When multiple losses occur this can result in a case where the cwnd is lower than the slow start threshold. Unfortunately, the connection never goes through slow start again. The result is that a few network interruptions can result in TCP slowing to a crawl for all the connections in the session.

This is even more deadly when combined with tcp_no_metrics_save=0 sysctl setting on unpatched kernels before 3.2. This setting will save data on connections and attempt to use it to optimize the network. Unfortunately, this can actually make performance worse because TCP will apply the exception case to every new connection from a client within a window of a few minutes. In other words, in some cases, one person surfing your site from a mobile phone who has some random packet loss can reduce your server's performance to this visitor even when their temporary loss has cleared.

If you expect your visitors to be coming from mobile, lossy connections and you cannot upgrade or patch your kernel I recommend setting tcp_no_metrics_save=1. If you're comfortable doing some hacking, you can patch older kernels.

The good news is that Linux 3.2 implements PRR, which decreases the amount of time that a lossy connection will impact TCP performance. If you can upgrade, it may be one of the most significant things you can do in order to increase your web performance.

More Improvements Ahead

Linux 3.2 also has another important improvement with RFC2099bis. The initial Retransmission Timeout (initRTO) has been changed to 1s from 3s. If loss happens after sending the initcwnd two seconds waiting time are saved when trying to resend the data. With TCP streams being so short this can have a very noticeable improvement if a connection experiences loss at the beginning of the stream. Like the PRR patch this can also be applied (with modification) to older kernels if for some reason you cannot upgrade (here's the patch).

Looking forward, Linux 3.3 has Byte Queue Limits when teamed with CoDel (controlled delay) in the 3.5 kernel helps fight the long standing issue of
Bufferbloat by intelligently managing packet queues. Bufferbloat is when the caching overhead on TCP becomes inefficient because it's littered with stale data. Linux 3.3 has features to auto QoS important packets (SYN/DNS/ARP/etc.,) keep down buffer queues thereby reducing bufferbloat and improving latency on loaded servers.

Linux 3.5 implements TCP Early Retransmit with some safeguards for connections that have a small amount of packet reordering. This allows connections, under certain conditions, to trigger fast retransmit and bypass the costly Retransmission Timeout (RTO) mentioned earlier. By default it is enabled in the failsafe mode tcp_early_retrans=2. If for some reason you are sure your clients have loss but no reordering then you could set tcp_early_retrans=1 to save one quarter a RTT on recovery.

One of the most extensive changes to 3.6 that hasn't got much press is the removal of the IPv4 routing cache. In a nutshell it was an extraneous caching layer in the kernel that mapped interfaces to routes to IPs and saved a lookup to the Forward Information Base (FIB). The FIB is a routing table within the network stack. The IPv4 routing cache was intended to eliminate a FIB lookup and increase performance. While a good idea in principle, unfortunately it provided a very small performance boost in less than 10% of connections. In the 3.2.x-3.5.x kernels it was extremely vulnerable to certain DDoS techniques so it has been removed.

Finally, one important setting you should check, regardless of the Linux kernel you are running: tcp_slow_start_after_idle. If you're concerned about web performance, it has been proclaimed sysctl setting of the year. It can be enabled in almost any kernel. By default this is set to 1 which will aggressively reduce cwnd on idle connections and negatively impact any long lived connections such as SSL. The following command will set it to 0 and can significantly improve performance:

sysctl -w tcp_slow_start_after_idle=0

The Missing Congestion Control Algorithm

You may be curious as to why I haven't made a recommendation as far as a quick and easy change of congestion control algorithms. Since Linux 2.6.19, the default congestion control algorithm in the Linux kernel is CUBIC, which is time based and optimized for high speed and high latency networks. It's killer feature, known as called Hybrid Slow Start (HyStart), allows it to safely exit slow start by measuring the ACK trains and not overshoot the cwnd. It can improve startup throughput by up to 200-300%.

While other Congestion Control Algorithms may seem like performance wins on connections experiencing high amounts of loss (>.1%) (e.g., TCP Westwood+ or Hybla), unfortunately these algorithms don't include HyStart. The net effect is that, in our tests, they underperform CUBIC for general network performance. Unless a majority of your clients are on lossy connections, I recommend staying with CUBIC.

Of course the real answer here is to dynamically swap out congestion control algorithms based on historical data to better serve these edge cases. Unfortunately, that is difficult for the average web server unless you're seeing a very high volume of traffic and are able to record and analyze network characteristics across multiple connections. The good news is that loss predictors and hybrid congestion control algorithms are continuing to mature, so maybe we will have an answer in an upcoming kernel.

The Cloudflare Blog

Optimizing Your Linux Stack for Maximum Mobile Web Performance

TCP Congestion Control

A Series of Leaky Tubes

Optimizing for Congestion or Loss

Compelling Reasons to Upgrade You Kernel

More Improvements Ahead

The Missing Congestion Control Algorithm

A Socket API that works across JavaScript runtimes — announcing a WinterCG spec and Node.js implementation of connect()

Unbounded memory usage by TCP for receive buffers, and how we fixed it

Announcing connect() — a new API for creating TCP sockets from Cloudflare Workers

When the window is not fully open, your TCP stack is doing more than you think