A Tour Inside CloudFlare's Latest Generation Servers

CloudFlare operates at a significant scale, handling more than a trillion requests through our network every month. To ensure this is as efficient as possible, we own and operate all the equipment in our 23 locations around the world in order to process the volume of traffic that flows through our network. We spend a significant amount of time specing and, in some cases, designing the hardware that makes up our network. At the edge of CloudFlare's network we run three main components: routers, switches, and servers. On regular intervals, we will take everything we've learned about a last generation of hardware and refresh each component with a next generation.

For servers, we are on our fourth generation (G4) of servers. Our first generation (G1) servers were stock Dell PowerEdge servers. We deployed these in early 2010 while we were building the beta of CloudFlare. We learned quickly from that experience and then worked with a company called ZT Systems to build our G2 servers, which we deployed just before our public launch in September 2010.

In 2011, we worked with HP to build our G3 servers. We deployed that generation through mid 2012, forming the core of our current network of 23 data centers worldwide. In the Fall of 2012 we started working with vendors to take everything we'd learned about running a network to date and roll out newest generation of servers. We wanted to share a peek literally inside the equipment we're now using across CloudFlare's network.

Generation 3

Before looking at G4 it's important to understand what we learned about our previous generation of servers. Our G3 servers were built by HP. They were each 2U (meaning they were about as tall as two pizza boxes stacked on top of each other). Each had two Intel Xeon E5645 (Westmere variant) CPUs running at 2.4GHz, up to 25 Intel SSDs across the front of the boxes, 48GB of RAM, 6 1Gbps Intel-based network interfaces (2 on the motherboard and four on a PCI card), and a single high efficiency (Platinum) power supply.

We liked the build quality and reliability of the HP gear and it continues to power a significant amount of traffic across our network. When we went to design our G4 servers we looked at the bottlenecks we faced with G3 and sought new hardware designed specifically to solve them.

Storage

CloudFlare uses SSDs exclusively at the edge of our network. SSDs give us three advantages. First, they tend to fail gradually over time rather than catastrophically, which allows us to predictably schedule their replacement and not keep staff on hand at all our locations around the world. Second, SSDs consume significantly less power than spinning HDDs. Less power means we can put more equipment less expensively in the data centers where we are located. Finally, SSDs have faster random seek and write performance, which is important given the nature of our traffic.

We went through a number of different models of SSDs over the life of our G3 hardware. The best price performance we found were the 240GB Intel 520 SSD drives, which is what we are using in the first shipments of the new G4 servers. Intel reports that the 520-series drives have a mean time between failure (MTBF) of 1,200,000 hours (about 137 years). While we haven't been running enough of the drives long enough to confirm that MTBF, we have been pleased with their low failure rate in our production environment. We vary the number of SSDs per server depending on the needs of the data center where the machine is located. At a minimum we install 6 drives but will ramp up to 24 if the traffic in the location calls for it. We do a lot of small transactions so, when optimizing our drive and file system choices, we focus on small file performance.

One thing we pulled out of our G4 setup was a RAID card. We'd experimented with hardware RAID but found we got more performance addressing each disk individually and using consistent hashing in the algorithm to spread files across disks. Specifically, we saw a 50% performance benefit addressing disks directly rather than going through the G3 hardware RAID. The additional reliability of RAID isn't as important for our application as we can always go fetch a copy of the original object from our customer's origin server if necessary.

While disk performance is important, in the case of frequently requested files it's even better if we can pull them straight from RAM. With RAM prices falling dramatically, we increased the amount of RAM in our G4 servers to 128GB. This allows us to hold more cache objects in RAM and hit the disk less frequently. Specifically, 128GB vs 48GB, allows us to have 100GB of in-memory file cache versus 20GB. That's 5x the memory file cache, which means we don't have to go to disk for the most popular resources. This gives us a 50% memory cache hit rate vs 25% on G3 servers.

Our software, for certain applications like caching where it makes sense, shares resources across all the servers within a location. This means that as we add additional servers to a data center the overall storage capacity of that data center increases. This architecture has allowed us to scale quickly to meet our continued growth.

CPU

With our G3 equipment we were not CPU bound under normal circumstances. When we mitigate large Layer 4 DDoS attacks (e.g., SYN floods) our CPUs would, from time to time, become overwhelmed with excessive processor interrupts. In our tests, increasing or decreasing the clockspeed of the CPU did little to change this problem. Adding more cores to a CPU did help mitigate this and we tested some of the high core count AMD CPUs, but ultimately decided against going that direction.

While top clockspeed was not our priority, our product roadmap includes more CPU-heavy features. These include image optimization (e.g., Mirage and Polish), high volumes of SSL/TLS connections, and extremely fast pattern expression matching (e.g., PCRE tests for our WAF). These CPU-heavy operations can, in most cases, take advantage of special vector processing instruction sets on post-Westmere Intel chips. This made Intel's newest generation Sandybridge chipset attractive.

We were willing to sacrifice a bit of clockspeed and spend a bit more on chips to save power. We tend to put our equipment in data centers that have high network density. These facilities, however, are usually older and don't always have the highest power capacity. We settled on our G4 servers having two Intel Xeon 2630L CPUs (a low power chip in the Sandybridge family) running at 2.0GHz. This gives us 12 physical cores (and 24 virtual cores with hyperthreading) per server. The power savings per chip (60 watts vs. 95 watts) is sufficient to allow us at least one more server per rack than we'd be able to get if we went with the non-low power version.

Network

This biggest change from our G3 to G4 servers was the jump from 1Gbps to 10Gbps network interfaces. With our G3 servers, we would sometimes max out the 6Gbps of network capacity per server when under certain high-volume Layer 7 attacks. We knew we wanted to jump to 10Gbps on each server, but we also wanted to pick the right network controller card.

We ended up testing a very wide range of network cards, spending more time optimizing this component in the servers than any other. In the end, we settled on the network cards from a company called Solarflare. (It didn't hurt that their name was similar to ours.)

Solarflare has traditionally focused on supplying extremely performant network cards for the high frequency trading industry. What we found was that their cards ran circles around everything else in the market: handling up to 16 million packets per second in our tests (at 60 bytes per packet, the typical size of a SYN packet in a SYN-flood attack), compared with the next best alternative topping out around 9M PPS. We ended up using the Solarflare SFC9020 in our G4 servers. Part of the explanation for the performance benefit is that Solarflare includes a comparatively large network buffer on their cards (16MB versus 512KB or less in most the other cards we tested), minimizing the chance of network congestion leading to packet loss. This is good under normal operations but is particularly helpful when there is a DDoS attack.

Beyond the hardware, we're working with Solarflare to take advantage of some of the software which allows us to streamline network performance. In particular, we've begun to test their OpenOnload kernel bypass technology. This allows network requests to be handled directly by userspace without creating a CPU interrupt.

Beyond reducing interrupts during attacks, if you remove the latency of going through the kernel stack and directly into application stack then you can process a higher number of packets in the same amount of time, which increases overall performance. On average you save 100μs (100 microseconds, or 1/10th of a millisecond) each time you bypass the kernel stack. While that may not seem like a lot, it translates into a 20% transaction latency savings for us. If you control sender/receiver — which we do, for example, when fetching cached objects intra-network — that benefit is doubled.

Because of our unique requirements, we need to rewrite portions of the Solarflare network drivers before OpenOnload can be fully implemented. We're working with Solarflare on this project. However, we're excited once that process is done to fully unleash the potential of our G4 network hardware.

Finally, while 10Gbps Ethernet can run across standard Cat5/6 cable, we elected to use SFP+ connectors. We chose this to have the flexibility between optical (fiber) and copper connections. Some network card and switch vendors lock down their equipment to only support proprietary SFP+s, which they charge a significant premium for. We spent significant time testing a combination of SFP+ vendors before finding FiberStore, a SFP+ manufacturer from which we could directly source SFP+s at a reasonable price that worked in the network gear we wanted to use.

Designed to Our Unique Needs

We've continued to use a single, high efficiency (Platinum) power supply on our servers. While power supplies do fail, we've designed CloudFlare's network to be highly resilient to component failure. If a power supply fails, traffic is automatically rebalanced across the remaining servers and a ticket is created to have the power supply replaced.

Unlike companies like Facebook and Google that build data centers from the foundation up, at CloudFlare we deploy smaller footprints in more locations. This means we don't control the entire environment of the data center and haven't been able to do more exotic things like chassisless deployments, direct DC power, or exotic cooling strategies. From the outside, our servers look little different from what you'd get if you bought directly from an original equipment manufacturer (OEM) like Dell or HP.

While we talked to these OEMs in the bakeoff we ran to select the vendor for our G4 servers, in the end we chose to work with what is known as an original design manufacturer (ODM) that built the servers exactly to our spec. We choose a Taiwanese company called Quanta which has built custom designed servers for companies like Facebook and Rackspace.

Overall, our G4 servers cost us slightly less per server than our G3s but deliver a bit better CPU performance, faster storage, 3 times the RAM, and 3 times the network capacity all while using about 20% less power. While we continue to tune our G4 servers to optimize performance, and we will update particular components as better versions become available (e.g., we're already testing various next generation SSDs from Intel and other vendors), we expect that we'll continue to utilize the core G4 current platform through at least mid-2014. We've already begun to discover bottlenecks in this new architecture (hint: switches) which we're starting to investigate how to solve. As we do, we'll keep sharing what we learn.

The Cloudflare Blog

A Tour Inside CloudFlare's Latest Generation Servers

Generation 3

Storage

CPU

Network

Designed to Our Unique Needs

Making Rust Workers reliable: panic and abort recovery in wasm‑bindgen

500 Tbps of capacity: 16 years of scaling our global network

How Workers powers our internal maintenance scheduling pipeline

Monitoring AS-SETs and why they matter