The Cloudflare Blog

TURN and anycast: making peer connections work globally

Nils Ohlmeier — Wed, 25 Sep 2024 13:00:00 GMT

A TURN server helps maintain connections during video calls when local networking conditions prevent participants from connecting directly to other participants. It acts as an intermediary, passing data between users when their networks block direct communication. TURN servers ensure that peer-to-peer calls go smoothly, even in less-than-ideal network conditions.

When building their own TURN infrastructure, developers often have to answer a few critical questions:

“How do we build and maintain a mesh network that achieves near-zero latency to all our users?”
“Where should we spin up our servers?”
“Can we auto-scale reliably to be cost-efficient without hurting performance?”

In April, we launched Cloudflare Calls TURN in open beta to help answer these questions. Starting today, Cloudflare Calls’ TURN service is now generally available to all Cloudflare accounts. Our TURN server works on our anycast network, which helps deliver global coverage and near-zero latency required by real time applications.

TURN solves connectivity and privacy problems for real time apps

When Internet Protocol version 4 (IPv4, RFC 791) was designed back in 1981, it was assumed that the 32-bit address space was big enough for all computers to be able to connect to each other. When IPv4 was created, billions of people didn’t have smartphones in their pockets and the idea of the Internet of Things didn’t exist yet. It didn’t take long for companies, ISPs, and even entire countries to realize they didn’t have enough IPv4 address space to meet their needs.

NATs are unpredictable

Fortunately, you can have multiple devices share the same IP address because the most common protocols run on top of IP are TCP and UDP, both of which support up to 65,535 port numbers. (Think of port numbers on an IP address as extensions behind a single phone number.) To solve this problem of IP scarcity, network engineers developed a way to share a single IP address across multiple devices by exploiting the port numbers. This is called Network Address Translation (NAT) and it is a process through which your router knows which packets to send to your smartphone versus your laptop or other devices, all of which are connecting to the public Internet through the IP address assigned to the router.

In a typical NAT setup, when a device sends a packet to the Internet, the NAT assigns a random, unused port to track it, keeping a forwarding table to map the device to the port. This allows NAT to direct responses back to the correct device, even if the source IP address and port vary across different destinations. The system works as long as the internal device initiates the connection and waits for the response.

However, real-time apps like video or audio calls are more challenging with NAT. Since NATs don't reveal how they assign ports, devices can't pre-communicate where to send responses, making it difficult to establish reliable connections. Earlier solutions like STUN (RFC 3489) couldn't fully solve this, which gave rise to the TURN protocol.

TURN predictably relays traffic between devices while ensuring minimal delay, which is crucial for real-time communication where even a second of lag can disrupt the experience.

ICE to determine if a relay server is needed

The ICE (Interactive Connectivity Establishment) protocol was designed to find the fastest communication path between devices. It works by testing multiple routes and choosing the one with the least delay. ICE determines whether a TURN server is needed to relay the connection when a direct peer-to-peer path cannot be established or is not performant enough.

^{How two peers (A and B) try to connect directly by sharing their public and local IP addresses using the ICE protocol. If the direct connection fails, both peers use the TURN server to relay their connection and communicate with each other.}

While ICE is designed to find the most efficient connection path between peers, it can inadvertently expose sensitive information, creating privacy concerns. During the ICE process, endpoints exchange a list of all possible network addresses, including local IP addresses, NAT IP addresses, and TURN server addresses. This comprehensive sharing of network details can reveal information about a user's network topology, potentially exposing their approximate geographic location or details about their local network setup.

The "brute force" nature of ICE, where it attempts connections on all possible paths, can create distinctive network traffic patterns that sophisticated observers might use to infer the use of specific applications or communication protocols.

TURN solves privacy problems

The threat from exposing sensitive information while using real-time applications is especially important for people that use end-to-end encrypted messaging apps for sensitive information — for example, journalists who need to communicate with unknown sources without revealing their location.

With Cloudflare TURN in place, traffic is proxied through Cloudflare, preventing either party in the call from seeing client IP addresses or associated metadata. Cloudflare simply forwards the calls to their intended recipients, but never inspects the contents — the underlying call data is always end-to-end encrypted. This masking of network traffic is an added layer of privacy.

Cloudflare is a trusted third-party when it comes to operating these types of services: we have experience operating privacy-preserving proxies at scale for our Consumer WARP product, Apple’s Private Relay, and Microsoft Edge’s Secure Network, preserving end-user privacy without sacrificing performance.

Cloudflare’s TURN is the fastest because of Anycast

Lots of real time communication services run their own TURN servers on a commercial cloud provider because they don’t want to leave a certain percentage of their customers with non-working communication. This results in additional costs for DevOps, egress bandwidth, etc. And honestly, just deploying and running a TURN server, like CoTURN, in a VPS isn’t an interesting project for most engineers.

Because using a TURN relay adds extra delay for the packets to travel between the peers, the relays should be located as close as possible to the peers. Cloudflare’s TURN service avoids all these headaches by simply running in all of the 330 cities where Cloudflare has data centers. And any time Cloudflare adds another city, the TURN service automatically becomes available there as well.

Anycast is the perfect network topology for TURN

Anycast is a network addressing and routing methodology in which a single IP address is shared by multiple servers in different locations. When a client sends a request to an anycast address, the network automatically routes the request via BGP to the topologically nearest server. This is in contrast to unicast, where each destination has a unique IP address. Anycast allows multiple servers to have the same IP address, and enables clients to automatically connect to a server close to them. This is similar to emergency phone networks (911, 112, etc.) which connect you to the closest emergency communications center in your area.

Anycast allows for lower latency because of the sheer number of locations available around the world. Approximately 95% of the Internet-connected population globally is within approximately 50ms away from a Cloudflare location. For real-time communication applications that use TURN, leads to improved call quality and user experience.

Auto-scaling and inherently global

Running TURN over anycast allows for better scalability and global distribution. By naturally distributing load across multiple servers based on network topology, this setup helps balance traffic and improve performance. When you use Cloudflare’s TURN service, you don’t need to manage a list of servers for different parts of the world. And you don’t need to write custom scaling logic to scale VMs up or down based on your traffic.

Anycast allows TURN to use fewer IP addresses, making it easier to allowlist in restrictive networks. Stateless protocols like DNS over UDP work well with anycast. This includes stateless STUN binding requests used to determine a system's external IP address behind a NAT.

However, stateful protocols over UDP, like QUIC or TURN, are more challenging with anycast. QUIC handles this better due to its stable connection ID, which load balancers can use to consistently route traffic. However, TURN/STUN lacks a similar connection ID. So when a TURN client sends requests to the Cloudflare TURN service, the Unimog load balancer ensures that all its requests get routed to the same server within a data center. The challenges for the communication between a client on the Internet and Cloudflare services listening on an anycast IP address have been described multiple times before.

How does Cloudflare's TURN server receive packets?

TURN servers act as relay points to help connect clients. This process involves two types of connections: the client-server connection and the third-party connection (relayed address).

The client-server connection uses published IP and port information to communicate with TURN clients using anycast.

For the relayed address, using anycast poses a challenge. The TURN protocol requires that packets reach the specific Cloudflare server handling the client connection. If we used anycast for relay addresses, packets might not arrive at the correct data center or server.

One alternative is to use unicast addresses for relay candidates. However, this approach has drawbacks, including making servers vulnerable to attacks and requiring many IP addresses.

To solve these issues, we've developed a middle-ground solution, previously discussed in “Cloudflare servers don't own IPs anymore – so how do they connect to the Internet?”. We use anycast addresses but add extra handling for packets that reach incorrect servers. If a packet arrives at the wrong Cloudflare location, we forward it over our backbone to the correct datacenter, rather than sending it back over the public Internet.

This approach not only resolves routing issues but also improves TURN connection speed. Packets meant for the relay address enter the Cloudflare network as close to the sender as possible, optimizing the routing process.

^{In this non-ideal setup, a TURN client connects to Cloudflare using Anycast, while a direct client uses Unicast, which would expose the TURN server to potential DDoS attacks.}

^{The optimized setup uses Anycast for all TURN clients, allowing for dynamic load distribution across Cloudflare's globally distributed TURN servers.}

Try Cloudflare Calls TURN today

The new TURN feature of Cloudflare Calls addresses critical challenges in real-time communication:

Connectivity: By solving NAT traversal issues, TURN ensures reliable connections even in complex network environments.
Privacy: Acting as an intermediary, TURN enhances user privacy by masking IP addresses and network details.
Performance: Leveraging Cloudflare's global anycast network, our TURN service offers unparalleled speed and near-zero latency.
Scalability: With presence in over 330 cities, Cloudflare Calls TURN grows with your needs.

Cloudflare Calls TURN service is billed on a usage basis. It is available to self-serve and Enterprise customers alike. There is no cost for the first 1,000 GB (one terabyte) of Cloudflare Calls usage each month. It costs five cents per GB after your first terabyte of usage on self-serve. Volume pricing is available for Enterprise customers through your account team.

Switching TURN providers is likely as simple as changing a single configuration in your real-time app. To get started with Cloudflare’s TURN service, create a TURN app from your Cloudflare Calls Dashboard or read the Developer Docs.

The backbone behind Cloudflare’s Connectivity Cloud

Shozo Moritz Takaya — Tue, 06 Aug 2024 14:00:00 GMT

The modern use of "cloud" arguably traces its origins to the cloud icon, omnipresent in network diagrams for decades. A cloud was used to represent the vast and intricate infrastructure components required to deliver network or Internet services without going into depth about the underlying complexities. At Cloudflare, we embody this principle by providing critical infrastructure solutions in a user-friendly and easy-to-use way. Our logo, featuring the cloud symbol, reflects our commitment to simplifying the complexities of Internet infrastructure for all our users.

This blog post provides an update about our infrastructure, focusing on our global backbone in 2024, and highlights its benefits for our customers, our competitive edge in the market, and the impact on our mission of helping build a better Internet. Since the time of our last backbone-related blog post in 2021, we have increased our backbone capacity (Tbps) by more than 500%, unlocking new use cases, as well as reliability and performance benefits for all our customers.

A snapshot of Cloudflare’s infrastructure

As of July 2024, Cloudflare has data centers in 330 cities across more than 120 countries, each running Cloudflare equipment and services. The goal of delivering Cloudflare products and services everywhere remains consistent, although these data centers vary in the number of servers and amount of computational power.

These data centers are strategically positioned around the world to ensure our presence in all major regions and to help our customers comply with local regulations. It is a programmable smart network, where your traffic goes to the best data center possible to be processed. This programmability allows us to keep sensitive data regional, with our Data Localization Suite solutions, and within the constraints that our customers impose. Connecting these sites, exchanging data with customers, public clouds, partners, and the broader Internet, is the role of our network, which is managed by our infrastructure engineering and network strategy teams. This network forms the foundation that makes our products lightning fast, ensuring our global reliability, security for every customer request, and helping customers comply with data sovereignty requirements.

Traffic exchange methods

The Internet is an interconnection of different networks and separate autonomous systems that operate by exchanging data with each other. There are multiple ways to exchange data, but for simplicity, we'll focus on two key methods on how these networks communicate: Peering and IP Transit. To better understand the benefits of our global backbone, it helps to understand these basic connectivity solutions we use in our network.

Peering: The voluntary interconnection of administratively separate Internet networks that allows for traffic exchange between users of each network is known as “peering”. Cloudflare is one of the most peered networks globally. We have peering agreements with ISPs and other networks in 330 cities and across all major
Internet Exchanges (IX’s). Interested parties can register to peer with us anytime, or directly connect to our network with a link through a private network interconnect (PNI).
IP transit: A paid service that allows traffic to cross or "transit" somebody else's network, typically connecting a smaller Internet service provider (ISP) to the larger Internet. Think of it as paying a toll to access a private highway with your car.

The backbone is a dedicated high-capacity optical fiber network that moves traffic between Cloudflare’s global data centers, where we interconnect with other networks using these above-mentioned traffic exchange methods. It enables data transfers that are more reliable than over the public Internet. For the connectivity within a city and long distance connections we manage our own dark fiber or lease wavelengths using Dense Wavelength Division Multiplexing (DWDM). DWDM is a fiber optic technology that enhances network capacity by transmitting multiple data streams simultaneously on different wavelengths of light within the same fiber. It’s like having a highway with multiple lanes, so that more cars can drive on the same highway. We buy and lease these services from our global carrier partners all around the world.

Backbone operations and benefits

Operating a global backbone is challenging, which is why many competitors don’t do it. We take this challenge for two key reasons: traffic routing control and cost-effectiveness.

With IP transit, we rely on our transit partners to carry traffic from Cloudflare to the ultimate destination network, introducing unnecessary third-party reliance. In contrast, our backbone gives us full control over routing of both internal and external traffic, allowing us to manage it more effectively. This control is crucial because it lets us optimize traffic routes, usually resulting in the lowest latency paths, as previously mentioned. Furthermore, the cost of serving large traffic volumes through the backbone is, on average, more cost-effective than IP transit. This is why we are doubling down on backbone capacity in regions such as Frankfurt, London, Amsterdam, and Paris and Marseille, where we see continuous traffic growth and where connectivity solutions are widely available and competitively priced.

Our backbone serves both internal and external traffic. Internal traffic includes customer traffic using our security or performance products and traffic from Cloudflare's internal systems that shift data between our data centers. Tiered caching, for example, optimizes our caching delivery by dividing our data centers into a hierarchy of lower tiers and upper tiers. If lower-tier data centers don’t have the content, they request it from the upper tiers. If the upper tiers don’t have it either, they then request it from the origin server. This process reduces origin server requests and improves cache efficiency. Using our backbone to transport the cached content between lower and upper-tier data centers and the origin is often the most cost-effective method, considering the scale of our network. Magic Transit is another example where we attract traffic, by means of BGP anycast, to the Cloudflare data center closest to the end user and implement our DDoS solution. Our backbone transports the clean traffic to our customer’s data center, which they connect through a Cloudflare Network Interconnect (CNI).

External traffic that we carry on our backbone can be traffic from other origin providers like AWS, Oracle, Alibaba, Google Cloud Platform, or Azure, to name a few. The origin responses from these cloud providers are transported through peering points and our backbone to the Cloudflare data center closest to our customer. By leveraging our backbone we have more control over how we backhaul this traffic throughout our network, which results in more reliability and better performance and less dependency on the public Internet.

This interconnection between public clouds, offices, and the Internet with a controlled layer of performance, security, programmability, and visibility running on our global backbone is our Connectivity Cloud.

_{This map is a simplification of our current backbone network and does not show all paths}

Expanding our network

As mentioned in the introduction, we have increased our backbone capacity (Tbps) by more than 500% since 2021. With the addition of sub-sea cable capacity to Africa, we achieved a big milestone in 2023 by completing our global backbone ring. It now reaches six continents through terrestrial fiber and subsea cables.

Building out our backbone within regions where Internet infrastructure is less developed compared to markets like Central Europe or the US has been a key strategy for our latest network expansions. We have a shared goal with regional ISP partners to keep our data flow localized and as close as possible to the end user. Traffic often takes inefficient routes outside the region due to the lack of sufficient local peering and regional infrastructure. This phenomenon, known as traffic tromboning, occurs when data is routed through more cost-effective international routes and existing peering agreements.

Our regional backbone investments in countries like India or Turkey aim to reduce the need for such inefficient routing. With our own in-region backbone, traffic can be directly routed between in-country Cloudflare data centers, such as from Mumbai to New Delhi to Chennai, reducing latency, increasing reliability, and helping us to provide the same level of service quality as in more developed markets. We can control that data stays local, supporting our Data Localization Suite (DLS), which helps businesses comply with regional data privacy laws by controlling where their data is stored and processed.

Improved latency and performance

This strategic expansion has not only extended our global reach but has also significantly improved our overall latency. One illustration of this is that since the deployment of our backbone between Lisbon and Johannesburg, we have seen a major performance improvement for users in Johannesburg. Customers benefiting from this improved latency can be, for example, a financial institution running their APIs through us for real-time trading, where milliseconds can impact trades, or our Magic WAN users, where we facilitate site-to-site connectivity between their branch offices.

The table above shows an example where we measured the round-trip time (RTT) for an uncached origin fetch, from an end-user in Johannesburg to various origin locations, comparing our backbone and the public Internet. By carrying the origin request over our backbone, as opposed to IP transit or peering, local users in Johannesburg get their content up to 22% faster. By using our own backbone to long-haul the traffic to its final destination, we are in complete control of the path and performance. This improvement in latency varies by location, but consistently demonstrates the superiority of our backbone infrastructure in delivering high performance connectivity.

Traffic control

Consider a navigation system using 1) GPS to identify the route and 2) a highway toll pass that is valid until your final destination and allows you to drive straight through toll stations without stopping. Our backbone works quite similarly.

Our global backbone is built upon two key pillars. The first is BGP (Border Gateway Protocol), the routing protocol for the Internet, and the second is Segment Routing MPLS (Multiprotocol label switching), a technique for steering traffic across predefined forwarding paths in an IP network. By default, Segment Routing provides end-to-end encapsulation from ingress to egress routers where the intermediate nodes execute no route lookup. Instead, they forward traffic across an end-to-end virtual circuit, or tunnel, called a label-switched path. Once traffic is put on a label-switched path, it cannot detour onto the public Internet and must continue on the predetermined route across Cloudflare’s backbone. This is nothing new, as many networks will even run a “BGP Free Core” where all the route intelligence is carried at the edge of the network, and intermediate nodes only participate in forwarding from ingress to egress.

While leveraging Segment Routing Traffic Engineering (SR-TE) in our backbone, we can automatically select paths between our data centers that are optimized for latency and performance. Sometimes the “shortest path” in terms of routing protocol cost is not the lowest latency or highest performance path.

Supercharged: Argo and the global backbone

Argo Smart Routing is a service that uses Cloudflare’s portfolio of backbone, transit, and peering connectivity to find the most optimal path between the data center where a user’s request lands and your back-end origin server. Argo may forward a request from one Cloudflare data center to another on the way to an origin if the performance would improve by doing so. Orpheus is the counterpart to Argo, and routes around degraded paths for all customer origin requests free of charge. Orpheus is able to analyze network conditions in real-time and actively avoid reachability failures. Customers with Argo enabled get optimal performance for requests from Cloudflare data centers to their origins, while Orpheus provides error self-healing for all customers universally. By mixing our global backbone using Segment Routing as an underlay with Argo Smart Routing and Orpheus as our connectivity overlay, we are able to transport critical customer traffic along the most optimized paths that we have available.

So how exactly does our global backbone fit together with Argo Smart Routing? Argo Transit Selection is an extension of Argo Smart Routing where the lowest latency path between Cloudflare data center hops is explicitly selected and used to forward customer origin requests. The lowest latency path will often be our global backbone, as it is a more dedicated and private means of connectivity, as opposed to third-party transit networks.

Consider a multinational Dutch pharmaceutical company that relies on Cloudflare's network and services with our SASE solution to connect their global offices, research centers, and remote employees. Their Asian branch offices depend on Cloudflare's security solutions and network to provide secure access to important data from their central data centers back to their offices in Asia. In case of a cable cut between regions, our network would automatically look for the best alternative route between them so that business impact is limited.

Argo measures every potential combination of the different provider paths, including our own backbone, as an option for reaching origins with smart routing. Because of our vast interconnection with so many networks, and our global private backbone, Argo is able to identify the most performant network path for requests. The backbone is consistently one of the lowest latency paths for Argo to choose from.

In addition to high performance, we care greatly about network reliability for our customers. This means we need to be as resilient as possible from fiber cuts and third-party transit provider issues. During a disruption of the AAE-1 (Asia Africa Europe-1) submarine cable, this is what Argo saw between Singapore and Amsterdam across some of our transit provider paths vs. the backbone.

The large (purple line) spike shows a latency increase on one of our third-party IP transit provider paths due to congestion, which was eventually resolved following likely traffic engineering within the provider’s network. We saw a smaller latency increase (yellow line) over other transit networks, but still one that is noticeable. The bottom (green) line on the graph is our backbone, where round-trip time more or less remains flat throughout the event, due to our diverse backbone connectivity between Asia and Europe. Throughout the fiber cut, we remained stable at around 200ms between Amsterdam and Singapore. There was no noticeable network hiccup as was seen on the transit provider paths, so Argo actively leveraged the backbone for optimal performance.

Call to action

As Argo improves performance in our network, Cloudflare Network Interconnects (CNIs) optimize getting onto it. We encourage our Enterprise customers to use our free CNI’s as on-ramps onto our network whenever practical. In this way, you can fully leverage our network, including our robust backbone, and increase overall performance for every product within your Cloudflare Connectivity Cloud. In the end, our global network is our main product and our backbone plays a critical role in it. This way we continue to help build a better Internet, by improving our services for everybody, everywhere.

If you want to be part of our mission, join us as a Cloudflare network on-ramp partner to offer secure and reliable connectivity to your customers by integrating directly with us. Learn more about our on-ramp partnerships and how they can benefit your business here.

How Cloudflare’s systems dynamically route traffic across the globe

David Tuber — Mon, 25 Sep 2023 13:00:16 GMT

Picture this: you’re at an airport, and you’re going through an airport security checkpoint. There are a bunch of agents who are scanning your boarding pass and your passport and sending you through to your gate. All of a sudden, some of the agents go on break. Maybe there’s a leak in the ceiling above the checkpoint. Or perhaps a bunch of flights are leaving at 6pm, and a number of passengers turn up at once. Either way, this imbalance between localized supply and demand can cause huge lines and unhappy travelers — who just want to get through the line to get on their flight. How do airports handle this?

Some airports may not do anything and just let you suffer in a longer line. Some airports may offer fast-lanes through the checkpoints for a fee. But most airports will tell you to go to another security checkpoint a little farther away to ensure that you can get through to your gate as fast as possible. They may even have signs up telling you how long each line is, so you can make an easier decision when trying to get through.

At Cloudflare, we have the same problem. We are located in 300 cities around the world that are built to receive end-user traffic for all of our product suites. And in an ideal world, we always have enough computers and bandwidth to handle everyone at their closest possible location. But the world is not always ideal; sometimes we take a data center offline for maintenance, or a connection to a data center goes down, or some equipment fails, and so on. When that happens, we may not have enough attendants to serve every person going through security in every location. It’s not because we haven’t built enough kiosks, but something has happened in our data center that prevents us from serving everyone.

So, we built Traffic Manager: a tool that balances supply and demand across our entire global network. This blog is about Traffic Manager: how it came to be, how we built it, and what it does now.

The world before Traffic Manager

The job now done by Traffic Manager used to be a manual process carried out by network engineers: our network would operate as normal until something happened that caused user traffic to be impacted at a particular data center.

When such events happened, user requests would start to fail with 499 or 500 errors because there weren’t enough machines to handle the request load of our users. This would trigger a page to our network engineers, who would then remove some Anycast routes for that data center. The end result: by no longer advertising those prefixes in the impacted data center, user traffic would divert to a different data center. This is how Anycast fundamentally works: user traffic is drawn to the closest data center advertising the prefix the user is trying to connect to, as determined by Border Gateway Protocol. For a primer on what Anycast is, check out this reference article.

Depending on how bad the problem was, engineers would remove some or even all the routes in a data center. When the data center was again able to absorb all the traffic, the engineers would put the routes back and the traffic would return naturally to the data center.

As you might guess, this was a challenging task for our network engineers to do every single time any piece of hardware on our network had an issue. It didn’t scale.

Never send a human to do a machine’s job

But doing it manually wasn’t just a burden on our Network Operations team. It also resulted in a sub-par experience for our customers; our engineers would need to take time to diagnose and re-route traffic. To solve both these problems, we wanted to build a service that would immediately and automatically detect if users were unable to reach a Cloudflare data center, and withdraw routes from the data center until users were no longer seeing issues. Once the service received notifications that the impacted data center could absorb the traffic, it could put the routes back and reconnect that data center. This service is called Traffic Manager, because its job (as you might guess) is to manage traffic coming into the Cloudflare network.

Accounting for second order consequences

When a network engineer removes a route from a router, they can make the best guess at where the user requests will move to, and try to ensure that the failover data center has enough resources to handle the requests — if it doesn’t, they can adjust the routes there accordingly prior to removing the route in the initial data center. To be able to automate this process, we needed to move from a world of intuition to a world of data — accurately predicting where traffic would go when a route was removed, and feeding this information to Traffic Manager, so it could ensure it doesn’t make the situation worse.

Meet Traffic Predictor

Although we can adjust which data centers advertise a route, we are unable to influence what proportion of traffic each data center receives. Each time we add a new data center, or a new peering session, the distribution of traffic changes, and as we are in over 300 cities and 12,500 peering sessions, it has become quite difficult for a human to keep track of, or predict the way traffic will move around our network. Traffic manager needed a buddy: Traffic Predictor.

In order to do its job, Traffic Predictor carries out an ongoing series of real world tests to see where traffic actually moves. Traffic Predictor relies on a testing system that simulates removing a data center from service and measuring where traffic would go if that data center wasn’t serving traffic. To help understand how this system works, let’s simulate the removal of a subset of a data center in Christchurch, New Zealand:

First, Traffic Predictor gets a list of all the IP addresses that normally connect to Christchurch. Traffic Predictor will send a ping request to hundreds of thousands of IPs that have recently made a request there.
Traffic Predictor records if the IP responds, and whether the response returns to Christchurch using a special Anycast IP range specifically configured for Traffic Predictor.
Once Traffic Predictor has a list of IPs that respond to Christchurch, it removes that route containing that special range from Christchurch, waits a few minutes for the Internet routing table to be updated, and runs the test again.
Instead of being routed to Christchurch, the responses are instead going to data centers around Christchurch. Traffic Predictor then uses the knowledge of responses for each data center, and records the results as the failover for Christchurch.

This allows us to simulate Christchurch going offline without actually taking Christchurch offline!

But Traffic Predictor doesn’t just do this for any one data center. To add additional layers of resiliency, Traffic Predictor even calculates a second layer of indirection: for each data center failure scenario, Traffic Predictor also calculates failure scenarios and creates policies for when surrounding data centers fail.

Using our example from before, when Traffic Predictor tests Christchurch, it will run a series of tests that remove several surrounding data centers from service including Christchurch to calculate different failure scenarios. This ensures that even if something catastrophic happens which impacts multiple data centers in a region, we still have the ability to serve user traffic. If you think this data model is complicated, you’re right: it takes several days to calculate all of these failure paths and policies.

Here’s what those failure paths and failover scenarios look like for all of our data centers around the world when they’re visualized:

This can be a bit complicated for humans to parse, so let’s dig into that above scenario for Christchurch, New Zealand to make this a bit more clear. When we take a look at failover paths specifically for Christchurch, we see they look like this:

In this scenario we predict that 99.8% of Christchurch’s traffic would shift to Auckland, which is able to absorb all Christchurch traffic in the event of a catastrophic outage.

Traffic Predictor allows us to not only see where traffic will move to if something should happen, but it allows us to preconfigure Traffic Manager policies to move requests out of failover data centers to prevent a thundering herd scenario: where sudden influx of requests can cause failures in a second data center if the first one has issues. With Traffic Predictor, Traffic Manager doesn’t just move traffic out of one data center when that one fails, but it also proactively moves traffic out of other data centers to ensure a seamless continuation of service.

From a sledgehammer to a scalpel

With Traffic Predictor, Traffic Manager can dynamically advertise and withdraw prefixes while ensuring that every datacenter can handle all the traffic. But withdrawing prefixes as a means of traffic management can be a bit heavy-handed at times. One of the reasons for this is that the only way we had to add or remove traffic to a data center was through advertising routes from our Internet-facing routers. Each one of our routes has thousands of IP addresses, so removing only one still represents a large portion of traffic.

Specifically, Internet applications will advertise prefixes to the Internet from a /24 subnet at an absolute minimum, but many will advertise prefixes larger than that. This is generally done to prevent things like route leaks or route hijacks: many providers will actually filter out routes that are more specific than a /24 (for more information on that, check out this blog on how we detect route leaks). If we assume that Cloudflare maps protected properties to IP addresses at a 1:1 ratio, then each /24 subnet would be able to service 256 customers, which is the number of IP addresses in a /24 subnet. If every IP address sent one request per second, we’d have to move 4 /24 subnets out of a data center if we needed to move 1,000 requests per second (RPS).

But in reality, Cloudflare maps a single IP address to hundreds of thousands of protected properties. So for Cloudflare, a /24 might take 3,000 requests per second, but if we needed to move 1,000 RPS out, we would have no choice but to move a single /24 out. And that’s just assuming we advertise at a /24 level. If we used /20s to advertise, the amount we can withdraw gets less granular: at a 1:1 website to IP address mapping, that’s 4,096 requests per second for each prefix, and even more if the website to IP address mapping is many to one.

While withdrawing prefix advertisements improved the customer experience for those users who would have seen a 499 or 500 error — there may have been a significant portion of users who wouldn’t have been impacted by an issue who still were moved away from the data center they should have gone to, probably slowing them down, even if only a little bit. This concept of moving more traffic out than is necessary is called “stranding capacity”: the data center is theoretically able to service more users in a region but cannot because of how Traffic Manager was built.

We wanted to improve Traffic Manager so that it only moved the absolute minimum of users out of a data center that was seeing a problem and not strand any more capacity. To do so, we needed to shift percentages of prefixes, so we could be extra fine-grained and only move the things that absolutely need to be moved. To solve this, we built an extension of our Layer 4 load balancer Unimog, which we call Plurimog.

A quick refresher on Unimog and layer 4 load balancing: every single one of our machines contains a service that determines whether that machine can take a user request. If the machine can take a user request then it sends the request to our HTTP stack which processes the request before returning it to the user. If the machine can’t take the request, the machine sends the request to another machine in the data center that can. The machines can do this because they are constantly talking to each other to understand whether they can serve requests for users.

Plurimog does the same thing, but instead of talking between machines, Plurimog talks in between data centers and points of presence. If a request goes into Philadelphia and Philadelphia is unable to take the request, Plurimog will forward to another data center that can take the request, like Ashburn, where the request is decrypted and processed. Because Plurimog operates at layer 4, it can send individual TCP or UDP requests to other places which allows it to be very fine-grained: it can send percentages of traffic to other data centers very easily, meaning that we only need to send away enough traffic to ensure that everyone can be served as fast as possible. Check out how that works in our Frankfurt data center, as we are able to shift progressively more and more traffic away to handle issues in our data centers. This chart shows the number of actions taken on free traffic that cause it to be sent out of Frankfurt over time:

But even within a data center, we can route traffic around to prevent traffic from leaving the datacenter at all. Our large data centers, called Multi-Colo Points of Presence (MCPs) contain logical sections of compute within a data center that are distinct from one another. These MCP data centers are enabled with another version of Unimog called Duomog, which allows for traffic to be shifted between logical sections of compute automatically. This makes MCP data centers fault-tolerant without sacrificing performance for our customers, and allows Traffic Manager to work within a data center as well as between data centers.

When evaluating portions of requests to move, Traffic Manager does the following:

Traffic Manager identifies the proportion of requests that need to be removed from a data center or subsection of a data center so that all requests can be served.
Traffic Manager then calculates the aggregated space metrics for each target to see how many requests each failover data center can take.
Traffic Manager then identifies how much traffic in each plan we need to move, and moves either a proportion of the plan, or all of the plan through Plurimog/Duomog, until we've moved enough traffic. We move Free customers first, and if there are no more Free customers in a data center, we'll move Pro, and then Business customers if needed.

For example, let’s look at Ashburn, Virginia: one of our MCPs. Ashburn has nine different subsections of capacity that can each take traffic. On 8/28, one of those subsections, IAD02, had an issue that reduced the amount of traffic it could handle.

During this time period, Duomog sent more traffic from IAD02 to other subsections within Ashburn, ensuring that Ashburn was always online, and that performance was not impacted during this issue. Then, once IAD02 was able to take traffic again, Duomog shifted traffic back automatically. You can see these actions visualized in the time series graph below, which tracks the percentage of traffic moved over time between subsections of capacity within IAD02 (shown in green):

How does Traffic Manager know how much to move?

Although we used requests per second in the example above, using requests per second as a metric isn’t accurate enough when actually moving traffic. The reason for this is that different customers have different resource costs to our service; a website served mainly from cache with the WAF deactivated is much cheaper CPU wise than a site with all WAF rules enabled and caching disabled. So we record the time that each request takes in the CPU. We can then aggregate the CPU time across each plan to find the CPU time usage per plan. We record the CPU time in ms, and take a per second value, resulting in a unit of milliseconds per second.

CPU time is an important metric because of the impact it can have on latency and customer performance. As an example, consider the time it takes for an eyeball request to make it entirely through the Cloudflare front line servers: we call this the cfcheck latency. If this number goes too high, then our customers will start to notice, and they will have a bad experience. When cfcheck latency gets high, it’s usually because CPU utilization is high. The graph below shows 95th percentile cfcheck latency plotted against CPU utilization across all the machines in the same data center, and you can see the strong correlation:

So having Traffic Manager look at CPU time in a data center is a very good way to ensure that we’re giving customers the best experience and not causing problems.

After getting the CPU time per plan, we need to figure out how much of that CPU time to move to other data centers. To do this, we aggregate the CPU utilization across all servers to give a single CPU utilization across the data center. If a proportion of servers in the data center fail, due to network device failure, power failure, etc., then the requests that were hitting those servers are automatically routed elsewhere within the data center by Duomog. As the number of servers decrease, the overall CPU utilization of the data center increases. Traffic Manager has three thresholds for each data center; the maximum threshold, the target threshold, and the acceptable threshold:

Maximum: the CPU level at which performance starts to degrade, where Traffic Manager will take action
Target: the level to which Traffic Manager will try to reduce the CPU utilization to restore optimal service to users
Acceptable: the level below which a data center can receive requests forwarded from another data center, or revert active moves

When a data center goes above the maximum threshold, Traffic Manager takes the ratio of total CPU time across all plans to current CPU utilization, then applies that to the target CPU utilization to find the target CPU time. Doing it this way means we can compare a data center with 100 servers to a data center with 10 servers, without having to worry about the number of servers in each data center. This assumes that load increases linearly, which is close enough to true for the assumption to be valid for our purposes.

Target ratio equals current ratio:

Therefore:

Subtracting the target CPU time from the current CPU time gives us the CPU time to move:

For example, if the current CPU utilization was at 90% across the data center, the target was 85%, and the CPU time across all plans was 18,000, we would have:

This would mean Traffic Manager would need to move 1,000 CPU time:

Now we know the total CPU time needed to move, we can go through the plans, until the required time to move has been met.

What is the maximum threshold?

A frequent problem that we faced was determining at which point Traffic Manager should start taking action in a data center - what metric should it watch, and what is an acceptable level?

As said before, different services have different requirements in terms of CPU utilization, and there are many cases of data centers that have very different utilization patterns.

To solve this problem, we turned to machine learning. We created a service that will automatically adjust the maximum thresholds for each data center according to customer-facing indicators. For our main service-level indicator (SLI), we decided to use the cfcheck latency metric we described earlier.

But we also need to define a service-level objective (SLO) in order for our machine learning application to be able to adjust the threshold. We set the SLO for 20ms. Comparing our SLO to our SLI, our 95th percentile cfcheck latency should never go above 20ms and if it does, we need to do something. The below graph shows 95th percentile cfcheck latency over time, and customers start to get unhappy when cfcheck latency goes into the red zone:

If customers have a bad experience when CPU gets too high, then the goal of Traffic Manager’s maximum thresholds are to ensure that customer performance isn’t impacted and to start redirecting traffic away before performance starts to degrade. At a scheduled interval the Traffic Manager service will fetch a number of metrics for each data center and apply a series of machine learning algorithms. After cleaning the data for outliers we apply a simple quadratic curve fit, and we are currently testing a linear regression algorithm.

After fitting the models we can use them to predict the CPU usage when the SLI is equal to our SLO, and then use it as our maximum threshold. If we plot the cpu values against the SLI we can see clearly why these methods work so well for our data centers, as you can see for Barcelona in the graphs below, which are plotted against curve fit and linear regression fit respectively.

In these charts the vertical line is the SLO, and the intersection of this line with the fitted model represents the value that will be used as the maximum threshold. This model has proved to be very accurate, and we are able to significantly reduce the SLO breaches. Let’s take a look at when we started deploying this service in Lisbon:

Before the change, cfcheck latency was constantly spiking, but Traffic Manager wasn’t taking actions because the maximum threshold was static. But after July 29, we see that cfcheck latency has never hit the SLO because we are constantly measuring to make sure that customers are never impacted by CPU increases.

Where to send the traffic?

So now that we have a maximum threshold, we need to find the third CPU utilization threshold which isn’t used when calculating how much traffic to move - the acceptable threshold. When a data center is below this threshold, it has unused capacity which, as long as it isn’t forwarding traffic itself, is made available for other data centers to use when required. To work out how much each data center is able to receive, we use the same methodology as above, substituting target for acceptable:

Therefore:

Subtracting the current CPU time from the acceptable CPU time gives us the amount of CPU time that a data center could accept:

To find where to send traffic, Traffic Manager will find the available CPU time in all data centers, then it will order them by latency from the data center needing to move traffic. It moves through each of the data centers, using all available capacity based on the maximum thresholds before moving onto the next. When finding which plans to move, we move from the lowest priority plan to highest, but when finding where to send them, we move in the opposite direction.

To make this clearer let's use an example:

We need to move 1,000 CPU time from data center A, and we have the following usage per plan: Free: 500ms/s, Pro: 400ms/s, Business: 200ms/s, Enterprise: 1000ms/s.

We would move 100% of Free (500ms/s), 100% of Pro (400ms/s), 50% of Business (100ms/s), and 0% of Enterprise.

Nearby data centers have the following available CPU time: B: 300ms/s, C: 300ms/s, D: 1,000ms/s.

With latencies: A-B: 100ms, A-C: 110ms, A-D: 120ms.

Starting with the lowest latency and highest priority plan that requires action, we would be able to move all the Business CPU time to data center B and half of Pro. Next we would move onto data center C, and be able to move the rest of Pro, and 20% of Free. The rest of Free could then be forwarded to data center D. Resulting in the following action: Business: 50% → B, Pro: 50% → B, 50% → C, Free: 20% → C, 80% → D.

Reverting actions

In the same way that Traffic Manager is constantly looking to keep data centers from going above the threshold, it is also looking to bring any forwarded traffic back into a data center that is actively forwarding traffic.

Above we saw how Traffic Manager works out how much traffic a data center is able to receive from another data center — it calls this the available CPU time. When there is an active move we use this available CPU time to bring back traffic to the data center — we always prioritize reverting an active move over accepting traffic from another data center.

When you put this all together, you get a system that is constantly measuring system and customer health metrics for every data center and spreading traffic around to make sure that each request can be served given the current state of our network. When we put all of these moves between data centers on a map, it looks something like this, a map of all Traffic Manager moves for a period of one hour. This map doesn’t show our full data center deployment, but it does show the data centers that are sending or receiving moved traffic during this period:

Data centers in red or yellow are under load and shifting traffic to other data centers until they become green, which means that all metrics are showing as healthy. The size of the circles represent how many requests are shifted from or to those data centers. Where the traffic is going is denoted by where the lines are moving. This is difficult to see at a world scale, so let’s zoom into the United States to see this in action for the same time period:

Here you can see Toronto, Detroit, New York, and Kansas City are unable to serve some requests due to hardware issues, so they will send those requests to Dallas, Chicago, and Ashburn until equilibrium is restored for users and data centers. Once data centers like Detroit are able to service all the requests they are receiving without needing to send traffic away, Detroit will gradually stop forwarding requests to Chicago until any issues in the data center are completely resolved, at which point it will no longer be forwarding anything. Throughout all of this, end users are online and are not impacted by any physical issues that may be happening in Detroit or any of the other locations sending traffic.

Happy network, happy products

Because Traffic Manager is plugged into the user experience, it is a fundamental component of the Cloudflare network: it keeps our products online and ensures that they’re as fast and reliable as they can be. It’s our real time load balancer, helping to keep our products fast by only shifting necessary traffic away from data centers that are having issues. Because less traffic gets moved, our products and services stay fast.

But Traffic Manager can also help keep our products online and reliable because they allow our products to predict where reliability issues may occur and preemptively move the products elsewhere. For example, Browser Isolation directly works with Traffic Manager to help ensure the uptime of the product. When you connect to a Cloudflare data center to create a hosted browser instance, Browser Isolation first asks Traffic Manager if the data center has enough capacity to run the instance locally, and if so, the instance is created right then and there. If there isn’t sufficient capacity available, Traffic Manager tells Browser Isolation which the closest data center with sufficient available capacity is, thereby helping Browser Isolation to provide the best possible experience for the user.

Happy network, happy users

At Cloudflare, we operate this huge network to service all of our different products and customer scenarios. We’ve built this network for resiliency: in addition to our MCP locations designed to reduce impact from a single failure, we are constantly shifting traffic around on our network in response to internal and external issues.

But that is our problem — not yours.

Similarly, when human beings had to fix those issues, it was customers and end users who would be impacted. To ensure that you’re always online, we’ve built a smart system that detects our hardware failures and preemptively balances traffic across our network to ensure it’s online and as fast as possible. This system works faster than any person — not only allowing our network engineers to sleep at night — but also providing a better, faster experience for all of our customers.

And finally: if these kinds of engineering challenges sound exciting to you, then please consider checking out the Traffic Engineering team's open position on Cloudflare’s Careers page!

Announcing Anycast IPsec: a new on-ramp to Cloudflare One

Annika Garbers — Mon, 06 Dec 2021 13:59:45 GMT

Today, we're excited to announce support for IPsec as an on-ramp to Cloudflare One. As a customer, you should be able to use whatever method you want to get your traffic to Cloudflare's network. We've heard from you that IPsec is your method of choice for connecting to us at the network layer, because of its near-universal vendor support and blanket layer of encryption across all traffic. So we built support for it! Read on to learn how our IPsec implementation is faster and easier to use than traditional IPsec connectivity, and how it integrates deeply with our Cloudflare One suite to provide unified security, performance, and reliability across all your traffic.

Using the Internet as your corporate network

With Cloudflare One, customers can connect any traffic source or destination — branch offices, data centers, cloud properties, user devices — to our network. Traffic is routed to the closest Cloudflare location, where security policies are applied before we send it along optimized routes to its destination — whether that’s within your private network or on the Internet. It is good practice to encrypt any traffic that’s sensitive at the application level, but for customers who are transitioning from forms of private connectivity like Multiprotocol Label Switching (MPLS), this often isn’t a reality. We’ve talked to many customers who have legacy file transfer and other applications running across their MPLS circuits unencrypted, and are relying on the fact that these circuits are “private” to provide security. In order to start sending this traffic over the Internet, customers need a blanket layer of encryption across all of it; IPsec tunnels are traditionally an easy way to accomplish this.

Traditional IPsec implementations

IPsec as a technology has been around since 1995, and is broadly implemented across many hardware and software platforms. Many companies have adopted IPsec VPNs for securely transferring corporate traffic over the Internet. These VPNs tend to have one of two main architectures: hub and spoke, or mesh.

In the hub and spoke model, each “spoke” node establishes an IPsec tunnel back to a core “hub,” usually a headquarters or data center location. Traffic between spokes flows through the hub for routing and in order to have security policies applied (like by an on-premise firewall). This architecture is simple because each node only needs to maintain one tunnel to get connectivity to other locations, but it can introduce significant performance penalties. Imagine a global network with two “spokes”, one in India and another one in Singapore, but a “hub” located in the United States — traffic needs to travel a round trip thousands of miles back and forth in order to get to its destination.

In the mesh model, every node is connected to every other node with a dedicated IPsec tunnel. This improves performance because traffic can take more direct paths, but in practice means an unmanageable number of tunnels after even a handful of locations are added.

Customers we’ve talked to about IPsec know they want it for the blanket layer of encryption and broad vendor support, but they haven’t been particularly excited about it because of the problems with existing architecture models. We knew we wanted to develop something that was easier to use and left those problems in the past, so that customers could get excited about building their next-generation network on Cloudflare. So how are we bringing IPsec out of the 90s? By delivering it on our global Anycast network: customers establish one IPsec tunnel to us and get automatic connectivity to 250+ locations. It’s conceptually similar to the hub and spoke model, but the “hub” is everywhere, blazing fast, and easy to manage.

So how does IPsec actually work?

IPsec was designed back in 1995 to provide authentication, integrity, and confidentiality for IP packets. One of the ways it does this is by creating tunnels between two hosts, encrypting the IP packets, and adding a new IP header onto encrypted packets. To make this happen, IPsec has two components working together: a userspace Internet Key Exchange (IKE) daemon and an IPsec stack in kernel-space. IKE is the protocol which creates Security Associations (SAs) for IPsec. An SA is a collection of all the security parameters, like those for authentication and encryption, that are needed to establish an IPsec tunnel.

When a new IPsec tunnel needs to be set up, one IKE daemon will initiate a session with another and create an SA. All the complexity of configuration, key negotiation, and key generation happens in a handful of packets between the two IKE daemons safely in userspace. Once the IKE Daemons have started their session, they hand off their nice and neat SA to the IPsec stack in kernel-space, which now has all the information it needs to intercept the right packets for encryption and decryption.

There are plenty of open source IKE daemons, including strongSwan, Libreswan, and Openswan, that we considered using for our IPsec implementation. These “swans” all tie speaking the IKE protocol tightly with configuring the IPsec stack. This is great for establishing point-to-point tunnels — installing one “swan” is all you need to speak IKE and configure an encrypted tunnel. But we’re building the next-generation network that takes advantage of Cloudflare’s entire global Anycast edge. So how do we make it so that a customer sets up one tunnel with Cloudflare with every single edge server capable of exchanging data on it?

Anycast IPsec: an implementation for next-generation networks

The fundamental problem in the way of Anycast IPsec is that the SA needs to be handed off to the kernel-space IPsec stack on every Cloudflare edge server, but the SA is created on only one server — the one running the IKE daemon that the customer’s IKE daemon connects to. How do we solve this problem? The first thing that needs to be true is that every server needs to be able to create that SA.

Every Cloudflare server now runs an IKE daemon, so customers can have a fast, reliable connection to start a tunnel anywhere in the world. We looked at using one of the existing “swans” but that tight coupling of IKE with the IPsec stack meant that the SA was hard to untangle from configuring the dataplane. We needed the SA totally separate and neatly sharable from the server that created it to every other server on our edge. Naturally, we built our own “swan” to do just that.

To send our SA worldwide, we put a new spin on an old trick. With Cloudflare Tunnels, a customer’s cloudflared tunnel process creates connections to a few nearby Cloudflare edge servers. But traffic destined for that tunnel could arrive at any edge server, which needs to know how to proxy traffic to the tunnel-connected edge servers. So, we built technology that enables an edge server to rapidly distribute information about its Cloudflare Tunnel connections to all other edge servers.

Fundamentally, the problem of SA distribution is similar -- a customer’s IKE daemon connects to a single Cloudflare edge server’s IKE daemon, and information about that connection needs to be distributed to every other edge server. So, we upgraded the Cloudflare Tunnel technology to make it more general and are now using it to distribute SAs as part of Anycast IPsec support. Within seconds of an SA being created, it is distributed to every Cloudflare edge server where a streaming protocol applies the configuration to the kernel-space IPsec stack. Cloudflare’s Anycast IPsec benefits from the same reliability and resilience we’ve built in Cloudflare Tunnels and turns our network into one massively scalable, resilient IPsec tunnel to your network.

On-ramp with IPsec, access all of Cloudflare One

We built IPsec as an on-ramp to Cloudflare One on top of our existing global system architecture, putting the principles customers care about first. You care about ease of deployment, so we made it possible for you to connect to your entire virtual network on Cloudflare One with a single IPsec tunnel. You care about performance, so we built technology that connects your IPsec tunnel to every Cloudflare location, eliminating hub-and-spoke performance penalties. You care about enforcing security policies across all your traffic regardless of source, so we integrated IPsec with the entire Cloudflare One suite including Magic Transit, Magic Firewall, Zero Trust, and more.

IPsec is in early access for Cloudflare One customers. If you’re interested in trying it out, contact your account team today!

Magic Transit: Network functions at Cloudflare scale

Nick Wondra — Tue, 13 Aug 2019 13:00:00 GMT

Today we announced Cloudflare Magic Transit, which makes Cloudflare’s network available to any IP traffic on the Internet. Up until now, Cloudflare has primarily operated proxy services: our servers terminate HTTP, TCP, and UDP sessions with Internet users and pass that data through new sessions they create with origin servers. With Magic Transit, we are now also operating at the IP layer: in addition to terminating sessions, our servers are applying a suite of network functions (DoS mitigation, firewalling, routing, and so on) on a packet-by-packet basis.

Over the past nine years, we’ve built a robust, scalable global network that currently spans 193 cities in over 90 countries and is ever growing. All Cloudflare customers benefit from this scale thanks to two important techniques. The first is anycast networking. Cloudflare was an early adopter of anycast, using this routing technique to distribute Internet traffic across our data centers. It means that any data center can handle any customer’s traffic, and we can spin up new data centers without needing to acquire and provision new IP addresses. The second technique is homogeneous server architecture. Every server in each of our edge data centers is capable of running every task. We build our servers on commodity hardware, making it easy to quickly increase our processing capacity by adding new servers to existing data centers. Having no specialty hardware to depend on has also led us to develop an expertise in pushing the limits of what’s possible in networking using modern Linux kernel techniques.

Magic Transit is built on the same network using the same techniques, meaning our customers can now run their network functions at Cloudflare scale. Our fast, secure, reliable global edge becomes our customers’ edge. To explore how this works, let’s follow the journey of a packet from a user on the Internet to a Magic Transit customer’s network.

Putting our DoS mitigation to work… for you!

In the announcement blog post we describe an example deployment for Acme Corp. Let’s continue with this example here. When Acme brings their IP prefix 203.0.113.0/24 to Cloudflare, we start announcing that prefix to our transit providers, peers, and to Internet exchanges in each of our data centers around the globe. Additionally, Acme stops announcing the prefix to their own ISPs. This means that any IP packet on the Internet with a destination address within Acme’s prefix is delivered to a nearby Cloudflare data center, not to Acme’s router.

Let’s say I want to access Acme’s FTP server on 203.0.113.100 from my computer in Cloudflare’s office in Champaign, IL. My computer generates a TCP SYN packet with destination address 203.0.113.100 and sends it out to the Internet. Thanks to anycast, that packet ends up at Cloudflare’s data center in Chicago, which is the closest data center (in terms of Internet routing distance) to Champaign. The packet arrives on the data center’s router, which uses ECMP (Equal Cost Multi-Path) routing to select which server should handle the packet and dispatches the packet to the selected server.

Once at the server, the packet flows through our XDP- and iptables-based DoS detection and mitigation functions. If this TCP SYN packet were determined to be part of an attack, it would be dropped and that would be the end of it. Fortunately for me, the packet is permitted to pass.

So far, this looks exactly like any other traffic on Cloudflare’s network. Because of our expertise in running a global anycast network we’re able to attract Magic Transit customer traffic to every data center and apply the same DoS mitigation solution that has been protecting Cloudflare for years. Our DoS solution has handled some of the largest attacks ever recorded, including a 942Gbps SYN flood in 2018. Below is a screenshot of a recent SYN flood of 300M packets per second. Our architecture lets us scale to stop the largest attacks.

Network namespaces for isolation and control

The above looked identical to how all other Cloudflare traffic is processed, but this is where the similarities end. For our other services, the TCP SYN packet would now be dispatched to a local proxy process (e.g. our nginx-based HTTP/S stack). For Magic Transit, we instead want to dynamically provision and apply customer-defined network functions like firewalls and routing. We needed a way to quickly spin up and configure these network functions while also providing inter-network isolation. For that, we turned to network namespaces.

Namespaces are a collection of Linux kernel features for creating lightweight virtual instances of system resources that can be shared among a group of processes. Namespaces are a fundamental building block for containerization in Linux. Notably, Docker is built on Linux namespaces. A network namespace is an isolated instance of the Linux network stack, including its own network interfaces (with their own eBPF hooks), routing tables, netfilter configuration, and so on. Network namespaces give us a low-cost mechanism to rapidly apply customer-defined network configurations in isolation, all with built-in Linux kernel features so there’s no performance hit from userspace packet forwarding or proxying.

When a new customer starts using Magic Transit, we create a brand new network namespace for that customer on every server across our edge network (did I mention that every server can run every task?). We built a daemon that runs on our servers and is responsible for managing these network namespaces and their configurations. This daemon is constantly reading configuration updates from Quicksilver, our globally distributed key-value store, and applying customer-defined configurations for firewalls, routing, etc, inside the customer’s namespace. For example, if Acme wants to provision a firewall rule to allow FTP traffic (TCP ports 20 and 21) to 203.0.113.100, that configuration is propagated globally through Quicksilver and the Magic Transit daemon applies the firewall rule by adding an nftables rule to the Acme customer namespace:

# Apply nftables rule inside Acme’s namespace
$ sudo ip netns exec acme_namespace nft add rule inet filter prerouting ip daddr 203.0.113.100 tcp dport 20-21 accept

Getting the customer’s traffic to their network namespace requires a little routing configuration in the default network namespace. When a network namespace is created, a pair of virtual ethernet (veth) interfaces is also created: one in the default namespace and one in the newly created namespace. This interface pair creates a “virtual wire” for delivering network traffic into and out of the new network namespace. In the default network namespace, we maintain a routing table that forwards Magic Transit customer IP prefixes to the veths corresponding to those customers’ namespaces. We use iptables to mark the packets that are destined for Magic Transit customer prefixes, and we have a routing rule that specifies that these specially marked packets should use the Magic Transit routing table.

(Why go to the trouble of marking packets in iptables and maintaining a separate routing table? Isolation. By keeping Magic Transit routing configurations separate we reduce the risk of accidentally modifying the default routing table in a way that affects how non-Magic Transit traffic flows through our edge.)

Network namespaces provide a lightweight environment where a Magic Transit customer can run and manage network functions in isolation, letting us put full control in the customer’s hands.

GRE + anycast = magic

After passing through the edge network functions, the TCP SYN packet is finally ready to be delivered back to the customer’s network infrastructure. Because Acme Corp. does not have a network footprint in a colocation facility with Cloudflare, we need to deliver their network traffic over the public Internet.

This poses a problem. The destination address of the TCP SYN packet is 203.0.113.100, but the only network announcing the IP prefix 203.0.113.0/24 on the Internet is Cloudflare. This means that we can’t simply forward this packet out to the Internet—it will boomerang right back to us! In order to deliver this packet to Acme we need to use a technique called tunneling.

Tunneling is a method of carrying traffic from one network over another network. In our case, it involves encapsulating Acme’s IP packets inside of IP packets that can be delivered to Acme’s router over the Internet. There are a number of common tunneling protocols, but Generic Routing Encapsulation (GRE) is often used for its simplicity and widespread vendor support.

GRE tunnel endpoints are configured both on Cloudflare’s servers (inside of Acme’s network namespace) and on Acme’s router. Cloudflare servers then encapsulate IP packets destined for 203.0.113.0/24 inside of IP packets destined for a publicly-routable IP address for Acme’s router, which decapsulates the packets and emits them into Acme’s internal network.

Now, I’ve omitted an important detail in the diagram above: the IP address of Cloudflare’s side of the GRE tunnel. Configuring a GRE tunnel requires specifying an IP address for each side, and the outer IP header for packets sent over the tunnel must use these specific addresses. But Cloudflare has thousands of servers, each of which may need to deliver packets to the customer through a tunnel. So how many Cloudflare IP addresses (and GRE tunnels) does the customer need to talk to? The answer: just one, thanks to the magic of anycast.

Cloudflare uses anycast IP addresses for our GRE tunnel endpoints, meaning that any server in any data center is capable of encapsulating and decapsulating packets for the same GRE tunnel. How is this possible? Isn’t a tunnel a point-to-point link? The GRE protocol itself is stateless—each packet is processed independently and without requiring any negotiation or coordination between tunnel endpoints. While the tunnel is technically bound to an IP address it need not be bound to a specific device. Any device that can strip off the outer headers and then route the inner packet can handle any GRE packet sent over the tunnel. Actually, in the context of anycast the term “tunnel” is misleading since it implies a link between two fixed points. With Cloudflare’s Anycast GRE, a single “tunnel” gives you a conduit to every server in every data center on Cloudflare’s global edge.

One very powerful consequence of Anycast GRE is that it eliminates single points of failure. Traditionally, GRE-over-Internet can be problematic because an Internet outage between the two GRE endpoints fully breaks the “tunnel”. This means reliable data delivery requires going through the headache of setting up and maintaining redundant GRE tunnels terminating at different physical sites and rerouting traffic when one of the tunnels breaks. But because Cloudflare is encapsulating and delivering customer traffic from every server in every data center, there is no single “tunnel” to break. This means Magic Transit customers can enjoy the redundancy and reliability of terminating tunnels at multiple physical sites while only setting up and maintaining a single GRE endpoint, making their jobs simpler.

Our scale is now your scale

Magic Transit is a powerful new way to deploy network functions at scale. We’re not just giving you a virtual instance, we’re giving you a global virtual edge. Magic Transit takes the hardware appliances you would typically rack in your on-prem network and distributes them across every server in every data center in Cloudflare’s network. This gives you access to our global anycast network, our fleet of servers capable of running your tasks, and our engineering expertise building fast, reliable, secure networks. Our scale is now your scale.

Cloudflare Partners: A New Program with New Partners

Dan Hollinger — Thu, 06 Jun 2019 13:01:00 GMT

Many overlook a critical portion of the language in Cloudflare’s mission: “to help build a better Internet.” From the beginning, we knew a mission this bold, an undertaking of this magnitude, couldn’t be done alone. We could only help. To ultimately build a better Internet, it would take a diverse and engaged ecosystem of technologies, customers, partners, and end-users. Fortunately, we’ve been able to work with amazing partners as we’ve grown, and we are eager to announce new, specific programs to grow our ecosystem with an increasingly diverse set of partners.

Today, we’re excited to announce the latest iteration of our partnership program for solutions partners. These categories encompass resellers and referral partners, OEM partners, and the new partner services program. Over the past few years, we’ve grown and learned from some amazing partnerships, and want to bring those best practices to our latest partners at scale—to help them grow their business with Cloudflare’s global network.

Cloudflare Partner Tiers

Partner Program for Solution Partners

Every partner program out there has tiers, and Cloudflare’s program is no exception. However, our tiering was built to help our partners ramp up, accelerate and move fast. As Matt Harrell highlighted, we want the process to be as seamless as possible, to help partners find the level of engagement that works best for them‚ with world-class enablement paths and best-in-class revenue share models—built-in from the beginning.

World-Class Enablement

Cloudflare offers complimentary training and enablement to all partners. From self-serve paths, to partner-focused webinars, and instructor-based courses, and certification—we want to ensure our partners can learn and develop Cloudflare and product expertise, to make them as effective as possible when utilizing our massive global network.

Driving Business Value

We want our partners to grow and succeed. From self-serve resellers to our most custom integration, we want to make it frictionless to build a profitable business on Cloudflare. From our tenant API system to dedicated account teams, we’re ready to help you go-to-market with solutions that help end-customers. This includes opportunities to co-market, develop target accounts, and directly partner, to succeed with strategic accounts.

Cloudflare recognizes that, in order to help build a better Internet, we need successful partners—and our latest program is built to help partners build and grow profitable business models around Cloudflare’s solutions.

Partner Services Program - SIs, MSSPs, MSSPs, PSOs

For the first time, we are expanding our program to also include service providers that want to develop and grow profitable services practices around Cloudflare’s solutions.

Our customers face some of the most complex challenges on the Internet. From those challenges, we’ve already seen some amazing opportunities for service providers to create value, grow their business, and make an impact. From customers migrating to the public, hybrid or multi-cloud for the first time, to entirely re-writing applications using Cloudflare Workers®, the need for expertise has increased across our entire customer base. In early pilots, we’ve seen four major categories in building a successful service practice around Cloudflare:

Network Digital Transformations - Help customers migrate and modernize their network solution. Cloudflare is the only cloud-native network to give Enterprise-grade control and visibility into on-prem, hybrid, and multi-cloud architectures.
Serverless Architecture Development - Provide serverless application development services, thought leadership, and technical consulting around leveraging Cloudflare Workers and Apps.
Managed Security & Insights - Enable CISO and IT leaders to obtain single pane of glass of reporting and policy management across all Internet-facing application with Cloudflare’s Security solutions.
Managed Performance & Reliability - Keep customer’s High-Availability applications running quickly and efficiently with Cloudflare’s Global Load Balancing, Smart Routing, and Anycast DNS, which allows performance consulting, traffic analysis, and application monitoring.

As we expand this program, we are looking for audacious service providers and system integrators that want to help us build a better Internet for our mutual customers. Cloudflare can be an essential lynchpin to simplify and accelerate digital transformations. We imagine a future where massive applications run within 10ms of 90% of the global population, and where a single-pane solution provides security, performance, and reliability for mission-critical applications—running across multiple clouds. Cloudflare needs help from amazing global integrators and service partners to help realize this future.

If you are interested in learning more about becoming a service partner and growing your business with Cloudflare, please reach out to partner-services@cloudflare.com or explore cloudflare.com/partners/services

Just Getting Started

Metcalf’s law states that a network is only as powerful as the amount of nodes within the network. And within the global Cloudflare network, we want as many partner nodes as possible—from agencies to systems integrators, managed security providers, Enterprise resellers, and OEMs. A diverse ecosystem of partners is essential to our mission of helping to build a better Internet, together. We are dedicated to the success of our partners, and we will continue to iterate and develop our programs to make sure our partners can grow and develop on Cloudflare’s global network. Our commitment moving forward is that Cloudflare will be the easiest and most valuable solution for channel partners to sell and support globally.

More Information:

Partner Program Website
Partner Services Website
Announcing the new Cloudflare Partner Platform
A New Program with New Partners (you are here)
Building Partnerships Worldwide

Cloudflare architecture and how BPF eats the world

Marek Majkowski — Sat, 18 May 2019 15:00:00 GMT

Recently at Netdev 0x13, the Conference on Linux Networking in Prague, I gave a short talk titled "Linux at Cloudflare". The talk ended up being mostly about BPF. It seems, no matter the question - BPF is the answer.

Here is a transcript of a slightly adjusted version of that talk.

At Cloudflare we run Linux on our servers. We operate two categories of data centers: large "Core" data centers, processing logs, analyzing attacks, computing analytics, and the "Edge" server fleet, delivering customer content from 180 locations across the world.

In this talk, we will focus on the "Edge" servers. It's here where we use the newest Linux features, optimize for performance and care deeply about DoS resilience.

Our edge service is special due to our network configuration - we are extensively using anycast routing. Anycast means that the same set of IP addresses are announced by all our data centers.

This design has great advantages. First, it guarantees the optimal speed for end users. No matter where you are located, you will always reach the closest data center. Then, anycast helps us to spread out DoS traffic. During attacks each of the locations receives a small fraction of the total traffic, making it easier to ingest and filter out unwanted traffic.

Anycast allows us to keep the networking setup uniform across all edge data centers. We applied the same design inside our data centers - our software stack is uniform across the edge servers. All software pieces are running on all the servers.

In principle, every machine can handle every task - and we run many diverse and demanding tasks. We have a full HTTP stack, the magical Cloudflare Workers, two sets of DNS servers - authoritative and resolver, and many other publicly facing applications like Spectrum and Warp.

Even though every server has all the software running, requests typically cross many machines on their journey through the stack. For example, an HTTP request might be handled by a different machine during each of the 5 stages of the processing.

Let me walk you through the early stages of inbound packet processing:

(1) First, the packets hit our router. The router does ECMP, and forwards packets onto our Linux servers. We use ECMP to spread each target IP across many, at least 16, machines. This is used as a rudimentary load balancing technique.

(2) On the servers we ingest packets with XDP eBPF. In XDP we perform two stages. First, we run volumetric DoS mitigations, dropping packets belonging to very large layer 3 attacks.

(3) Then, still in XDP, we perform layer 4 load balancing. All the non-attack packets are redirected across the machines. This is used to work around the ECMP problems, gives us fine-granularity load balancing and allows us to gracefully take servers out of service.

(4) Following the redirection the packets reach a designated machine. At this point they are ingested by the normal Linux networking stack, go through the usual iptables firewall, and are dispatched to an appropriate network socket.

(5) Finally packets are received by an application. For example HTTP connections are handled by a "protocol" server, responsible for performing TLS encryption and processing HTTP, HTTP/2 and QUIC protocols.

It's in these early phases of request processing where we use the coolest new Linux features. We can group useful modern functionalities into three categories:

DoS handling
Load balancing
Socket dispatch

Let's discuss DoS handling in more detail. As mentioned earlier, the first step after ECMP routing is Linux's XDP stack where, among other things, we run DoS mitigations.

Historically our mitigations for volumetric attacks were expressed in classic BPF and iptables-style grammar. Recently we adapted them to execute in the XDP eBPF context, which turned out to be surprisingly hard. Read on about our adventures:

L4Drop: XDP DDoS Mitigations
xdpcap: XDP Packet Capture
XDP based DoS mitigation talk by Arthur Fabre
XDP in practice: integrating XDP into our DDoS mitigation pipeline (PDF)

During this project we encountered a number of eBPF/XDP limitations. One of them was the lack of concurrency primitives. It was very hard to implement things like race-free token buckets. Later we found that Facebook engineer Julia Kartseva had the same issues. In February this problem has been addressed with the introduction of bpf_spin_lock helper.

While our modern volumetric DoS defenses are done in XDP layer, we still rely on iptables for application layer 7 mitigations. Here, a higher level firewall’s features are useful: connlimit, hashlimits and ipsets. We also use the xt_bpf iptables module to run cBPF in iptables to match on packet payloads. We talked about this in the past:

After XDP and iptables, we have one final kernel side DoS defense layer.

Consider a situation when our UDP mitigations fail. In such case we might be left with a flood of packets hitting our application UDP socket. This might overflow the socket causing packet loss. This is problematic - both good and bad packets will be dropped indiscriminately. For applications like DNS it's catastrophic. In the past to reduce the harm, we ran one UDP socket per IP address. An unmitigated flood was bad, but at least it didn't affect the traffic to other server IP addresses.

Nowadays that architecture is no longer suitable. We are running more than 30,000 DNS IP's and running that number of UDP sockets is not optimal. Our modern solution is to run a single UDP socket with a complex eBPF socket filter on it - using the SO_ATTACH_BPF socket option. We talked about running eBPF on network sockets in past blog posts:

The mentioned eBPF rate limits the packets. It keeps the state - packet counts - in an eBPF map. We can be sure that a single flooded IP won't affect other traffic. This works well, though during work on this project we found a rather worrying bug in the eBPF verifier:

eBPF can't count?!

I guess running eBPF on a UDP socket is not a common thing to do.

Apart from the DoS, in XDP we also run a layer 4 load balancer layer. This is a new project, and we haven't talked much about it yet. Without getting into many details: in certain situations we need to perform a socket lookup from XDP.

The problem is relatively simple - our code needs to look up the "socket" kernel structure for a 5-tuple extracted from a packet. This is generally easy - there is a bpf_sk_lookup helper available for this. Unsurprisingly, there were some complications. One problem was the inability to verify if a received ACK packet was a valid part of a three-way handshake when SYN-cookies are enabled. My colleague Lorenz Bauer is working on adding support for this corner case.

After DoS and the load balancing layers, the packets are passed onto the usual Linux TCP / UDP stack. Here we do a socket dispatch - for example packets going to port 53 are passed onto a socket belonging to our DNS server.

We do our best to use vanilla Linux features, but things get complex when you use thousands of IP addresses on the servers.

Convincing Linux to route packets correctly is relatively easy with the "AnyIP" trick. Ensuring packets are dispatched to the right application is another matter. Unfortunately, standard Linux socket dispatch logic is not flexible enough for our needs. For popular ports like TCP/80 we want to share the port between multiple applications, each handling it on a different IP range. Linux doesn't support this out of the box. You can call bind() either on a specific IP address or all IP's (with 0.0.0.0).

In order to fix this, we developed a custom kernel patch which adds a SO_BINDTOPREFIX socket option. As the name suggests - it allows us to call bind() on a selected IP prefix. This solves the problem of multiple applications sharing popular ports like 53 or 80.

Then we run into another problem. For our Spectrum product we need to listen on all 65535 ports. Running so many listen sockets is not a good idea (see our old war story blog), so we had to find another way. After some experiments we learned to utilize an obscure iptables module - TPROXY - for this purpose. Read about it here:

Abusing Linux's firewall: the hack that allowed us to build Spectrum

This setup is working, but we don't like the extra firewall rules. We are working on solving this problem correctly - actually extending the socket dispatch logic. You guessed it - we want to extend socket dispatch logic by utilizing eBPF. Expect some patches from us.

Then there is a way to use eBPF to improve applications. Recently we got excited about doing TCP splicing with SOCKMAP:

SOCKMAP - TCP splicing of the future

This technique has a great potential for improving tail latency across many pieces of our software stack. The current SOCKMAP implementation is not quite ready for prime time yet, but the potential is vast.

Similarly, the new TCP-BPF aka BPF_SOCK_OPS hooks provide a great way of inspecting performance parameters of TCP flows. This functionality is super useful for our performance team.

Some Linux features didn't age well and we need to work around them. For example, we are hitting limitations of networking metrics. Don't get me wrong - the networking metrics are awesome, but sadly they are not granular enough. Things like TcpExtListenDrops and TcpExtListenOverflows are reported as global counters, while we need to know it on a per-application basis.

Our solution is to use eBPF probes to extract the numbers directly from the kernel. My colleague Ivan Babrou wrote a Prometheus metrics exporter called "ebpf_exporter" to facilitate this. Read on:

With "ebpf_exporter" we can generate all manner of detailed metrics. It is very powerful and saved us on many occasions.

In this talk we discussed 6 layers of BPFs running on our edge servers:

Volumetric DoS mitigations are running on XDP eBPF
Iptables xt_bpf cBPF for application-layer attacks
SO_ATTACH_BPF for rate limits on UDP sockets
Load balancer, running on XDP
eBPFs running application helpers like SOCKMAP for TCP socket splicing, and TCP-BPF for TCP measurements
"ebpf_exporter" for granular metrics

And we're just getting started! Soon we will be doing more with eBPF based socket dispatch, eBPF running on Linux TC (Traffic Control) layer and more integration with cgroup eBPF hooks. Then, our SRE team is maintaining ever-growing list of BCC scripts useful for debugging.

It feels like Linux stopped developing new API's and all the new features are implemented as eBPF hooks and helpers. This is fine and it has strong advantages. It's easier and safer to upgrade eBPF program than having to recompile a kernel module. Some things like TCP-BPF, exposing high-volume performance tracing data, would probably be impossible without eBPF.

Some say "software is eating the world", I would say that: "BPF is eating the software".

Fixing an old hack - why we are bumping the IPv6 MTU

Marek Majkowski — Mon, 10 Sep 2018 09:21:25 GMT

Back in 2015 we deployed ECMP routing - Equal Cost Multi Path - within our datacenters. This technology allowed us to spread traffic heading to a single IP address across multiple physical servers.

You can think about it as a third layer of load balancing.

First we split the traffic across multiple IP addresses with DNS.
Then we split the traffic across multiple datacenters with Anycast.
Finally, we split the traffic across multiple servers with ECMP.

photo by Sahra by-sa/2.0

When deploying ECMP we hit a problem with Path MTU discovery. The ICMP packets destined to our Anycast IP's were being dropped. You can read more about that (and the solution) in the 2015 blog post Path MTU Discovery in practice.

To solve the problem we created a small piece of software, called pmtud (https://github.com/cloudflare/pmtud). Since deploying pmtud, our ECMP setup has been working smoothly.

Hardcoding IPv6 MTU

During that initial ECMP rollout things were broken. To keep services running until pmtud was done, we deployed a quick hack. We reduced the MTU of IPv6 traffic to the minimal possible value: 1280 bytes.

This was done as a tag on a default route. This is how our routing table used to look:

$ ip -6 route show
...
default via 2400:xxxx::1 dev eth0 src 2400:xxxx:2  metric 1024  mtu 1280

Notice the mtu 1280 in the default route.

With this setting our servers never transmitted IPv6 packets larger than 1280 bytes, therefore "fixing" the issue. Since all IPv6 routers must have an MTU of at least 1280, we could expect that no ICMP Packet-Too-Big message would ever be sent to us.

Remember - the original problem introduced by ECMP was that ICMP routed back to our Anycast addresses could go to a wrong machine within the ECMP group. Therefore we became ICMP black holes. Cloudflare would send large packets, they would be dropped with ICMP PTB packet flying back to us. Which, in turn would fail to be delivered to the right machine due to ECMP.

But why did this problem not appear for IPv4 traffic? We believe the same issue exists on IPv4, but it's less damaging due to the different nature of the network. IPv4 is more mature and the great majority of end-hosts support either MTU 1500 or have their MSS option well configured - or clamped by some middle box. This is different in IPv6 where a large proportion of users use tunnels, have Path MTU strictly smaller than 1500 and use incorrect MSS settings in the TCP header. Finally, Linux implements RFC4821 for IPv4 but not IPv6. RFC4821 (PLPMTUD) has its disadvantages, but does slightly help to alleviate the ICMP blackhole issue.

Our "fix" - reducing the MTU to 1280 - was serving us well and we had no pressing reason to revert it.

Researchers did notice though. We were caught red-handed twice:

In 2017 Geoff Huston noticed (pdf) that we sent DNS fragments of 1280 only (older blog post).
In June 2018 the paper "Exploring usable Path MTU in the Internet" (pdf) mentioned our weird setting - where we can accept 1500 bytes just fine, but transmit is limited to 1280.

When small MTU is too small

photo by NH53 by/2.0

This changed recently, when we started working on Cloudflare Spectrum support for UDP. Spectrum is a terminating proxy, able to handle protocols other than HTTP. Getting Spectrum to forward TCP was relatively straightforward (barring couple of awesome hacks). UDP is different.

One of the major issues we hit was related to the MTU on our servers.

During tests we wanted to forward UDP VPN packets through Spectrum. As you can imagine, any VPN would encapsulate a packet in another packet. Spectrum received packets like this:

 +---------------------+------------------------------------------------+
 +  IPv6 + UDP header  |  UDP payload encapsulating a 1280 byte packet  |
 +---------------------+------------------------------------------------+

It's pretty obvious, that our edge servers supporting IPv6 packets of max 1280 bytes won't be able to handle this type of traffic. We are going to need at least 1280+40+8 bytes MTU! Hardcoding MTU=1280 in IPv6 may be acceptable solution if you are an end-node on the internet, but is definitely too small when forwarding tunneled traffic.

Picking a new MTU

But what MTU value should we use instead? Let's see what other major internet companies do. Here is a couple of examples of advertised MSS values in TCP SYN+ACK packets over IPv6:

+---- site -----+- MSS --+- estimated MTU -+
| google.com    |  1360  |    1420         |
+---------------+--------+-----------------+
| facebook.com  |  1410  |    1470         |
+---------------+--------+-----------------+
| wikipedia.org |  1440  |    1500         |
+---------------+--------+-----------------+

I believe Google and Facebook adjust their MTU due to their use of L4 load balancers. Their implementations do IP-in-IP encapsulation so need a bit of space for the header. Read more:

Google - Maglev
Facebook - Katran

There may be other reasons for having a smaller MTU. A reduced value may decrease the probability of the Path MTU detection algorithm kicking in (ie: relying on ICMP PTB). We can theorize that for the misconfigured eyeballs:

MTU=1280 will never run Path MTU detection
MTU=1500 will always run it.
In-between values would have increasing different chances of hitting the problem.

But just what is the chance of that?

A quick unscientific study of the MSS values we encountered from eyeballs shows the following distributions. For connections going over IPv4:

IPv4 eyeball advertised MSS in SYN:
 value |-------------------------------------------------- count cummulative
  1300 |                                                 *  1.28%   98.53%
  1360 |                                              ****  4.40%   95.68%
  1370 |                                                 *  1.15%   91.05%
  1380 |                                               ***  3.35%   89.81%
  1400 |                                          ********  7.95%   84.79%
  1410 |                                                 *  1.17%   76.66%
  1412 |                                              ****  4.58%   75.49%
  1440 |                                            ******  6.14%   65.71%
  1452 |                                      ************ 11.50%   58.94%
  1460 |************************************************** 47.09%   47.34%

Assuming the majority of clients have MSS configured right, we can say that 89.8% of connections advertised MTU=1380+40=1420 or higher. 75% had MTU >= 1452.

For IPv6 connections we saw:

IPv6 eyeball advertised MSS in SYN:
 value |-------------------------------------------------- count cummulative
  1220 |                                               ***  4.21%   99.96%
  1340 |                                                **  3.11%   93.23%
  1362 |                                                 *  1.31%   87.70%
  1368 |                                               ***  3.38%   86.36%
  1370 |                                               ***  4.24%   82.98%
  1380 |                                               ***  3.52%   78.65%
  1390 |                                                 *  2.11%   75.10%
  1400 |                                               ***  3.89%   72.25%
  1412 |                                               ***  3.64%   68.21%
  1420 |                                                 *  2.02%   64.54%
  1440 |************************************************** 54.31%   54.34%

On IPv6 87.7% connections had MTU >= 1422 (1362+60). 75% had MTU >= 1450. (See also: MTU distribution of DNS servers).

Interpretation

Before we move on it's worth reiterating the original problem. Each connection from an eyeball to our Anycast network has three numbers related to it:

Client advertised MTU - seen in MSS option in TCP header
True Path MTU value - generally unknown until measured
Our Edge server MTU - value we are trying to optimize in this exercise

(This is a slight simplification, paths on the internet aren't symmetric so the path from eyeball to Cloudflare could have different Path MTU than the reverse path.)

In order for the connection to misbehave, three conditions must be met:

Client advertised MTU must be "wrong", that is: larger than True Path MTU
Our edge server must be willing to send such large packets: Edge server MTU >= True Path MTU
The ICMP PTB messages must fail to be delivered to our edge server - preventing Path MTU detection from working.

The last condition could occur for one of the reasons:

the routers on the path are misbehaving and perhaps firewalling ICMP
due to the asymmetric nature of the internet the ICMP back is routed to the wrong Anycast datacenter
something is wrong on our side, for example pmtud process fails

In the past we limited our Edge Server MTU value to the smallest possible, to make sure we never encounter the problem. Due to the development of Spectrum UDP support we must increase the Edge Server MTU, while still minimizing the probability of the issue happening.

Finally, relying on ICMP PTB messages for a large fraction of traffic is a bad idea. It's easy to imagine the cost this induces: even with Path MTU detection working fine, the affected connection will suffer a hiccup. A couple of large packets will be dropped before the reverse ICMP will get through and reconfigure the saved Path MTU value. This is not optimal for latency.

Progress

In recent days we increased the IPv6 MTU. As part of the process we could have chosen 1300, 1350, or 1400. We choose 1400 because we think it's the next best value to use after 1280. With 1400 we believe 93.2% of IPv6 connections will not need to rely on Path MTU Detection/ICMP. In the near future we plan to increase this value further. We won't settle on 1500 though - we want to leave a couple of bytes for IPv4 encapsulation, to allow the most popular tunnels to keep working without suffering poor latency when Path MTU Detection kicks in.

Since the rollout we've been monitoring Icmp6InPktTooBigs counters:

$ nstat -az | grep Icmp6InPktTooBigs
Icmp6InPktTooBigs               738748             0.0

Here is a chart of the ICMP PTB packets we received over last 7 days. You can clearly see that when the rollout started, we saw a large increase in PTB ICMP messages (Y label - packet count - deliberately obfuscated):

Interestingly the majority of the ICMP packets are concentrated in our Frankfurt datacenter:

We estimate that in our Frankfurt datacenter, we receive ICMP PTB message on 2 out of every 100 IPv6 TCP connections. These seem to come from only a handful of ASNs:

AS6830 - Liberty Global Operations B.V.
AS20825- Unitymedia NRW GmbH
AS31334 - Vodafone Kabel Deutschland GmbH
AS29562 - Kabel BW GmbH

These networks send to us ICMP PTB messages, usually informing that their MTU is 1280. For example:

$ sudo tcpdump -tvvvni eth0 icmp6 and ip6[40+0]==2 
IP6 2a02:908:xxx > 2400:xxx ICMP6, packet too big, mtu 1280, length 1240
IP6 2a02:810d:xx > 2400:xxx ICMP6, packet too big, mtu 1280, length 1240
IP6 2001:ac8:xxx > 2400:xxx ICMP6, packet too big, mtu 1390, length 1240

Final thoughts

Finally, if you are an IPv6 user with a weird MTU and have misconfigured MSS - basically if you are doing tunneling - please let us know of any issues. We know that debugging MTU issues is notoriously hard. To aid that we created an online fragmentation and ICMP delivery test. You can run it:

IPv6 version: http://icmpcheckv6.popcount.org
(for completeness, we also have an IPv4 version)

If you are a server operator running IPv6 applications, you should not worry. In most cases leaving the MTU at default 1500 is a good choice and should work for the majority of connections. Just remember to allow ICMP PTB packets on the firewall and you should be good. If you serve variety of IPv6 users and need to optimize latency, you may consider choosing a slightly smaller MTU for outbound packets, to reduce the risk of relying on Path MTU Detection / ICMP.

Low level network tuning sound interesting? Join our world famous team in London, Austin, San Francisco, Champaign and our elite office in Warsaw, Poland.

Delivering Dot

Dani Grant — Sun, 10 Sep 2017 17:04:26 GMT

Since March 30, 2017, Cloudflare has been providing DNS Anycast service as additional F-Root instances under contract with ISC (the F-Root operator).

F-Root is a single IPv4 address plus a single IPv6 address which both ISC and Cloudflare announce to the global Internet as a shared Anycast. This document reviews how F-Root has performed since that date in March 2017.

The DNS root servers are an important utility provided to all clients on the Internet for free - all F root instances including those hosted on the Cloudflare network are a free service provided by both ISC and Cloudflare for public benefit. Because every online request begins with a DNS lookup, and every DNS lookup requires the retrieval of information stored on the DNS root servers, the DNS root servers plays an invaluable role to the functioning of the internet.

At Cloudflare, we were excited to work with ISC to bring greater security, speed and new software diversity to the root server system. First, the root servers, because of their crucial role, are often the subject of large scale volumetric DDoS attacks, which Cloudflare specializes in mitigating (Cloudflare is currently mitigating two concurrently ongoing DDoS attacks as we write this). Second, with a distributed network of data centers in well over 100 global cities, Cloudflare DNS is close to the end client which reduces round trip times. And lastly, the F-root nodes hosted by Cloudflare also run Cloudflare’s in-house DNS software, written in Go, which brings new code diversity to the root server system.

Throughout the deployment, ISC and Cloudflare paid close attention to telemetry measurements to ensure positive impact on the global DNS and root server system. Here is what both organizations observed when transit was enabled to Cloudflare DNS servers for F-Root.

Using RIPE atlas probe measurements, we can see an immediate performance benefit to the F-Root server, from 8.24 median RTT to 4.24 median RTT.

F-Root actually became one of the fastest performing root servers:

The biggest performance improvement was in the 90th percentile, or what are the top 10% of queries that received the slowest replies. This graph below shows the 90th percentile response time for any given RIPE atlas probe. Each probe is represented by two markers, a red X for before Cloudflare enabled transit and a blue X for after Cloudflare began announcing. You can see a drop in 90th percentile response times, the blue X’s are much lower than the red X’s.

One of the optimizations that DNS resolvers do is preferring the faster root servers. As F-Root picked up speed, DNS resolvers also started sending it more traffic. Here you can see the aggregate number of queries received by each root letter per day, with an increase to F starting on March 30th.

One large public DNS resolver shared with us their internal metrics, where you can also see a large shift of traffic to F-Root as F-Root increased in speed.

When one external DNS monitor published their Root Server measurement report in June 2017, they mentioned F-Root’s increased performance. “9 more countries than in 2015 observed F-Root as the fastest — this is the biggest change across all of the root servers, so in this sense F-Root is “most improved.” F-Root has increasingly become the fastest root server in significant portions of Asia Pacific, Latin America and Eastern Europe.” They noted that “F-Root is now the fastest for roughly one quarter of the countries we tested from.”

We are happy to be working with ISC on delivering answers for F-Root and aim in the process to improve the speed and security of the F-Root server.

A Brief Primer on Anycast

Matthew Prince — Fri, 21 Oct 2011 05:33:00 GMT

I wrote a blog post the other day about CloudFlare's globally distributed DNS infrastructure and how each ninja name server we give you when you signup doesn't represent just one machine, but instead a whole cluster of machines in each of the data centers we operate worldwide. Several people in the comments wondered how that was possible since each name server URL maps to an IP address. The answer is Anycast, but I wanted to take some more time to explain exactly what that means.

Unicast: One Machine, One IP

Most of the Internet works via a routing scheme called Unicast. Under Unicast, every node on the network gets an IP address which is unique to it. You may have connected to a wireless network and gotten a notice that your IP address is already in use, this happens when two computers on the same Unicast network try and use the same IP. In most cases, that isn't allowed.

Routers keep a map of the world's IP addresses and maintain a sense of the shortest path to get from one node to another. One router will hand a packet off to another router that is closer to the final destination until the packet finally arrives at the address it was sent to. You can see these handoffs if you run a traceroute (instructions for Mac and Windows) to any site. Take, for example, the popular arts, culture, and technology blog laughingsquid.com. If I do a traceroute to their origin server then I get something like this.

Each of those lines, known as a "hop," represents a router that the packets carrying my request are passing through. Often there are clues in the names of the routers as to where they are located. In this case, it looks like the origin is running on a Rackspace server somewhere in or around Dallas (line 13 and knowing that DFW is the airport code for the Dallas-Ft. Worth airport gives it away). No matter where in the world you do a traceroute from, you will eventually get to the same server running in Dallas because it is using the Unicast routing scheme.

Numbers on each line. Those are three samples of the time, in milliseconds, it takes for a packet to do a round-trip to the particular router. The last line is typically the most important because it is an estimate of the real-world network latency between you and the server. In this case for me, about 50 milliseconds (which is pretty good).

Anycast: Many Machines, One IP

While Unicast is the easiest way to run a network, it isn't the only way. At CloudFlare, we make extensive use of another routing scheme called Anycast. With Anycast, multiple machines can share the same IP address. When a request is sent to an Anycasted IP address, routers will direct it to the machine on the network that is closest.

Return to our friends at laughingsquid.com. They are CloudFlare users so if you request their website you get back one or more CloudFlare IP addresses. Under normal circumstances, it doesn't matter where you are in the world, you will get the same IP addresses. However, there are machines in 12 data centers all listening on those IPs. As a result, the network itself will direct traffic to the nearest CloudFlare data center to where you are making the request from. Check out the following traceroute.

I'm running this traceroute from Northern California. The nearest CloudFlare data center is in San Jose. And, if you look closely, hops #11 - #13 all show routers in "SJC" which is the airport code for SanJose. If you run this same traceroute to laughingsquid.com from somewhere else in the world, you'll see the requests hitting whatever data center is closest to you.

Speed, Resiliency & Attack Mitigation

Anycast is not trivial to implement, but if you do it has a number of key benefits. The first is speed. Note the latency on the last hop of about 9.5 milliseconds. Since CloudFlare answers typically 75% of requests from its edge without having to hit the origin, we're able to save significant network latency. We also pick core IBX data centers through which your packet would have likely traveled anyway, which means even when we have to go back to the origin there's little if any additional latency introduced.

In addition to speed, using Anycast means the network can be extremely resilient. Because traffic will find the best path, we can take an entire data center offline and traffic will automatically flow to the next closest data center.

The last benefit of Anycast is that it can help with attack mitigation. In most DDoS attacks, many compromised "zombie" computers are used to form what is known as a botnet. These machines can be scattered around the web and generate so much traffic that they can overwhelm a typical Unicast connected machine. The nature of CloudFlare's Anycasted network is that we inherently increase the surface area to absorb such an attack. A distributed botnet will have a portion of its denial of service traffic absorbed by each of our data centers. As a result, as we continue to grow our network and expanding into more data centers, it will get harder and harder to launch an effective DDoS against any of our users.

It is not easy to setup a true Anycasted network. It requires that you own your own hardware, build direct relationships with your upstream carriers, and tune your networking routes to ensure traffic doesn't "flap" between multiple locations. We've taken the time to do that at CloudFlare because it helps ensure all our users have access to a faster, safer, better Internet.