Posts tagged "Load Balancing"

Load Balancing Monitor Groups: Multi-Service Health Checks for Resilient Applications

Noah Crouch — Fri, 17 Oct 2025 06:00:00 GMT

Modern applications are not monoliths. They are complex, distributed systems where availability depends on multiple independent components working in harmony. A web server might be running, but if its connection to the database is down or the authentication service is unresponsive, the application as a whole is unhealthy. Relying on a single health check is like knowing the “check engine” light is not on, but not knowing that one of your tires has a puncture. It’s great your engine is going, but you’re probably not driving far.

As applications grow in complexity, so does the definition of "healthy." We've heard from customers, big and small, that they need to validate multiple services to consider an endpoint ready to receive traffic. For example, they may need to confirm that an underlying API gateway is healthy and that a specific ‘/login’ service is responsive before routing users there. Until now, this required building custom, synthetic services to aggregate these checks, adding operational overhead and another potential point of failure.

Today, we are introducing Monitor Groups for Cloudflare Load Balancing. This feature provides a new way to create sophisticated, multi-service health assessments directly on our platform. With Monitor Groups, you can bundle multiple health monitors into a single logical entity, define which components are critical, and use an aggregated health score to make more intelligent and resilient failover decisions.

This new capability, available via the API for our Enterprise customers, removes the need for custom health aggregation services and provides a far more accurate picture of your application’s true availability. In the near future this feature will be available in the Dashboard for all Load Balancing users, not just Enterprise!

How Monitor Groups Work

Monitor Groups function as a superset of monitors. Once you have created your monitors they can be bundled into a single unit – the Monitor Group! When you attach a Monitor Group to an endpoint pool, the health of each endpoint in that pool is determined by aggregating the results of all enabled monitors within the group. These settings, defined within the ‘members’ array of a monitor group, give you granular control over how the collective health is determined.

Here’s what each property does:

Critical Monitors (must_be_healthy): You can designate a monitor as critical. If a monitor with this setting fails its health check against an endpoint, that endpoint is immediately marked as unhealthy. This provides a definitive override for essential services, regardless of the status of other monitors in the group.
Observational Probes (monitoring_only): Mark a monitor as "monitoring only" to receive alerts and data without it affecting a pool's health status or traffic steering. This is perfect for testing new checks or observing non-critical dependencies without impacting production traffic.
Quorum-Based Health: In the absence of a failure from a critical monitor, an endpoint's health is determined by a quorum of all other active monitors. An endpoint is considered globally unhealthy only if more than 50% of its assigned monitors report it as unhealthy. This system prevents an endpoint from being prematurely marked as unhealthy due to a transient failure from a single, non-critical monitor.

You can add up to five monitors to a group.

^{A diagram showing three health monitors (HTTP, TCP, and Database) combined into a single Monitor Group. The group is attached to a Cloudflare Load Balancing pool, which assesses the health of three origin servers.}

A Globally Distributed Perspective

The power of Monitor Groups is amplified by the scale of Cloudflare’s global network. Health checks aren't performed from a handful of static locations; they can be configured to execute from data centers in over 300 cities across the globe. While you can configure monitoring from every data center simultaneously ('All Datacenters' mode), we recommend a more targeted approach for most applications. Choosing a few diverse regions, like Western North America and Eastern Europe, or using the 'All Regions' setting provides a robust, global perspective on your application's health while reducing the volume of health monitoring traffic sent to your origins. This creates a distributed consensus on application health, preventing a localized network issue from triggering a false positive and causing an unnecessary global failover. Your application’s health is determined not by a single perspective, but by a global one.

This same principle elevates Dynamic Steering when used in conjunction with Monitor Groups. The latency for a Monitor Group isn't just a single RTT measurement. It's a holistic performance score, averaged from, potentially, hundreds of points of presence, across all the critical services you’ve defined. This means your load balancer steers traffic based on a true, globally-aware understanding of your application’s performance.

For load balancers using Dynamic Steering and a Monitor Group, the latency used to make steering decisions is now calculated as the average Round Trip Time (RTT) of all active, non-monitoring-only members in the group. This provides a more stable and representative performance metric. Rather than relying on the latency of a single service, Dynamic Steering can now make decisions based on the collective performance of all critical components, ensuring traffic is sent to the endpoint that is truly the most performant overall.

Health Aggregation in Action

Let's walk through an example to see how Cloudflare aggregates health signals from a Monitor Group to determine the overall health of a single endpoint. In this scenario, our application has three key components we need to check: a public-facing /health endpoint, another service running on a specific TCP port, and a database dependency. Privacy and security are paramount, so, to monitor the database without exposing it to the public Internet, you would securely connect it to Cloudflare using a Cloudflare Tunnel, allowing our health checks to reach it securely.

Setup

Health Monitors in the Group:
- HTTP check for /health (must_be_healthy: true)
- TCP check for Port 3000 connectivity (must_be_healthy: false)
- DB check for database health (must_be_healthy: false)
Health Check Regions:
- Western North America (3 data centers)
- Eastern North America (3 data centers)
Quorum Threshold: An endpoint is considered healthy if more than 50% of checking data centers report it as UP.

First, Cloudflare determines the health from the perspective of each individual data center. If the critical monitor fails, that data center’s result is definitively DOWN. Otherwise, the result is based on the majority status of the remaining monitors.

Here are the results from our six data centers:

[image description: A table showing health check results from six data centers across two regions. One of the six data centers report a "DOWN" status because the critical HTTP monitor failed. The other five report "UP" because the critical monitor passed and a majority of the remaining monitors were healthy.]

Finally, the results from all six checking data centers are combined to determine the final, global health status for the endpoint.

Global Result: 5 out of the 6 total data centers (83%) report the endpoint as UP.
Conclusion: Because 83% is greater than the 50% quorum threshold, the endpoint is considered globally healthy and will continue to receive traffic.

This multi-layered quorum system provides incredible resilience, ensuring that failover decisions are based on a comprehensive and geographically distributed consensus.

Getting Started with Monitor Groups

Monitor Groups are now available via the API for all customers with an Enterprise Cloudflare Load Balancing subscription and will be made available to self-serve customers in the near future. To get started with building more sophisticated health checks for your applications today, check out our developer documentation.

To create a monitor group, you can use a POST request to the new /load_balancers/monitor_groups endpoint.

Once created, you can attach the group to a pool by referencing its ID in the monitor_group field of the pool object.

What’s Next

We are continuing to build a seamless platform experience that simplifies traffic management for both internal and external applications. Looking ahead, Monitor Groups will be making its way into the Dashboard for all users soon! We are also working on more flexible role-based access controls and even more advanced load-based load balancing capabilities to give you the granular control you need to manage your most complex applications.

TURN and anycast: making peer connections work globally

Nils Ohlmeier — Wed, 25 Sep 2024 13:00:00 GMT

A TURN server helps maintain connections during video calls when local networking conditions prevent participants from connecting directly to other participants. It acts as an intermediary, passing data between users when their networks block direct communication. TURN servers ensure that peer-to-peer calls go smoothly, even in less-than-ideal network conditions.

When building their own TURN infrastructure, developers often have to answer a few critical questions:

“How do we build and maintain a mesh network that achieves near-zero latency to all our users?”
“Where should we spin up our servers?”
“Can we auto-scale reliably to be cost-efficient without hurting performance?”

In April, we launched Cloudflare Calls TURN in open beta to help answer these questions. Starting today, Cloudflare Calls’ TURN service is now generally available to all Cloudflare accounts. Our TURN server works on our anycast network, which helps deliver global coverage and near-zero latency required by real time applications.

TURN solves connectivity and privacy problems for real time apps

When Internet Protocol version 4 (IPv4, RFC 791) was designed back in 1981, it was assumed that the 32-bit address space was big enough for all computers to be able to connect to each other. When IPv4 was created, billions of people didn’t have smartphones in their pockets and the idea of the Internet of Things didn’t exist yet. It didn’t take long for companies, ISPs, and even entire countries to realize they didn’t have enough IPv4 address space to meet their needs.

NATs are unpredictable

Fortunately, you can have multiple devices share the same IP address because the most common protocols run on top of IP are TCP and UDP, both of which support up to 65,535 port numbers. (Think of port numbers on an IP address as extensions behind a single phone number.) To solve this problem of IP scarcity, network engineers developed a way to share a single IP address across multiple devices by exploiting the port numbers. This is called Network Address Translation (NAT) and it is a process through which your router knows which packets to send to your smartphone versus your laptop or other devices, all of which are connecting to the public Internet through the IP address assigned to the router.

In a typical NAT setup, when a device sends a packet to the Internet, the NAT assigns a random, unused port to track it, keeping a forwarding table to map the device to the port. This allows NAT to direct responses back to the correct device, even if the source IP address and port vary across different destinations. The system works as long as the internal device initiates the connection and waits for the response.

However, real-time apps like video or audio calls are more challenging with NAT. Since NATs don't reveal how they assign ports, devices can't pre-communicate where to send responses, making it difficult to establish reliable connections. Earlier solutions like STUN (RFC 3489) couldn't fully solve this, which gave rise to the TURN protocol.

TURN predictably relays traffic between devices while ensuring minimal delay, which is crucial for real-time communication where even a second of lag can disrupt the experience.

ICE to determine if a relay server is needed

The ICE (Interactive Connectivity Establishment) protocol was designed to find the fastest communication path between devices. It works by testing multiple routes and choosing the one with the least delay. ICE determines whether a TURN server is needed to relay the connection when a direct peer-to-peer path cannot be established or is not performant enough.

^{How two peers (A and B) try to connect directly by sharing their public and local IP addresses using the ICE protocol. If the direct connection fails, both peers use the TURN server to relay their connection and communicate with each other.}

While ICE is designed to find the most efficient connection path between peers, it can inadvertently expose sensitive information, creating privacy concerns. During the ICE process, endpoints exchange a list of all possible network addresses, including local IP addresses, NAT IP addresses, and TURN server addresses. This comprehensive sharing of network details can reveal information about a user's network topology, potentially exposing their approximate geographic location or details about their local network setup.

The "brute force" nature of ICE, where it attempts connections on all possible paths, can create distinctive network traffic patterns that sophisticated observers might use to infer the use of specific applications or communication protocols.

TURN solves privacy problems

The threat from exposing sensitive information while using real-time applications is especially important for people that use end-to-end encrypted messaging apps for sensitive information — for example, journalists who need to communicate with unknown sources without revealing their location.

With Cloudflare TURN in place, traffic is proxied through Cloudflare, preventing either party in the call from seeing client IP addresses or associated metadata. Cloudflare simply forwards the calls to their intended recipients, but never inspects the contents — the underlying call data is always end-to-end encrypted. This masking of network traffic is an added layer of privacy.

Cloudflare is a trusted third-party when it comes to operating these types of services: we have experience operating privacy-preserving proxies at scale for our Consumer WARP product, Apple’s Private Relay, and Microsoft Edge’s Secure Network, preserving end-user privacy without sacrificing performance.

Cloudflare’s TURN is the fastest because of Anycast

Lots of real time communication services run their own TURN servers on a commercial cloud provider because they don’t want to leave a certain percentage of their customers with non-working communication. This results in additional costs for DevOps, egress bandwidth, etc. And honestly, just deploying and running a TURN server, like CoTURN, in a VPS isn’t an interesting project for most engineers.

Because using a TURN relay adds extra delay for the packets to travel between the peers, the relays should be located as close as possible to the peers. Cloudflare’s TURN service avoids all these headaches by simply running in all of the 330 cities where Cloudflare has data centers. And any time Cloudflare adds another city, the TURN service automatically becomes available there as well.

Anycast is the perfect network topology for TURN

Anycast is a network addressing and routing methodology in which a single IP address is shared by multiple servers in different locations. When a client sends a request to an anycast address, the network automatically routes the request via BGP to the topologically nearest server. This is in contrast to unicast, where each destination has a unique IP address. Anycast allows multiple servers to have the same IP address, and enables clients to automatically connect to a server close to them. This is similar to emergency phone networks (911, 112, etc.) which connect you to the closest emergency communications center in your area.

Anycast allows for lower latency because of the sheer number of locations available around the world. Approximately 95% of the Internet-connected population globally is within approximately 50ms away from a Cloudflare location. For real-time communication applications that use TURN, leads to improved call quality and user experience.

Auto-scaling and inherently global

Running TURN over anycast allows for better scalability and global distribution. By naturally distributing load across multiple servers based on network topology, this setup helps balance traffic and improve performance. When you use Cloudflare’s TURN service, you don’t need to manage a list of servers for different parts of the world. And you don’t need to write custom scaling logic to scale VMs up or down based on your traffic.

Anycast allows TURN to use fewer IP addresses, making it easier to allowlist in restrictive networks. Stateless protocols like DNS over UDP work well with anycast. This includes stateless STUN binding requests used to determine a system's external IP address behind a NAT.

However, stateful protocols over UDP, like QUIC or TURN, are more challenging with anycast. QUIC handles this better due to its stable connection ID, which load balancers can use to consistently route traffic. However, TURN/STUN lacks a similar connection ID. So when a TURN client sends requests to the Cloudflare TURN service, the Unimog load balancer ensures that all its requests get routed to the same server within a data center. The challenges for the communication between a client on the Internet and Cloudflare services listening on an anycast IP address have been described multiple times before.

How does Cloudflare's TURN server receive packets?

TURN servers act as relay points to help connect clients. This process involves two types of connections: the client-server connection and the third-party connection (relayed address).

The client-server connection uses published IP and port information to communicate with TURN clients using anycast.

For the relayed address, using anycast poses a challenge. The TURN protocol requires that packets reach the specific Cloudflare server handling the client connection. If we used anycast for relay addresses, packets might not arrive at the correct data center or server.

One alternative is to use unicast addresses for relay candidates. However, this approach has drawbacks, including making servers vulnerable to attacks and requiring many IP addresses.

To solve these issues, we've developed a middle-ground solution, previously discussed in “Cloudflare servers don't own IPs anymore – so how do they connect to the Internet?”. We use anycast addresses but add extra handling for packets that reach incorrect servers. If a packet arrives at the wrong Cloudflare location, we forward it over our backbone to the correct datacenter, rather than sending it back over the public Internet.

This approach not only resolves routing issues but also improves TURN connection speed. Packets meant for the relay address enter the Cloudflare network as close to the sender as possible, optimizing the routing process.

^{In this non-ideal setup, a TURN client connects to Cloudflare using Anycast, while a direct client uses Unicast, which would expose the TURN server to potential DDoS attacks.}

^{The optimized setup uses Anycast for all TURN clients, allowing for dynamic load distribution across Cloudflare's globally distributed TURN servers.}

Try Cloudflare Calls TURN today

The new TURN feature of Cloudflare Calls addresses critical challenges in real-time communication:

Connectivity: By solving NAT traversal issues, TURN ensures reliable connections even in complex network environments.
Privacy: Acting as an intermediary, TURN enhances user privacy by masking IP addresses and network details.
Performance: Leveraging Cloudflare's global anycast network, our TURN service offers unparalleled speed and near-zero latency.
Scalability: With presence in over 330 cities, Cloudflare Calls TURN grows with your needs.

Cloudflare Calls TURN service is billed on a usage basis. It is available to self-serve and Enterprise customers alike. There is no cost for the first 1,000 GB (one terabyte) of Cloudflare Calls usage each month. It costs five cents per GB after your first terabyte of usage on self-serve. Volume pricing is available for Enterprise customers through your account team.

Switching TURN providers is likely as simple as changing a single configuration in your real-time app. To get started with Cloudflare’s TURN service, create a TURN app from your Cloudflare Calls Dashboard or read the Developer Docs.

Eliminating hardware with Load Balancing and Cloudflare One

Noah Crouch — Tue, 16 Jul 2024 13:02:00 GMT

In 2023, Cloudflare introduced a new load balancing solution supporting Private Network Load Balancing. This year, we took it a step further by introducing support for layer 4 load balancing to private networks via Spectrum. Now, organizations can seamlessly balance public HTTP(S), TCP, and UDP traffic to their privately hosted applications. Today, we’re thrilled to unveil our latest enhancement: support for end-to-end private traffic flows as well as WARP authenticated device traffic, eliminating the need for dedicated hardware load balancers! These groundbreaking features are powered by the enhanced integration of Cloudflare load balancing with our Cloudflare One platform, and are available to our enterprise customers. With this upgrade, our customers can now utilize Cloudflare load balancers for both public and private traffic directed at private networks.

Cloudflare Load Balancing today

Before discussing the new features, let's review Cloudflare's existing load balancing support and the challenges customers face.

Cloudflare currently supports four main load balancing traffic flows:

Internet-facing load balancers connecting to publicly accessible endpoints at layer 7, supporting HTTP(S).
Internet-facing load balancers connecting to publicly accessible endpoints at layer 4 (Spectrum), supporting TCP and UDP services
Internet-facing load balancers connecting to private endpoints at layer 7 HTTP(S) via Cloudflare Tunnels.
Internet-facing load balancers connecting to private endpoints at layer 4 (Spectrum), supporting TCP and UDP services via Cloudflare Tunnels.

One of the biggest advantages of Cloudflare’s load balancing solutions is the elimination of hardware costs and maintenance. Unlike hardware-based load balancers, which are costly to purchase, license, operate, and upgrade, Cloudflare’s solution requires no hardware. There's no need to buy additional modules or new licenses, and you won't face end-of-life issues with equipment that necessitate costly replacements.

With Cloudflare, you can focus on innovation and growth. Load balancers are deployed in every Cloudflare data center across the globe, in over 300 cities, providing virtually unlimited scale and capacity. You never need to worry about bandwidth constraints, deployment locations, extra hardware modules, downtime, upgrades, or supply chain constraints. Cloudflare’s global Anycast network ensures that every customer connects to a nearby data center and load balancer, where policies, rules, and steering are applied efficiently. And now, the resilience, scale, and simplicity of Cloudflare load balancers can be integrated into your private networks! We have worked hard to ensure that Cloudflare load balancers are highly available and disaster ready, from the core to the edge – even when datacenters lose power.

Keeping private resources private with Magic WAN

Before today's announcement, all of Cloudflare's load balancers operating at layer 4 have been connected to the public Internet. Customers have been able to secure the traffic flowing to their load balancers with WAF rules and Zero Trust policies, but some customers would prefer to keep certain resources private and under no circumstances exposed to the Internet. It’s been possible to isolate origin servers and endpoints this way, which can exist on private networks that are only accessible via Cloudflare Tunnels. And as of today, we can offer a similar level of isolation to customers’ layer 4 load balancers.

In our previous blog post, we discussed connecting these internal or private resources to the Cloudflare global network and how Cloudflare would soon introduce load balancers that are accessible via private IP addresses. Unlike other Cloudflare load balancers, these do not have an associated hostname. Rather, they are accessible via an RFC 1918 private IP address. In the land of load balancers, this is often referred to as a virtual IP (VIP). As of today, load balancers that are accessible at private IPs can now be used within a virtual network to isolate traffic to a certain set of Cloudflare tunnels, enabling customers to load balance traffic within their private network without exposing applications to the public Internet.

The question you might be asking is, “If I have a private IP load balancer and privately hosted applications, how do I or my users actually reach these now-private services?”

Cloudflare Magic WAN can now be used as an on-ramp in tandem with Cloudflare load balancers that are accessible via an assigned private IP address. Magic WAN provides a secure and high-performance connection to internal resources, ensuring that traffic remains private and optimized across our global network. With Magic WAN, customers can connect their corporate networks directly to Cloudflare's global network with GRE or IPSec tunnels, maintaining privacy and security while enjoying seamless connectivity. The Magic WAN Connector easily establishes connectivity to Cloudflare without the need to configure network gear, and it can be deployed at any physical or cloud location! With the enhancements to Cloudflare’s load balancing solution, customers can confidently keep their corporate applications resilient while maintaining the end-to-end privacy and security of their resources.

This enhancement opens up numerous use cases for internal load balancing, such as managing traffic between different data centers, efficiently routing traffic for internally hosted applications, optimizing resource allocation for critical applications, and ensuring high availability for internal services. Organizations can now replace traditional hardware-based load balancers, reducing complexity and lowering costs associated with maintaining physical infrastructure. By leveraging Cloudflare load balancing and Magic WAN, companies can achieve greater flexibility and scalability, adapting quickly to changing network demands without the need for additional hardware investments.

But what about latency? Load balancing is all about keeping your applications resilient and performant and Cloudflare was built with speed at its core. There is a Cloudflare datacenter within 50ms of 95% of the Internet-connected population globally! Now, we support all Cloudflare One on-ramps to not only provide seamless and secure connectivity, but also to dramatically reduce latency compared to legacy solutions. Load balancing also works seamlessly with Argo Smart Routing to intelligently route around network congestion to improve your application performance by up to 30%! Check out the blogs here and here to read more about how Cloudflare One can reduce application latency.

Supporting distributed users with Cloudflare WARP

But what about when users are distributed and not connected to the local corporate network? Cloudflare WARP can now be used as an on-ramp to reach Cloudflare load balancers that are configured with private IP addresses. The Cloudflare WARP client allows you to protect corporate devices by securely and privately sending traffic from those devices to Cloudflare’s global network, where Cloudflare Gateway can apply advanced web filtering. The WARP client also makes it possible to apply advanced Zero Trust policies that check a device’s health before it connects to corporate applications.

In this load balancing use case, WARP pairs up perfectly with Cloudflare Tunnels so that customers can place their private origins within virtual networks to help either isolate traffic or handle overlapping private IP addresses. Once these virtual networks are defined, administrators can configure WARP profiles to allow their users to connect to the proper virtual networks. Once connected, WARP takes the configuration of the virtual networks and installs routes on the end users’ devices. These routes will tell the end user’s device how to reach the Cloudflare load balancer that was created with a private, non-publicly routable IP address. The administrator could then create a DNS record locally that would point to that private IP address. Once DNS resolves locally, the device would route all subsequent traffic over the WARP connection. This is all seamless to the user and occurs with minimal latency.

How we connected load balancing to Cloudflare One

In contrast to public L4 or L7 load balancers, private L4 load balancers are not going to have publicly addressable hostnames or IP addresses, but we still need to be able to handle their traffic. To make this possible, we had to integrate existing load balancing services with private networking services created by our Cloudflare One team. To do this, upon creation of a private load balancer, we now assign a private IP address within the customer's virtual network. When traffic destined for a private load balancer enters Cloudflare, our private networking services make a request to load balancing to determine which endpoint to connect to. The information in the response from load balancing is used to connect directly to a privately hosted endpoint via a variety of secure traffic off-ramps. This differs significantly from our public load balancers where traffic is off-ramped to the public internet. In fact, we can now direct traffic from any on-ramp to any off-ramp! This allows for significant flexibility in architecture. For example, not only can we direct WARP traffic to an endpoint connected via GRE or IPSec, but we can also off-ramp this traffic to Cloudflare Tunnel, a CNI connection, or out to the public internet! Now, instead of purchasing a bespoke load balancing solution for each traffic type, like an application or network load balancer, you can configure a single load balancing solution to handle virtually any permutation of traffic that your business needs to run!

Getting started with internal load balancing

We are excited to be releasing these new load balancing features that solve critical connectivity issues for our customers and effectively eliminate the need for a hardware load balancer. Cloudflare load balancers now support end-to-end private traffic flows with Cloudflare One. To get started with configuring this feature, take a look at our load balancing documentation.

We are just getting started with our local traffic management load balancing support. There is so much more to come including user experience changes, enhanced layer 4 session affinity, new steering methods, refined control of egress ports, and more.

Extending Private Network Load Balancing load balancing to Layer 4 with Spectrum

Chris Ward — Fri, 31 May 2024 13:00:07 GMT

In 2023, Cloudflare introduced a new load balancing solution, supporting Private Network Load Balancing. This gives organizations a way to balance HTTP(S) traffic between private or internal servers within a region-specific data center. Today, we are thrilled to be able to extend those samecapabilities to non-HTTP(S) traffic. This new feature is enabled by the integration of Cloudflare Spectrum, Cloudflare Tunnels, and Cloudflare load balancers and is available to enterprise customers. Our customers can now use Cloudflare load balancers for all TCP and UDP traffic destined for private IP addresses, eliminating the need for expensive on-premise load balancers.

A quick primer

In this blog post, we will be referring to load balancers at either layer 4 or layer 7. This is, of course, referring to layers of the OSI model but more specifically, the ingress path that is being used to reach the load balancer. Layer 7, also known as the Application Layer, is where the HTTP(S) protocol exists. Cloudflare is well known for our layer 7 capabilities, which are built around speeding up and protecting websites which run over HTTP(S). When we refer to layer 7 load balancers, we are referring to HTTP(S)-based services. Our layer 7 stack allows Cloudflare to apply services like CDN, WAF, Bot Management, DDoS protection, and more to a customer's website or application to improve performance, availability, and security.

Layer 4 load balancers operate at a lower level of the OSI model, called the Transport Layer, which means they can be used to support a much broader set of services and protocols. At Cloudflare, our public layer 4 load balancers are enabled by a Cloudflare product called Spectrum. Spectrum works as a layer 4 reverse proxy. This places Cloudflare in front of any DDoS attacks that may be launched against Spectrum-proxied services, and by using Spectrum in front of your application, your private origin IP address is concealed, which also prevents bad actors from discovering and attacking your origin’s IP address directly.

Services that use TCP or UDP for transport can leverage Spectrum with a Cloudflare load balancer. Layer 4 load balancing allows us to support other application layer protocols such as SSH, FTP, NTP, and SMTP since they operate over TCP and UDP. Given the breadth of services and protocols this represents, the treatment provided is more generalized. Cloudflare Spectrum supports features such as TLS/SSL offloading, DDoS protection, Argo Smart Routing, and session persistence with our layer 4 load balancers.

Cloudflare’s current load balancing capabilities

Before we dig into the new features we are announcing, it's important to understand what Cloudflare load balancing supports today and the challenges our customers face with regard to their load balancing needs.

There are three main load balancing traffic flows that Cloudflare supports today:

Internet-facing load balancers connecting to publicly accessible origins operating at layer 7, which supports HTTP(S)
Internet-facing load balancers connecting to publicly accessible origins operating at layer 4 (Spectrum), which supports all TCP-based and UDP-based services such as SSH, FTP, NTP, SMTP, etc.
Publicly accessible load balancers connecting to private origins operating at layer 7 HTTP(S) over Cloudflare Tunnels

One of the biggest advantages Cloudflare’s load balancing solutions offer our customers is that there is no hardware to purchase or maintain. Hardware-based load balancers are expensive to purchase, license, operate, and upgrade. “Need more bandwidth? Just buy and install this additional module.” “Need more features? Just buy and install this new license.” “Oh, your hardware load balancer is End-of-Life? Just purchase an entire new kit which we will EOL in a few years!” The upgrade or refresh cycle on a fully integrated hardware load balancer setup can take years and, by the time you finish the planning, implementation, and cutover, it might actually be time to start planning the next refresh.

Cloudflare eliminates all these concerns and lets you focus on innovation and growth. Your load balancers exist in every Cloudflare data center across the globe, in over 300 cities, with virtually unlimited scale and capacity. You never need to worry about bandwidth constraints, deployment locations, extra hardware modules, downtime, upgrades, or maintenance windows ever again. With Cloudflare’s global Anycast network, every customer connects to a nearby Cloudflare data center and load balancer, where relevant policies, rules, and steering are applied.

Load balancing more than websites with Cloudflare Spectrum

Today, we are excited to announce that Cloudflare Spectrum can now support load balancing traffic to private networks. The addition of private IP origin support for Cloudflare load balancers is very powerful and that's why we are extending that support to load balancing with Cloudflare Spectrum as well. This means that any set of private or internal applications that use TCP or UDP can now be locally load balanced via Cloudflare. These services will also benefit from Spectrum’s layer 3/4 DDoS protection and can leverage other features like session persistence without compromising security. So while the ingress to these load balancers is public, the origins to which they distribute traffic can all be private, inaccessible from the public Internet.

Ordinarily, load balancing to private networks would require expensive on-premise hardware or costly direct physical connections to cloud providers. But, by using Spectrum as the ingress path for TCP and UDP load balancing, customers can keep their origins completely protected and unreachable from the Internet and allow access exclusively through their Cloudflare load balancer – no expensive hardware required. Customers no longer need to manage complex ACLs or security settings to make sure only certain source IP addresses are connecting to the origins. These private origins can be hosted in private data centers, a public cloud, a private cloud, or on-premise.

How we enabled Spectrum to support private networks

All of our changes to create this feature center around integrations with Apollo, the unifying service created by the Cloudflare Zero Trust team. You can read their previous blog post on the Oxy framework for more details on how Zero Trust handles and routes traffic. Apollo accepts incoming traffic from supported on-ramps, applies Zero Trust logic as configured by the customer, and then routes the traffic to egress via supported off-ramps. For example, Apollo enables clients connected securely using Cloudflare’s WARP client to communicate over Cloudflare Tunnels with private origins in a customer’s data center. Now, Apollo is being extended to do more.

When a user creates a load balanced Spectrum app, they choose a hostname and port, and select a Cloudflare load balancer as their origin. This allocates a hostname which will resolve to an IP address where Spectrum will listen for incoming traffic on the customer-configured port. Spectrum makes a call to Cloudflare's internal load balancing service, Director, which responds with the appropriate endpoint, to which Spectrum will proxy the connection. Previously, load balanced Spectrum apps only supported publicly addressable origins. Now, if the response from Director indicates that the traffic is destined for a private origin, Spectrum passes the private origin's IP address and virtual network ID to Apollo, which then proxies the traffic to the customer's private origin.

In short, new integrations between our Spectrum service and Apollo and between Apollo and Director have allowed us to expand our load balancing offerings not only to layer 4, but also enable us to leverage virtual networks to keep load balanced traffic private and off the public Internet. This also sets the stage for integrating load balancing with other traffic on-ramps and off-ramps, such as WARP, in the future. It also opens the door to a number of exciting possibilities like load balancing authenticated device traffic to private networks or even load balancing internal traffic that is never exposed to the public Internet.

Looking to the future

We are excited to be releasing this new load balancing feature which enables Cloudflare Spectrum to reach private IP endpoints. Cloudflare load balancers now support steering any TCP or UDP-based protocols over Cloudflare Tunnels to private IP endpoints, which are otherwise not accessible via the public Internet. You can learn more about how to configure this feature on our load balancing documentation pages.

We are just getting started with our private network load balancing support. There is so much more to come including support for load balancing internal traffic, enhanced layer 4 session affinity, new steering methods, additional traffic ingress methods, and more!

Elevate load balancing with Private IPs and Cloudflare Tunnels: a secure path to efficient traffic distribution

Brian Batraski — Fri, 08 Sep 2023 13:00:01 GMT

In the dynamic world of modern applications, efficient load balancing plays a pivotal role in delivering exceptional user experiences. Customers commonly leverage load balancing, so they can efficiently use their existing infrastructure resources in the best way possible. Though, load balancing is not a ‘one-size-fits-all, out of the box’ solution for everyone. As you go deeper into the details of your traffic shaping requirements and as your architecture becomes more complex, different flavors of load balancing are usually required to achieve these varying goals, such as steering between datacenters for public traffic, creating high availability for critical internal services with private IPs, applying steering between servers in a single datacenter, and more. We are extremely excited to announce a new addition to our Load Balancing solution, Private Network Load Balancing with deep integrations with Zero Trust!

A common problem businesses run into is that almost no providers can satisfy all these requirements, resulting in a growing list of vendors to manage disparate data sources to get a clear view of your traffic pipeline, and investment into incredibly expensive hardware that is complicated to set up and maintain. Not having a single source of truth to dwindle down ‘time to resolution’ and a single partner to work with in times when things are not operating within the ideal path can be the difference between a proactive, healthy growing business versus one that is reactive and constantly having to put out fires. The latter can result in extreme slowdown to developing amazing features/services, reduction in revenue, tarnishing of brand trust, decreases in adoption - the list goes on!

For eight years, we have provided top-tier global traffic load balancing (GTM) capabilities to thousands of customers across the globe. But why should the steering intelligence, failover, and reliability we guarantee stop at the front door of the selected datacenter and only operate with public traffic? We came to the conclusion that we should go even further. Today is the start of a long series of new features that allow traffic steering, failover, session persistence, SSL/TLS offloading and much more to take place between servers after datacenter selection has occurred! Instead of relying only on the relative weight to determine which server traffic should be sent to, you can now bring the same intelligent steering policies, such as least outstanding requests steering or hash steering, to any of your many data centers. This also means you have a single partner for all of your load balancing initiatives and a single pane of glass to inform business decisions! Cloudflare is thrilled to introduce the powerful combination of private IP support for Load Balancing with Cloudflare Tunnels and Private Network Load Balancing, offering customers a solution that blends unparalleled efficiency, security, flexibility, and privacy.

What is a load balancer?

A Cloudflare load balancer directs a request from a user to the appropriate origin pool within a data center

Load balancing — functionality that’s been around for the last 30 years to help businesses leverage their existing infrastructure resources. Load balancing works by proactively steering traffic away from unhealthy origin servers and — for more advanced solutions — intelligently distributing traffic load based on different steering algorithms. This process ensures that errors aren’t served to end users and empowers businesses to tightly couple overall business objectives to their traffic behavior. Cloudflare Load Balancing has made it simpler and easier to securely and reliably manage your traffic across multiple data centers around the world. With Cloudflare Load Balancing, your traffic will be directed reliably regardless of the scale of traffic or where it originates with customizable steering, affinity and failover. This clearly has an advantage over a physical load balancer since it can be configured easily and traffic doesn’t have to reach one of your data centers to be routed to another location, introducing single points of failure and significant latency. When compared with other global traffic management load balancers, Cloudflare’s Load Balancing offering is easier to set up, simpler to understand, and is fully integrated with the Cloudflare platform as one single product for all load balancing needs.

What are Cloudflare Tunnels?

Origins and servers of various types can be connected to Cloudflare using Cloudflare Tunnel. Users can also secure their traffic using WARP, allowing traffic to be secured and managed end to end through Cloudflare.‌ ‌

In 2018, Cloudflare introduced Cloudflare Tunnels, a private, secure connection between your data center and Cloudflare. Traditionally, from the moment an Internet property is deployed, developers spend an exhaustive amount of time and energy locking it down through access control lists, rotating IP addresses, or more complex solutions like GRE tunnels. We built Tunnel to help alleviate that burden. With Tunnels, users can create a private link from their origin server directly to Cloudflare without exposing your services directly to the public internet or allowing incoming connections in your data center’s firewall. Instead, this private connection is established by running a lightweight daemon, cloudflared, in your data center, which creates a secure, outbound-only connection. This means that only traffic that you’ve configured to pass through Cloudflare can reach your private origin.

Unleashing the potential of Cloudflare Load Balancing with Cloudflare Tunnels

Cloudflare Load Balancing can easily and securely direct a user’s request to a specific origin within your private data center or public cloud using Cloudflare Tunnels

Combining Cloudflare Tunnels with Cloudflare Load Balancing allows you to remove your physical load balancers from your data center and have your Cloudflare load balancer reach out to your servers directly via their private IP addresses with health checks, steering, and all other Load Balancing features currently available. Instead of configuring your on-premise load balancer to expose each service and then updating your Cloudflare load balancer, you can configure it all in one place. This means that from the end-user to the server handling the request, all your configuration can be done in a single place – the Cloudflare dashboard. On top of this, you can say goodbye to the multi hundred thousand dollar price tag to hardware appliances, the incredible management overhead and investing in a solution that has a time limit for its delivered value.

Load Balancing serves as the backbone for online services, ensuring seamless traffic distribution across servers or data centers. Traditional load balancing techniques often require exposing services on a data center’s public IP addresses, forcing organizations to create complex configurations vulnerable to security risks and potential data exposure. By harnessing the power of private IP support for Load Balancing in conjunction with Cloudflare Tunnels, Cloudflare is revolutionizing the way businesses protect and optimize their applications. With clear steps to install the cloudflared agent to connect your private network to Cloudflare’s network via Cloudflare Tunnels, directly and securely routing traffic into your data centers becomes easier than ever before!

Publicly exposing services in private data centers is complicated

^{A visitor’s request hits a global traffic management (GTM) load balancer directing the request to a data center, then a firewall, then a local load balancer and then an origin}

Load balancing within a private data center can be expensive and difficult to manage. The idea of keeping security first while ensuring ease of use and flexibility for your internal workforce is a tricky balance to strike. It’s not only the ‘how’ of securely exposing internal services, but how to best balance traffic between servers at a single location within your private network!

In a private data center, even a very simple website can be fairly complex in terms of networking and configuration. Let’s walk through a simple example of a customer device connecting to a website. A customer device performs a DNS lookup for the business’s website and receives an IP address corresponding to a customer data center. The customer then makes an HTTPS request to that IP address, passing the original hostname via Server Name Indication (SNI). That load balancer forwards that request to the corresponding origin server and returns the response to the customer device.

This example doesn’t have any advanced functionality and the stack is already difficult to configure:

Expose the service or server on a private IP.
Configure your data center’s networking to expose the LB on a public IP or IP range.
Configure your load balancer to forward requests for that hostname and/or public IP to your server’s private IP.
Configure a DNS record for your domain to point to your load balancer’s public IP.

In large enterprises, each of these configuration changes likely requires approval from several stakeholders and modified through different repositories, websites and/or private web interfaces. Load balancer and networking configurations are often maintained as complex configuration files for Terraform, Chef, Puppet, Ansible or a similar infrastructure-as-code service. These configuration files can be syntax checked or tested but are rarely tested thoroughly prior to deployment. Each deployment environment is often unique enough that thorough testing is often not feasible given the time and hardware requirements needed to do so. This means that changes to these files can negatively affect other services within the data center. In addition, opening up an ingress to your data center widens the attack surface for varying security risks such as DDoS attacks or catastrophic data breaches. To make things worse, each vendor has a different interface or API for configuring their devices or services. For example, some registrars only have XML APIs while others have JSON REST APIs. Each device configuration may have different Terraform providers or Ansible playbooks. This results in complex configurations accumulating over time that are difficult to consolidate or standardize, inevitably resulting in technical debt.

Now let’s add additional origins. For each additional origin for our service, we’ll have to go set up and expose that origin and configure the physical load balancer to use our new origin. Now let’s add another data center. Now we need another solution to distribute across our data centers. This results in one system for global traffic management and another, separate system, for local traffic. . These solutions have in the past come from different vendors and will have to be configured in different ways even though they should serve the same purpose: load balancing. This makes managing your web traffic unnecessarily difficult. Why should you have to configure your origins in two different load balancers? Why can’t you manage all the traffic for all the origins for a service in the same place?

Simpler and better: Load Balancing with Tunnels

Cloudflare Load Balancing can manage traffic for all your offices, data centers, remote users, public clouds, private clouds and hybrid clouds in one place‌ ‌

With Cloudflare Load Balancing and Cloudflare Tunnel, you can manage all your public and private origins in one place: the Cloudflare dashboard. Cloudflare load balancers can be easily configured using the Cloudflare dashboard or the Cloudflare API. There’s no need to SSH or open a remote desktop to modify load balancer configurations for your public or private servers. All configurations can be done through the dashboard UI or Cloudflare API, with full parity between the two.

With Cloudflare Tunnel set up and running in your data center, everything is ready to connect your origin server to Cloudflare network and load balancers. You do not need to configure any ingress to your data center since Cloudflare Tunnel operates only over outbound connections and can securely reach out to privately addressed services inside your data center. To expose your service to Cloudflare, you just set up your private IP range to be routed over that tunnel. Then, you can create a Cloudflare load balancer and input the corresponding private IP address and virtual network ID into your origin pool. After that, Cloudflare manages the DNS and load balancing across your private servers. Now your origin is receiving traffic exclusively via Cloudflare Tunnel and your physical load balancer is no longer needed!

This groundbreaking integration enables organizations to deploy load balancers while keeping their applications securely shielded from the public Internet. The customer’s traffic passes through Cloudflare’s data centers, allowing customers to continue to take full advantage of Cloudflare’s security and performance services. Also, by leveraging Cloudflare Tunnels, traffic between Cloudflare and customer origins remains isolated within trusted networks, bolstering privacy, security, and peace of mind.

The advantages of Private IP support with Cloudflare Tunnels

^{Cloudflare Load Balancing works in conjunction with all the security and privacy products that Cloudflare has to offer including DDoS protection, Web Application Firewall and Bot Management}

Unified Traffic Management : All the features and ease of use that were part of Cloudflare Load Balancing for Global Traffic Management are also available with Private Network Load Balancing. You can configure your public and private origins in one dashboard as opposed to several services and vendors. Now, all your private origins can benefit from the features that Cloudflare Load Balancing is known for: instant failover, customizable steering between data centers, ease of use, custom rules and configuration updates in a matter of seconds. They will also benefit from our newer features including least connection steering, least outstanding request steering, and session affinity by header. This is just a small subset of the expansive feature set for Load Balancing. See our dev docs for more features and details on the offering.

Enhanced Security: By combining private IP support with Cloudflare Tunnels, organizations can fortify their security posture and protect sensitive data. With private IP addresses and encrypted connections via Cloudflare Tunnel, the risk of unauthorized access and potential attacks is significantly reduced – traffic remains within trusted networks. You can also configure Cloudflare Access to add single sign-on support for your application and restrict your application to a subset of authorized users. In addition, you still benefit from Firewall rules, Rate Limiting rules, Bot Management, DDoS protection and all the other Cloudflare products available today allowing comprehensive security configurations.

Uncompromising Privacy: As data privacy continues to take center stage, businesses must ensure the confidentiality of user information. Cloudflare's private IP support with Cloudflare Tunnels enables organizations to segregate applications and keep sensitive data within their private network boundaries. Custom rules also allow you to direct traffic for specific devices to specific data centers. For example, you can use custom rules to direct traffic from Eastern and Western Europe to your European data centers, so you can easily keep those users’ data within Europe. This minimizes the exposure of data to external entities, preserving user privacy and complying with strict privacy regulations across different geographies.

Flexibility & Reliability: Scale and adaptability are some of the major foundations of a well-operating business. Implementing solutions that fit your business’ needs today is not enough. Customers must find solutions that meet their needs for the next three or more years. The blend of Load Balancing with Cloudflare Tunnels within our Zero Trust solution lends to the very definition of flexibility and reliability! Changes to load balancer configurations propagate around the world in a matter of seconds, making load balancers an effective way to respond to incidents. Also, instant failover, health monitoring, and steering policies all help to maintain high availability for your applications, so you can deliver the reliability that your users expect. This is all in addition to best in class Zero Trust capabilities that are deeply integrated such as, but not limited to Secure Web Gateway (SWG), remote browser isolation, network logs and data loss prevention.

Streamlined Infrastructure: Organizations can consolidate their network architecture and establish secure connections across distributed environments. This unification reduces complexity, lowers operational overhead, and facilitates efficient resource allocation. Whether you need to apply a global traffic manager to intelligently direct traffic between datacenters within your private network, or steer between specific servers after datacenter selection has taken place, there is now a clear, single lens to manage your global and local traffic, regardless of whether the source or destination of the traffic is public or private. Complexity can be a large hurdle in achieving and maintaining fast, agile business units. Consolidating into a single provider, like Cloudflare, that provides security, reliability, and observability will not only save significant cost but allows your teams to move faster and focus on growing their business, enhancing critical services, and developing incredible features, rather than taping together infrastructure that may not work in a few years. Leave the heavy lifting to us, and let us empower you and your team to focus on creating amazing experiences for your employees and end-users.

The lack of agility, flexibility, and lean operations of hardware appliances for local traffic does not justify the hundreds of thousands of dollars spent on them, along with the huge overhead of managing CPU, memory, power, cooling, etc. Instead, we want to help businesses move this logic to the cloud by abstracting away the needless overhead and bringing more focus back to teams to do what they do best, building amazing experiences, and allowing Cloudflare to do what we do best, protecting, accelerating, and building heightened reliability. Stay tuned for more updates on Cloudflare's Private Network Load Balancing and how it can reduce architecture complexity while bringing more insight, security, and control to your teams. In the meantime, check out our new whitepaper!

Looking to the future

Cloudflare's impactful solution, private IP support for Load Balancing with Cloudflare Tunnels as part of the Zero Trust solution, reaffirms our commitment to providing cutting-edge tools that prioritize security, privacy, and performance. By leveraging private IP addresses and secure tunnels, Cloudflare empowers businesses to fortify their network infrastructure while ensuring compliance with regulatory requirements. With enhanced security, uncompromising privacy, and streamlined infrastructure, load balancing becomes a powerful driver of efficient and secure public or private services.

As a business grows and its systems scale up, they'll need the features that Cloudflare Load Balancing is known for: health monitoring, steering, and failover. As availability requirements increase due to growing demands and standards from end-users, customers can add health checks, enabling automatic failover to healthy servers when an unhealthy server begins to fail. When the business begins to receive more traffic from around the world, they can create new pools for different regions and use dynamic steering to reduce latency between the user and the server. For intensive or long-running requests, such as complex datastore queries, customers can benefit from leveraging least outstanding requests steering to reduce the number of concurrent requests per server. Before, this could all be done with publicly addressable IPs, but it is now available for pools with public IPs, private servers, or combinations of the two. Private Network Load Balancing is live and ready to use today! Check out our dev docs for instructions on how to get started.

Stay tuned for our next addition to add new Load Balancing onramp support for Spectrum and WARP with Cloudflare Tunnels with private IPs for your Layer 4 traffic, allowing us to support TCP and UDP applications in your private data centers!

Load Balancing with Weighted Pools

Brian Batraski — Tue, 02 Aug 2022 13:00:00 GMT

Anyone can take advantage of Cloudflare’s far-reaching network to protect and accelerate their online presence. Our vast number of data centers, and their proximity to Internet users around the world, enables us to secure and accelerate our customers’ Internet applications, APIs and websites. Even a simple service with a single origin server can leverage the massive scale of the Cloudflare network in 270+ cities. Using the Cloudflare cache, you can support more requests and users without purchasing new servers.

Whether it is to guarantee high availability through redundancy, or to support more dynamic content, an increasing number of services require multiple origin servers. The Cloudflare Load Balancer keeps our customer’s services highly available and makes it simple to spread out requests across multiple origin servers. Today, we’re excited to announce a frequently requested feature for our Load Balancer – Weighted Pools!

What’s a Weighted Pool?

Before we can answer that, let’s take a quick look at how our load balancer works and define a few terms:

Origin Servers - Servers which sit behind Cloudflare and are often located in a customer-owned datacenter or at a public cloud provider.

Origin Pool - A logical collection of origin servers. Most pools are named to represent data centers, or cloud providers like “us-east,” “las-vegas-bldg1,” or “phoenix-bldg2”. It is recommended to use pools to represent a collection of servers in the same physical location.

Traffic Steering Policy - A policy specifies how a load balancer should steer requests across origin pools. Depending on the steering policy, requests may be sent to the nearest pool as defined by latitude and longitude, the origin pool with the lowest latency, or based upon the location of the Cloudflare data center.

Pool Weight - A numerical value to describe what percentage of requests should be sent to a pool, relative to other pools.

When a request from a visitor arrives at the Cloudflare network for a hostname with a load balancer attached to it, the load balancer must decide where the request should be forwarded. Customers can configure this behavior with traffic steering policies.

The Cloudflare Load Balancer already supports Standard Steering, Geo Steering, Dynamic Steering, and Proximity Steering. Each of these respective traffic steering policies control how requests are distributed across origin pools. Weighted Pools are an extension of our standard, random steering policy which allows the specification of what relative percentage of requests should be sent to each respective pool.

In the example above, our load balancer has two origin pools, “las-vegas-bldg1” (which is a customer operated data center), and “us-east-cloud” (which is a public cloud provider with multiple virtual servers). Each pool has a weight of 0.5, so 50% of requests should be sent to each respective pool.

Why would someone assign weights to origin pools?

Before we built this, Weighted Pools was a frequently requested feature from our customers. Part of the reason we’re so excited about this feature is that it can be used to solve many types of problems.

Unequally Sized Origin Pools

In the example below, the amount of dynamic and uncacheable traffic has significantly increased due to a large sales promotion. Administrators notice that the load on their Las Vegas data center is too high, so they elect to dynamically increase the number of origins within their public cloud provider. Our two pools, “las-vegas-bldg1” and “us-east-cloud” are no longer equally sized. Our pool representing the public cloud provider is now much larger, so administrators change the pool weights so that the cloud pool receives 0.8 (80%) of the traffic, relative to the 0.2 (20%) of the traffic which the Las Vegas pool receives. The administrators were able to use pool weights to very quickly fine-tune the distribution of requests across unequally sized pools.

Data center kill switch

In addition to balancing out unequal sized pools, Weighted Pools may also be used to completely take a data center (an origin pool) out of rotation by setting the pool’s weight to 0. This feature can be particularly useful if a data center needs to be quickly eliminated during troubleshooting or a proactive maintenance where power may be unavailable. Even if a pool is disabled with a weight of 0, Cloudflare will still monitor the pool for health so that the administrators can assess when it is safe to return traffic.

Network A/B testing

One final use case we’re excited about is the ability to use weights to attract a very small amount of requests to pool. Did the team just stand up a brand-new data center, or perhaps upgrade all the servers to a new software version? Using weighted pools, the administrators can use a load balancer to effectively A/B test their network. Only send 0.05 (5%) of requests to a new pool to verify the origins are functioning properly before gradually increasing the load.

How do I get started?

When setting up a load balancer, you need to configure one or more origin pools, and then place origins into your respective pools. Once you have more than one pool, the relative weights of the respective pools will be used to distribute requests.

To set up a weighted pool using the Dashboard, create a load balancer in the Traffic > Load Balancing area.

Once you have set up the load balancer, you’re navigated to the Origin Pools setup page. Under the Traffic Steering Policy, select Random, and then assign relative weights to every pool.

If your weights do not add up to 1.00 (100%), that’s fine! We will do the math behind the scenes to ensure how much traffic the pool should receive relative to other pools.

Weighted Pools may also be configured via the API. We’ve edited an example illustrating the relevant parts of the REST API.

The load balancer should employ a “steering_policy” of random.
Each pool has a UUID, which can then be assigned a “pool_weight.”

We’re excited to launch this simple, yet powerful and capable feature. Weighted pools may be utilized in tons of creative new ways to solve load balancing challenges. It’s available for all customers with load balancers today!

Developer Docs:https://developers.cloudflare.com/load-balancing/how-to/create-load-balancer/#create-a-load-balancer

API Docs:https://api.cloudflare.com/#load-balancers-create-load-balancer

Waiting Room: Random Queueing and Custom Web/Mobile Apps

Tyler Caslin — Thu, 07 Oct 2021 15:07:17 GMT

Today, we are announcing the general availability of Cloudflare Waiting Room to customers on our Enterprise plans, making it easier than ever to protect your website against traffic spikes. We are also excited to present several new features that have user experience in mind — an alternative queueing method and support for custom web/mobile applications.

First-In-First-Out (FIFO) Queueing

Whether you’ve waited to check out at a supermarket or stood in line at a bank, you’ve undoubtedly experienced FIFO queueing. FIFO stands for First-In-First-Out, which simply means that people are seen in the order they arrive — i.e., those who arrive first are processed before those who arrive later.

When Waiting Room was introduced earlier this year, it was first deployed to protect COVID-19 vaccine distributors from overwhelming demand — a service we offer free of charge under Project Fair Shot. At the time, FIFO queueing was the natural option due to its wide acceptance in day-to-day life and accurate estimated wait times. One problem with FIFO is that users who arrive later could see long estimated wait times and decide to abandon the website.

We take customer feedback seriously and improve products based on it. A frequent request was to handle users irrespective of the time they arrive in the Waiting Room. In response, we developed an additional approach: random queueing.

A New Approach to Fairness: Random Queueing

You can think of random queueing as participating in a raffle for a prize. In a raffle, people obtain tickets and put them into a big container. Later, tickets are drawn at random to determine the winners. The more time you spend in the raffle, the better your chances of winning at least once, since there will be fewer tickets in the container. No matter what, everyone participating in the raffle has an opportunity to win.

Similarly, in a random queue, users are selected from the Waiting Room at random, regardless of their initial arrival time. This means that you could be let into the application before someone who arrived earlier than you, or vice versa. Just like how you can buy more tickets in a raffle, joining a random queue earlier than someone else will give you more attempts to be accepted, but does not guarantee you will be let in. However, at any particular time, you will have the same chance to be let into the website as anyone else. This is different from a raffle, where you could have more tickets than someone else at a given time, providing you with an advantage.

Random queueing is designed to give everyone a fair chance. Imagine waking up excited to purchase new limited-edition sneakers only to find that the FIFO queue is five hours long and full of users that either woke up in the middle of the night to get in line or joined from earlier time zones. Even if you waited five hours, those sneakers would likely be sold out by the time you reach the website. In this case, you’d probably abandon the Waiting Room completely and do something else. On the other hand, if you were aware that the queue was random, you’d likely stick around. After all, you have a chance to be accepted and make a purchase!

As a result, random queueing is perfect for short-lived scenarios with lots of hype, such as product launches, holiday traffic, special events, and limited-time sales.

By contrast, when the event ends and traffic returns to normal, a FIFO queue is likely more suitable, since its widely accepted structure and accurate estimated wait times provide a consistent user experience.

How Does Random Queueing Work?

Perhaps the best part about random queueing is that it maintains the same internal structure that powers FIFO. As a result, if you change the queueing method in the dashboard — even when you may be actively queueing users — the transition to the new method is seamless. Imagine you have users 1, 2, 3, 4, and 5 waiting in a FIFO queue in the order 5 → 4 → 3 → 2 → 1, where user 1 will be the next user to access the application. Let’s assume you switch to random queueing. Now, any user can be accepted next. Let’s assume user 4 is accepted. If you decide to immediately switch back to FIFO queueing, the queue will reflect the order 5 → 3 → 2 → 1. In other words, transitioning from FIFO to random and back to FIFO will respect the initial queue positions of the users! But how does this work? To understand, we first need to remember how we built Waiting Room for FIFO.

Recall the Waiting Room configurations:

Total Active Users. The total number of active users that can be using the application at any given time.
New Users Per Minute. The maximum number of new users per minute that can be accepted to the application.

Next, remember that Waiting Room is powered by cookies. When you join the Waiting Room for the first time, you are assigned an encrypted cookie. You bring this cookie back to the Waiting Room and update it with every request, using it to prove your initial arrival time and status.

Properties in the Waiting Room cookie include:

bucketId. The timestamp rounded down to the nearest minute of the user’s first request to the Waiting Room. If you arrive at 10:23:45, you will be grouped into a bucket for 10:23:00.
acceptedAt. The timestamp when the user got accepted to the origin website for the first time.
refreshIntervalSeconds. When queueing, this is the number of seconds the user must wait before sending another request to the Waiting Room.
lastCheckInTime. The last time each user checked into the Waiting Room or origin website. When queueing, this is only updated for requests every refreshIntervalSeconds.

For any given minute, we can calculate the number of users we can let into the origin website. Let’s say we deploy a Waiting Room on "https://example.com/waitingroom" that can support 10,000 Total Active Users, and we allow up to 2,000 New Users Per Minute. If there are currently 7,000 active users on the website, we have 10,000 - 7,000 = 3,000 open slots. However, we need to take the minimum (3,000, 2,000) = 2,000 since we need to respect the New Users Per Minute limit. Thus, we have 2,000 available slots we can give out.

Let’s assume there are 2,500 queued users that joined over the last three minutes in groups of 500, 1,000, and 1,000, respectively for the timestamps 15:54, 15:55, and 15:56. To respect FIFO queueing, we will take our 2,000 available slots and try to reserve them for users who joined first. Thus, we will reserve 500 available slots for the users who joined at 15:54 and then reserve 1000 available slots for the users who joined at 15:55. When we get to the users for 15:56, we see that we only have 500 slots left, which is not enough for the 1,000 queued users for this minute:

Since we have reserved slots for all users with bucketIds of 15:54 and 15:55, they can be let into the origin website from any data center. However, we can only let in a subset of the users who initially arrived at 15:56.

These 500 slots for 15:56 are allocated to each Cloudflare edge data center based on its respective historical traffic data, and further divided for each Cloudflare Worker within the data center. For example, let’s assume there are two data centers — Nairobi and Dublin — which share 60% and 40% of the traffic, respectively, for this minute. In this case, we will allocate 500 * .6 = 300 slots for Nairobi and 500 * .4 = 200 slots for Dublin. In Nairobi, let’s say there are 3 active workers, so we will grant each of them 300 / 3 = 100 slots. If you make a request to a worker in Nairobi and your bucketId is 15:56, you will be allowed in and consume a slot if the worker still has at least one of its 100 slots available. Since we have reserved all 2,000 available slots, users with bucketIds after 15:56 will have to continue queueing.

Let’s modify this case and assume we only have 200 queued users, all of which are in the 15:54 bucket. First, we reserve 200 slots for these queued users, leaving us 2,000 - 200 = 1,800 remaining slots. Since we have reserved slots for all queued users, we can use the remaining 1,800 slots on new users — people who have just made their first request to the Waiting Room and don’t have a cookie or bucketId yet. Similar to how we handle buckets with fewer slots than queued users, we will distribute these 1,800 slots to each data center, allocating 1,800 * .6 = 1,080 to Nairobi and 1,800 * .4 = 720 to Dublin. In Nairobi, we will split these equally across the 3 workers, giving them 1,080 / 3 = 360 slots each. If you are a new user making a request to a worker in Nairobi, you will be accepted and take a slot if the worker has at least one of its 360 slots available, otherwise you will be marked as a queued user and enter the Waiting Room.

Now that we have outlined the concepts for FIFO, we can understand how random queueing operates. Simply put, random queueing functions the same way as FIFO, except we pretend that every user is new. In other words, we will not look at reserved slots when making the decision if the user should be let in. Let’s revisit the last case with 200 queued users in the 15:54 bucket and 2,000 available slots. When random queueing, we allocate the full 2,000 slots to new users, meaning Nairobi gets 2,000 * .6 = 1,200 slots and each of its 3 workers gets 1,200 / 3 = 400 slots. No matter how many users are queued or freshly joining the Waiting Room, all of them will have a chance at taking these slots.

Finally, let’s reiterate that we are only pretending that all users are new — we still assign them to bucketIds and reserve slots as if we were FIFO queueing, but simply don’t make any use of this logic while random queueing is active. That way, we can maintain the same FIFO structure while we are random queueing so that if necessary, we can smoothly transition back to FIFO queueing and respect initial user arrival times.

How “Random” is Random Queueing?

Since random queueing is basically a race for available slots, we were concerned that it could be exploited if the available user slots and the queued user check-ins did not occur randomly.

To ensure all queued users can attempt to get into the website at the same rate, we store (in the encrypted cookie) the last time each user checked into the Waiting Room (lastCheckInTime) to prevent them from attempting to gain access to the website until a number of seconds have passed (refreshIntervalSeconds). This means that spamming the page refresh button will not give you an advantage over other queued users! Be patient — the browser will refresh automatically the moment you are eligible for another chance.

Next, let’s imagine five queued users checking into the Waiting Room every refreshIntervalSeconds=30 at approximately the :00 and :30 minute marks. A new queued user joins the Waiting Room and checks in at approximately :15 and :45. If new slots are randomly released, this new user will have about a 50% chance of being selected next, since it monopolizes over the :00-15 and :30-45 ranges. On the other hand, the other five queued users share the :15-30 and :45-00 ranges, giving them about a 50% / 5 = 10% chance each. Let's consider that new slots are not randomly released and assume they are always released at :59. In this case, the new queued user will have virtually no chance to be selected before the other five queued users because these users will always check in one second later at :00, immediately consuming any newly released slots.

To address this vulnerability, we changed our implementation to ensure that slots are released randomly and encouraged users to check in at random offsets from each other. To help split up users that are checking in at similar times, we vary each user’s refreshIntervalSeconds by a small, pseudo-randomly generated offset for each check-in and store this new refresh interval in the encrypted Waiting Room cookie for validation on the next request. Thus, a user who previously checked in every 30 seconds might now check in after 29 seconds, then 31 seconds, then 27 seconds, and so on — but still averaging a 30-second refresh interval. Over time, these slight check-in variations become significant, spreading out user check-in times and strengthening the randomness of the queue. If you are curious to learn more about the apparent “randomness” behind mixing user check-in intervals, you can think of it as a chaotic system subjected to the butterfly effect.

Nevertheless, we weren’t convinced our efforts were enough and wanted to test random queueing empirically to validate its integrity. We conducted a simulation of 10,000 users joining a Waiting Room uniformly across 30 minutes. When let into the application, users spent approximately 1 minute “browsing” before they stopped checking in. We ran this experiment for both FIFO and random queueing and graphed each user’s observed wait time in seconds in the Waiting Room against the minute they initially arrived (starting from 0). Recall that users are grouped by minute using bucketIds, so each user’s arrival minute is truncated down to the current minute.

Based on our data, we can see immediately for FIFO queueing that, as the arrival minute increases, the observed wait time increases linearly. This makes sense for a FIFO queue, since the “line” will just get longer if there are more users entering the queue than leaving it. For each arrival minute, there is very little variation among user wait times, meaning that if you and your friend join a Waiting Room at approximately the same time, you will both be accepted around the same time. If you join a couple of minutes before your friend, you will almost always be accepted first.

When looking at the results for random queueing, we observe users experiencing varied wait times regardless of the arrival minute. This is expected, and helps prove the “randomness” of the random queue! We can see that, if you join five minutes after your friend, although your friend will have more chances to get in, you may still be accepted first! However, there are so many data points overlapping with each other in the plot that it is hard to tell how they are distributed. For instance, it could be possible that most of these data points experience extreme wait times, but as humans we aren’t able to tell.

As a result, we created heatmaps of these plots in Python using numpy.histogram2d and displayed them with matplotlib.pyplot:

The heatmaps display where the data points are concentrated in the original plot, using brighter (hotter) colors to represent areas containing more points:

By inspecting the generated heatmaps, we can conclude that FIFO and random queueing are working properly. For FIFO queueing, users are being accepted in the order they arrive. For random queueing, we can see that users are accepted to the origin regardless of arrival time. Overall, we can see the heatmap for random queueing is well distributed, indicating it is sufficiently random!

If you are curious why random queueing has very hot colors along the lowest wait times followed by very dark colors afterward, it is actually because of how we are simulating the queue. For the simulation, we spoofed the bucketIds of the users and let them all join the Waiting Room at once to see who would be let in first. In the random queueing heatmap, the bright colors along the lowest wait time buckets indicate that many users were accepted quickly after joining the queue across all bucketIds. This is expected, demonstrating that random queueing does not give an edge to users who join earlier, giving each user a fair chance regardless of its bucketId. The reason why these users were almost immediately accepted in WaitTime Bucket 0 is because this simulation started with no users on the origin, meaning new users would be accepted until the Waiting Room limits were reached. Since this first wave of accepted users “browsed” on the origin for a minute before leaving, no additional users during this time were let in. Thus, the colors are very dark for WaitTime Buckets 1 and 2. Similarly, the second wave of users is randomly selected afterward, followed by another period of time when no users were accepted in WaitTime Bucket 5. As the wait time increases, the more attempts a particular user will have to be let in, meaning it is unlikely for users to have extreme wait times. We can see this by observing the colors grow darker as the WaitTime Bucket approaches 29.

How Is Estimated Time Calculated for Random Queueing?

In a random queue, you can be accepted at any moment… so how can you display an estimated wait time? For a particular user, this is an impossible task, but when you observe all the users together, you can accurately account for most user experiences using a probabilistic estimated wait time range.

At any given moment, we know:

letInPerMinute. The current average users per minute being let into the origin.
currentlyWaiting. The current number of users waiting in the queue.

Therefore, we can calculate the probability of a user being let into the origin in the next minute:

P(LetInOverMinute) = letInPerMinute / currentlyWaiting

If there are 100 users waiting in the queue, and we are currently letting in 10 users per minute, the probability a user will be let in over the next minute is 10 / 100 = .1 (10%).

Using P(LetInOverMinute), we can determine the n minutes needed for a p chance of being let into the origin:

p = 1 - (1 - P(LetInOverMinute))ⁿ

Recall that the probability of getting in at least once is the complement of not getting in at all. The probability of not being let into the origin over n minutes is (1 - P(LetInOverMinutes))ⁿ. Therefore, the probability of getting in at least once is 1 - (1 - P(LetInOverMinute))ⁿ. This equation can be simplified further:

n = log(1 - p) / log(1 - P(LetInOverMinute))

Thus, if we want to calculate the estimated wait time to have a p = .5 (50%) chance of getting into the origin with the probability of getting let in during a particular minute P(LetInOverMinute) = .1 (10%), we calculate:

n = log(1 - .5) / log(1 - .1) ≈ 6.58 minutes or 6 minutes and 35 seconds

In this case, we estimate that 50% of users will wait less than 6 minutes and 35 seconds and the remaining 50% of users will wait longer than this.

So, which estimated wait times are displayed to the user? It is up to you! If you create a Mustache HTML template for a Waiting Room, you will now be able to use the variables waitTime25Percentile, waitTime50Percentile, and waitTime75Percentile to display the estimated wait times in minutes when p = .25, p = .5, and p = .75, respectively. There are also new variables that are used to display and determine the queueing method, such as queueingMethod, isFIFOQueue, and isRandomQueue. If you want to display something more dynamic like a custom view in a mobile app, keep reading to learn about our new JSON response, which provides a REST API for the same set of variables.

Supporting Dynamic Applications with a JSON Response

Before, customers could only deploy static Mustache HTML templates to customize the style of their Waiting Rooms. These templates work well for most use cases, but fall short if you want to display anything that requires state. Let’s imagine you’re queueing to buy concert tickets on your mobile device, and you see an embedded video of your favorite song. Naturally, you click on it and start singing along! A couple seconds later, the browser refreshes the page automatically to update your status in the Waiting Room, resetting your video to the start.

The purpose of the new JSON response is to give full control to a custom application, allowing it to determine what to display to the user and when to refresh. As a result, the application can maintain state and make sure your videos are never interrupted again!

Once the JSON response is enabled for a Waiting Room, any request to the Waiting Room with the header Accept: application/json will receive a JSON object with all the fields from the Mustache template.

An example request when the queueing method is FIFO:

An example request when the queueing method is random:

A few important reminders before you get started:

Don’t forget that Waiting Room uses a cookie to maintain a user’s status! Without a cookie in the request, the Waiting Room will think the user has just joined the queue.
Don’t forget to refresh! Inspect the ‘Refresh’ HTTP response header or the refreshIntervalSeconds property and send another request to the Waiting Room after that number of seconds.
Keep in mind that if the user’s request is let into the origin, JSON may not necessarily be returned. To gracefully parse all responses, send JSON from the origin website if the header Accept: application/json is present. For example, the origin could return:

Embedding a Waiting Room in a Webpage: SameSite Cookies and IFrames

What are SameSite cookies and IFrames?

SameSite and Secure are attributes in the HTTP response Set-Cookie header. SameSite is used to determine when cookies are sent to a website while Secure indicates if there must be a secure context (HTTPS).

There are three different values of SameSite:

SameSite=Lax. This is the default value when the SameSite attribute is not present. Cookies are not sent on cross-site sub-requests unless the user is following a link to the third-party site. If you are on example1.com, cookies will not be sent to example2.com unless you click a link that navigates to example2.com.
SameSite=Strict. Cookies are sent only in first-party contexts. If you are on example1.com, cookies will never be sent to example2.com even if you click a link that navigates to example2.com.
SameSite=None. Cookies are sent for all contexts, but the Secure attribute must be set. If you are on example1.com, cookies will be sent to example2.com for all sub-requests. If Secure is not set, the browser will block the cookie.

IFrames (Inline Frames) allow HTML documents to embed other HTML documents, such as an advertisement, video, or webpage. When an application from a third-party website is rendered inside an IFrame, cookies will only be sent to it if SameSite=None is set.

Why is this all important? In the past, we did not set SameSite, meaning it defaulted to SameSite=Lax for all responses. As a result, a user queueing through an IFrame would never have its cookie updated and appear to the Waiting Room as joining for the first time on every request. Today, we are introducing customization for both the SameSite and Secure attributes, which will allow Waiting Rooms to be displayed in IFrames!

At the moment, this is only configurable through the Cloudflare API. By default, the configuration for SameSite and Secure will be set to "auto", automatically selecting the most flexible option. In this case, SameSite will be set to None if Always Use HTTPS is enabled, otherwise it will be set to Lax. Similarly, Secure will only be set if Always Use HTTPS is enabled. In other words, Waiting Room IFrames will work properly by default as long as Always Use HTTPS is toggled. If you are wondering why Always Use HTTPS is used here, remember that SameSite=None requires that Secure is also set, or else the browser will block the Waiting Room cookie.

If you decide to manually configure the behavior of SameSite and Secure through the API, be careful! We do guard against setting SameSite=None without Secure, but if you decide to set Secure on every request (secure="always") and don’t have Always Use HTTPS enabled, this means that a user who sends an insecure (HTTP) request to the Waiting Room will have its cookie blocked by its browser!

If you want to explore using IFrames with Waiting Room yourself, here is a simple example of a Cloudflare Worker that renders the Waiting Room on "https://example.com/waitingroom" in an IFrame:

Looking Forward

Waiting Room still has plenty of room to grow! Every day, we are seeing more Waiting Rooms deployed to protect websites from traffic spikes. As Waiting Room continues to be used for new purposes, we will keep adding features to make it as customizable and user-friendly as possible.

Stay tuned — what we have announced today is just the tip of the iceberg of what we have planned for Waiting Room!

Rich, complex rules for advanced load balancing

Brian Batraski — Fri, 16 Jul 2021 13:00:22 GMT

Load Balancing — functionality that’s been around for the last 30 years to help businesses leverage their existing infrastructure resources. Load balancing works by proactively steering traffic away from unhealthy origin servers and — for more advanced solutions — intelligently distributing traffic load based on different steering algorithms. This process ensures that errors aren’t served to end users and empowers businesses to tightly couple overall business objectives to their traffic behavior.

What’s important for load balancing today?

We are no longer in the age where setting up a fixed amount of servers in a data center is enough to meet the massive growth of users browsing the Internet. This means that we are well past the time when there is a one size fits all solution to suffice the needs of different businesses. Today, customers look for load balancers that are easy to use, propagate changes quickly, and — especially now — provide the most feature flexibility. Feature flexibility has become so important because different businesses have different paths to success and, consequently, different challenges! Let’s go through a few common use cases:

You might have an application split into microservices, where specific origins support segments of your application. You need to route your traffic based on specific paths to ensure no single origin can be overwhelmed and users get sent to the correct server to answer the originating request.
You may want to route traffic based on a specific value within a header request such as “PS5” and send requests to the data center with the matching header.
If you heavily prioritize security and privacy, you may adopt a split-horizon DNS setup within your network architecture. You might choose this architecture to separate internal network requests from public requests from the rest of the public Internet. Then, you could route each type of request to pools specifically suited to handle the amount and type of traffic.

As we continue to build new features and products, we also wanted to build a framework that would allow us to increase our velocity to add new items to our Load Balancing solution while we also take the time to create first class features as well. The result was the creation of our custom rule builder!

Now you can build complex, custom rules to direct traffic using Cloudflare Load Balancing, empowering customers to create their own custom logic around their traffic steering and origin selection decisions. As we mentioned, there is no one size fits all solution in today's world. We provide the tools to easily and quickly create rules that meet the exact requirements needed for any customer's unique situation and architecture. On top of that, we also support ‘and’ and ‘or’ statements within a rule, allowing very powerful and complex rules to be created for any situation!

Load Balancing by path becomes easy, requiring just a few minutes to enter the paths and some boolean statements to create complex rules. Steer by a specific header, query string, or cookie. It’s no longer a pain point. Leverage a split horizon DNS design by creating a rule looking at the IP source address and then routing to the appropriate pool based on the value. This is just a small subset of the very robust capabilities that load balancing custom rules makes available to our users and this is just the start! Not only do we have a large amount of functionality right out of the box, but we’re also providing a consistent, intuitive experience by building on our Firewall Rules Engine.

Let’s go through some use cases to explore how custom rules can open new possibilities by giving you more granular control of your traffic.

High-volume transactions for ecommerce

For any high-volume transaction business such as an ecommerce or retail store, ensuring the transactions go through as fast and reliably as possible is a table stakes requirement. As transaction volume increases, no single origin can handle the incoming traffic, and it doesn’t always make sense for it to do so. Why have a transaction request travel around the world to a specifically nominated origin for payment processing? This setup would only add latency, leading to degraded performance, increased errors, and a poor customer experience. But what if you could create custom logic to segment transactions to different origin servers based on a specific value in a query string, such as a PS5 (associated with Sony’s popular PlayStation 5)? What if you could then couple that value with dynamic latency steering to ensure your load balancer always chooses the most performant path to the origin? This would be game changing to not only ensure that table-stakes transactions are reliable and fast but also drastically improve the customer experience. You could do this in minutes with load balancing custom rules:

For any requests where the query string shows ‘PS5’, then route based on which pool is the most performant.

Load balance across multiple DNS vendors to support privacy and security

Some customers may want to use multiple DNS providers to bolster their resiliency along with their security and privacy for the different types of traffic going through their network. By utilizing two DNS providers, customers can not only be sure that they remain highly available in times of outages, but also direct different types of traffic, whether that be internal network traffic across offices or unknown traffic from the public Internet.

Without flexibility, however, it can be difficult to easily and intelligently route traffic to the proper data centers to maintain that security and privacy posture. Not anymore! With load balancing custom rules, supporting a split horizon DNS architecture takes as little as five minutes to set up a rule based on the IP source condition and then overwriting which pools or data centers that traffic should route to.

This can also be extremely helpful if your data centers are spread across multiple areas of the globe that don’t align with the 13 current regions within Cloudflare. By segmenting where traffic goes based on the IP source address, you can create a type of geo-steering setup that is also finely tuned to the requirements of the business!

How did we build it?

We built Load Balancing rules on top of our open-source wirefilter execution engine. People familiar with Firewall Rules and other products will notice similar syntax since both products are built on top of this execution engine.

By reusing the same underlying engine, we can take advantage of a battle-tested production library used by other products that have the performance and stability requirements of their own. For those experienced with our rule-based products, you can reuse your knowledge due to the shared syntax to define conditionals statements. For new users, the Wireshark-like syntax is often familiar and relatively simple.

DNS vs Proxied?

Our Load Balancer supports both DNS and Proxied load balancing. These two protocols operate very differently and as such are handled differently.

For DNS-based load balancing, our load balancer responds to DNS queries sent from recursive resolvers. These resolvers are normally not the end user directly requesting the traffic nor is there a 1-to-1 ratio between DNS query and end-user requests. The DNS makes extensive use of caching at all levels so the result of each query could potentially be used by thousands of users. Combined, this greatly limits the possible feature set for DNS. Since you don’t see the end user directly nor know if your response is going to be used by one or more users, all responses can only be customized to a limited degree.

Our Proxied load balancing, on the other hand, processes rules logic for every request going through the system. Since we act as a proxy for all these requests, we can invoke this logic for all requests and access user-specific data.

These different modes mean the fields available to each end up being quite different. The DNS load balancer gets access to DNS-specific fields such as “dns.qry.name” (the query name) while our Proxied load balancer has access to “http.request.method” (the HTTP method used to access the proxied resource). Some more general fields — like the name of the load balancer being used — are available in both modes.

How does it work under the hood?

When a load balancer rule is configured, that API call will validate that the conditions and actions of the rules are valid. It makes sure the condition only references known fields, isn’t excessively long, and is syntactically valid. The overrides are processed and applied to the load balancers configuration to make sure they won’t cause an invalid configuration. After validation, the new rule is saved to our database.

With the new rule saved, we take the load balancer’s data and all rules used by it and package that data together into one configuration to be shipped out to our edge. This process happens very quickly, so any changes are visible to you in just a few seconds.

While DNS and proxied load balancers have access to different fields and the protocols themselves are quite different, the two code paths overlap quite a bit. When either request type makes it to our load balancer, we first load up the load balancer specific configuration data from our edge datastore. This object contains all the “static” data for a load balancer, such as rules, origins, pools, steering policy, and so forth. We load dynamic data such as origin health and RTT data when evaluating each pool.

At the start of the load balancer processing, we run our rules. This ends up looking very much like a loop where we check each condition and — if the condition is true — we apply the effects specified by the rules. After each condition is processed and the effects are applied we then run our normal load balancing logic as if you have configured the load balancer with the overridden settings. This style of applying each override in turn allows more than one rule to change a given setting multiple times during execution. This lets users avoid extremely long and specific conditionals and instead use shorter conditionals and rule ordering to override specific settings creating a more modular ruleset.

What’s coming next?

For you, the next steps are simple. Start building custom load balancing rules! For more guidance, check out our developer documentation.

For us, we’re looking to expand this functionality. As this new feature develops, we are going to be identifying new fields for conditionals and new options for overrides to allow more specific behavior. As an example, we’ve been looking into exposing a means to creating more time-based conditionals, so users can create rules that only apply during certain times of the day or month. Stay tuned to the blog for more!

Per Origin Host Header Override

Brian Batraski — Fri, 09 Apr 2021 13:00:00 GMT

Load Balancing as a concept is pretty straightforward. Take an existing infrastructure and route requests to the available origin servers so no single server is overwhelmed. Add in some health monitoring to ensure each server has a heartbeat/pulse so proactive decisions can be made. With two steps, you get more effective utilization of your existing resources… simple enough!

As your application grows, however, load balancing becomes more complicated. An example of this — and the subject of this blog post — is how load balancing interacts with the Host header in an HTTP request.

Host headers and load balancing

Every request to a website contains a unique piece of identifying information called the Host header. The Host header helps route each request to the correct origin server so the end user is sent the information they requested from the start.

For example, say that you enter example.com into my URL bar in my browser. You are sending a request to ‘example.com’ to send you back the homepage located within that application. To make sure you actually get resources from example.com, your browser includes a Host header of example.com. When that request reaches the back-end infrastructure, it finds the origin server that also has the host example.com and then sends back the required information.

Host headers are not only important for locating resources, but also for security. Imagine how jarring it would be if someone sent your request to scary-example.com and returned malicious resources instead! If the Host header in the request does not match the Host header at the destination, then the request will fail.

This behavior can cause issues when you start using third-party applications like Google Compute Cloud, Google App Engine, Heroku, and Amazon S3. These applications are great, helping you deploy and scale applications. However, they have an important effect on the requirements for a Host header. Your server is no longer located at example.com. Instead, it’s probably at something like example.herokuapp.com. If you can’t change the Host header on your origin, users might be sending their requests to the wrong place, leading to error messages and poor user experience.

The combination of Host headers and third-party applications caused issues with our Load Balancing solution. The Host header of the request would not match the updated Host header for the new application now added into the request pipeline, meaning the request would fail. Larger customers were forced to choose between using third-party applications and load balancing… not a decision we wanted to force our customers to make!

We saw this to be a big problem and wanted to help our customers leverage different solutions in the market to ensure they are successful and can align their business objectives with their infrastructure.

Introducing Origin Server Host Header Overrides

To address this problem, we’ve added support to override the Host header on a per-origin basis within our Load Balancing solution. This means that you don’t have to worry about sending errors back to your end-users and can seamlessly integrate apps — such as Heroku — into your reliability solutions.

We wanted to create a scalable solution, one that allowed you to perform these overrides in an easy and intuitive way. As we reviewed the problem, we saw that there was no one-size fit all solution. Different customers are going to have their apps architected differently and have different goals based on their business, industry, geography, etc. Thus, it was important to provide flexibility to override the Host header on a per-origin basis, since different origins can support different segments of an application or entirely different applications altogether! With a simple click, users can now update the Host header on their origins. When requests hit the Load Balancer, it reads the overwritten value for the origin host instead of the defaulted origin Host header value and requests are routed properly without errors.

“But what about my health monitors!?” you may be thinking. When you add a Host header override on your origin, we will automatically apply it to the origin’s health monitor. No extra configuration is required. On top of that, we felt it was important to provide as much meaningful information as possible so you could make informed decisions around any configuration updates. When editing a health monitor, you will see a new table if any origins have a Host header override.

How else can overriding the Host header help me?

You might find the Host header override useful if you have a shared web server. In these situations, IP addresses are not uniquely assigned to each server, meaning you might need a Host header override to direct your request to the right domain. For example, you might have example-grocery.com, example-furniture.com, and example-perfume.com all on a shared webserver hosted on the same IP address. When a request intended for example-furniture.com is forwarded to the server, the Host header override — Host: example-furniture.com — specifies which host to connect to.

Setting a Host header would mean that you enforce a strict HTTPS/TLS connection to reach the origin. We validate the Server Name Indicator (SNI) to verify that the request will be forwarded to the appropriate website.

How does it work under the hood?

To understand how it works, first let’s see what solutions we have currently. To configure a Load Balancer with a third-party cloud platform that requires a Host header, you would need to set a Host header override with Page Rules (Enterprise customers only). This solution works great if you have back-end origin1 and origin2 that expect the same Host header. But when the origins expect different Host headers, it wasn’t possible to set it per origin. There was a need for a more granular flexibility, hence the reason for per-origin Host header override.

Now, when you navigate to the Traffic tab and set up a Load Balancer, you can add a Host header override on your back-end origin. When you add the Host header, we do safety checks — which we’ll get to in a bit — and if it passes all the checks then your configuration gets saved. If you set a Host header override on an origin, it will take precedence over a Host header override set on the monitor.

Cloudflare has over 200 edge locations around the world, making it one of the largest Anycast networks. We make sure that your Load Balancer config information is available at all our Cloudflare edge locations. For example, you may have your website hosted on a third-party cloud provider like Heroku. You would set up a Load Balancer with pools and associated origin example.com. To reach example.com hosted on Heroku, you would set the heroku url example.herokuapp.com as the origin Host header “Host:example.herokuapp.com”. When a request hits a Cloudflare edge, we would first check the load balancing config and the health monitor to check the origin health status. Then, we would set the Host header override and Server Name Indication (SNI).For more about, visit our learning center.

There are some restrictions that limit the domain set as the Host header override on an origin:

The only allowed HTTP header is “Host” and no other.
You can specify only one “Host” Host header per origin, so no duplicates or line wrap/indent Host header with space is allowed.
No ports can be added to the Host header.
We allow fully qualified domain names (FQDN) and IP addresses that can be resolved by public DNS.
The FQDN in the Host header must be a subdomain of a zone associated with the account, which is applicable for partial zones and secondary zones.

These requirements make sure you are directing requests to domains that belong to you. With future development, we may relax some restrictions.

If you want to understand more about Load Balancing on Cloudflare, visit our product page or look at our help articles, such as Configure Cloudflare & Heroku.

Introducing Project Fair Shot: Ensuring COVID-19 Vaccine Registration Sites Can Keep Up With Demand

Matthew Prince — Fri, 22 Jan 2021 14:01:00 GMT

Around the world government and medical organizations are struggling with one of the most difficult logistics challenges in history: equitably and efficiently distributing the COVID-19 vaccine. There are challenges around communicating who is eligible to be vaccinated, registering those who are eligible for appointments, ensuring they show up for their appointments, transporting the vaccine under the required handling conditions, ensuring that there are trained personnel to administer the vaccine, and then doing it all over again as most of the vaccines require two doses.

Cloudflare can't help with most of that problem, but there is one key part that we realized we could help facilitate: ensuring that registration websites don't crash under load when they first begin scheduling vaccine appointments. Project Fair Shot provides Cloudflare's new Waiting Room service for free for any government, municipality, hospital, pharmacy, or other organization responsible for distributing COVID-19 vaccines. It is open to eligible organizations around the world and will remain free until at least July 1, 2021 or longer if there is still more demand for appointments for the vaccine than there is supply.

Crashing Registration Websites

The problem of vaccine scheduling registration websites crashing under load isn't theoretical: it is happening over and over as organizations attempt to schedule the administration of the vaccine. This hit home at Cloudflare last weekend. The wife of one of our senior team members was trying to register her parents to receive the vaccine. They met all the criteria and the municipality where they lived was scheduled to open appointments at noon.

When the time came for the site to open, it immediately crashed. The cause wasn't hackers or malicious activity. It was merely that so many people were trying to access the site at once. "Why doesn't Cloudflare build a service that organizes a queue into an orderly fashion so these sites don't get overwhelmed?" she asked her husband.

A Virtual Waiting Room

Turns out, we were already working on such a feature, but not for this use case. The problem of fairly distributing something where there is more demand than supply comes up with several of our clients. Whether selling tickets to a hot concert, the latest new sneaker, or access to popular national park hikes it is a difficult challenge to ensure that everyone eligible has a fair chance.

The solution is to open registration to acquire the scarce item ahead of the actual sale. Anyone who visits the site ahead of time can be put into a queue. The moment before the sale opens, the order of the queue can be randomly (and fairly) shuffled. People can then be let in in order of their new, random position in the queue — allowing only so many at any time as the backend of the site can handle.

At Cloudflare, we were building this functionality for our customers as a feature called Waiting Room. (You can learn more about the technical details of Waiting Room in this post by Brian Batraski who helped build it.) The technology is powerful because it can be used in front of any existing web registration site without needing any code changes or hardware installation. Simply deploy Cloudflare through a simple DNS change and then configure Waiting Room to ensure any transactional site, no matter how meagerly resourced, can keep up with demand.

Recognizing a Critical Need; Moving Up the Launch

We planned to release it in February. Then, when we saw vaccine sites crashing under load and frustration of people eligible for the vaccine building, we realized we needed to move the launch up and offer the service for free to organizations struggling to fairly distribute the vaccine. With that, Project Fair Shot was born.

Government, municipal, hospital, pharmacy, clinic, and any other organizations charged with scheduling appointments to distribute the vaccine can apply to participate in Project Fair Shot by visiting: projectfairshot.org

Giving Front Line Organizations the Technical Resources They Need

The service will be free for qualified organizations at least until July 1, 2021 or longer if there is still more demand for appointments for the vaccine than there is supply. We are not experts in medical cold storage and I get squeamish at the sight of needles, so we can't help with many of the logistical challenges of distributing the vaccine. But, seeing how we could support this aspect, our team knew we needed to do all we could to help.

The superheroes of this crisis are the medical professionals who are taking care of the sick and the scientists who so quickly invented these miraculous vaccines. We're proud of the supporting role Cloudflare has played helping ensure the Internet has continued to function well when the world needed it most. Project Fair Shot is one more way we are living up to our mission of helping build a better Internet.

Cloudflare Waiting Room

Brian Batraski — Fri, 22 Jan 2021 14:00:00 GMT

Today, we are excited to announce Cloudflare Waiting Room! It will first be available to select customers through a new program called Project Fair Shot which aims to help with the problem of overwhelming demand for COVID-19 vaccinations causing appointment registration websites to fail. General availability in our Business and Enterprise plans will be added in the near future.

Wait, you’re excited about a… Waiting Room?

Most of us are familiar with the concept of a waiting room, and rarely are we excited about the idea of being in one. Usually our first experience of one is at a doctor’s office — yes, you have an appointment, but sometimes the doctor is running late (or one of the patients was). Given the doctor can only see one person at a time… the waiting room was born, as a mechanism to queue up patients.

While servers can handle more concurrent requests than a doctor can, they too can be overwhelmed. If, in a pre-COVID world, you’ve ever tried buying tickets to a popular concert or event, you’ve probably encountered a waiting room online. It limits requests inbound to an application, and places these requests into a virtual queue. Once the number of users in the application has reduced, new users are let in within the defined thresholds the application can handle. This protects the origin servers supporting the application from being inundated with too many requests, while also ensuring equity from a user perspective — users who try to access a resource when the system is overloaded are not unfairly dropped and forced to reconnect, hoping to join their chance in the queue.

Why Now?

Given not many of us are going to live concerts any time soon, why is Cloudflare doing this now?

Well, perhaps we aren’t going to concerts, but the second order effects of COVID-19 have created a huge need for waiting rooms. First of all, given social distancing and the closing of many places of business and government, customers and citizens have shifted to online channels, putting substantially more strain on business and government infrastructure.

Second, the pandemic and the flow-on consequences of it have meant many folks around the world have come to rely on resources that they didn’t need twelve months earlier. To be specific, these are often health or government-related resources — for example, unemployment insurance websites. The online infrastructure was set up to handle a peak load that didn’t foresee the impact of COVID-19. We’re seeing a similar pattern emerge with websites that are related to vaccines.

Historically, the number of organizations that needed waiting rooms was quite small. The nature of most businesses online usually involves a more consistent user load, rather than huge crushes of people all at once. Those organizations were able to build custom waiting rooms and were integrated deeply into their application (for example, buying tickets). With Cloudflare’s Waiting Room, no code changes to the application are necessary and a Waiting Room can be set up in a matter of minutes for any website without writing a single line of code.

Whether you are an engineering architect or a business operations analyst, setting up a Waiting Room is simple. We make it quick and easy to ensure your applications are reliable and protected from unexpected spikes in traffic. Other features we felt were important are automatic enablement and dynamic outflow. In other words, a waiting room should turn on automatically when thresholds are exceeded and as users finish their tasks in the application, let out different sized buckets of users and intake new ones already in the queue. It should just work. Lastly, we’ve seen the major impact COVID-19 has made on users and businesses alike, especially, but not limited to, the health and government sectors. We wanted to provide another way to ensure these applications remain available and functional so all users can receive the care that they need and not errors within their browser.

How does Cloudflare’s Waiting Room work?

We built Waiting Room on top of our edge network and our Workers product. By leveraging Workers and our new Durable Objects offerings, we were able to remove the need for any customer coding and provide a seamless, out of the box product that will ‘just work’. On top of this, we get the benefits of the scale and performance of our Workers product to ensure we maintain extremely low latency overhead, keep estimated times presented to end users accurate as can be and not keep any user in the queue longer than needed. But building a centralized system in a decentralized network is no easy task. When requests come into an application from around the world, we need to be able to get a broad, accurate view of what that load looks like inbound and outbound to a given application.

Request going through Cloudflare without a Waiting Room

These requests, as fast as they are, still take time to travel across the planet. And so, a unique edge case was presented. What if a website is getting reasonable traffic from North America and Europe, but then a sudden major spike of traffic takes place from South America - how do we know when to keep letting users into the application and when to kick in the Waiting Room to protect the origin servers from being overloaded?

Thanks to some clever engineering and our Workers product, we were able to create a system that almost immediately keeps itself synced with global demand to an application giving us the necessary insight into when we should and should not be queueing users into the Waiting Room. By leveraging our global Anycast network and over 200+ data centers, we remove any single point of failure to protect our customers' infrastructure yet also provide a great experience to end-users who have to wait a small amount of time to enter the application under high load.

Request going through Cloudflare with a Waiting Room

How to setup a Waiting Room

Setting up a Waiting Room is incredibly easy and very fast! At the easiest side of the scale, a user needs to fill out only five fields: 1) the name of the Waiting Room, 2) a hostname (which will already be pre-populated with the zone it’s being configured on), 3) the total active users that can be in the application at any given time, 4) the new users per minute allowed into the application, and 5) the session duration for any given user. No coding or any application changes are necessary.

We provide the option of using our default Waiting Room template for customers who don’t want to add additional branding. This simplifies the process of getting a Waiting Room up and running.

That’s it! Press save and the Waiting Room is ready to go!

For customers with more time and technical ability, the same process is followed, except we give full customization capabilities to our users so they can brand the Waiting Room, ensuring it matches the look and feel of their overall product.

Lastly, managing different Waiting Rooms is incredibly easy. With our Manage Waiting Room table, at a glance you are able to get a full snapshot of which rooms are actively queueing, not queueing, and/or disabled.

We are very excited to put the power of our Waiting Room into the hands of our customers to ensure they continue to focus on their businesses and customers. Keep an eye out for another blog post coming soon with major updates to our Waiting Room product for Enterprise!

Unimog - Cloudflare’s edge load balancer

David Wragg — Wed, 09 Sep 2020 11:00:00 GMT

As the scale of Cloudflare’s edge network has grown, we sometimes reach the limits of parts of our architecture. About two years ago we realized that our existing solution for spreading load within our data centers could no longer meet our needs. We embarked on a project to deploy a Layer 4 Load Balancer, internally called Unimog, to improve the reliability and operational efficiency of our edge network. Unimog has now been deployed in production for over a year.

This post explains the problems Unimog solves and how it works. Unimog builds on techniques used in other Layer 4 Load Balancers, but there are many details of its implementation that are tailored to the needs of our edge network.

The role of Unimog in our edge network

Cloudflare operates an anycast network, meaning that our data centers in 200+ cities around the world serve the same IP addresses. For example, our own cloudflare.com website uses Cloudflare services, and one of its IP addresses is 104.17.175.85. All of our data centers will accept connections to that address and respond to HTTP requests. By the magic of Internet routing, when you visit cloudflare.com and your browser connects to 104.17.175.85, your connection will usually go to the closest (and therefore fastest) data center.

Inside those data centers are many servers. The number of servers in each varies greatly (the biggest data centers have a hundred times more servers than the smallest ones). The servers run the application services that implement our products (our caching, DNS, WAF, DDoS mitigation, Spectrum, WARP, etc). Within a single data center, any of the servers can handle a connection for any of our services on any of our anycast IP addresses. This uniformity keeps things simple and avoids bottlenecks.

But if any server within a data center can handle any connection, when a connection arrives from a browser or some other client, what controls which server it goes to? That’s the job of Unimog.

There are two main reasons why we need this control. The first is that we regularly move servers in and out of operation, and servers should only receive connections when they are in operation. For example, we sometimes remove a server from operation in order to perform maintenance on it. And sometimes servers are automatically removed from operation because health checks indicate that they are not functioning correctly.

The second reason concerns the management of the load on the servers (by load we mean the amount of computing work each one needs to do). If the load on a server exceeds the capacity of its hardware resources, then the quality of service to users will suffer. The performance experienced by users degrades as a server approaches saturation, and if a server becomes sufficiently overloaded, users may see errors. We also want to prevent servers being underloaded, which would reduce the value we get from our investment in hardware. So Unimog ensures that the load is spread across the servers in a data center. This general idea is called load balancing (balancing because the work has to be done somewhere, and so for the load on one server to go down, the load on some other server must go up).

Note that in this post, we’ll discuss how Cloudflare balances the load on its own servers in edge data centers. But load balancing is a requirement that occurs in many places in distributed computing systems. Cloudflare also has a Layer 7 Load Balancing product to allow our customers to balance load across their servers. And Cloudflare uses load balancing in other places internally.

Deploying Unimog led to a big improvement in our ability to balance the load on our servers in our edge data centers. Here’s a chart for one data center, showing the difference due to Unimog. Each line shows the processor utilization of an individual server (the colour of the lines indicates server model). The load on the servers varies during the day with the activity of users close to this data center. The white line marks the point when we enabled Unimog. You can see that after that point, the load on the servers became much more uniform. We saw similar results when we deployed Unimog to our other data centers.

How Unimog compares to other load balancers

There are a variety of techniques for load balancing. Unimog belongs to a category called Layer 4 Load Balancers (L4LBs). L4LBs direct packets on the network by inspecting information up to layer 4 of the OSI network model, which distinguishes them from the more common Layer 7 Load Balancers.

The advantage of L4LBs is their efficiency. They direct packets without processing the payload of those packets, so they avoid the overheads associated with higher level protocols. For any load balancer, it’s important that the resources consumed by the load balancer are low compared to the resources devoted to useful work. At Cloudflare, we already pay close attention to the efficient implementation of our services, and that sets a high bar for the load balancer that we put in front of those services.

The downside of L4LBs is that they can only control which connections go to which servers. They cannot modify the data going over the connection, which prevents them from participating in higher-level protocols like TLS, HTTP, etc. (in contrast, Layer 7 Load Balancers act as proxies, so they can modify data on the connection and participate in those higher-level protocols).

L4LBs are not new. They are mostly used at companies which have scaling needs that would be hard to meet with L7LBs alone. Google has published about Maglev, Facebook open-sourced Katran, and Github has open-sourced their GLB.

Unimog is the L4LB that Cloudflare has built to meet the needs of our edge network. It shares features with other L4LBs, and it is particularly strongly influenced by GLB. But there are some requirements that were not well-served by existing L4LBs, leading us to build our own:

Unimog is designed to run on the same general-purpose servers that provide application services, rather than requiring a separate tier of servers dedicated to load balancing.
It performs dynamic load balancing: measurements of server load are used to adjust the number of connections going to each server, in order to accurately balance load.
It supports long-lived connections that remain established for days.
Virtual IP addresses are managed as ranges (Cloudflare serves hundreds of thousands of IPv4 addresses on behalf of our customers, so it is impractical to configure these individually).
Unimog is tightly integrated with our existing DDoS mitigation system, and the implementation relies on the same XDP technology in the Linux kernel.

The rest of this post describes these features and the design and implementation choices that follow from them in more detail.

For Unimog to balance load, it’s not enough to send the same (or approximately the same) number of connections to each server, because the performance of our servers varies. We regularly update our server hardware, and we’re now on our 10th generation. Once we deploy a server, we keep it in service for as long as it is cost effective, and the lifetime of a server can be several years. It’s not unusual for a single data center to contain a mix of server models, due to expansion and upgrades over time. Processor performance has increased significantly across our server generations. So within a single data center, we need to send different numbers of connections to different servers to utilize the same percentage of their capacity.

It’s also not enough to give each server a fixed share of connections based on static estimates of their capacity. Not all connections consume the same amount of CPU. And there are other activities running on our servers and consuming CPU that are not directly driven by connections from clients. So in order to accurately balance load across servers, Unimog does dynamic load balancing: it takes regular measurements of the load on each of our servers, and uses a control loop that increases or decreases the number of connections going to each server so that their loads converge to an appropriate value.

Refresher: TCP connections

The relationship between TCP packets and connections is central to the operation of Unimog, so we’ll briefly describe that relationship.

(Unimog supports UDP as well as TCP, but for clarity most of this post will focus on the TCP support. We explain how UDP support differs towards the end.)

Here is the outline of a TCP packet:

The TCP connection that this packet belongs to is identified by the four labelled header fields, which span the IPv4/IPv6 (i.e. layer 3) and TCP (i.e. layer 4) headers: the source and destination addresses, and the source and destination ports. Collectively, these four fields are known as the 4-tuple. When we say the Unimog sends a connection to a server, we mean that all the packets with the 4-tuple identifying that connection are sent to that server.

A TCP connection is established via a three-way handshake between the client and the server handling that connection. Once a connection has been established, it is crucial that all the incoming packets for that connection go to that same server. If a TCP packet belonging to the connection is sent to a different server, it will signal the fact that it doesn’t know about the connection to the client with a TCP RST (reset) packet. Upon receiving this notification, the client terminates the connection, probably resulting in the user seeing an error. So a misdirected packet is much worse than a dropped packet. As usual, we consider the network to be unreliable, and it’s fine for occasional packets to be dropped. But even a single misdirected packet can lead to a broken connection.

Cloudflare handles a wide variety of connections on behalf of our customers. Many of these connections carry HTTP, and are typically short lived. But some HTTP connections are used for websockets, and can remain established for hours or days. Our Spectrum product supports arbitrary TCP connections. TCP connections can be terminated or stall for many reasons, and ideally all applications that use long-lived connections would be able to reconnect transparently, and applications would be designed to support such reconnections. But not all applications and protocols meet this ideal, so we strive to maintain long-lived connections. Unimog can maintain connections that last for many days.

Forwarding packets

The previous section described that the function of Unimog is to steer connections to servers. We’ll now explain how this is implemented.

To start with, let’s consider how one of our data centers might look without Unimog or any other load balancer. Here’s a conceptual view:

Packets arrive from the Internet, and pass through the router, which forwards them on to servers (in reality there is usually additional network infrastructure between the router and the servers, but it doesn’t play a significant role here so we’ll ignore it).

But is such a simple arrangement possible? Can the router spread traffic over servers without some kind of load balancer in between? Routers have a feature called ECMP (equal cost multipath) routing. Its original purpose is to allow traffic to be spread across multiple paths between two locations, but it is commonly repurposed to spread traffic across multiple servers within a data center. In fact, Cloudflare relied on ECMP alone to spread load across servers before we deployed Unimog. ECMP uses a hashing scheme to ensure that packets on a given connection use the same path (Unimog also employs a hashing scheme, so we’ll discuss how this can work in further detail below) . But ECMP is vulnerable to changes in the set of active servers, such as when servers go in and out of service. These changes cause rehashing events, which break connections to all the servers in an ECMP group. Also, routers impose limits on the sizes of ECMP groups, which means that a single ECMP group cannot cover all the servers in our larger edge data centers. Finally, ECMP does not allow us to do dynamic load balancing by adjusting the share of connections going to each server. These drawbacks mean that ECMP alone is not an effective approach.

Ideally, to overcome the drawbacks of ECMP, we could program the router with the appropriate logic to direct connections to servers in the way we want. But although programmable network data planes have been a hot research topic in recent years, commodity routers are still essentially fixed-function devices.

We can work around the limitations of routers by having the router send the packets to some load balancing servers, and then programming those load balancers to forward packets as we want. If the load balancers all act on packets in a consistent way, then it doesn’t matter which load balancer gets which packets from the router (so we can use ECMP to spread packets across the load balancers). That suggests an arrangement like this:

And indeed L4LBs are often deployed like this.

Instead, Unimog makes every server into a load balancer. The router can send any packet to any server, and that initial server will forward the packet to the right server for that connection:

We have two reasons to favour this arrangement:

First, in our edge network, we avoid specialised roles for servers. We run the same software stack on the servers in our edge network, providing all of our product features, whether DDoS attack prevention, website performance features, Cloudflare Workers, WARP, etc. This uniformity is key to the efficient operation of our edge network: we don’t have to manage how many load balancers we have within each of our data centers, because all of our servers act as load balancers.

The second reason relates to stopping attacks. Cloudflare’s edge network is the target of incessant attacks. Some of these attacks are volumetric - large packet floods which attempt to overwhelm the ability of our data centers to process network traffic from the Internet, and so impact our ability to service legitimate traffic. To successfully mitigate such attacks, it’s important to filter out attack packets as early as possible, minimising the resources they consume. This means that our attack mitigation system needs to occur before the forwarding done by Unimog. That mitigation system is called l4drop, and we’ve written about it before. l4drop and Unimog are closely integrated. Because l4drop runs on all of our servers, and because l4drop comes before Unimog, it’s natural for Unimog to run on all of our servers too.

XDP and xdpd

Unimog implements packet forwarding using a Linux kernel facility called XDP. XDP allows a program to be attached to a network interface, and the program gets run for every packet that arrives, before it is processed by the kernel’s main network stack. The XDP program returns an action code to tell the kernel what to do with the packet:

PASS: Pass the packet on to the kernel’s network stack for normal processing.
DROP: Drop the packet. This is the basis for l4drop.
TX: Transmit the packet back out of the network interface. The XDP program can modify the packet data before transmission. This action is the basis for Unimog forwarding.

XDP programs run within the kernel, making this an efficient approach even at high packet rates. XDP programs are expressed as eBPF bytecode, and run within an in-kernel virtual machine. Upon loading an XDP program, the kernel compiles its eBPF code into machine code. The kernel also verifies the program to check that it does not compromise security or stability. eBPF is not only used in the context of XDP: many recent Linux kernel innovations employ eBPF, as it provides a convenient and efficient way to extend the behaviour of the kernel.

XDP is much more convenient than alternative approaches to packet-level processing, particularly in our context where the servers involved also have many other tasks. We have continued to enhance Unimog since its initial deployment. Our deployment model for new versions of our Unimog XDP code is essentially the same as for userspace services, and we are able to deploy new versions on a weekly basis if needed. Also, established techniques for optimizing the performance of the Linux network stack provide good performance for XDP.

There are two main alternatives for efficient packet-level processing:

Kernel-bypass networking (such as DPDK), where a program in userspace manages a network interface (or some part of one) directly without the involvement of the kernel. This approach works best when servers can be dedicated to a network function (due to the need to dedicate processor or network interface hardware resources, and awkward integration with the normal kernel network stack; see our old post about this). But we avoid putting servers in specialised roles. (Github’s open-source GLB uses DPDK, and this is one of the main factors that made GLB unsuitable for us.)
Kernel modules, where code is added to the kernel to perform the necessary network functions. The Linux IPVS (IP Virtual Server) subsystem falls into this category. But developing, testing, and deploying kernel modules is cumbersome compared to XDP.

The following diagram shows an overview of our use of XDP. Both l4drop and Unimog are implemented by an XDP program. l4drop matches attack packets, and uses the DROP action to discard them. Unimog forwards packets, using the TX action to resend them. Packets that are not dropped or forwarded pass through to the normal Linux network stack. To support our elaborate use of XDP, we have developed the xdpd daemon which performs the necessary supervisory and support functions for our XDP programs.

Rather than a single XDP program, we have a chain of XDP programs that must be run for each packet (l4drop, Unimog, and others we have not covered here). One of the responsibilities of xdpd is to prepare these programs, and to make the appropriate system calls to load them and assemble the full chain.

Our XDP programs come from two sources. Some are developed in a conventional way: engineers write C code, our build system compiles it (with clang) to eBPF ELF files, and our release system deploys those files to our servers. Our Unimog XDP code works like this. In contrast, the l4drop XDP code is dynamically generated by xdpd based on information it receives from attack detection systems.

xdpd has many other duties to support our use of XDP:

XDP programs can be supplied with data using data structures called maps. xdpd populates the maps needed by our programs, based on information received from control planes.
Programs (for instance, our Unimog XDP program) may depend upon configuration values which are fixed while the program runs, but do not have universal values known at the time their C code was compiled. It would be possible to supply these values to the program via maps, but that would be inefficient (retrieving a value from a map requires a call to a helper function). So instead, xdpd will fix up the eBPF program to insert these constants before it is loaded.
Cloudflare carefully monitors the behaviour of all our software systems, and this includes our XDP programs: They emit metrics (via another use of maps), which xdpd exposes to our metrics and alerting system (prometheus).
When we deploy a new version of xdpd, it gracefully upgrades in such a way that there is no interruption to the operation of Unimog or l4drop.

Although the XDP programs are written in C, xdpd itself is written in Go. Much of its code is specific to Cloudflare. But in the course of developing xdpd, we have collaborated with Cilium to develop https://github.com/cilium/ebpf, an open source Go library that provides the operations needed by xdpd for manipulating and loading eBPF programs and related objects. We’re also collaborating with the Linux eBPF community to share our experience, and extend the core eBPF technology in ways that make features of xdpd obsolete.

In evaluating the performance of Unimog, our main concern is efficiency: that is, the resources consumed for load balancing relative to the resources used for customer-visible services. Our measurements show that Unimog costs less than 1% of the processor utilization, compared to a scenario where no load balancing is in use. Other L4LBs, intended to be used with servers dedicated to load balancing, may place more emphasis on maximum throughput of packets. Nonetheless, our experience with Unimog and XDP in general indicates that the throughput is more than adequate for our needs, even during large volumetric attacks.

Unimog is not the first L4LB to use XDP. In 2018, Facebook open sourced Katran, their XDP-based L4LB data plane. We considered the possibility of reusing code from Katran. But it would not have been worthwhile: the core C code needed to implement an XDP-based L4LB is relatively modest (about 1000 lines of C, both for Unimog and Katran). Furthermore, we had requirements that were not met by Katran, and we also needed to integrate with existing components and systems at Cloudflare (particularly l4drop). So very little of the code could have been reused as-is.

Encapsulation

As discussed as the start of this post, clients make connections to one of our edge data centers with a destination IP address that can be served by any one of our servers. These addresses that do not correspond to a specific server are known as virtual IPs (VIPs). When our Unimog XDP program forwards a packet destined to a VIP, it must replace that VIP address with the direct IP (DIP) of the appropriate server for the connection, so that when the packet is retransmitted it will reach that server. But it is not sufficient to overwrite the VIP in the packet headers with the DIP, as that would hide the original destination address from the server handling the connection (the original destination address is often needed to correctly handle the connection).

Instead, the packet must be encapsulated: Another set of packet headers is prepended to the packet, so that the original packet becomes the payload in this new packet. The DIP is then used as the destination address in the outer headers, but the addressing information in the headers of the original packet is preserved. The encapsulated packet is then retransmitted. Once it reaches the target server, it must be decapsulated: the outer headers are stripped off to yield the original packet as if it had arrived directly.

Encapsulation is a general concept in computer networking, and is used in a variety of contexts. The headers to be added to the packet by encapsulation are defined by an encapsulation format. Many different encapsulation formats have been defined within the industry, tailored to the requirements in specific contexts. Unimog uses a format called GUE (Generic UDP Encapsulation), in order to allow us to re-use the glb-redirect component from github’s GLB (glb-redirect is discussed below).

GUE is a relatively simple encapsulation format. It encapsulates within a UDP packet, placing a GUE-specific header between the outer IP/UDP headers and the payload packet to allow extension data to be carried (and we’ll see how Unimog takes advantage of this):

When an encapsulated packet arrives at a server, the encapsulation process must be reversed. This step is called decapsulation. The headers that were added during the encapsulation process are removed, leaving the original packet to be processed by the network stack as if it had arrived directly from the client.

An issue that can arise with encapsulation is hitting limits on the maximum packet size, because the encapsulation process makes packets larger. The de-facto maximum packet size on the Internet is 1500 bytes, and not coincidentally this is also the maximum packet size on ethernet networks. For Unimog, encapsulating a 1500-byte packet results in a 1536-byte packet. To allow for these enlarged encapsulated packets, we have enabled jumbo frames on the networks inside our data centers, so that the 1500-byte limit only applies to packets headed out to the Internet.

Forwarding logic

So far, we have described the technology used to implement the Unimog load balancer, but not how our Unimog XDP program selects the DIP address when forwarding a packet. This section describes the basic scheme. But as we’ll see, there is a problem, so then we’ll describe how this scheme is elaborated to solve that problem.

In outline, our Unimog XDP program processes each packet in the following way:

Determine whether the packet is destined for a VIP address. Not all of the packets arriving at a server are for VIP addresses. Other packets are passed through for normal handling by the kernel’s network stack. (xdpd obtains the VIP address ranges from the Unimog control plane.)
Determine the DIP for the server handling the packet’s connection.
Encapsulate the packet, and retransmit it to the DIP.

In step 2, note that all the load balancers must act consistently - when forwarding packets, they must all agree about which connections go to which servers. The rate of new connections arriving at a data center is large, so it’s not practical for load balancers to agree by communicating information about connections amongst themselves. Instead L4LBs adopt designs which allow the load balancers to reach consistent forwarding decisions independently. To do this, they rely on hashing schemes: Take the 4-tuple identifying the packet’s connection, put it through a hash function to obtain a key (the hash function ensures that these key values are uniformly distributed), then perform some kind of lookup into a data structure to turn the key into the DIP for the target server.

Unimog uses such a scheme, with a data structure that is simple compared to some other L4LBs. We call this data structure the forwarding table, and it consists of an array where each entry contains a DIP specifying the target server for the relevant packets (we call these entries buckets). The forwarding table is generated by the Unimog control plane and broadcast to the load balancers (more on this below), so that it has the same contents on all load balancers.

To look up a packet’s key in the forwarding table, the low N bits from the key are used as the index for a bucket (the forwarding table is always a power-of-2 in size):

Note that this approach does not provide per-connection control - each bucket typically applies to many connections. All load balancers in a data center use the same forwarding table, so they all forward packets in a consistent manner. This means it doesn’t matter which packets are sent by the router to which servers, and so ECMP re-hashes are a non-issue. And because the forwarding table is immutable and simple in structure, lookups are fast.

Although the above description only discusses a single forwarding table, Unimog supports multiple forwarding tables, each one associated with a trafficset - the traffic destined for a particular service. Ranges of VIP addresses are associated with a trafficset. Each trafficset has its own configuration settings and forwarding tables. This gives us the flexibility to differentiate how Unimog behaves for different services.

Precise load balancing requires the ability to make fine adjustments to the number of connections arriving at each server. So we make the number of buckets in the forwarding table more than 100 times the number of servers. Our data centers can contain hundreds of servers, and so it is normal for a Unimog forwarding table to have tens of thousands of buckets. The DIP for a given server is repeated across many buckets in the forwarding table, and by increasing or decreasing the number of buckets that refer to a server, we can control the share of connections going to that server. Not all buckets will correspond to exactly the same number of connections at a given point in time (the properties of the hash function make this a statistical matter). But experience with Unimog has demonstrated that the relationship between the number of buckets and resulting server load is sufficiently strong to allow for good load balancing.

But as mentioned, there is a problem with this scheme as presented so far. Updating a forwarding table, and changing the DIPs in some buckets, would break connections that hash to those buckets (because packets on those connections would get forwarded to a different server after the update). But one of the requirements for Unimog is to allow us to change which servers get new connections without impacting the existing connections. For example, sometimes we want to drain the connections to a server, maintaining the existing connections to that server but not forwarding new connections to it, in the expectation that many of the existing connections will terminate of their own accord. The next section explains how we fix this scheme to allow such changes.

Maintaining established connections

To make changes to the forwarding table without breaking established connections, Unimog adopts the “daisy chaining” technique described in the paper Stateless Datacenter Load-balancing with Beamer.

To understand how the Beamer technique works, let’s look at what can go wrong when a forwarding table changes: imagine the forwarding table is updated so that a bucket which contained the DIP of server A now refers to server B. A packet that would formerly have been sent to A by the load balancers is now sent to B. If that packet initiates a new connection (it’s a TCP SYN packet), there’s no problem - server B will continue the three-way handshake to complete the new connection. On the other hand, if the packet belongs to a connection established before the change, then the TCP implementation of server B has no matching TCP socket, and so sends a RST back to the client, breaking the connection.

This explanation hints at a solution: the problem occurs when server B receives a forwarded packet that does not match a TCP socket. If we could change its behaviour in this case to forward the packet a second time to the DIP of server A, that would allow the connection to server A to be preserved. For this to work, server B needs to know the DIP for the bucket before the change.

To accomplish this, we extend the forwarding table so that each bucket has two slots, each containing the DIP for a server. The first slot contains the current DIP, which is used by the load balancer to forward packets as discussed (and here we refer to this forwarding as the first hop). The second slot preserves the previous DIP (if any), in order to allow the packet to be forwarded again on a second hop when necessary.

For example, imagine we have a forwarding table that refers to servers A, B, and C, and then it is updated to stop new connections going to server A, but maintaining established connections to server A. This is achieved by replacing server A’s DIP in the first slot of any buckets where it appears, but preserving it in the second slot:

In addition to extending the forwarding table, this approach requires a component on each server to forward packets on the second hop when necessary. This diagram shows where this redirector fits into the path a packet can take:

The redirector follows some simple logic to decide whether to process a packet locally on the first-hop server or to forward it on the second-hop server:

If the packet is a SYN packet, initiating a new connection, then it is always processed by the first-hop server. This ensures that new connections go to the first-hop server.
For other packets, the redirector checks whether the packet belongs to a connection with a corresponding TCP socket on the first-hop server. If so, it is processed by that server.
Otherwise, the packet has no corresponding TCP socket on the first-hop server. So it is forwarded on to the second-hop server to be processed there (in the expectation that it belongs to some connection established on the second-hop server that we wish to maintain).

In that last step, the redirector needs to know the DIP for the second hop. To avoid the need for the redirector to do forwarding table lookups, the second-hop DIP is placed into the encapsulated packet by the Unimog XDP program (which already does a forwarding table lookup, so it has easy access to this value). This second-hop DIP is carried in a GUE extension header, so that it is readily available to the redirector if it needs to forward the packet again.

This second hop, when necessary, does have a cost. But in our data centers, the fraction of forwarded packets that take the second hop is usually less than 1% (despite the significance of long-lived connections in our context). The result is that the practical overhead of the second hops is modest.

When we initially deployed Unimog, we adopted the glb-redirect iptables module from github’s GLB to serve as the redirector component. In fact, some implementation choices in Unimog, such as the use of GUE, were made in order to facilitate this re-use. glb-redirect worked well for us initially, but subsequently we wanted to enhance the redirector logic. glb-redirect is a custom Linux kernel module, and developing and deploying changes to kernel modules is more difficult for us than for eBPF-based components such as our XDP programs. This is not merely due to Cloudflare having invested more engineering effort in software infrastructure for eBPF; it also results from the more explicit boundary between the kernel and eBPF programs (for example, we are able to run the same eBPF programs on a range of kernel versions without recompilation). We wanted to achieve the same ease of development for the redirector as for our XDP programs.

To that end, we decided to write an eBPF replacement for glb-redirect. While the redirector could be implemented within XDP, like our load balancer, practical concerns led us to implement it as a TC classifier program instead (TC is the traffic control subsystem within the Linux network stack). A downside to XDP is that the packet contents prior to processing by the XDP program are not visible using conventional tools such as tcpdump, complicating debugging. TC classifiers do not have this downside, and in the context of the redirector, which passes most packets through, the performance advantages of XDP would not be significant.

The result is cls-redirect, a redirector implemented as a TC classifier program. We have contributed our cls-redirect code as part of the Linux kernel test suite. In addition to implementing the redirector logic, cls-redirect also implements decapsulation, removing the need to separately configure GUE tunnel endpoints for this purpose.

There are some features suggested in the Beamer paper that Unimog does not implement:

Beamer embeds generation numbers in the encapsulated packets to address a potential corner case where a ECMP rehash event occurs at the same time as a forwarding table update is propagating from the control plane to the load balancers. Given the combination of circumstances required for a connection to be impacted by this issue, we believe that in our context the number of affected connections is negligible, and so the added complexity of the generation numbers is not worthwhile.
In the Beamer paper, the concept of daisy-chaining encompasses third hops etc. to preserve connections across a series of changes to a bucket. Unimog only uses two hops (the first and second hops above), so in general it can only preserve connections across a single update to a bucket. But our experience is that even with only two hops, a careful strategy for updating the forwarding tables permits connection lifetimes of days.

To elaborate on this second point: when the control plane is updating the forwarding table, it often has some choice in which buckets to change, depending on the event that led to the update. For example, if a server is being brought into service, then some buckets must be assigned to it (by placing the DIP for the new server in the first slot of the bucket). But there is a choice about which buckets. A strategy of choosing the least-recently modified buckets will tend to minimise the impact to connections.

Furthermore, when updating the forwarding table to adjust the balance of load between servers, Unimog often uses a novel trick: due to the redirector logic, exchanging the first-hop and second-hop DIPs for a bucket only affects which server receives new connections for that bucket, and never impacts any established connections. Unimog is able to achieve load balancing in our edge data centers largely through forwarding table changes of this type.

Control plane

So far, we have discussed the Unimog data plane - the part that processes network packets. But much of the development effort on Unimog has been devoted to the control plane - the part that generates the forwarding tables used by the data plane. In order to correctly maintain the forwarding tables, the control plane consumes information from multiple sources:

Server information: Unimog needs to know the set of servers present in a data center, some key information about each one (such as their DIP addresses), and their operational status. It also needs signals about transitional states, such as when a server is being withdrawn from service, in order to gracefully drain connections (preventing the server from receiving new connections, while maintaining its established connections).
Health: Unimog should only send connections to servers that are able to correctly handle those connections, otherwise those servers should be removed from the forwarding tables. To ensure this, it needs health information at the node level (indicating that a server is available) and at the service level (indicating that a service is functioning normally on a server).
Load: in order to balance load, Unimog needs information about the resource utilization on each server.
IP address information: Cloudflare serves hundreds of thousands of IPv4 addresses, and these are something that we have to treat as a dynamic resource rather than something statically configured.

The control plane is implemented by a process called the conductor. In each of our edge data centers, there is one active conductor, but there are also standby instances that will take over if the active instance goes away.

We use Hashicorp’s Consul in a number of ways in the Unimog control plane (we have an independent Consul server cluster in each data center):

Consul provides a key-value store, with support for blocking queries so that changes to values can be received promptly. We use this to propagate the forwarding tables and VIP address information from the conductor to xdpd on the servers.
Consul provides server- and service-level health checks. We use this as the source of health information for Unimog.
The conductor stores its state in the Consul KV store, and uses Consul’s distributed locks to ensure that only one conductor instance is active.

The conductor obtains server load information from Prometheus, which we already use for metrics throughout our systems. It balances the load across the servers using a control loop, periodically adjusting the forwarding tables to send more connections to underloaded servers and less connections to overloaded servers. The load for a server is defined by a Prometheus metric expression which measures processor utilization (with some intricacies to better handle characteristics of our workloads). The determination of whether a server is underloaded or overloaded is based on comparison with the average value of the load metric, and the adjustments made to the forwarding table are proportional to the deviation from the average. So the result of the feedback loop is that the load metric for all servers converges on the average.

Finally, the conductor queries internal Cloudflare APIs to obtain the necessary information on servers and addresses.

Unimog is a critical system: incorrect, poorly adjusted or stale forwarding tables could cause incoming network traffic to a data center to be dropped, or servers to be overloaded, to the point that a data center would have to be removed from service. To maintain a high quality of service and minimise the overhead of managing our many edge data centers, we have to be able to upgrade all components. So to the greatest extent possible, all components are able to tolerate brief absences of the other components without any impact to service. In some cases this is possible through careful design. In other cases, it requires explicit handling. For example, we have found that Consul can temporarily report inaccurate health information for a server and its services when the Consul agent on that server is restarted (for example, in order to upgrade Consul). So we implemented the necessary logic in the conductor to detect and disregard these transient health changes.

Unimog also forms a complex system with feedback loops: The conductor reacts to its observations of behaviour of the servers, and the servers react to the control information they receive from the conductor. This can lead to behaviours of the overall system that are hard to anticipate or test for. For instance, not long after we deployed Unimog we encountered surprising behaviour when data centers became overloaded. This is of course a scenario that we strive to avoid, and we have automated systems to remove traffic from overloaded data centers if it does. But if a data center became sufficiently overloaded, then health information from its servers would indicate that many servers were degraded to the point that Unimog would stop sending new connections to those servers. Under normal circumstances, this is the correct reaction to a degraded server. But if enough servers become degraded, diverting new connections to other servers would mean those servers became degraded, while the original servers were able to recover. So it was possible for a data center that became temporarily overloaded to get stuck in a state where servers oscillated between healthy and degraded, even after the level of demand on the data center had returned to normal. To correct this issue, the conductor now has logic to distinguish between isolated degraded servers and such data center-wide problems. We have continued to improve Unimog in response to operational experience, ensuring that it behaves in a predictable manner over a wide range of conditions.

UDP Support

So far, we have described Unimog’s support for directing TCP connections. But Unimog also supports UDP traffic. UDP does not have explicit connections between clients and servers, so how it works depends upon how the UDP application exchanges packets between the client and server. There are a few cases of interest:

Request-response UDP applications

Some applications, such as DNS, use a simple request-response pattern: the client sends a request packet to the server, and expects a response packet in return. Here, there is nothing corresponding to a connection (the client only sends a single packet, so there is no requirement to make sure that multiple packets arrive at the same server). But Unimog can still provide value by spreading the requests across our servers.

To cater to this case, Unimog operates as described in previous sections, hashing the 4-tuple from the packet headers (the source and destination IP addresses and ports). But the Beamer daisy-chaining technique that allows connections to be maintained does not apply here, and so the buckets in the forwarding table only have a single slot.

UDP applications with flows

Some UDP applications have long-lived flows of packets between the client and server. Like TCP connections, these flows are identified by the 4-tuple. It is necessary that such flows go to the same server (even when Cloudflare is just passing a flow through to the origin server, it is convenient for detecting and mitigating certain kinds of attack to have that flow pass through a single server within one of Cloudflare’s data centers).

It's possible to treat these flows by hashing the 4-tuple, skipping the Beamer daisy-chaining technique as for request-response applications. But then adding servers will cause some flows to change servers (this would effectively be a form of consistent hashing). For UDP applications, we can’t say in general what impact this has, as we can for TCP connections. But it’s possible that it causes some disruption, so it would be nice to avoid this.

So Unimog adapts the daisy-chaining technique to apply it to UDP flows. The outline remains similar to that for TCP: the same redirector component on each server decides whether to send a packet on a second hop. But UDP does not have anything corresponding to TCP’s SYN packet that indicates a new connection. So for UDP, the part that depends on SYNs is removed, and the logic applied for each packet becomes:

The redirector checks whether the packet belongs to a connection with a corresponding UDP socket on the first-hop server. If so, it is processed by that server.
Otherwise, the packet has no corresponding UDP socket on the first-hop server. So it is forwarded on to the second-hop server to be processed there (in the expectation that it belongs to some flow established on the second-hop server that we wish to maintain).

Although the change compared to the TCP logic is not large, it has the effect of switching the roles of the first- and second-hop servers: For UDP, new flows go to the second-hop server. The Unimog control plane has to take account of this when it updates a forwarding table. When it introduces a server into a bucket, that server should receive new connections or flows. For a TCP trafficset, this means it becomes the first-hop server. For UDP trafficset, it must become the second-hop server.

This difference between handling of TCP and UDP also leads to higher overheads for UDP. In the case of TCP, as new connections are formed and old connections terminate over time, fewer packets will require the second hop, and so the overhead tends to diminish. But with UDP, new connections always involve the second hop. This is why we differentiate the two cases, taking advantage of SYN packets in the TCP case.

The UDP logic also places a requirement on services. The redirector must be able to match packets to the corresponding sockets on a server according to their 4-tuple. This is not a problem in the TCP case, because all TCP connections are represented by connected sockets in the BSD sockets API (these sockets are obtained from an accept system call, so that they have a local address and a peer address, determining the 4-tuple). But for UDP, unconnected sockets (lacking a declared peer address) can be used to send and receive packets. So some UDP services only use unconnected sockets. For the redirector logic above to work, services must create connected UDP sockets in order to expose their flows to the redirector.

UDP applications with sessions

Some UDP-based protocols have explicit sessions, with a session identifier in each packet. Session identifiers allow sessions to persist even if the 4-tuple changes. This happens in mobility scenarios - for example, if a mobile device passes from a WiFi to a cellular network, causing its IP address to change. An example of a UDP-based protocol with session identifiers is QUIC (which calls them connection IDs).

Our Unimog XDP program allows a flow dissector to be configured for different trafficsets. The flow dissector is the part of the code that is responsible for taking a packet and extracting the value that identifies the flow or connection (this value is then hashed and used for the lookup into the forwarding table). For TCP and UDP, there are default flow dissectors that extract the 4-tuple. But specialised flow dissectors can be added to handle UDP-based protocols.

We have used this functionality in our WARP product. We extended the Wireguard protocol used by WARP in a backwards-compatible way to include a session identifier, and added a flow dissector to Unimog to exploit it. There are more details in our post on the technical challenges of WARP.

Conclusion

Unimog has been deployed to all of Cloudflare’s edge data centers for over a year, and it has become essential to our operations. Throughout that time, we have continued to enhance Unimog (many of the features described here were not present when it was first deployed). So the ease of developing and deploying changes, due to XDP and xdpd, has been a significant benefit. Today we continue to extend it, to support more services, and to help us manage our traffic and the load on our servers in more contexts.

High Availability Load Balancers with Maglev

Terin Stock — Wed, 10 Jun 2020 11:00:00 GMT

Background

We run many backend services that power our customer dashboard, APIs, and features available at our edge. We own and operate physical infrastructure for our backend services. We need an effective way to route arbitrary TCP and UDP traffic between services and also from outside these data centers.

Previously, all traffic for these backend services would pass through several layers of stateful TCP proxies and NATs before reaching an available instance. This solution worked for several years, but as we grew it caused our service and operations teams many issues. Our service teams needed to deal with drops of availability, and our operations teams had much toil when needing to do maintenance on load balancer servers.

Goals

With the experience with our stateful TCP proxy and NAT solutions in mind, we had several goals for a replacement load balancing service, while remaining on our own infrastructure:

Preserve source IPs through routing decisions to destination servers. This allows us to support servers that require client IP addresses as part of their operation, without workarounds such as X-Forwarded-For headers or the PROXY TCP extension.
Support an architecture where backends are located across many racks and subnets. This prevents solutions that cannot be routed by existing network equipment.
Allow operation teams to perform maintenance with zero downtime. We should be able to remove load balancers at any time without causing any connection resets or downtime for services.
Use Linux tools and features that are commonplace and well-tested. There are a lot of very cool networking features in Linux we could experiment with, but we wanted to optimize for least surprising behavior for operators who do not primarily work with these load balancers.
No explicit connection synchronization between load balancers. We found that communication between load balancers significantly increased the system complexity, allowing for more opportunities for things to go wrong.
Allow for staged rollout from the previous load balancer implementation. We should be able to migrate the traffic of specific services between the two implementations to find issues and gain confidence in the system.

Reaching Zero Downtime

Problems

Previously, when traffic arrived at our backend data centers, our routers would pick and forward packets to one of the L4 load balancers servers it knew about. These L4 load balancers would determine what service the traffic was for, then forward the traffic to one of the service's L7 servers.

This architecture worked fine during normal operations. However, issues would quickly surface whenever the set of load balancers changed. Our routers would forward traffic to the new set and it was very likely traffic would arrive to a different load balancer than before. As each load balancer maintained its own connection state, it would be unable to forward traffic for these new in-progress connections. These connections would then be reset, potentially causing errors for our customers.

Consistent Hashing

During normal operations, our new architecture has similar behavior to the previous design. A L4 load balancer would be selected by our routers, which would then forward traffic to a service's L7 server.

There's a significant change when the set of load balancers changes. As our load balancers are now stateless, it doesn't matter which load balancer our router selects to forward traffic to, they'll end up reaching the same backend server.

Implementation

BGP

Our load balancer servers announce service IP addresses to our data centers’ routers using BGP, unchanged from the previous solution. Our routers choose which load balancers will receive packets based on a routing strategy called equal-cost multi-path routing (ECMP).

ECMP hashes information from packets to pick a path for that packet. The hash function used by routers is often fixed in firmware. Routers that chose a poor hashing function, or chose bad inputs, can create unbalanced network and server load, or break assumptions made by the protocol layer.

We worked with our networking team to ensure ECMP is configured on our routers to hash only based on the packet's 5-tuple—the protocol, source address and port, and destination address and port.

For maintenance, our operators can withdraw the BGP session and traffic will transparently shift to other load balancers. However, if a load balancer suddenly becomes unavailable, such as with a kernel panic or power failure, there is a short delay before the BGP keepalive mechanism fails and routers terminate the session.

It's possible for routers to terminate BGP sessions after a much shorter delay using the Bidirectional Forwarding Detection (BFD) protocol between the router and load balancers. Different routers have different limitations and restrictions on BFD that makes it difficult to use in an environment heavily using L2 link aggregation and VXLANs.

We're continuing to work with our networking team to find solutions to reduce the time to terminate BGP sessions, using tools and configurations they're most comfortable with.

Selecting Backends with Maglev

To ensure all load balancers are sending traffic to the same backends, we decided to use the Maglev connection scheduler. Maglev is a consistent hash scheduler hashing a 5-tuple of information from each packet—the protocol, source address and port, and destination address and port—to determine a backend server.

By being a consistent hash, the same backend server is chosen by every load balancer for a packet without needing to persist any connection state. This allows us to transparently move traffic between load balancers without requiring explicit connection synchronization between them.

IPVS and Foo-Over-UDP

Where possible, we wanted to use commonplace and reliable Linux features. Linux has implemented a powerful layer 4 load balancer, the IP Virtual Server (IPVS), since the early 2000s. IPVS has supported the Maglev scheduler since Linux 4.18.

Our load balancer and application servers are spread across multiple racks and subnets. To route traffic from the load balancer we opted to use Foo-Over-UDP encapsulation.

In Foo-Over-UDP encapsulation a new IP and UDP header are added around the original packet. When these packets arrive on the destination server, the Linux kernel removes the outer IP and UDP headers and inserts the inner payload back into the networking stack for processing as if the packet had originally been received on that server.

Compared to other encapsulation methods—such as IPIP, GUE, and GENEVE—we felt Foo-Over-UDP struck a nice balance between features and flexibility. Direct Server Return, where application servers reply directly to clients and bypass the load balancers, was implemented as a byproduct of the encapsulation. There was no state associated with the encapsulation, each server only required one encapsulation interface to receive traffic from all load balancers.

Example Load Balancer Configuration

Example Application Server Configuration

IPVS does not support Foo-Over-UDP as a packet forwarding method. To work around this limitation, we've created virtual interfaces that implement Foo-Over-UDP encapsulation. We can then use IPVS's direct packet forwarding method along with the kernel routing table to choose a specific interface.

Linux is often configured to ignore packets that arrive on an interface that is different from the interface used for replies. As packets will now be arriving on the virtual "tunl0" interface, we need to disable reverse path filtering on this interface. The kernel uses the higher value of the named and "all" interfaces, so you may need to decrease "all" and adjust other interfaces.

MTUs and Encapsulation

The maximum IPv4 packet size, or maximum transmission unit (MTU), we accept from the internet is 1500 bytes. To ensure we did not fragment these packets during encapsulation we increased our internal MTUs from the default to accommodate the IP and UDP headers.

The team had underestimated the complexity of changing the MTU across all our racks of equipment. We had to adjust the MTU across all our routers and switches, of our bonded and VXLAN interfaces, and finally our Foo-Over-UDP encapsulation. Even with a carefully orchestrated rollout, and we still uncovered MTU-related bugs with our switches and server stack, many of which manifested first as issues on other parts of the network.

Node Agent

We've written a Go agent running on each load balancer that synchronizes with a control plane layer that's tracking the location of services. The agent then configures the system based on active services and available backend servers.

To configure IPVS and the routing table we're using packages built upon the netlink Go package. We're open sourcing the IPVS netlink package we built today, which supports querying, creating and updating IPVS virtual servers, destinations, and statistics.

Unfortunately, there is no official programming interface for iptables, so we must instead execute the iptables binary. The agent computes an ideal set of iptables chains and rules, which is then reconciled with the live rules.

Subset of iptables for a service

The iptables output of a rule may differ significantly from the input given by our ideal rule. To avoid needing to parse the entire iptables rule in our comparisons, we store a hash of the rule, including the position in the chain, as an iptables comment. We then can compare the comment to our ideal rule to determine if we need to take any actions. On chains that are shared (such as INPUT) the agent ignores unmanaged rules.

Kubernetes Integration

We use the network load balancer described here as a cloud load balancer for Kubernetes. A controller assigns virtual IP addresses to Kubernetes services requesting a load balancer IP. These IPs get configured by the agent in IPVS. Traffic is directed to a subset of cluster nodes for handling by kube-proxy, unless the External Traffic Policy is set to "Local" in which case the traffic is sent to the specific backends the workloads are running on.

This allows us to have internal Kubernetes clusters that better replicate the load balancer behavior of managed clusters on cloud providers. Services running Kubernetes, such as ingress controllers, API gateways, and databases, have access to correct client IP addresses of load balanced traffic.

Future Work

Continuing a close eye on future developments of IPVS and alternatives, including nftlb and Express Data Path (XDP) and eBPF.
Migrate to nftables. The "flat priorities" and lack of programmable interface for iptables makes it ill-suited for including automated rules alongside rules added by operators. We hope as more projects and operations move to nftables we'll be able to switch without creating a "blind-spot" to operations.
Failures of a load balancer can result in temporary outages due to BGP hold timers. We'd like to improve how we're handling the failures with BGP sessions.
Investigate using Lightweight Tunnels to reduce the number of Foo-Over-UDP interfaces are needed on the load balancer nodes.

Additional Reading

Multi-tier load-balancing with Linux. Vincent Bernat (2018). Describes a network load balancer using IPVS and IPIP.
Introducing the GitHub Load Balancer. Joe Williams, Theo Julienne (2017). Describes a similar split layer 4 and layer 7 architecture, using consistent hashing and Foo-Over-UDP. They seemed to have limitations with IPVS that looked to have been resolved.
Foo over UDP. Jonathan Corbet (2014). Describes the basics of IPIP and Foo-Over-UDP, which was just introduced at the time.
UDP encapsulation, FOU, GUE, & RCO. Tom Herbert (2015). Describes the different UDP encapsulation options.
Hashing on broken assumptions. Lorenzo Saino (2017). Farther discussion on hashing difficulties with ECMP.
BFD (Bidirectional Forwarding Detection): Does it work and is it worth it?. Tom School (2009). Discussion on BFD with common protocols and where it can become a problem.

Adding the Fallback Pool to the Load Balancing UI and other significant UI enhancements

Brian Batraski — Sat, 21 Mar 2020 12:00:00 GMT

The Cloudflare Load Balancer was introduced over three years ago to provide our customers with a powerful, easy to use tool to intelligently route traffic to their origins across the world. During the initial design process, one of the questions we had to answer was ‘where do we send traffic if all pools are down?’ We did not think it made sense just to drop the traffic, so we used the concept of a ‘fallback pool’ to send traffic to a ‘pool of last resort’ in the case that no pools were detected as available. While this may still result in an error, it gave an eyeball request a chance at being served successfully in case the pool was still up.

As a brief reminder, a load balancer helps route traffic across your origin servers to ensure your overall infrastructure stays healthy and available. Load Balancers are made up of pools, which can be thought of as collections of servers in a particular location.

Over the past three years, we’ve made many updates to the dashboard. The new designs now support the fallback pool addition to the dashboard UI. The use of a fallback pool is incredibly helpful in a tight spot, but not having it viewable in the dashboard led to confusion around which pool was set as the fallback. Was there a fallback pool set at all? We want to be sure you have the tools to support your day-to-day work, while also ensuring our dashboard is usable and intuitive.

You can now check which pool is set as the fallback in any given Load Balancer, along with being able to easily designate any pool in the Load Balancer as the fallback. If no fallback pool is set, then the last pool in the list will automatically be chosen. We made the decision to auto-set a pool to be sure that customers are always covered in case the worst scenario happens. You can access the fallback pool within the Traffic App of the Cloudflare dashboard when creating or editing a Load Balancer.

Load Balancing UI Improvements

Not only did we add the fallback pool to the UI, but we saw this as an opportunity to update other areas of the Load Balancing app that have caused some confusion in the past.

Facelift and De-modaling

As a start, we gave the main Load Balancing page a face lift as well as de-modaling (moving content out of a smaller modal screen into a larger area) the majority of the Load Balancing UI. We felt moving this content out of a small web element would allow users to more easily understand the content on the page and allow us to better use the larger available space rather than being limited to the small area of a modal. This change has been applied when you create or edit a Load Balancer and manage monitors and/or pools.

Before:

After:

The updated UI has combined the health status and icon to declutter the available space and make it clear at a glance what the status is for a particular Load Balancer or Pool. We have also updated to a smaller toggle button across the Load Balancing UI, which allows us to update the action buttons with the added margin space gained. Now that we are utilizing the page surface area more efficiently, we moved forward to add more information in our tables so users are more aware of the shared aspects of their Load Balancer.

Shared Objects and Editing

Shared objects have caused some level of concern for companies who have teams across the world - all leveraging the Cloudflare dashboard.

Some of the shared objects, Monitors and Pools, have a new column added outlining which Pools or Load Balancers are currently in use by a particular Monitor or Pool. This brings more clarity around what will be affected by any changes made by someone from your organization. This supports users to be more autonomous and confident when they make an update in the dashboard. If someone from team A wants to update a monitor for a production server, they can do so without the worry of monitoring for another pool possibly breaking or have to speak to team B first. The time saved and empowerment to make updates as things change in your business is incredibly valuable. It supports velocity you may want to achieve while maintaining a safe environment to operate in. The days of having to worry about unforeseen consequences that could crop up later down the road are swiftly coming to a close.

This helps teams understand the impact of a given change and what else would be affected. But, we did not feel this was enough. We want to be sure that everyone is confident in the changes they are making. On top of the additional columns, we added in a number of confirmation modals to drive confidence about a particular change. Also, a list in the modal of the other Load Balancers or Pools that would be impacted. We really wanted to drive the message home around which objects are shared: we made a final change to allow edits of monitors to take place only within the Manage Monitors page. We felt that having users navigate to the manage page in itself gives more understanding that these items are shared. For example, allowing edits to a Monitor in the same view of editing a Load Balancer can make it seem like those changes are only for that Load Balancer, which is not always the case.

Manage Monitors before:

Manage monitors after:

Updated CTAs/Buttons

Lastly, when users would expand the Manage Load Balancer table to view more details about their Pools or Origins within that specific Load Balancer, they would click the large X icon in the top right of that expanded card to close it - seems reasonable in the expanded context.

But, the X icon did not close the expanded card, but rather deleted the Load Balancer altogether. This is dangerous and we want to prevent users from making mistakes. With the added space we gained from de-modaling large areas of the UI, we have updated these buttons to be clickable text buttons that read ‘Edit’ or ‘Delete’ instead of the icon buttons. The difference is providing clearly defined text around the action that will take place, rather than leaving it up to a users interpretation of what the icon on the button means and the action it would result in. We felt this was much clearer to users and not be met with unwanted changes.

We are very excited about the updates to the Load Balancing dashboard and look forward to improving day in and day out.

After:

Introducing Load Balancing Analytics

Brian Batraski — Tue, 10 Dec 2019 14:00:00 GMT

Cloudflare aspires to make Internet properties everywhere faster, more secure, and more reliable. Load Balancing helps with speed and reliability and has been evolving over the past three years.

Let’s go through a scenario that highlights a bit more of what a Load Balancer is and the value it can provide. A standard load balancer comprises a set of pools, each of which have origin servers that are hostnames and/or IP addresses. A routing policy is assigned to each load balancer, which determines the origin pool selection process.

Let’s say you build an API that is using cloud provider ACME Web Services. Unfortunately, ACME had a rough week, and their service had a regional outage in their Eastern US region. Consequently, your website was unable to serve traffic during this period, which resulted in reduced brand trust from users and missed revenue. To prevent this from happening again, you decide to take two steps: use a secondary cloud provider (in order to avoid having ACME as a single point of failure) and use Cloudflare’s Load Balancing to take advantage of the multi-cloud architecture. Cloudflare’s Load Balancing can help you maximize your API’s availability for your new architecture. For example, you can assign health checks to each of your origin pools. These health checks can monitor your origin servers’ health by checking HTTP status codes, response bodies, and more. If an origin pool’s response doesn’t match what is expected, then traffic will stop being steered there. This will reduce downtime for your API when ACME has a regional outage because traffic in that region will seamlessly be rerouted to your fallback origin pool(s). In this scenario, you can set the fallback pool to be origin servers in your secondary cloud provider. In addition to health checks, you can use the ‘random’ routing policy in order to distribute your customers’ API requests evenly across your backend. If you want to optimize your response time instead, you can use ‘dynamic steering’, which will send traffic to the origin determined to be closest to your customer.

Our customers love Cloudflare Load Balancing, and we’re always looking to improve and make our customers’ lives easier. Since Cloudflare’s Load Balancing was first released, the most popular customer request was for an analytics service that would provide insights on traffic steering decisions.

Today, we are rolling out Load Balancing Analytics in the Traffic tab of the Cloudflare dashboard. The three major components in the analytics service are:

An overview of traffic flow that can be filtered by load balancer, pool, origin, and region.
A latency map that indicates origin health status and latency metrics from Cloudflare’s global network spanning 194 cities and growing!
Event logs denoting changes in origin health. This feature was released in 2018 and tracks pool and origin transitions between healthy and unhealthy states. We’ve moved these logs under the new Load Balancing Analytics subtab. See the documentation to learn more.

In this blog post, we’ll discuss the traffic flow distribution and the latency map.

Traffic Flow Overview

Our users want a detailed view into where their traffic is going, why it is going there, and insights into what changes may optimize their infrastructure. With Load Balancing Analytics, users can graphically view traffic demands on load balancers, pools, and origins over variable time ranges.

Understanding how traffic flow is distributed informs the process of creating new origin pools, adapting to peak traffic demands, and observing failover response during origin pool failures.

Figure 1

In Figure 1, we can see an overview of traffic for a given domain. On Tuesday, the 24th, the red pool was created and added to the load balancer. In the following 36 hours, as the red pool handled more traffic, the blue and green pool both saw a reduced workload. In this scenario, the traffic distribution graph did provide the customer with new insights. First, it demonstrated that traffic was being steered to the new red pool. It also allowed the customer to understand the new level of traffic distribution across their network. Finally, it allowed the customer to confirm whether traffic decreased in the expected pools. Over time, these graphs can be used to better manage capacity and plan for upcoming infrastructure needs.

Latency Map

The traffic distribution overview is only one part of the puzzle. Another essential component is understanding request performance around the world. This is useful because customers can ensure user requests are handled as fast as possible, regardless of where in the world the request originates.

The standard Load Balancing configuration contains monitors that probe the health of customer origins. These monitors can be configured to run from a particular region(s) or, for Enterprise customers, from all Cloudflare locations. They collect useful information, such as round-trip time, that can be aggregated to create the latency map.

The map provides a summary of how responsive origins are from around the world, so customers can see regions where requests are underperforming and may need further investigation. A common metric used to identify performance is request latency. We found that the p90 latency for all Load Balancing origins being monitored is 300 milliseconds, which means that 90% of all monitors’ health checks had a round trip time faster than 300 milliseconds. We used this value to identify locations where latency was slower than the p90 latency seen by other Load Balancing customers.

Figure 2

In Figure 2, we can see the responsiveness of the Northeast Asia pool. The Northeast Asia pool is slow specifically for monitors in South America, the Middle East, and Southern Africa, but fast for monitors that are probing closer to the origin pool. Unfortunately, this means users for the pool in countries like Paraguay are seeing high request latency. High page load times have many unfortunate consequences: higher visitor bounce rate, decreased visitor satisfaction rate, and a lower search engine ranking. In order to avoid these repercussions, a site administrator could consider adding a new origin pool in a region closer to underserved regions. In Figure 3, we can see the result of adding a new origin pool in Eastern North America. We see the number of locations where the domain was found to be unhealthy drops to zero and the number of slow locations cut by more than 50%.

Figure 3

Tied with the traffic flow metrics from the Overview page, the latency map arms users with insights to optimize their internal systems, reduce their costs, and increase their application availability.

GraphQL Analytics API

Behind the scenes, Load Balancing Analytics is powered by the GraphQL Analytics API. As you’ll learn later this week, GraphQL provides many benefits to us at Cloudflare. Customers now only need to learn a single API format that will allow them to extract only the data they require. For internal development, GraphQL eliminates the need for customized analytics APIs for each service, reduces query cost by increasing cache hits, and reduces developer fatigue by using a straightforward query language with standardized input and output formats. Very soon, all Load Balancing customers on paid plans will be given the opportunity to extract insights from the GraphQL API. Let’s walk through some examples of how you can utilize the GraphQL API to understand your Load Balancing logs.

Suppose you want to understand the number of requests the pools for a load balancer are seeing from the different locations in Cloudflare’s global network. The query in Figure 4 counts the number of unique (location, pool ID) combinations every fifteen minutes over the course of a week.

Figure 4

For context, our example load balancer, lb.example.com, utilizes dynamic steering. Dynamic steering directs requests to the most responsive, available, origin pool, which is often the closest. It does so using a weighted round-trip time measurement. Let’s try to understand why all traffic from Singapore (SIN) is being steered to our pool in Northeast Asia (asia-ne). We can run the query in Figure 5. This query shows us that the asia-ne pool has an avgRttMs value of 67ms, whereas the other two pools have avgRttMs values that exceed 150ms. The lower avgRttMs value explains why traffic in Singapore is being routed to the asia-ne pool.

Figure 5

Notice how the query in Figure 4 uses the loadBalancingRequestsGroups schema, whereas the query in Figure 5 uses the loadBalancingRequests schema. loadBalancingRequestsGroups queries aggregate data over the requested query interval, whereas loadBalancingRequests provides granular information on individual requests. For those ready to get started, Cloudflare has written a helpful guide. The GraphQL website is also a great resource. We recommend you use an IDE like GraphiQL to make your queries. GraphiQL embeds the schema documentation into the IDE, autocompletes, saves your queries, and manages your custom headers, all of which help make the developer experience smoother.

Conclusion

Now that the Load Balancing Analytics solution is live and available to all Pro, Business, Enterprise customers, we’re excited for you to start using it! We’ve attached a survey to the Traffic overview page, and we’d love to hear your feedback.

Announcing deeper insights and new monitoring capabilities from Cloudflare Analytics

Filipp Nisenzoun — Mon, 09 Dec 2019 15:15:00 GMT

This week we’re excited to announce a number of new products and features that provide deeper security and reliability insights, “proactive” analytics when there’s a problem, and more powerful ways to explore your data.

If you’ve been a user or follower of Cloudflare for a little while, you might have noticed that we take pride in turning technical challenges into easy solutions. Flip a switch or run a few API commands, and the attack you’re facing is now under control or your site is now 20% faster. However, this ease of use is even more helpful if it’s complemented by analytics. Before you make a change, you want to be sure that you understand your current situation. After the change, you want to confirm that it worked as intended, ideally as fast as possible.

Because of the front-line position of Cloudflare’s network, we can provide comprehensive metrics regarding both your traffic and the security and performance of your Internet property. And best of all, there’s nothing to set up or enable. Cloudflare Analytics is automatically available to all Cloudflare users and doesn’t rely on Javascript trackers, meaning that our metrics include traffic from APIs and bots and are not skewed by ad blockers.

Here’s a sneak peek of the product launches. Look out for individual blog posts this week for more details on each of these announcements.

Product Analytics:
- Today, we’re making Firewall Analytics available to Business and Pro plans, so that more customers understand how well Cloudflare mitigates attacks and handles malicious traffic. And we’re highlighting some new metrics, such as the rate of solved captchas (useful for Bot Management), and features, such as customizable reports to facilitate sharing and archiving attack information.
- We’re introducing Load Balancing Analytics, which shows traffic flows by load balancer, pool, origin, and region, and helps explain why a particular origin was selected to receive traffic.
Monitoring:
- We’re announcing tools to help you monitor your origin either actively or passively and automatically reroute your traffic to a different server when needed. Because Cloudflare sits between your end users and your origin, we can spot problems with your servers without the use of external monitoring services.
Data tools:
- The product analytics we’ll be featuring this week use a new API behind the scenes. We’re making this API generally available, allowing you to easily build custom dashboards and explore all of your Cloudflare data the same way we do, so you can easily gain insights and identify and debug issues.
Account Analytics:
- We’re releasing (in beta) a new dashboard that shows aggregated information for all of the domains under your account, allowing you to know what’s happening at a glance.

We’re excited to tell you about all of these new products in this week’s posts and would love to hear your thoughts. If you’re not already subscribing to the blog, sign up now to receive daily updates in your inbox.

Cloudflare Partners: A New Program with New Partners

Dan Hollinger — Thu, 06 Jun 2019 13:01:00 GMT

Many overlook a critical portion of the language in Cloudflare’s mission: “to help build a better Internet.” From the beginning, we knew a mission this bold, an undertaking of this magnitude, couldn’t be done alone. We could only help. To ultimately build a better Internet, it would take a diverse and engaged ecosystem of technologies, customers, partners, and end-users. Fortunately, we’ve been able to work with amazing partners as we’ve grown, and we are eager to announce new, specific programs to grow our ecosystem with an increasingly diverse set of partners.

Today, we’re excited to announce the latest iteration of our partnership program for solutions partners. These categories encompass resellers and referral partners, OEM partners, and the new partner services program. Over the past few years, we’ve grown and learned from some amazing partnerships, and want to bring those best practices to our latest partners at scale—to help them grow their business with Cloudflare’s global network.

Cloudflare Partner Tiers

Partner Program for Solution Partners

Every partner program out there has tiers, and Cloudflare’s program is no exception. However, our tiering was built to help our partners ramp up, accelerate and move fast. As Matt Harrell highlighted, we want the process to be as seamless as possible, to help partners find the level of engagement that works best for them‚ with world-class enablement paths and best-in-class revenue share models—built-in from the beginning.

World-Class Enablement

Cloudflare offers complimentary training and enablement to all partners. From self-serve paths, to partner-focused webinars, and instructor-based courses, and certification—we want to ensure our partners can learn and develop Cloudflare and product expertise, to make them as effective as possible when utilizing our massive global network.

Driving Business Value

We want our partners to grow and succeed. From self-serve resellers to our most custom integration, we want to make it frictionless to build a profitable business on Cloudflare. From our tenant API system to dedicated account teams, we’re ready to help you go-to-market with solutions that help end-customers. This includes opportunities to co-market, develop target accounts, and directly partner, to succeed with strategic accounts.

Cloudflare recognizes that, in order to help build a better Internet, we need successful partners—and our latest program is built to help partners build and grow profitable business models around Cloudflare’s solutions.

Partner Services Program - SIs, MSSPs, MSSPs, PSOs

For the first time, we are expanding our program to also include service providers that want to develop and grow profitable services practices around Cloudflare’s solutions.

Our customers face some of the most complex challenges on the Internet. From those challenges, we’ve already seen some amazing opportunities for service providers to create value, grow their business, and make an impact. From customers migrating to the public, hybrid or multi-cloud for the first time, to entirely re-writing applications using Cloudflare Workers®, the need for expertise has increased across our entire customer base. In early pilots, we’ve seen four major categories in building a successful service practice around Cloudflare:

Network Digital Transformations - Help customers migrate and modernize their network solution. Cloudflare is the only cloud-native network to give Enterprise-grade control and visibility into on-prem, hybrid, and multi-cloud architectures.
Serverless Architecture Development - Provide serverless application development services, thought leadership, and technical consulting around leveraging Cloudflare Workers and Apps.
Managed Security & Insights - Enable CISO and IT leaders to obtain single pane of glass of reporting and policy management across all Internet-facing application with Cloudflare’s Security solutions.
Managed Performance & Reliability - Keep customer’s High-Availability applications running quickly and efficiently with Cloudflare’s Global Load Balancing, Smart Routing, and Anycast DNS, which allows performance consulting, traffic analysis, and application monitoring.

As we expand this program, we are looking for audacious service providers and system integrators that want to help us build a better Internet for our mutual customers. Cloudflare can be an essential lynchpin to simplify and accelerate digital transformations. We imagine a future where massive applications run within 10ms of 90% of the global population, and where a single-pane solution provides security, performance, and reliability for mission-critical applications—running across multiple clouds. Cloudflare needs help from amazing global integrators and service partners to help realize this future.

If you are interested in learning more about becoming a service partner and growing your business with Cloudflare, please reach out to partner-services@cloudflare.com or explore cloudflare.com/partners/services

Just Getting Started

Metcalf’s law states that a network is only as powerful as the amount of nodes within the network. And within the global Cloudflare network, we want as many partner nodes as possible—from agencies to systems integrators, managed security providers, Enterprise resellers, and OEMs. A diverse ecosystem of partners is essential to our mission of helping to build a better Internet, together. We are dedicated to the success of our partners, and we will continue to iterate and develop our programs to make sure our partners can grow and develop on Cloudflare’s global network. Our commitment moving forward is that Cloudflare will be the easiest and most valuable solution for channel partners to sell and support globally.

More Information:

Partner Program Website
Partner Services Website
Announcing the new Cloudflare Partner Platform
A New Program with New Partners (you are here)
Building Partnerships Worldwide

MultiCloud... flare

Zack Bloom — Sun, 02 Jun 2019 21:00:00 GMT

If you want to start an intense conversation in the halls of Cloudflare, try describing us as a "CDN". CDNs don't generally provide you with Load Balancing, they don't allow you to deploy Serverless Applications, and they certainly don't get installed onto your phone. One of the costs of that confusion is many people don't realize everything Cloudflare can do for people who want to operate in multiple public clouds, or want to operate in both the cloud and on their own hardware.

Load Balancing

Cloudflare has countless servers located in 180 data centers around the world. Each one is capable of acting as a Layer 7 load balancer, directing incoming traffic between origins wherever they may be. You could, for example, add load balancing between a set of machines you have in AWS' EC2, and another set you keep in Google Cloud.

This load balancing isn't just round-robining traffic. It supports weighting to allow you to control how much traffic goes to each cluster. It supports latency-based routing to automatically route traffic to the cluster which is closer (so adding geographic distribution can be as simple as spinning up machines). It even supports health checks, allowing it to automatically direct traffic to the cloud which is currently healthy.

Most importantly, it doesn't run in any of the provider's clouds and isn't dependent on them to function properly. Even better, since the load balancing runs near virtually every Internet user around the world it doesn't come at any performance cost. (Using our Argo technology performance often gets better!).

Argo Tunnel

One of the hardest components to managing a multi-cloud deployment is networking. Each provider has their own method of defining networks and firewalls, and even tools which can deploy clusters across multiple clouds often can't quite manage to get the networking configuration to work in the same way. The task of setting it up can often be a trial-and-error game where the final config is best never touched again, leaving 'going multi-cloud' as a failed experiment within organizations.

At Cloudflare we have a technology called Argo Tunnel which flips networking on its head. Rather than opening ports and directing incoming traffic, each of your virtual machines (or k8s pods) makes outbound tunnels to the nearest Cloudflare PoPs. All of your Internet traffic then flows over those tunnels. You keep all your ports closed to inbound traffic, and never have to think about Internet networking again.

What's so powerful about this configuration is is makes it trivial to spin up machines in new locations. Want a dozen machines in Australia? As long as they start the Argo Tunnel daemon they will start receiving traffic. Don't need them any more? Shut them down and the traffic will be routed elsewhere. And, of course, none of this relies on any one public cloud provider, making it reliable even if they should have issues.

Argo Tunnel makes it trivial to add machines in new clouds, or to keep machines on-prem even as you start shifting workloads into the Cloud.

Access Control

One thing you'll realize about using Argo Tunnel is you now have secure tunnels which connect your infrastructure with Cloudflare's network. Once traffic reaches that network, it doesn't necessarily have to flow directly to your machines. It could, for example, have access control applied where we use your Identity Provider (like Okta or Active Directory) to decide who should be able to access what. Rather than wrestling with VPCs and VPN configurations, you can move to a zero-trust model where you use policies to decide exactly who can access what on a per-request basis.

In fact, you can now do this with SSH as well! You can manage all your user accounts in a single place and control with precision who can access which piece of infrastructure, irrespective of which cloud it's in.

Our Reliability

No computer system is perfect, and ours is no exception. We make mistakes, have bugs in our code, and deal with the pain of operating at the scale of the Internet every day. One great innovation in the recent history of computers, however, is the idea that it is possible to build a reliable system on top of many individually unreliable components.

Each of Cloudflare's PoPs is designed to function without communication or support from others, or from a central data center. That alone greatly increases our tolerance for network partitions and moves us from maintaining a single system to be closer to maintaining 180 independent clouds, any of which can serve all traffic.

We are also a system built on anycast which allows us to tap into the fundamental reliability of the Internet. The Internet uses a protocol called BGP which asks each system who would like to receive traffic for a particular IP address to 'advertise' it. Each router then will decide to forward traffic based on which person advertising an address is the closest. We advertise all of our IP addresses in every one of our data centers. If a data centers goes down, it stops advertising BGP routes, and the very same packets which would have been destined for it arrive in another PoP seamlessly.

Ultimately we are trying to help build a better Internet. We don't believe that Internet is built on the back of a single provider. Many of the services provided by these cloud providers are simply too complex to be as reliable as the Internet demands.

True reliability and cost control both require existing on multiple clouds. It is clear that the tools which the Internet of the 80s and 90s gave us may be insufficient to move into that future. With a smarter network we can do more, better.

Introducing Spectrum with Load Balancing

Rustam Lalkaka — Thu, 25 Oct 2018 13:00:00 GMT

We’re excited to announce the full integration of Cloudflare Spectrum with Load Balancing. Combining Spectrum with Load Balancing enables traffic management of TCP connections utilising the same battle tested Load Balancer our customers already use for billions of HTTP requests every day.

Customers can configure load balancers with TCP health checks, failover, and steering policies to dictate where traffic should flow. This is live in the Cloudflare dashboard and API — give it a shot!

TCP Health Checks

You can now configure Cloudflare’s Load Balancer health checks to probe any TCP port for an accepted connection. This is in addition to the existing HTTP and HTTPS options.

Health checks are an optional feature within Cloudflare’s Load Balancing product. Without health checks, the Cloudflare Load Balancer will distribute traffic to all origins in the first pool. While this is in itself useful, adding a health check to a Load Balancer provides additional functionality.

With a health check configured for a pool in a Load Balancer, Cloudflare will automatically distribute traffic within a pool to any origins that are marked up by the health check. Unhealthy origins will be dropped automatically. This allows for intelligent failover both within a pool and amongst pools. Health checks can be configured from multiple regions (and even all of Cloudflare’s PoPs as an Enterprise customer) to detect local and global connectivity issues from your origins.

In this example, we will configure a TCP health check for an application running on port 2408 with a refresh rate of every 30 seconds via either the dashboard or our API.

Configuring a TCP health check

Weights

Origin weights are beneficial should you have origins that are not of equal capacity or if you want to unequally split traffic for any other reason.

Weights configured within a load balancer pool will be honored with transport load balancing through Spectrum. If configured, Cloudflare will distribute traffic amongst the available origins within a pool according to the relative weights assigned to each origin.

For further information on weighted steering, see the knowledge base article.

Steering Modes

All steering modes are available for transport load balancing through Spectrum: You can choose standard failover, dynamic steering, or geo steering:

FailoverIn this mode, the Cloudflare Load Balancer will fail over amongst pools listed in a given load balancer configuration as they are marked down by health checks. If all pools are marked down, Cloudflare will send traffic to the fallback pool. The fallback pool is the last pool in the list in the dashboard or specifically nominated via a parameter in the API. If no health checks are configured, Cloudflare will send to the primary pool exclusively.
Dynamic SteeringDynamic steering was recently introduced by Cloudflare as a way of directing traffic to the fastest pool for a given user. In this mode, the Cloudflare load balancer will select the fastest pool for the given Cloudflare Region or PoP (ENT only) through health check data. If there is no health check data for a given colo or region, the load balancer will select a pool in failover order. It is important to note that with TCP health checks, latency calculated may not be representative of true latency to origin if you are terminating TCP at a cloud provider edge location.
Geo SteeringGeo Steering allows you to specify pools for a given Region or PoP (ENT only). In this configuration, Cloudflare will direct traffic from specified Cloudflare locations to configured pools. You may configure multiple pools, and the load balancer will use them in failover order. If this steering mode is selected and there is no configuration for a region or pool, the load balancer will use the default failover order.

Build Scalable TCP Applications

Once your load balancer is configured, it’s available for use as an origin with your Spectrum application:

Configuring a Spectrum application with Load Balancing

Combining Spectrum’s ability to proxy TCP applications, our Load Balancer’s full feature set, and Cloudflare’s global network allows our customers to build performant, reliable, and secure network applications with minimal effort.

We’ve seen customers combine Spectrum and Load Balancing to build scalable gaming platforms, make their live streaming infrastructure more robust, push the envelope with interesting cryptocurrency use cases, and lots more. What will you build?

Spectrum with Load Balancing is available to all current Spectrum and Load Balancing users. Want access to Spectrum? Get in touch with our team. Spectrum is available for applications on the Enterprise plan.

Fixing an old hack - why we are bumping the IPv6 MTU

Marek Majkowski — Mon, 10 Sep 2018 09:21:25 GMT

Back in 2015 we deployed ECMP routing - Equal Cost Multi Path - within our datacenters. This technology allowed us to spread traffic heading to a single IP address across multiple physical servers.

You can think about it as a third layer of load balancing.

First we split the traffic across multiple IP addresses with DNS.
Then we split the traffic across multiple datacenters with Anycast.
Finally, we split the traffic across multiple servers with ECMP.

photo by Sahra by-sa/2.0

When deploying ECMP we hit a problem with Path MTU discovery. The ICMP packets destined to our Anycast IP's were being dropped. You can read more about that (and the solution) in the 2015 blog post Path MTU Discovery in practice.

To solve the problem we created a small piece of software, called pmtud (https://github.com/cloudflare/pmtud). Since deploying pmtud, our ECMP setup has been working smoothly.

Hardcoding IPv6 MTU

During that initial ECMP rollout things were broken. To keep services running until pmtud was done, we deployed a quick hack. We reduced the MTU of IPv6 traffic to the minimal possible value: 1280 bytes.

This was done as a tag on a default route. This is how our routing table used to look:

Notice the mtu 1280 in the default route.

With this setting our servers never transmitted IPv6 packets larger than 1280 bytes, therefore "fixing" the issue. Since all IPv6 routers must have an MTU of at least 1280, we could expect that no ICMP Packet-Too-Big message would ever be sent to us.

Remember - the original problem introduced by ECMP was that ICMP routed back to our Anycast addresses could go to a wrong machine within the ECMP group. Therefore we became ICMP black holes. Cloudflare would send large packets, they would be dropped with ICMP PTB packet flying back to us. Which, in turn would fail to be delivered to the right machine due to ECMP.

But why did this problem not appear for IPv4 traffic? We believe the same issue exists on IPv4, but it's less damaging due to the different nature of the network. IPv4 is more mature and the great majority of end-hosts support either MTU 1500 or have their MSS option well configured - or clamped by some middle box. This is different in IPv6 where a large proportion of users use tunnels, have Path MTU strictly smaller than 1500 and use incorrect MSS settings in the TCP header. Finally, Linux implements RFC4821 for IPv4 but not IPv6. RFC4821 (PLPMTUD) has its disadvantages, but does slightly help to alleviate the ICMP blackhole issue.

Our "fix" - reducing the MTU to 1280 - was serving us well and we had no pressing reason to revert it.

Researchers did notice though. We were caught red-handed twice:

In 2017 Geoff Huston noticed (pdf) that we sent DNS fragments of 1280 only (older blog post).
In June 2018 the paper "Exploring usable Path MTU in the Internet" (pdf) mentioned our weird setting - where we can accept 1500 bytes just fine, but transmit is limited to 1280.

When small MTU is too small

photo by NH53 by/2.0

This changed recently, when we started working on Cloudflare Spectrum support for UDP. Spectrum is a terminating proxy, able to handle protocols other than HTTP. Getting Spectrum to forward TCP was relatively straightforward (barring couple of awesome hacks). UDP is different.

One of the major issues we hit was related to the MTU on our servers.

During tests we wanted to forward UDP VPN packets through Spectrum. As you can imagine, any VPN would encapsulate a packet in another packet. Spectrum received packets like this:

It's pretty obvious, that our edge servers supporting IPv6 packets of max 1280 bytes won't be able to handle this type of traffic. We are going to need at least 1280+40+8 bytes MTU! Hardcoding MTU=1280 in IPv6 may be acceptable solution if you are an end-node on the internet, but is definitely too small when forwarding tunneled traffic.

Picking a new MTU

But what MTU value should we use instead? Let's see what other major internet companies do. Here is a couple of examples of advertised MSS values in TCP SYN+ACK packets over IPv6:

I believe Google and Facebook adjust their MTU due to their use of L4 load balancers. Their implementations do IP-in-IP encapsulation so need a bit of space for the header. Read more:

Google - Maglev
Facebook - Katran

There may be other reasons for having a smaller MTU. A reduced value may decrease the probability of the Path MTU detection algorithm kicking in (ie: relying on ICMP PTB). We can theorize that for the misconfigured eyeballs:

MTU=1280 will never run Path MTU detection
MTU=1500 will always run it.
In-between values would have increasing different chances of hitting the problem.

But just what is the chance of that?

A quick unscientific study of the MSS values we encountered from eyeballs shows the following distributions. For connections going over IPv4:

Assuming the majority of clients have MSS configured right, we can say that 89.8% of connections advertised MTU=1380+40=1420 or higher. 75% had MTU >= 1452.

For IPv6 connections we saw:

On IPv6 87.7% connections had MTU >= 1422 (1362+60). 75% had MTU >= 1450. (See also: MTU distribution of DNS servers).

Interpretation

Before we move on it's worth reiterating the original problem. Each connection from an eyeball to our Anycast network has three numbers related to it:

Client advertised MTU - seen in MSS option in TCP header
True Path MTU value - generally unknown until measured
Our Edge server MTU - value we are trying to optimize in this exercise

(This is a slight simplification, paths on the internet aren't symmetric so the path from eyeball to Cloudflare could have different Path MTU than the reverse path.)

In order for the connection to misbehave, three conditions must be met:

Client advertised MTU must be "wrong", that is: larger than True Path MTU
Our edge server must be willing to send such large packets: Edge server MTU >= True Path MTU
The ICMP PTB messages must fail to be delivered to our edge server - preventing Path MTU detection from working.

The last condition could occur for one of the reasons:

the routers on the path are misbehaving and perhaps firewalling ICMP
due to the asymmetric nature of the internet the ICMP back is routed to the wrong Anycast datacenter
something is wrong on our side, for example pmtud process fails

In the past we limited our Edge Server MTU value to the smallest possible, to make sure we never encounter the problem. Due to the development of Spectrum UDP support we must increase the Edge Server MTU, while still minimizing the probability of the issue happening.

Finally, relying on ICMP PTB messages for a large fraction of traffic is a bad idea. It's easy to imagine the cost this induces: even with Path MTU detection working fine, the affected connection will suffer a hiccup. A couple of large packets will be dropped before the reverse ICMP will get through and reconfigure the saved Path MTU value. This is not optimal for latency.

Progress

In recent days we increased the IPv6 MTU. As part of the process we could have chosen 1300, 1350, or 1400. We choose 1400 because we think it's the next best value to use after 1280. With 1400 we believe 93.2% of IPv6 connections will not need to rely on Path MTU Detection/ICMP. In the near future we plan to increase this value further. We won't settle on 1500 though - we want to leave a couple of bytes for IPv4 encapsulation, to allow the most popular tunnels to keep working without suffering poor latency when Path MTU Detection kicks in.

Since the rollout we've been monitoring Icmp6InPktTooBigs counters:

Here is a chart of the ICMP PTB packets we received over last 7 days. You can clearly see that when the rollout started, we saw a large increase in PTB ICMP messages (Y label - packet count - deliberately obfuscated):

Interestingly the majority of the ICMP packets are concentrated in our Frankfurt datacenter:

We estimate that in our Frankfurt datacenter, we receive ICMP PTB message on 2 out of every 100 IPv6 TCP connections. These seem to come from only a handful of ASNs:

AS6830 - Liberty Global Operations B.V.
AS20825- Unitymedia NRW GmbH
AS31334 - Vodafone Kabel Deutschland GmbH
AS29562 - Kabel BW GmbH

These networks send to us ICMP PTB messages, usually informing that their MTU is 1280. For example:

Final thoughts

Finally, if you are an IPv6 user with a weird MTU and have misconfigured MSS - basically if you are doing tunneling - please let us know of any issues. We know that debugging MTU issues is notoriously hard. To aid that we created an online fragmentation and ICMP delivery test. You can run it:

IPv6 version: http://icmpcheckv6.popcount.org
(for completeness, we also have an IPv4 version)

If you are a server operator running IPv6 applications, you should not worry. In most cases leaving the MTU at default 1500 is a good choice and should work for the majority of connections. Just remember to allow ICMP PTB packets on the firewall and you should be good. If you serve variety of IPv6 users and need to optimize latency, you may consider choosing a slightly smaller MTU for outbound packets, to reduce the risk of relying on Path MTU Detection / ICMP.

Low level network tuning sound interesting? Join our world famous team in London, Austin, San Francisco, Champaign and our elite office in Warsaw, Poland.