It was a scorching Monday on July 22 as temperatures soared above 37°C (99°F) in Austin, TX, the live music capital of the world. Only hours earlier, the last crowds dispersed from the historic East 6th Street entertainment district. A few blocks away, Cloudflarians were starting to make their way to the office. Little did those early arrivers know that they would soon be unknowingly participating in a Cloudflare time honored tradition of dogfooding new services before releasing them to the wild.
East 6th Street, Austin Texas
(A photo I took on a night out with the team while visiting the Cloudflare Austin office)
Dogfooding is when an organization uses its own products. In this case, we dogfed our newest cloud service, Magic Transit, which both protects and accelerates our customers’ entire network infrastructure—not just their web properties or TCP/UDP applications. With Magic Transit, Cloudflare announces your IP prefixes via BGP, attracts (routes) your traffic to our global network edge, blocks bad packets, and delivers good packets to your data centers via Anycast GRE.
We decided to use Austin’s network because we wanted to test the new service on a live network with real traffic from real people and apps. With the target identified, we began onboarding the Austin office in an always-on routing topology.
In an always-on routing mode, Cloudflare data centers constantly advertise Austin’s prefix, resulting in faster, almost immediate mitigation. As opposed to traditional on-demand scrubbing center solutions with limited networks, Cloudflare operates within 100 milliseconds of 99% of the Internet-connected population in the developed world. For our customers, this means that always-on DDoS mitigation doesn’t sacrifice performance due to suboptimal routing. On the contrary, Magic Transit can actually improve your performance due to our network’s reach.
Cloudflare’s Global Network
DDoS’ing Austin
Now that we’ve completed onboarding Austin to Magic Transit, all we needed was a motivated attacker to launch a DDoS attack. Luckily, we found more than a few willing volunteers on our Site Reliability Engineering (SRE) team to execute the attack. While the teams were still assembling in multiple locations around the world, our SRE volunteer started firing packets at our target from an undisclosed location.
Without Magic Transit, the Austin office would’ve been hit directly with the packet flood. Two things could have happened in this case (not mutually exclusive):
Austin’s on-premise equipment (routers, firewalls, servers, etc.) would have been overwhelmed and failed
Austin’s service providers would have dropped packets that exceeded its bandwidth allowance
Both cases would result in a very bad day for everyone.
Cloudflare DDoS Mitigation
Instead, when our SRE attacker launched the flood the packets were automatically routed via BGP to Cloudflare’s network. The packets reached the closest data center via Anycast and encountered multiple defenses in the form of XDP, eBPF and iptables. Those defenses are populated with pre-configured static firewall rules as well as dynamic rules generated by our DDoS mitigation systems.
Static rules can vary from straightforward IP blocking and rate-limiting to more sophisticated expressions that match against specific packet attributes. Dynamic rules, on the other hand, are generated automatically in real-time. To play fair with our attacker, we didn’t pre-configure any special rules against the attack. We wanted to give our attacker a fair opportunity to take Austin down. Although due to our multi-layered protection approach, the odds were never actually in their favor.
Source: https://imgflip.com
Generating Dynamic Rules
As part of our multi-layered protection approach, Dynamic Rules are generated on-the-fly by analyzing the packets that route through our network. While the packets are being routed, flow data is asynchronously sampled, collected, and analyzed by two main detection systems. The first is called Gatebot and runs across the entire Cloudflare network; the second is our newly deployed DoSD (denial of service daemon) which operates locally within each data center. DoSD is an exciting improvement that we’ve just recently rolled out and we look forward to writing more about its technical details here soon. DoSD samples at a much faster rate (1/100 packets) versus Gatebot which samples at a lower rate (~1/8000 packets), allowing it to detect even more attacks and block them faster.
The asynchronous attack detection lifecycle is represented as the dotted lines in the diagram below. Attacks are detected out of path to assure that we don’t add any latency, and mitigation rules are pushed in line and removed as needed.
Multiple packet attributes and correlations are taken into consideration during analysis and detection. Gatebot and DoSD search for both new network anomalies and already known attacks. Once an attack is detected, rules are automatically generated, propagated, and applied in the optimal location within 10 seconds or less. Just to give you an idea of the scale, we’re talking about hundreds of thousands of dynamic rules that are applied and removed every second across the entire Cloudflare network.
One of the beauties of Gatebot and DoSD is that they don’t require a traffic learning period. Once a customer is onboarded, they’re protected immediately. They don’t need to sample traffic for weeks before kicking in. While we can always apply specific firewall rules if requested by the customer, no manual configuration is required by the customer or our teams. It just works.
What this mitigation process looks like in practice
Let’s look at what happened in Austin when one of our SREs tried to DDoS Austin and failed. During one of the first attempts, before DoSD had rolled out globally, a degradation in audio and video quality was noticed for Austin employees on video calls for a few seconds before Gatebot kicked in. However, as soon as Gatebot kicked in, the quality was immediately restored. If we hadn’t had Magic Transit in-line, the degradation of service would’ve worsened until the point of full denial of service. Austin would have been offline and our Austin colleagues wouldn’t have had a very productive day.
On a subsequent attack attempt which took place after DoSD was deployed, our SRE launched a SYN flood on Austin. The attack targeted multiple IP addresses in Austin’s prefix and peaked just above 250,000 packets per second. DoSD detected the attack and blocked it in approximately 3 seconds. DoSD’s quick response resulted in no degradation of service for the Austin team.
Attack Snapshot
Green line = Attack traffic to Cloudflare edge, Yellow line = clean traffic from Cloudflare to origin over GRE
What We Learned
Dogfooding Magic Transit served as a valuable experiment for us with lots of lessons learned both from the engineering and procedural aspects. From the engineering aspect, we fine-tuned our mitigations and optimized routings. From the procedural aspects, we drilled members of multiple teams including the Security Operations Center and Solution Engineering teams to help refine our run-books. By doing so, we reduced the onboarding duration to hours instead of days in order to assure a quick and smooth onboarding experience for our customers.
Want To Learn More?
Request a demo and learn how you can protect and accelerate your network with Cloudflare.