When building a DDoS mitigation service it’s incredibly tempting to think that the solution is scrubbing centers or scrubbing servers. I, too, thought that was a good idea in the beginning, but experience has shown that there are serious pitfalls to this approach.
A scrubbing server is a dedicated machine that receives all network traffic destined for an IP address and attempts to filter good traffic from bad. Ideally, the scrubbing server will only forward non-DDoS packets to the Internet application being attacked. A scrubbing center is a dedicated location filled with scrubbing servers.
Three Problems With Scrubbers
The three most pressing problems with scrubbing are: bandwidth, cost, knowledge.
The bandwidth problem is easy to see. As DDoS attacks have scaled to >1Tbps having that much network capacity available is problematic. Provisioning and maintaining multiple-Tbps of bandwidth for DDoS mitigation is expensive and complicated. And it needs to be located in the right place on the Internet to receive and absorb an attack. If it’s not then attack traffic will need to be received at one location, scrubbed, and then clean traffic forwarded to the real server: that can introduce enormous delays with a limited number of locations.
Imagine for a moment you’ve built a small number of scrubbing centers, and each center is connected to the Internet with many Gbps of connectivity. When a DDoS attack occurs that center needs to be able to handle potentially 100s of Gbps of attack traffic at line rate. That means exotic network and server hardware. Everything from the line cards in routers, to the network adapter cards in the servers, to the servers themselves is going to be very expensive.
This (and bandwidth above) is one of the reasons DDoS mitigation has traditionally cost so much and been billed by attack size.
The final problem, knowledge, is the most easily overlooked. When you set out to build a scrubbing server you are building something that has to separate good packets from bad.
At first this seems easy (let’s filter out all TCP ACK packets for non-established connections, for example), and low level engineers are easy to excite about writing high-performance code to do that. But attackers are not stupid and they’ll throw legitimate looking traffic at a scrubbing server and it gets harder and harder to distinguish good from bad.
At that point, scrubbing engineers need to become protocol experts at all levels of the stack. That means you have to build a competency in all levels of TCP/IP, DNS, HTTP, TLS, etc. And that’s hard.
CC BY-SA 2.0 image by Lisa Stevens
The bottom line is scrubbing centers and exotic hardware are great marketing. But, like citadels of medieval times, they are monumentally expensive and outdated, overwhelmed by better weapons and warfighting techniques.
And many DDoS mitigation services that use scrubbing centers operate in an offline mode. They are only enabled when a DDoS occurs. This typically means that an Internet application will succumb to the DDoS attack before its traffic is diverted to the scrubbing center.
Just imagine citizens fleeing to hide behind the walls of the citadel under fire from an approaching army.
Better, Cheaper, Smarter
There’s a subtler point about not having dedicated scrubbers: it forces us to build better software. If a scrubbing server becomes overwhelmed or fails then only the customer being scrubbed is affected, but when the mitigation happens on the very servers running the core service it has to work and be effective.
I spoke above about the ‘knowledge gap’ that comes about with dedicated DDoS scrubbing. The Cloudflare approach means that if bad traffic gets through, say a flood of bad DNS packets, then it reaches a service owned and operated by people who are experts in that domain. If a DNS flood gets through our DDoS protection it hits our custom DNS server, RRDNS, the engineers who work on it can bring their expertise to bear.
This makes an enormous difference because the result is either improved DDoS scrubbing or a change to the software (e.g. the DNS stack) that improves its performance under load. We’ve lived that story many, many times and the entire software stack has improved because of it.
The approach Cloudflare took to DDoS mitigation is rather simple: make every single server in Cloudflare participate in mitigation, load balance DDoS attacks across the data centers and servers within them and then apply smarts to the handling of packets. These are the same servers, processors and cores handling our entire service.
Eliminating scrubbing centers and hardware completely changes the cost of building a DDoS mitigation service.
We currently have around 15 Tbps of network capacity worldwide but this capacity doesn’t require exotic network hardware. We are able to use low cost or commodity networking equipment bound together using network automation to handle normal and DDoS traffic. Just as Google originally built its service by writing software that tied together commodity servers into a super (search) computer; our architecture binds commodity servers together into one giant network device.
By building the world’s most peered network we’ve built this capacity at reasonable cost and more importantly are able to handle attack traffic globally wherever it originates with low latency links. No scrubbing solution is able to say the same.
And because Cloudflare manages DNS for our customers and uses an Anycasted network attack traffic originating from botnets is automatically distributed across our global network. Each data center deals with a portion of DDoS traffic.
Within each data center DDoS traffic is load balanced across multiple servers running our service. Each server handles a portion of the DDoS traffic. This spreading of DDoS traffic means that a single DDoS attack will be handled by a large number of individual servers across the world.
And as Cloudflare grows our DDoS mitigation capacity grows automatically, and because our DDoS mitigation is built into our stack it is always on. We mitigate a new DDoS attack every three minutes with no downtime for Internet applications and have no need to ‘switch over’ to a scrubbing center.
Inside a Server
Once all this global and local load balancing has occurred packets do finally hit a network adapter card in a server. It’s here that Cloudflare’s custom DDoS mitigation stack comes into play.
Over the years we’ve learned how to automatically detect and mitigate anything the internet can throw at us. For most of the attacks, we rely on dynamically managing iptables: the standard Linux firewall. We’ve spoken about the most effective techniques in past. iptables has a number of very powerful features which we select depending on specific attack vector. From our experience xt_bpf, ipset, hashlimits and connlimits are the most useful iptables modules.
For very large attacks the Linux Kernel is not fast enough though. To relieve the kernel from processing excessive number of packets, we experimented with various kernel bypass techniques. We’ve settled on a partial kernel bypass interface - Solarflare specific EFVI.
With EFVI we can offload the processing of our firewall rules to a user space program, and we can easily process millions of packets per second on each server, while keeping the CPU usage low. This allows us to withstand the largest attacks, without affecting our multi-tenant service.
Open Source
Cloudflare’s vision is to help to build a better internet. Fixing DDoS is a part of it. We’ve been relentlessly documenting the most important and dangerous attacks we’ve encountered, fighting botnets and open sourcing critical pieces of our DDoS infrastructure.
We’ve open sourced various tools, from the very low level projects like our BPF Tools, that we use to fight DNS and SYN floods, to contributing to OpenResty a performant application framework on top of NGINX, which is great for building L7 defenses.
Further Reading
Cloudflare has written a great deal about DDoS mitigation in the past. Some example, blog posts: How Cloudflare's Architecture Allows Us to Scale to Stop the Largest Attacks, Reflections on reflection (attacks), The Daily DDoS: Ten Days of Massive Attacks, and The Internet is Hostile: Building a More Resilient Network.
And if you want to go deeper, my colleague Marek Majkowski dives deeper into the code we use for DDoS mitigation.
Conclusion
Cloudflare’s DDoS mitigation architecture and custom software makes Unmetered Mitigation possible. With it we can withstand the largest DDoS attacks and as our network grows our DDoS mitigation capability grows with it.