This is part 3 of a six part series based on a talk I gave in Trento, Italy. To start from the beginning go here.
After Cloudbleed, lots of things changed. We started to move away from memory-unsafe languages like C and C++ (there’s a lot more Go and Rust now). And every SIGABRT or crash on any machine results in an email to me and a message to the team responsible. And I don’t let the team leave those problems to fester.
Making 1.1.1.1
So Cloudbleed was a terrible time. Let’s talk about a great time. The launch of our public DNS resolver 1.1.1.1. That launch is a story of an important Cloudflare quality: audacity. Google had launched 8.8.8.8 years ago and had taken the market for a public DNS resolver by storm. Their address is easy to remember, their service is very fast.
But we thought we could do better. We thought we could be faster, and we thought we could be more memorable. Matthew asked us to get the address 1.1.1.1 and launch a secure, privacy-preserving, public DNS resolver in a couple of months. Oh, and make it faster than everybody else.
We did that. In part we did it because of good relationships we’ve established with different groups around the world. We’ve done that by being consistent about how we operate and by employing people with established relationships. This is partly a story about how diversity matters. If we’d been the sort of people who discriminated against older engineers a lot of Cloudflare would not have been built. I’ll return to the topic of diversity and inclusion later.
Through relationships and sharing we were able to get the 1.1.1.1 address. Through our architecture we were able to be the fastest. Over years and years, we’ve been saying that Cloudflare was for everyone on the Internet. Everyone, everywhere. And we put our money where our mouths are and built 165 data centers across the world. Our goal is to be within 10ms of everyone who uses the Internet.
And when you’re everywhere it’s easy to be the fastest. Or at least it’s easy if you have an architecture that makes it possible to update software quickly and run it everywhere. Cloudflare runs a single stack of software on every machine world-wide. That architecture has made a huge difference versus our competitors and has allowed us to scale quickly and cheaply.
Cloudflare's Architecture
It was largely put in place before I joined the company. Lee Holloway (the original architect of the company), working with a small team, built a service based on open source components (such as Postgres and NGINX) that had a single stack of software doing caching, WAF, DDoS mitigation and more.
It was all bound together by a distributed key-value store to send configuration to every machine we have around the world in seconds. And centrally there was a large customer database in Postgres and a lot of PHP to create the public cloudflare.com web site.
Although we have constantly changed our software this architecture still exists. Early at Cloudflare I argued that there should be some special machines in the network doing special tasks (like DDoS mitigation). The truth is I wanted to build those machines because technically it would have been really exciting to work on that sort of large, complex low latency software. But Lee and Matthew told me I was wrong: a simple architecture could scale more easily.
And they were right. We’ve scaled to 25Tbps of network capacity with every machine doing every single thing. So, get the architecture right and make sure you’re building things for the right reasons. Once you can scale like that, adding 1.1.1.1 was easy. We rolled out the software to every machine, tested it and made it public. Overnight it was the fastest public DNS resolver there is and remains so.
Naturally, our software stack has evolved a lot since Lee started working on it. And most parts of it have been rewritten. We’ve thrown away all the code that Matthew Prince wrote in PHP from the earliest days, we’ve started to throw away code that I wrote in Lua and Go. This is natural and if you’re looking back at code you wrote five years ago and you’re feeling that it’s still fit for purpose then you are either fooling yourself or not growing.
The Price of Growth is Rewrites
It seems that about every order of magnitude change in use of software requires a rewrite. It’s sad that you can’t start with the ultimate code base and ultimate architecture but the reality is that it’s too hard to build the software you need for today’s challenges and so you can’t worry about tomorrow. It’s also very hard to anticipate what you’ll actually need when your service grows by 10x.
When I joined most of our customers had a small number of DNS records. And the software had been built to scale to thousands or millions of customers. Each with a small number of records. That’s because our typical customer was a small business or individual with a blog. We were built for millions of them.
Then along came a company that had a single domain name with millions of subdomains. Our software immediately fell over. It just wasn’t built to cope with that particular shape of customer.
So, we had to build an immediate band aid and start re-architecting the piece of software that handled DNS records. I could tell you 10 other stories like that. But the lesson is clear: you don’t know what to expect up front so keep going until you get there. But be ready to change quickly.
Helping to Build Cloudflare
Part 3: Audacity, Diversity and Change (you are here)