Today from 21:00 - 23:00 UTC CloudFlare scheduled a maintenance window. During that time, CloudFlare's interface was offline. While it was only two hours of time (and we finished a bit early, at 22:16 UTC) what went on during that window had been in the works for several months. I wanted to take a second and let you know what we just did and why.
When Dinosaurs Roamed the Web
Michelle, Lee and I started working on CloudFlare in early 2009. However, it wasn't until the beginning of 2010 that we invited the first users to sign up. Here's an early invite that went out to Project Honey Pot users. While CloudFlare's network today spans the globe, back then we only had one data center (in Chicago) and about 512 IP addresses (two /24 CIDRs).
Over the course of 2010, we built the product and continued to signup customers. It was a struggle to get our first 100 customers and, when we did, we took the whole team (at the time there were 6 of us) to Las Vegas. One of our very first blog posts was documenting that adventure. While today we regularly sign up 100 new customers an hour, we're really proud of the fact that a lot of those original customers are still CloudFlare customers today.
Over the course of the summer of 2010, about 1,000 customers signed up. On September 27, 2010, Michelle and I stepped on stage at TechCrunch Disrupt and launched the service live to the public. We were flooded with new signups, more than tripling in size in 48 hours. Our growth has only accelerated since then.
Provisioning and Accounting
One of the hardest non-obvious challenges to running CloudFlare is the accounting and provisioning of our resources across our customers sites. When someone signs up, we run hundreds of thousands of tests on the characteristics of the site in order to find the best pool to assign the site to. If a site signs up for a plan tier that supports HTTPS then we automatically issue and deploy a SSL certificate. And we spread sites across resource pools to ensure that we don't have hot spots on our network.
Originally, we were fairly arbitrary about what customers are assigned to what pool of resources. Over time, we've developed much more sophisticated systems to put new customers into the best pool of resources for them at the moment. However, the system has been relatively static: the pool a site is placed in when you first sign up generally has remained your pool over time.
Moving Sucks
Since provisioning has been relatively static, we had sites that were frozen in time. Those first 100 customers that were on CloudFlare's first IP addresses were mixed between free and paying customers. This lead to less efficient allocation of our server resources and, over time, kept us from better automating a number of systems that would better distribute load and isolate sites that were under attack from the rest of the network.
The solution was to migrate everyone from the various old systems to a new system. Lee began planning for this two months ago and stuck around the office over the holidays in order to ensure all the prep work was in place. To give you a sense of the scope, the migration script was 2,487 lines of code. And it would only be run once.
We picked a day after the holidays when all the team would be in the office. We run a global network with customers around the world so there is no quiet time during which to do system-wide maintenance, so wepicked a time all our team would be on hand and fully alert. We ordered a pizza lunch for everyone and then, about an hour after lunch, began migrating everyone from various old deployments to a new, modern system.
Replacing the Engine (and Wings) of the Plane in Flight
It is non-trivial to move more than half a million websites around IP space. Sites were rebalanced to new pools of IP addresses. Where necessary, new SSL certificates were seamlessly issued. Custom SSL certificates were redeployed to new machines. From start to finish, the process took about an hour and sixteen minutes.
The process was designed to ensure that there would be no interruption in services. Unless you knew the IP addresses CloudFlare announced for your site, you likely wouldn't notice anything. And for most our customers, it went very smoothly.
We had two issues that affected a handful of customers. First, there was a conflict with some of our web server configurations that prevented a "staging" SSL environment from coming up properly. This staging environment was used as a temporary home for some sites that used SSL as they migrated from their old IP space to their new IP space. As a result, some customers saw SSL errors for about 10 minutes.
Second, a small number of customers were assigned to an IP address that had recently been under a DDoS attack and had been null routed at the routers. This null route would usually be recorded in our database, keeping sites from being assigned to the space until the null route was removed. In this case, the information wasn't correctly recorded and for a short time a small number of sites were on a pair of IP addresses that was unreachable. We removed the null route within a few minutes of realizing the mistake and the sites were again online.
Flexibility
We have known we needed to do this migration for quite some time. Now that it's done, CloudFlare's network is significantly more flexible and robust to ensure fast performance and keep attacks against one site from ever effecting any other customers.
To give you some sense of the flexibility the new system offers, here's a challenge we've faced. As CloudFlare looks to expand its network, some regions where we want to add data centers have restrictions on certain kinds of content being served. For example, in many Middle Eastern countries it is illegal to serve adult content from within their borders. CloudFlare is a reflection of the Internet, so there are adult-oriented sites that use our network. Making matters more difficult, the challenge is that what counts as an "adult" site can change over time.
The new system allows both our automated systems and our ops team to tag sites with certain characteristics. Now we can label a site as "adult" and the system automatically migrates it to a pool of resources that doesn't need to be announced from a particular region where serving the content would be illegal.
A similar use case is a site that is under attack. The new provisioning system allows us to isolate the site from the rest of the network so mitigate any collateral damage to other customers. We can also automatically dedicated additional resources (e.g., data centers in parts of the world that are at a lull of traffic based on the time of the day) in order to better mitigate the attacks. In the end, the benefit here is extreme flexibility.
We never like to take our site and API offline for any period of time, and I am disappointed we didn't complete the migration completely without incident, but overall this was a very important, surprisingly complex transition that went very smoothly. CloudFlare's network is now substantially more robust and flexible in order to continue to grow and expand as we continue on our mission to build a better web.