Blue Light Special: Ensuring fast global configuration changes

by Ben Cartwright-Cox.

CloudFlare operates a huge global network of servers that proxy our customers' web sites, operate as caches, inspect requests to ensure they are not malicious, deflect DDoS attacks and handle one of the largest authoritative DNS systems in the world. And where there's software there's configuration information.

CloudFlare is highly customisable. Each customer has a unique configuration consisting of DNS records, all manner of settings (such as minification, image recompression, IP-based blocking, which individual WAF rules to execute) and per-URL rules. And the configuration changes constantly.

Warp speed configuration

We offer almost instant configuration changes. If a user adds a DNS record it should be globally resolvable in seconds. If a user enables a CloudFlare WAF rule it should happen very, very fast to protect a site. This presents a challenge because those configuration changes need to be pushed across the globe very quickly.

We've written in the past about the underlying technology we use: Kyoto Tycoon and how we secured it from eavesdroppers. We also monitor its performance.

DNS records are currently changing at a rate of around 40 per second, 24 hours a day. All those changes need to be propagated in seconds.

So we take propagation times very seriously.

Keep a close eye on this light of mine

For this we need to keep a close eye on how long it takes a change to reach every one of our data centers. Whilst we have in-depth metrics for our operations team to look at it's sometimes useful (and fun) to have something more visceral.

We also want developers and operations people to equally be aware of some critical metrics, and developers are spending their time observing the metrics and alerts aimed at operations.

On some rare occasions, perhaps due to routing problems on the wider Internet, we may find that our ability to push changes at the required velocity becomes impractical. To ensure that we know about this as soon as possible and know when to take action we've built a custom alert system that everyone in the office can see.

From an external global collection of machines we monitor propagation time for DNS records and trigger an alert if propagation time exceeds a pre-set threshold. The alert comes in the form of a blue rotating 'police light'.

We had joked about having a "red alert" alarm when we fall behind on propagation and so I turned that joke into reality.

Hawaii Pi-O

A Raspberry Pi hidden in an old hard drive case connects to our global monitors and obtains the current propagation time (as measured from outside our network). The Pi is connected (via a transistor acting as a switch) to a cheap mini police light that's visible throughout the office.

PS: All the puns in this post were added by John Graham-Cumming. I disclaim all responsibility.

comments powered by Disqus