How we made our DNS stack 3x faster

by Tom Arnfeld.

Cloudflare is now well into its 6th year and providing authoritative DNS has been a core part of infrastructure from the start. We’ve since grown to be the largest and one of the fastest managed DNS services on the Internet, hosting DNS for nearly 100,000 of the Alexa top 1M sites and over 6 million other web properties – or DNS zones.

Space Shuttle Main Engine SSME CC-BY 2.0 image by Steve Jurvetson

Today Cloudflare’s DNS service answers around 1 million queries per second – not including attack traffic – via a global anycast network. Naturally as a growing startup, the technology we used to handle tens or hundreds of thousands of zones a few years ago became outdated over time, and couldn't keep up with the millions we have today. Last year we decided to replace two core elements of our DNS infrastructure: the part of our DNS server that answers authoritative queries and the data pipeline which takes changes made by our customers to DNS records and distributes them to our edge machines across the globe.

DNS Data Flow

The rough architecture of the system can be seen above. We store customer DNS records and other origin server information in a central database, convert the raw data into a format usable by our edge in the middle, and then distribute it to our >100 data centers (we call them PoPs - Points of Presence) using a KV (key/value) store.

The queries are served by a custom DNS server, rrDNS, that we’ve been using and developing for several years. In the early days of Cloudflare, our DNS service was built on top of PowerDNS, but that was phased out and replaced by rrDNS in 2013.

The Cloudflare DNS team owns two elements of the data flow: the data pipeline itself and rrDNS. The first goal was to replace the data pipeline with something entirely new as the current software was starting to show its age; as any >5 year old infrastructure would. The existing data pipeline was originally built for use with PowerDNS, and slowly evolved over time. It contained many warts and obscure features because it was built to translate our DNS records into the PowerDNS format.

A New Data Model

In the old system, the data model was fairly simple. We’d store the DNS records roughly in the same structure that they are represented in our UI or API: one entry per resource record (RR). This meant that the data pipeline only had to perform fairly rudimentary encoding tasks when generating the zone data to be distributed to the edge.

Zone metadata and RRs were encoded using a mix of JSON and Protocol Buffers, though we weren’t making particularly good use of the schematized nature of the protocols so the schemas were very bloated and the resulting data ended up being larger than necessary. Not to mention that as the number of total RRs in our database headed north of 100 million, these small differences in encoding made a significant difference in aggregate.

It’s worth remembering here that DNS doesn’t really operate on a per-RR basis when responding to queries. You query for a name and a type (e.g example.com and AAAA) and you’ll be given an RRSet which is a collection of RRs. The old data format had RRSets broken out into multiple RR entries (one key per record) which typically meant multiple roundtrips to our KV store to answer a single query. We wanted to change this and group data by RRSet so that a single request could be made to the KV store to retrieve all the data needed to answer a query. Because Cloudflare optimizes heavily for DNS performance, multiple KV lookups were limiting our ability to make rrDNS go as fast as possible.

In a similar vein, for lookups like A/AAAA/CNAME we decided to group the values into a single “address” key instead of one key per RRset. This further avoids having to perform extra lookups in the most common cases. Squishing keys together also helps reduce memory usage of the cache we use in front of the KV store, since we’re storing more information against a single cache key.

After settling on this new data model, we needed to figure out how to serialize the data and pass it to the edge. As mentioned, we were previously using a mix of JSON and Protocol Buffers, and we decided to replace this with a purely MessagePack-based implementation.

Why MessagePack?

MessagePack is a binary serialization format that is typed, but does not have a strict schema built into the format. In this regard, it can be considered a little like JSON. For both the reader and the writer, extra fields can be present or absent and it’s up to your application code to compensate.

In contrast, Protocol Buffers (or other formats like Cap’n Proto) require a schema for data structures defined in a language agnostic format, and then generate code for the specific implementation. Since DNS already has a large structured schema, we didn’t want to have to duplicate all of this schema in another language and then maintain it. In the old implementation with Protocol Buffers, we’d not properly defined schemas for all DNS types – to avoid this maintenance overhead – which resulted in a very confusing data model for rrDNS.

When looking for new formats we wanted something that would be fast, easy to use and that could integrate easily into the code base and libraries we were already using. rrDNS makes heavy use of the miekg/dns Go library which uses a large collection of structs to represent each RR type, for example:

type SRV struct {  
    Hdr      RR_Header
    Priority uint16
    Weight   uint16
    Port     uint16
    Target   string `dns:"domain-name"`
}

When decoding the data written by our pipeline in rrDNS we need to convert the RRs into these structs. As it turns out, the tinylib/msgp library we had been investigating has a rather nice set of code generation tools. This would allow us to auto-generate efficient Go code from the struct definitions without having to maintain another schema definition in another format.

This meant we could work with the miekg RR structs (of which we are already familiar with from rrDNS) in the data pipeline, serialize them straight into binary data, and then deserialize them again at the edge straight into a struct we could use. We didn't need to worry about mapping from one set of structures to another using this technique, which simplified things greatly.

MessagePack also performs incredibly well compared to other formats on the market. Here’s an excerpt from a Go serialization benchmarking test; we can see that on top of the other reasons MessagePack benefits our stack, it outperforms pretty much every other viable cross-platform option.

Benchmark Results Table

One unexpected surprise after switching to this new model was that we actually reduced the space required to store the data at the edge by around 9x, which was a significantly higher saving compared to our initial estimates. It just goes to show how much impact a bloated data model can have on a system.

A New Data Pipeline

Another very important feature of Cloudflare’s DNS is our ability to propagate zone changes around the globe in a matter of seconds, not minutes or hours. Our existing pipeline was struggling to keep up with the growing number of zones, and with changes to at least 5 zones each second, even at the quietest of times we needed something new.

Global distribution is hard

For a while now we’ve had this monitoring, and we are able to visualize propagation times across the globe. The graph below is taken from our end-to-end monitoring: it makes changes to DNS via our API and watches for the change from various probes around the world. Each dot on the graph represents a particular probe talking to one of our PoPs, and the delay is tracked as the time it took for a change made via our API to be visible externally.

Due to various layers of caches – both inside and outside of our control – we see some banding on 10s intervals under 1 minute, and it fluctuates all the time. For monitoring and alerting of this nature, the granularity we have here is sufficient but it’s something we’d definitely like to improve. In normal operation, new DNS data is actually available to 99% of our global PoPs in under 5s.

In this time frame we can see there were a couple of incidents where delays of a few minutes were visible for a small number of PoPs due to network connectivity, but generally all probes reported stable propagation times.

Zone Propagation Drift

In contrast, here’s a graph of the old data pipeline for the same period. We can see how the graph represents the growing delay in visible changes for all PoPs at any given time.

Late Zone Propagation Drift

With a new data model designed and ready to go, one that better matched our query patterns, we set out implementing a new service to pick up changes to our zones in the central data store, do any needed processing and send the resulting output to our KV store.

The new service (written in our favourite language Go) has been running in production since July 2016, and we’ve now migrated over 99% of Cloudflare customer zones over to it. If we exclude incidents where issues with congestion across the internet affect connectivity to or from a particular location, the new pipeline itself has experienced zero delays thus far.

Authoritative rrDNS v2

rrDNS is a modular application, which allows us to write different “filters” that can hand off processing of different types of queries to different code. The Authoritative filter is responsible for taking an incoming DNS query, looking up the zone the query name belongs to, and performing all relevant logic to find the RRSet to send back to the client.

Since we’ve completely revised the underlying DNS data model at our edge, we needed to make significant changes to the “Authoritative Filter” in rrDNS. This too is an old area of the code base that hasn’t significantly changed in a number of years. As with any ageing code base, this brings a number of challenges, so we opted to re-write the filter completely. This allowed us to redesign it from the ground up on our new data model, keeping a keen eye on performance, and to better suit the scale and shape of our DNS traffic today. Starting fresh also made it much easier to build in good development practices, such as high test coverage and better documentation.

We’ve been running the v2 version of the authoritative filter in production alongside the existing code since the later months of 2016, and it has already played a key role in the DNS aspects of our new load balancing product.

The results with the new filter have been great: we’re able to respond to DNS queries on average 3x faster than before, which is excellent news for our customers and improves our ability to mitigate large DNS attacks. We can see here that as the percentage of zones migrated increased, we saw a significant improvement in our average response time.

rrdns Response Time 60d

Replacing the wings while flying

The most time consuming part of the project was migrating customers from the old system to something entirely new, without impacting customers or anybody noticing what we were doing. Achieving this involved a significant effort from variety of people in our customer facing, support and operations teams. Cloudflare has many offices in different time zones – London, San Francisco, Singapore and Austin – so keeping everyone in sync was key to our success.

Already, as a part of the release process for rrDNS we automatically sample and replay production queries against existing and upcoming code to detect unexpected differences, so naturally we decided to extend this idea for our migration. For any zone to pass the migration test, we compared the possible answers for the entire zone from the old system and the new system. Just one failure would result in the tool skipping the zone.

This allowed us to iteratively test the migration of zones and fix issues as they arose, keeping releases simple and regular. We chose not to do a single – and very scary – switch away from the old system, but run them both in parallel and slowly move zones over keeping them both in sync. Meaning we quickly could migrate zones back in case something unexpected happened.

Once we got going we were safely migrating zones at several hundred thousand per day, and we kept a close eye on how far we were from our initial goal of 99%. The last mile is still in progress, as there is often an element of customer engagement for some complex configurations that need attention.

Zone Migration Chart

What did we gain?

Replacing a piece of infrastructure this core to Cloudflare took significant effort from a large variety of teams. So what did we gain?

  • Average of 3x performance boost in code handling DNS queries
  • Faster and more consistent updates to DNS data around the globe
  • A much more robust system for SREs to operate and engineers to maintain
  • Consolidated feature-set based on today’s requirements, and better documentation of edge case behaviours
  • More test coverage, better metrics and higher confidence in our code, making it safer to make changes and develop our DNS products

Now that we’re now able to process our customers DNS more quickly, we’ll soon be rolling out support for a few new RR types and some other exciting new things in the coming months.

Does solving these kinds of technical and operational challenges excite you? Cloudflare is always hiring for talented specialists and generalists within our Engineering, Technical Operations and other teams.

comments powered by Disqus