We use Consul for service discovery, and we’ve deployed a cluster that spans several of our data centers. This cluster exposes HTTP and DNS interfaces so that clients can query the Consul catalog and search for a particular service and the majority of the clients use DNS. We were aware from the start that the DNS query latencies were not great from certain parts of the world that were furthest away from these data centers. This, together with the fact that we use DNS over TLS, results in some long latencies. The TTL of these names being low makes it even more impractical when resolving these names in the hot path.
The usual way to solve these issues is by caching values so that at least subsequent requests are resolved quickly, and this is exactly what our resolver of choice, Unbound, is configured to do. The problem remains when the cache expires. When it expires, the next client will have to wait while Unbound resolves the name using the network. To have a low recovery time in case some service needs to failover and clients need to use another address we use a small TTL (30 seconds) in our records and as such cache expirations happen often.
There are at least two strategies to deal with this: prefetching and stale cache.
When prefetching is turned on, the server tries to refresh DNS records in the background before they expire. In practice, the way this works is: if an entry is served from cache and the TTL is less than 10% of the lifetime of the records, the server responds to the client, but in the background, it dispatches a job to refresh the record across the network. If it all goes as planned, the entry is refreshed in the cache before the TTL expires, and no client is ever left waiting for the network resolution.
This works very well for records that are accessed frequently, but poses issues in two situations: records that are accessed infrequently and records that have a small TTL. In both cases it is unlikely that a prefetch would be triggered, and so the cache expires and the next client will have to wait for the network resolution.
In our case, the records have a 30 second TTL, so frequently no requests come in during the yellow "prefetch" stage, which bumps the records out of the cache and causes the next client to have to wait for the slow network resolution. Prefetching was already turned on in our Unbound configuration but this still was not enough for our use case.
Another option is to serve stale records from the cache while we dispatch a job to refresh the record in the background without keeping the client waiting for the response. This would provide benefits in two different scenarios:
- General high latency - even after the TTL expires subsequent attempts will be returned from cache if the response takes too long to arrive.
- Consul/network blips - if for some reason Consul is unavailable/unreachable, there is still a period of time that expired responses can be used instead of failing resolution.
The first case is exactly our current scenario.
Being able to serve stale entries from the cache for a period after the TTL expires affords us some time in which we can return a result to a client either immediately or after some small amount of time has passed. The latter is the behaviour that we are looking for which also makes us compliant with RFC 8767, “Serving Stale Data to Improve DNS Resiliency”, published March 2020.
Previous dealings with Unbound's stale cache
Stale cache had been tried before at Cloudflare in 2017, and at that time, we encountered two problems:
1. Unbound could serve stale records from cache indefinitely
We could solve this through configuration that did not exist in 2017, but we had to test it.
2. Unbound serves negative/incomplete information from the expired cache
It was reported that Unbound would serve incomplete chains of responses to clients (for example, a CNAME record with no corresponding A/AAAA record).
This turned out to be unconfirmed but it is a failure mode that nevertheless we wanted to make sure Unbound could handle since it would be possible for parts of the response chain to be purged from the stale cache before others.
As mentioned before we would need to test a current version of Unbound, come up with an up-to-date stale cache configuration, and test the scenarios that posed problems when we last tried to use it plus any others that we came up with. The Unbound version used for tests was 1.13.0 and the stale cache was configured thus:
- serve-expired: yes - enable the stale cache.
- serve-expired-ttl: 3600 - this limits the serving of expired resources to one hour after they expire and no response from upstream was obtained.
- serve-expired-client-timeout: 500 - serve from stale cache only when upstream takes longer than 500ms to respond.
The option serve-expired-ttl-reset was considered, which would cause the records to never expire, even with the upstream down indefinitely as long as there are regular requests for them, but it was deemed too dangerous since it could potentially last forever and applies to all entries.
Example consul zone: service1.sd.
While resolving this name we end up with:
;; ANSWER SECTION: service1.sd. 120 IN CNAME service1.query.consul. service1.query.consul. 30 IN A 192.0.2.1 ;; Query time: 230 msec
If we dump Unbound's cache (unbound-control dump_cache), we can find the entries there:
service1.query.consul. 23 IN A 192.0.2.1 service1.sd. 113 IN CNAME service1.query.consul.
If we wait for the TTL of these entries to expire and dump the cache again, we notice that they don't show up anymore:
➜ unbound-control dump_cache | grep -e query.consul -e service1 ➜
But if we block access to the Consul resolver, we can see that still the entries are returned:
;; ANSWER SECTION: service1.sd. 119 IN CNAME service1.query.consul. service1.query.consul. 30 IN A 192.0.2.1 ;; Query time: 503 msec
Notice that Unbound waited 500ms before returning a response, which was the value we configured. This also shows that the dump cache method does not return expired entries even though they are present. So far so good.
Test - Unbound could serve stale records from cache indefinitely
After 3600 seconds pass we again try to resolve the same name:
➜ dig service1.sd. @127.0.0.1 ; <<>> DiG 9.10.6 <<>> service1.sd. @127.0.0.1 ;; global options: +cmd ;; connection timed out; no servers could be reached
Dig gave up on waiting for a response from Unbound. So we see that after the configured time, Unbound will indeed stop serving the stale record, even if it has been requested recently. This is due to serve-expired-ttl: 3600 and serve-expired-ttl-reset being set to "no" by default. These options were not available when we first tried Unbound’s stale cache.
Test - Unbound serves negative/incomplete information from the expired cache
Since the record we are resolving includes a CNAME and an A record with different TTLs, we will try to see if Unbound can be made to return an incomplete chain (CNAME with no A record).
We start by having both entries in cache, and block connections to the Consul resolver. We then remove from the cache the A entry:
$ /usr/sbin/unbound-control flush_type service1.query.consul. A ok
Now we attempt to query:
$ dig service1.sd. @127.0.0.1 ... ;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 9532 ;; Query time: 22 msec
Nothing was returned which is the result we were expecting.
Another scenario we wanted to test was to see if when serve-expired-ttl + A's TTL had passed, the CNAME would be returned on its own.
➜ dig service1.sd. @127.0.0.1 ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 48918 ;; ANSWER SECTION: service1.sd. 48 IN CNAME service1.query.consul. service1.query.consul. 30 IN A 192.0.2.1 ;; Query time: 430 msec ;; WHEN: Thu Dec 31 17:07:55 WET 2020 ➜ dig service1.sd. @127.0.0.1 ;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 42432 ;; Query time: 509 msec ;; WHEN: Thu Dec 31 17:07:56 WET 2020
We lowered serve-expired-ttl to 40 to help testing the issue and we were able to see that as soon as the A entry was deemed not serve-able, the CNAME was also not returned, which is the behaviour that we want.
One last thing that we wanted to test was if the cache was really refreshed in the background while the stale entry was returned. Network Link Conditioner was used to simulate high latency:
➜ dig service1.service.consul @CONSUL_ADDRESS ;; ANSWER SECTION: service1.service.consul. 60 IN A 192.0.2.10 service1.service.consul. 60 IN A 192.0.2.12 ;; Query time: 3053 msec
We can see that the artificial latency is working. We can now test if changes to the address are actually reflected on the consul record even though a stale record is served at first:
➜ dig service2.service.consul @127.0.0.1 ... ;; ANSWER SECTION: service2.service.consul. 30 IN A 192.0.2.100 ;; Query time: 501 msec ;; SERVER: 127.0.0.1#53(127.0.0.1) ;; WHEN: Sat Jan 02 13:00:36 WET 2021 ;; MSG SIZE rcvd: 89 ➜ dig service2.service.consul @127.0.0.1 ... ;; ANSWER SECTION: service2.service.consul. 58 IN A 192.0.2.200 ;; Query time: 0 msec ;; SERVER: 127.0.0.1#53(127.0.0.1) ;; WHEN: Sat Jan 02 13:00:39 WET 2021 ;; MSG SIZE rcvd: 89
We can see that the first attempt was served from the stale cache, since the address was already changed on Consul to 192.0.2.200, and the query time was 500ms. The next attempt was served from the fresh cache, hence the 0ms response time, and so we know that the cache was refreshed in the background.
All the test scenarios completed successfully, even though we did find an issue that has since been fixed upstream. We realised that the prefetching logic blocked the cache while it attempted to refresh the entry, which prevented stale entries from being served if the upstream was unresponsive. This fix is included in version 1.13.1.
This change has already been deployed to all of our data centers, and while the change addressed our original use case, there were some interesting side-effects as can be seen from the graphs below.
Decreased in-flight requests
The graph above is from the Toronto data center which received the change at approximately 14:00 UTC. We can observe an attenuated curve, which indicates that spikes in the number of in-flight requests were probably caused by slow/unresponsive upstreams which are now served a stale response after a short while.
The same graph from the Perth data center, which received the change on a later day.
Decrease in P99 latency
The graph above is from the Lisbon data center, which received the change at approximately 12:37 UTC. We can see that the long tail (P99) of request latencies decreases dramatically as we are now able to serve a response to clients in more situations without having them wait for a slow upstream network resolution that might not even succeed.
The effect is even more dramatic on the Paris data center.
The decrease in P99 latencies directly correlates to the start of serving stale entries, even though there are not many of them. In the graph above we can observe the number of stale records served per second for the same data center, using the last 3 minutes of datapoints.
Decreased general error rate
This graph is from our Perth data center, which received the change on February 8. This is explained by now being able to serve some requests to unresponsive/slow upstreams, which previously would have resulted in a failed request.
Graph from the Vientiane data center, which received the change on February 1.
Less jitter on the rate of load balancer queries by RRDNS
Another graph from the Perth data center, which received the change on February 8. An explanation for this change is that resolvers usually retry after a small amount of time when they don't get any response, and so by returning a stale response instead of keeping the client waiting (and retrying), we prevent a surge of new queries for expired and slow to resolve records.
A graph from the Doha data center, which received the change on February 1.
Miscellaneous Unbound metrics
This change also had a positive effect on several Unbound-specific metrics, below are several graphs from the Perth data center that received the change on February 8.
In the future, we would like to tweak the serve-expired-client-timeout value, which is currently set to 500ms. It was chosen conservatively, at a time when we were cautiously experimenting with a feature that had been problematic in the past, and our gut feeling is that it can be at least halved without causing issues to further improve the client’s experience. There is also the possibility of using different values per data center, for example picking up the P90 resolve latency value which differs per datacenter, and this is something that we want to explore further.
We are enabling Unbound to return expired entries from the cache whenever an upstream takes longer than 500ms to respond, while refreshing the cache in the background, unless 3600 seconds have elapsed from the record's original TTL.