Creative foot-shooting with Go RWMutex

2015-10-29

2 min read

Hi, I'm Filippo and today I managed to surprise myself! (And not in a good way.)

I'm developing a new module ("filter" as we call them) for RRDNS, CloudFlare's Go DNS server. It's a rewrite of the authoritative module, the one that adds the IP addresses to DNS answers.

It has a table of CloudFlare IPs that looks like this:

type IPMap struct {
	sync.RWMutex
	M map[string][]net.IP
}

It's a global filter attribute:

type V2Filter struct {
	name       string
	IPTable    *IPMap
	// [...]
}

Mexican Standoff CC-BY-NC-ND image by Martin SoulStealer

The table changes often, so a background goroutine periodically reloads it from our distributed key-value store, acquires the lock (f.IPTable.Lock()), updates it and releases the lock (f.IPTable.Unlock()). This happens every 5 minutes.

Everything worked in tests, including multiple and concurrent requests.

Today we deployed to an off-production test machine and everything worked. For a few minutes. Then RRDNS stopped answering queries for the beta domains served by the new code.

What. That worked on my laptop™.

Here's the IPTable consumer function. You can probably spot the bug.

func (f *V2Filter) getCFAddr(...) (result []dns.RR) {
	f.IPTable.RLock()
	// [... append IPs from f.IPTable.M to result ...]
	return
}

f.IPTable.RUnlock() is never called. Whoops. But it's an RLock, so multiple getCFAddr calls should work, and only table reloading should break, no? Instead getCFAddr started blocking after a few minutes. To the docs!

To ensure that the lock eventually becomes available, a blocked Lock call excludes new readers from acquiring the lock. https://golang.org/pkg/sync/#RWMutex.Lock

So everything worked and RLocks piled up until the table reload function ran, then the pending Lock call caused all following RLock calls to block, breaking RRDNS answer generation.

In tests the table reload function never ran while answering queries, so getCFAddr kept piling up RLock calls but never blocked.

No customers were affected because A) the release was still being tested on off-production machines and B) no real customers run on the new code yet. Anyway it was a interesting way to cause a deferred deadlock.

In closing, there's probably space for a better tooling here. A static analysis tool might output a listing of all Lock/Unlock calls, and a dynamic analysis tool might report still [r]locked Mutex at the end of tests. (Or maybe these tools already exist, in which case let me know!)

Do you want to help (introduce :) and) fix bugs in the DNS server answering more than 50 billion queries every day? We are hiring in London, San Francisco and Singapore!

We protect entire corporate networks, help customers build Internet-scale applications efficiently, accelerate any website or Internet application, ward off DDoS attacks, keep hackers at bay, and can help you on your journey to Zero Trust.

Visit 1.1.1.1 from any device to get started with our free app that makes your Internet faster and safer.

To learn more about our mission to help build a better Internet, start here. If you're looking for a new career direction, check out our open positions.

RRDNS

Bugs

DNS

Reliability

Programming

The Cloudflare Blog

Creative foot-shooting with Go RWMutex

mTLS client certificate revocation vulnerability with TLS Session Resumption

However improbable: The story of a processor bug

An Explanation of the Meltdown/Spectre Bugs for a Non-Technical Audience

Quantifying the Impact of "Cloudbleed"