Creative foot-shooting with Go RWMutex

2015-10-29

1 min read

Hi, I'm Filippo and today I managed to surprise myself! (And not in a good way.)

I'm developing a new module ("filter" as we call them) for RRDNS, CloudFlare's Go DNS server. It's a rewrite of the authoritative module, the one that adds the IP addresses to DNS answers.

It has a table of CloudFlare IPs that looks like this:

type IPMap struct {
	sync.RWMutex
	M map[string][]net.IP
}

It's a global filter attribute:

type V2Filter struct {
	name       string
	IPTable    *IPMap
	// [...]
}

CC-BY-NC-ND image by Martin SoulStealer

The table changes often, so a background goroutine periodically reloads it from our distributed key-value store, acquires the lock (f.IPTable.Lock()), updates it and releases the lock (f.IPTable.Unlock()). This happens every 5 minutes.

Everything worked in tests, including multiple and concurrent requests.

Today we deployed to an off-production test machine and everything worked. For a few minutes. Then RRDNS stopped answering queries for the beta domains served by the new code.

What. _That worked on my laptop_™.

Here's the IPTable consumer function. You can probably spot the bug.

func (f *V2Filter) getCFAddr(...) (result []dns.RR) {
	f.IPTable.RLock()
	// [... append IPs from f.IPTable.M to result ...]
	return
}

f.IPTable.RUnlock() is never called. Whoops. But it's an RLock, so multiple getCFAddr calls should work, and only table reloading should break, no? Instead getCFAddr started blocking after a few minutes. To the docs!

To ensure that the lock eventually becomes available, a blocked Lock call excludes new readers from acquiring the lock. https://golang.org/pkg/sync/#RWMutex.Lock

So everything worked and RLocks piled up until the table reload function ran, then the pending Lock call caused all following RLock calls to block, breaking RRDNS answer generation.

In tests the table reload function never ran while answering queries, so getCFAddr kept piling up RLock calls but never blocked.

No customers were affected because A) the release was still being tested on off-production machines and B) no real customers run on the new code yet. Anyway it was a interesting way to cause a deferred deadlock.

In closing, there's probably space for a better tooling here. A static analysis tool might output a listing of all Lock/Unlock calls, and a dynamic analysis tool might report still [r]locked Mutex at the end of tests. (Or maybe these tools already exist, in which case let me know!)

Do you want to help (introduce :) and) fix bugs in the DNS server answering more than 50 billion queries every day? We are hiring in London, San Francisco and Singapore!

Cloudflare's connectivity cloud protects entire corporate networks, helps customers build Internet-scale applications efficiently, accelerates any website or Internet application, wards off DDoS attacks, keeps hackers at bay, and can help you on your journey to Zero Trust.

Visit 1.1.1.1 from any device to get started with our free app that makes your Internet faster and safer.

To learn more about our mission to help build a better Internet, start here. If you're looking for a new career direction, check out our open positions.

RRDNS

Bugs

DNS

Reliability

Programming

The Cloudflare Blog

Creative foot-shooting with Go RWMutex

Some TXT about, and A PTR to, new DNS insights on Cloudflare Radar

Over 700 million events/second: How we make sense of too much data

Demonstrating reduction of vulnerability classes: a key step in CISA’s “Secure by Design” pledge

How we prevent conflicts in authoritative DNS configuration using formal verification