The Cloudflare Blog

How and why the leap second affected Cloudflare DNS

John Graham-Cumming — Sun, 01 Jan 2017 22:40:26 GMT

At midnight UTC on New Year’s Day, deep inside Cloudflare’s custom RRDNS software, a number went negative when it should always have been, at worst, zero. A little later this negative value caused RRDNS to panic. This panic was caught using the recover feature of the Go language. The net effect was that some DNS resolutions to some Cloudflare managed web properties failed.

The problem only affected customers who use CNAME DNS records with Cloudflare, and only affected a small number of machines across Cloudflare's 102 data centers. At peak approximately 0.2% of DNS queries to Cloudflare were affected and less than 1% of all HTTP requests to Cloudflare encountered an error.

This problem was quickly identified. The most affected machines were patched in 90 minutes and the fix was rolled out worldwide by 0645 UTC. We are sorry that our customers were affected, but we thought it was worth writing up the root cause for others to understand.

A little bit about Cloudflare DNS

Cloudflare customers use our DNS service to serve the authoritative answers for DNS queries for their domains. They need to tell us the IP address of their origin web servers so we can contact the servers to handle non-cached requests. They do this in two ways: either they enter the IP addresses associated with the names (e.g. the IP address of example.com is 192.0.2.123 and is entered as an A record) or they enter a CNAME (e.g. example.com is origin-server.example-hosting.biz).

This image shows a test site with an A record for theburritobot.com and a CNAME for www.theburritobot.com pointing directly to Heroku.

When a customer uses the CNAME option, Cloudflare has occasionally to do a lookup, using DNS, for the actual IP address of the origin server. It does this automatically using standard recursive DNS. It was this CNAME lookup code that contained the bug that caused the outage.

Internally, Cloudflare operates DNS resolvers to lookup DNS records from the Internet and RRDNS talks to these resolvers to get IP addresses when doing CNAME lookups. RRDNS keeps track of how well the internal resolvers are performing and does a weighted selection of possible resolvers (we operate multiple per data center for redundancy) and chooses the most performant. Some of these resolutions ended up recording in a data structure a negative value during the leap second.

The weighted selection code, at a later point, was being fed the negative number which caused it to panic. The negative number got there through a combination of the leap second and smoothing.

A falsehood programmers believe about time

The root cause of the bug that affected our DNS service was the belief that time cannot go backwards. In our case, some code assumed that the difference between two times would always be, at worst, zero.

RRDNS is written in Go and uses Go’s time.Now() function to get the time. Unfortunately, this function does not guarantee monotonicity. Go currently doesn’t offer a monotonic time source (see issue 12914 for discussion).

In measuring the performance of the upstream DNS resolvers used for CNAME lookups RRDNS contains the following code:

// Update upstream sRTT on UDP queries, penalize it if it fails
if !start.IsZero() {
	rtt := time.Now().Sub(start)
	if success && rcode != dns.RcodeServerFailure {
		s.updateRTT(rtt)
	} else {
		// The penalty should be a multiple of actual timeout
		// as we don't know when the good message was supposed to arrive,
		// but it should not put server to backoff instantly
		s.updateRTT(TimeoutPenalty * s.timeout)
	}
}

In the code above rtt could be negative if time.Now() was earlier than start (which was set by a call to time.Now() earlier).

That code works well if time moves forward. Unfortunately, we’ve tuned our resolvers to be very fast which means that it’s normal for them to answer in a few milliseconds. If, right when a resolution is happening, time goes back a second the perceived resolution time will be negative.

RRDNS doesn’t just keep a single measurement for each resolver, it takes many measurements and smoothes them. So, the single measurement wouldn’t cause RRDNS to think the resolver was working in negative time, but after a few measurements the smoothed value would eventually become negative.

When RRDNS selects an upstream to resolve a CNAME it uses a weighted selection algorithm. The code takes the upstream time values and feeds them to Go’s rand.Int63n() function. rand.Int63n promptly panics if its argument is negative. That's where the RRDNS panics were coming from.

(Aside: there are many other falsehoods programmers believe about time)

The one character fix

One precaution when using a non-monotonic clock source is to always check whether the difference between two timestamps is negative. Should this happen, it’s not possible to accurately determine the time difference until the clock stops rewinding.

In this patch we allowed RRDNS to forget about current upstream performance, and let it normalize again if time skipped backwards. This prevents leaking of negative numbers to the server selection code, which would result in throwing errors before attempting to contact the upstream server.

The fix we applied prevents the recording of negative values in server selection. Restarting all the RRDNS servers then fixed any recurrence of the problem.

Timeline

The following is the complete timeline of the events around the leap second bug.

2017-01-01 00:00 UTC Impact starts2017-01-01 00:10 UTC Escalated to engineers2017-01-01 00:34 UTC Issue confirmed2017-01-01 00:55 UTC Mitigation deployed to one canary node and confirmed2017-01-01 01:03 UTC Mitigation deployed to canary data center and confirmed2017-01-01 01:23 UTC Fix deployed in most impacted data center2017-01-01 01:45 UTC Fix being deployed to major data centers2017-01-01 01:48 UTC Fix being deployed everywhere2017-01-01 02:50 UTC Fix rolled out to most of the affected data centers2017-01-01 06:45 UTC Impact ends

This chart shows error rates for each Cloudflare data center (some data centers were more affected than others) and the rapid drop in errors as the fix was deployed. We deployed the fix prioritizing those locations with the most errors first.

Conclusion

We are sorry that our customers were affected by this bug and are inspecting all our code to ensure that there are no other leap second sensitive uses of time intervals.

Economical With The Truth: Making DNSSEC Answers Cheap

Dani Grant — Fri, 24 Jun 2016 16:31:10 GMT

We launched DNSSEC late last year and are already signing 56.9 billion DNS record sets per day. At this scale, we care a great deal about compute cost. One of the ways we save CPU cycles is our unique implementation of negative answers in DNSSEC.

CC BY-SA 2.0 image by Chris Short

I will briefly explain a few concepts you need to know about DNSSEC and negative answers, and then we will dive into how CloudFlare saves on compute when asked for names that don’t exist.

What You Need To Know: DNSSEC Edition

Here’s a quick summary of DNSSEC:

This is an unsigned DNS answer (unsigned == no DNSSEC):

cloudflare.com.		299	IN	A	198.41.214.162
cloudflare.com.		299	IN	A	198.41.215.162

This is an answer with DNSSEC:

cloudflare.com.		299	IN	A	198.41.214.162
cloudflare.com.		299	IN	A	198.41.215.162
cloudflare.com.		299	IN	RRSIG	A 13 2 300 20160311145051 20160309125051 35273     cloudflare.com. RqRna0qkih8cuki++YbFOkJi0DGeNpCMYDzlBuG88LWqx+Aaq8x3kQZX TzMTpFRs6K0na9NCUg412bOD4LH3EQ==

Answers with DNSSEC contain a signature for every record type that is returned. (In this example, only A records are returned so there is only one signature.) The signatures allow DNS resolvers to validate the records returned and prevent on-path attackers from intercepting and changing the answers.

What You Need To Know: Negative Answer Edition

There are two types of negative answers. The first is NXDOMAIN, which means that the name asked for does not exist. An example of this is a query asking for missing.cloudflare.com. missing.cloudflare.com doesn’t exist at all.

The second type is NODATA, which means that the name does exist, just not in the requested type. An example of this would be asking for the MX record of blog.cloudflare.com. There are A records for blog.cloudflare.com but no MX records so the appropriate response is NODATA.

What Goes Into An `NXDOMAIN` With DNSSEC

To see what gets returned in a negative NXDOMAIN answer, let’s look at the response for a query for bogus.ietf.org.

The first record that has to be returned in a negative answer with DNSSEC is an SOA, just like in an unsigned negative answer. The SOA contains some metadata about the zone and lets the recursor know how long to cache the negative answer for.

ietf.org.	1179	IN	SOA	ns0.amsl.com. glen.amsl.com. 1200000325 1800 1800 604800 1800

Because the domain is signed with DNSSEC, the signature for the SOA is also returned:

ietf.org.	1179	IN	RRSIG	SOA 5 2 1800 20170308083354 20160308073501 40452 ietf.org. S0gIjTnQGA6TyIBjCeBXL4ip8aEQEgg2y+kCQ3sLtFa3oNy9vj9kj4aP 8EVu4oIexr8X/i9L8Oj5ec4HOrQoYsMGObRUG0FGT0MEbxepi+wWrfed vD/3mq8KZg/pj6TQAKebeSQGkmb8y9eP0PdWdUi6EatH9ZY/tsoiKyqg U4vtq9sWZ/4mH3xfhK9RBI4M7XIXsPX+biZoik6aOt4zSWR5WDq27pXI 0l+BLzZb72C7McT4PlBiF+U86OngBlGxVBnILyW2aUisi2LY6KeO5AmK WNT0xHWe5+JtPD5PgmSm46YZ8jMP5mH4hSYr76jqwvlCtXvq8XgYQU/P QyuCpQ==

The next part of the negative answer in DNSSEC is a record type called NSEC. The NSEC record returns the previous and next name in the zone, which proves to the recursor that the queried name cannot possibly exist, because nothing exists between the two names listed in the NSEC record.

www.apps.ietf.org.	1062	IN	NSEC	cloudflare-verify.ietf.org. A RRSIG NSEC

This NSEC record above tells you that bogus.ietf.org does not exist because no names exist canonically between www.apps.ietf.org and cloudflare-verify.ietf.org. Of course, this record also has a signature contained in the answer:

www.apps.ietf.org.	1062	IN	RRSIG	NSEC 5 4 1800 20170308083322 20160308073501 40452 ietf.org. NxmjhCkTtoiolJUow/OreeBRxTtf2AnIPM/r2p7oS/hNeOdFI9tpgGQY g0lTOYjcNNoIoDB/r56Kd+5wtuaKT+xsYiZ4K413I+cmrNQ+6oLT+Mz6 Kfzvo/TcrJD99PVAYIN1MwzO42od/vi/juGkuKJVcCzrBKNHCZqu7clu mU3DEqbQQT2O8dYIUjLlfom1iYtZZrfuhB6FCYFTRd3h8OLfMhXtt8f5 8Q/XvjakiLqov1blZAK229I2qgUYEhd77n2pXV6SJuOKcSjZiQsGJeaM wIotSKa8EttJELkpNAUkN9uXfhU+WjouS1qzgyWwbf2hdgsBntKP9his 9MfJNA==

A second NSEC record is also returned to prove that there is no wildcard that would have covered bogus.ietf.org:

ietf.org.	1062	IN	NSEC	ietf1._domainkey.ietf.org. A NS SOA MX TXT AAAA RRSIG NSEC DNSKEY SPF

This record above tells you that a wildcard (*. ietf.org) would have existed between those two names. Because there is no wildcard record at *.ietf.org, as proven by this NSEC record, the DNS resolver knows that really nothing should have been returned for bogus.ietf.org. This NSEC record also has a signature:

ietf.org.	1062	IN	RRSIG	NSEC 5 2 1800 20170308083303 20160308073501 40452 ietf.org. homg5NrZIKo0tR+aEp0MVYYjT7J/KGTKP46bJ8eeetbq4KqNvLKJ5Yig ve4RSWFYrSARAmbi3GIFW00P/dFCzDNVlMWYRbcFUt5NfYRJxg25jy95 yHNmInwDUnttmzKuBezdVVvRLJY3qSM7S3VfI/b7n6++ODUFcsL88uNB V6bRO6FOksgE1/jUrtz6/lEKmodWWI2goFPGgmgihqLR8ldv0Dv7k9vy Ao1uunP6kDQEj+omkICFHaT/DBSSYq59DVeMAAcfDq2ssbr4p8hUoXiB tNlJWEubMnHi7YmLSgby+m8b97+8b6qPe8W478gAiggsNjc2gQSKOOXH EejOSA==

All in all, the negative answer for bogus.ietf.org contains an SOA + SOA RRSIG + (2) NSEC + (2) NSEC RRSIG. It is 6 records in total, returning an answer that is 1095 bytes (this is a large DNS answer).

Zone Walking

What you may have noticed is that because the negative answer returns the previous and next name, you can keep asking for next names and essentially “walk” the zone until you learn every single name contained in it.

For example, if you ask for the NSEC on ietf.org, you will get back the first name in the zone, ietf1._domainkey.ietf.org:

ietf.org.		1799	IN	NSEC 	ietf1._domainkey.ietf.org.  A NS SOA MX TXT AAAA RRSIG NSEC DNSKEY SPF

Then if you ask for the NSEC on ietf1._domainkey.ietf.org you will get the next name in the zone:

ietf1._domainkey.ietf.org. 1799	IN	NSEC 	apps.ietf.org. TXT RRSIG NSEC

And you can keep going until you get every name in the zone:

apps.ietf.org.		1799	IN	NSEC 	mail.apps.ietf.org. MX RRSIG NSEC

The root zone uses NSEC as well, so you can walk the root to see every TLD:

The root NSEC:

.			21599	IN	NSEC 	aaa. NS SOA RRSIG NSEC DNSKEY

.aaa NSEC:

aaa.			21599	IN	NSEC 	aarp. NS DS RRSIG NSEC

.aarp NSEC:

aarp.			21599	IN	NSEC	 abb. NS DS RRSIG NSEC

Zone walking was actually considered a feature of the original design:

The complete NXT chains specified in this document enable a resolver to obtain, by successive queries chaining through NXTs, all of the names in a zone. - RFC2535

(NXT is the original DNS record type that NSEC was based off of)

However, as you can imagine, this is a terrible idea for some zones. If you could walk the .gov zone, you could learn every US government agency and government agency portal. If you owned a real estate company where every realtor got their own subdomain, a competitor could walk through your zone and find out who all of your realtors are.

CC BY 2.0 image by KIUI

So the DNS community rallied together and found a solution. They would continue to return previous and next names, but they would hash the outputs. This was defined in an upgrade to NSEC called NSEC3.

6rmo7l6664ki2heho7jtih1lea9k6los.icann.org. 3599 IN NSEC3 1 0 5 2C21FAE313005174 6S2J9F2OI56GPVEIH3KBKJGGCL21SKKL A RRSIG

NSEC3 was a “close but no cigar” solution to the problem. While it’s true that it made zone walking harder, it did not make it impossible. Zone walking with NSEC3 is still possible with a dictionary attack. An attacker can use a list of the most common hostnames, hash them with the hashing algorithm used in the NSEC3 record (which is listed in the record itself) and see if there are any matches. Even if the domain owner uses a salt on the hash, the length of the salt is included in the NSEC3 record, so there are a finite number of salts to guess.

The Salt Length field defines the length of the salt in octets, ranging in value from 0 to 255.”

RFC5155

`NODATA` Responses

If you recall from above, NODATA is the response from a server when it is asked for a name that exists, but not in the requested type (like an MX record for blog.cloudflare.com). NODATA is similar in output to NXDOMAIN. It still requires SOA, but it only takes one NSEC record to prove the next name, and to specify which types do exist on the queried name.

For example, if you look for a TXT record on apps.ietf.org, the NSEC record will tell you that while there is no TXT record on apps.ietf.org, there are MX, RRSIG and NSEC records.

apps.ietf.org.		1799	IN	NSEC 	mail.apps.ietf.org. MX RRSIG NSEC

Problems With Negative Answers

There are two problems with negative answers:

The first is that the authoritative server needs to return the previous and next name. As you’ll see, this is computationally expensive for CloudFlare, and as you’ve already seen, it can leak information about a zone.

The second is that negative answers require two NSEC records and their two subsequent signatures (or three NSEC3 records and three NSEC3 signatures) to authenticate the nonexistence of one name. This means that answers are bigger than they need to be.

The Trouble with Previous and Next Names

CloudFlare has a custom in house DNS server built in Go called RRDNS. What's unique about RRDNS is that unlike standard DNS servers, it does not have the concept of a zone file. Instead, it has a key value store that holds all of the DNS records of all of the domains. When it gets a query for a record, it can just pick out the record that it needs.

Another unique aspect of CloudFlare's DNS is that a lot of our business logic is handled in the DNS. We often dynamically generate DNS answers on the fly, so we don't always know what we will respond with before we are asked.

Traditional negative answers require the authoritative server to return the previous and next name of a missing name. Because CloudFlare does not have the full view of the zone file, we'd have to ask the database to do a sorted search just to figure out the previous and next names. Beyond that, because we generate answers on the fly, we don’t have a reliable way to know what might be the previous and next name, unless we were to precompute every possible option ahead of time.

One proposed solution to the previous and next name, and secrecy problems is RFC4470, dubbed 'White Lies'. This RFC proposes that DNS operators make up a previous and next name by randomly generating names that are canonically slightly before and after the requested name.

White lies is a great solution to block zone walking (and it helps us prevent unnecessary database lookups), but it still requires 2 NSEC records (one for previous and next name and another for the wildcard) to say one thing, so the answer is still bigger than it needs to be.

When CloudFlare Lies

We decided to take lying in negative answers to its fullest extent. Instead of white lies, we do black lies.

For an NXDOMAIN, we always return \000.(the missing name) as the next name, and because we return an NSEC directly on the missing name, we do not have to return an additional NSEC for the wildcard. This way we only have to return SOA, SOA RRSIG, NSEC and NSEC RRSIG, and we do not need to search the database or precompute dynamic answers.

Our negative answers are usually around 300 bytes. For comparison, negative answers for ietf.org which uses NSEC and icann.org, which uses NSEC3 are both slightly over 1000 bytes, three times the size. The reason this matters so much is that the maximum size of an unsigned UDP packet is typically 512 octets. DNSSEC requires support for at least 1220 octets long messages over UDP, but above that limit, the client may need to upgrade to DNS over TCP. A good practice is to keep enough headroom in order to keep response sizes below fragmentation threshold during zone signing key rollover periods.

NSEC: 1096 bytes

ietf.org.		1799	IN	SOA	ns0.amsl.com. glen.amsl.com. 1200000317 1800 1800 604800 1800
ietf.org.		1799	IN	RRSIG	SOA 5 2 1800 20170213210533 20160214200831 40452 ietf.org. P8XoJx+SK5nUZAV/IqiJrsoKtP1c+GXmp3FvEOUZPFn1VwW33242LVrJ GMI5HHjMEX07EzOXZyLnQeEvlf2QLxRIQm1wAnE6W4SUp7TgKUZ7NJHP dgLr2gqKYim4CI7ikYj3vK7NgcaSE5jqIZUm7oFxxYO9/YPz4Mx7COw6 XBOMYS2v8VY3DICeJdZsHJnVKlgl8L7/yqrL8qhkSW1yDo3YtB9cZEjB OVk8uRDxK7aHkEnMRz0LODOJ10AngJpg9LrkZ1CO444RhZGgTbwzN9Vq rDyH47Cn3h8ofEOJtYCJvuX5CCzaZDInBsjq9wNAiNBgIQatPkNriR77 hCEHhQ==
ietf.org.		1799	IN	NSEC	ietf1._domainkey.ietf.org. A NS SOA MX TXT AAAA RRSIG NSEC DNSKEY SPF
ietf.org.		1799	IN	RRSIG	NSEC 5 2 1800 20170213210816 20160214200831 40452 ietf.org. B9z/JJs30tkn0DyxVz0zaRlm4HkeNY1TqYmr9rx8rH7kC32PWZ1Fooy6 16qmB33/cvD2wtOCKMnNQPdTG2qUs/RuVxqRPZaQojIVZsy/GYONmlap BptzgOJLP7/HOxgYFgMt5q/91JHfp6Mn0sd218/H86Aa98RCXwUOzZnW bdttjsmbAqONuPQURaGz8ZgGztFmQt5dNeNRaq5Uqdzw738vQjYwppfU 9GSLkT7RCh3kgbNcSaXeuWfFnxG1R2SdlRoDICos+RqdDM+23BHGYkYc /NEBLtjYGxPqYCMe/7lOtWQjtQOkqylAr1r7pSI2NOA9mexa7yTuXH+x o/rzRA==
www.apps.ietf.org.	1799	IN	NSEC	cloudflare-verify.ietf.org. A RRSIG NSEC
www.apps.ietf.org.	1799	IN	RRSIG	NSEC 5 4 1800 20170213210614 20160214200831 40452 ietf.org. U+hEHcTps2IC8VKS61rU3MDZq+U0KG4/oJjIHVYbrWufQ7NdMdnY6hCL OmQtsvuZVRQjWHmowRhMj83JMUagxoZuWTg6GuLPin3c7PkRimfBx7jI wjqORwcuvpBh92A/s/2HXBma3PtDZl2UDLy4z7wdO62rbxGU/LX1jTqY FoJJLJfJ/C+ngVMIE/QVneXSJkAjHV96FSEnreF81V62x9azv3AHo4tl qnoYvRDtK+cR072A5smtWMKDfcIr2fI11TAGIyhR55yAiollPDEz5koj BfMstC/JXVURJMM+1vCPjxvwYzTZN8iICf1AupyyR8BNWxgic5yh1ljH 1AuAVQ==

Black Lies: 357 bytes

cloudflare.com.		1799	IN	SOA	ns3.cloudflare.com. dns.cloudflare.com. 2020742566 10000 2400 604800 3600
blog.cloudflare.com.	3599	IN	NSEC	\000.blog.cloudflare.com. RRSIG NSEC
cloudflare.com.		1799	IN	RRSIG	SOA 13 2 86400 20160220230013 20160218210013 35273 cloudflare.com. kgjtJDuuNC/yX8yWQpol4ZUUr8s8yAXZi26KWBI6S3HDtry2t6LnP1ou QK10Ut7DXO/XhyZddRBVj3pIpWYdBQ==
blog.cloudflare.com.	3599	IN	RRSIG	NSEC 13 3 3600 20160220230013 20160218210013 35273 cloudflare.com. 8BKAAS8EXNJbm8DxEI1OOBba8KaiimIuB47mPlteiZf3sVLGN1edsrXE +q+pHaSHEfYG5mHfCBJrbi6b3EoXOw==

DNS Shotgun

Our take on NODATA responses is also unique. Traditionally, NODATA responses contain one NSEC record to tell the resolver which types exist on the requested name. This is highly inefficient. To do this, we’d have to search the database for all the types that do exist, just to answer that the requested type does not exist. Remember that’s not even always possible because we have dynamic answers that are generated on the fly.

What we realized was that NSEC is a denial of existence. What matters in NSEC are the missing types, not the present ones. So what we do is we set all the types. We say, this name does exist, just not on the one type you asked for.

For example, if you asked for an TXT record of blog.cloudflare.com we would say, all the types exist, just not TXT.

blog.cloudflare.com.	3599	IN	NSEC	\000.blog.cloudflare.com. A WKS HINFO MX AAAA LOC SRV CERT SSHFP IPSECKEY RRSIG NSEC TLSA HIP OPENPGPKEY SPF

And then if you queried for a MX on blog.cloudflare.com, we would return saying we have every record type, even TXT, but just not MX.

blog.cloudflare.com.	3599	IN	NSEC	\000.blog.cloudflare.com. A WKS HINFO TXT AAAA LOC SRV CERT SSHFP IPSECKEY RRSIG NSEC TLSA HIP OPENPGPKEY SPF

This saves us a database lookup and from leaking any zone information in negative answers. We call this the DNS Shotgun.

How Are Black Lies and DNS Shotgun Standards Compliant

We put a lot of care to ensure CloudFlare’s negative answers are standards compliant. We’re even pushing for them to become an Internet Standard by publishing an Internet Draft earlier this year.

RFC4470, White Lies, allows us to randomly generate next names in NSEC. Not setting the second NSEC for the wildcard subdomain is also allowed, so long as there exists an NSEC record on the actual queried name. And lastly, our lie of setting every record type in NSEC records for NODATA, is okay too –– after all, domains are constantly changing, it’s feasible that the zone file changed in between the time the NSEC record indicated to you there was no MX record on blog.cloudflare.com and when you queried successfully for MX on blog.cloudflare.com.

Conclusion

We’re proud of our negative answers. They help us keep packet size small, and CPU consumption low enough for us to provide DNSSEC for free for any domain. Let us know what you think, we’re looking forward to hearing from you.

Go coverage with external tests

Filippo Valsorda — Tue, 19 Jan 2016 18:19:59 GMT

The Go test coverage implementation is quite ingenious: when asked to, the Go compiler will preprocess the source so that when each code portion is executed a bit is set in a coverage bitmap. This is integrated in the go test tool: go test -cover enables it and -coverprofile= allows you to write a profile to then inspect with go tool cover.

This makes it very easy to get unit test coverage, but there's no simple way to get coverage data for tests that you run against the main version of your program, like end-to-end tests.

The proper fix would involve adding -cover preprocessing support to go build, and exposing the coverage profile maybe as a runtime/pprof.Profile, but as of Go 1.6 there’s no such support. Here instead is a hack we've been using for a while in the test suite of RRDNS, our custom Go DNS server.

We create a dummy test that executes main(), we put it behind a build tag, compile a binary with go test -c -cover and then run only that test instead of running the regular binary.

Here's what the rrdns_test.go file looks like:

// +build testrunmain

package main

import "testing"

func TestRunMain(t *testing.T) {
	main()
}

We compile the binary like this

$ go test -coverpkg="rrdns/..." -c -tags testrunmain rrdns

And then when we want to collect coverage information, we execute this instead of ./rrdns (and run our test battery as usual):

$ ./rrdns.test -test.run "^TestRunMain$" -test.coverprofile=system.out

You must return from main() cleanly for the profile to be written to disk; in RRDNS we do that by catching SIGINT. You can still use command line arguments and standard input normally, just note that you will get two lines of extra output from the test framework.

Finally, since you probably also run unit tests, you might want to merge the coverage profiles with gocovmerge (from issue #6909):

$ go get github.com/wadey/gocovmerge
$ gocovmerge unit.out system.out > all.out
$ go tool cover -html all.out

If finding creative ways to test big-scale network services sounds fun, know that we are hiring in London, San Francisco and Singapore.

Creative foot-shooting with Go RWMutex

Filippo Valsorda — Thu, 29 Oct 2015 21:26:36 GMT

Hi, I'm Filippo and today I managed to surprise myself! (And not in a good way.)

I'm developing a new module ("filter" as we call them) for RRDNS, CloudFlare's Go DNS server. It's a rewrite of the authoritative module, the one that adds the IP addresses to DNS answers.

It has a table of CloudFlare IPs that looks like this:

type IPMap struct {
	sync.RWMutex
	M map[string][]net.IP
}

It's a global filter attribute:

type V2Filter struct {
	name       string
	IPTable    *IPMap
	// [...]
}

CC-BY-NC-ND image by Martin SoulStealer

The table changes often, so a background goroutine periodically reloads it from our distributed key-value store, acquires the lock (f.IPTable.Lock()), updates it and releases the lock (f.IPTable.Unlock()). This happens every 5 minutes.

Everything worked in tests, including multiple and concurrent requests.

Today we deployed to an off-production test machine and everything worked. For a few minutes. Then RRDNS stopped answering queries for the beta domains served by the new code.

What. _That worked on my laptop_™.

Here's the IPTable consumer function. You can probably spot the bug.

func (f *V2Filter) getCFAddr(...) (result []dns.RR) {
	f.IPTable.RLock()
	// [... append IPs from f.IPTable.M to result ...]
	return
}

f.IPTable.RUnlock() is never called. Whoops. But it's an RLock, so multiple getCFAddr calls should work, and only table reloading should break, no? Instead getCFAddr started blocking after a few minutes. To the docs!

To ensure that the lock eventually becomes available, a blocked Lock call excludes new readers from acquiring the lock. https://golang.org/pkg/sync/#RWMutex.Lock

So everything worked and RLocks piled up until the table reload function ran, then the pending Lock call caused all following RLock calls to block, breaking RRDNS answer generation.

In tests the table reload function never ran while answering queries, so getCFAddr kept piling up RLock calls but never blocked.

No customers were affected because A) the release was still being tested on off-production machines and B) no real customers run on the new code yet. Anyway it was a interesting way to cause a deferred deadlock.

In closing, there's probably space for a better tooling here. A static analysis tool might output a listing of all Lock/Unlock calls, and a dynamic analysis tool might report still [r]locked Mutex at the end of tests. (Or maybe these tools already exist, in which case let me know!)

Do you want to help (introduce :) and) fix bugs in the DNS server answering more than 50 billion queries every day? We are hiring in London, San Francisco and Singapore!

DNS parser, meet Go fuzzer

Filippo Valsorda — Thu, 06 Aug 2015 13:40:40 GMT

Here at CloudFlare we are heavy users of the github.com/miekg/dns Go DNS library and we make sure to contribute to its development as much as possible. Therefore when Dmitry Vyukov published go-fuzz and started to uncover tens of bugs in the Go standard library, our task was clear.

Hot Fuzz

Fuzzing is the technique of testing software by continuously feeding it inputs that are automatically mutated. For C/C++, the wildly successful afl-fuzz tool by Michał Zalewski uses instrumented source coverage to judge which mutations pushed the program into new paths, eventually hitting many rarely-tested branches.

go-fuzz applies the same technique to Go programs, instrumenting the source by rewriting it (like godebug does). An interesting difference between afl-fuzz and go-fuzz is that the former normally operates on file inputs to unmodified programs, while the latter asks you to write a Go function and passes inputs to that. The former usually forks a new process for each input, the latter keeps calling the function without restarting often.

There is no strong technical reason for this difference (and indeed afl recently gained the ability to behave like go-fuzz), but it's likely due to the different ecosystems in which they operate: Go programs often expose well-documented, well-behaved APIs which enable the tester to write a good wrapper that doesn't contaminate state across calls. Also, Go programs are often easier to dive into and more predictable, thanks obviously to GC and memory management, but also to the general community repulsion towards unexpected global states and side effects. On the other hand many legacy C code bases are so intractable that the easy and stable file input interface is worth the performance tradeoff.

Back to our DNS library. RRDNS, our in-house DNS server, uses github.com/miekgs/dns for all its parsing needs, and it has proved to be up to the task. However, it's a bit fragile on the edge cases and has a track record of panicking on malformed packets. Thankfully, this is Go, not BIND C, and we can afford to recover() panics without worrying about ending up with insane memory states. Here's what we are doing

func ParseDNSPacketSafely(buf []byte, msg *old.Msg) (err error) {
	defer func() {
		panicked := recover()

		if panicked != nil {
			err = errors.New("ParseError")
		}
	}()

	err = msg.Unpack(buf)

	return
}

We saw an opportunity to make the library more robust so we wrote this initial simple fuzzing function:

func Fuzz(rawMsg []byte) int {
    msg := &dns.Msg{}

    if unpackErr := msg.Unpack(rawMsg); unpackErr != nil {
        return 0
    }

    if _, packErr = msg.Pack(); packErr != nil {
        println("failed to pack back a message")
        spew.Dump(msg)
        panic(packErr)
    }

    return 1
}

To create a corpus of initial inputs we took our stress and regression test suites and used github.com/miekg/pcap to write a file per packet.

package main

import (
	"crypto/rand"
	"encoding/hex"
	"log"
	"os"
	"strconv"

	"github.com/miekg/pcap"
)

func fatalIfErr(err error) {
	if err != nil {
		log.Fatal(err)
	}
}

func main() {
	handle, err := pcap.OpenOffline(os.Args[1])
	fatalIfErr(err)

	b := make([]byte, 4)
	_, err = rand.Read(b)
	fatalIfErr(err)
	prefix := hex.EncodeToString(b)

	i := 0
	for pkt := handle.Next(); pkt != nil; pkt = handle.Next() {
		pkt.Decode()

		f, err := os.Create("p_" + prefix + "_" + strconv.Itoa(i))
		fatalIfErr(err)
		_, err = f.Write(pkt.Payload)
		fatalIfErr(err)
		fatalIfErr(f.Close())

		i++
	}
}

CC BY 2.0 image by JD Hancock

We then compiled our Fuzz function with go-fuzz, and launched the fuzzer on a lab server. The first thing go-fuzz does is minimize the corpus by throwing away packets that trigger the same code paths, then it starts mutating the inputs and passing them to Fuzz() in a loop. The mutations that don't fail (return 1) and expand code coverage are kept and iterated over. When the program panics, a small report (input and output) is saved and the program restarted. If you want to learn more about go-fuzz watch the author's GopherCon talk or read the README.

Crashes, mostly "index out of bounds", started to surface. go-fuzz becomes pretty slow and ineffective when the program crashes often, so while the CPUs burned I started fixing the bugs.

In some cases I just decided to change some parser patterns, for example reslicing and using len() instead of keeping offsets. However these can be potentially disrupting changes—I'm far from perfect—so I adapted the Fuzz function to keep an eye on the differences between the old and new, fixed parser, and crash if the new parser started refusing good packets or changed its behavior:

func Fuzz(rawMsg []byte) int {
    var (
        msg, msgOld = &dns.Msg{}, &old.Msg{}
        buf, bufOld = make([]byte, 100000), make([]byte, 100000)
        res, resOld []byte

        unpackErr, unpackErrOld error
        packErr, packErrOld     error
    )

    unpackErr = msg.Unpack(rawMsg)
    unpackErrOld = ParseDNSPacketSafely(rawMsg, msgOld)

    if unpackErr != nil && unpackErrOld != nil {
        return 0
    }

    if unpackErr != nil && unpackErr.Error() == "dns: out of order NSEC block" {
        // 97b0a31 - rewrite NSEC bitmap [un]packing to account for out-of-order
        return 0
    }

    if unpackErr != nil && unpackErr.Error() == "dns: bad rdlength" {
        // 3157620 - unpackStructValue: drop rdlen, reslice msg instead
        return 0
    }

    if unpackErr != nil && unpackErr.Error() == "dns: bad address family" {
        // f37c7ea - Reject a bad EDNS0_SUBNET family on unpack (not only on pack)
        return 0
    }

    if unpackErr != nil && unpackErr.Error() == "dns: bad netmask" {
        // 6d5de0a - EDNS0_SUBNET: refactor netmask handling
        return 0
    }

    if unpackErr != nil && unpackErrOld == nil {
        println("new code fails to unpack valid packets")
        panic(unpackErr)
    }

    res, packErr = msg.PackBuffer(buf)

    if packErr != nil {
        println("failed to pack back a message")
        spew.Dump(msg)
        panic(packErr)
    }

    if unpackErrOld == nil {

        resOld, packErrOld = msgOld.PackBuffer(bufOld)

        if packErrOld == nil && !bytes.Equal(res, resOld) {
            println("new code changed behavior of valid packets:")
            println()
            println(hex.Dump(res))
            println(hex.Dump(resOld))
            os.Exit(1)
        }

    }

    return 1
}

I was pretty happy about the robustness gain, but since we used the ParseDNSPacketSafely wrapper in RRDNS I didn't expect to find security vulnerabilities. I was wrong!

DNS names are made of labels, usually shown separated by dots. In a space saving effort, labels can be replaced by pointers to other names, so that if we know we encoded example.com at offset 15, www.example.com can be packed as www. + PTR(15). What we found is a bug in handling of pointers to empty names: when encountering the end of a name (0x00), if no label were read, "." (the empty name) was returned as a special case. Problem is that this special case was unaware of pointers, and it would instruct the parser to resume reading from the end of the pointed-to empty name instead of the end of the original name.

For example if the parser encountered at offset 60 a pointer to offset 15, and msg[15] == 0x00, parsing would then resume from offset 16 instead of 61, causing a infinite loop. This is a potential Denial of Service vulnerability.

A) Parse up to position 60, where a DNS name is found

| ... |  15  |  16  |  17  | ... |  58  |  59  |  60  |  61  |
| ... | 0x00 |      |      | ... |      |      | ->15 |      |

------------------------------------------------->     

B) Follow the pointer to position 15

| ... |  15  |  16  |  17  | ... |  58  |  59  |  60  |  61  |
| ... | 0x00 |      |      | ... |      |      | ->15 |      |

         ^                                        |
         ------------------------------------------      

C) Return a empty name ".", special case triggers

D) Erroneously resume from position 16 instead of 61

| ... |  15  |  16  |  17  | ... |  58  |  59  |  60  |  61  |
| ... | 0x00 |      |      | ... |      |      | ->15 |      |

                 -------------------------------->   

E) Rinse and repeat

We sent the fixes privately to the library maintainer while we patched our servers and we opened a PR once done. (Two bugs were independently found and fixed by Miek while we released our RRDNS updates, as it happens.)

Not just crashes and hangs

Thanks to its flexible fuzzing API, go-fuzz lends itself nicely not only to the mere search of crashing inputs, but can be used to explore all scenarios where edge cases are troublesome.

Useful applications range from checking output validation by adding crashing assertions to your Fuzz() function, to comparing the two ends of a unpack-pack chain and even comparing the behavior of two different versions or implementations of the same functionality.

For example, while preparing our DNSSEC engine for launch, I faced a weird bug that would happen only on production or under stress tests: NSEC records that were supposed to only have a couple bits set in their types bitmap would sometimes look like this

deleg.filippo.io.  IN  NSEC    3600    \000.deleg.filippo.io. NS WKS HINFO TXT AAAA LOC SRV CERT SSHFP RRSIG NSEC TLSA HIP TYPE60 TYPE61 SPF

The catch was that our "pack and send" code pools []byte buffers to reduce GC and allocation churn, so buffers passed to dns.msg.PackBuffer(buf []byte) can be "dirty" from previous uses.

var bufpool = sync.Pool{
    New: func() interface{} {
        return make([]byte, 0, 2048)
    },
}

[...]

    data := bufpool.Get().([]byte)
    defer bufpool.Put(data)

    if data, err = r.Response.PackBuffer(data); err != nil {

However, buf not being an array of zeroes was not handled by some github.com/miekgs/dns packers, including the NSEC rdata one, that would just OR present bits, without clearing ones that are supposed to be absent.

case `dns:"nsec"`:
    lastwindow := uint16(0)
    length := uint16(0)
    for j := 0; j < val.Field(i).Len(); j++ {
        t := uint16((fv.Index(j).Uint()))
        window := uint16(t / 256)
        if lastwindow != window {
            off += int(length) + 3
        }
        length = (t - window*256) / 8
        bit := t - (window * 256) - (length * 8)

        msg[off] = byte(window) // window #
        msg[off+1] = byte(length + 1) // octets length

        // Setting the bit value for the type in the right octet
--->    msg[off+2+int(length)] |= byte(1 << (7 - bit)) 

        lastwindow = window
    }
    off += 2 + int(length)
    off++
}

The fix was clear and easy: we benchmarked a few different ways to zero a buffer and updated the code like this

// zeroBuf is a big buffer of zero bytes, used to zero out the buffers passed
// to PackBuffer.
var zeroBuf = make([]byte, 65535)

var bufpool = sync.Pool{
    New: func() interface{} {
        return make([]byte, 0, 2048)
    },
}

[...]

    data := bufpool.Get().([]byte)
    defer bufpool.Put(data)
    copy(data[0:cap(data)], zeroBuf)

    if data, err = r.Response.PackBuffer(data); err != nil {

Note: a recent optimization turns zeroing range loops into memclr calls, so once 1.5 lands that will be much faster than copy().

But this was a boring fix! Wouldn't it be nicer if we could trust our library to work with any buffer we pass it? Luckily, this is exactly what coverage based fuzzing is good for: making sure all code paths behave in a certain way.

What I did then is write a Fuzz() function that would first parse a message, and then pack it to two different buffers: one filled with zeroes and one filled with 0xff. Any differences between the two results would signal cases where the underlying buffer is leaking into the output.

func Fuzz(rawMsg []byte) int {
    var (
        msg         = &dns.Msg{}
        buf, bufOne = make([]byte, 100000), make([]byte, 100000)
        res, resOne []byte

        unpackErr, packErr error
    )

    if unpackErr = msg.Unpack(rawMsg); unpackErr != nil {
        return 0
    }

    if res, packErr = msg.PackBuffer(buf); packErr != nil {
        return 0
    }

    for i := range res {
        bufOne[i] = 1
    }

    resOne, packErr = msg.PackBuffer(bufOne)
    if packErr != nil {
        println("Pack failed only with a filled buffer")
        panic(packErr)
    }

    if !bytes.Equal(res, resOne) {
        println("buffer bits leaked into the packed message")
        println(hex.Dump(res))
        println(hex.Dump(resOne))
        os.Exit(1)
    }

    return 1
}

I wish here, too, I could show a PR fixing all the bugs, but go-fuzz did its job even too well and we are still triaging and fixing what it finds.

Anyway, once the fixes are done and go-fuzz falls silent, we will be free to drop the buffer zeroing step without worry, with no need to audit the whole codebase!

Do you fancy fuzzing the libraries that serve 43 billion queries per day? We are hiring in London, San Francisco and Singapore!

Quick and dirty annotations for Go stack traces

Filippo Valsorda — Mon, 03 Aug 2015 11:26:04 GMT

CloudFlare’s DNS server, RRDNS, is entirely written in Go and typically runs tens of thousands goroutines. Since goroutines are cheap and Go I/O is blocking we run one goroutine per file descriptor we listen on and queue new packets for processing.

CC BY-SA 2.0 image by wiredforlego

When there are thousands of goroutines running, debug output quickly becomes difficult to interpret. For example, last week I was tracking down a problem with a file descriptor and wanted to know what its listening goroutine was doing. With 40k stack traces, good luck figuring out which one is having trouble.

Go stack traces include parameter values, but most Go types are (or are implemented as) pointers, so what you will see passed to the goroutine function is just a meaningless memory address.

We have a couple options to make sense of the addresses: get a heap dump at the same time as the stack trace and cross-reference the pointers, or have a debug endpoint that prints a goroutine/pointer -> IP map. Neither are seamless.

Underscore to the rescue

However, we know that integers are shown in traces, so what we did is first convert IPv4 addresses to their uint32 representation:

// addrToUint32 takes a TCPAddr or UDPAddr and converts its IP to a uint32.
// If the IP is v6, 0xffffffff is returned.
func addrToUint32(addr net.Addr) uint32 {
       var ip net.IP
       switch addr := addr.(type) {
       case *net.TCPAddr:
               ip = addr.IP
       case *net.UDPAddr:
               ip = addr.IP
       case *net.IPAddr:
               ip = addr.IP
       }
       if ip == nil {
               return 0
       }
       ipv4 := ip.To4()
       if ipv4 == nil {
               return math.MaxUint32
       }
       return uint32(ipv4[0])<<24 | uint32(ipv4[1])<<16 | uint32(ipv4[2])<<8 | uint32(ipv4[3])
}

And then pass the IPv4-as-uint32 to the listening goroutine as an _ parameter. Yes, as a parameter with name _; it's known as the blank identifier in Go.

// PacketUDPRead is a goroutine that listens on a specific UDP socket and reads
// in new requests
// The first parameter is the int representation of the listening IP address,
// and it's passed just so it will appear in stack traces
func PacketUDPRead(_ uint32, conn *net.UDPConn, ...) { ... }

go PacketUDPRead(addrToUint32(conn.LocalAddr()), conn, ...)

Now when we get a stack trace, we can just look at the first bytes, convert them back to dotted notation, and know on what IP the goroutine was listening.

goroutine 42 [IO wait]:
	[...]
	/.../request.go:195 +0x5d
rrdns/core.PacketUDPRead(0xc27f000001, 0x2b6328113ad8, 0xc20801ecc0, 0xc208044308, 0xc208e99280, 0xc208ad8180, 0x12a05f200)
	/.../server.go:119 +0x35a
created by rrdns/core.PacketIO
	/.../server.go:230 +0x8be

0xc27f000001 -> remove alignment byte -> 0x7f000001 -> 127.0.0.1

Obviously you can do the same with any piece of information you can represent as an int.

Are you interested in taming the goroutines that run the web? We're hiring in London, San Francisco and Singapore!

Setting Go variables from the outside

Filippo Valsorda — Wed, 01 Jul 2015 13:26:37 GMT

CloudFlare's DNS server, RRDNS, is written in Go and the DNS team used to generate a file called version.go in our Makefile. version.go looked something like this:

// THIS FILE IS AUTOGENERATED BY THE MAKEFILE. DO NOT EDIT.

// +build	make

package version

var (
	Version   = "2015.6.2-6-gfd7e2d1-dev"
	BuildTime = "2015-06-16-0431 UTC"
)

and was used to embed version information in RRDNS. It was built inside the Makefile using sed and git describe from a template file. It worked, but was pretty ugly.

Today we noticed that another Go team at CloudFlare, the Data team, had a much smarter way to bake version numbers into binaries using the -X linker option.

The -X Go linker option, which you can set with -ldflags, sets the value of a string variable in the Go program being linked. You use it like this: -X main.version 1.0.0.

A simple example: let's say you have this source file saved as hello.go.

package main

import "fmt"

var who = "World"

func main() {
    fmt.Printf("Hello, %s.\n", who)
}

Then you can use go run (or other build commands like go build or go install) with the -ldflags option to modify the value of the who variable:

$ go run hello.go
Hello, World.
$ go run -ldflags="-X main.who CloudFlare" hello.go
Hello, CloudFlare.

The format is importpath.name string, so it's possible to set the value of any string anywhere in the Go program, not just in main. Note that from Go 1.5 the syntax has changed to importpath.name=string. The old style is still supported but the linker will complain.

I was worried this would not work with external linking (for example when using cgo) but as we can see with -ldflags="-linkmode=external -v" the Go linker runs first anyway and takes care of our -X.

$ go build -x -ldflags="-X main.who CloudFlare -linkmode=external -v" hello.go
WORK=/var/folders/v8/xdj2snz51sg2m2bnpmwl_91c0000gn/T/go-build149644699
mkdir -p $WORK/command-line-arguments/_obj/
cd /Users/filippo/tmp/X
/usr/local/Cellar/go/1.4.2/libexec/pkg/tool/darwin_amd64/6g -o $WORK/command-line-arguments.a -trimpath $WORK -p command-line-arguments -complete -D _/Users/filippo/tmp/X -I $WORK -pack ./hello.go
cd .
/usr/local/Cellar/go/1.4.2/libexec/pkg/tool/darwin_amd64/6l -o hello -L $WORK -X main.hi hi -linkmode=external -v -extld=clang $WORK/command-line-arguments.a
# command-line-arguments
HEADER = -H1 -T0x2000 -D0x0 -R0x1000
searching for runtime.a in $WORK/runtime.a
searching for runtime.a in /usr/local/Cellar/go/1.4.2/libexec/pkg/darwin_amd64/runtime.a
 0.06 deadcode
 0.07 pclntab=284969 bytes, funcdata total 49800 bytes
 0.07 dodata
 0.08 symsize = 0
 0.08 symsize = 0
 0.08 reloc
 0.09 reloc
 0.09 asmb
 0.09 codeblk
 0.09 datblk
 0.09 dwarf
 0.09 sym
 0.09 headr
host link: clang -m64 -gdwarf-2 -Wl,-no_pie,-pagezero_size,4000000 -o hello -Qunused-arguments /var/folders/v8/xdj2snz51sg2m2bnpmwl_91c0000gn/T//go-link-mFNNCD/000000.o /var/folders/v8/xdj2snz51sg2m2bnpmwl_91c0000gn/T//go-link-mFNNCD/000001.o /var/folders/v8/xdj2snz51sg2m2bnpmwl_91c0000gn/T//go-link-mFNNCD/go.o -g -O2 -g -O2 -lpthread
 0.17 cpu time
33619 symbols
64 sizeof adr
216 sizeof prog
23412 liveness data

Do you want to work next to Go developers that can always make you learn a new trick? We are hiring in London, San Francisco and Singapore.

Go has a debugger—and it's awesome!

Filippo Valsorda — Thu, 18 Jun 2015 11:14:00 GMT

Something that often, uh... bugs[1] Go developers is the lack of a proper debugger. Sure, builds are ridiculously fast and easy, and println(hex.Dump(b)) is your friend, but sometimes it would be nice to just set a breakpoint and step through that endless if chain or print a bunch of values without recompiling ten times.

CC BY 2.0 image by Carl Milner

You could try to use some dirty gdb hacks that will work if you built your binary with a certain linker and ran it on some architectures when the moon was in a waxing crescent phase, but let's be honest, it isn't an enjoyable experience.

Well, worry no more! godebug is here!

godebug is an awesome cross-platform debugger created by the Mailgun team. You can read their introduction for some under-the-hood details, but here's the cool bit: instead of wrestling with half a dozen different ptrace interfaces that would not be portable, godebug rewrites your source code and injects function calls like godebug.Line on every line, godebug.Declare at every variable declaration, and godebug.SetTrace for breakpoints (i.e. wherever you type _ = "breakpoint").

I find this solution brilliant. What you get out of it is a (possibly cross-compiled) debug-enabled binary that you can drop on a staging server just like you would with a regular binary. When a breakpoint is reached, the program will stop inline and wait for you on stdin. It's the single-binary, zero-dependencies philosophy of Go that we love applied to debugging. Builds everywhere, runs everywhere, with no need for tools or permissions on the server. It even compiles to JavaScript with gopherjs (check out the Mailgun post above—show-offs ;) ).

You might ask, "But does it get a decent runtime speed or work with big applications?" Well, the other day I was seeing RRDNS—our in-house Go DNS server—hit a weird branch, so I placed a breakpoint a couple lines above the if in question, recompiled the whole of RRDNS with godebug instrumentation, dropped the binary on a staging server, and replayed some DNS traffic.

filippo@staging:~$ ./rrdns -config config.json
-> _ = "breakpoint"
(godebug) l

    q := r.Query.Question[0]

--> _ = "breakpoint"

    if !isQtypeSupported(q.Qtype) {
        return
(godebug) n
-> if !isQtypeSupported(q.Qtype) {
(godebug) q
dns.Question{Name:"filippo.io.", Qtype:0x1, Qclass:0x1}
(godebug) c

Boom. The request and the debug log paused (make sure to terminate any timeout you have in your tools), waiting for me to step through the code.

Sold yet? Here's how you use it: simply run godebug {build|run|test} instead of go {build|run|test}. We adapted godebug to resemble the go tool as much as possible. Remember to use -instrument if you want to be able to step into packages that are not main.

For example, here is part of the RRDNS Makefile:

bin/rrdns:
ifdef GODEBUG
	GOPATH="${PWD}" go install github.com/mailgun/godebug
	GOPATH="${PWD}" ./bin/godebug build -instrument "${GODEBUG}" -o bin/rrdns rrdns
else
	GOPATH="${PWD}" go install rrdns
endif

test:
ifdef GODEBUG
	GOPATH="${PWD}" go install github.com/mailgun/godebug
	GOPATH="${PWD}" ./bin/godebug test -instrument "${GODEBUG}" rrdns/...
else
	GOPATH="${PWD}" go test rrdns/...
endif

Debugging is just a make bin/rrdns GODEBUG=rrdns/... away.

This tool is still young, but in my experience, perfectly functional. The UX could use some love if you can spare some time (as you can see above it's pretty spartan), but it should be easy to build on what's there already.

About source rewriting

Before closing, I'd like to say a few words about the technique of source rewriting in general. It powers many different Go tools, like test coverage, fuzzing and, indeed, debugging. It's made possible primarily by Go’s blazing-fast compiles, and it enables amazing cross-platform tools to be built easily.

However, since it's such a handy and powerful pattern, I feel like there should be a standard way to apply it in the context of the build process. After all, all the source rewriting tools need to implement a subset of the following features:

Wrap the main function
Conditionally rewrite source files
Keep global state

Why should every tool have to reinvent all the boilerplate to copy the source files, rewrite the source, make sure stale objects are not used, build the right packages, run the right tests, and interpret the CLI..? Basically, all of godebug/cmd.go. And what about gb, for example?

I think we need a framework for Go source code rewriting tools. (Spoiler, spoiler, ...)

If you’re interested in working on Go servers at scale and developing tools to do it better, remember we’re hiring in London, San Francisco, and Singapore!

I'm sorry. ↩︎

The weird and wonderful world of DNS LOC records

John Graham-Cumming — Tue, 01 Apr 2014 01:19:00 GMT

A cornerstone of CloudFlare's infrastructure is our ability to serve DNS requests quickly and handle DNS attacks. To do both those things we wrote our own authoritative DNS server called RRDNS in Go. Because of it we've been able to fight off DNS attacks, and be consistenly one of the fastest DNS providers on the web.

Implementing an authoritative DNS server is a large task. That's in part because DNS is a very old standard (RFC 1035 dates to 1987), in part because as DNS has developed it has grown into a more and more complex system, and in part because what's written in the RFCs and what happens in the real-world aren't always the same thing.

One little used type of DNS record is the LOC (or location). It allows you to specify a physical location. CloudFlare handles millions of DNS records; of those just 743 are LOCs. Nevertheless, it's possible to set up a LOC record in the CloudFlare DNS editor.

My site geekatlas.com has a LOC record as an Easter Egg. Here's how it's configured in the CloudFlare DNS settings:

When you operate at CloudFlare scale the little-used nooks and crannies turn out to be important. And even though there are only 743 LOC records in our entire database, at least one customer contacted support to find out why their LOC record wasn't being served.

And that sent me into the RRDNS source code to find out why.

The answer was simple. Although RRDNS had code for receiving requests for LOC records, creating response packets containing LOC data, there was a missing link. The CloudFlare DNS server stores the LOC record as a string (such as the 33 40 31 N 106 28 29 W 10m above) and no one had written the code to parse that and turn it into the internal format. Oops.

The textual LOC format and the binary, on-the-wire, format are described in RFC 1876 and it's one of many RFCs that updated the original 1987 standard. RFC 1876 is from 1996.

The textual format is fairly simple. Here's what the RFC says:

The LOC record is expressed in a primary file in the following format:

owner TTL class LOC ( d1 [m1 [s1]] {"N"|"S"} d2 [m2 [s2]]
                           {"E"|"W"} alt["m"] [siz["m"] [hp["m"]
                           [vp["m"]]]] )

where:

   d1:     [0 .. 90]            (degrees latitude)
   d2:     [0 .. 180]           (degrees longitude)
   m1, m2: [0 .. 59]            (minutes latitude/longitude)
   s1, s2: [0 .. 59.999]        (seconds latitude/longitude)
   alt:    [-100000.00 .. 42849672.95] BY .01 (altitude in meters)
   siz, hp, vp: [0 .. 90000000.00] (size/precision in meters)

If omitted, minutes and seconds default to zero, size defaults to 1m,
horizontal precision defaults to 10000m, and vertical precision
defaults to 10m.  These defaults are chosen to represent typical
ZIP/postal code area sizes, since it is often easy to find
approximate geographical location by ZIP/postal code.

So, there are required latitude, longitude and altitude and three optional values for the size of the location and precision information. Pretty simple.

Then there's the on-the-wire format. Unlike a TXT record the LOC record data is parsed and turned into a fixed size binary format. Back to RFC 1876:

  MSB                                           LSB
   +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
  0|        VERSION        |         SIZE          |
   +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
  2|       HORIZ PRE       |       VERT PRE        |
   +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
  4|                   LATITUDE                    |
   +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
  6|                   LATITUDE                    |
   +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
  8|                   LONGITUDE                   |
   +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
 10|                   LONGITUDE                   |
   +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
 12|                   ALTITUDE                    |
   +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
 14|                   ALTITUDE                    |
   +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+

So, 32 bits of latitude, longitude and altitude and then three 8 bit values for the size and precision. The latitude and longitude values have a pretty simple encoding that treats the 32 bits as an unsigned integer:

The latitude of the center of the sphere described by the SIZE field, expressed as a 32-bit integer, most significant octet first (network standard byte order), in thousandths of a second of arc.  2^31 represents the equator; numbers above that are north latitude.

And the altitude can be below sea-level but still unsigned:

The altitude of the center of the sphere described by the SIZE field, expressed as a 32-bit integer, most significant octet first (network standard byte order), in centimeters, from a base of 100,000m below the [WGS 84] reference spheroid used by GPS.

But the 8 bit values use a very special encoding that allows a wide range of approximate values to be packed into 8 bits and also be human-readable when dumped out in hex!

The diameter of a sphere enclosing the described entity, in centimeters, expressed as a pair of four-bit unsigned integers, each ranging from zero to nine, with the most significant four bits representing the base and the second number representing the power of ten by which to multiply the base.  This allows sizes from 0e0 (<1cm) to 9e9 (90,000km) to be expressed.  This representation was chosen such that the hexadecimal representation can be read by eye; 0x15 = 1e5.

For example, the value 0x12 means 1 * 10^2 or 100cm. 0x99 means 9 * 10^9 or 90,000,000m. The smallest value that can be represented is 1cm (it's 0x10). So, in just 8 bits there's a range values from 1cm to larger than the diameter of Jupiter.

To fix this I wrote a parser for the LOC text record type (and associated tests). It can be found here.

We've now rolled out the fix and all the existing LOC records are being served by RRDNS. For example, my geekatlas.com LOC record can be queried like this:

$ dig geekatlas.com LOC
; <<>> DiG 9.8.3-P1 <<>> geekatlas.com LOC
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 2997
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:    
;geekatlas.com.         IN  LOC

;; ANSWER SECTION:
geekatlas.com.      299 IN  LOC 33 40 31.000 N 106 28 29.000 W 10.00m 1m 10000m 10m

;; Query time: 104 msec
;; SERVER: 192.168.14.1#53(192.168.14.1)
;; WHEN: Tue Apr  1 14:13:48 2014
;; MSG SIZE  rcvd: 59

What we've been doing with Go

John Graham-Cumming — Mon, 11 Nov 2013 01:00:00 GMT

Almost two years ago CloudFlare started working with Go. What started as an experiment on one network and concurrency heavy project has turned into full, production use of Go for multiple services. Today Go is at the heart of CloudFlare's services including handling compression for high-latency HTTP connections, our entire DNS infrastructure, SSL, load testing and more.

I first wrote about CloudFlare's use of Go back in July 2012. At that time there was only one CloudFlare project, Railgun, written in Go, but others were starting to germinate around the company. Today we have many major projects in Go. So, we celebrate Go's 4th birthday with a short list of interesting things we've written in Go.

RRDNS

RRDNS is a DNS proxy that was written to help scale CloudFlare's DNS infrastructure. Our DNS was already very fast and we wanted to make it faster, more reliable and resilient to attack.

RRDNS provides response rate limiting to stop DNS attacks, caching to lower the load on the database, load balancing to detect downed upstreams, seamless binary upgrade (with no service interruption), CNAME flattening and more. It is built on a modular framework that allows the implementation of each behaviour to exist in separately maintained modules.

This modularity was trivial to enforce with interfaces, allowing each module to remain strictly self-contained but retain the flexibility of implementation (some use cgo, others have background workers). A goroutine is dedicated to each request to force isolation where panics are recovered and logged instead of crashing the server.

The guarantees needed to avoid leaving the server in a bad state when handling panics would be impossible without the defer mechanism Go provides. Other language features simplified the complexity of writing new modules; garbage collection allows us to avoid complex reference counting schemes and concentrate on the business logic. Managing the concurrent requests' access to shared data was also easy, either by wrapping this in a goroutine or using read-write locks.

Railgun

Railgun is CloudFlare's compression technology that's used to speed up connections between CloudFlare's data centers and origin web servers (especially when there is high latency between them). It's 100% written in Go and performs on the fly compression of web pages (and other textual assets) against recent versions of those pages to send the absolute minimum data possible.

Railgun also helps cache previously uncacheable assets (such as dynamically generated web pages) by spotting the often small changes between web pages over time, or the small changes between different users. Railgun is now widely used by CloudFlare customers to create responsive web sites wherever the end user is in the world.

It is now widely used by CloudFlare customers and hosting partners and achieves impressive real world speedups, including faster TTFB and better page load time, and bandwidth savings which translate into additional savings for people using services like AWS.

Red October

Red October is a cryptographically secure implementation of the two-man rule control mechanism. It is a software-based encryption and decryption server. The server allows authorized individuals to encrypt a payload in such a way that no one individual can decrypt it. To decrypt the payload, at least two authorized individuals must be logged into the server. The decryption of the payload is cryptographically tied to the login credentials of the authorized individuals and is only kept in memory for the duration of the decryption.

We'll be using Red October to secure very sensitive material (such as private encryption keys) so that no single CloudFlare employee (or single attacker) can get access to them. In coming months we'll write more about the cryptographic underpinnings of keeping CloudFlare secure.

SSL Bundler

The SSL Bundler allows us to take a customer's own SSL certificate and compute the fastest, shortest chain of intermediate certificates that can be used to verify the connection. When someone uploads a custom SSL certificate, we use our directory of intermediate certificates to build all the possible chains from the uploaded cert to a trusted browser root. We then rank these chains based on a number of factors including:

The length of the certificate chain
The ubiquity of the root certificate in browsers and other clients
The security of each step in the chain (e.g., does their Extended Key Usage include Server Authentication)
The length of the validity period of all the steps in the chain

The result is a server bundle that is small, fast and strong while having ubiquitous browser and client support. We then present that chain of certificates when an SSL connection is made so that the browser can securely verify the SSL connection as quickly as possible.

And that's not all

We're also experimenting with Go-based software like CoreOS (we're working with their go-raft library) and Docker. We have internal tools for load testing written in Go for Kyoto Tycoon, HTTP servers, SPDY and memcached. And we've open-sourced our Go stream processing library, a Go wrapper for high-performance syslogging, our changes to Go bindings for Kyoto Cabinet, and a tool for 'curling' SPDY sites. And, of course, we've been contributing directly to Go itself.

All in all Go is one of the main languages in use at CloudFlare and from here it looks like it has a bright future.

Happy 4th Birthday Go!

(And, yes, we're hiring Go programmers in London and San Francisco)

The story of a little DNS easter egg

Matthew Prince — Tue, 27 Aug 2013 00:10:00 GMT

About a year ago, we realized that CloudFlare's current DNS infrastructure had some challenges. We were using PowerDNS, an open source DNS server that is popular with hosting providers. We'd chosen PowerDNS four years ago for several reasons: 1) it was reasonably fast as an authoritative DNS server; 2) it seamlessly allowed us to add new records without rebooting; and 3) it allowed us to write custom backends that tied in to the rest of our system and allowed us to quickly update DNS records.

While PowerDNS got us a long way, it started to run into issues as we scaled and dealt with an increasing number of large denial of service attacks. There were three main challenges. First, PowerDNS when under large in-bound DDoS loads would consume excessive resources and occasionally fall over. Second, and more insidious, because the version of PowerDNS we were using did not have abuse detection and rate limiting, we were increasingly seeing attempts at using CloudFlare authoritative DNS network to launch reflection attacks against others. And, third, it was very difficult to extend PowerDNS and add any sort of application logic which kept us from doing interesting things... like adding Easter Eggs.

Enter RRDNS

To their credit, the PowerDNS community responded to the first two problems with some efforts at rate limiting and other abuse detection. However, because of CloudFlare's unique needs, we realized the only real solution to what we needed was a custom build authoritative name server. About nine months ago, we began the creation of what we affectionately refer to as RRDNS.

Today, every DNS query to CloudFlare runs through RRDNS, and we have eliminated PowerDNS from our stack. We're planning a more significant blog post on the technical details of RRDNS and how we build its attack detection and prevention. This, however, is a quick story of how the ease of adding application logic to RRDNS allowed us to create a couple fun Easter Eggs.

Easter Eggs

RRDNS is fully extensible in a modular framework. This allows us to easily add application logic to our name servers, allowing them to respond more dynamically based on the requests they receive. Over time, using this extensibility we plan to add features like DNSSEC. In the meantime, Ian, one of our engineers, wrote a quick extension to demonstrate the functionality.

DNS has many common records such as A, AAAA, CNAME, TXT, MX, etc. One of the largely deprecated records is the CH (Chaos) protocol. Ian decided to allow you to query for CloudFlare's current job listings via a DNS request. He wrote 15 lines of code as an extension to RRDNS. He pulled the listings dynamically from our RSS feed of job listings and returned them if you ran a particular query for the CH record for jobs.cloudflare directly against one of our DNS servers. Here's the output:

$ dig ch jobs.cloudflare @emma.ns.cloudflare.com TXT 
;; Truncated, retrying in TCP mode.  
; <<>> DiG 9.8.3-P1 <<>> ch jobs.cloudflare @emma.ns.cloudflare.com  
;; global options: +cmd  
;; Got answer:  
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 28339  
;; flags: qr aa rd; QUERY: 1, ANSWER: 21, AUTHORITY: 0, ADDITIONAL: 0  
;; WARNING: recursion requested but not available  
;; QUESTION SECTION:  
;jobs.cloudflare. > CH > A  
;; ANSWER SECTION:  
jobs.cloudflare. > 86400 > CH > TXT > "Content Marketer, San Francisco, CA"  
jobs.cloudflare. > 86400 > CH > TXT > "Customer Support Leader, San Francisco, CA"  
jobs.cloudflare. > 86400 > CH > TXT > "General Marketer, San Francisco, CA"  
jobs.cloudflare. > 86400 > CH > TXT > "Marketing Analytics Engineer, San Francisco, CA"  
jobs.cloudflare. > 86400 > CH > TXT > "Talent, San Francisco, CA"  
jobs.cloudflare. > 86400 > CH > TXT > "Build Engineer, San Francisco, CA"  
jobs.cloudflare. > 86400 > CH > TXT > "DDoS Engineer, San Francisco, CA"  
jobs.cloudflare. > 86400 > CH > TXT > "Database Engineer, San Francisco, CA"  
jobs.cloudflare. > 86400 > CH > TXT > "Front End Developer, San Francisco, CA"  
jobs.cloudflare. > 86400 > CH > TXT > "Integration Developer, San Francisco, CA"  
jobs.cloudflare. > 86400 > CH > TXT > "JavaScript Performance Engineer, San Francisco, CA"  
jobs.cloudflare. > 86400 > CH > TXT > "PHP Developer, San Francisco, CA"  
jobs.cloudflare. > 86400 > CH > TXT > "Product Manager, San Francisco, CA"  
jobs.cloudflare. > 86400 > CH > TXT > "SDK Engineer, San Francisco, CA"  
jobs.cloudflare. > 86400 > CH > TXT > "Systems Engineer, London, United Kingdom"  
jobs.cloudflare. > 86400 > CH > TXT > "Systems Engineer, San Francisco, CA"  
jobs.cloudflare. > 86400 > CH > TXT > "Network Engineer, San Francisco, CA"  
jobs.cloudflare. > 86400 > CH > TXT > "Technical Operations Engineer, San Francisco, CA"  
jobs.cloudflare. > 86400 > CH > TXT > "Technical Operations Engineer, London, United Kingdom"  
jobs.cloudflare. > 86400 > CH > TXT > "Technical Support Engineer, London, United Kingdom"  
jobs.cloudflare. > 86400 > CH > TXT > "Technical Support Engineer, San Francisco, CA"  
;; Query time: 16 msec  
;; SERVER: 2400:cb00:2049:1::adf5:3a70#53(2400:cb00:2049:1::adf5:3a70)  
;; WHEN: Mon Aug 26 19:42:26 2013  
;; MSG SIZE  rcvd: 1122

The power of this isn't a new way for you to tell what job listings we have. Instead, it is that we can easily embed dynamic application logic into our DNS system. For an even more useless, albeit fun Easter Egg, try querying for the CH record for whois.cloudflare against one of our name servers. You may need to expand the size of your terminal window for full effect.

All Fun & Games Until Someone Gets Hurt

One of my first questions when I saw these Easter Eggs was: can someone now use these to in order to use CloudFlare's network to amplify a DNS reflection attack.

Thankfully, Ian had already thought of this and built in a safeguard. DNS reflection attacks rely on the attacker spoofing the source IP of the victim when querying a DNS server. Since most DNS requests are sent over UDP, which has no handshake, it is possible to cause DNS servers to respond to the spoofed source.

If you look at the response to the query above, the first line of the response reads:

;; Truncated, retrying in TCP mode.

Because RRDNS is highly flexible, Ian was able to force it when getting one of the Easter Egg-generating queries to respond back with a 0-byte UDP response and the DNS message truncated flag. This causes the client making the DNS request to retry via TCP. Since TCP has a handshake as part of its protocol, it prevents the source IP from being forged. Since the UDP response is smaller than the original query, and since the TCP response will only be sent to a source IP that has been verified with a full handshake, the risk of the Easter Eggs being used to amplify a DNS reflection attack is eliminated.

It doesn't take very much thought to see how we could use this extensibility to better defend against some of the DNS attacks we see, both those directed at us and also those using us to target others. Over the next few weeks we'll dive into more of the technical details behind RRDNS. Until then, enjoy the Easter Eggs.

CloudFlare: Fastest Free DNS, Among Fastest DNS Period

Matthew Prince — Mon, 07 Jan 2013 05:38:00 GMT

CloudFlare runs one of the largest networks of DNS servers in the world. Over the last few months, we've invested in making our DNS as fast and responsive as possible. We were happy to see these efforts pay off in third-party DNS test results.

The good folks at SolveDNS conduct a monthly survey of the fastest DNS providers in the world. CloudFlare has regularly been in the top-5 fastest DNS providers. This month we're up to number two, with SolveDNS's tests showing an average 4.51ms response time. That's just a hair behind number one (at 4.38ms) and almost twice as fast as number three (at 8.85ms). And, unlike most the other DNS providers in the top-10, CloudFlare's fast Anycast DNS service is provided even for our free plans.

Lest you think we're resting on our laurels, we've got a major DNS release (which we've dubbed RRDNS) scheduled for the next few months that we think will allow us to squeeze a bit more speed out of our DNS lookups. We're shooting for number one!

The Cloudflare Blog

How and why the leap second affected Cloudflare DNS

A little bit about Cloudflare DNS

A falsehood programmers believe about time

The one character fix

Timeline

Conclusion

Economical With The Truth: Making DNSSEC Answers Cheap

What You Need To Know: DNSSEC Edition

What You Need To Know: Negative Answer Edition

What Goes Into An NXDOMAIN With DNSSEC

Zone Walking

NODATA Responses

Problems With Negative Answers

The Trouble with Previous and Next Names

When CloudFlare Lies

DNS Shotgun

How Are Black Lies and DNS Shotgun Standards Compliant

Conclusion

Go coverage with external tests

Creative foot-shooting with Go RWMutex

DNS parser, meet Go fuzzer

Hot Fuzz

Not just crashes and hangs

Quick and dirty annotations for Go stack traces

Underscore to the rescue

Setting Go variables from the outside

Go has a debugger—and it's awesome!

About source rewriting

The weird and wonderful world of DNS LOC records

What we've been doing with Go

RRDNS

Railgun

Red October

SSL Bundler

And that's not all

The story of a little DNS easter egg

Enter RRDNS

Easter Eggs

All Fun & Games Until Someone Gets Hurt

CloudFlare: Fastest Free DNS, Among Fastest DNS Period

What Goes Into An `NXDOMAIN` With DNSSEC

`NODATA` Responses