
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/">
    <channel>
        <title><![CDATA[ The Cloudflare Blog ]]></title>
        <description><![CDATA[ Get the latest news on how products at Cloudflare are built, technologies used, and join the teams helping to build a better Internet. ]]></description>
        <link>https://blog.cloudflare.com</link>
        <atom:link href="https://blog.cloudflare.com/" rel="self" type="application/rss+xml"/>
        <language>en-us</language>
        <image>
            <url>https://blog.cloudflare.com/favicon.png</url>
            <title>The Cloudflare Blog</title>
            <link>https://blog.cloudflare.com</link>
        </image>
        <lastBuildDate>Tue, 12 May 2026 06:11:49 GMT</lastBuildDate>
        <item>
            <title><![CDATA[When DNSSEC goes wrong: how we responded to the .de TLD outage]]></title>
            <link>https://blog.cloudflare.com/de-tld-outage-dnssec/</link>
            <pubDate>Wed, 06 May 2026 17:00:00 GMT</pubDate>
            <description><![CDATA[ On May 5, 2026, DENIC published broken DNSSEC signatures for the .de TLD, making millions of domains unreachable. Here's what 1.1.1.1 saw, how serve stale cushioned the impact, and how we restored resolution. ]]></description>
            <content:encoded><![CDATA[ <p></p><p>On May 5, 2026, at roughly 19:30 UTC, DENIC, the registry operator for the <code>.de</code> country-code top-level domain (TLD), started publishing incorrect DNSSEC signatures for the <code>.de</code> zone. Any validating DNS resolver receiving these signatures was required by the DNSSEC specification to reject them and return SERVFAIL to clients, including <a href="https://www.cloudflare.com/learning/dns/what-is-1.1.1.1/"><u>1.1.1.1</u></a>, the public DNS resolver operated by Cloudflare. </p><p>The country-code top-level domain for Germany, <code>.de</code>, is one of the largest on the Internet. On <a href="https://radar.cloudflare.com/tlds?dateRange=7d"><u>Cloudflare Radar</u></a>, it consistently ranks among the most broadly queried TLDs globally. An outage at this level of the DNS hierarchy has the potential to make millions of domains unreachable.</p><p>In this post, we’ll walk through what we saw, the impact of these events, and how we applied temporary mitigations while DENIC resolved the issue.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4hF64h72z4oKRg28w0mDJm/7f535cf687750f9ea730c27fa5e729e3/BLOG-3309_2.png" />
          </figure>
    <div>
      <h2>How DNSSEC works</h2>
      <a href="#how-dnssec-works">
        
      </a>
    </div>
    <p><a href="https://www.cloudflare.com/learning/dns/dnssec/how-dnssec-works/"><u>DNSSEC</u></a> (Domain Name System Security Extensions) adds cryptographic authentication to DNS. When a zone is signed with DNSSEC, each set of records is accompanied by a digital signature known as an RRSIG record that lets a resolver verify the records haven’t been tampered with. Unlike encrypted DNS protocols, such as DNS over TLS (DoT) and DNS over HTTPs (DoH), DNSSEC is about integrity, not privacy. The records are visible, but their authenticity can be proven.</p><p>What makes DNSSEC unique is that the signatures travel together with the records they protect. This means integrity can be verified regardless of how many caches or hops a response has passed through. A cached record is just as verifiable as a fresh one.</p><p>DNSSEC is built on a chain of trust. Starting at the root zone, whose trust anchor is hard-coded into the resolvers, each zone delegates trust to child zones via Delegation Signer (DS) records. A DS record in the parent zone contains a cryptographic hash of a public key in the child zone. When a resolver validates <code>example.de</code> it verifies the chain: root trusts <code>.de</code>, <code>.de</code> trusts <code>example.de</code>. A break anywhere in that chain causes validation to fail for everything below it, which is why a misconfiguration at a TLD like <code>.de</code> affects every domain under it.</p><p>Zones typically use two types of keys: a Zone Signing Key (ZSK), used to sign the zone’s records, and a Key Signing Key (KSK), used to sign the ZSK itself. The KSK’s public key is what the parent zone’s DS record points to, anchoring the chain of trust. Rotating a ZSK is relatively straightforward: generate a new key, re-sign the zone’s records, and wait for caches to expire. Rotating a KSK is more involved, because the parent’s DS record must also be updated, often requiring coordination with a registrar or registry.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6EDg7LKirRAVrzXCYprNIv/f14a9e3a24595d898cc9a650e9101fdd/image13.png" />
          </figure><p>During a key rotation, there is a critical window where the old key is being phased out and the new one phased in. If the signatures published in the zone are made with a key that resolvers cannot verify against the zone’s published DNSKEY records, whether because the signing step failed, the timing was wrong, or the new key wasn’t fully distributed yet, resolvers have no choice but to reject the responses and return SERVFAIL.</p>
    <div>
      <h2>What we saw</h2>
      <a href="#what-we-saw">
        
      </a>
    </div>
    <p>On May 5, 2026, at roughly 19:30 UTC, DENIC, the operator for the <code>.de</code> TLD, started publishing incorrect DNSSEC signatures for the <code>.de</code> zone. Any validating resolver receiving these records was required by the DNSSEC specification to reject them and return SERVFAIL. 1.1.1.1 was no exception.</p><p>The graph below shows the response codes 1.1.1.1 returned for <code>.de</code> queries during the incident.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/78zFXArtjyc8vcUup4zr9L/4207aa01b3caad16392266b4c32037e7/BLOG-3309_4.png" />
          </figure><p>After the immediate spike in SERVFAILs at 19:30 UTC, it climbed steadily over the following three hours as cached records slowly started expiring. As each domain's cached records expired and resolvers went back to DENIC for fresh copies, they got back broken signatures and started failing.</p><p>Also visible is a large increase in query volume. This is typical during DNS incidents, as clients retry failed queries, often three or more times, inflating the raw numbers. The SERVFAIL rate looks more alarming than the actual user impact, as many of those queries represent the same user retrying the same domain.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2KpMo46Phe5HtxmP34FYMK/46a1281a625760f58d592cbde91943f8/BLOG-3309_5.png" />
          </figure><p>What might be surprising is that the NOERROR rate stayed relatively stable throughout the incident. That's “serve stale” at work, which we'll cover in the next section.</p>
    <div>
      <h2>Serve stale</h2>
      <a href="#serve-stale">
        
      </a>
    </div>
    <p>Recursive resolvers cache the records they receive from authoritative nameservers for the duration of each record's TTL (Time-to-Live). While a record is cached, the resolver serves it directly without going back to the authoritative nameserver. When the TTL expires, the resolver fetches a fresh copy and re-caches it.</p><p>During the outage, freshly requested records ended up resolving to SERVFAIL. The DNSSEC signatures were broken and the resolver correctly rejected them. But many <code>.de</code> records were still sitting in cache from before the incident began. Rather than immediately discarding those and returning SERVFAIL to users, 1.1.1.1 continued serving them past their TTL. This is called “serving stale.”</p><p>1.1.1.1 implements <a href="https://datatracker.ietf.org/doc/html/rfc8767"><u>RFC 8767</u></a>, which formalizes this behavior. When upstream resolution fails, a resolver may continue serving expired cached records rather than returning an error. This significantly cushions the impact of an upstream outage, buying time for operators to respond.</p><p>The result is visible in the graph below, which shows response codes for <code>.de</code> queries during the incident excluding the stale-served responses. Without stale-served responses, the NOERROR rate drops steadily from 19:30 onward. These represent queries that users received good answers for only because their record was still in cache.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3YUtnXiFixcdswxtGik46r/78082f4b4439130cf23ff1473448781a/BLOG-3309_6.png" />
          </figure>
    <div>
      <h2>Our mitigation</h2>
      <a href="#our-mitigation">
        
      </a>
    </div>
    <p>While the issue was largely out of our own control, and serve stale was doing its job, there was still a legitimate impact for a lot of users. Luckily, there were some actions we were able to take to improve the situation.</p>
    <div>
      <h3>Negative Trust Anchors</h3>
      <a href="#negative-trust-anchors">
        
      </a>
    </div>
    <p><a href="https://datatracker.ietf.org/doc/html/rfc7646"><u>RFC 7646</u></a> defines the concept of a Negative Trust Anchor (NTA). In normal DNSSEC operation, a validating resolver maintains a set of trust anchors: public keys at the root of the chain of trust. Each DNS zone signed with DNSSEC has a trust anchor, and every child zone builds its own trust anchor upon it. When the cryptographic signatures linking the chain together are broken, responses will be rejected and result in SERVFAIL. An NTA is an explicit exception. It tells the resolver to treat a specific zone as if it were unsigned, bypassing validation for names under that zone.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4ZPkgIvIf1R9rlScLS7ofh/3483daabde429f99e5ca3bc2f6b5709f/BLOG-3309_7.png" />
          </figure><p>NTAs exist precisely for these types of incidents. When a TLD operator publishes broken signatures, every DNSSEC-validating resolver is forced to return SERVFAIL for every domain under that TLD. Not because of anything wrong with those domains themselves, but because their parent zone is misconfigured. Continuing to return SERVFAIL in that situation provides no security value: the failure is already known, public, and being fixed. RFC 7646 explicitly names TLD misconfiguration as the primary use case for NTAs.</p>
    <div>
      <h3>What we actually deployed</h3>
      <a href="#what-we-actually-deployed">
        
      </a>
    </div>
    <p>For 1.1.1.1 we have our own resolver referred to as <a href="https://blog.cloudflare.com/big-pineapple-intro/"><u>Big Pineapple</u></a>, which also powers 1.1.1.1 for Families, Gateway DNS, DNS Firewall, and more. At this time, we have not implemented a native NTA mechanism. Instead, we used an existing override rule mechanism to mark <code>.de</code> as an insecure zone, which causes all <code>.de</code> queries to be resolved as if they don’t have DNSSEC enabled. This is functionality equivalent to an NTA, though it is not formally defined in any RFC.</p><p>The decision to bypass DNSSEC is a deliberate tradeoff. Without DNSSEC validation, <code>.de</code> domains become vulnerable to <a href="https://www.cloudflare.com/en-gb/learning/dns/dnssec/how-dnssec-works/"><u>genuine attacks</u></a> for the duration of the incident. During incidents like this, we weighed this as acceptable because the signing failure was widespread, publicly confirmed, and affected every validating resolver on the Internet equally. As it was put in our internal incident room: “<i>There is no user of 1.1.1.1 resolving a .de name right now who would prefer a SERVFAIL over an unvalidated response</i>.”</p><p>We rolled out our mitigation at 22:17 UTC, which marked the end of impact for 1.1.1.1. We communicated this with fellow DNS operators in the <a href="https://www.dns-oarc.net/oarc/services/chat"><u>DNS-OARC Mattermost</u></a>.</p>
    <div>
      <h3>Origin resolution mitigations</h3>
      <a href="#origin-resolution-mitigations">
        
      </a>
    </div>
    <p>While all Internet users can access our 1.1.1.1 resolver, we have a particular responsibility to customers using our CDN platform services. Those with <code>.de</code> origin names were also affected by this outage.</p><p>Cloudflare operates a separate internal resolver for origin resolution, distinct from our publicly available 1.1.1.1 service. To mitigate impact we applied a similar NTA for <code>.de</code> on the internal resolver service, restoring origin connectivity for affected customers.</p>
    <div>
      <h3>Extended DNS Errors</h3>
      <a href="#extended-dns-errors">
        
      </a>
    </div>
    <p>Before our mitigation, queries that couldn't be served from cache received a SERVFAIL response from 1.1.1.1. Each SERVFAIL included an Extended DNS Error (EDE) code, defined in <a href="https://datatracker.ietf.org/doc/html/rfc8914"><u>RFC 8914</u></a>, which gives clients more detail about what went wrong. </p><p>Some resolvers returned EDE 6 (DNSSEC Bogus) with a descriptive message pointing directly at the broken signature. This is the correct behavior:</p>
            <pre><code>EDE: 6 (DNSSEC Bogus): RRSIG with malformed signature found for example.de/nsec3 (keytag=33834)
</code></pre>
            <p></p><p>1.1.1.1, on the other hand, returned EDE 22 (No Reachable Authority), which on the surface suggests a connectivity problem with the upstream nameservers rather than a DNSSEC validation failure.  </p><p>The cause is a bug in how we propagate DNSSEC EDE codes up from our trust chain verifier. When the verifier detects a bogus signature it creates the DNSSEC Bogus EDE code, but this is never inserted into the response. Instead, the outer layer of the resolver sees a problem with recursive resolution with no error code and falls back to reporting “No Reachable Authority.” This obscures the underlying DNSSEC cause.</p><p>We're aware that this isn't helpful for 1.1.1.1 users and will be fixing our responses to surface the DNSSEC errors.</p>
    <div>
      <h2>Is this a failure of DNSSEC as a technology?</h2>
      <a href="#is-this-a-failure-of-dnssec-as-a-technology">
        
      </a>
    </div>
    <p>DNS is a critical part of the request chain for most Internet communication. It would be easy to come to the conclusion that this outage and the mitigations applied means DNSSEC has failed as a technology. However, any technology that is misconfigured will risk breaking for users that rely on it. Leaving critical fiber cables exposed on the seabed for sharks to chew on does not invalidate the important role underwater cables pose in today's Internet communications. It only highlights that we’ve sometimes failed to accurately protect it. The same applies here. DNSSEC serves a critical role in ensuring that we can rely on the DNS answers without tampering by malicious actors.</p>
    <div>
      <h2>#HugOps</h2>
      <a href="#hugops">
        
      </a>
    </div>
    <p>No one likes to have serious incidents. These things, unfortunately, happen to everyone who operates critical infrastructure at scale. When they do, the DNS community tends to show up for each other.</p><p>Incidents like this also highlight why relationships between operators matter. DNS is a decentralized system, no single organization controls all of it, and keeping it running reliably depends on mutual trust and open lines of communication between registries, resolver operators, and the broader community. Forums like <a href="https://dns-oarc.net/">DNS-OARC</a> provide exactly this: shared mailing lists and chat rooms where operators can coordinate quickly across organizational boundaries when something goes wrong.</p><p>DENIC has published <a href="https://blog.denic.de/en/technical-issue-with-de-domains-resolved/"><u>a short blog post about the incident</u></a> where they state: “The outage is linked to a routine, scheduled key rollover. During this process, non-validatable signatures were generated and distributed. As a precautionary measure, future rollovers have been suspended until the exact technical causes have been identified.”</p><p> We're sure we’ll hear more when their own analysis is ready. </p>
    <div>
      <h2>Takeaways from this incident</h2>
      <a href="#takeaways-from-this-incident">
        
      </a>
    </div>
    <p>This incident highlights a structural reality of the DNS hierarchy: when a registry at the TLD level fails, every domain under that TLD is affected simultaneously, regardless of where it's hosted or which resolver is used. This isn't unique to DNSSEC; the same is true if a TLD’s nameservers become unreachable. The hierarchy that makes the global DNS work is also what makes failures at the top propagate downward.</p><p>There is no simple fix for this. What the industry can do is respond quickly and consistently when it happens. In this incident, resolver operators across the Internet independently applied Negative Trust Anchors within an hour, restoring resolution while DENIC worked to fix the zone. Operational practices, industry communication channels like DNS-OARC, and features like serve stale all reduce the impact, even if they can’t eliminate the underlying dependency.</p><p>We also came away with some points to improve for ourselves. We will be working on our EDE errors to better surface DNSSEC errors.</p><p>We look forward to DENIC’s post-incident report and appreciate the transparency they showed throughout.</p><p>If you want to learn more about how DNSSEC works, visit our page <a href="https://www.cloudflare.com/en-gb/learning/dns/dnssec/how-dnssec-works/"><u>How does DNSSEC work?</u></a> And you can always follow real-time DNS trends and TLD data on <a href="https://radar.cloudflare.com/tlds/de?dateStart=2026-05-05&amp;dateEnd=2026-05-06"><u>Cloudflare Radar</u></a>.</p> ]]></content:encoded>
            <category><![CDATA[DNS]]></category>
            <category><![CDATA[DNSSEC]]></category>
            <category><![CDATA[1.1.1.1]]></category>
            <category><![CDATA[Reliability]]></category>
            <category><![CDATA[Outage]]></category>
            <guid isPermaLink="false">2MckFmlh9Epgpruqa9MXRh</guid>
            <dc:creator>Sebastiaan Neuteboom</dc:creator>
            <dc:creator>Christian Elmerot</dc:creator>
            <dc:creator>Max Worsley</dc:creator>
        </item>
        <item>
            <title><![CDATA[What came first: the CNAME or the A record?]]></title>
            <link>https://blog.cloudflare.com/cname-a-record-order-dns-standards/</link>
            <pubDate>Wed, 14 Jan 2026 00:00:00 GMT</pubDate>
            <description><![CDATA[ A recent change to 1.1.1.1 accidentally altered the order of CNAME records in DNS responses, breaking resolution for some clients. This post explores the technical root cause, examines the source code of affected resolvers, and dives into the inherent ambiguities of the DNS RFCs.   ]]></description>
            <content:encoded><![CDATA[ <p>On January 8, 2026, a routine update to 1.1.1.1 aimed at reducing memory usage accidentally triggered a wave of DNS resolution failures for users across the Internet. The root cause wasn't an attack or an outage, but a subtle shift in the order of records within our DNS responses.  </p><p>While most modern software treats the order of records in DNS responses as irrelevant, we discovered that some implementations expect CNAME records to appear before everything else. When that order changed, resolution started failing. This post explores the code change that caused the shift, why it broke specific DNS clients, and the 40-year-old protocol ambiguity that makes the "correct" order of a DNS response difficult to define.</p>
    <div>
      <h2>Timeline</h2>
      <a href="#timeline">
        
      </a>
    </div>
    <p><i>All timestamps referenced are in Coordinated Universal Time (UTC).</i></p><table><tr><th><p><b>Time</b></p></th><th><p><b>Description</b></p></th></tr><tr><td><p>2025-12-02</p></td><td><p>The record reordering is introduced to the 1.1.1.1 codebase</p></td></tr><tr><td><p>2025-12-10</p></td><td><p>The change is released to our testing environment</p></td></tr><tr><td><p>2026-01-07 23:48</p></td><td><p>A global release containing the change starts</p></td></tr><tr><td><p>2026-01-08 17:40</p></td><td><p>The release reaches 90% of servers</p></td></tr><tr><td><p>2026-01-08 18:19</p></td><td><p>Incident is declared</p></td></tr><tr><td><p>2026-01-08 18:27</p></td><td><p>The release is reverted</p></td></tr><tr><td><p>2026-01-08 19:55</p></td><td><p>Revert is completed. Impact ends</p></td></tr></table>
    <div>
      <h2>What happened?</h2>
      <a href="#what-happened">
        
      </a>
    </div>
    <p>While making some improvements to lower the memory usage of our cache implementation, we introduced a subtle change to CNAME record ordering. The change was introduced on December 2, 2025, released to our testing environment on December 10, and began deployment on January 7, 2026.</p>
    <div>
      <h3>How DNS CNAME chains work</h3>
      <a href="#how-dns-cname-chains-work">
        
      </a>
    </div>
    <p>When you query for a domain like <code>www.example.com</code>, you might get a <a href="https://www.cloudflare.com/learning/dns/dns-records/dns-cname-record/"><u>CNAME (Canonical Name)</u></a> record that indicates one name is an alias for another name. It’s the job of public resolvers, such as <a href="https://www.cloudflare.com/learning/dns/what-is-1.1.1.1/"><u>1.1.1.1</u></a>, to follow this chain of aliases until it reaches a final response:</p><p><code>www.example.com → cdn.example.com → server.cdn-provider.com → 198.51.100.1</code></p><p>As 1.1.1.1 traverses this chain, it caches every intermediate record. Each record in the chain has its own <a href="https://www.cloudflare.com/learning/cdn/glossary/time-to-live-ttl/"><u>TTL (Time-To-Live)</u></a>, indicating how long we can cache it. Not all the TTLs in a CNAME chain need to be the same:</p><p><code>www.example.com → cdn.example.com (TTL: 3600 seconds) # Still cached
cdn.example.com → 198.51.100.1    (TTL: 300 seconds)  # Expired</code></p><p>When one or more records in a CNAME chain expire, it’s considered partially expired. Fortunately, since parts of the chain are still in our cache, we don’t have to resolve the entire CNAME chain again — only the part that has expired. In our example above, we would take the still valid <code>www.example.com → cdn.example.com</code> chain, and only resolve the expired <code>cdn.example.com</code> <a href="https://www.cloudflare.com/learning/dns/dns-records/dns-a-record/"><u>A record</u></a>. Once that’s done, we combine the existing CNAME chain and the newly resolved records into a single response.</p>
    <div>
      <h3>The logic change</h3>
      <a href="#the-logic-change">
        
      </a>
    </div>
    <p>The code that merges these two chains is where the change occurred. Previously, the code would create a new list, insert the existing CNAME chain, and then append the new records:</p>
            <pre><code>impl PartialChain {
    /// Merges records to the cache entry to make the cached records complete.
    pub fn fill_cache(&amp;self, entry: &amp;mut CacheEntry) {
        let mut answer_rrs = Vec::with_capacity(entry.answer.len() + self.records.len());
        answer_rrs.extend_from_slice(&amp;self.records); // CNAMEs first
        answer_rrs.extend_from_slice(&amp;entry.answer); // Then A/AAAA records
        entry.answer = answer_rrs;
    }
}
</code></pre>
            <p>However, to save some memory allocations and copies, the code was changed to instead append the CNAMEs to the existing answer list:</p>
            <pre><code>impl PartialChain {
    /// Merges records to the cache entry to make the cached records complete.
    pub fn fill_cache(&amp;self, entry: &amp;mut CacheEntry) {
        entry.answer.extend(self.records); // CNAMEs last
    }
}
</code></pre>
            <p>As a result, the responses that 1.1.1.1 returned now sometimes had the CNAME records appearing at the bottom, after the final resolved answer.</p>
    <div>
      <h3>Why this caused impact</h3>
      <a href="#why-this-caused-impact">
        
      </a>
    </div>
    <p>When DNS clients receive a response with a CNAME chain in the answer section, they also need to follow this chain to find out that <code>www.example.com</code> points to <code>198.51.100.1</code>. Some DNS client implementations handle this by keeping track of the expected name for the records as they’re iterated sequentially. When a CNAME is encountered, the expected name is updated:</p>
            <pre><code>;; QUESTION SECTION:
;; www.example.com.        IN    A

;; ANSWER SECTION:
www.example.com.    3600   IN    CNAME  cdn.example.com.
cdn.example.com.    300    IN    A      198.51.100.1
</code></pre>
            <p></p><ol><li><p>Find records for <code>www.example.com</code></p></li><li><p>Encounter <code>www.example.com. CNAME cdn.example.com</code></p></li><li><p>Find records for <code>cdn.example.com</code></p></li><li><p>Encounter <code>cdn.example.com. A 198.51.100.1</code></p></li></ol><p>When the CNAME suddenly appears at the bottom, this no longer works:</p>
            <pre><code>;; QUESTION SECTION:
;; www.example.com.	       IN    A

;; ANSWER SECTION:
cdn.example.com.    300    IN    A      198.51.100.1
www.example.com.    3600   IN    CNAME  cdn.example.com.
</code></pre>
            <p></p><ol><li><p>Find records for <code>www.example.com</code></p></li><li><p>Ignore <code>cdn.example.com. A 198.51.100.1</code> as it doesn’t match the expected name</p></li><li><p>Encounter <code>www.example.com. CNAME cdn.example.com</code></p></li><li><p>Find records for <code>cdn.example.com</code></p></li><li><p>No more records are present, so the response is considered empty</p></li></ol><p>One such implementation that broke is the <a href="https://man7.org/linux/man-pages/man3/getaddrinfo.3.html"><code><u>getaddrinfo</u></code></a> function in glibc, which is commonly used on Linux for DNS resolution. When looking at its <code>getanswer_r</code> implementation, we can indeed see it expects to find the CNAME records before any answers:</p>
            <pre><code>for (; ancount &gt; 0; --ancount)
  {
    // ... parsing DNS records ...
    
    if (rr.rtype == T_CNAME)
      {
        /* Record the CNAME target as the new expected name. */
        int n = __ns_name_unpack (c.begin, c.end, rr.rdata,
                                  name_buffer, sizeof (name_buffer));
        expected_name = name_buffer;  // Update what we're looking for
      }
    else if (rr.rtype == qtype
             &amp;&amp; __ns_samebinaryname (rr.rname, expected_name)  // Must match!
             &amp;&amp; rr.rdlength == rrtype_to_rdata_length (type:qtype))
      {
        /* Address record matches - store it */
        ptrlist_add (list:addresses, item:(char *) alloc_buffer_next (abuf, uint32_t));
        alloc_buffer_copy_bytes (buf:abuf, src:rr.rdata, size:rr.rdlength);
      }
  }
</code></pre>
            <p>Another notable affected implementation was the DNSC process in three models of Cisco ethernet switches. In the case where switches had been configured to use 1.1.1.1 these switches experienced spontaneous reboot loops when they received a response containing the reordered CNAMEs. <a href="https://www.cisco.com/c/en/us/support/docs/smb/switches/Catalyst-switches/kmgmt3846-cbs-reboot-with-fatal-error-from-dnsc-process.html"><u>Cisco has published a service document describing the issue</u></a>.</p>
    <div>
      <h3>Not all implementations break</h3>
      <a href="#not-all-implementations-break">
        
      </a>
    </div>
    <p>Most DNS clients don’t have this issue. For example, <a href="https://www.freedesktop.org/software/systemd/man/latest/systemd-resolved.service.html"><u>systemd-resolved</u></a> first parses the records into an ordered set:</p>
            <pre><code>typedef struct DnsAnswerItem {
        DnsResourceRecord *rr; // The actual record
        DnsAnswerFlags flags;  // Which section it came from
        // ... other metadata
} DnsAnswerItem;


typedef struct DnsAnswer {
        unsigned n_ref;
        OrderedSet *items;
} DnsAnswer;
</code></pre>
            <p>When following a CNAME chain it can then search the entire answer set, even if the CNAME records don’t appear at the top.</p>
    <div>
      <h2>What the RFC says</h2>
      <a href="#what-the-rfc-says">
        
      </a>
    </div>
    <p><a href="https://datatracker.ietf.org/doc/html/rfc1034"><u>RFC 1034</u></a>, published in 1987, defines much of the behavior of the DNS protocol, and should give us an answer on whether the order of CNAME records matters. <a href="https://datatracker.ietf.org/doc/html/rfc1034#section-4.3.1"><u>Section 4.3.1</u></a> contains the following text:</p><blockquote><p>If recursive service is requested and available, the recursive response to a query will be one of the following:</p><p>- The answer to the query, possibly preface by one or more CNAME RRs that specify aliases encountered on the way to an answer.</p></blockquote><p>While "possibly preface" can be interpreted as a requirement for CNAME records to appear before everything else, it does not use normative key words, such as <a href="https://datatracker.ietf.org/doc/html/rfc2119"><u>MUST and SHOULD</u></a> that modern RFCs use to express requirements. This isn’t a flaw in RFC 1034, but simply a result of its age. <a href="https://datatracker.ietf.org/doc/html/rfc2119"><u>RFC 2119</u></a>, which standardized these key words, was published in 1997, 10 years <i>after</i> RFC 1034.</p><p>In our case, we did originally implement the specification so that CNAMEs appear first. However, we did not have any tests asserting the behavior remains consistent due to the ambiguous language in the RFC.</p>
    <div>
      <h3>The subtle distinction: RRsets vs RRs in message sections</h3>
      <a href="#the-subtle-distinction-rrsets-vs-rrs-in-message-sections">
        
      </a>
    </div>
    <p>To understand why this ambiguity exists, we need to understand a subtle but important distinction in DNS terminology.</p><p>RFC 1034 <a href="https://datatracker.ietf.org/doc/html/rfc1034#section-3.6"><u>section 3.6</u></a> defines Resource Record Sets (RRsets) as collections of records with the same name, type, and class. For RRsets, the specification is clear about ordering:</p><blockquote><p>The order of RRs in a set is not significant, and need not be preserved by name servers, resolvers, or other parts of the DNS.</p></blockquote><p>However, RFC 1034 doesn’t clearly specify how message sections relate to RRsets. While modern DNS specifications have shown that message sections can indeed contain multiple RRsets (consider <a href="https://www.cloudflare.com/learning/dns/dnssec/how-dnssec-works/">DNSSEC</a> responses with signatures), RFC 1034 doesn’t describe message sections in those terms. Instead, it treats message sections as containing individual Resource Records (RRs).</p><p>The problem is that the RFC primarily discusses ordering in the context of RRsets but doesn't specify the ordering of different RRsets relative to each other within a message section. This is where the ambiguity lives.</p><p>RFC 1034 <a href="https://datatracker.ietf.org/doc/html/rfc1034#section-6.2.1"><u>section 6.2.1</u></a> includes an example that demonstrates this ambiguity further. It mentions that the order of Resource Records (RRs) is not significant either:</p><blockquote><p>The difference in ordering of the RRs in the answer section is not significant.</p></blockquote><p>However, this example only shows two A records for the same name within the same RRset. It doesn't address whether this applies to different record types like CNAMEs and A records.</p>
    <div>
      <h2>CNAME chain ordering</h2>
      <a href="#cname-chain-ordering">
        
      </a>
    </div>
    <p>It turns out that this issue extends beyond putting CNAME records before other record types. Even when CNAMEs appear before other records, sequential parsing can still break if the CNAME chain itself is out of order. Consider the following response:</p>
            <pre><code>;; QUESTION SECTION:
;; www.example.com.              IN    A

;; ANSWER SECTION:
cdn.example.com.           3600  IN    CNAME  server.cdn-provider.com.
www.example.com.           3600  IN    CNAME  cdn.example.com.
server.cdn-provider.com.   300   IN    A      198.51.100.1
</code></pre>
            <p>Each CNAME belongs to a different RRset, as they have different owners, so the statement about RRset order being insignificant doesn’t apply here.</p><p>However, RFC 1034 doesn't specify that CNAME chains must appear in any particular order. There's no requirement that <code>www.example.com. CNAME cdn.example.com.</code> must appear before <code>cdn.example.com. CNAME server.cdn-provider.com.</code>. With sequential parsing, the same issue occurs:</p><ol><li><p>Find records for <code>www.example.com</code></p></li><li><p>Ignore <code>cdn.example.com. CNAME server.cdn-provider.com</code>. as it doesn’t match the expected name</p></li><li><p>Encounter <code>www.example.com. CNAME cdn.example.com</code></p></li><li><p>Find records for <code>cdn.example.com</code></p></li><li><p>Ignore <code>server.cdn-provider.com. A 198.51.100.1</code> as it doesn’t match the expected name</p></li></ol>
    <div>
      <h2>What should resolvers do?</h2>
      <a href="#what-should-resolvers-do">
        
      </a>
    </div>
    <p>RFC 1034 section 5 describes resolver behavior. <a href="https://datatracker.ietf.org/doc/html/rfc1034#section-5.2.2"><u>Section 5.2.2</u></a> specifically addresses how resolvers should handle aliases (CNAMEs): </p><blockquote><p>In most cases a resolver simply restarts the query at the new name when it encounters a CNAME.</p></blockquote><p>This suggests that resolvers should restart the query upon finding a CNAME, regardless of where it appears in the response. However, it's important to distinguish between different types of resolvers:</p><ul><li><p>Recursive resolvers, like 1.1.1.1, are full DNS resolvers that perform recursive resolution by querying authoritative nameservers</p></li><li><p>Stub resolvers, like glibc’s getaddrinfo, are simplified local interfaces that forward queries to recursive resolvers and process the responses</p></li></ul><p>The RFC sections on resolver behavior were primarily written with full resolvers in mind, not the simplified stub resolvers that most applications actually use. Some stub resolvers evidently don’t implement certain parts of the spec, such as the CNAME-restart logic described in the RFC. </p>
    <div>
      <h2>The DNSSEC specifications provide contrast</h2>
      <a href="#the-dnssec-specifications-provide-contrast">
        
      </a>
    </div>
    <p>Later DNS specifications demonstrate a different approach to defining record ordering. <a href="https://datatracker.ietf.org/doc/html/rfc4035"><u>RFC 4035</u></a>, which defines protocol modifications for <a href="https://www.cloudflare.com/learning/dns/dnssec/how-dnssec-works/"><u>DNSSEC</u></a>, uses more explicit language:</p><blockquote><p>When placing a signed RRset in the Answer section, the name server MUST also place its RRSIG RRs in the Answer section. The RRSIG RRs have a higher priority for inclusion than any other RRsets that may have to be included.</p></blockquote><p>The specification uses "MUST" and explicitly defines "higher priority" for <a href="https://www.cloudflare.com/learning/dns/dnssec/how-dnssec-works/"><u>RRSIG</u></a> records. However, "higher priority for inclusion" refers to whether RRSIGs should be included in the response, not where they should appear. This provides unambiguous guidance to implementers about record inclusion in DNSSEC contexts, while not mandating any particular behavior around record ordering.</p><p>For unsigned zones, however, the ambiguity from RFC 1034 remains. The word "preface" has guided implementation behavior for nearly four decades, but it has never been formally specified as a requirement.</p>
    <div>
      <h2>Do CNAME records come first?</h2>
      <a href="#do-cname-records-come-first">
        
      </a>
    </div>
    <p>While in our interpretation the RFCs do not require CNAMEs to appear in any particular order, it’s clear that at least some widely-deployed DNS clients rely on it. As some systems using these clients might be updated infrequently, or never updated at all, we believe it’s best to require CNAME records to appear in-order before any other records.</p><p>Based on what we have learned during this incident, we have reverted the CNAME re-ordering and do not intend to change the order in the future.</p><p>To prevent any future incidents or confusion, we have written a proposal in the form of an <a href="https://www.ietf.org/participate/ids/"><u>Internet-Draft</u></a> to be discussed at the IETF. If consensus is reached on the clarified behavior, this would become an RFC that explicitly defines how to correctly handle CNAMEs in DNS responses, helping us and the wider DNS community navigate the protocol. The proposal can be found at <a href="https://datatracker.ietf.org/doc/draft-jabley-dnsop-ordered-answer-section/">https://datatracker.ietf.org/doc/draft-jabley-dnsop-ordered-answer-section</a>. If you have suggestions or feedback we would love to hear your opinions, most usefully via the <a href="https://datatracker.ietf.org/wg/dnsop/about/"><u>DNSOP working group</u></a> at the IETF.</p> ]]></content:encoded>
            <category><![CDATA[1.1.1.1]]></category>
            <category><![CDATA[Post Mortem]]></category>
            <category><![CDATA[DNS]]></category>
            <category><![CDATA[Resolver]]></category>
            <category><![CDATA[Standards]]></category>
            <category><![CDATA[Bugs]]></category>
            <category><![CDATA[Consumer Services]]></category>
            <guid isPermaLink="false">3fP84BsxwSxKr7ffpmVO6s</guid>
            <dc:creator>Sebastiaan Neuteboom</dc:creator>
        </item>
    </channel>
</rss>