The Cloudflare Blog

Shutdowns, power outages, and conflict: a review of Q1 2026 Internet disruptions

David Belson — Tue, 28 Apr 2026 13:00:00 GMT

In the first quarter of 2026, government-directed shutdowns figured prominently, with prolonged Internet blackouts in both Uganda and Iran, a stark contrast to the lack of observed government-directed shutdowns in the same quarter a year prior. This quarter, we also observed a number of Internet disruptions caused by power outages, including three separate collapses of Cuba's national electrical grid. Military action continued to disrupt connectivity in Ukraine and also impacted hyperscaler cloud infrastructure in the Middle East. Severe weather knocked out Internet connectivity in Portugal, while cable damage disrupted connectivity in the Republic of Congo. A technical problem hit Verizon Wireless in the United States, and unknown issues briefly disrupted connectivity for customers of providers in Guinea and the United Kingdom.

This post is intended as a summary overview of observed and confirmed disruptions and is not an exhaustive or complete list of issues that have occurred during the quarter. A larger list of detected traffic anomalies is available in the Cloudflare Radar Outage Center. Note that both bytes-based and request-based traffic graphs are used within this post to illustrate the impact of the observed disruptions, with the choice of metric generally made based on which better illustrates the impact of the disruption.

Government-directed shutdowns

Uganda

In advance of the January 15 presidential election, Ugandan authorities ordered a nationwide Internet shutdown. The Uganda Communications Commission (UCC) instructed mobile network operators to suspend public Internet access, effective 18:00 local time (15:00 UTC) on January 13. The UCC reportedly defended the shutdown as necessary to "curb misinformation, disinformation, electoral fraud and related risks." Domestic traffic at the Uganda Internet Exchange Point (UIXP) dropped from approximately 72 Gbps to 1 Gbps as a result of the action taken.

Similarly, Cloudflare data shows a near-complete loss of traffic from Uganda coincident with the start of the shutdown, with traffic remaining effectively at zero through 23:00 local time (20:00 UTC) on January 17, when Internet connectivity was partially restored after incumbent President Yoweri Museveni was declared winner of his seventh term.

Full Internet restoration was announced by the UCC on January 26, with mobile network operators MTN Uganda and Airtel Uganda both confirming on social media that restrictions had been lifted. The shutdown prompted lawsuits against UCC and the telecoms companies and drew criticism from digital rights organizations including CIPESA.

Uganda also blocked Internet access during its 2021 election. Authorities had repeatedly promised this time would be different, stating as recently as January 5 that "claims suggesting otherwise are false, misleading."

Iran

Iranian citizens spent a large part of Q1 2026 offline, or with severely limited connectivity, due to two nationwide Internet shutdowns. The first began around 20:00 local time (16:30 UTC) on January 8, and we explored the impact seen over the first few days in our What we know about Iran’s Internet shutdown blog post. Traffic from Iran remained near zero until January 21, when a small amount of traffic returned, only to disappear a little over 24 hours later. A similar brief restoration also occurred on January 25, before traffic recovered more aggressively starting on January 27.

A near-complete loss of announced IPv6 address space started several hours before the drop in traffic took place on January 8. Asiatech (AS43754) was by far the single largest contributor, losing 4.46 million /48-equivalents, accounting for ~9.4% of Iran's entire IPv6 space loss on its own. RASANA (AS31549) was the second-largest, losing 4.19 million /48-equivalents (~8.8% of the country total). As would be expected, this resulted in the share of IPv6 traffic in Iran going to zero. Given the gap in timing between this change and the loss of traffic across the country, this may have been a leading indicator of what was about to happen, but likely not a direct cause of it. Some nominal shifts in announced IPv4 address space are visible during the shutdown, but levels remained fairly consistent during the shutdown period. These observations suggest that the shutdown was implemented by other means, such as filtering.

Cloudflare Radar social media posts (X, Bluesky, Mastodon) throughout January and into early February documented our observations about the state of connectivity in Iran over the course of that month.

On February 28, as military strikes on Iran escalated, a second nationwide Internet shutdown began. Cloudflare Radar observed a sharp drop in traffic from Iran beginning around 10:30 local time (07:00 UTC). Traffic levels fell to well under 1% of previous levels, with only small amounts of Web and DNS traffic egressing the country.

No significant shifts in announced IP address space were observed around the onset of this shutdown. IPv4 space remained fairly consistent, and IPv6 space remained consistently volatile, suggesting that route withdrawals were not the cause of this second shutdown.

The continued announcement of IP address space, and the presence of traffic from the country, even if just a small amount, supports reports that the shutdown was effectively achieved through aggressive filtering, with so-called “whitelists” and “white SIM cards” restricting access to only approved Internet sites by selected users.

Iran remained effectively offline through the end of the quarter. As of late April, this shutdown remains largely in place, making it one of the longest sustained Internet disruptions observed in recent years.

Republic of Congo

On March 15, as the Republic of Congo held a presidential election expected to extend President Denis Sassou Nguesso's 42-year rule, a near-complete shutdown of Internet connectivity was observed in the country. Traffic from the country dropped precipitously around 06:30 local time (05:30 UTC), falling to near zero for approximately 60 hours through the election period and its immediate aftermath. Traffic began recovering around March 17 at 18:20 local time (17:20 UTC), rapidly returning to pre-shutdown levels. While Congolese authorities provided no official explanation for the drop in traffic, similar shutdowns were put into place during the 2021 and 2016 elections.

Military action

Ukraine (Dnipropetrovsk)

On January 7-8, Russian attacks on energy infrastructure in Ukraine caused power outages that disrupted Internet connectivity in Dnipropetrovsk and surrounding regions. Cloudflare Radar observed a significant drop in traffic from the region, reaching nearly 50% below the prior week’s levels, starting around 22:45 local time (20:45 UTC) on January 7. Recovery began approximately 06:00 local time (04:00 UTC) on January 8.

Ukraine (Kharkiv)

On January 26, Russia launched a drone and missile attack targeting energy infrastructure in Kharkiv. Cloudflare Radar observed an approximately 50% drop in traffic from the region beginning around 19:15 local time (17:15 UTC). Recovery progressed through January 27 as power was gradually restored.

Amazon Web Services Middle East (United Arab Emirates and Bahrain)

One of the most unusual disruptions of the quarter was the physical damage inflicted on Amazon Web Services data centers in the Middle East by drone strikes tied to the ongoing regional conflict. On the morning of March 1 (UTC), Amazon reported a fire started after objects hit a UAE data center. The following day, the company confirmed that two of its facilities in the United Arab Emirates (me-central-1 region) were "directly struck" by drones and that a facility in Bahrain (me-south-1 region) was also taken offline after being damaged by a nearby strike.

Cloudflare's Cloud Observatory data showed elevated connection failure rates for the me-central-1 and me-south-1 regions beginning March 1-2 and remaining higher for multiple days. Connection failures occur when Cloudflare fails to successfully connect to an origin server when attempting to retrieve uncacheable content, or content not in/expired from cache. These graphs illustrate the increased rate of failures experienced when attempting to connect to servers in these impacted regions.

In a status post on the AWS Health Dashboard, Amazon acknowledged: "These strikes have caused structural damage, disrupted power delivery to our infrastructure, and in some cases required fire suppression activities that resulted in additional water damage." The company warned that instability was likely to continue in the Middle East, making operations "unpredictable," and urged customers with workloads in the affected regions to back up their data or migrate to other AWS regions.

The AWS me-south-1 region in Bahrain suffered an additional disruption on March 23, following further drone activity.

Power outages

Argentina (Buenos Aires)

On January 15, a power outage struck Buenos Aires during a summer heat wave. The outage caused nominal disruptions in Internet connectivity for customers of multiple providers in the Buenos Aires area, including Telecom Argentina (AS7303), Telecentro (AS27747), and IPLAN (AS16814), with traffic from these networks dropping between 17:30 and 19:30 local time (20:30 - 22:30 UTC). Traffic returned to expected levels approximately two hours after the outage began.

Moldova and Ukraine

An emergency power cut on Ukraine's electricity grid on January 31 caused widespread power outages affecting Moldova and several Ukrainian regions including Kyiv and Kharkiv. Moldova was reportedly hit by widespread power cuts amid the Ukrainian grid problems, and the Ukrainian Energy Minister explained the cross-border impact, noting “Today at 10:42 a.m. (08:42 GMT), a technical malfunction occurred, causing a simultaneous shutdown of the 400 kilovolt line between the power grids of Romania and Moldova and the 750 kilovolt line between western and central Ukraine.” Traffic from Moldova, Kyiv, and Kharkiv began falling around 10:42 local time (08:42 UTC), reaching as much as 46% below the prior week, with recovery occurring around 14:00 local time (12:00 UTC).

Paraguay

On February 18, widespread power outages struck Paraguay after key transmission lines went out of service. The National Electricity Administration (ANDE) posted a series of updates on X documenting the incident and efforts to restore power. Internet traffic from Paraguay dropped as much as 72% compared to the prior week beginning around 15:15 local time (18:15 UTC), and the disruption lasted nearly three hours, with recovery occurring by approximately 18:30 local time (21:30 UTC).

Dominican Republic

A major failure in the Interconnected National Electric System (SENI) of the Dominican Republic caused a widespread power outage on February 23. The state-owned electric company Empresa de Transmisión Eléctrica Dominicana (ETED) posted updates on X documenting the failure and the recovery effort. Internet traffic from the country dropped sharply beginning around 10:50 local time (14:50 UTC), and recovered around midnight local time (04:00 UTC) on February 24, in line with a confirmation posted by ETED that “The authorities of the electric sector reported that the Interconnected National Electric System (SENI) was fully restored to 100% at 11:53 p.m. on this Monday…”.

Cuba

Cuba experienced three separate collapses of its National Electric System (SEN) during March, each causing widespread Internet disruption, reflecting the severe deterioration of the country's electrical infrastructure. (Power outages also disrupted Internet connectivity in Cuba during September and March 2025, and October 2024.)

The first collapse occurred on March 4, when a disconnection of Cuba's National Electroenergy System cascaded from Camagüey to Pinar del Río, cutting power to the western half of the island, including Havana. OSDE/UNE (Cuba's Electric Union) confirmed the failure on social media. Cloudflare Radar data showed traffic from the island dropping by nearly half beginning around 12:15 local time (17:15 UTC), with traffic recovering by approximately 05:01 local time (10:01 UTC) on March 5.

The second collapse occurred on March 16, when Cuba's entire National Electric Power System was disconnected. EnergíaMinas Cuba posted updates on the situation on X. Cloudflare Radar data again shows a significant loss of traffic from Cuba beginning around 13:35 local time (17:35 UTC) on March 16, dropping approximately 65%. Traffic returned to expected levels by approximately 20:00 local time on March 17 (00:00 UTC on March 18), with the disruption lasting over 30 hours.

The third collapse (the second in just a week) happened just days later, on March 21-22. EnergíaMinas Cuba and OSDE/UNE again provided situation updates via X. Cloudflare Radar data shows another significant loss of traffic from Cuba beginning around 18:30 local time (22:30 UTC) on March 21, falling as much as 77% compared to the previous week. Traffic recovered around 21:39 local time on March 22 (01:39 UTC on March 23).

U.S. Virgin Islands

According to a Facebook post from the Virgin Islands Water and Power Authority (WAPA) on March 24, a loss of generation at the Richmond Power Plant combined with damage to an underground cable caused a power outage affecting St. Croix and St. Thomas in the U.S. Virgin Islands. Cloudflare Radar data shows traffic from local provider VI Powernet (AS14434), the primary ISP for the U.S. Virgin Islands, dropping to near zero beginning around 12:15 local time (16:15 UTC), with recovery occurring by approximately 14:45 local time (18:45 UTC). Although VI Powernet experienced a near-complete outage, traffic from St. Thomas only fell by around 60%, and approximately 40% from St. Croix due to the presence of other providers.

Severe weather

Portugal

Storm Kristin made landfall in Portugal on January 28, causing widespread damage and power outages across the country. Approximately 1,500 incidents were registered by Civil Protection between midnight and 08:00 local time (00:00 - 08:00 UTC), with the hardest-hit areas being the districts of Leiria and Coimbra. Significant infrastructure damage was reported, and by 07:00 local time (07:00 UTC), over 850,000 E-Redes customers were without electricity.

The associated power outages disrupted Internet connectivity across Portugal, which Cloudflare Radar observed primarily in the regions of Leiria, Santarém, and Coimbra beginning around 04:10 local time (04:10 UTC) on January 28. Internet traffic dropped as much as 70% in Leiria, and 52% in Coimbra.

Recovery was slow: over 290,000 customers remained without power as late as January 30, and Cloudflare continued tracking gradual recovery of regional traffic over the following weeks. (Coimbra returned to expected levels within the first several days after the storm.) More than three weeks after the storm, over 6,000 customers in Leiria reportedly remained without electricity.

Cable damage

Republic of Congo

Just after the New Year, Internet connectivity in the Republic of Congo was disrupted by an incident on the WACS (West Africa Cable System) submarine cable. Congo Telecom (AS37451) posted on X announcing "an international incident on the WACS cable" was causing Internet disruptions, and stating that backup solutions had been activated. Cloudflare Radar observed a significant drop in traffic from Congo beginning around 00:00 local time on January 2 (23:00 UTC on January 1), falling to 82% below expected levels. A follow-up post from Congo Telecom confirmed that repairs were ongoing, with users potentially experiencing slowdowns during peak hours. Traffic returned to expected levels by approximately 15:00 local time (14:00 UTC) on January 4.

Technical problems

Verizon Wireless (United States)

On January 14, a software issue impacted voice and data services for customers of Verizon Wireless (AS6167) across the United States. Verizon published an official statement acknowledging that the outage began January 14 and that by 22:15 ET (03:15 UTC on January 15) the issue had been resolved. Multiple updates on X from @VerizonNews kept subscribers informed throughout the evening. Cloudflare Radar data shows a minor drop in traffic beginning around 12:30 ET (17:30 UTC) on January 14, consistent with the reported onset of the outage.

Grenada

On February 9-10, customers of Flow Grenada (AS46650) – the primary Internet provider serving Grenada – experienced an island-wide service disruption lasting approximately 12 hours. The provider posted on Facebook acknowledging a service disruption, though no details about the root cause were provided. Cloudflare Radar data shows traffic from the network initially dropping around 11:30 local time (15:30) UTC on February 9, disappearing completely around 20:00 local time (midnight UTC on February 10), and recovering by approximately 23:30 local time (03:30 UTC on February 10). Routing data shows a complete loss of announced IPv4 space at the same time traffic dropped to zero. Major spikes in BGP announcements around the time the disruption initially started, and bookending the complete outage, suggest that the whole event may have been routing-related.

Unknown cause

Orange Guinée (Guinea)

Customers of Orange Guinée (AS37461) in Guinea were unable to make phone calls or access the Internet starting around 10:45 local time (10:45 UTC) on January 6. Orange Guinée subsequently confirmed an "exceptional breakdown" affecting mobile phone and Internet services due to a technical incident, with teams mobilized to restore service. Service was restored by approximately 14:00 local time (14:00 UTC) that same day. No further details on the root cause of the incident were publicly disclosed.

TalkTalk (United Kingdom)

On March 25, customers of UK broadband provider TalkTalk (AS13285) reported widespread service disruptions. TalkTalk acknowledged the issues on X but did not publicly disclose a root cause. Cloudflare Radar observed traffic from the provider drop nearly 50% as compared to the previous week beginning around 07:00 local time (07:00 UTC), with service restored by approximately 08:15 local time (08:15 UTC).

A quarter marked by major disruptions

The first quarter of 2026 was marked by an unusually high number of severe and prolonged Internet disruptions. The major government-directed shutdowns, particularly the extended blackouts in Uganda and Iran, underscore how Internet access continues to be weaponized as a tool of political control. Cuba's three separate national grid collapses in a single month paint a troubling picture of infrastructure fragility with direct consequences for connectivity. And the drone strikes on AWS data centers in the Middle East represent an unprecedented escalation as active military conflict directly and physically damaged major cloud infrastructure, with disastrous consequences for the websites and applications hosted there.

The Cloudflare Radar team is constantly monitoring for Internet disruptions, sharing our observations on the Cloudflare Radar Outage Center, via social media, and in posts on blog.cloudflare.com. Follow us on social media at @CloudflareRadar(X), noc.social/@cloudflareradar (Mastodon), and radar.cloudflare.com (Bluesky), or contact us via email.

DIY BYOIP: a new way to Bring Your Own IP prefixes to Cloudflare

Ash Pallarito — Fri, 07 Nov 2025 14:00:00 GMT

When a customer wants to bring IP address space to Cloudflare, they’ve always had to reach out to their account team to put in a request. This request would then be sent to various Cloudflare engineering teams such as addressing and network engineering — and then the team responsible for the particular service they wanted to use the prefix with (e.g., CDN, Magic Transit, Spectrum, Egress). In addition, they had to work with their own legal teams and potentially another organization if they did not have primary ownership of an IP prefix in order to get a Letter of Agency (LOA) issued through hoops of approvals. This process is complex, manual, and time-consuming for all parties involved — sometimes taking up to 4–6 weeks depending on various approvals.

Well, no longer! Today, we are pleased to announce the launch of our self-serve BYOIP API, which enables our customers to onboard and set up their BYOIP prefixes themselves.

With self-serve, we handle the bureaucracy for you. We have automated this process using the gold standard for routing security — the Resource Public Key Infrastructure, RPKI. All the while, we continue to ensure the best quality of service by generating LOAs on our customers’ behalf, based on the security guarantees of our new ownership validation process. This ensures that customer routes continue to be accepted in every corner of the Internet.

Cloudflare takes the security and stability of the whole Internet very seriously. RPKI is a cryptographically-strong authorization mechanism and is, we believe, substantially more reliable than common practice which relies upon human review of scanned documents. However, deployment and availability of some RPKI-signed artifacts like the AS Path Authorisation (ASPA) object remains limited, and for that reason we are limiting the initial scope of self-serve onboarding to BYOIP prefixes originated from Cloudflare's autonomous system number (ASN) AS 13335. By doing this, we only need to rely on the publication of Route Origin Authorisation (ROA) objects, which are widely available. This approach has the advantage of being safe for the Internet and also meeting the needs of most of our BYOIP customers.

Today, we take a major step forward in offering customers a more comprehensive IP address management (IPAM) platform. With the recent update to enable multiple services on a single BYOIP prefix and this latest advancement to enable self-serve onboarding via our API, we hope customers feel empowered to take control of their IPs on our network.

An evolution of Cloudflare BYOIP

We want Cloudflare to feel like an extension of your infrastructure, which is why we originally launched Bring-Your-Own-IP (BYOIP) back in 2020.

A quick refresher: Bring-your-own-IP is named for exactly what it does - it allows customers to bring their own IP space to Cloudflare. Customers choose BYOIP for a number of reasons, but the main reasons are control and configurability. An IP prefix is a range or block of IP addresses. Routers create a table of reachable prefixes, known as a routing table, to ensure that packets are delivered correctly across the Internet. When a customer's Cloudflare services are configured to use the customer's own addresses, onboarded to Cloudflare as BYOIP, a packet with a corresponding destination address will be routed across the Internet to Cloudflare's global edge network, where it will be received and processed. BYOIP can be used with our Layer 7 services, Spectrum, or Magic Transit.

A look under the hood: How it works

Today’s world of prefix validation

Let’s take a step back and take a look at the state of the BYOIP world right now. Let’s say a customer has authority over a range of IP addresses, and they’d like to bring them to Cloudflare. We require customers to provide us with a Letter of Authorization (LOA) and have an Internet Routing Registry (IRR) record matching their prefix and ASN. Once we have this, we require manual review by a Cloudflare engineer. There are a few issues with this process:

Insecure: The LOA is just a document—a piece of paper. The security of this method rests entirely on the diligence of the engineer reviewing the document. If the review is not able to detect that a document is fraudulent or inaccurate, it is possible for a prefix or ASN to be hijacked.
Time-consuming: Generating a single LOA is not always sufficient. If you are leasing IP space, we will ask you to provide documentation confirming that relationship as well, so that we can see a clear chain of authorisation from the original assignment or allocation of addresses to you. Getting all the paper documents to verify this chain of ownership, combined with having to wait for manual review can result in weeks of waiting to deploy a prefix!

Automating trust: How Cloudflare verifies your BYOIP prefix ownership in minutes

Moving to a self-serve model allowed us to rethink the manner in which we conduct prefix ownership checks. We asked ourselves: How can we quickly, securely, and automatically prove you are authorized to use your IP prefix and intend to route it through Cloudflare?

We ended up killing two birds with one stone, thanks to our two-step process involving the creation of an RPKI ROA (verification of intent) and modification of IRR or rDNS records (verification of ownership). Self-serve unlocks the ability to not only onboard prefixes more quickly and without human intervention, but also exercises more rigorous ownership checks than a simple scanned document ever could. While not 100% foolproof, it is a significant improvement in the way we verify ownership.

Tapping into the authorities

Regional Internet Registries (RIRs) are the organizations responsible for distributing and managing Internet number resources like IP addresses. They are composed of 5 different entities operating in different regions of the world (RIRs). Originally allocated address space from the Internet Assigned Numbers Authority (IANA), they in turn assign and allocate that IP space to Local Internet Registries (LIRs) like ISPs.

This process is based on RIR policies which generally look at things like legal documentation, existing database/registry records, technical contacts, and BGP information. End-users can obtain addresses from an LIR, or in some cases through an RIR directly. As IPv4 addresses have become more scarce, brokerage services have been launched to allow addresses to be leased for fixed periods from their original assignees.

The Internet Routing Registry (IRR) is a separate system that focuses on routing rather than address assignment. Many organisations operate IRR instances and allow routing information to be published, including all five RIRs. While most IRR instances impose few barriers to the publication of routing data, those that are operated by RIRs are capable of linking the ability to publish routing information with the organisations to which the corresponding addresses have been assigned. We believe that being able to modify an IRR record protected in this way provides a good signal that a user has the rights to use a prefix.

Example of a route object containing validation token (using the documentation-only address 192.0.2.0/24):

% whois -h rr.arin.net 192.0.2.0/24

route:          192.0.2.0/24
origin:         AS13335
descr:          Example Company, Inc.
                cf-validation: 9477b6c3-4344-4ceb-85c4-6463e7d2453f
admin-c:        ADMIN2521-ARIN
tech-c:         ADMIN2521-ARIN
tech-c:         CLOUD146-ARIN
mnt-by:         MNT-CLOUD14
created:        2025-07-29T10:52:27Z
last-modified:  2025-07-29T10:52:27Z
source:         ARIN

For those that don’t want to go through the process of IRR-based validation, reverse DNS (rDNS) is provided as another secure method of verification. To manage rDNS for a prefix — whether it's creating a PTR record or a security TXT record — you must be granted permission by the entity that allocated the IP block in the first place (usually your ISP or the RIR).

This permission is demonstrated in one of two ways:

Directly through the IP owner’s authenticated customer portal (ISP/RIR).
By the IP owner delegating authority to your third-party DNS provider via an NS record for your reverse zone.

Example of a reverse domain lookup using dig command (using the documentation-only address 192.0.2.0/24):

% dig cf-validation.2.0.192.in-addr.arpa TXT

; <<>> DiG 9.10.6 <<>> cf-validation.2.0.192.in-addr.arpa TXT
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 16686
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;cf-validation.2.0.192.in-addr.arpa. IN TXT

;; ANSWER SECTION:
cf-validation.2.0.192.in-addr.arpa. 300 IN TXT "b2f8af96-d32d-4c46-a886-f97d925d7977"

;; Query time: 35 msec
;; SERVER: 127.0.2.2#53(127.0.2.2)
;; WHEN: Fri Oct 24 10:43:52 EDT 2025
;; MSG SIZE  rcvd: 150

So how exactly is one supposed to modify these records? That’s where the validation token comes into play. Once you choose either the IRR or Reverse DNS method, we provide a unique, single-use validation token. You must add this token to the content of the relevant record, either in the IRR or in the DNS. Our system then looks for the presence of the token as evidence that the request is being made by someone with authorization to make the requested modification. If the token is found, verification is complete and your ownership is confirmed!

The digital passport 🛂

Ownership is only half the battle; we also need to confirm your intention that you authorize Cloudflare to advertise your prefix. For this, we rely on the gold standard for routing security: the Resource Private Key Infrastructure (RPKI), and in particular Route Origin Authorization (ROA) objects.

A ROA is a cryptographically-signed document that specifies which Autonomous System Number (ASN) is authorized to originate your IP prefix. You can think of a ROA as the digital equivalent of a certified, signed, and notarised contract from the owner of the prefix.

Relying parties can validate the signatures in a ROA using the RPKI.You simply create a ROA that specifies Cloudflare's ASN (AS13335) as an authorized originator and arrange for it to be signed. Many of our customers used hosted RPKI systems available through RIR portals for this. When our systems detect this signed authorization, your routing intention is instantly confirmed.

Many other companies that support BYOIP require a complex workflow involving creating self-signed certificates and manually modifying RDAP (Registration Data Access Protocol) records—a heavy administrative lift. By embracing a choice of IRR object modification and Reverse DNS TXT records, combined with RPKI, we offer a verification process that is much more familiar and straightforward for existing network operators.

The global reach guarantee

While the new self-serve flow ditches the need for the "dinosaur relic" that is the LOA, many network operators around the world still rely on it as part of the process of accepting prefixes from other networks.

To help ensure your prefix is accepted by adjacent networks globally, Cloudflare automatically generates a document on your behalf to be distributed in place of a LOA. This document provides information about the checks that we have carried out to confirm that we are authorised to originate the customer prefix, and confirms the presence of valid ROAs to authorise our origination of it. In this way we are able to support the workflows of network operators we connect to who rely upon LOAs, without our customers having the burden of generating them.

Staying away from black holes

One concern in designing the Self-Serve API is the trade-off between giving customers flexibility while implementing the necessary safeguards so that an IP prefix is never advertised without a matching service binding. If this were to happen, Cloudflare would be advertising a prefix with no idea on what to do with the traffic when we receive it! We call this “blackholing” traffic. To handle this, we introduced the requirement of a default service binding — i.e. a service binding that spans the entire range of the IP prefix onboarded.

A customer can later layer different service bindings on top of their default service binding via multiple service bindings, like putting CDN on top of a default Spectrum service binding. This way, a prefix can never be advertised without a service binding and blackhole our customers’ traffic.

Getting started

Check out our developer docs on the most up-to-date documentation on how to onboard, advertise, and add services to your IP prefixes via our API. Remember that onboardings can be complex, and don’t hesitate to ask questions or reach out to our professional services team if you’d like us to do it for you.

The future of network control

The ability to script and integrate BYOIP management into existing workflows is a game-changer for modern network operations, and we’re only just getting started. In the months ahead, look for self-serve BYOIP in the dashboard, as well as self-serve BYOIP offboarding to give customers even more control.

Cloudflare's self-serve BYOIP API onboarding empowers customers with unprecedented control and flexibility over their IP assets. This move to automate onboarding empowers a stronger security posture, moving away from manually-reviewed PDFs and driving RPKI adoption. By using these API calls, organizations can automate complex network tasks, streamline migrations, and build more resilient and agile network infrastructures.

A deep dive into BPF LPM trie performance and optimization

Matt Fleming — Tue, 21 Oct 2025 13:00:00 GMT

It started with a mysterious soft lockup message in production. A single, cryptic line that led us down a rabbit hole into the performance of one of the most fundamental data structures we use: the BPF LPM trie.

BPF trie maps (BPF_MAP_TYPE_LPM_TRIE) are heavily used for things like IP and IP+Port matching when routing network packets, ensuring your request passes through the right services before returning a result. The performance of this data structure is critical for serving our customers, but the speed of the current implementation leaves a lot to be desired. We’ve run into several bottlenecks when storing millions of entries in BPF LPM trie maps, such as entry lookup times taking hundreds of milliseconds to complete and freeing maps locking up a CPU for over 10 seconds. For instance, BPF maps are used when evaluating Cloudflare’s Magic Firewall rules and these bottlenecks have even led to traffic packet loss for some customers.

This post gives a refresher of how tries and prefix matching work, benchmark results, and a list of the shortcomings of the current BPF LPM trie implementation.

A brief recap of tries

If it’s been a while since you last looked at the trie data structure (or if you’ve never seen it before), a trie is a tree data structure (similar to a binary tree) that allows you to store and search for data for a given key and where each node stores some number of key bits.

Searches are performed by traversing a path, which essentially reconstructs the key from the traversal path, meaning nodes do not need to store their full key. This differs from a traditional binary search tree (BST) where the primary invariant is that the left child node has a key that is less than the current node and the right child has a key that is greater. BSTs require that each node store the full key so that a comparison can be made at each search step.

Here’s an example that shows how a BST might store values for the keys:

ABC
ABCD
ABCDEFGH
DEF

In comparison, a trie for storing the same set of keys might look like this.

This way of splitting out bits is really memory-efficient when you have redundancy in your data, e.g. prefixes are common in your keys, because that shared data only requires a single set of nodes. It’s for this reason that tries are often used to efficiently store strings, e.g. dictionaries of words – storing the strings “ABC” and “ABCD” doesn’t require 3 bytes + 4 bytes (assuming ASCII), it only requires 3 bytes + 1 byte because “ABC” is shared by both (the exact number of bits required in the trie is implementation dependent).

Tries also allow more efficient searching. For instance, if you wanted to know whether the key “CAR” existed in the BST you are required to go to the right child of the root (the node with key “DEF”) and check its left child because this is where it would live if it existed. A trie is more efficient because it searches in prefix order. In this particular example, a trie knows at the root whether that key is in the trie or not.

This design makes tries perfectly suited for performing longest prefix matches and for working with IP routing using CIDR. CIDR was introduced to make more efficient use of the IP address space (no longer requiring that classes fall into 4 buckets of 8 bits) but comes with added complexity because now the network portion of an IP address can fall anywhere. Handling the CIDR scheme in IP routing tables requires matching on the longest (most specific) prefix in the table rather than performing a search for an exact match.

If searching a trie does a single-bit comparison at each node, that’s a binary trie. If searching compares more bits we call that a multibit trie. You can store anything you like in a trie, including IP and subnet addresses – it’s all just ones and zeroes.

Nodes in multibit tries use more memory than in binary tries, but since computers operate on multibit words anyhow, it’s more efficient from a microarchitecture perspective to use multibit tries because you can traverse through the bits faster, reducing the number of comparisons you need to make to search for your data. It’s a classic space vs time tradeoff.

There are other optimisations we can use with tries. The distribution of data that you store in a trie might not be uniform and there could be sparsely populated areas. For example, if you store the strings “A” and “BCDEFGHI” in a multibit trie, how many nodes do you expect to use? If you’re using ASCII, you could construct the binary trie with a root node and branch left for “A” or right for “B”. With 8-bit nodes, you’d need another 7 nodes to store “C”, “D”, “E”, “F”, “G”, “H", “I”.

Since there are no other strings in the trie, that’s pretty suboptimal. Once you hit the first level after matching on “B” you know there’s only one string in the trie with that prefix, and you can avoid creating all the other nodes by using path compression. Path compression replaces nodes “C”, “D”, “E” etc. with a single one such as “I”.

If you traverse the tree and hit “I”, you still need to compare the search key with the bits you skipped (“CDEFGH”) to make sure your search key matches the string. Exactly how and where you store the skipped bits is implementation dependent – BPF LPM tries simply store the entire key in the leaf node. As your data becomes denser, path compression is less effective.

What if your data distribution is dense and, say, all the first 3 levels in a trie are fully populated? In that case you can use level compression and replace all the nodes in those levels with a single node that has 2**3 children. This is how Level-Compressed Tries work which are used for IP route lookup in the Linux kernel (see net/ipv4/fib_trie.c).

There are other optimisations too, but this brief detour is sufficient for this post because the BPF LPM trie implementation in the kernel doesn’t fully use the three we just discussed.

How fast are BPF LPM trie maps?

Here are some numbers from running BPF selftests benchmark on AMD EPYC 9684X 96-Core machines. Here the trie has 10K entries, a 32-bit prefix length, and an entry for every key in the range [0, 10K).

Operation	Throughput	Stddev	Latency
lookup	7.423M ops/s	0.023M ops/s	134.710 ns/op
update	2.643M ops/s	0.015M ops/s	378.310 ns/op
delete	0.712M ops/s	0.008M ops/s	1405.152 ns/op
free	0.573K ops/s	0.574K ops/s	1.743 ms/op

The time to free a BPF LPM trie with 10K entries is noticeably large. We recently ran into an issue where this took so long that it caused soft lockup messages to spew in production.

This benchmark gives some idea of worst case behaviour. Since the keys are so densely populated, path compression is completely ineffective. In the next section, we explore the lookup operation to understand the bottlenecks involved.

Why are BPF LPM tries slow?

The LPM trie implementation in kernel/bpf/lpm_trie.c has a couple of the optimisations we discussed in the introduction. It is capable of multibit comparisons at leaf nodes, but since there are only two child pointers in each internal node, if your tree is densely populated with a lot of data that only differs by one bit, these multibit comparisons degrade into single bit comparisons.

Here’s an example. Suppose you store the numbers 0, 1, and 3 in a BPF LPM trie. You might hope that since these values fit in a single 32 or 64-bit machine word, you could use a single comparison to decide which next node to visit in the trie. But that’s only possible if your trie implementation has 3 child pointers in the current node (which, to be fair, most trie implementations do). In other words, you want to make a 3-way branching decision but since BPF LPM tries only have two children, you’re limited to a 2-way branch.

A diagram for this 2-child trie is given below.

The leaf nodes are shown in green with the key, as a binary string, in the center. Even though a single 8-bit comparison is more than capable of figuring out which node has that key, the BPF LPM trie implementation resorts to inserting intermediate nodes (blue) to inject 2-way branching decisions into your path traversal because its parent (the orange root node in this case) only has 2 children. Once you reach a leaf node, BPF LPM tries can perform a multibit comparison to check the key. If a node supported pointers to more children, the above trie could instead look like this, allowing a 3-way branch and reducing the lookup time.

This 2-child design impacts the height of the trie. In the worst case, a completely full trie essentially becomes a binary search tree with height log2(nr_entries) and the height of the trie impacts how many comparisons are required to search for a key.

The above trie also shows how BPF LPM tries implement a form of path compression – you only need to insert an intermediate node where you have two nodes whose keys differ by a single bit. If instead of 3, you insert a key of 15 (0b1111), this won’t change the layout of the trie; you still only need a single node at the right child of the root.

And finally, BPF LPM tries do not implement level compression. Again, this stems from the fact that nodes in the trie can only have 2 children. IP route tables tend to have many prefixes in common and you typically see densely packed tries at the upper levels which makes level compression very effective for tries containing IP routes.

Here’s a graph showing how the lookup throughput for LPM tries (measured in million ops/sec) degrades as the number of entries increases, from 1 entry up to 100K entries.

Once you reach 1 million entries, throughput is around 1.5 million ops/sec, and continues to fall as the number of entries increases.

Why is this? Initially, this is because of the L1 dcache miss rate. All of those nodes that need to be traversed in the trie are potential cache miss opportunities.

As you can see from the graph, L1 dcache miss rate remains relatively steady and yet the throughput continues to decline. At around 80K entries, dTLB miss rate becomes the bottleneck.

Because BPF LPM tries to dynamically allocate individual nodes from a freelist of kernel memory, these nodes can live at arbitrary addresses. Which means traversing a path through a trie almost certainly will incur cache misses and potentially dTLB misses. This gets worse as the number of entries, and height of the trie, increases.

Where do we go from here?

By understanding the current limitations of the BPF LPM trie, we can now work towards building a more performant and efficient solution for the future of the Internet.

We’ve already contributed these benchmarks to the upstream Linux kernel — but that’s only the start. We have plans to improve the performance of BPM LPM tries, particularly the lookup function which is heavily used for our workloads. This post covered a number of optimisations that are already used by the net/ipv4/fib_trie.c code, so a natural first step is to refactor that code so that a common Level Compressed trie implementation can be used. Expect future blog posts to explore this work in depth.

If you’re interested in looking at more performance numbers, Jesper Brouer has recorded some here: https://github.com/xdp-project/xdp-project/blob/main/areas/bench/bench02_lpm-trie-lookup.org.

If the Linux kernel, performance, or optimising data structures excites you, our engineering teams are hiring.

Simplify allowlist management and lock down origin access with Cloudflare Aegis

Mia Malden — Thu, 20 Mar 2025 13:00:00 GMT

Today, we’re taking a deep dive into Aegis, Cloudflare’s origin protection product, to help you understand what the product is, how it works, and how to take full advantage of it for locking down access to your origin. We’re excited to announce the availability of Bring Your Own IPs (BYOIP) for Aegis, a customer-accessible Aegis API, and a gradual rollout for observability of Aegis IP utilization.

If you are new to Cloudflare Aegis, let’s take a step back and understand the product’s purpose and security benefits, process, and how it came to be.

Origin protection then…

Allowlisting a specific set of IP addresses has long existed as one of the simplest ways of restricting access to a server. This firewall mechanism is a starting state that just about every server supports. As we built Cloudflare’s network, one of the first features that customers requested was the ability to restrict access to their origin, so only Cloudflare could make requests to it. Back then, the most natural way to support this was to tell our customers which IP addresses belong to us, so they could allowlist those in their origin firewall. To that end, we have published our IP address ranges, providing an easy configuration to ensure that all traffic accessing your origin comes from Cloudflare’s network.

However, Cloudflare’s IP ranges are used across multiple Cloudflare services and customers, so restricting access to the full list doesn’t necessarily give customers the security benefit they need. With the frequency and scale of IP-based and DDoS attacks every day, origin protection is absolutely paramount. Some customers have the need for more stringent security precautions to ensure that traffic is only coming from Cloudflare’s network and, more specifically, only coming from their zones within Cloudflare.

Origin protection now…

Cloudflare has built out additional services to lock down access to your origin, like authenticated origin pulls (mTLS) and Cloudflare Tunnels, that no longer rely on IP addresses as an indicator of identity. These are part of a global effort towards Zero Trust security: whereas the Internet used to operate under a trust-but-verify model, we aim to operate as nothing is trusted, and everything is verified.

Having non-ephemeral IP addresses — upon which the firewall allowlist mechanism relies — does not quite fit the Zero Trust system. Although mTLS and similar solutions present a more modern approach to origin security, they aren’t always feasible for customers, depending on their hardware or system architecture.

We launched Cloudflare Aegis in March 2023 for customers seeking an intermediary security solution. Aegis provides a dedicated IP address, or set of addresses, from which Cloudflare sends requests, allowing you to further lock down your origin’s layer 3 firewall. Aegis also simplifies management by only requiring you to allowlist a small number of IP addresses.

Normally, Cloudflare’s publicly listed IP ranges are used to egress from Cloudflare’s network to the customer origin. With these IP addresses distributed across Cloudflare’s network, the customer traffic can egress from many servers to the customer origin.

With Aegis, a customer does not necessarily have an Aegis IP address on every server if they are using IPv4. That means requests must be routed through Cloudflare’s network to a server where Aegis IP addresses are present before the traffic can egress to the customer origin.

How requests are routed with Aegis

A few terms, before we begin:

Anycast: a technology where each of our data centers “announces” and can handle the same IP address ranges
Unicast: a technology where each server is given its own, unique unicast IP address

Dedicated egress Aegis IPs are located in a particular set of specific data centers. This list is handpicked by the customer, in conversation with Cloudflare, to be geographically close to their origin servers for optimal security and performance in tandem.

Aegis relies on a technology called soft-unicasting, which allows us to share a /32 egress IPv4 amongst many servers, thereby enabling us to spread a single subnet across many data centers. Then, the traffic going back from the origin servers (the return path) is routed to their closest data center. Once in Cloudflare's network, our in-house L4 XDP-based load balancer, Unimog, ensures that the return packets make it back to the machine that connected to the origin servers at the start.

This supports fast, local, and reliable egress from Cloudflare’s network. With this configuration, we essentially use Anycast at the BGP layer before using an IP and port range to reach a specific machine in the correct data center. Across Cloudflare’s network, we use a significant range of egress IPs to cover all data centers and machines. Since Aegis customers only have a few IPv4 addresses, the range is limited to a few data centers rather than Cloudflare’s entire egress IP range.

The capacity issue

Every IP address has 65,535 ports. A request egresses from exactly one port on the Aegis IP address to exactly one port on the origin IP address.

Each TCP request consists of a 4-way tuple that contains:

Source IP address
Source port
Destination IP address
Destination port

A UDP request can also consist of a 4-way tuple (if it’s connected) or a 2-way tuple (if it’s unconnected), simply including a bind IP address and port. Aegis supports both TCP and UDP traffic — in either case, the requests rely upon IP:port pairings between the source and destination.

When a request reaches the origin, it opens a connection, through which data can pass between the source and destination. One source port can sustain multiple connections at a time, only if the destination ip:ports are different.

Normally at Cloudflare, an IP address establishes connections to a variety of different destination IP ports or addresses to support high traffic volumes. With Aegis, that is no longer the case. The challenge with Aegis IP capacity is exactly that: all the traffic is egressing to the same (or a small set of) origin IP address(es) from the same (or a small set of) source IP address(es). That means Aegis IP addresses have capacity constraints associated with them.

The number of concurrent connections is the number of simultaneous connections for a given 4-way tuple. Between one client and one server, the volume of concurrent connections is inherently limited by the number of ports on an IP address to 65,535 — each source ip:port can only support a single outbound connection per destination IP:port. In practice, that maximum number of concurrent connections is often lower due to assignments of port ranges across many servers and imperfect load distribution.

For planning purposes, we use an estimate of ~80% of the IP capacity (the volume of concurrent connections a source IP address can support to a destination IP address) to protect against overload, in case of traffic spikes. If every port on an IP address is maintaining a concurrent connection, that address would reach and exceed capacity — it would be overloaded with port usage exhaustion. Requests may then be dropped since no new connections can be established. To build in resiliency, we only plan to support 40k concurrent connections per Aegis IP address per origin.

Aegis with IPv6

Each customer who onboards with Cloudflare Aegis receives two /64 prefixes to be globally allocated and announced. That means, outside of Cloudflare’s China Network, every Cloudflare data center has hundreds or even thousands of addresses reserved for egressing your traffic directly to your origin. Without Aegis, any data center in Cloudflare’s Anycast network can serve as a point of egress – so we built Aegis with IPv6 to preserve that level of resiliency and performance. The sheer scale of IPv6, with its available address space, allows us to cushion Aegis’ capacity to a point far beyond any reasonable concern. Globally allocating and announcing your Aegis IPv6 addresses maintains all of Cloudflare’s functionality as a reverse proxy without inducing additional friction.

Aegis with IPv4

Although using IPv6 with Aegis facilitates the best possible speed and resiliency for your traffic, we recognize the transition from IPv4 to IPv6 can be challenging for some customers. Moreover, some customers prefer Aegis IPv4 for granular control over their traffic’s physical egress locations. Still, IPv4 space is more limited and more expensive — while all Cloudflare Aegis customers simply receive two dedicated /64s for IPv6, enabling Aegis with IPv4 requires a touch more tailoring. When you onboard to Aegis, we work with you to determine the ideal number of IPv4 addresses for your Aegis configuration to maintain optimal performance and resiliency, while also ensuring cost efficiency.

Naturally, this introduces a bottleneck — whereas every Cloudflare data center can serve as a point of egress with Aegis IPv6, only a small fraction will have that capability with Aegis IPv4. We aim to mitigate this impact by careful provisioning of the IPv4 addresses.

Now that BYOIP for Aegis is supported, you can also onboard an entire IPv4 /24 prefix or IPv6 /64 for Aegis, allowing for a cost-effective configuration with a much higher volume of capacity.

When we launched Aegis, each IP address was allocated to one data center, requiring at least two IPv4 addresses for appropriate resiliency. To reduce the number of IP addresses necessary in your layer 3 firewall allowlist, and to manage the cost to the customer of leasing IPs, we expanded our Aegis functionality so that one address can be announced from up to four data centers. To do this, we essentially slice the available IP port range into four subsets and provision each at a unique data center.

A quick refresher: when a request travels through Cloudflare, it first hits our network via an ingress data center. The ingress data center is generally near the eyeball, who is sending the request. Then, the request is routed following BGP – or Argo Smart Routing, when enabled – to an exit, or egress, data center. The exit data center will generally fall in close geographic proximity to the request’s destination, which is the customer origin. This mitigates latency induced by the final hop from Cloudflare’s network to your origin.

With Aegis, the possible exit data centers are limited to the data centers in which an Aegis IP address has been allocated. For IPv6, this is a non-issue, since every data center outside our China Network is covered. With IPv4, however, the exit data centers are limited to a much smaller number (4 x the number of Aegis IPs). Aegis IP addresses are allocated, then, to data centers in close geographic proximity to your origin(s). This maximizes the likelihood that whichever data centers would ordinarily have been selected as the egress data center are already announcing Aegis IP addresses. Theoretically, no extra hop is necessary from the optimal exit data center to an Aegis-enabled data center – they are one and the same. In practice, this cannot be guaranteed 100% of the time because optimal routes are ever-changing. We recommend IPv6 to ensure optimal performance because of this possibility of an extra hop with IPv4.

A brief comparison, to summarize:

	Aegis IPv4	Aegis IPv6
Physical points of egress	4 physical data center sites (1-2 cities near origin) per IP address	All 300+ Cloudflare locations (excluding China network)
Capacity	One IPv4 address per 40,000 concurrent connections per origin	Two /64 prefixes for all Aegis customers (>36 quintillion IP addresses) ~50,000x capacity of IPv4 config
Pricing model	Monthly fee based on IPv4 leases or BYOIP for Aegis prefix fees	Included with product purchase or BYOIP for Aegis prefix fees

Aegis IPv4

Aegis IPv6

Physical points of egress

4 physical data center sites (1-2 cities near origin) per IP address

All 300+ Cloudflare locations (excluding China network)

Capacity

One IPv4 address per 40,000 concurrent connections per origin

Two /64 prefixes for all Aegis customers (>36 quintillion IP addresses)

~50,000x capacity of IPv4 config

Pricing model

Monthly fee based on IPv4 leases or BYOIP for Aegis prefix fees

Included with product purchase or BYOIP for Aegis prefix fees

Now, with Aegis analytics coming soon, customers can monitor and manage their IP address usage by Cloudflare data centers in aggregate. Every Cloudflare data center will now run a service with the sole purpose of calculating and reporting Aegis usage for each origin IP:port at regular intervals. Written to an internal database, these reports will be aggregated and exposed to customers via Cloudflare’s GraphQL Analytics API. Several aggregation functions will be available, such as average usage over a period of time, or total summed usage.

This will allow customers to track their own IP address usage to further optimize the distribution of traffic and addresses across different points of presence for IPv4. Additionally, the improved observability will support customer-created notifications via RSS feeds such that you can design your own notification thresholds for port usage.

How Aegis benefits from connection reuse & coalescence

As we mentioned earlier, requests egress from the source IP address to the destination IP address only when a connection has been established between the two. In early Internet protocols, requests and connections were 1:1. Now, once that connection is open, it can remain open and support hundreds or thousands of requests between that source and destination via connection reuse and connection coalescing.

Connection reuse, implemented by HTTP/1.1, allows for requests with the same source ip:port and destination IP:port to pass through the same connection to the origin. A “simple” website by modern standards can send hundreds of requests just to load initially; by streamlining these into a single origin connection, connection reuse reduces the latency derived from constantly opening and closing new connections between two endpoints. Still, any request from a different domain would need to create a new, unique connection to communicate with the origin.

As of HTTP/2, connection coalescing can group requests from different domains into one connection if the requests have the same destination IP address and the server certificate is authoritative for both domains. Depending on the traffic patterns routing from the eyeball to an Aegis IP address, the volume of connection reuse & coalescence can vary. One connection most likely facilitates the traffic of many requests, but each connection requires at least one request to open it in the first place. Therefore, the worst possible ratio between concurrent connections and concurrent requests is 1:1.

In practice, a 1:1 ratio between connections and requests almost never happens. Connection reuse and connection coalescence are very common but highly variable, due to sporadic traffic patterns. We size our Aegis IP address allocations accordingly, erring on the conservative side to minimize risk of capacity overload. With the proper number of dedicated egress IP addresses and optimal allocation to Cloudflare points of presence, we are able to lock down your origin with IPv4 addresses to block malicious layer 7 traffic and reduce overall load to your origin.

Connection reuse and coalescence pairs well with Aegis to reduce load on the origin’s side as well. Because a request can be reused if it comes from the same source IP:port and shares a destination IP:port, routing traffic from a reduced number of source IP addresses (Aegis addresses, in this case) to your origin facilitates a smaller number of total connections. Not only does this improve security by limiting open connection access, but also it reduces latency since fewer connections need to be opened. Maintaining fewer connections is also less resource intensive — more connections means more CPU and more memory handling the inbound requests. By reducing the number of connections to the origin through reuse and coalescence, HTTP/2 lowers the overall cost of operation by optimizing resource usage.

Recap and recommendations

Cloudflare Aegis locks down your origin by restricting access via your origin’s layer 3 firewall. By routing traffic from Cloudflare’s network to your origin through dedicated egress IP addresses, you can ensure that requests coming from Cloudflare are legitimate customer traffic. With a simple flip-switch configuration — allow listing your Aegis IP addresses in your origin’s firewall — you can block excessive noise and bad actors from access. So, to help you take full advantage of Aegis, let’s recap:

Concurrent connections can be, at worst, a 1:1 ratio to concurrent requests.
Cloudflare bases our IP address usage recommendations on 40,000 concurrent connections to minimize risk of capacity overload.
Each Aegis IP address supports an estimated 40,000 concurrent connections per origin IP address.

Additionally, we’re excited to now support:

Public Aegis API
BYOIP for Aegis
Customer-facing Aegis observability (coming soon via gradual rollout)

For customers leasing Cloudflare-owned Aegis IP addresses, the Aegis API will allow you to enable and disable Aegis on zones within your parent account (parent being the account which owns the IP lease). If you deploy your Aegis IP addresses across multiple accounts, you’ll still rely on Cloudflare’s account team to enable and disable Aegis on zones within those additional accounts.

For customers who leverage BYOIP for Aegis, the Aegis API will allow you to enable and disable Aegis on zones within your parent account and within any accounts to which you delegate prefix permissions. We recommend BYOIP for Aegis for improved configurability and cost efficiency.

	BYOIP	Cloudflare-owned IPs
Enable Aegis on zones on parent account	✓	✓
Enable Aegis on zones beyond parent account	✓	✗
Disable Aegis on zones on parent account	✓	✓
Disable Aegis on zones beyond parent account	✓	✗
Access Aegis analytics via the API	✓	✓

With the improved Aegis observability, all Aegis customers will be able to monitor their port usage by IP address, account, zone, and data centers in aggregate via the API. You will also be able to ingest these metrics to configure your own, customizable alerts based on certain port usage thresholds. Alongside the new configurability of Aegis, this visibility will better equip customers to manage their Aegis deployments themselves and alert us to any changes, rather than the other way around.

We also have a few adjacent recommendations to optimize your Aegis configuration. We generally encourage the following best practices for security hygiene for your origin and traffic as well.

IPv6 Compatibility: if your origin(s) support IPv6, you will experience even better resiliency, performance, and availability with your dedicated egress IP addresses at a lower overall cost.
HTTP/2 or HTTP/3 adoption: by supporting connection reuse and coalescence, you will reduce overall load to your origin and latency in the path of your request.
Multi-level origin protection: while Aegis protects your origin at the application level, it pairs well with Access and CNI, Authenticated Origin Pulls, and/or other Cloudflare products to holistically protect, verify, and facilitate your traffic from edge to origin.

If you or your organization want to enhance security and lock down your origin with dedicated egress IP addresses reach out to your account team to onboard today.

connect() - why are you so slow?

Frederick Lawler — Thu, 08 Feb 2024 14:00:27 GMT

It is no secret that Cloudflare is encouraging companies to deprecate their use of IPv4 addresses and move to IPv6 addresses. We have a couple articles on the subject from this year:

And many more in our catalog. To help with this, we spent time this last year investigating and implementing infrastructure to reduce our internal and egress use of IPv4 addresses. We prefer to re-allocate our addresses than to purchase more due to increasing costs. And in this effort we discovered that our cache service is one of our bigger consumers of IPv4 addresses. Before we remove IPv4 addresses for our cache services, we first need to understand how cache works at Cloudflare.

How does cache work at Cloudflare?

Describing the full scope of the architecture is out of scope of this article, however, we can provide a basic outline:

Internet User makes a request to pull an asset
Cloudflare infrastructure routes that request to a handler
Handler machine returns cached asset, or if miss
Handler machine reaches to origin server (owned by a customer) to pull the requested asset

The particularly interesting part is the cache miss case. When a website suddenly becomes very popular, many uncached assets may need to be fetched all at once. Hence we may make an upwards of: 50k TCP unicast connections to a single destination_._

That is a lot of connections! We have strategies in place to limit the impact of this or avoid this problem altogether. But in these rare cases when it occurs, we will then balance these connections over two source IPv4 addresses.

Our goal is to remove the load balancing and prefer one IPv4 address. To do that, we need to understand the performance impact of two IPv4 addresses vs one.

TCP connect() performance of two source IPv4 addresses vs one IPv4 address

We leveraged a tool called wrk, and modified it to distribute connections over multiple source IP addresses. Then we ran a workload of 70k connections over 48 threads for a period of time.

During the test we measured the function tcp_v4_connect() with the BPF BCC libbpf-tool funclatency tool to gather latency metrics as time progresses.

Note that throughout the rest of this article, all the numbers are specific to a single machine with no production traffic. We are making the assumption that if we can improve a worse case scenario in an algorithm with a best case machine, that the results could be extrapolated to production. Lock contention was specifically taken out of the equation, but will have production implications.

Two IPv4 addresses

The y-axis are buckets of nanoseconds in powers of ten. The x-axis represents the number of connections made per bucket. Therefore, more connections in a lower power of ten buckets is better.

We can see that the majority of the connections occur in the fast case with roughly ~20k in the slow case. We should expect this bimodal to increase over time due to wrk continuously closing and establishing connections.

Now let us look at the performance of one IPv4 address under the same conditions.

One IPv4 address

In this case, the bimodal distribution is even more pronounced. Over half of the total connections are in the slow case than in the fast! We may conclude that simply switching to one IPv4 address for cache egress is going to introduce significant latency on our connect() syscalls.

The next logical step is to figure out where this bottleneck is happening.

Port selection is not what you think it is

To investigate this, we first took a flame graph of a production machine:

Flame graphs depict a run-time function call stack of a system. Y-axis depicts call-stack depth, and x-axis depicts a function name in a horizontal bar that represents the amount of times the function was sampled. Checkout this in-depth guide about flame graphs for more details.

Most of the samples are taken in the function __inet_hash_connect(). We can see that there are also many samples for __inet_check_established() with some lock contention sampled between. We have a better picture of a potential bottleneck, but we do not have a consistent test to compare against.

Wrk introduces a bit more variability than we would like to see. Still focusing on the function tcp_v4_connect(), we performed another synthetic test with a homegrown benchmark tool to test one IPv4 address. A tool such as stress-ng may also be used, but some modification is necessary to implement the socket option IP_LOCAL_PORT_RANGE. There is more about that socket option later.

We are now going to ensure a deterministic amount of connections, and remove lock contention from the problem. The result is something like this:

On the y-axis we measured the latency between the start and end of a connect() syscall. The x-axis denotes when a connect() was called. Green dots are even numbered ports, and red dots are odd numbered ports. The orange line is a linear-regression on the data.

The disparity between the average time for port allocation between even and odd ports provides us with a major clue. Connections with odd ports are found significantly slower than the even. Further, odd ports are not interleaved with earlier connections. This implies we exhaust our even ports before attempting the odd. The chart also confirms our bimodal distribution.

__inet_hash_connect()

At this point we wanted to understand this split a bit better. We know from the flame graph and the function __inet_hash_connect() that this holds the algorithm for port selection. For context, this function is responsible for associating the socket to a source port in a late bind. If a port was previously provided with bind(), the algorithm just tests for a unique TCP 4-tuple (src ip, src port, dest ip, dest port) and ignores port selection.

Before we dive in, there is a little bit of setup work that happens first. Linux first generates a time-based hash that is used as the basis for the starting port, then adds randomization, and then puts that information into an offset variable. This is always set to an even integer.

net/ipv4/inet_hashtables.c

   offset &= ~1U;
    
other_parity_scan:
    port = low + offset;
    for (i = 0; i < remaining; i += 2, port += 2) {
        if (unlikely(port >= high))
            port -= remaining;

        inet_bind_bucket_for_each(tb, &head->chain) {
            if (inet_bind_bucket_match(tb, net, port, l3mdev)) {
                if (!check_established(death_row, sk, port, &tw))
                    goto ok;
                goto next_port;
            }
        }
    }

    offset++;
    if ((offset & 1) && remaining > 1)
        goto other_parity_scan;

Then in a nutshell: loop through one half of ports in our range (all even or all odd ports) before looping through the other half of ports (all odd or all even ports respectively) for each connection. Specifically, this is a variation of the Double-Hash Port Selection Algorithm. We will ignore the bind bucket functionality since that is not our main concern.

Depending on your port range, you either start with an even port or an odd port. In our case, our low port, 9024, is even. Then the port is picked by adding the offset to the low port:

net/ipv4/inet_hashtables.c

port = low + offset;

If low was odd, we will have an odd starting port because odd + even = odd.

There is a bit too much going on in the loop to explain in text. I have an example instead:

This example is bound by 8 ports and 8 possible connections. All ports start unused. As a port is used up, the port is grayed out. Green boxes represent the next chosen port. All other colors represent open ports. Blue arrows are even port iterations of offset, and red are the odd port iterations of offset. Note that the offset is randomly picked, and once we cross over to the odd range, the offset is incremented by one.

For each selection of a port, the algorithm then makes a call to the function check_established() which dereferences __inet_check_established(). This function loops over sockets to verify that the TCP 4-tuple is unique. The takeaway is that the socket list in the function is usually smaller than not. This grows as more unique TCP 4-tuples are introduced to the system. Longer socket lists may slow down port selection eventually. We have a blog post on ephemeral port exhausting that dives into the socket list and port uniqueness criteria.

At this point, we can summarize that the odd/even port split is what is causing our performance bottleneck. And during the investigation, it was not obvious to me (or even maybe you) why the offset was initially calculated the way it was, and why the odd/even port split was introduced. After some git-archaeology the decisions become more clear.

Security considerations

Port selection has been shown to be used in device fingerprinting in the past. This led the authors to introduce more randomization into the initial port selection. Prior, ports were predictably picked solely based on their initial hash and a salt value which does not change often. This helps with explaining the offset, but does not explain the split.

Why the even/odd split?

Prior to this patch and that patch, services may have conflicts between the connect() and bind() heavy workloads. Thus, to avoid those conflicts, the split was added. An even offset was chosen for the connect() workloads, and an odd offset for the bind() workloads. However, we can see that the split works great for connect() workloads that do not exceed one half of the allotted port range.

Now we have an explanation for the flame graph and charts. So what can we do about this?

User space solution (kernel < 6.8)

We have a couple of strategies that would work best for us. Infrastructure or architectural strategies are not considered due to significant development effort. Instead, we prefer to tackle the problem where it occurs.

Select, test, repeat

For the “select, test, repeat” approach, you may have code that ends up looking like this:

sys = get_ip_local_port_range()
estab = 0
i = sys.hi
while i >= 0:
    if estab >= sys.hi:
        break

    random_port = random.randint(sys.lo, sys.hi)
    connection = attempt_connect(random_port)
    if connection is None:
        i += 1
        continue

    i -= 1
    estab += 1

The algorithm simply loops through the system port range, and randomly picks a port each iteration. Then test that the connect() worked. If not, rinse and repeat until range exhaustion.

This approach is good for up to ~70-80% port range utilization. And this may take roughly eight to twelve attempts per connection as we approach exhaustion. The major downside to this approach is the extra syscall overhead on conflict. In order to reduce this overhead, we can consider another approach that allows the kernel to still select the port for us.

Select port by random shifting range

This approach leverages the IP_LOCAL_PORT_RANGE socket option. And we were able to achieve performance like this:

That is much better! The chart also introduces black dots that represent errored connections. However, they have a tendency to clump at the very end of our port range as we approach exhaustion. This is not dissimilar to what we may see in “select, test, repeat”.

The way this solution works is something like:

IP_BIND_ADDRESS_NO_PORT = 24
IP_LOCAL_PORT_RANGE = 51
sys = get_local_port_range()
window.lo = 0
window.hi = 1000
range = window.hi - window.lo
offset = randint(sys.lo, sys.hi - range)
window.lo = offset
window.hi = offset + range

sk = socket(AF_INET, SOCK_STREAM)
sk.setsockopt(IPPROTO_IP, IP_BIND_ADDRESS_NO_PORT, 1)
range = pack("@I", window.lo | (window.hi << 16))
sk.setsockopt(IPPROTO_IP, IP_LOCAL_PORT_RANGE, range)
sk.bind((src_ip, 0))
sk.connect((dest_ip, dest_port))

We first fetch the system's local port range, define a custom port range, and then randomly shift the custom range within the system range. Introducing this randomization helps the kernel to start port selection randomly at an odd or even port. Then reduces the loop search space down to the range of the custom window.

We tested with a few different window sizes, and determined that a five hundred or one thousand size works fairly well for our port range:

Window size	Errors	Total test time	Connections/second
500	868	~1.8 seconds	~30,139
1,000	1,129	~2 seconds	~27,260
5,000	4,037	~6.7 seconds	~8,405
10,000	6,695	~17.7 seconds	~3,183

As the window size increases, the error rate increases. That is because a larger window provides less random offset opportunity. A max window size of 56,512 is no different from using the kernels default behavior. Therefore, a smaller window size works better. But you do not want it to be too small either. A window size of one is no different from “select, test, repeat”.

In kernels >= 6.8, we can do even better.

Kernel solution (kernel >= 6.8)

A new patch was introduced that eliminates the need for the window shifting. This solution is going to be available in the 6.8 kernel.

Instead of picking a random window offset for setsockopt(IPPROTO_IP, IP_LOCAL_PORT_RANGE, …), like in the previous solution, we instead just pass the full system port range to activate the solution. The code may look something like this:

IP_BIND_ADDRESS_NO_PORT = 24
IP_LOCAL_PORT_RANGE = 51
sys = get_local_port_range()
sk = socket(AF_INET, SOCK_STREAM)
sk.setsockopt(IPPROTO_IP, IP_BIND_ADDRESS_NO_PORT, 1)
range = pack("@I", sys.lo | (sys.hi << 16))
sk.setsockopt(IPPROTO_IP, IP_LOCAL_PORT_RANGE, range)
sk.bind((src_ip, 0))
sk.connect((dest_ip, dest_port))

Setting IP_LOCAL_PORT_RANGE option is what tells the kernel to use a similar approach to “select port by random shifting range” such that the start offset is randomized to be even or odd, but then loops incrementally rather than skipping every other port. We end up with results like this:

The performance of this approach is quite comparable to our user space implementation. Albeit, a little faster. Due in part to general improvements, and that the algorithm can always find a port given the full search space of the range. Then there are no cycles wasted on a potentially filled sub-range.

These results are great for TCP, but what about other protocols?

Other protocols & connect()

It is worth mentioning at this point that the algorithms used for the protocols are mostly the same for IPv4 & IPv6. Typically, the key difference is how the sockets are compared to determine uniqueness and where the port search happens. We did not compare performance for all protocols. But it is worth mentioning some similarities and differences with TCP and a couple of others.

DCCP

The DCCP protocol leverages the same port selection algorithm as TCP. Therefore, this protocol benefits from the recent kernel changes. It is also possible the protocol could benefit from our user space solution, but that is untested. We will let the reader exercise DCCP use-cases.

UDP & UDP-Lite

UDP leverages a different algorithm found in the function udp_lib_get_port(). Similar to TCP, the algorithm will loop over the whole port range space incrementally. This is only the case if the port is not already supplied in the bind() call. The key difference between UDP and TCP is that a random number is generated as a step variable. Then, once a first port is identified, the algorithm loops on that port with the random number. This relies on an uint16_t overflow to eventually loop back to the chosen port. If all ports are used, increment the port by one and repeat. There is no port splitting between even and odd ports.

The best comparison to the TCP measurements is a UDP setup similar to:

sk = socket(AF_INET, SOCK_DGRAM)
sk.bind((src_ip, 0))
sk.connect((dest_ip, dest_port))

And the results should be unsurprising with one IPv4 source address:

UDP fundamentally behaves differently from TCP. And there is less work overall for port lookups. The outliers in the chart represent a worst-case scenario when we reach a fairly bad random number collision. In that case, we need to more-completely loop over the ephemeral range to find a port.

UDP has another problem. Given the socket option SO_REUSEADDR, the port you get back may conflict with another UDP socket. This is in part due to the function udp_lib_lport_inuse() ignoring the UDP 2-tuple (src ip, src port) check given the socket option. When this happens you may have a new socket that overwrites a previous. Extra care is needed in that case. We wrote more in depth about these cases in a previous blog post.

In summary

Cloudflare can make a lot of unicast egress connections to origin servers with popular uncached assets. To avoid port-resource exhaustion, we balance the load over a couple of IPv4 source addresses during those peak times. Then we asked: “what is the performance impact of one IPv4 source address for our connect()-heavy workloads?”. Port selection is not only difficult to get right, but is also a performance bottleneck. This is evidenced by measuring connect() latency with a flame graph and synthetic workloads. That then led us to discovering TCP’s quirky port selection process that loops over half your ephemeral ports before the other for each connect().

We then proposed three solutions to solve the problem outside of adding more IP addresses or other architectural changes: “select, test, repeat”, “select port by random shifting range”, and an IP_LOCAL_PORT_RANGE socket option solution in newer kernels. And finally closed out with other protocol honorable mentions and their quirks.

Do not take our numbers! Please explore and measure your own systems. With a better understanding of your workloads, you can make a good decision on which strategy works best for your needs. Even better if you come up with your own strategy!

Using DNS to estimate the worldwide state of IPv6 adoption

Carlos Rodrigues — Thu, 14 Dec 2023 15:05:52 GMT

In order for one device to talk to other devices on the Internet using the aptly named Internet Protocol (IP), it must first be assigned a unique numerical address. What this address looks like depends on the version of IP being used: IPv4 or IPv6.

IPv4 was first deployed in 1983. It’s the IP version that gave birth to the modern Internet and still remains dominant today. IPv6 can be traced back to as early as 1998, but only in the last decade did it start to gain significant traction — rising from less than 1% to somewhere between 30 and 40%, depending on who’s reporting and what and how they’re measuring.

With the growth in connected devices far exceeding the number of IPv4 addresses available, and its costs rising, the much larger address space provided by IPv6 should have made it the dominant protocol by now. However, as we’ll see, this is not the case.

Cloudflare has been a strong advocate of IPv6 for many years and, through Cloudflare Radar, we’ve been closely following IPv6 adoption across the Internet. At three years old, Radar is still a relatively recent platform. To go further back in time, we can briefly turn to our friends at APNIC¹ — one of the five Regional Internet Registries (RIRs). Through their data, going back to 2012, we can see that IPv6 experienced a period of seemingly exponential growth until mid-2017, after which it entered a period of linear growth that’s still ongoing:

IPv6 adoption is slowed down by the lack of compatibility between both protocols — devices must be assigned both an IPv4 and an IPv6 address — along with the fact that virtually all devices on the Internet still support IPv4. Nevertheless, IPv6 is critical for the future of the Internet, and continued effort is required to increase its deployment.

Cloudflare Radar, like APNIC and most other sources today, publishes numbers that primarily reflect the extent to which Internet Service Providers (ISPs) have deployed IPv6: the client side. It’s a very important angle, and one that directly impacts end users, but there’s also the other end of the equation: the server side.

With this in mind, we invite you to follow us on a quick experiment where we aim for a glimpse of server side IPv6 adoption, and how often clients are actually (or likely) able to talk to servers over IPv6. We’ll rely on DNS for this exploration and, as they say, the results may surprise you.

IPv6 Adoption on the Client Side (from HTTP)

By the end of October 2023, from Cloudflare’s perspective, IPv6 adoption across the Internet was at roughly 36% of all traffic, with slight variations depending on the time of day and day of week. When excluding bots the estimate goes up to just over 46%, while excluding humans pushes it down close to 24%. These numbers refer to the share of HTTP requests served over IPv6 across all IPv6-enabled content (the default setting).

For this exercise, what matters most is the number for both humans and bots. There are many reasons for the adoption gap between both kinds of traffic — from varying levels of IPv6 support in the plethora of client software used, to varying levels of IPv6 deployment inside the many networks where traffic comes from, to the varying size of such networks, etc. — but that’s a rabbit hole for another day. If you’re curious about the numbers for a particular country or network, you can find them on Cloudflare Radar and in our Year in Review report for 2023.

It Takes Two to Dance

You, the reader, might point out that measuring the client side of the client-server equation only tells half the story: for an IPv6-capable client to establish a connection with a server via IPv6, the server must also be IPv6-capable.

This raises two questions:

What’s the extent of IPv6 adoption on the server side?
How well does IPv6 adoption on the client side align with adoption on the server side?

There are several possible answers, depending on whether we’re talking about users, devices, bytes transferred, and so on. We’ll focus on connections (it will become clear why in a moment), and the combined question we’re asking is:

How often can an IPv6-capable client use IPv6 when connecting to servers on the Internet, under typical usage patterns?

Typical usage patterns include people going about their day visiting some websites more often than others or automated clients calling APIs. We’ll turn to DNS to get this perspective.

Enter DNS

Before a client can attempt to establish a connection with a server by name, using either the classic IPv4 protocol or the more modern IPv6, it must look up the server’s IP address in the phonebook of the Internet, the Domain Name System (DNS).

Looking up a hostname in DNS is a recursive process. To find the IP address of a server, the domain hierarchy (the dot-separated components of a server’s name) must be followed across several DNS authoritative servers until one of them returns the desired response². Most clients, however, don’t do this directly and instead ask an intermediary server called a recursive resolver to do it for them. Cloudflare operates one such recursive resolver that anyone can use: 1.1.1.1.

As a simplified example, when a client asks 1.1.1.1 for the IP address where “www.example.com” lives, 1.1.1.1 will go out and ask the DNS root servers³ about “.com”, then ask the .com DNS servers about “example.com”, and finally ask the example.com DNS servers about “www.example.com”, which has direct knowledge of it and answers with an IP address. To make things faster for the next client asking a similar question, 1.1.1.1 caches (remembers for a while) both the final answer and the steps in between.

This means 1.1.1.1 is in a very good position to count how often clients try to look up IPv4 addresses (A-type queries) vs. how often they try to look up IPv6 addresses (AAAA-type queries), covering most of the observable Internet.

But how does a client know when to ask for a server’s IPv4 address or its IPv6 address?

The short answer is that clients with IPv6 available to them just ask for both — doing separate A and AAAA lookups for every server they wish to connect to. These IPv6-capable clients will prioritize connecting over IPv6 when they get a non-empty AAAA answer, whether they also get a non-empty A answer (which they almost always get, as we’ll see). The algorithm driving this preference for modernity is called Happy Eyeballs, if you’re interested in the details.

We’re now ready to start looking at some actual data…

IPv6 Adoption on the Client Side (from DNS)

The first step is establishing a baseline by measuring client IPv6 deployment from 1.1.1.1’s perspective and comparing it with the numbers from HTTP requests that we started with.

It’s tempting to count how often clients connect to 1.1.1.1 using IPv6, but the results are misleading for a couple of reasons, the strongest one being hidden in plain sight: 1.1.1.1 is the most memorable address of the set of IPv4 and IPv6 addresses that clients can use to perform DNS lookups through the 1.1.1.1 service. Ideally, IPv6-capable clients using 1.1.1.1 as their recursive resolver should have all four of the following IP addresses configured, not just the first two:

1.1.1.1 (IPv4)
1.0.0.1 (IPv4)
2606:4700:4700::1111 (IPv6)
2606:4700:4700::1001 (IPv6)

But, when manual configuration is involved4, humans find IPv6 addresses less memorable than IPv4 addresses and are less likely to configure them, viewing the IPv4 addresses as enough.

A related, but less obvious, confounding factor is that many IPv6-capable clients will still perform DNS lookups over IPv4 even when they have 1.1.1.1’s IPv6 addresses configured, as spreading lookups over the available addresses is a popular default option.

A more sensible approach to assess IPv6 adoption from DNS client activity is calculating the percentage of AAAA-type queries over the total amount of A-type queries, assuming IPv6-clients always perform both⁵, as mentioned earlier.

Then, from 1.1.1.1’s perspective, IPv6 adoption on the client side is estimated at 30.5% by query volume. This is a bit under what we observed from HTTP traffic over the same time period (35.9%) but such a difference between two different perspectives is not unexpected.

A Note on TTLs

It’s not only recursive resolvers that cache DNS responses, most DNS clients have their own local caches as well. Your web browser, operating system, and even your home router, keep answers around hoping to speed up subsequent queries.

How long each answer remains in cache depends on the time-to-live (TTL) field sent back with DNS records. If you’re familiar with DNS, you might be wondering if A and AAAA records have similar TTLs. If they don’t, we may be getting fewer queries for just one of these two types (because it gets cached for longer at the client level), biasing the resulting adoption figures.

The pie charts here break down the minimum TTLs sent back by 1.1.1.1 in response to A and AAAA queries⁶. There is some difference between both types, but the difference is very small.

IPv6 Adoption on the Server Side

The following graph shows how often A and AAAA-type queries get non-empty responses, shedding light on server side IPv6 adoption and getting us closer to the answer we’re after:

IPv6 adoption by servers is estimated at 43.3% by query volume, noticeably higher than what was observed for clients.

How Often Both Sides Swipe Right

If 30.5% of IP address lookups handled by 1.1.1.1 could make use of an IPv6 address to connect to their intended destinations, but only 43.3% of those lookups get a non-empty response, that can give us a pretty good basis for how often IPv6 connections are made between client and server — roughly 13.2% of the time.

The Potential Impact of Popular Domains

IPv6 server side adoption measured by query volume for the domains in Radar’s Top 100 list is 60.8%. If we exclude these domains from our overall calculations, the previous 13.2% figure drops to 8%. This is a significant difference, but not unexpected, as the Top 100 domains make up over 55% of all A and AAAA queries to 1.1.1.1.

If just a few more of these highly popular domains were to deploy IPv6 today, observed adoption would noticeably increase and, with it, the chance of IPv6-capable clients establishing connections using IPv6.

Closing Thoughts

Observing the extent of IPv6 adoption across the Internet can mean different things:

Counting users with IPv6-capable Internet access;
Counting IPv6-capable devices or software on those devices (clients and/or servers);
Calculating the amount of traffic flowing through IPv6 connections, measured in bytes;
Counting the fraction of connections (or individual requests) over IPv6.

In this exercise we chose to look at connections and requests. Keeping in mind that the underlying reality can only be truly understood by considering several different perspectives, we saw three different IPv6 adoption figures:

35.9% (client side) when counting HTTP requests served from Cloudflare's CDN;
30.5% (client side) when counting A and AAAA-type DNS queries handled by 1.1.1.1;
43.3% (server side) of positive responses to AAAA-type DNS queries, also from 1.1.1.1.

We combined the client side and server side figures from the DNS perspective to estimate how often connections to third-party servers are likely to be established over IPv6 rather than IPv4: just 13.2% of the time.

To improve on these numbers, ISPs, cloud and hosting providers, and corporations alike must increase the rate at which they make IPv6 available for devices in their networks. But large websites and content sources also have a critical role to play in enabling IPv6-capable clients to use IPv6 more often, as 39.2% of queries for domains in the Radar Top 100 (representing over half of all A and AAAA queries to 1.1.1.1) are still limited to IPv4-only responses.

On the road to full IPv6 adoption, the Internet isn’t quite there yet. But continued effort from all those involved can help it to continue to move forward, and perhaps even accelerate progress.

On the server side, Cloudflare has been helping with this worldwide effort for many years by providing free IPv6 support for all domains. On the client side, the 1.1.1.1 app automatically enables your device for IPv6 even if your ISP doesn’t provide any IPv6 support. And, if you happen to manage an IPv6-only network, 1.1.1.1’s DNS64 support also has you covered.

***

¹Cloudflare’s public DNS resolver (1.1.1.1) is operated in partnership with APNIC. You can read more about it in the original announcement blog post and in 1.1.1.1’s privacy policy.

²There’s more information on how DNS works in the “What is DNS?” section of our website. If you’re a hands-on learner, we suggest you take a look at Julia Evans’ “mess with dns”.

³Any recursive resolver will already know the IP addresses of the 13 root servers beforehand. Cloudflare also participates at the topmost level of DNS by providing anycast service to the E and F-Root instances, which means 1.1.1.1 doesn’t need to go far for that first lookup step.

⁴When using the 1.1.1.1 app, all four IP addresses are configured automatically.

⁵For simplification, we assume the amount of IPv6-only clients is still negligibly small. It’s a reasonable assumption in general, and other datasets available to us confirm it.

⁶1.1.1.1, like other recursive resolvers, returns adjusted TTLs: the record’s original TTL minus the number of seconds since the record was last cached. Empty A and AAAA answers get cached for the amount of time defined in the domain’s Start of Authority (SOA) record, as specified by RFC 2308.

Amazon’s $2bn IPv4 tax — and how you can avoid paying it

Anie Jacob — Tue, 26 Sep 2023 13:02:00 GMT

One of the wonderful things about the Internet is that, whether as a consumer or producer, the cost has continued to come down. Back in the day, it used to be that you needed a server room, a whole host of hardware, and an army of folks to help keep everything up and running. The cloud changed that, but even with that shift, services like SSL or unmetered DDoS protection were out of reach for many. We think that the march towards a more accessible Internet — both through ease of use, and reduced cost — is a wonderful thing, and we’re proud to have played a part in making it happen.

Every now and then, however, the march of progress gets interrupted.

On July 28, 2023, Amazon Web Services (AWS) announced that they would begin to charge “per IP per hour for all public IPv4 addresses, whether attached to a service or not”, starting February 1, 2024. This change will add at least \$43 extra per year for every IPv4 address Amazon customers use; this may not sound like much, but we’ve seen back of the napkin analysis that suggests this will result in an approximately \$2bn tax on the Internet.

In this blog, we’ll explain a little bit more about the technology involved, but most importantly, give you a step-by-step walkthrough of how Cloudflare can help you not only eliminate the need to pay Amazon for something that they shouldn’t be charging you for in the first place, but also if you’re a Pro or Business subscriber, we want to put \$43 in your pocket instead of taking it out. Don’t give Amazon \$43 for IPv4, let us give you \$43 and throw in IPv4 as well.

How can Cloudflare help avoid AWS IPv4 charges?

The only way to avoid Amazon’s IPv4 tax is to transition to IPv6 with AWS. But we recognize that not everyone is ready to make that shift — it can be an expensive and challenging process, and may present problems with hardware compatibility and network performance. We cover the finer details of these challenges below, so keep reading! Cloudflare can help ease this transition: let us deal with communicating to AWS using IPv6. Not only that, you’ll get all the rest of the benefits of using Cloudflare and our global network — including all the performance and security that Cloudflare is known for — and a \$43 dollar credit for using us!

IPv6 services like these are something we’ve been offering at Cloudflare for years - in fact this was first announced during Cloudflare's first birthday week in 2011! We’ve made this process simple to enable as well, so you can set this up as soon as today.

To set this feature up you will need to both enable IPv6 Compatibility and set up your origin for AWS to be an IPv6 origin.

To configure this feature simply follow these steps:

1. Login to your Cloudflare account.

2. Select the appropriate domain

3. Click the Network app.

4. Make sure IPv6 Compatibility is toggled on.

To get an IPv6 origin from Amazon you will likely have to follow these steps:

Associate an IPv6 CIDR block with your VPC and subnets
Update your route tables
Update your security group rules
Change your instance type
Assign IPv6 addresses to your instances
(Optional) Configure IPv6 on your instances

(For more information about this migration, check out this link.)

Once you have your IPv6 origins, you’ll want to update your origins on Cloudflare to use the IPv6 addresses. In the simple example of a single origin at root, this is done by creating a proxied (orange-cloud) AAAA record in your Cloudflare DNS editor:

If you are using Load Balancers, you will want to update the origin(s) there.

Once that’s done, you can remove the A/IPv4 record(s) and traffic will move over to the v6 address. While this process is easy now, we’re working on how we can make moving to IPv6 on Cloudflare even easier.

Once you have these features configured and have traffic running through Cloudflare to your origin for at least 6 months, you will be eligible to have a $43 credit deposited right into your Cloudflare account! You can use this credit for your Pro or Biz subscription or even for Workers and R2 usage. See here for more information on how to opt in to this offer.

Through this feature Cloudflare provides the flexibility to manage your IPv6 settings as per your requirements. By leveraging Cloudflare's robust IPv6 support, you can ensure seamless connectivity for your users, while avoiding additional costs associated with public IPv4 addresses.

What’s wrong with IPv4?

So if Cloudflare has this solution, why should you even move to IPv6? Well to clearly explain this let's start with the problem with IPv4.

IP addresses are used to identify and reach resources on a network, which could be a private network, like your office's private network, or a complex public network like the Internet. An example of an IPv4 address would be 198.51.100.1 or 198.51.100.50. And there are approximately 4.3 billion unique IPv4 addresses like these for websites, servers, and other destinations on the Internet to use for routing.

4.3 billion IPv4 addresses may sound like a lot, but it’s not as IPv4 space is running out. In September 2015 ARIN, one of the regional Internet registries that allows people to acquire IP addresses, announced that they had no available space: if you want to buy an IPv4 address you have to go and talk to private companies who are selling them. These companies charge a pretty penny for their IPv4 addresses. It costs about $40 per IPv4 address today. To buy a grouping of IPv4 addresses, also known as a prefix of which the minimum required size is 256 IP addresses, costs about \$10,000.

IP addresses are necessary for having a domain or device on the Internet, but today IPv4 addresses are an increasingly more complicated resource to acquire. Therefore, to facilitate the growth of the Internet there needed to be more unique addresses made available without breaking the bank. That’s where IPv6 comes in.

IPv4 vs. IPv6

In 1995 the IETF (Internet Engineering Task Force) published the RFC for IPv6, which proposed to solve this problem of the limited IPv4 space. Instead of 32 bits of addressable space, IPv6 expanded to 128 bits of addressable space. This means that instead of 4.3 billion addresses available, there are approximately 340 undecillion IPv6 addresses available. This is roughly equivalent to the number of grains of sand on Earth.

So this problem is solved, why should you care? The answer is because many networks on the Internet still prefer IPv4, and companies like AWS are starting to charge money for IPv4 usage.

Let's speak on AWS first: AWS today owns one of the largest chunks of the IPv4 space. During a period of time when IPv4 addresses were on the private market to purchase for dollars per IP address, AWS chose to use its large capital to its advantage and buy up a large amount of the space. Today AWS owns 1.7% of the IPv4 address space which equates to ~100 million IPv4 addresses.

So you would think that moving to IPv6 is the right move, however, for the Internet community it’s proven to be quite a challenge.

When IPv6 was published in the 90s very few networks had devices that supported IPv6. However, today in 2023, that is not the case: global networks supporting IPv6 has increased to 46 percent, so the hardware limitations around supporting it are decreasing. Additionally, anti-abuse and security tools initially had no idea how to deal with attacks or traffic that used IPv6 address space, and this still remains an issue for some of these tools. In 2014, we made it even easier for origin tools to convert by creating pseudo IPv4 to help bridge the gap to those tools.

Despite all of this, many networks don’t have good support infrastructure for IPv6 networking since most networks were built on IPv4. At Cloudflare, we have built our network to support both protocols, known as “dual-stack”.

For a while there were also many networks which had markedly worse performance for IPv6 than IPv4. This is not true anymore, as of today we see only a slight degradation in IPv6 performance across the whole Internet compared to IPv4. The reasons for this include things like legacy hardware, sub-optimal IPv6 connectivity outside our network and high cost for deploying IPv6. You can see in the chart below the additional latency of IPv6 traffic on Cloudflare’s network as compared to IPv4 traffic:

There were many challenges to adopting IPv6, and for some these issues with hardware compatibility and network performance are still worries. This is why still using IPv4 can be useful to folks while transitioning to IPv6, which is what makes AWS’ decision to charge for IPv4 impactful on many websites.

So, don’t pay for AWS IPv4 charges

At the end of the day the choice is clear: you could pay Amazon more to rent their IPs than to buy them, or move to Cloudflare and use our free service to help with the transition to IPv6 with little overhead.

Cloudflare's handling of a bug in interpreting IPv4-mapped IPv6 addresses

Lucas Ferreira — Thu, 02 Feb 2023 13:32:00 GMT

In November 2022, our bug bounty program received a critical and very interesting report. The report stated that certain types of DNS records could be used to bypass some of our network policies and connect to ports on the loopback address (e.g. 127.0.0.1) of our servers. This post will explain how we dealt with the report, how we fixed the bug, and the outcome of our internal investigation to see if the vulnerability had been previously exploited.

RFC 4291 defines ways to embed an IPv4 address into IPv6 addresses. One of the methods defined in the RFC is to use IPv4-mapped IPv6 addresses, that have the following format:

   |                80 bits               | 16 |      32 bits        |
   +--------------------------------------+--------------------------+
   |0000..............................0000|FFFF|    IPv4 address     |
   +--------------------------------------+----+---------------------+

In IPv6 notation, the corresponding mapping for 127.0.0.1 is ::ffff:127.0.0.1 (RFC 4038)

The researcher was able to use DNS entries based on mapped addresses to bypass some of our controls and access ports on the loopback address or non-routable IPs.

This vulnerability was reported on November 27 to our bug bounty program. Our Security Incident Response Team (SIRT) was contacted, and incident response activities began shortly after the report was filed. A hotpatch was deployed three hours later to prevent exploitation of the bug.

Date	Time (UTC)	Activity
27 November 2022	20:42	Initial report to Cloudflare's bug bounty program
	21:04	SIRT oncall is paged
	21:15	SIRT manager on call starts working on the report
	21:22	Incident declared and team is assembled and debugging starts
	23:20	A hotfix is ready and deployment starts
	23:47	Team confirms that the hotfix is deployed and working
	23:58	Team investigates if other products are affected. Load Balancers and Spectrum are potential targets. Both products are found to be unaffected by the vulnerability.
28 November 2022	21:14	A permanent fix is ready
29 November 2022	21:34	Permanent fix is merged

Blocking exploitation

Immediately after the vulnerability was reported to our Bug Bounty program, the team began working to understand the issue and find ways to quickly block potential exploitation. It was determined that the fastest way to prevent exploitation would be to block the creation of the DNS records required to execute the attack.

The team then began to implement a patch to prevent the creation of DNS records that include IPv6 addresses that map loopback or RFC 1918 (internal) IPv4 addresses. The fix was fully deployed and confirmed three hours after the report was filed. We later realized that this change was insufficient because records hosted on external DNS servers could also be used in this attack.

The exploit

The exploit provided consisted of the following: a DNS entry, and a Cloudflare Worker. The DNS entry was an AAAA record pointing to ::ffff:127.0.0.1:

exploit.example.com AAAA ::ffff:127.0.0.1

The worker included the following code:

export default {
    async fetch(request) {
        const requestJson = await request.json()
        return fetch(requestJson.url, requestJson)
    }
}

The Worker was given a custom URL such as proxy.example.com.

With that setup, it was possible to make the worker attempt connections on the loopback interface of the server where it was running. The call would look like this:

curl https://proxy.example.com/json -d '{"url":"http://exploit.example.com:80/url_path"}'

The attack could then be scripted to attempt to connect to multiple ports on the server.

It was also found that a similar setup could be used with other IPv4 addresses to attempt connections into internal services. In this case, the DNS entry would look like:

exploit.example.com AAAA ::ffff:10.0.0.1

This exploit would allow an attacker to connect to services running on the loopback interface of the server. If the attacker was able to bypass the security and authentication mechanisms of a service, it could impact the confidentiality and integrity of data. For services running on other servers, the attacker could also use the worker to attempt connections and map services available over the network. As in most networks, Cloudflare's network policies and ACLs must allow a few ports to be accessible. These ports would be accessible by an attacker using this exploit.

Investigation

We started an investigation to understand the root cause of the problem and created a proof-of-concept that allowed the team to debug the issue. At the same time, we started a parallel investigation to determine if the issue had been previously exploited.

It all happened when two bugs collided.

The first bug happened in our internal DNS system which is responsible for mapping hostnames to IP addresses of our customers’ origin servers (the DNS system). When the DNS system tried to answer a query for the DNS record from exploit.example.com, it serialized the IP as a string. The Golang net library used for DNS automatically converted the IP ::ffff:10.0.0.1 to string “10.0.0.1”. However, the DNS system still treated it as an IPv6 address. So a query response {ipv6: “10.0.0.1”} was returned.

The second bug was in our internal HTTP system (the proxy) which is responsible for forwarding HTTP traffic to customer’s origin servers. The bug happened in how the proxy validates this DNS response, {ipv6: “10.0.0.1”}. The proxy has two deny lists of IPs that are not allowed to be used, one for IPv4 and one for IPv6. These lists contain localhost IPs and private IPs. The bug was that the proxy system compared the address 10.0.0.1 against the IPv6 deny list because the address was in the “ipv6” section. Naturally the address didn’t match any entry in the deny list. So the address was allowed to be used as an origin IP address.

The second investigation team searched through the logs and found no evidence of previous exploitation of this vulnerability. The team also checked Cloudflare DNS for entries using IPv4-mapped IPv6 addresses and determined that all the existing entries had been used for testing purposes. As of now, there are no signs that this vulnerability could have been previously used against Cloudflare systems.

Remediating the vulnerability

To address this issue we implemented a fix in the proxy service to correctly use the deny list of the parsed address, not the deny list of the IP family the DNS API response claimed to be, to validate the IP address. We confirmed both in our test and production environments that the fix did prevent the issue from happening again.

Beyond maintaining a bug bounty program, we regularly perform internal security reviews and hire third-party firms to audit the software we develop. But it is through our bug bounty program that we receive some of the most interesting and creative reports. Each report has helped us improve the security of our services. We invite those that find a security issue in any of Cloudflare’s services to report it to us through HackerOne.

Building fast interpreters in Rust

Ingvar Stepanyan — Mon, 04 Mar 2019 16:00:00 GMT

In the previous post we described the Firewall Rules architecture and how the different components are integrated together. We also mentioned that we created a configurable Rust library for writing and executing Wireshark®-like filters in different parts of our stack written in Go, Lua, C, C++ and JavaScript Workers.

With a mixed set of requirements of performance, memory safety, low memory use, and the capability to be part of other products that we’re working on like Spectrum, Rust stood out as the strongest option.

We have now open-sourced this library under our Github account: https://github.com/cloudflare/wirefilter. This post will dive into its design, explain why we didn’t use a parser generator and how our execution engine balances security, runtime performance and compilation cost for the generated filters.

Parsing Wireshark syntax

When building a custom Domain Specific Language (DSL), the first thing we need to be able to do is parse it. This should result in an intermediate representation (usually called an Abstract Syntax Tree) that can be inspected, traversed, analysed and, potentially, serialised.

There are different ways to perform such conversion, such as:

Manual char-by-char parsing using state machines, regular expression and/or native string APIs.
Parser combinators, which use higher-level functions to combine different parsers together (in Rust-land these are represented by nom, chomp, combine and others).
Fully automated generators which, provided with a grammar, can generate a fully working parser for you (examples are peg, pest, LALRPOP, etc.).

Wireshark syntax

But before trying to figure out which approach would work best for us, let’s take a look at some of the simple official Wireshark examples, to understand what we’re dealing with:

ip.len le 1500
udp contains 81:60:03
sip.To contains "a1762"
http.request.uri matches "gl=se$"
eth.dst == ff:ff:ff:ff:ff:ff
ip.addr == 192.168.0.1
ipv6.addr == ::1

You can see that the right hand side of a comparison can be a number, an IPv4 / IPv6 address, a set of bytes or a string. They are used interchangeably, without any special notion of a type, which is fine given that they are easily distinguishable… or are they?

Let’s take a look at some IPv6 forms on Wikipedia:

2001:0db8:0000:0000:0000:ff00:0042:8329
2001:db8:0:0:0:ff00:42:8329
2001:db8::ff00:42:8329

So IPv6 can be written as a set of up to 8 colon-separated hexadecimal numbers, each containing up to 4 digits with leading zeros omitted for convenience. This appears suspiciously similar to the syntax for byte sequences. Indeed, if we try writing out a sequence like 2f:31:32:33:34:35:36:37, it’s simultaneously a valid IPv6 and a byte sequence in terms of Wireshark syntax.

There is no way of telling what this sequence actually represents without looking at the type of the field it’s being compared with, and if you try using this sequence in Wireshark, you’ll notice that it does just that:

ipv6.addr == 2f:31:32:33:34:35:36:37: right hand side is parsed and used as an IPv6 address
http.request.uri == 2f:31:32:33:34:35:36:37: right hand side is parsed and used as a byte sequence (will match a URL "/1234567")

Are there other examples of such ambiguities? Yup - for example, we can try using a single number with two decimal digits:

tcp.port == 80: matches any traffic on the port 80 (HTTP)
http.file_data == 80: matches any HTTP request/response with body containing a single byte (0x80)

We could also do the same with ethernet address, defined as a separate type in Wireshark, but, for simplicity, we represent it as a regular byte sequence in our implementation, so there is no ambiguity here.

Choosing a parsing approach

This is an interesting syntax design decision. It means that we need to store a mapping between field names and types ahead of time - a Scheme, as we call it - and use it for contextual parsing. This restriction also immediately rules out many if not most parser generators.

We could still use one of the more sophisticated ones (like LALRPOP) that allow replacing the default regex-based lexer with your own custom code, but at that point we’re so close to having a full parser for our DSL that the complexity outweighs any benefits of using a black-box parser generator.

Instead, we went with a manual parsing approach. While (for a good reason) this might sound scary in unsafe languages like C / C++, in Rust all strings are bounds checked by default. Rust also provides a rich string manipulation API, which we can use to build more complex helpers, eventually ending up with a full parser.

This approach is, in fact, pretty similar to parser combinators in that the parser doesn’t have to keep state and only passes the unprocessed part of the input down to smaller, narrower scoped functions. Just as in parser combinators, the absence of mutable state also allows to easily test and maintain each of the parsers for different parts of the syntax independently of the others.

Compared with popular parser combinator libraries in Rust, one of the differences is that our parsers are not standalone functions but rather types that implement common traits:

pub trait Lex<'i>: Sized {
   fn lex(input: &'i str) -> LexResult<'i, Self>;
}
pub trait LexWith<'i, E>: Sized {
   fn lex_with(input: &'i str, extra: E) -> LexResult<'i, Self>;
}

The lex method or its contextual variant lex_with can either return a successful pair of (instance of the type, rest of input) or a pair of (error kind, relevant input span).

The Lex trait is used for target types that can be parsed independently of the context (like field names or literals), while LexWith is used for types that need a Scheme or a part of it to be parsed unambiguously.

A bigger difference is that, instead of relying on higher-level functions for parser combinators, we use the usual imperative function call syntax. For example, when we want to perform sequential parsing, all we do is call several parsers in a row, using tuple destructuring for intermediate results:

let input = skip_space(input);
let (op, input) = CombinedExpr::lex_with(input, scheme)?;
let input = skip_space(input);
let input = expect(input, ")")?;

And, when we want to try different alternatives, we can use native pattern matching and ignore the errors:

if let Ok(input) = expect(input, "(") {
   ...
   (SimpleExpr::Parenthesized(Box::new(op)), input)
} else if let Ok((op, input)) = UnaryOp::lex(input) {
   ...
} else {
   ...
}

Finally, when we want to automate parsing of some more complicated common cases - say, enums - Rust provides a powerful macro syntax:

lex_enum!(#[repr(u8)] OrderingOp {
   "eq" | "==" => Equal = EQUAL,
   "ne" | "!=" => NotEqual = LESS | GREATER,
   "ge" | ">=" => GreaterThanEqual = GREATER | EQUAL,
   "le" | "<=" => LessThanEqual = LESS | EQUAL,
   "gt" | ">" => GreaterThan = GREATER,
   "lt" | "<" => LessThan = LESS,
});

This gives an experience similar to parser generators, while still using native language syntax and keeping us in control of all the implementation details.

Execution engine

Because our grammar and operations are fairly simple, initially we used direct AST interpretation by requiring all nodes to implement a trait that includes an execute method.

trait Expr<'s> {
    fn execute(&self, ctx: &ExecutionContext<'s>) -> bool;
}

The ExecutionContext is pretty similar to a Scheme, but instead of mapping arbitrary field names to their types, it maps them to the runtime input values provided by the caller.

As with Scheme, initially ExecutionContext used an internal HashMap for registering these arbitrary String -> RhsValue mappings. During the execute call, the AST implementation would evaluate itself recursively, and look up each field reference in this map, either returning a value or raising an error on missing slots and type mismatches.

This worked well enough for an initial implementation, but using a HashMap has a non-trivial cost which we would like to eliminate. We already used a more efficient hasher - [Fnv](https://github.com/servo/rust-fnv) - because we are in control of all keys and so are not worried about hash DoS attacks, but there was still more we could do.

Speeding up field access

If we look at the data structures involved, we can see that the scheme is always well-defined in advance, and all our runtime values in the execution engine are expected to eventually match it, even if the order or a precise set of fields is not guaranteed:

So what if we ditch the second map altogether and instead use a fixed-size array of values? Array indexing should be much cheaper than looking up in a map, so it might be well worth the effort.

How can we do it? We already know the number of items (thanks to the predefined scheme) so we can use that for the size of the backing storage, and, in order to simulate HashMap “holes” for unset values, we can wrap each item an Option<...>:

pub struct ExecutionContext<'e> {
    scheme: &'e Scheme,
    values: Box<[Option>]>,
}

The only missing piece is an index that could map both structures to each other. As you might remember, Scheme still uses a HashMap for field registration, and a HashMap is normally expected to be randomised and indexed only by the predefined key.

While we could wrap a value and an auto-incrementing index together into a custom struct, there is already a better solution: [IndexMap](https://github.com/bluss/indexmap). IndexMap is a drop-in replacement for a HashMap that preserves ordering and provides a way to get an index of any element and vice versa - exactly what we needed.

After replacing a HashMap in the Scheme with IndexMap, we can change parsing to resolve all the parsed field names to their indices in-place and store that in the AST:

impl<'i, 's> LexWith<'i, &'s Scheme> for Field<'s> {
   fn lex_with(mut input: &'i str, scheme: &'s Scheme) -> LexResult<'i, Self> {
       ...
       let field = scheme
           .get_field_index(name)
           .map_err(|err| (LexErrorKind::UnknownField(err), name))?;
       Ok((field, input))
   }
}

After that, in the ExecutionContext we allocate a fixed-size array and use these indices for resolving values during runtime:

impl<'e> ExecutionContext<'e> {
   /// Creates an execution context associated with a given scheme.
   ///
   /// This scheme will be used for resolving any field names and indices.
   pub fn new<'s: 'e>(scheme: &'s Scheme) -> Self {
       ExecutionContext {
           scheme,
           values: vec![None; scheme.get_field_count()].into(),
       }
   }
   ...
}

This gave significant (~2x) speed ups on our standard benchmarks:

Before:

test matching ... bench:       2,548 ns/iter (+/- 98)
test parsing  ... bench:     192,037 ns/iter (+/- 21,538)

After**:**

test matching ... bench:       1,227 ns/iter (+/- 29)
test parsing  ... bench:     197,574 ns/iter (+/- 16,568)

This change also improved the usability of our API, as any type errors are now detected and reported much earlier, when the values are just being set on the context, and not delayed until filter execution.

[not] JIT compilation

Of course, as with any respectable DSL, one of the other ideas we had from the beginning was “...at some point we’ll add native compilation to make everything super-fast, it’s just a matter of time...”.

In practice, however, native compilation is a complicated matter, but not due to lack of tools.

First of all, there is question of storage for the native code. We could compile each filter statically into some sort of a library and publish to a key-value store, but that would not be easy to maintain:

We would have to compile each filter to several platforms (x86-64, ARM, WASM, …).
The overhead of native library formats would significantly outweigh the useful executable size, as most filters tend to be small.
Each time we’d like to change our execution logic, whether to optimise it or to fix a bug, we would have to recompile and republish all the previously stored filters.
Finally, even if/though we’re sure of the reliability of the chosen store, executing dynamically retrieved native code on the edge as-is is not something that can be taken lightly.

The usual flexible alternative that addresses most of these issues is Just-in-Time (JIT) compilation.

When you compile code directly on the target machine, you get to re-verify the input (still expressed as a restricted DSL), you can compile it just for the current platform in-place, and you never need to republish the actual rules.

Looks like a perfect fit? Not quite. As with any technology, there are tradeoffs, and you only get to choose those that make more sense for your use cases. JIT compilation is no exception.

First of all, even though you’re not loading untrusted code over the network, you still need to generate it into the memory, mark that memory as executable and trust that it will always contain valid code and not garbage or something worse. Depending on your choice of libraries and complexity of the DSL, you might be willing to trust it or put heavy sandboxing around, but, either way, it’s a risk that one must explicitly be willing to take.

Another issue is the cost of compilation itself. Usually, when measuring the speed of native code vs interpretation, the cost of compilation is not taken into the account because it happens out of the process.

With JIT compilers though, it’s different as you’re now compiling things the moment they’re used and cache the native code only for a limited time. Turns out, generating native code can be rather expensive, so you must be absolutely sure that the compilation cost doesn’t offset any benefits you might gain from the native execution speedup.

I’ve talked a bit more about this at Rust Austin meetup and, I believe, this topic deserves a separate blog post so won’t go into much more details here, but feel free to check out the slides: https://www.slideshare.net/RReverser/building-fast-interpreters-in-rust. Oh, and if you’re in Austin, you should pop into our office for the next meetup!

Let’s get back to our original question: is there anything else we can do to get the best balance between security, runtime performance and compilation cost? Turns out, there is.

Dynamic dispatch and closures to the rescue

Introducing Fn trait!

In Rust, the Fn trait and friends (FnMut, FnOnce) are automatically implemented on eligible functions and closures. In case of a simple Fn case the restriction is that they must not modify their captured environment and can only borrow from it.

Normally, you would want to use it in generic contexts to support arbitrary callbacks with given argument and return types. This is important because in Rust, each function and closure implements a unique type and any generic usage would compile down to a specific call just to that function.

fn just_call(me: impl Fn(), maybe: bool) {
  if maybe {
    me()
  }
}

Such behaviour (called static dispatch) is the default in Rust and is preferable for performance reasons.

However, if we don’t know all the possible types at compile-time, Rust allows us to opt-in for a dynamic dispatch instead:

fn just_call(me: &dyn Fn(), maybe: bool) {
  if maybe {
    me()
  }
}

Dynamically dispatched objects don't have a statically known size, because it depends on the implementation details of particular type being passed. They need to be passed as a reference or stored in a heap-allocated Box, and then used just like in a generic implementation.

In our case, this allows to create, return and store arbitrary closures, and later call them as regular functions:

trait Expr<'s> {
    fn compile(self) -> CompiledExpr<'s>;
}

pub(crate) struct CompiledExpr<'s>(Box) -> bool>);

impl<'s> CompiledExpr<'s> {
   /// Creates a compiled expression IR from a generic closure.
   pub(crate) fn new(closure: impl 's + Fn(&ExecutionContext<'s>) -> bool) -> Self {
       CompiledExpr(Box::new(closure))
   }

   /// Executes a filter against a provided context with values.
   pub fn execute(&self, ctx: &ExecutionContext<'s>) -> bool {
       self.0(ctx)
   }
}

The closure (an Fn box) will also automatically include the environment data it needs for the execution.

This means that we can optimise the runtime data representation as part of the “compile” process without changing the AST or the parser. For example, when we wanted to optimise IP range checks by splitting them for different IP types, we could do that without having to modify any existing structures:

RhsValues::Ip(ranges) => {
   let mut v4 = Vec::new();
   let mut v6 = Vec::new();
   for range in ranges {
       match range.clone().into() {
           ExplicitIpRange::V4(range) => v4.push(range),
           ExplicitIpRange::V6(range) => v6.push(range),
       }
   }
   let v4 = RangeSet::from(v4);
   let v6 = RangeSet::from(v6);
   CompiledExpr::new(move |ctx| {
       match cast!(ctx.get_field_value_unchecked(field), Ip) {
           IpAddr::V4(addr) => v4.contains(addr),
           IpAddr::V6(addr) => v6.contains(addr),
       }
   })
}

Moreover, boxed closures can be part of that captured environment, too. This means that we can convert each simple comparison into a closure, and then combine it with other closures, and keep going until we end up with a single top-level closure that can be invoked as a regular function to evaluate the entire filter expression.

It’s turtles closures all the way down:

let items = items
   .into_iter()
   .map(|item| item.compile())
   .collect::>()
   .into_boxed_slice();

match op {
   CombiningOp::And => {
       CompiledExpr::new(move |ctx| items.iter().all(|item| item.execute(ctx)))
   }
   CombiningOp::Or => {
       CompiledExpr::new(move |ctx| items.iter().any(|item| item.execute(ctx)))
   }
   CombiningOp::Xor => CompiledExpr::new(move |ctx| {
       items
           .iter()
           .fold(false, |acc, item| acc ^ item.execute(ctx))
   }),
}

What’s nice about this approach is:

Our execution is no longer tied to the AST, and we can be as flexible with optimising the implementation and data representation as we want without affecting the parser-related parts of code or output format.
Even though we initially “compile” each node to a single closure, in future we can pretty easily specialise certain combinations of expressions into their own closures and so improve execution speed for common cases. All that would be required is a separate match branch returning a closure optimised for just that case.
Compilation is very cheap compared to real code generation. While it might seem that allocating many small objects (one Boxed closure per expression) is not very efficient and that it would be better to replace it with some sort of a memory pool, in practice we saw a negligible performance impact.
No native code is generated at runtime, which means that we execute only code that was statically verified by Rust at compile-time and compiled down to a static function. All that we do at the runtime is call existing functions with different values.
Execution turns out to be faster too. This initially came as a surprise, because dynamic dispatch is widely believed to be costly and we were worried that it would get slightly worse than AST interpretation. However, it showed an immediate ~10-15% runtime improvement in benchmarks and on real examples.

The only obvious downside is that each level of AST requires a separate dynamically-dispatched call instead of a single inlined code for the entire expression, like you would have even with a basic template JIT.

Unfortunately, such output could be achieved only with real native code generation, and, for our case, the mentioned downsides and risks would outweigh runtime benefits, so we went with the safe & flexible closure approach.

Bonus: WebAssembly support

As was mentioned earlier, we chose Rust as a safe high-level language that allows easy integration with other parts of our stack written in Go, C and Lua via C FFI. But Rust has one more target it invests in and supports exceptionally well: WebAssembly.

Why would we be interested in that? Apart from the parts of the stack where our rules would run, and the API that publishes them, we also have users who like to write their own rules. To do that, they use a UI editor that allows either writing raw expressions in Wireshark syntax or as a WYSIWYG builder.

We thought it would be great to expose the parser - the same one as we use on the backend - to the frontend JavaScript for a consistent real-time editing experience. And, honestly, we were just looking for an excuse to play with WASM support in Rust.

WebAssembly could be targeted via regular C FFI, but in that case you would need to manually provide all the glue for the JavaScript side to hold and convert strings, arrays and objects forth and back.

In Rust, this is all handled by wasm-bindgen. While it provides various attributes and methods for direct conversions, the simplest way to get started is to activate the “serde” feature which will automatically convert types using JSON.parse, JSON.stringify and [serde_json](https://docs.serde.rs/serde_json/) under the hood.

In our case, creating a wrapper for the parser with only 20 lines of code was enough to get started and have all the WASM code + JavaScript glue required:

#[wasm_bindgen]
pub struct Scheme(wirefilter::Scheme);

fn into_js_error(err: impl std::error::Error) -> JsValue {
   js_sys::Error::new(&err.to_string()).into()
}

#[wasm_bindgen]
impl Scheme {
   #[wasm_bindgen(constructor)]
   pub fn try_from(fields: &JsValue) -> Result {
       fields.into_serde().map(Scheme).map_err(into_js_error)
   }

   pub fn parse(&self, s: &str) -> Result {
       let filter = self.0.parse(s).map_err(into_js_error)?;
       JsValue::from_serde(&filter).map_err(into_js_error)
   }
}

And by using a higher-level tool called wasm-pack, we also got automated npm package generation and publishing, for free.

This is not used in the production UI yet because we still need to figure out some details for unsupported browsers, but it’s great to have all the tooling and packages ready with minimal efforts. Extending and reusing the same package, it should be even possible to run filters in Cloudflare Workers too (which also support WebAssembly).

The future

The code in the current state is already doing its job well in production and we’re happy to share it with the open-source Rust community.

This is definitely not the end of the road though - we have many more fields to add, features to implement and planned optimisations to explore. If you find this sort of work interesting and would like to help us by working on firewalls, parsers or just any Rust projects at scale, give us a shout!

Fixing an old hack - why we are bumping the IPv6 MTU

Marek Majkowski — Mon, 10 Sep 2018 09:21:25 GMT

Back in 2015 we deployed ECMP routing - Equal Cost Multi Path - within our datacenters. This technology allowed us to spread traffic heading to a single IP address across multiple physical servers.

You can think about it as a third layer of load balancing.

First we split the traffic across multiple IP addresses with DNS.
Then we split the traffic across multiple datacenters with Anycast.
Finally, we split the traffic across multiple servers with ECMP.

photo by Sahra by-sa/2.0

When deploying ECMP we hit a problem with Path MTU discovery. The ICMP packets destined to our Anycast IP's were being dropped. You can read more about that (and the solution) in the 2015 blog post Path MTU Discovery in practice.

To solve the problem we created a small piece of software, called pmtud (https://github.com/cloudflare/pmtud). Since deploying pmtud, our ECMP setup has been working smoothly.

Hardcoding IPv6 MTU

During that initial ECMP rollout things were broken. To keep services running until pmtud was done, we deployed a quick hack. We reduced the MTU of IPv6 traffic to the minimal possible value: 1280 bytes.

This was done as a tag on a default route. This is how our routing table used to look:

$ ip -6 route show
...
default via 2400:xxxx::1 dev eth0 src 2400:xxxx:2  metric 1024  mtu 1280

Notice the mtu 1280 in the default route.

With this setting our servers never transmitted IPv6 packets larger than 1280 bytes, therefore "fixing" the issue. Since all IPv6 routers must have an MTU of at least 1280, we could expect that no ICMP Packet-Too-Big message would ever be sent to us.

Remember - the original problem introduced by ECMP was that ICMP routed back to our Anycast addresses could go to a wrong machine within the ECMP group. Therefore we became ICMP black holes. Cloudflare would send large packets, they would be dropped with ICMP PTB packet flying back to us. Which, in turn would fail to be delivered to the right machine due to ECMP.

But why did this problem not appear for IPv4 traffic? We believe the same issue exists on IPv4, but it's less damaging due to the different nature of the network. IPv4 is more mature and the great majority of end-hosts support either MTU 1500 or have their MSS option well configured - or clamped by some middle box. This is different in IPv6 where a large proportion of users use tunnels, have Path MTU strictly smaller than 1500 and use incorrect MSS settings in the TCP header. Finally, Linux implements RFC4821 for IPv4 but not IPv6. RFC4821 (PLPMTUD) has its disadvantages, but does slightly help to alleviate the ICMP blackhole issue.

Our "fix" - reducing the MTU to 1280 - was serving us well and we had no pressing reason to revert it.

Researchers did notice though. We were caught red-handed twice:

In 2017 Geoff Huston noticed (pdf) that we sent DNS fragments of 1280 only (older blog post).
In June 2018 the paper "Exploring usable Path MTU in the Internet" (pdf) mentioned our weird setting - where we can accept 1500 bytes just fine, but transmit is limited to 1280.

When small MTU is too small

photo by NH53 by/2.0

This changed recently, when we started working on Cloudflare Spectrum support for UDP. Spectrum is a terminating proxy, able to handle protocols other than HTTP. Getting Spectrum to forward TCP was relatively straightforward (barring couple of awesome hacks). UDP is different.

One of the major issues we hit was related to the MTU on our servers.

During tests we wanted to forward UDP VPN packets through Spectrum. As you can imagine, any VPN would encapsulate a packet in another packet. Spectrum received packets like this:

 +---------------------+------------------------------------------------+
 +  IPv6 + UDP header  |  UDP payload encapsulating a 1280 byte packet  |
 +---------------------+------------------------------------------------+

It's pretty obvious, that our edge servers supporting IPv6 packets of max 1280 bytes won't be able to handle this type of traffic. We are going to need at least 1280+40+8 bytes MTU! Hardcoding MTU=1280 in IPv6 may be acceptable solution if you are an end-node on the internet, but is definitely too small when forwarding tunneled traffic.

Picking a new MTU

But what MTU value should we use instead? Let's see what other major internet companies do. Here is a couple of examples of advertised MSS values in TCP SYN+ACK packets over IPv6:

+---- site -----+- MSS --+- estimated MTU -+
| google.com    |  1360  |    1420         |
+---------------+--------+-----------------+
| facebook.com  |  1410  |    1470         |
+---------------+--------+-----------------+
| wikipedia.org |  1440  |    1500         |
+---------------+--------+-----------------+

I believe Google and Facebook adjust their MTU due to their use of L4 load balancers. Their implementations do IP-in-IP encapsulation so need a bit of space for the header. Read more:

Google - Maglev
Facebook - Katran

There may be other reasons for having a smaller MTU. A reduced value may decrease the probability of the Path MTU detection algorithm kicking in (ie: relying on ICMP PTB). We can theorize that for the misconfigured eyeballs:

MTU=1280 will never run Path MTU detection
MTU=1500 will always run it.
In-between values would have increasing different chances of hitting the problem.

But just what is the chance of that?

A quick unscientific study of the MSS values we encountered from eyeballs shows the following distributions. For connections going over IPv4:

IPv4 eyeball advertised MSS in SYN:
 value |-------------------------------------------------- count cummulative
  1300 |                                                 *  1.28%   98.53%
  1360 |                                              ****  4.40%   95.68%
  1370 |                                                 *  1.15%   91.05%
  1380 |                                               ***  3.35%   89.81%
  1400 |                                          ********  7.95%   84.79%
  1410 |                                                 *  1.17%   76.66%
  1412 |                                              ****  4.58%   75.49%
  1440 |                                            ******  6.14%   65.71%
  1452 |                                      ************ 11.50%   58.94%
  1460 |************************************************** 47.09%   47.34%

Assuming the majority of clients have MSS configured right, we can say that 89.8% of connections advertised MTU=1380+40=1420 or higher. 75% had MTU >= 1452.

For IPv6 connections we saw:

IPv6 eyeball advertised MSS in SYN:
 value |-------------------------------------------------- count cummulative
  1220 |                                               ***  4.21%   99.96%
  1340 |                                                **  3.11%   93.23%
  1362 |                                                 *  1.31%   87.70%
  1368 |                                               ***  3.38%   86.36%
  1370 |                                               ***  4.24%   82.98%
  1380 |                                               ***  3.52%   78.65%
  1390 |                                                 *  2.11%   75.10%
  1400 |                                               ***  3.89%   72.25%
  1412 |                                               ***  3.64%   68.21%
  1420 |                                                 *  2.02%   64.54%
  1440 |************************************************** 54.31%   54.34%

On IPv6 87.7% connections had MTU >= 1422 (1362+60). 75% had MTU >= 1450. (See also: MTU distribution of DNS servers).

Interpretation

Before we move on it's worth reiterating the original problem. Each connection from an eyeball to our Anycast network has three numbers related to it:

Client advertised MTU - seen in MSS option in TCP header
True Path MTU value - generally unknown until measured
Our Edge server MTU - value we are trying to optimize in this exercise

(This is a slight simplification, paths on the internet aren't symmetric so the path from eyeball to Cloudflare could have different Path MTU than the reverse path.)

In order for the connection to misbehave, three conditions must be met:

Client advertised MTU must be "wrong", that is: larger than True Path MTU
Our edge server must be willing to send such large packets: Edge server MTU >= True Path MTU
The ICMP PTB messages must fail to be delivered to our edge server - preventing Path MTU detection from working.

The last condition could occur for one of the reasons:

the routers on the path are misbehaving and perhaps firewalling ICMP
due to the asymmetric nature of the internet the ICMP back is routed to the wrong Anycast datacenter
something is wrong on our side, for example pmtud process fails

In the past we limited our Edge Server MTU value to the smallest possible, to make sure we never encounter the problem. Due to the development of Spectrum UDP support we must increase the Edge Server MTU, while still minimizing the probability of the issue happening.

Finally, relying on ICMP PTB messages for a large fraction of traffic is a bad idea. It's easy to imagine the cost this induces: even with Path MTU detection working fine, the affected connection will suffer a hiccup. A couple of large packets will be dropped before the reverse ICMP will get through and reconfigure the saved Path MTU value. This is not optimal for latency.

Progress

In recent days we increased the IPv6 MTU. As part of the process we could have chosen 1300, 1350, or 1400. We choose 1400 because we think it's the next best value to use after 1280. With 1400 we believe 93.2% of IPv6 connections will not need to rely on Path MTU Detection/ICMP. In the near future we plan to increase this value further. We won't settle on 1500 though - we want to leave a couple of bytes for IPv4 encapsulation, to allow the most popular tunnels to keep working without suffering poor latency when Path MTU Detection kicks in.

Since the rollout we've been monitoring Icmp6InPktTooBigs counters:

$ nstat -az | grep Icmp6InPktTooBigs
Icmp6InPktTooBigs               738748             0.0

Here is a chart of the ICMP PTB packets we received over last 7 days. You can clearly see that when the rollout started, we saw a large increase in PTB ICMP messages (Y label - packet count - deliberately obfuscated):

Interestingly the majority of the ICMP packets are concentrated in our Frankfurt datacenter:

We estimate that in our Frankfurt datacenter, we receive ICMP PTB message on 2 out of every 100 IPv6 TCP connections. These seem to come from only a handful of ASNs:

AS6830 - Liberty Global Operations B.V.
AS20825- Unitymedia NRW GmbH
AS31334 - Vodafone Kabel Deutschland GmbH
AS29562 - Kabel BW GmbH

These networks send to us ICMP PTB messages, usually informing that their MTU is 1280. For example:

$ sudo tcpdump -tvvvni eth0 icmp6 and ip6[40+0]==2 
IP6 2a02:908:xxx > 2400:xxx ICMP6, packet too big, mtu 1280, length 1240
IP6 2a02:810d:xx > 2400:xxx ICMP6, packet too big, mtu 1280, length 1240
IP6 2001:ac8:xxx > 2400:xxx ICMP6, packet too big, mtu 1390, length 1240

Final thoughts

Finally, if you are an IPv6 user with a weird MTU and have misconfigured MSS - basically if you are doing tunneling - please let us know of any issues. We know that debugging MTU issues is notoriously hard. To aid that we created an online fragmentation and ICMP delivery test. You can run it:

IPv6 version: http://icmpcheckv6.popcount.org
(for completeness, we also have an IPv4 version)

If you are a server operator running IPv6 applications, you should not worry. In most cases leaving the MTU at default 1500 is a good choice and should work for the majority of connections. Just remember to allow ICMP PTB packets on the firewall and you should be good. If you serve variety of IPv6 users and need to optimize latency, you may consider choosing a slightly smaller MTU for outbound packets, to reduce the risk of relying on Path MTU Detection / ICMP.

Low level network tuning sound interesting? Join our world famous team in London, Austin, San Francisco, Champaign and our elite office in Warsaw, Poland.

Enable Private DNS with 1.1.1.1 on Android 9 Pie

Stephen Pinkerton — Thu, 16 Aug 2018 15:01:15 GMT

Recently, Google officially launched Android 9 Pie, which includes a slew of new features around digital well-being, security, and privacy. If you’ve poked around the network settings on your phone while on the beta or after updating, you may have noticed a new Private DNS Mode now supported by Android.

This new feature simplifies the process of configuring a custom secure DNS resolver on Android, meaning parties between your device and the websites you visit won’t be able to snoop on your DNS queries because they’ll be encrypted. The protocol behind this, TLS, is also responsible for the green lock icon you see in your address bar when visiting websites over HTTPS. The same technology is useful for encrypting DNS queries, ensuring they cannot be tampered with and are unintelligible to ISPs, mobile carriers, and any others in the network path between you and your DNS resolver. These new security protocols are called DNS over HTTPS, and DNS over TLS.

Configuring 1.1.1.1

Android Pie only supports DNS over TLS. To enable this on your device:

Go to Settings → Network & internet → Advanced → Private DNS.
Select the Private DNS provider hostname option.
Enter 1dot1dot1dot1.cloudflare-dns.com and hit Save.
Visit 1.1.1.1/help (or 1.0.0.1/help) to verify that “Using DNS over TLS (DoT)” shows as “Yes”.

And you’re done!

Why Use Private DNS?

So how do DNS over HTTPS and DNS over TLS fit into the current state of internet privacy?

TLS is the protocol that encrypts your traffic over an untrusted communication channel, like when browsing your email on a cafe’s wireless network. Even with TLS, there is still no way of knowing if your connection to the DNS server has been hijacked or is being snooped on by a third party. This is significant because a bad actor could configure an open WiFi hotspot in a public place that responds to DNS queries with falsified records in order to hijack connections to common email providers and online banks. DNSSEC solves the problem of guaranteeing authenticity by signing responses, making tampering detectable, but leaves the body of the message readable by anyone else on the wire.

DNS over HTTPS / TLS solves this. These new protocols ensure that communication between your device and the resolver is encrypted, just like we’ve come to expect of HTTPS traffic.

However, there is one final insecure step in this chain of events: the revealing of the SNI (server name indication) during the initial TLS negotiation between your device and a specific hostname on a server. The requested hostname is not encrypted, so third parties still have the ability to see the websites you visit. It makes sense that the final step in completely securing your browsing activity involves encrypting SNI, which is an in-progress standard that Cloudflare has joined other organizations to define and promote.

DNS in an IPv6 World

You may have noticed that the private DNS field does not accept an IP address like 1.1.1.1 and instead wants a hostname like 1dot1dot1dot1.cloudflare-dns.com. This doesn’t exactly roll off the tongue, so we’re working on deploying an easier to remember address for the resolver, and will continue to support 1.1.1.1, 1.0.0.1, and 1dot1dot1dot1.cloudflare-dns.com.

Google requires a hostname for this field because of how mobile carriers are adapting to a dual-stack world in which IPv4 and IPv6 coexist. Companies are adopting IPv6 much more rapidly than generally expected, and all major mobile carriers in the US support it, including T-Mobile who has gone completely IPv6. In a world where the approximately 26 billion internet-connected devices vastly outnumber the 4.3 billion IPv4 addresses, this is good news. And in a forward-thinking move, Apple requires that all new iOS apps must support single-stack IPv6 networks.

However, we still live in a world with IPv4 addresses, so phone manufacturers and carriers have to architect their systems with backwards compatibility in mind. Currently, iOS and Android request both A and AAAA DNS records, which contain the IP address(es) corresponding to a domain in version 4 and version 6 format, respectively. Try it out yourself:

$ dig A +short 1dot1dot1dot1.cloudflare-dns.com
1.0.0.1
1.1.1.1

$ dig AAAA +short 1dot1dot1dot1.cloudflare-dns.com
2606:4700:4700::1001
2606:4700:4700::1111

To talk to a device with only an IPv4 address over an IPv6 only network, the DNS resolver has to translate IPv4 addresses into the IPv6 address using DNS64. The requests to those translated IP addresses then go through the NAT64 translation service provided by the network operator. This is all completely transparent to the device and web server.

Learn more about this process from APNIC.

IPv6 in China

Tom Paseka — Thu, 19 Jul 2018 00:03:37 GMT

Photo by chuttersnap / Unsplash

At the end of 2017, Xinhua reported that there will be 200 Million IPv6 users inside Mainland China by the end of this year. Halfway into the year, we’re seeing a rapid growth in IPv6 users and traffic originating from Mainland China.

Why does this matter?

IPv6 is often referred to the next generation of IP addressing. The reality is, IPv6 is what is needed for addressing today. Taking the largest mobile network in China today, China Mobile has over 900 Million mobile subscribers and over 670 Million 4G/LTE subscribers. To be able to provide service to their users, they need to provide an IP address to each subscriber’s device. This means close to a billion IP addresses would be required, which is far more than what is available in IPv4, especially as the available IP address pools have been exhausted.

What is the solution?

To solve the addressability of clients, many networks, especially mobile networks, will use Carrier Grade NAT (CGN). This allows thousands, possibly up to hundreds of thousands, of devices to be shared behind a single internet IP address. The CGN equipment can be very expensive to scale and further, given the scale of the networks, they might need to layer CGNs behind other CGNs. This increases costs per subscriber, can reduce performance and makes scaling very challenging. A further solution, NAT64, allows IPv6 addresses to be given to subscribers, but then translated to IPv4 addresses similar to other NATs. This allows networks and ISPs to begin deploying IPv6 to subscribers, a first step in transition to IPv6.

IPv6 IPv6 IPv6!

Announcements IPv6 address blocks from China Mobile. Source: Hurricane Electric

On June 7, China Mobile started to announce IPv6 address blocks to the Internet at large. At the same time, Cloudflare started seeing traffic being exchanged with China Mobile users over IPv6 connections.

IPv4 to IPv6 percentage of traffic as seen from Cloudflare to AS9808 China Mobile’s Guangdong network.

Throughout the past 45 days, we’ve seen more and more IPv6 address blocks being announced to the internet, along with very aggressive usage. Interestingly this all started on-or-around June 8th 2018 (seven years to the day from World IPv6 Day)

It’s natural to see traffic graphs like this go up; then down after a while. This could indicate there’s some testing still going on with the deployment. We fully expect that the traffic percentage will climb back up once this is fully rolled out.

It’s fantastic to see the IPv6 enablement! We congratulate China Mobile on their successful enablement going forward.

eBPF, Sockets, Hop Distance and manually writing eBPF assembly

Marek Majkowski — Thu, 29 Mar 2018 10:43:38 GMT

A friend gave me an interesting task: extract IP TTL values from TCP connections established by a userspace program. This seemingly simple task quickly exploded into an epic Linux system programming hack. The result code is grossly over engineered, but boy, did we learn plenty in the process!

CC BY-SA 2.0 image by Paul Miller

Context

You may wonder why she wanted to inspect the TTL packet field (formally known as "IP Time To Live (TTL)" in IPv4, or "Hop Count" in IPv6)? The reason is simple - she wanted to ensure that the connections are routed outside of our datacenter. The "Hop Distance" - the difference between the TTL value set by the originating machine and the TTL value in the packet received at its destination - shows how many routers the packet crossed. If a packet crossed two or more routers, we know it indeed came from outside of our datacenter.

It's uncommon to look at TTL values (except for their intended purpose of mitigating routing loops by checking when the TTL reaches zero). The normal way to deal with the problem we had would be to blocklist IP ranges of our servers. But it’s not that simple in our setup. Our IP numbering configuration is rather baroque, with plenty of Anycast, Unicast and Reserved IP ranges. Some belong to us, some don't. We wanted to avoid having to maintain a hard-coded blocklist of IP ranges.

The gist of the idea is: we want to note the TTL value from a returned SYN+ACK packet. Having this number we can estimate the Hop Distance - number of routers on the path. If the Hop Distance is:

zero: we know the connection went to localhost or a local network.

one: connection went through our router, and was terminated just behind it.

two: connection went through two routers. Most possibly our router, and one just near to it.

For our use case, we want to see if the Hop Distance was two or more - this would ensure the connection was routed outside the datacenter.

Not so easy

It's easy to read TTL values from a userspace application, right? No. It turns out it's almost impossible. Here are the theoretical options we considered early on:

A) Run a libpcap/tcpdump-like raw socket, and catch the SYN+ACK's manually. We ruled out this design quickly - it requires elevated privileges. Also, raw sockets are pretty fragile: they can suffer packet loss if the userspace application can’t keep up.

B) Use the IP_RECVTTL socket option. IP_RECVTTL requests a "cmsg" data to be attached to control/ancillary data in a recvmsg() syscall. This is a good choice for UDP connections, but this socket option is not supported by TCP SOCK_STREAM sockets.

Extracting the TTL is not so easy.

SO_ATTACH_FILTER to rule the world!

CC BY-SA 2.0 image by Lee Jordan

Wait, there is a third way!

You see, for quite some time it has been possible to attach a BPF filtering program to a socket. See socket(7)

SO_ATTACH_FILTER (since Linux 2.2), SO_ATTACH_BPF (since Linux 3.19)
    Attach a classic BPF (SO_ATTACH_FILTER) or an extended BPF
    (SO_ATTACH_BPF) program to the socket for use as a filter of
    incoming packets.  A packet will be dropped if the filter pro‐
    gram returns zero.  If the filter program returns a nonzero
    value which is less than the packet's data length, the packet
    will be truncated to the length returned.  If the value
    returned by the filter is greater than or equal to the
    packet's data length, the packet is allowed to proceed unmodi‐
    fied.

You probably take advantage of SO_ATTACH_FILTER already: This is how tcpdump/wireshark does filtering when you're dumping packets off the wire.

How does it work? Depending on the result of a BPF program, packets can be filtered, truncated or passed to the socket without modification. Normally SO_ATTACH_FILTER is used for RAW sockets, but surprisingly, BPF filters can also be attached to normal SOCK_STREAM and SOCK_DGRAM sockets!

We don't want to truncate packets though - we want to extract the TTL. Unfortunately with Classical BPF (cBPF) it's impossible to extract any data from a running BPF filter program.

eBPF and maps

This changed with modern BPF machinery, which includes:

modernised eBPF bytecode
eBPF maps
bpf() syscall
SO_ATTACH_BPF socket option

eBPF bytecode can be thought of as an extension to Classical BPF, but it's the extra features that really let it shine.

The gem is the "map" abstraction. An eBPF map is a thingy that allows an eBPF program to store data and share it with a userspace code. Think of an eBPF map as a data structure (a hash table most usually) shared between a userspace program and an eBPF program running in kernel space.

To solve our TTL problem, we can use eBPF filter program. It will look at the TTL values of passing packets, and save them in an eBPF map. Later, we can inspect the eBPF map and analyze the recorded values from userspace.

SO_ATTACH_BPF to rule the world!

To use eBPF we need a number of things set up. First, we need to create an "eBPF map". There are many specialized map types, but for our purposes let's use the "hash" BPF_MAP_TYPE_HASH type.

We need to figure out the "bpf(BPF_MAP_CREATE, map type, key size, value size, limit, flags)" parameters. For our small TTL program, let's set 4 bytes for key size, and 8 byte value size. The max element limit is set to 5. It doesn't matter, we expect all the packets in one connection to have just one coherent TTL value anyway.

This is how it would look in a Golang code:

bpfMapFd, err := ebpf.NewMap(ebpf.Hash, 4, 8, 5, 0)

A word of warning is needed here. BPF maps use the "locked memory" resource. With multiple BPF programs and maps, it's easy to exhaust the default tiny 64 KiB limit. Consider bumping this with ulimit -l, for example:

ulimit -l 10240

The bpf() syscall returns a file descriptor pointing to the kernel BPF map we just created. With it handy we can operate on a map. The possible operations are:

bpf(BPF_MAP_LOOKUP_ELEM, )
bpf(BPF_MAP_UPDATE_ELEM, , , )
bpf(BPF_MAP_DELETE_ELEM, )
bpf(BPF_MAP_GET_NEXT_KEY, )

BPF calling convention

Before we go further we should highlight couple of things about the eBPF environment. The official kernel documentation isn't too friendly:

Documentation/networking/filter.txt

The first important bit to know, is the calling convention:

R0 - return value from in-kernel function, and exit value for eBPF program
R1-R5 - arguments from eBPF program to in-kernel function
R6-R9 - callee saved registers that in-kernel function will preserve
R10 - read-only frame pointer to access stack

When the BPF is started, R1 contains a pointer to ctx. This data structure is defined as struct __sk_buff. For example, to access the protocol field you'd need to run:

r0 = *(u32 *)(r1 + 16)

Or in other words:

ebpf.BPFIDstOffSrc(ebpf.LdXW, ebpf.Reg0, ebpf.Reg1, 16),

Which is exactly what we do in first line of our program, since we need to choose between IPv4 or IPv6 code branches.

Accessing the BPF payload

Next, there are special instructions for packet payload loading. Most BPF programs (but not all!) run in the context of packet filtering, so it makes sense to accelerate data lookups by having magic opcodes for accessing packet data.

Instead of dereferencing context, like ctx->data[x] to load a byte, BPF supports the BPF_LD instruction that can do it in one operation. There are caveats though, the documentation says:

eBPF has two non-generic instructions: (BPF_ABS |  | BPF_LD) and
(BPF_IND |  | BPF_LD) which are used to access packet data.

They had to be carried over from classic BPF to have strong performance of
socket filters running in eBPF interpreter. These instructions can only
be used when interpreter context is a pointer to 'struct sk_buff' and
have seven implicit operands. Register R6 is an implicit input that must
contain pointer to sk_buff. Register R0 is an implicit output which contains
the data fetched from the packet. Registers R1-R5 are scratch registers
and must not be used to store the data across BPF_ABS | BPF_LD or
BPF_IND | BPF_LD instructions.

In other words: before calling BPF_LD we must move ctx to R6, like this:

ebpf.BPFIDstSrc(ebpf.MovSrc, ebpf.Reg6, ebpf.Reg1),

Then we can call the load:

ebpf.BPFIImm(ebpf.LdAbsB, int32(-0x100000+7)),

At this stage the result is in r0, but we must remember the r1-r5 should be considered dirty. For an instruction the BPF_LD looks very much like a function call.

Magical Layer 3 offset

Next note the load offset - we loaded the -0x100000+7 byte of the packet. This magic offset is another BPF context curiosity. It turns out that the BPF script loaded under SO_ATTACH_BPF on a SOCK_STREAM (or SOCK_DGRAM) socket, will only see Layer 4 and higher OSI layers by default. To extract the TTL we need access to the layer 3 header (i.e. the IP header). To access L3 in the L4 context, we must offset the data lookups by magical -0x100000.

This magic constant is defined in the kernel.

For completeness, the +7 is, of course, the offset of the TTL field in an IPv4 packet. Our small BPF program also supports IPv6 where the TTL/Hop Count is at offset +8.

Return value

Finally, the return value of the BPF program is meaningful. In the context of packet filtering it will be interpreted as a truncated packet length.Had we returned 0 - the packet would be dropped and wouldn't be seen by the userspace socket application. It's quite interesting that we can do packet-based data manipulation with eBPF on a stream-based socket. Anyway, our script returns -1, which when cast to unsigned will be interpreted as a very large number:

ebpf.BPFIDstImm(ebpf.MovImm, ebpf.Reg0, -1),
ebpf.BPFIOp(ebpf.Exit),

Extracting data from map

Our running BPF program will set a key on our map for any matched packet. The key is the recorded TTL value, the value is the packet count. The value counter is somewhat vulnerable to a tiny race condition, but it's ignorable for our purposes. Later on, to extract the data from userspace program we use this Golang loop:

var (
	value   MapU64
	k1, k2  MapU32
)

for {
	ok, err := bpfMap.Get(k1, &value, 8)
	if ok {
		// k1 is TTL, value is counter
		...
	}

	ok, err = bpfMap.GetNextKey(k1, &k2, 4)
	if err != nil || ok == false {
		break
	}
	k1 = k2
}

Putting it all together

Now with all the pieces ready we can make it a proper runnable program. There is little point in discussing it here, so allow me to refer to the source code. The BPF pieces are here:

ebpf.go

We haven't discussed how to catch inbound SYN+ACK in the BPF program. This is a matter of setting up BPF before calling connect(). Sadly, it's impossible to customize net.Dial in Golang. Instead we wrote a surprisingly painful and awful custom Dial implementation. The ugly custom dialer code is here:

magic_conn.go

To run all this you need kernel 4.4+ Kernel with the bpf() syscall compiled in. BPF features of specific kernels are documented in this superb page from BCC:

docs/kernel-versions.md

Run the code to observe the TTL Hop Counts:

$ ./ttl-ebpf tcp4://google.com:80 tcp6://google.com:80 \
             tcp4://cloudflare.com:80 tcp6://cloudflare.com:80
[+] TTL distance to tcp4://google.com:80 172.217.4.174 is 6
[+] TTL distance to tcp6://google.com:80 [2607:f8b0:4005:809::200e] is 4
[+] TTL distance to tcp4://cloudflare.com:80 198.41.215.162 is 3
[+] TTL distance to tcp6://cloudflare.com:80 [2400:cb00:2048:1::c629:d6a2] is 3

Takeaways

In this blog post we dived into the new eBPF machinery, including the bpf() syscall, maps and SO_ATTACH_BPF. This work allowed me to realize the potential of running SO_ATTACH_BPF on fully established TCP sockets. Undoubtedly, eBPF still requires plenty of love and documentation, but it seems to be a perfect bridge to expose low level toggles to userspace applications.

I highly recommend keeping the dependencies small. For small BPF programs, like the one shown, there is little need for complex clang compilation and ELF loading. Don't be afraid of the eBPF bytecode!

We only touched on SO_ATTACH_BPF, where we analyzed network packets with BPF running on network sockets. There is more! First, you can attach BPFs to a dozen "things", XDP being the most obvious example. Full list. Then, it's possible to actually affect kernel packet processing, here is a full list of helper functions, some of which can modify kernel data structures.

In February LWN jokingly wrote:

Developers should be careful, though; this could
prove to be a slippery slope leading toward something 
that starts to look like a microkernel architecture.

There is a grain of truth here. Maybe the ability to run eBPF on variety of subsystems feels like microkernel coding, but definitely the SO_ATTACH_BPF smells like STREAMS programming model from 1984.

Thanks to Gilberto Bertin and David Wragg for helping out with the eBPF bytecode.

Doing eBPF work sound interesting? Join our world famous team in London, Austin, San Francisco, Champaign and our elite office in Warsaw, Poland.

2018 and the Internet: our predictions

John Graham-Cumming — Thu, 21 Dec 2017 14:01:43 GMT

At the end of 2016, I wrote a blog post with seven predictions for 2017. Let’s start by reviewing how I did.

Public Domain image by Michael Sharpe

I’ll score myself with two points for being correct, one point for mostly right and zero for wrong. That’ll give me a maximum possible score of fourteen. Here goes...

2017-1: 1Tbps DDoS attacks will become the baseline for ‘massive attacks’

This turned out to be true but mostly because massive attacks went away as Layer 3 and Layer 4 DDoS mitigation services got good at filtering out high bandwidth and high packet rates. Over the year we saw many DDoS attacks in the 100s of Gbps (up to 0.5Tbps) and then in September announced Unmetered Mitigation. Almost immediately we saw attackers stop bothering to attack Cloudflare-protected sites with large DDoS.

So, I’ll be generous and give myself one point.

2017-2: The Internet will get faster yet again as protocols like QUIC become more prevalent

Well, yes and no. QUIC has become more prevalent as Google has widely deployed it in the Chrome browser and it accounts for about 7% of Internet traffic. At the same time the protocol is working its way through the IETF standardization process and has yet to be deployed widely outside Google.

So, I’ll award myself one point for this as QUIC did progress but didn’t get as far as I thought.

2017-3: IPv6 will become the defacto for mobile networks and IPv4-only fixed networks will be looked upon as old fashioned

IPv6 continued to grow throughout 2017 and seems to be on a pretty steady trajectory upwards. Although it’s not yet deployed on ¼ of the top 25,000 web sites. Note that the large jump in IPv6 support that occurred in the middle of 2016 when Cloudflare enabled it by default for all our customers.

The Internet Society reported that mobile networks that switch to IPv6 see 70-95% of their traffic use IPv6. Google reports that traffic from Verizon is now 90% IPv6 and T-Mobile is turning off IPv4 completely.

Here I’ll award myself two points.

2017-4: A SHA-1 collision will be announced

That happened on 23 February 2017 with the announcement of an efficient way to generate colliding PDF documents. It’s so efficient that here are two PDFs containing the old and new Cloudflare logos. I generated these two PDFs using a web site that takes two JPEGs, embeds them in two PDFs and makes them collide. It does this instantly.

They have the same SHA-1 hash:

$ shasum *.pdf
e1964edb8bcafc43de6d1d99240e80dfc710fbe1  a.pdf
e1964edb8bcafc43de6d1d99240e80dfc710fbe1  b.pdf

But different SHA-256 hash:

$ shasum -a256 *.pdf
8e984df6f4a63cee798f9f6bab938308ebad8adf67daba349ec856aad07b6406  a.pdf
f20f44527f039371f0aa51bc9f68789262416c5f2f9cefc6ff0451de8378f909  b.pdf

So, two points for getting that right (and thanks, Nick Sullivan, for suggesting it and making me look smart).

2017-5: Layer 7 attacks will rise but Layer 6 won’t be far behind

The one constant of 2017 in terms of DDoS was the prevalence of Layer 7 attacks. Even as attackers decided that large scale Layer 3 and 4 DDoS attacks were being mitigated easily and hence stopped performing them so frequently, Layer 7 attacks continued apace with attacks in the 100s of krps common place.

Awarding myself one point because Layer 6 attacks didn’t materialize as much as predicted.

2017-6: Mobile traffic will account for 60% of all Internet traffic by the end of the year

Ericsson reported mid-year that mobile data traffic was continuing to grow strongly and grew 70% between Q116 and Q117. Stats show that while mobile traffic continued to increase its share of Internet traffic and passed 50% in 2017 it didn’t reach 60%.

Zero points for me.

2017-7: The security of DNS will be taken seriously

This has definitely happened. The 2016 Dyn DNS attack was a wake up call that often overlooked infrastructure was at risk of DDoS attack. In April 2017 Wired reported that hackers took over 36 Brazilian banking web sites by hijacking DNS registration, and in June Mozilla and ICANN proposed encrypting DNS by sending it over HTTPS and the IETF has a working group on what’s now being called doh.

DNSSEC deployment continued with SecSpider showing steady, continuous growth during 2017.

So, two points for me.

Overall, I scored myself a total of 9 out of 14, or 64% right. With that success rate in mind here are my predictions for 2018.

2018 Predictions

2018-1: By the end of 2018 more than 50% of HTTPS connections will happen over TLS 1.3

The roll out of TLS 1.3 has been stalled because of difficulty in getting it working correctly in the heterogenous Internet environment. Although Cloudflare has had TLS 1.3 in production and available for all customers for over a year only 0.2% of our traffic is currently using that version.

Given the state of standardization of TLS 1.3 today we believe that major browser vendors will enable TLS 1.3 during 2018 and by the end of the year more than 50% of HTTPS connections will be using the latest, most secure version of TLS.

2018-2: Vendor lock-in with Cloud Computing vendors becomes dominant worry for enterprises

In Mary Meeker’s 2017 Internet Trends report she gives on statistics (slide 183) on the top three concerns of users of cloud computing. These show a striking change from being primarily about security and cost to worries about vendor lock-in and compliance. Cloudflare believes that vendor lock-in will become the top concern of users of cloud computing in 2018 and that multi-cloud strategies will become common.

BillForward is already taking a multi-cloud approach with Cloudflare moving traffic dynamically between cloud computing providers. Alongside vendor lock-in, users will name data portability between clouds as a top concern.

2018-3: Deep learning hype will subside as self-driving cars don't become a reality but AI/ML salaries will remain high

Self-driving cars won’t become available in 2018, but AI/ML will remain red hot as every technology company tries to hire appropriate engineering staff and finds they can’t. At the same time deep learning techniques will be widely applied across companies and industries as it becomes clear that these techniques are not limited to game playing, classification, or translation tasks and can be widely applied.

Expect unexpected applications of techniques, that are already in use in Silicon Valley, when they are applied to the rest of the world. Don’t be surprised if there’s talk of AI/ML managed traffic management for highways, for example. Anywhere there's a heuristic we'll see AI/ML applied.

But it’ll take another couple of years for AI/ML to really have profound effects. By 2020 the talent pool will have greatly increased and manufacturers such as Qualcomm, nVidia and Intel will have followed Google’s lead and produced specialized chipsets designed for deep learning and other ML techniques.

2018-4: k8s becomes the dominant platform for cloud computing

A corollary to users’ concerns about cloud vendor lock-in and the need for multi-cloud capability is that an orchestration framework will dominate. We believe that Kubernetes will be that dominant platform and that large cloud vendors will work to ensure compatibility across implementations at the demand of customers.

We are currently in the infancy of k8s deployment with the major cloud computing vendors deploying incompatible versions. We believe that customer demand for portability will cause cloud computer vendors to ensure compatibility.

2018-5: Quantum resistant crypto will be widely deployed in machine-to-machine links across the internet

During 2017 Cloudflare experimented with, and open sourced, quantum-resistant cryptography as part of our implementation of TLS 1.3. Today there is a threat to the security of Internet protocols from quantum computers, and although the threat has not been realized, cryptographers are working on cryptographic schemes that will resist attacks from quantum computers when they arrive.

We predict that quantum-resistant cryptography will become widespread in links between machines and data centers especially where the connections being encrypted cross the public Internet. We don’t predict that quantum-resistant cryptography will be widespread in browsers, however.

2018-6: Mobile traffic will account for 60% of all Internet traffic by the end of the year

Based on the continued trend upwards in mobile traffic I’m predicting that 2018 (instead of 2017) will be the year mobile traffic shoots past 60% of overall Internet traffic. Fingers crossed.

2018-7: Stable BTC/USD exchanges will emerge as others die off from security-based Darwinism

The meteoric rise in the Bitcoin/USD exchange rate has been accompanied by a drumbeat of stories about stolen Bitcoins and failing exchanges. We believe that in 2018 the entire Bitcoin ecosystem will stabilize.

This will partly be through security-based Darwinism as trust in exchanges and wallets that have security problems plummets and those that survive have developed the scale and security to cope with the explosion in Bitcoin transactions and attacks on their services.

Less Is More - Why The IPv6 Switch Is Missing

Dani Grant — Thu, 25 May 2017 17:30:00 GMT

At Cloudflare we believe in being good to the Internet and good to our customers. By moving on from the legacy world of IPv4-only to the modern-day world where IPv4 and IPv6 are treated equally, we believe we are doing exactly that.

"No matter what happens in life, be good to people. Being good to people is a wonderful legacy to leave behind." - Taylor Swift (whose website has been IPv6 enabled for many many years)

Starting today with free domains, IPv6 is no longer something you can toggle on and off, it’s always just on.

How we got here

Cloudflare has always been a gateway for visitors on IPv6 connections to access sites and applications hosted on legacy IPv4-only infrastructure. Connections to Cloudflare are terminated on either IP version and then proxied to the backend over whichever IP version the backend infrastructure can accept.

That means that a v6-only mobile phone (looking at you, T-Mobile users) can establish a clean path to any site or mobile app behind Cloudflare instead of doing an expensive 464XLAT protocol translation as part of the connection (shaving milliseconds and conserving very precious battery life).

That IPv6 gateway is set by a simple toggle that for a while now has been default-on. And to make up for the time lost before the toggle was default on, in August 2016 we went back retroactively and enabled IPv6 for those millions of domains that joined before IPv6 was the default. Over those next few months, we enabled IPv6 for nearly four million domains –– you can see Cloudflare’s dent in the IPv6 universe below –– and by the time we were done, 98.1% of all of our domains had IPv6 connectivity.

As an interim step, we added an extra feature –– when you turn off IPv6 in our dashboard, we remind you just how archaic we think that is.

With close to 100% IPv6 enablement, it no longer makes sense to offer an IPv6 toggle. Instead, Cloudflare is offering IPv6 always on, with no off-switch. We’re starting with free domains, and over time we’ll change the toggle on the rest of Cloudflare paid-plan domains.

The Future: How Cloudflare and OpenDNS are working together to make IPv6 even faster and more globally deployed

In November we published stats about the IPv6 usage we see on the Cloudflare network in an attempt to answer who and what is pushing IPv6. The top operating systems by percent IPv6 traffic are iOS, ChromeOS, and MacOS respectively. These operating systems push significantly more IPv6 traffic than their peers because they use a routing choice algorithm called Happy Eyeballs. Happy Eyeballs opportunistically chooses IPv6 when available by doing two DNS lookups –– one for an IPv6 address (this IPv6 address is stored in the DNS AAAA record - pronounced quad-A) and then one for the IPv4 address (stored in the DNS A record). Both DNS queries are flying over the Internet at the same time and the client chooses the address that comes back first. The client even gives IPv6 a few milliseconds head start (iOS and MacOS give IPv6 lookups a 25ms head start for example) so that IPv6 may be chosen more often. This works and has fueled some of IPv6’s growth. But it has fallen short of the goal of a 100% IPv6 world.

While there are perfectly good historical reasons why IPv6 and IPv4 addresses are stored in separate DNS types, today clients are IP version agnostic and it no longer makes sense for it to require two separate round trips to learn what addresses are available to fetch a resource from.

Alongside OpenDNS, we are testing a new idea - what if you could ask for all the addresses in just one DNS query?

With OpenDNS, we are prototyping and testing just that –– a new DNS metatype that returns all available addresses in one DNS answer –– A records and AAAA records in one response. (A metatype is a query type in DNS that end users can’t add into their DNS zone file, it’s assembled dynamically by the authoritative nameserver.)

What this means is that in the future if a client like an iPhone wants to access a mobile app that uses Cloudflare DNS or using another DNS provider that supports the spec, the iPhone DNS client would only need to do one DNS lookup to find where the app’s API server is located, cutting the number of necessary round trips in half.

This reduces the amount of bandwidth on the DNS system, and pre-populates global DNS caches with IPv6 addresses, making IPv6 lookups faster in the future, with the side benefit that Happy Eyeballs clients prefer IPv6 when they can get the address quickly, which increases the amount of IPv6 traffic that flows through the Internet.

We have the metaquery working in code with the reserved TYPE65535 querytype. You can ask a Cloudflare nameserver for TYPE65535 of any domain on Cloudflare and get back all available addresses for that name.

$ dig cloudflare.com @ns1.cloudflare.com -t TYPE65535 +short
198.41.215.162
198.41.214.162
2400:cb00:2048:1::c629:d6a2
2400:cb00:2048:1::c629:d7a2
$

Did we mention Taylor Swift earlier?

$ dig taylorswift.com @ns1.cloudflare.com -t TYPE65535 +short
104.16.193.61
104.16.194.61
104.16.191.61
104.16.192.61
104.16.195.61
2400:cb00:2048:1::6810:c33d
2400:cb00:2048:1::6810:c13d
2400:cb00:2048:1::6810:bf3d
2400:cb00:2048:1::6810:c23d
2400:cb00:2048:1::6810:c03d
$

We believe in proving concepts in code and through the IETF standards process. We’re currently working on an experiment with OpenDNS and will translate our learnings to an Internet Draft we will submit to the IETF to become an RFC. We’re sure this is just the beginning to faster, better deployed IPv6.

98.01% of sites on Cloudflare now use IPv6

Martin J Levy — Mon, 21 Nov 2016 14:14:46 GMT

It's 2016 and almost every site using Cloudflare (more than 4 million of them) is using IPv6. Because of this, Cloudflare sees significant IPv6 traffic globally where networks have enabled IPv6 to the consumer.

The top IPv6 networks are shown here.

The chart shows the percentage of IPv6 within a specific network vs. the relative bandwidth of that network. We will talk about specific networks below.

Why IPv6? Because fast internet.

IPv6 is faster for two reasons. The first is that many major operating systems and browsers like iOS, macOS, Chrome and Firefox impose anywhere from a 25ms to 300ms artificial delay on connections made over IPv4. The second is that some mobile networks won’t need to perform extra v4 → v6 and v6 → v4 translations to connect visitors to IPv6 enabled sites if the phone is only assigned an IPv6 address. (IPv6-only phones are becoming very common. If you have a phone on T-Mobile, Telstra, SK Telecom, Orange, or EE UK, to name a few, it’s likely you’re v6-only.)

How much faster is IPv6? Our data shows that visitors connecting over IPv6 were able to connect and load pages in 27% less time than visitors connecting over IPv4. LinkedIn found an even more dramatic effect, with up to a 40% performance boost on mobile connections over IPv6. Facebook also found a significant performance increase, around 10-15% on IPv6.

Who and what is driving IPv6?

IPv6 is clearly important for driving a faster, better internet, so who is driving IPv6 adoption?

In terms of countries, Belgium leads by a mile. Over the last 30 days, 56.47% of traffic (in bytes) to Belgians on Cloudflare has been over IPv6. This is largely due to Telenet, an ISP in Belgium, doing almost 96.8% of their traffic over IPv6!

Belgium, 56.47%
Ireland, 31.75%*
Greece, 20.79%
Germany, 15.87%
Ecuador, 15.62%
Luxembourg, 15.51%
Portugal, 14.07%
Estonia, 13.75%
India, 11.84%
Peru, 10.57%

*The Irish numbers are artificially high due to several of Facebook’s IPv6 ranges being registered in Ireland. Facebook does an enormous amount of traffic over IPv6 –– 81% of Facebook’s traffic through Cloudflare is IPv6. In fact, Facebook’s crawling over IPv6 actually accounts for 6.9% of Cloudflare’s total outbound IPv6 traffic.

If you look just 6 months ago, less than 1% of India’s traffic was over IPv6. India has experienced an amazing rise in IPv6 connectivity, especially over the last month and a half.

The US is ranked 17th on the global list at 8.78%. And since we have a Canadian co-founder (and three Canadian PoPs), it hasn’t gone unnoticed at Cloudflare that Canada is just beating the US in 16th place at 9.14%.

A handful of networks are responsible for the majority of IPv6 traffic through Cloudflare. In fact, the top 10 networks by IPv6 traffic to and from Cloudflare are responsible for over half (55.4%) of Cloudflare’s IPv6 traffic. Those networks are:

The above chart shows which networks account for the highest percentage of IPv6 traffic through Cloudflare. In terms of which networks are promoting IPv6 on their own network traffic, these are the networks Cloudflare sees with the highest ratios of IPv6 to IPv4 traffic:

#	IPv6 %	ASN	NAME
1	100.0%	AS43447	Orange Polska
2	100.0%	AS23910	China Next Generation Internet CERNET2
3	100.0%	AS17419	HiNet IPv6 (Taiwan)
4	96.8%	AS6848	Telenet (Belgium)
5	91.5%	AS12271	Time Warner Cable
6	88.9%	AS3651	Sprint
7	81.0%	AS32934	Facebook
8	74.0%	AS54500	EGIHosting
9	65.9%	AS21321	Areti Internet
10	63.9%	AS3598	Microsoft
11	61.8%	AS4250	Alentus
12	60.3%	AS21928	T-Mobile USA
13	58.8%	AS22394	Verizon Wireless
14	57.6%	AS18126	Chubu Telecommunications Company
15	48.5%	AS5607	Sky (UK)
16	47.8%	AS16591	Google Fiber
17	44.6%	AS133481	AIS Fibre (Thailand)
18	43.6%	AS7018	AT&T
19	43.3%	AS6621	Hughes Network Systems
20	43.2%	AS15943	wilhelm.tel GmbH Norderstedt

In terms of devices, mobile traffic is 50% more likely to use IPv6 than desktop traffic (21.4% of mobile traffic uses IPv6 traffic, whereas only 13.6% of desktop traffic is over IPv6).

In mobile, iOS sends slightly more traffic than Android over IPv6, with iOS sending about a quarter of its traffic (23.5%) and Android sending about one fifth (18.7%).

For Windows, the newer the OS, the more traffic it sends over IPv6, with XP (2001) sending just 1.1% of requests over IPv6 and Windows 10 (2015) sending 18.7%, almost a fifth of all requests over IPv6.

Here’s the full breakdown of browsers and operating systems to explain the effect of OS and browser on the percentage of IPv6 traffic. In it, you can see some interesting things, for example, iOS and ChromeOS apps tend to use IPv6 a lot more than other operating systems. Also, Chrome on mobile sends about twice as much IPv6 traffic than Chrome on desktop.

Below you can see the net effect of using a specific browser or operating system on IPv6 traffic:

Even more interesting is that while only 10.97% of our total DNS traffic is over IPv6, 22.03% of our DNS packet floods are IPv6 as well as IPv4 (we see no IPv6-only attacks, all the attacks we see on IPv6 are also seen on IPv4).

Making the push for IPv6

At Cloudflare, we weren’t happy with only hundreds of thousands of IPv6-enabled websites. We wanted the millions and millions of Cloudflare customers to have IPv6. Over the last few months, we have been carefully enabling IPv6 for around 100,000 sites a day, all the time monitoring how our systems operated and how our traffic behaved. We also monitored customer support and of course, social media.

People noticed. People were happy. People posted what they saw!

Some with a long range view:

there is rapid growth in number of AAAA websites from 76K (08/2016) to 109K (10/2016) (source @dan_wing dataset: https://t.co/CRzN5TweKz ) pic.twitter.com/0KNhqBMFsS
— Vaibhav Bajpai (@bajpaivaibhav) October 26, 2016

Some zoomed into the last few months:

there is rapid growth in number of AAAA websites from 76K (08/2016) to 134K (11/2016) (source @dan_wing dataset: https://t.co/CRzN5TweKz) pic.twitter.com/ygFEVSNSps
— Vaibhav Bajpai (@bajpaivaibhav) November 15, 2016

In both of the above graphs, you can really see the impact of Cloudflare enabling IPv6 for all of our customers’ sites, starting in August and wrapping up this week. Below is one more public measurement from Eric Vyncke where you can see the effect on world IPv6 measurements:

The IETF and IPv6

As of a late 2016, the IETF’s Internet Architecture Board has taken one more step in its path towards v6 adoption. It's taken the bold step of saying that new protocols don't need to be v4 backward compatible. This can usher-in some great new protocol designs and solutions that simply don't need to live in v4's legacy world. At Cloudflare, we've always been aggressive with our desire to provide our customers cutting edge solutions. We can't wait to see what's cooking!

The growing IPv6 address usage

Along with the increased bandwidth observed, at Cloudflare, we are seeing a solid increase in unique IPv6 addresses seen.

In June number of IPv6 addresses (/64 masked) crossed IPv4 addresses connecting to @CloudFlare for the first time. pic.twitter.com/eTPFNoueHQ
— Matthew Prince (@eastdakota) August 17, 2016

Final thoughts

Cloudflare is committed to bringing the most up to date, fastest and most secure technologies to our customers. We are committed to IPv6, TLS 1.3 and ubiquitous DNSSEC. Just like we'd never go back to an unencrypted life, there's no going back to a v4-only world.

PS - Want to live an IPv6 life? We’re hiring.

Big thanks to Dani Grant, Igor Postelnik, Lukasz Mierzwa and Marty Strong.

Supporting the transition to IPv6-only networking services for iOS

Dragos Bogdan — Tue, 07 Jun 2016 18:55:29 GMT

Early last month Apple announced that all apps submitted to the Apple Store June 1 forward would need to support IPv6-only networking as they transition to IPv6-only network services in iOS 9. Apple reports that “Most apps will not require any changes”, as these existing apps support IPv6 through Apple's NSURLSession and CFNetwork APIs.

Our goal with IPv6, and any other emerging networking technology, is to make it ridiculously easy for our customers to make the transition. Over 2 years ago, we published Eliminating the last reasons to not enable IPv6 in celebration of World IPv6 Day. CloudFlare has been offering full IPv6 support as well as our IPv6-to-IPv4 gateway to all of our customers since 2012.

Why is the transition happening?

IPv4 represents a technical limitation, a hard stop to the number of devices that can access the Internet. When the Internet Protocol (IP) was first introduced by Vint Cerf and Bob Kahn in the late 1970s, Internet Protocol Version 4 (IPv4) used a 32-bit (four-byte) number, allowing about 4 billion unique addresses. At the time, IPv4 seemed more than sufficient to power the World Wide Web. On January 31, 2011, the top-level pool of Internet Assigned Numbers Authority (IANA) IPv4 addresses was officially exhausted. On September 24th 2015, The American Registry for Internet Numbers (ARIN) officially ran out of IPv4 addresses.

It was clear well before 2011 that 4 billion addresses would not be nearly enough for all of the people in the world—let alone all of the phones, tablets, TVs, cars, watches, and the upcoming slew of devices that would need to access the Internet. In 1998, the Internet Engineering Task Force (IETF) had formalized IPv4’s successor, IPv6. IPv6 uses a 128-bit address, which theoretically allows for approximately 340 trillion trillion trillion or 340,000,000,000,000,000,000,000,000,000,000,000,000 unique addresses.

So here we are nearly 20 years since IPv6 and… we’re at a little over 10% IPv6 adoption. :/

(source)

The good news, as the graph above indicates, is that the rate of IPv6 adoption has increased significantly; last year being a record year with a 4% increase, according to Google. The way Google derives these numbers is by having a small number of their users execute JavaScript code that tests whether a computer can load URLs over IPv6.

What’s up with the delay?

Transitioning from IPv4 to IPv6 is a complex thing to execute on a global scale. When data gets sent through the Internet, packets of data are sent using the IP protocol. Within each packet you have a number of elements besides the payload (the actual data you’re sending), including the source and destination.

In order for the transmission to get to its destination, each device passing it along (clients/servers, routers, firewalls, load balancers, etc) needs to be able to communicate with the other. Traditionally, these devices communicate with IPv4, the universal language on the Internet. IPv6 represents an entirely new language. In order for all of these devices to be able to communicate, they all need to talk IPv6 or have some sort of translator involved.

Translation requires technologies such as NAT64 and DNS64. NAT64 allow IPv6 hosts to communicate with IPv4 servers by creating a NAT-mapping between the IPv6 and the IPv4 address. DNS64 will synthesize AAAA records from A records. DNS64 has some known issues like DNSSEC validation failure (because the DNS server doing the translation is not the owner’s domain server). For service providers, supporting IPv4/IPv6 translation means providing separate IPv4 and IPv6 connectivity thus incurring additional complexity as well as additional operational and administrative costs.

Making the move

Using IP literals (hard coded IP addresses) in your code is a common pitfall to meeting Apple's IPv6 support requirement. Developers should check their configuration files for any IP literals and replace them with domain names. Literals should not be embedded in protocols either. Although literals may seem unavoidable when using certain low-level APIs in communications protocols like SIP, WebSockets, or P2PP, Apple offers high-level networking frameworks that are easy to implement and less error prone.

Stay away from network preflighting and instead simply attempt to connect to a network resource and gracefully handle failures. Preflighting often attempts to check for Internet connectivity by passing IP addresses to network reachability APIs. This is bad practice both in terms of introducing IP literals in your code and misusing reachability APIs.

For iOS developers, it’s important to review Supporting IPv6 DNS64/NAT64 Networks to ensure code compatibility. Within Apple’s documentation you’ll find a list of IPv4-specific APIs that need to be eliminated, IPv6 equivalents for IPv4 types, system APIs that can synthesize IPv6 addresses, as well as how to set up a local IPv6 DNS64/NAT64 network so you can regularly test for IPv6 DNS64/NAT64 compatibility.

CloudFlare offers a number of IPv6 features that developers can take advantage of during their migration. If your domain is running through CloudFlare, enabling IPv6 support is as simple as enabling IPv6 Compatibility in the dashboard.

Certain legacy IPv4 applications may be able to take advantage of CloudFlare's Pseudo IPv4. Pseudo IPv4 works by adding an HTTP header to requests established over IPv6 with a "pseudo" IPv4 address. Using a hashing algorithm, Pseudo IPv4 will create a Class E IPv4 address which always produces the same output for the same input; the same IPv6 address will always result in the same Pseudo IPv4 address. Using the Class E IP space, we have access to 268,435,456 possible unique IPv4 addresses.

Pseudo IPv4 offers 2 options: Add Header or Overwrite Headers. Add Header will automatically add a header (Cf-Pseudo-IPv4), that can be parsed by software as needed. Overwrite Headers will overwrite the existing Cf-Connecting-IP and X-Forwarded-For headers with a Pseudo IPv4 address. The overwrite option has the advantage (in most cases), of not requiring any software changes. If you choose the overwrite option, we'll append a new header (Cf-Connecting-IPv6) in order to ensure you can still find the actual connecting IP address for debugging.

For iOS developers under the gun to make the transition to IPv6, there are benefits to the move apart from compliance with Apple’s policy. Beyond the inherent security benefits of IPv6 like mandatory IPSec, many companies have seen performance gains as well. Real User Measurement studies conducted by Facebook show IPv6 made their site 10-15 percent faster and LinkedIn realized 40 percent gains on select mobile networks in Europe.

For domains currently running through CloudFlare, since we currently do not enable IPv6 by default you’ll want to go into your account and make sure IPv6 Compatibility is enabled under the Network tab. CloudFlare has been offering rock solid IPv6 since 2012 with one-click IPv6 provisioning, an IPv4-to-IPv6 translation gateway, Pseudo IPv4, and much more. For more information, be sure to check out our IPv6 page: https://www.cloudflare.com/ipv6/

Happy 5th Birthday, CloudFlare!

Matthew Prince — Mon, 28 Sep 2015 02:00:52 GMT

Today is September 27, 2015. It's a rare Super Blood Moon. And it's also CloudFlare's birthday. CloudFlare launched 5 years ago today. It was a Monday. While Michelle, Lee, and I had high expectations, we would never have imagined what's happened since then.

In the last five years we've stopped 7 trillion cyber attacks, saved more than 94,116 years worth of time, and served 99.4 trillion requests — nearly half of those in the last 6 months. You can learn more from this timeline of the last five years.

Celebrating by doing the impossible

Every year we like to celebrate our birthday by giving something seemingly impossible back to our users. Two years ago we enabled on our Automatic IPv6 Gateway, allowing our users to support IPv6 without having to update their own servers. Last year we made Universal SSL support available to all our customers, even those on our free plan. And this year, we announced the expansion across Mainland China, building the first truly global performance and security platform.

Internet Summit & Party

We celebrated in San Francisco last week with CloudFlare's first Internet Summit at our new San Francisco Headquarters with more than 500 of our customers and friends. Speakers discussed their visions for the challenges and opportunities for the Internet over the next five years. We'll be posting videos of those talks over the course of this week. In the meantime, here are some photographs from the Summit and the party later that night with one of our favorite bands, Walk Off the Earth.

Thanks everyone for your support over the last five years. As Michelle likes to say, we're just getting started.

Test all the things: IPv6, HTTP/2, SHA-2

John Graham-Cumming — Wed, 02 Sep 2015 10:15:44 GMT

NOTE: The https://http2.cloudflare.com site has been shut down. This blog was posted in 2015 and subsequently we took all the technologies here and made them part of our core product.

CloudFlare constantly tries to stay on the leading edge of Internet technologies so that our customers' web sites use the latest, fastest, most secure protocols. For example, in the past we've enabled IPv6 and SPDY/3.1.

Today we've switched on a test server that is open for people to test compatibility of web clients. It's a mirror of this blog and is served from https://http2.cloudflare.com/. The server uses three technologies that it may be helpful to test with: IPv4/IPv6, HTTP/2 and an SSL certificate that uses SHA-2 for its signature.

The server has both IPv4 and IPv6 addresses.

$ dig +short http2.cloudflare.com A
45.55.83.207
$ dig +short http2.cloudflare.com AAAA
2604:a880:800:10:5ca1:ab1e:f4:e001

The certificate is based on SHA-2 (in this case SHA-256). This is important because SHA-1 is being deprecated by some browsers very soon. On a recent browser the connection will also be secured using ECDHE (for forward secrecy).

And, finally, the server uses HTTP/2 if the browser is capable. For example, in Google Chrome, with the HTTP/2 and SPDY indicator extension the blue lightning bolt indicates that the page was served using HTTP/2:

This server isn't on the normal CloudFlare network and is intended for testing purposes only. We'll endeavor to keep it online so that people have an HTTP/2 endpoint to test clients against.

In Google Chrome's net-internals view you can see the HTTP/2 session in use when connecting to the site.

We hope that this will prove useful for people tests HTTP/2 compatible client software. Let us know how it goes.

iOS Developers — Migrate to iOS 9 with CloudFlare

Nick Sullivan — Thu, 11 Jun 2015 10:31:29 GMT

Thousands of developers use CloudFlare to accelerate and secure the backend of their mobile applications and websites. This week is Apple’s Worldwide Developers Conference (WWDC), where thousands of Apple developers come to San Francisco to talk, learn and share best practices for developing software for Apple platforms. New announcements from Apple this week make CloudFlare an even more obvious choice for application developers.

New operating systems, new application requirements

The flagship announcement of WWDC 2015 was a new version of Apple’s mobile operating system, iOS 9, to be released in September with a developer preview available now. They also announced a new Mac operating system, OS X El Capitan, launching in the fall. Apple has a track record of developing and supporting technologies that enhance user privacy and security with iMessage and Facetime and the trend is continuing with these new operating systems. In both cases, Apple is requiring application developers to make use of two network technologies that CloudFlare is big fan of: HTTPS and IPv6.

For iOS 9 and El Capitan, all applications submitted to the iOS and Mac App Stores must work over IPv6. In previous versions, applications were allowed that only worked with IPv4.

From [Sebastien Marineau, Apple’s VP of Core OS](https://developer.apple.com/v219 commitsideos/wwdc/2015/?id=102): "Because IPv6 support is so critical to ensuring your applications work across the world for every customer, we are making it an AppStore submission requirement, starting with iOS 9."

By default, all network connections in third party applications compiled for on iOS 9 and El Capitan use a new feature called App Transport Security. This feature forces the application to connect to backend APIs and the web via HTTPS. Plain unencrypted HTTP requests are disallowed unless the developer specifically modifies a configuration file to allow it.

From the iOS 9 developer documentation: "If you're developing a new app, you should use HTTPS exclusively. If you have an existing app, you should use HTTPS as much as you can right now, and create a plan for migrating the rest of your app as soon as possible."

What does this mean for application developers? If your application has a web backend component, you will need to update the backend to support these protocols. This can be difficult to do since not all hosting providers support IPv6, and HTTPS certificates can be tricky to obtain and difficult to configure correctly, not to mention maintain.

Free and automatic IPv6 and SSL

One of the best things CloudFlare does well is to take modern network protocols and make them affordable (or free) and accessible to everyone. Every CloudFlare-backed website and API backend has support for both IPv6 and HTTPS automatically, and with no configuration necessary.

With Universal SSL, this is now true even for customers on CloudFlare’s free plan. We also make sure your HTTPS configuration is the latest and greatest with TLS 1.2 support and Forward secrecy.

CloudFlare provides its Automatic IPv6 Gateway for free to all CloudFlare users. This makes any content or API required by an Apple iOS app instantly available using IPv4 and IPv6 independent of the inability of your hosting provider to supply IPv6 support. More information about CloudFlare’s IPv6 support can be found here.

Not only does CloudFlare help keep application service components up to date with the latest Apple requirements, it provides the performance benefits of a globally distributed network and protection against malicious attacks.

If you’re an iOS developer looking to upgrade your application backend to meet Apple’s new requirements, you can sign up for CloudFlare here.

To verify that IPv6 is enabled, open your DNS settings and ensure that the IPv6 toggle is on:

CloudFlare offers HTTPS to the backend, so if you already have HTTPS, you can keep full end-to-end encryption with CloudFlare’s Strict SSL mode. If your backend doesn’t support HTTPS, you can select the Flexible mode to encrypt the communication between your App and CloudFlare. These setting are available in your Crypto settings:

CloudFlare and iOS 9 were made for each other.

P.S.: We provide the same service to Android apps as well.