The Cloudflare Blog

DIY BYOIP: a new way to Bring Your Own IP prefixes to Cloudflare

Ash Pallarito — Fri, 07 Nov 2025 14:00:00 GMT

When a customer wants to bring IP address space to Cloudflare, they’ve always had to reach out to their account team to put in a request. This request would then be sent to various Cloudflare engineering teams such as addressing and network engineering — and then the team responsible for the particular service they wanted to use the prefix with (e.g., CDN, Magic Transit, Spectrum, Egress). In addition, they had to work with their own legal teams and potentially another organization if they did not have primary ownership of an IP prefix in order to get a Letter of Agency (LOA) issued through hoops of approvals. This process is complex, manual, and time-consuming for all parties involved — sometimes taking up to 4–6 weeks depending on various approvals.

Well, no longer! Today, we are pleased to announce the launch of our self-serve BYOIP API, which enables our customers to onboard and set up their BYOIP prefixes themselves.

With self-serve, we handle the bureaucracy for you. We have automated this process using the gold standard for routing security — the Resource Public Key Infrastructure, RPKI. All the while, we continue to ensure the best quality of service by generating LOAs on our customers’ behalf, based on the security guarantees of our new ownership validation process. This ensures that customer routes continue to be accepted in every corner of the Internet.

Cloudflare takes the security and stability of the whole Internet very seriously. RPKI is a cryptographically-strong authorization mechanism and is, we believe, substantially more reliable than common practice which relies upon human review of scanned documents. However, deployment and availability of some RPKI-signed artifacts like the AS Path Authorisation (ASPA) object remains limited, and for that reason we are limiting the initial scope of self-serve onboarding to BYOIP prefixes originated from Cloudflare's autonomous system number (ASN) AS 13335. By doing this, we only need to rely on the publication of Route Origin Authorisation (ROA) objects, which are widely available. This approach has the advantage of being safe for the Internet and also meeting the needs of most of our BYOIP customers.

Today, we take a major step forward in offering customers a more comprehensive IP address management (IPAM) platform. With the recent update to enable multiple services on a single BYOIP prefix and this latest advancement to enable self-serve onboarding via our API, we hope customers feel empowered to take control of their IPs on our network.

An evolution of Cloudflare BYOIP

We want Cloudflare to feel like an extension of your infrastructure, which is why we originally launched Bring-Your-Own-IP (BYOIP) back in 2020.

A quick refresher: Bring-your-own-IP is named for exactly what it does - it allows customers to bring their own IP space to Cloudflare. Customers choose BYOIP for a number of reasons, but the main reasons are control and configurability. An IP prefix is a range or block of IP addresses. Routers create a table of reachable prefixes, known as a routing table, to ensure that packets are delivered correctly across the Internet. When a customer's Cloudflare services are configured to use the customer's own addresses, onboarded to Cloudflare as BYOIP, a packet with a corresponding destination address will be routed across the Internet to Cloudflare's global edge network, where it will be received and processed. BYOIP can be used with our Layer 7 services, Spectrum, or Magic Transit.

A look under the hood: How it works

Today’s world of prefix validation

Let’s take a step back and take a look at the state of the BYOIP world right now. Let’s say a customer has authority over a range of IP addresses, and they’d like to bring them to Cloudflare. We require customers to provide us with a Letter of Authorization (LOA) and have an Internet Routing Registry (IRR) record matching their prefix and ASN. Once we have this, we require manual review by a Cloudflare engineer. There are a few issues with this process:

Insecure: The LOA is just a document—a piece of paper. The security of this method rests entirely on the diligence of the engineer reviewing the document. If the review is not able to detect that a document is fraudulent or inaccurate, it is possible for a prefix or ASN to be hijacked.
Time-consuming: Generating a single LOA is not always sufficient. If you are leasing IP space, we will ask you to provide documentation confirming that relationship as well, so that we can see a clear chain of authorisation from the original assignment or allocation of addresses to you. Getting all the paper documents to verify this chain of ownership, combined with having to wait for manual review can result in weeks of waiting to deploy a prefix!

Automating trust: How Cloudflare verifies your BYOIP prefix ownership in minutes

Moving to a self-serve model allowed us to rethink the manner in which we conduct prefix ownership checks. We asked ourselves: How can we quickly, securely, and automatically prove you are authorized to use your IP prefix and intend to route it through Cloudflare?

We ended up killing two birds with one stone, thanks to our two-step process involving the creation of an RPKI ROA (verification of intent) and modification of IRR or rDNS records (verification of ownership). Self-serve unlocks the ability to not only onboard prefixes more quickly and without human intervention, but also exercises more rigorous ownership checks than a simple scanned document ever could. While not 100% foolproof, it is a significant improvement in the way we verify ownership.

Tapping into the authorities

Regional Internet Registries (RIRs) are the organizations responsible for distributing and managing Internet number resources like IP addresses. They are composed of 5 different entities operating in different regions of the world (RIRs). Originally allocated address space from the Internet Assigned Numbers Authority (IANA), they in turn assign and allocate that IP space to Local Internet Registries (LIRs) like ISPs.

This process is based on RIR policies which generally look at things like legal documentation, existing database/registry records, technical contacts, and BGP information. End-users can obtain addresses from an LIR, or in some cases through an RIR directly. As IPv4 addresses have become more scarce, brokerage services have been launched to allow addresses to be leased for fixed periods from their original assignees.

The Internet Routing Registry (IRR) is a separate system that focuses on routing rather than address assignment. Many organisations operate IRR instances and allow routing information to be published, including all five RIRs. While most IRR instances impose few barriers to the publication of routing data, those that are operated by RIRs are capable of linking the ability to publish routing information with the organisations to which the corresponding addresses have been assigned. We believe that being able to modify an IRR record protected in this way provides a good signal that a user has the rights to use a prefix.

Example of a route object containing validation token (using the documentation-only address 192.0.2.0/24):

% whois -h rr.arin.net 192.0.2.0/24

route:          192.0.2.0/24
origin:         AS13335
descr:          Example Company, Inc.
                cf-validation: 9477b6c3-4344-4ceb-85c4-6463e7d2453f
admin-c:        ADMIN2521-ARIN
tech-c:         ADMIN2521-ARIN
tech-c:         CLOUD146-ARIN
mnt-by:         MNT-CLOUD14
created:        2025-07-29T10:52:27Z
last-modified:  2025-07-29T10:52:27Z
source:         ARIN

For those that don’t want to go through the process of IRR-based validation, reverse DNS (rDNS) is provided as another secure method of verification. To manage rDNS for a prefix — whether it's creating a PTR record or a security TXT record — you must be granted permission by the entity that allocated the IP block in the first place (usually your ISP or the RIR).

This permission is demonstrated in one of two ways:

Directly through the IP owner’s authenticated customer portal (ISP/RIR).
By the IP owner delegating authority to your third-party DNS provider via an NS record for your reverse zone.

Example of a reverse domain lookup using dig command (using the documentation-only address 192.0.2.0/24):

% dig cf-validation.2.0.192.in-addr.arpa TXT

; <<>> DiG 9.10.6 <<>> cf-validation.2.0.192.in-addr.arpa TXT
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 16686
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;cf-validation.2.0.192.in-addr.arpa. IN TXT

;; ANSWER SECTION:
cf-validation.2.0.192.in-addr.arpa. 300 IN TXT "b2f8af96-d32d-4c46-a886-f97d925d7977"

;; Query time: 35 msec
;; SERVER: 127.0.2.2#53(127.0.2.2)
;; WHEN: Fri Oct 24 10:43:52 EDT 2025
;; MSG SIZE  rcvd: 150

So how exactly is one supposed to modify these records? That’s where the validation token comes into play. Once you choose either the IRR or Reverse DNS method, we provide a unique, single-use validation token. You must add this token to the content of the relevant record, either in the IRR or in the DNS. Our system then looks for the presence of the token as evidence that the request is being made by someone with authorization to make the requested modification. If the token is found, verification is complete and your ownership is confirmed!

The digital passport 🛂

Ownership is only half the battle; we also need to confirm your intention that you authorize Cloudflare to advertise your prefix. For this, we rely on the gold standard for routing security: the Resource Private Key Infrastructure (RPKI), and in particular Route Origin Authorization (ROA) objects.

A ROA is a cryptographically-signed document that specifies which Autonomous System Number (ASN) is authorized to originate your IP prefix. You can think of a ROA as the digital equivalent of a certified, signed, and notarised contract from the owner of the prefix.

Relying parties can validate the signatures in a ROA using the RPKI.You simply create a ROA that specifies Cloudflare's ASN (AS13335) as an authorized originator and arrange for it to be signed. Many of our customers used hosted RPKI systems available through RIR portals for this. When our systems detect this signed authorization, your routing intention is instantly confirmed.

Many other companies that support BYOIP require a complex workflow involving creating self-signed certificates and manually modifying RDAP (Registration Data Access Protocol) records—a heavy administrative lift. By embracing a choice of IRR object modification and Reverse DNS TXT records, combined with RPKI, we offer a verification process that is much more familiar and straightforward for existing network operators.

The global reach guarantee

While the new self-serve flow ditches the need for the "dinosaur relic" that is the LOA, many network operators around the world still rely on it as part of the process of accepting prefixes from other networks.

To help ensure your prefix is accepted by adjacent networks globally, Cloudflare automatically generates a document on your behalf to be distributed in place of a LOA. This document provides information about the checks that we have carried out to confirm that we are authorised to originate the customer prefix, and confirms the presence of valid ROAs to authorise our origination of it. In this way we are able to support the workflows of network operators we connect to who rely upon LOAs, without our customers having the burden of generating them.

Staying away from black holes

One concern in designing the Self-Serve API is the trade-off between giving customers flexibility while implementing the necessary safeguards so that an IP prefix is never advertised without a matching service binding. If this were to happen, Cloudflare would be advertising a prefix with no idea on what to do with the traffic when we receive it! We call this “blackholing” traffic. To handle this, we introduced the requirement of a default service binding — i.e. a service binding that spans the entire range of the IP prefix onboarded.

A customer can later layer different service bindings on top of their default service binding via multiple service bindings, like putting CDN on top of a default Spectrum service binding. This way, a prefix can never be advertised without a service binding and blackhole our customers’ traffic.

Getting started

Check out our developer docs on the most up-to-date documentation on how to onboard, advertise, and add services to your IP prefixes via our API. Remember that onboardings can be complex, and don’t hesitate to ask questions or reach out to our professional services team if you’d like us to do it for you.

The future of network control

The ability to script and integrate BYOIP management into existing workflows is a game-changer for modern network operations, and we’re only just getting started. In the months ahead, look for self-serve BYOIP in the dashboard, as well as self-serve BYOIP offboarding to give customers even more control.

Cloudflare's self-serve BYOIP API onboarding empowers customers with unprecedented control and flexibility over their IP assets. This move to automate onboarding empowers a stronger security posture, moving away from manually-reviewed PDFs and driving RPKI adoption. By using these API calls, organizations can automate complex network tasks, streamline migrations, and build more resilient and agile network infrastructures.

One IP address, many users: detecting CGNAT to reduce collateral effects

Vasilis Giotsas — Wed, 29 Oct 2025 13:00:00 GMT

IP addresses have historically been treated as stable identifiers for non-routing purposes such as for geolocation and security operations. Many operational and security mechanisms, such as blocklists, rate-limiting, and anomaly detection, rely on the assumption that a single IP address represents a cohesive, accountable entity or even, possibly, a specific user or device.

But the structure of the Internet has changed, and those assumptions can no longer be made. Today, a single IPv4 address may represent hundreds or even thousands of users due to widespread use of Carrier-Grade Network Address Translation (CGNAT), VPNs, and proxy middleboxes. This concentration of traffic can result in significant collateral damage – especially to users in developing regions of the world – when security mechanisms are applied without taking into account the multi-user nature of IPs.

This blog post presents our approach to detecting large-scale IP sharing globally. We describe how we build reliable training data, and how detection can help avoid unintentional bias affecting users in regions where IP sharing is most prevalent. Arguably it's those regional variations that motivate our efforts more than any other.

Why this matters: Potential socioeconomic bias

Our work was initially motivated by a simple observation: CGNAT is a likely unseen source of bias on the Internet. Those biases would be more pronounced wherever there are more users and few addresses, such as in developing regions. And these biases can have profound implications for user experience, network operations, and digital equity.

The reasons are understandable for many reasons, not least because of necessity. Countries in the developing world often have significantly fewer available IPs, and more users. The disparity is a historical artifact of how the Internet grew: the largest blocks of IPv4 addresses were allocated decades ago, primarily to organizations in North America and Europe, leaving a much smaller pool for regions where Internet adoption expanded later.

To visualize the IPv4 allocation gap, we plot country-level ratios of users to IP addresses in the figure below. We take online user estimates from the World Bank Group and the number of IP addresses in a country from Regional Internet Registry (RIR) records. The colour-coded map that emerges shows that the usage of each IP address is more concentrated in regions that generally have poor Internet penetration. For example, large portions of Africa and South Asia appear with the highest user-to-IP ratios. Conversely, the lowest user-to-IP ratios appear in Australia, Canada, Europe, and the USA — the very countries that otherwise have the highest Internet user penetration numbers.

The scarcity of IPv4 address space means that regional differences can only worsen as Internet penetration rates increase. A natural consequence of increased demand in developing regions is that ISPs would rely even more heavily on CGNAT, and is compounded by the fact that CGNAT is common in mobile networks that users in developing regions so heavily depend on. All of this means that actions known to be based on IP reputation or behaviour would disproportionately affect developing economies.

Cloudflare is a global network in a global Internet. We are sharing our methodology so that others might benefit from our experience and help to mitigate unintended effects. First, let’s better understand CGNAT.

When one IP address serves multiple users

Large-scale IP address sharing is primarily achieved through two distinct methods. The first, and more familiar, involves services like VPNs and proxies. These tools emerge from a need to secure corporate networks or improve users' privacy, but can be used to circumvent censorship or even improve performance. Their deployment also tends to concentrate traffic from many users onto a small set of exit IPs. Typically, individuals are aware they are using such a service, whether for personal use or as part of a corporate network.

Separately, another form of large-scale IP sharing often goes unnoticed by users: Carrier-Grade NAT (CGNAT). One way to explain CGNAT is to start with a much smaller version of network address translation (NAT) that very likely exists in your home broadband router, formally called a Customer Premises Equipment (or CPE), which translates unseen private addresses in the home to visible and routable addresses in the ISP. Once traffic leaves the home, an ISP may add an additional enterprise-level address translation that causes many households or unrelated devices to appear behind a single IP address.

The crucial difference between large-scale IP sharing is user choice: carrier-grade address sharing is not a user choice, but is configured directly by Internet Service Providers (ISPs) within their access networks. Users are not aware that CGNATs are in use.

The primary driver for this technology, understandably, is the exhaustion of the IPv4 address space. IPv4's 32-bit architecture supports only 4.3 billion unique addresses — a capacity that, while once seemingly vast, has been completely outpaced by the Internet's explosive growth. By the early 2010s, Regional Internet Registries (RIRs) had depleted their pools of unallocated IPv4 addresses. This left ISPs unable to easily acquire new address blocks, forcing them to maximize the use of their existing allocations.

While the long-term solution is the transition to IPv6, CGNAT emerged as the immediate, practical workaround. Instead of assigning a unique public IP address to each customer, ISPs use CGNAT to place multiple subscribers behind a single, shared IP address. This practice solves the problem of IP address scarcity. Since translated addresses are not publicly routable, CGNATs have also had the positive side effect of protecting many home devices that might be vulnerable to compromise.

CGNATs also create significant operational fallout stemming from the fact that hundreds or even thousands of clients can appear to originate from a single IP address. This means an IP-based security system may inadvertently block or throttle large groups of users as a result of a single user behind the CGNAT engaging in malicious activity.

This isn't a new or niche issue. It has been recognized for years by the Internet Engineering Task Force (IETF), the organization that develops the core technical standards for the Internet. These standards, known as Requests for Comments (RFCs), act as the official blueprints for how the Internet should operate. RFC 6269, for example, discusses the challenges of IP address sharing, while RFC 7021 examines the impact of CGNAT on network applications. Both explain that traditional abuse-mitigation techniques, such as blocklisting or rate-limiting, assume a one-to-one relationship between IP addresses and users: when malicious activity is detected, the offending IP address can be blocked to prevent further abuse.

In shared IPv4 environments, such as those using CGNAT or other address-sharing techniques, this assumption breaks down because multiple subscribers can appear under the same public IP. Blocking the shared IP therefore penalizes many innocent users along with the abuser. In 2015 Ofcom, the UK's telecommunications regulator, reiterated these concerns in a report on the implications of CGNAT where they noted that, “In the event that an IPv4 address is blocked or blacklisted as a source of spam, the impact on a CGNAT would be greater, potentially affecting an entire subscriber base.”

While the hope was that CGNAT was only a temporary solution until the eventual switch to IPv6, as the old proverb says, nothing is more permanent than a temporary solution. While IPv6 deployment continues to lag, CGNAT deployments have become increasingly common, and so do the related problems.

CGNAT detection at Cloudflare

To enable a fairer treatment of users behind CGNAT IPs by security techniques that rely on IP reputation, our goal is to identify large-scale IP sharing. This allows traffic filtering to be better calibrated and collateral damage minimized. Additionally, we want to distinguish CGNAT IPs from other large-scale sharing (LSS) IP technologies, such as VPNs and proxies, because we may need to take different approaches to different kinds of IP-sharing technologies.

To do this, we decided to take advantage of Cloudflare’s extensive view of the active IP clients, and build a supervised learning classifier that would distinguish CGNAT and VPN/proxy IPs from IPs that are allocated to a single subscriber (non-LSS IPs), based on behavioural characteristics. The figure below shows an overview of our supervised classifier:

While our classification approach is straightforward, a significant challenge is the lack of a reliable, comprehensive, and labeled dataset of CGNAT IPs for our training dataset.

Detecting CGNAT using public data sources

Detection begins by building an initial dataset of IPs believed to be associated with CGNAT. Cloudflare has vast HTTP and traffic logs. Unfortunately there is no signal or label in any request to indicate what is or is not a CGNAT.

To build an extensive labelled dataset to train our ML classifier, we employ a combination of network measurement techniques, as described below. We rely on public data sources to help disambiguate an initial set of large-scale shared IP addresses from others in Cloudflare’s logs.

Distributed Traceroutes

The presence of a client behind CGNAT can often be inferred through traceroute analysis. CGNAT requires ISPs to insert a NAT step that typically uses the Shared Address Space (RFC 6598) after the customer premises equipment (CPE). By running a traceroute from the client to its own public IP and examining the hop sequence, the appearance of an address within 100.64.0.0/10 between the first private hop (e.g., 192.168.1.1) and the public IP is a strong indicator of CGNAT.

Traceroute can also reveal multi-level NAT, which CGNAT requires, as shown in the diagram below. If the ISP assigns the CPE a private RFC 1918 address that appears right after the local hop, this indicates at least two NAT layers. While ISPs sometimes use private addresses internally without CGNAT, observing private or shared ranges immediately downstream combined with multiple hops before the public IP strongly suggests CGNAT or equivalent multi-layer NAT.

Although traceroute accuracy depends on router configurations, detecting private and shared IP ranges is a reliable way to identify large-scale IP sharing. We apply this method to distributed traceroutes from over 9,000 RIPE Atlas probes to classify hosts as behind CGNAT, single-layer NAT, or no NAT.

Scraping WHOIS and PTR records

Many operators encode metadata about their IPs in the corresponding reverse DNS pointer (PTR) record that can signal administrative attributes and geographic information. We first query the DNS for PTR records for the full IPv4 space and then filter for a set of known keywords from the responses that indicate a CGNAT deployment. For example, each of the following three records matches a keyword (cgnat, cgn or lsn) used to detect CGNAT address space:

node-lsn.pool-1-0.dynamic.totinternet.net 103-246-52-9.gw1-cgnat.mobile.ufone.nz cgn.gsw2.as64098.net

WHOIS and Internet Routing Registry (IRR) records may also contain organizational names, remarks, or allocation details that reveal whether a block is used for CGNAT pools or residential assignments.

Given that both PTR and WHOIS records may be manually maintained and therefore may be stale, we try to sanitize the extracted data by validating the fact that the corresponding ISPs indeed use CGNAT based on customer and market reports.

Collecting VPN and proxy IPs

Compiling a list of VPN and proxy IPs is more straightforward, as we can directly find such IPs in public service directories for anonymizers. We also subscribe to multiple VPN providers, and we collect the IPs allocated to our clients by connecting to a unique HTTP endpoint under our control.

Modeling CGNAT with machine learning

By combining the above techniques, we accumulated a dataset of labeled IPs for more than 200K CGNAT IPs, 180K VPNs & proxies and close to 900K IPs allocated that are not LSS IPs. These were the entry points to modeling with machine learning.

Feature selection

Our hypothesis was that aggregated activity from CGNAT IPs is distinguishable from activity generated from other non-CGNAT IP addresses. Our feature extraction is an evaluation of that hypothesis — since networks do not disclose CGNAT and other uses of IPs, the quality of our inference is strictly dependent on our confidence in the training data. We claim the key discriminator is diversity, not just volume. For example, VM-hosted scanners may generate high numbers of requests, but with low information diversity. Similarly, globally routable CPEs may have individually unique characteristics, but with volumes that are less likely to be caught at lower sampling rates.

In our feature extraction, we parse a 1% sampled HTTP requests log for distinguishing features of IPs compiled in our reference set, and the same features for the corresponding /24 prefix (namely IPs with the same first 24 bits in common). We analyse the features for each of the VPNs, proxies, CGNAT, or non LSS IP. We find that features from the following broad categories are key discriminators for the different types of IPs in our training dataset:

Client-side signals: We analyze the aggregate properties of clients connecting from an IP. A large, diverse user base (like on a CGNAT) naturally presents a much wider statistical variety of client behaviors and connection parameters than a single-tenant server or a small business proxy.
Network and transport-level behaviors: We examine traffic at the network and transport layers. The way a large-scale network appliance (like a CGNAT) manages and routes connections often leaves subtle, measurable artifacts in its traffic patterns, such as in port allocation and observed network timing.
Traffic volume and destination diversity: We also model the volume and "shape" of the traffic. An IP representing thousands of independent users will, on average, generate a higher volume of requests and target a much wider, less correlated set of destinations than an IP representing a single user.

Crucially, to distinguish CGNAT from VPNs and proxies (which is absolutely necessary for calibrated security filtering), we had to aggregate these features at two different scopes: per-IP and per /24 prefixes. CGNAT IPs are typically allocated large blocks of IPs, whereas VPNs IPs are more scattered across different IP prefixes.

Classification results

We compute the above features from HTTP logs over 24-hour intervals to increase data volume and reduce noise due to DHCP IP reallocation. The dataset is split into 70% training and 30% testing sets with disjoint /24 prefixes, and VPN and proxy labels are merged due to their similarity and lower operational importance compared to CGNAT detection.

Then we train a multi-class XGBoost model with class weighting to address imbalance, assigning each IP to the class with the highest predicted probability. XGBoost is well-suited for this task because it efficiently handles large feature sets, offers strong regularization to prevent overfitting, and delivers high accuracy with limited parameter tuning. The classifier achieves 0.98 accuracy, 0.97 weighted F1, and 0.04 log loss. The figure below shows the confusion matrix of the classification.

Our model is accurate for all three labels. The errors observed are mainly misclassifications of VPN/proxy IPs as CGNATs, mostly for VPN/proxy IPs that are within a /24 prefix that is also shared by broadband users outside of the proxy service. We also evaluate the prediction accuracy using k-fold cross validation, which provides a more reliable estimate of performance by training and validating on multiple data splits, reducing variance and overfitting compared to a single train–test split. We select 10 folds and we evaluate the Area Under the ROC Curve (AUC) and the multi-class logloss. We achieve a macro-average AUC of 0.9946 (σ=0.0069) and log loss of 0.0429 (σ=0.0115). Prefix-level features are the most important contributors to classification performance.

Users behind CGNAT are more likely to be rate limited

The figure below shows the daily number of CGNAT IP inferences generated by our CDN-deployed detection service between December 17, 2024 and January 9, 2025. The number of inferences remains largely stable, with noticeable dips during weekends and holidays such as Christmas and New Year’s Day. This pattern reflects expected seasonal variations, as lower traffic volumes during these periods lead to fewer active IP ranges and reduced request activity.

Next, recall that actions that rely on IP reputation or behaviour may be unduly influenced by CGNATs. One such example is bot detection. In an evaluation of our systems, we find that bot detection is resilient to those biases. However, we also learned that customers are more likely to rate limit IPs that we find are CGNATs.

We analyze bot labels by analyzing how often requests from CGNAT and non-CGNAT IPs are labeled as bots. Cloudflare assigns a bot score to each HTTP request using CatBoost models trained on various request features, and these scores are then exposed through the Web Application Firewall (WAF), allowing customers to apply filtering rules. The median bot rate is nearly identical for CGNAT (4.8%) and non-CGNAT (4.7%) IPs. However, the mean bot rate is notably lower for CGNATs (7%) than for non-CGNATs (13.1%), indicating different underlying distributions. Non-CGNAT IPs show a much wider spread, with some reaching 100% bot rates, while CGNAT IPs cluster mostly below 15%. This suggests that non-CGNAT IPs tend to be dominated by either human or bot activity, whereas CGNAT IPs reflect mixed behavior from many end users, with human traffic prevailing.

Interestingly, despite bot scores that indicate traffic is more likely to be from human users, CGNAT IPs are subject to rate limiting three times more often than non-CGNAT IPs. This is likely because multiple users share the same public IP, increasing the chances that legitimate traffic gets caught by customers’ bot mitigation and firewall rules.

This tells us that users behind CGNAT IPs are indeed susceptible to collateral effects, and identifying those IPs allows us to tune mitigation strategies to disrupt malicious traffic quickly while reducing collateral impact on benign users behind the same address.

A global view of the CGNAT ecosystem

One of the early motivations of this work was to understand if our knowledge about IP addresses might hide a bias along socio-economic boundaries—and in particular if an action on an IP address may disproportionately affect populations in developing nations, often referred to as the Global South. Identifying where different IPs exist is a necessary first step.

The map below shows the fraction of a country’s inferred CGNAT IPs over all IPs observed in the country. Regions with a greater reliance on CGNAT appear darker on the map. This view highlights the geodiversity of CGNATs in terms of importance; for example, much of Africa and Central and Southeast Asia rely on CGNATs.

As further evidence of continental differences, the boxplot below shows the distribution of distinct user agents per IP across /24 prefixes inferred to be part of a CGNAT deployment in each continent.

Notably, Africa has a much higher ratio of user agents to IP addresses than other regions, suggesting more clients share the same IP in African ASNs. So, not only do African ISPs rely more extensively on CGNAT, but the number of clients behind each CGNAT IP is higher.

While the deployment rate of CGNAT per country is consistent with the users-per-IP ratio per country, it is not sufficient by itself to confirm deployment. The scatterplot below shows the number of users (according to APNIC user estimates) and the number of IPs per ASN for ASNs where we detect CGNAT. ASNs that have fewer available IP addresses than their user base appear below the diagonal. Interestingly the scatterplot indicates that many ASNs with more addresses than users still choose to deploy CGNAT. Presumably, these ASNs provide additional services beyond broadband, preventing them from dedicating their entire address pool to subscribers.

What this means for everyday Internet users

Accurate detection of CGNAT IPs is crucial for minimizing collateral effects in network operations and for ensuring fair and effective application of security measures. Our findings underscore the potential socio-economic and geographical variations in the use of CGNATs, revealing significant disparities in how IP addresses are shared across different regions.

At Cloudflare we are going beyond just using these insights to evaluate policies and practices. We are using the detection systems to improve our systems across our application security suite of features, and working with customers to understand how they might use these insights to improve the protections they configure.

Our work is ongoing and we’ll share details as we go. In the meantime, if you’re an ISP or network operator that operates CGNAT and want to help, get in touch at ask-research@cloudflare.com. Sharing knowledge and working together helps make better and equitable user experience for subscribers, while preserving web service safety and security.

A deep dive into BPF LPM trie performance and optimization

Matt Fleming — Tue, 21 Oct 2025 13:00:00 GMT

It started with a mysterious soft lockup message in production. A single, cryptic line that led us down a rabbit hole into the performance of one of the most fundamental data structures we use: the BPF LPM trie.

BPF trie maps (BPF_MAP_TYPE_LPM_TRIE) are heavily used for things like IP and IP+Port matching when routing network packets, ensuring your request passes through the right services before returning a result. The performance of this data structure is critical for serving our customers, but the speed of the current implementation leaves a lot to be desired. We’ve run into several bottlenecks when storing millions of entries in BPF LPM trie maps, such as entry lookup times taking hundreds of milliseconds to complete and freeing maps locking up a CPU for over 10 seconds. For instance, BPF maps are used when evaluating Cloudflare’s Magic Firewall rules and these bottlenecks have even led to traffic packet loss for some customers.

This post gives a refresher of how tries and prefix matching work, benchmark results, and a list of the shortcomings of the current BPF LPM trie implementation.

A brief recap of tries

If it’s been a while since you last looked at the trie data structure (or if you’ve never seen it before), a trie is a tree data structure (similar to a binary tree) that allows you to store and search for data for a given key and where each node stores some number of key bits.

Searches are performed by traversing a path, which essentially reconstructs the key from the traversal path, meaning nodes do not need to store their full key. This differs from a traditional binary search tree (BST) where the primary invariant is that the left child node has a key that is less than the current node and the right child has a key that is greater. BSTs require that each node store the full key so that a comparison can be made at each search step.

Here’s an example that shows how a BST might store values for the keys:

ABC
ABCD
ABCDEFGH
DEF

In comparison, a trie for storing the same set of keys might look like this.

This way of splitting out bits is really memory-efficient when you have redundancy in your data, e.g. prefixes are common in your keys, because that shared data only requires a single set of nodes. It’s for this reason that tries are often used to efficiently store strings, e.g. dictionaries of words – storing the strings “ABC” and “ABCD” doesn’t require 3 bytes + 4 bytes (assuming ASCII), it only requires 3 bytes + 1 byte because “ABC” is shared by both (the exact number of bits required in the trie is implementation dependent).

Tries also allow more efficient searching. For instance, if you wanted to know whether the key “CAR” existed in the BST you are required to go to the right child of the root (the node with key “DEF”) and check its left child because this is where it would live if it existed. A trie is more efficient because it searches in prefix order. In this particular example, a trie knows at the root whether that key is in the trie or not.

This design makes tries perfectly suited for performing longest prefix matches and for working with IP routing using CIDR. CIDR was introduced to make more efficient use of the IP address space (no longer requiring that classes fall into 4 buckets of 8 bits) but comes with added complexity because now the network portion of an IP address can fall anywhere. Handling the CIDR scheme in IP routing tables requires matching on the longest (most specific) prefix in the table rather than performing a search for an exact match.

If searching a trie does a single-bit comparison at each node, that’s a binary trie. If searching compares more bits we call that a multibit trie. You can store anything you like in a trie, including IP and subnet addresses – it’s all just ones and zeroes.

Nodes in multibit tries use more memory than in binary tries, but since computers operate on multibit words anyhow, it’s more efficient from a microarchitecture perspective to use multibit tries because you can traverse through the bits faster, reducing the number of comparisons you need to make to search for your data. It’s a classic space vs time tradeoff.

There are other optimisations we can use with tries. The distribution of data that you store in a trie might not be uniform and there could be sparsely populated areas. For example, if you store the strings “A” and “BCDEFGHI” in a multibit trie, how many nodes do you expect to use? If you’re using ASCII, you could construct the binary trie with a root node and branch left for “A” or right for “B”. With 8-bit nodes, you’d need another 7 nodes to store “C”, “D”, “E”, “F”, “G”, “H", “I”.

Since there are no other strings in the trie, that’s pretty suboptimal. Once you hit the first level after matching on “B” you know there’s only one string in the trie with that prefix, and you can avoid creating all the other nodes by using path compression. Path compression replaces nodes “C”, “D”, “E” etc. with a single one such as “I”.

If you traverse the tree and hit “I”, you still need to compare the search key with the bits you skipped (“CDEFGH”) to make sure your search key matches the string. Exactly how and where you store the skipped bits is implementation dependent – BPF LPM tries simply store the entire key in the leaf node. As your data becomes denser, path compression is less effective.

What if your data distribution is dense and, say, all the first 3 levels in a trie are fully populated? In that case you can use level compression and replace all the nodes in those levels with a single node that has 2**3 children. This is how Level-Compressed Tries work which are used for IP route lookup in the Linux kernel (see net/ipv4/fib_trie.c).

There are other optimisations too, but this brief detour is sufficient for this post because the BPF LPM trie implementation in the kernel doesn’t fully use the three we just discussed.

How fast are BPF LPM trie maps?

Here are some numbers from running BPF selftests benchmark on AMD EPYC 9684X 96-Core machines. Here the trie has 10K entries, a 32-bit prefix length, and an entry for every key in the range [0, 10K).

Operation	Throughput	Stddev	Latency
lookup	7.423M ops/s	0.023M ops/s	134.710 ns/op
update	2.643M ops/s	0.015M ops/s	378.310 ns/op
delete	0.712M ops/s	0.008M ops/s	1405.152 ns/op
free	0.573K ops/s	0.574K ops/s	1.743 ms/op

The time to free a BPF LPM trie with 10K entries is noticeably large. We recently ran into an issue where this took so long that it caused soft lockup messages to spew in production.

This benchmark gives some idea of worst case behaviour. Since the keys are so densely populated, path compression is completely ineffective. In the next section, we explore the lookup operation to understand the bottlenecks involved.

Why are BPF LPM tries slow?

The LPM trie implementation in kernel/bpf/lpm_trie.c has a couple of the optimisations we discussed in the introduction. It is capable of multibit comparisons at leaf nodes, but since there are only two child pointers in each internal node, if your tree is densely populated with a lot of data that only differs by one bit, these multibit comparisons degrade into single bit comparisons.

Here’s an example. Suppose you store the numbers 0, 1, and 3 in a BPF LPM trie. You might hope that since these values fit in a single 32 or 64-bit machine word, you could use a single comparison to decide which next node to visit in the trie. But that’s only possible if your trie implementation has 3 child pointers in the current node (which, to be fair, most trie implementations do). In other words, you want to make a 3-way branching decision but since BPF LPM tries only have two children, you’re limited to a 2-way branch.

A diagram for this 2-child trie is given below.

The leaf nodes are shown in green with the key, as a binary string, in the center. Even though a single 8-bit comparison is more than capable of figuring out which node has that key, the BPF LPM trie implementation resorts to inserting intermediate nodes (blue) to inject 2-way branching decisions into your path traversal because its parent (the orange root node in this case) only has 2 children. Once you reach a leaf node, BPF LPM tries can perform a multibit comparison to check the key. If a node supported pointers to more children, the above trie could instead look like this, allowing a 3-way branch and reducing the lookup time.

This 2-child design impacts the height of the trie. In the worst case, a completely full trie essentially becomes a binary search tree with height log2(nr_entries) and the height of the trie impacts how many comparisons are required to search for a key.

The above trie also shows how BPF LPM tries implement a form of path compression – you only need to insert an intermediate node where you have two nodes whose keys differ by a single bit. If instead of 3, you insert a key of 15 (0b1111), this won’t change the layout of the trie; you still only need a single node at the right child of the root.

And finally, BPF LPM tries do not implement level compression. Again, this stems from the fact that nodes in the trie can only have 2 children. IP route tables tend to have many prefixes in common and you typically see densely packed tries at the upper levels which makes level compression very effective for tries containing IP routes.

Here’s a graph showing how the lookup throughput for LPM tries (measured in million ops/sec) degrades as the number of entries increases, from 1 entry up to 100K entries.

Once you reach 1 million entries, throughput is around 1.5 million ops/sec, and continues to fall as the number of entries increases.

Why is this? Initially, this is because of the L1 dcache miss rate. All of those nodes that need to be traversed in the trie are potential cache miss opportunities.

As you can see from the graph, L1 dcache miss rate remains relatively steady and yet the throughput continues to decline. At around 80K entries, dTLB miss rate becomes the bottleneck.

Because BPF LPM tries to dynamically allocate individual nodes from a freelist of kernel memory, these nodes can live at arbitrary addresses. Which means traversing a path through a trie almost certainly will incur cache misses and potentially dTLB misses. This gets worse as the number of entries, and height of the trie, increases.

Where do we go from here?

By understanding the current limitations of the BPF LPM trie, we can now work towards building a more performant and efficient solution for the future of the Internet.

We’ve already contributed these benchmarks to the upstream Linux kernel — but that’s only the start. We have plans to improve the performance of BPM LPM tries, particularly the lookup function which is heavily used for our workloads. This post covered a number of optimisations that are already used by the net/ipv4/fib_trie.c code, so a natural first step is to refactor that code so that a common Level Compressed trie implementation can be used. Expect future blog posts to explore this work in depth.

If you’re interested in looking at more performance numbers, Jesper Brouer has recorded some here: https://github.com/xdp-project/xdp-project/blob/main/areas/bench/bench02_lpm-trie-lookup.org.

If the Linux kernel, performance, or optimising data structures excites you, our engineering teams are hiring.

Cloudflare 1.1.1.1 incident on July 14, 2025

Ash Pallarito — Tue, 15 Jul 2025 15:05:00 GMT

On 14 July 2025, Cloudflare made a change to our service topologies that caused an outage for 1.1.1.1 on the edge, resulting in downtime for 62 minutes for customers using the 1.1.1.1 public DNS Resolver as well as intermittent degradation of service for Gateway DNS.

Cloudflare's 1.1.1.1 Resolver service became unavailable to the Internet starting at 21:52 UTC and ending at 22:54 UTC. The majority of 1.1.1.1 users globally were affected. For many users, not being able to resolve names using the 1.1.1.1 Resolver meant that basically all Internet services were unavailable. This outage can be observed on Cloudflare Radar.

The outage occurred because of a misconfiguration of legacy systems used to maintain the infrastructure that advertises Cloudflare’s IP addresses to the Internet.

This was a global outage. During the outage, Cloudflare's 1.1.1.1 Resolver was unavailable worldwide.

We’re very sorry for this outage. The root cause was an internal configuration error and not the result of an attack or a BGP hijack. In this blog, we’re going to talk about what the failure was, why it occurred, and what we’re doing to make sure this doesn’t happen again.

Background

Cloudflare introduced the 1.1.1.1 public DNS Resolver service in 2018. Since the announcement, 1.1.1.1 has become one of the most popular DNS Resolver IP addresses and it is free for anyone to use.

Almost all of Cloudflare's services are made available to the Internet using a routing method known as anycast, a well-known technique intended to allow traffic for popular services to be served in many different locations across the Internet, increasing capacity and performance. This is the best way to ensure we can globally manage our traffic, but also means that problems with the advertisement of this address space can result in a global outage.

Cloudflare announces these anycast routes to the Internet in order for traffic to those addresses to be delivered to a Cloudflare data center, providing services from many different places. Most Cloudflare services are provided globally, like the 1.1.1.1 public DNS Resolver, but a subset of services are specifically constrained to particular regions.

These services are part of our Data Localization Suite (DLS), which allows customers to configure Cloudflare in a variety of ways to meet their compliance needs across different countries and regions. One of the ways in which Cloudflare manages these different requirements is to make sure the right service's IP addresses are Internet-reachable only where they need to be, so your traffic is handled correctly worldwide. A particular service has a matching "service topology" – that is, traffic for a service should be routed only to a particular set of locations.

On June 6, during a release to prepare a service topology for a future DLS service, a configuration error was introduced: the prefixes associated with the 1.1.1.1 Resolver service were inadvertently included alongside the prefixes that were intended for the new DLS service. This configuration error sat dormant in the production network as the new DLS service was not yet in use, but it set the stage for the outage on July 14. Since there was no immediate change to the production network there was no end-user impact, and because there was no impact, no alerts were fired.

Incident Timeline

Time (UTC)	Event
2025-06-06 17:38	ISSUE INTRODUCED - NO IMPACT A configuration change was made for a DLS service that was not yet in production. This configuration change accidentally included a reference to the 1.1.1.1 Resolver service and, by extension, the prefixes associated with the 1.1.1.1 Resolver service. This change did not result in a change of network configuration, and so routing for the 1.1.1.1 Resolver was not affected. Since there was no change in traffic, no alerts fired, but the misconfiguration lay dormant for a future release.
2025-07-14 21:48	IMPACT START A configuration change was made for the same DLS service. The change attached a test location to the non-production service; this location itself was not live, but the change triggered a refresh of network configuration globally. Due to the earlier configuration error linking the 1.1.1.1 Resolver's IP addresses to our non-production service, those 1.1.1.1 IPs were inadvertently included when we changed how the non-production service was set up. The 1.1.1.1 Resolver prefixes started to be withdrawn from production Cloudflare data centers globally.
2025-07-14 21:52	DNS traffic to 1.1.1.1 Resolver service begins to drop globally
2025-07-14 21:54	Related, non-causal event: BGP origin hijack of 1.1.1.0/24 exposed by withdrawal of routes from Cloudflare. This was not a cause of the service failure, but an unrelated issue that was suddenly visible as that prefix was withdrawn by Cloudflare.
2025-07-14 22:01	IMPACT DETECTED Internal service health alerts begin to fire for the 1.1.1.1 Resolver
2025-07-14 22:01	INCIDENT DECLARED
2025-07-14 22:20	FIX DEPLOYED Revert was initiated to restore the previous configuration. To accelerate full restoration of service, a manually triggered action is validated in testing locations before being executed.
2025-07-14 22:54	IMPACT ENDS Resolver alerts cleared and DNS traffic on Resolver prefixes return to normal levels
2025-07-14 22:55	INCIDENT RESOLVED

Impact

Any traffic coming to Cloudflare via 1.1.1.1 Resolver services on these IPs was impacted. Traffic to each of these addresses were also impacted on the corresponding routes.

1.1.1.0/24
1.0.0.0/24 
2606:4700:4700::/48
162.159.36.0/24
162.159.46.0/24
172.64.36.0/24
172.64.37.0/24
172.64.100.0/24
172.64.101.0/24
2606:4700:4700::/48
2606:54c1:13::/48
2a06:98c1:54::/48

When the impact started we observed an immediate and significant drop in queries over UDP, TCP and DNS over TLS (DoT). Most users have 1.1.1.1, 1.0.0.1, 2606:4700:4700::1111, or 2606:4700:4700::1001 configured as their DNS server. Below you can see the query rate for each of the individual protocols and how they were impacted during the incident:

It’s worth noting that DoH (DNS-over-HTTPS) traffic remained relatively stable as most DoH users use the domain cloudflare-dns.com, configured manually or through their browser, to access the public DNS resolver, rather than by IP address. DoH remained available and traffic was mostly unaffected as cloudflare-dns.com uses a different set of IP addresses. Some DNS traffic over UDP that also used different IP addresses remained mostly unaffected as well.

As the corresponding prefixes were withdrawn, no traffic sent to those addresses could reach Cloudflare. We can see this in the timeline for the BGP announcements for 1.1.1.0/24:

^{Pictured above is the timeline for BGP withdrawal and re-announcement of 1.1.1.0/24 globally}

When looking at the query rate of the withdrawn IPs it can be observed that almost no traffic arrives during the impact window. When the initial fix was applied at 22:20 UTC, a large spike in traffic can be seen before it drops off again. This spike is due to clients retrying their queries. When we started announcing the withdrawn prefixes again, queries were able to reach Cloudflare once more. It took until 22:54 UTC before routing was restored in all locations and traffic returned to mostly normal levels.

Technical description of the error and how it happened

Failure of 1.1.1.1 Resolver Service

As described above, a configuration change on June 6 introduced an error in the service topology for a pre-production, DLS service. On July 14, a second change to that service was made: an offline data center location was added to the service topology for the pre-production DNS service in order to allow for some internal testing. This change triggered a refresh of the global configuration of the associated routes, and it was at this point that the impact from the earlier configuration error was felt. The service topology for the 1.1.1.1 Resolver's prefixes was reduced from all locations down to a single, offline location. The effect was to trigger the global and immediate withdrawal of all 1.1.1.1 prefixes.

As routes to 1.1.1.1 were withdrawn, the 1.1.1.1 service itself became unavailable. Alerts fired and an incident was declared.

Technical Investigation and Analysis

The way that Cloudflare manages service topologies has been refined over time and currently consist of a combination of a legacy and a strategic system that are synced. Cloudflare's IP ranges are currently bound and configured across these systems that dictate where an IP range should be announced (in terms of datacenter location) on the edge network. The legacy approach of hard-coding explicit lists of data center locations and attaching them to particular prefixes has proved error-prone, since (for example) bringing a new data center online requires many different lists to be updated and synced consistently. This model also has a significant flaw in that updates to the configuration do not follow a progressive deployment methodology: Even though this release was peer-reviewed by multiple engineers, the change didn’t go through a series of canary deployments before reaching every Cloudflare data center. Our newer approach is to describe service topologies without needing to hard-code IP addresses, which better accommodate expansions to new locations and customer scenarios while also allowing for a staged deployment model, so changes can propagate slowly with health monitoring. During the migration between these approaches, we need to maintain both systems and synchronize data between them, which looks like this:

Initial alerts were triggered for the DNS Resolver at 22:01, indicating query, proxy, and data center failures. While investigating the alerts, we noted traffic toward the Resolver prefixes had drastically dropped and was no longer being received at our edge data centers. Internally, we use BGP to control route advertisements, and we found the Resolver routes from servers were completely missing.

Once our configuration error had been exposed and Cloudflare systems had withdrawn the routes from our routing table, all of the 1.1.1.1 routes should have disappeared entirely from the global Internet routing table. However, this isn’t what happened with the prefix 1.1.1.0/24. Instead, we got reports from Cloudflare Radar that Tata Communications India (AS4755) had started advertising 1.1.1.0/24: from the perspective of the routing system, this looked exactly like a prefix hijack. This was unexpected to see while we were troubleshooting the routing problem, but to be perfectly clear: this BGP hijack was not the cause of the outage. We are following up with Tata Communications.

Restoring the 1.1.1.1 Service

We reverted to the previous configuration at 22:20 UTC. Near instantly, we began readvertising the BGP prefixes which were previously withdrawn from the routers, including 1.1.1.0/24. This restored 1.1.1.1 traffic levels to roughly 77% of what they were prior to the incident. However, during the period since withdrawal, approximately 23% of the fleet of edge servers had been automatically reconfigured to remove required IP bindings as a result of the topology change. To add the configurations back, these servers needed to be reconfigured with our change management system which is not an instantaneous process by default for safety.

The process by which the IP bindings can be restored normally takes some time, as the network in individual locations is designed to be updated over a course of multiple hours. We implement a progressive rollout, rather than on all nodes at once to ensure we don’t introduce additional impact. However, given the severity of the incident, we accelerated the rollout of the fix after verifying the changes in testing locations to restore service as quickly and safely as possible. Normal traffic levels were observed at 22:54 UTC.

Remediation and follow-up steps

We take incidents like this seriously, and we recognise the impact that this incident had. Though this specific issue has been resolved, we have identified several steps we can take to mitigate the risk of a similar problem occurring in the future. We are implementing the following plan as a result of this incident:

Staging Addressing Deployments: Legacy components do not leverage a gradual, staged deployment methodology. Cloudflare will deprecate these systems which enables modern progressive and health mediated deployment processes to provide earlier indication in a staged manner and rollback accordingly.

Deprecating Legacy Systems: We are currently in an intermediate state in which current and legacy components need to be updated concurrently, so we will be migrating addressing systems away from risky deployment methodologies like this one. We will accelerate our deprecation of the legacy systems in order to provide higher standards for documentation and test coverage.

Conclusion

Cloudflare's 1.1.1.1 DNS Resolver service fell victim to an internal configuration error.

We are sorry for the disruption this incident caused for our customers. We are actively making these improvements to ensure improved stability moving forward and to prevent this problem from happening again.

Your IPs, your rules: enabling more efficient address space usage

Mark Rodgers — Mon, 19 May 2025 13:00:00 GMT

IPv4 addresses have become a costly commodity, driven by their growing scarcity. With the original pool of 4.3 billion addresses long exhausted, organizations must now rely on the secondary market to acquire them. Over the years, prices have surged, often exceeding $30–$50 USD per address, with costs varying based on block size and demand. Given the scarcity, these prices are only going to rise, particularly for businesses that haven’t transitioned to IPv6. This rising cost and limited availability have made efficient IP address management more critical than ever. In response, we’ve evolved how we handle BYOIP (Bring Your Own IP) prefixes to give customers greater flexibility.

Historically, when customers onboarded a BYOIP prefix, they were required to assign it to a single service, binding all IP addresses within that prefix to one service before it was advertised. Once set, the prefix's destination was fixed — to direct traffic exclusively to that service. If a customer wanted to use a different service, they had to onboard a new prefix or go through the cumbersome process of offboarding and re-onboarding the existing one.

As a step towards addressing this limitation, we’ve introduced a new level of flexibility: customers can now use parts of any prefix — whether it’s bound to Cloudflare CDN, Spectrum, or Magic Transit — for additional use with CDN or Spectrum. This enhancement provides much-needed flexibility, enabling businesses to optimize their IP address usage while keeping costs under control.

The challenges of moving onboarded BYOIP prefixes between services

Migrating BYOIP prefixes dynamically between Cloudflare services is no trivial task, especially with thousands of servers capable of accepting and processing connections. The problem required overcoming several technical challenges related to IP address management, kernel-level bindings, and orchestration.

Dynamic reallocation of prefixes across services

When configuring an IP prefix for a service, we need to update IP address lists and firewall rules on each of our servers to allow only the traffic we expect for that service, such as opening ports 80 and 443 to allow HTTP and HTTPS traffic for the Cloudflare CDN. We use Linux iptables and IP sets for this.

Migrating IP prefixes to a different service involves dynamically reassigning them to different IP sets and iptable rules. This requires automated updates across a large-scale distributed environment.

As prefixes shift between services, it is critical that servers update their IP sets and iptable rules dynamically to ensure traffic is correctly routed. Failure to do so could lead to routing loops or dropped connections.

Updating Tubular – an eBPF-based IP and port binding service

Most web applications bind to a list of IP addresses at startup, and listen on only those IPs until shutdown. To allow customers to change the IPs bound to each service dynamically, we needed a way to add and remove IPs from a running service, without restarting it. Tubular is a BPF program we wrote that runs on Cloudflare servers that allows services to listen on a single socket, dynamically updating the list of addresses that are routed to that socket over the lifetime of the service, without requiring it to restart when those addresses change.

A significant engineering challenge was extending Tubular to support traffic destined for Cloudflare’s CDN. Without this enhancement, customers would be unable to leverage dynamic reassignment to bind prefixes onboarded through Spectrum to the Cloudflare CDN, limiting flexibility across services.

Cloudflare’s CDN depends on each server running an NGINX ingress proxy to terminate incoming connections. Due to the scale and performance limitations of NGINX, we are actively working to replace it by 2026. In the interim, however, we still depend on the current ingress proxy to reliably handle incoming connections.

One limitation is that this ingress proxy does not support systemd socket activation, a mechanism Tubular relies on to integrate with other Cloudflare services on each server. For services that do support systemd socket activation, systemd independently starts the sockets for the owning service and passes them to Tubular, allowing Tubular to easily detect and route traffic to the correct terminating service.

Since this integration model is not feasible, an alternative solution was required. This was addressed by introducing a shared Unix domain socket between Tubular and the ingress proxy service on each server. Through this channel, the ingress proxy service explicitly transmits socket information to Tubular, enabling it to correctly register the sockets in its datapath.

The final challenge was deploying the Tubular-ingress proxy integration across the fleet of servers without disrupting active connections. As of April 2025, Cloudflare handles an average of 71 million HTTP requests per second, peaking at 100 million. To safely deploy at this scale, the necessary Tubular and ingress proxy configuration changes were staged across all Cloudflare servers without disrupting existing connections. The final step involved adding bindings — IP addresses and ports corresponding to Cloudflare CDN prefixes — to the Tubular configuration. These bindings direct connections through Tubular via the Unix sockets registered during the previous integration step. To minimize risk, bindings were gradually enabled in a controlled rollout across the global fleet.

Tubular data plane in action

This high-level representation of the Tubular data plane binds together the Layer 4 protocol (TCP), prefix (192.0.2.0/24 - which is 254 usable IP addresses), and port number 0 (any port). When incoming packets match this combination, they are directed to the correct socket of the service — in this case, Spectrum.

In the following example, TCP 192.0.2.200/32 has been upgraded to the Cloudflare CDN via the edge Service Bindings API. Tubular dynamically consumes this information, adding a new entry to its data plane bindings and socket table. Using Longest Prefix Match, all packets within the 192.0.2.0/24 range port 0 will be routed to Spectrum, except for 192.0.2.200/32 port 443, which will be directed to the Cloudflare CDN.

Coordination and orchestration at scale

Our goal is to achieve a quick transition of IP address prefixes between services when initiated by customers, which requires a high level of coordination. We need to ensure that changes propagate correctly across all servers to maintain stability. Currently, when a customer migrates a prefix between services, there is a 4-6 hour window of uncertainty where incoming packets may be dropped due to a lack of guaranteed routing. To address this, we are actively implementing systems that will reduce this transition time from hours to just a matter of minutes, significantly improving reliability and minimizing disruptions.

Smarter IP address management

Service Bindings are mappings that control whether traffic destined for a given IP address is routed to Magic Transit, the CDN pipeline, or the Spectrum pipeline.

Consider the example in the diagram below. One of our customers, a global finance infrastructure platform, is using BYOIP and has a /24 range bound to Spectrum for DDoS protection of their TCP and UDP traffic. However, they are only using a few addresses in that range for their Spectrum applications, while the rest go unused. In addition, the customer is using Cloudflare’s CDN for their Layer 7 traffic and wants to set up Static IPs, so that their customers can allowlist a consistent set of IP addresses owned and controlled by their own network infrastructure team. Instead of using up another block of address space, they asked us whether they could carve out those unused sub-ranges of the /24 prefix.

From there, we set out to determine how to selectively map sub-ranges of the onboarded prefix to different services using service bindings:

192.0.2.0/24 is already bound to Spectrum
- 192.0.2.0/25 is updated and bound to CDN
- 192.0.2.200/32 is also updated bound to CDN

Both the /25 and /32 are sub-ranges within the /24 prefix and will receive traffic directed to the CDN. All remaining IP addresses within the /24 prefix, unless explicitly bound, will continue to use the default Spectrum service binding.

As you can see in this example, this approach provides customers with greater control and agility over how their IP address space is allocated. Instead of rigidly assigning an entire prefix to a single service, users can now tailor their IP address usage to match specific workloads or deployment needs. Setting this up is straightforward — all it takes is a few HTTP requests to the Cloudflare API. You can define service bindings by specifying which IP addresses or subnets should be routed to CDN, Spectrum, or Magic Transit. This allows you to tailor traffic routing to match your architecture without needing to restructure your entire IP address allocation. The process remains consistent whether you're configuring a single IP address or splitting up larger subnets, making it easy to apply across different parts of your network. The foundational technical work addressing the underlying architectural challenges outlined above made it possible to streamline what could have been a complex setup into a straightforward series of API interactions.

Conclusion

We envision a future where customers have granular control over how their traffic moves through Cloudflare’s global network, not just by service, but down to the port level. A single prefix could simultaneously power web applications on CDN, protect infrastructure through Magic Transit, and much more. This isn't just flexible routing, but programmable traffic orchestration across different services. What was once rigid and static becomes dynamic and fully programmable to meet each customer’s unique needs.

If you are an existing BYOIP customer using Magic Transit, CDN, or Spectrum, check out our configuration guide here. If you are interested in bringing your own IP address space and using multiple Cloudflare services on it, please reach out to your account team to enable setting up this configuration via API or reach out to sales@cloudflare.com if you’re new to Cloudflare.

Simplify allowlist management and lock down origin access with Cloudflare Aegis

Mia Malden — Thu, 20 Mar 2025 13:00:00 GMT

Today, we’re taking a deep dive into Aegis, Cloudflare’s origin protection product, to help you understand what the product is, how it works, and how to take full advantage of it for locking down access to your origin. We’re excited to announce the availability of Bring Your Own IPs (BYOIP) for Aegis, a customer-accessible Aegis API, and a gradual rollout for observability of Aegis IP utilization.

If you are new to Cloudflare Aegis, let’s take a step back and understand the product’s purpose and security benefits, process, and how it came to be.

Origin protection then…

Allowlisting a specific set of IP addresses has long existed as one of the simplest ways of restricting access to a server. This firewall mechanism is a starting state that just about every server supports. As we built Cloudflare’s network, one of the first features that customers requested was the ability to restrict access to their origin, so only Cloudflare could make requests to it. Back then, the most natural way to support this was to tell our customers which IP addresses belong to us, so they could allowlist those in their origin firewall. To that end, we have published our IP address ranges, providing an easy configuration to ensure that all traffic accessing your origin comes from Cloudflare’s network.

However, Cloudflare’s IP ranges are used across multiple Cloudflare services and customers, so restricting access to the full list doesn’t necessarily give customers the security benefit they need. With the frequency and scale of IP-based and DDoS attacks every day, origin protection is absolutely paramount. Some customers have the need for more stringent security precautions to ensure that traffic is only coming from Cloudflare’s network and, more specifically, only coming from their zones within Cloudflare.

Origin protection now…

Cloudflare has built out additional services to lock down access to your origin, like authenticated origin pulls (mTLS) and Cloudflare Tunnels, that no longer rely on IP addresses as an indicator of identity. These are part of a global effort towards Zero Trust security: whereas the Internet used to operate under a trust-but-verify model, we aim to operate as nothing is trusted, and everything is verified.

Having non-ephemeral IP addresses — upon which the firewall allowlist mechanism relies — does not quite fit the Zero Trust system. Although mTLS and similar solutions present a more modern approach to origin security, they aren’t always feasible for customers, depending on their hardware or system architecture.

We launched Cloudflare Aegis in March 2023 for customers seeking an intermediary security solution. Aegis provides a dedicated IP address, or set of addresses, from which Cloudflare sends requests, allowing you to further lock down your origin’s layer 3 firewall. Aegis also simplifies management by only requiring you to allowlist a small number of IP addresses.

Normally, Cloudflare’s publicly listed IP ranges are used to egress from Cloudflare’s network to the customer origin. With these IP addresses distributed across Cloudflare’s network, the customer traffic can egress from many servers to the customer origin.

With Aegis, a customer does not necessarily have an Aegis IP address on every server if they are using IPv4. That means requests must be routed through Cloudflare’s network to a server where Aegis IP addresses are present before the traffic can egress to the customer origin.

How requests are routed with Aegis

A few terms, before we begin:

Anycast: a technology where each of our data centers “announces” and can handle the same IP address ranges
Unicast: a technology where each server is given its own, unique unicast IP address

Dedicated egress Aegis IPs are located in a particular set of specific data centers. This list is handpicked by the customer, in conversation with Cloudflare, to be geographically close to their origin servers for optimal security and performance in tandem.

Aegis relies on a technology called soft-unicasting, which allows us to share a /32 egress IPv4 amongst many servers, thereby enabling us to spread a single subnet across many data centers. Then, the traffic going back from the origin servers (the return path) is routed to their closest data center. Once in Cloudflare's network, our in-house L4 XDP-based load balancer, Unimog, ensures that the return packets make it back to the machine that connected to the origin servers at the start.

This supports fast, local, and reliable egress from Cloudflare’s network. With this configuration, we essentially use Anycast at the BGP layer before using an IP and port range to reach a specific machine in the correct data center. Across Cloudflare’s network, we use a significant range of egress IPs to cover all data centers and machines. Since Aegis customers only have a few IPv4 addresses, the range is limited to a few data centers rather than Cloudflare’s entire egress IP range.

The capacity issue

Every IP address has 65,535 ports. A request egresses from exactly one port on the Aegis IP address to exactly one port on the origin IP address.

Each TCP request consists of a 4-way tuple that contains:

Source IP address
Source port
Destination IP address
Destination port

A UDP request can also consist of a 4-way tuple (if it’s connected) or a 2-way tuple (if it’s unconnected), simply including a bind IP address and port. Aegis supports both TCP and UDP traffic — in either case, the requests rely upon IP:port pairings between the source and destination.

When a request reaches the origin, it opens a connection, through which data can pass between the source and destination. One source port can sustain multiple connections at a time, only if the destination ip:ports are different.

Normally at Cloudflare, an IP address establishes connections to a variety of different destination IP ports or addresses to support high traffic volumes. With Aegis, that is no longer the case. The challenge with Aegis IP capacity is exactly that: all the traffic is egressing to the same (or a small set of) origin IP address(es) from the same (or a small set of) source IP address(es). That means Aegis IP addresses have capacity constraints associated with them.

The number of concurrent connections is the number of simultaneous connections for a given 4-way tuple. Between one client and one server, the volume of concurrent connections is inherently limited by the number of ports on an IP address to 65,535 — each source ip:port can only support a single outbound connection per destination IP:port. In practice, that maximum number of concurrent connections is often lower due to assignments of port ranges across many servers and imperfect load distribution.

For planning purposes, we use an estimate of ~80% of the IP capacity (the volume of concurrent connections a source IP address can support to a destination IP address) to protect against overload, in case of traffic spikes. If every port on an IP address is maintaining a concurrent connection, that address would reach and exceed capacity — it would be overloaded with port usage exhaustion. Requests may then be dropped since no new connections can be established. To build in resiliency, we only plan to support 40k concurrent connections per Aegis IP address per origin.

Aegis with IPv6

Each customer who onboards with Cloudflare Aegis receives two /64 prefixes to be globally allocated and announced. That means, outside of Cloudflare’s China Network, every Cloudflare data center has hundreds or even thousands of addresses reserved for egressing your traffic directly to your origin. Without Aegis, any data center in Cloudflare’s Anycast network can serve as a point of egress – so we built Aegis with IPv6 to preserve that level of resiliency and performance. The sheer scale of IPv6, with its available address space, allows us to cushion Aegis’ capacity to a point far beyond any reasonable concern. Globally allocating and announcing your Aegis IPv6 addresses maintains all of Cloudflare’s functionality as a reverse proxy without inducing additional friction.

Aegis with IPv4

Although using IPv6 with Aegis facilitates the best possible speed and resiliency for your traffic, we recognize the transition from IPv4 to IPv6 can be challenging for some customers. Moreover, some customers prefer Aegis IPv4 for granular control over their traffic’s physical egress locations. Still, IPv4 space is more limited and more expensive — while all Cloudflare Aegis customers simply receive two dedicated /64s for IPv6, enabling Aegis with IPv4 requires a touch more tailoring. When you onboard to Aegis, we work with you to determine the ideal number of IPv4 addresses for your Aegis configuration to maintain optimal performance and resiliency, while also ensuring cost efficiency.

Naturally, this introduces a bottleneck — whereas every Cloudflare data center can serve as a point of egress with Aegis IPv6, only a small fraction will have that capability with Aegis IPv4. We aim to mitigate this impact by careful provisioning of the IPv4 addresses.

Now that BYOIP for Aegis is supported, you can also onboard an entire IPv4 /24 prefix or IPv6 /64 for Aegis, allowing for a cost-effective configuration with a much higher volume of capacity.

When we launched Aegis, each IP address was allocated to one data center, requiring at least two IPv4 addresses for appropriate resiliency. To reduce the number of IP addresses necessary in your layer 3 firewall allowlist, and to manage the cost to the customer of leasing IPs, we expanded our Aegis functionality so that one address can be announced from up to four data centers. To do this, we essentially slice the available IP port range into four subsets and provision each at a unique data center.

A quick refresher: when a request travels through Cloudflare, it first hits our network via an ingress data center. The ingress data center is generally near the eyeball, who is sending the request. Then, the request is routed following BGP – or Argo Smart Routing, when enabled – to an exit, or egress, data center. The exit data center will generally fall in close geographic proximity to the request’s destination, which is the customer origin. This mitigates latency induced by the final hop from Cloudflare’s network to your origin.

With Aegis, the possible exit data centers are limited to the data centers in which an Aegis IP address has been allocated. For IPv6, this is a non-issue, since every data center outside our China Network is covered. With IPv4, however, the exit data centers are limited to a much smaller number (4 x the number of Aegis IPs). Aegis IP addresses are allocated, then, to data centers in close geographic proximity to your origin(s). This maximizes the likelihood that whichever data centers would ordinarily have been selected as the egress data center are already announcing Aegis IP addresses. Theoretically, no extra hop is necessary from the optimal exit data center to an Aegis-enabled data center – they are one and the same. In practice, this cannot be guaranteed 100% of the time because optimal routes are ever-changing. We recommend IPv6 to ensure optimal performance because of this possibility of an extra hop with IPv4.

A brief comparison, to summarize:

	Aegis IPv4	Aegis IPv6
Physical points of egress	4 physical data center sites (1-2 cities near origin) per IP address	All 300+ Cloudflare locations (excluding China network)
Capacity	One IPv4 address per 40,000 concurrent connections per origin	Two /64 prefixes for all Aegis customers (>36 quintillion IP addresses) ~50,000x capacity of IPv4 config
Pricing model	Monthly fee based on IPv4 leases or BYOIP for Aegis prefix fees	Included with product purchase or BYOIP for Aegis prefix fees

Aegis IPv4

Aegis IPv6

Physical points of egress

4 physical data center sites (1-2 cities near origin) per IP address

All 300+ Cloudflare locations (excluding China network)

Capacity

One IPv4 address per 40,000 concurrent connections per origin

Two /64 prefixes for all Aegis customers (>36 quintillion IP addresses)

~50,000x capacity of IPv4 config

Pricing model

Monthly fee based on IPv4 leases or BYOIP for Aegis prefix fees

Included with product purchase or BYOIP for Aegis prefix fees

Now, with Aegis analytics coming soon, customers can monitor and manage their IP address usage by Cloudflare data centers in aggregate. Every Cloudflare data center will now run a service with the sole purpose of calculating and reporting Aegis usage for each origin IP:port at regular intervals. Written to an internal database, these reports will be aggregated and exposed to customers via Cloudflare’s GraphQL Analytics API. Several aggregation functions will be available, such as average usage over a period of time, or total summed usage.

This will allow customers to track their own IP address usage to further optimize the distribution of traffic and addresses across different points of presence for IPv4. Additionally, the improved observability will support customer-created notifications via RSS feeds such that you can design your own notification thresholds for port usage.

How Aegis benefits from connection reuse & coalescence

As we mentioned earlier, requests egress from the source IP address to the destination IP address only when a connection has been established between the two. In early Internet protocols, requests and connections were 1:1. Now, once that connection is open, it can remain open and support hundreds or thousands of requests between that source and destination via connection reuse and connection coalescing.

Connection reuse, implemented by HTTP/1.1, allows for requests with the same source ip:port and destination IP:port to pass through the same connection to the origin. A “simple” website by modern standards can send hundreds of requests just to load initially; by streamlining these into a single origin connection, connection reuse reduces the latency derived from constantly opening and closing new connections between two endpoints. Still, any request from a different domain would need to create a new, unique connection to communicate with the origin.

As of HTTP/2, connection coalescing can group requests from different domains into one connection if the requests have the same destination IP address and the server certificate is authoritative for both domains. Depending on the traffic patterns routing from the eyeball to an Aegis IP address, the volume of connection reuse & coalescence can vary. One connection most likely facilitates the traffic of many requests, but each connection requires at least one request to open it in the first place. Therefore, the worst possible ratio between concurrent connections and concurrent requests is 1:1.

In practice, a 1:1 ratio between connections and requests almost never happens. Connection reuse and connection coalescence are very common but highly variable, due to sporadic traffic patterns. We size our Aegis IP address allocations accordingly, erring on the conservative side to minimize risk of capacity overload. With the proper number of dedicated egress IP addresses and optimal allocation to Cloudflare points of presence, we are able to lock down your origin with IPv4 addresses to block malicious layer 7 traffic and reduce overall load to your origin.

Connection reuse and coalescence pairs well with Aegis to reduce load on the origin’s side as well. Because a request can be reused if it comes from the same source IP:port and shares a destination IP:port, routing traffic from a reduced number of source IP addresses (Aegis addresses, in this case) to your origin facilitates a smaller number of total connections. Not only does this improve security by limiting open connection access, but also it reduces latency since fewer connections need to be opened. Maintaining fewer connections is also less resource intensive — more connections means more CPU and more memory handling the inbound requests. By reducing the number of connections to the origin through reuse and coalescence, HTTP/2 lowers the overall cost of operation by optimizing resource usage.

Recap and recommendations

Cloudflare Aegis locks down your origin by restricting access via your origin’s layer 3 firewall. By routing traffic from Cloudflare’s network to your origin through dedicated egress IP addresses, you can ensure that requests coming from Cloudflare are legitimate customer traffic. With a simple flip-switch configuration — allow listing your Aegis IP addresses in your origin’s firewall — you can block excessive noise and bad actors from access. So, to help you take full advantage of Aegis, let’s recap:

Concurrent connections can be, at worst, a 1:1 ratio to concurrent requests.
Cloudflare bases our IP address usage recommendations on 40,000 concurrent connections to minimize risk of capacity overload.
Each Aegis IP address supports an estimated 40,000 concurrent connections per origin IP address.

Additionally, we’re excited to now support:

Public Aegis API
BYOIP for Aegis
Customer-facing Aegis observability (coming soon via gradual rollout)

For customers leasing Cloudflare-owned Aegis IP addresses, the Aegis API will allow you to enable and disable Aegis on zones within your parent account (parent being the account which owns the IP lease). If you deploy your Aegis IP addresses across multiple accounts, you’ll still rely on Cloudflare’s account team to enable and disable Aegis on zones within those additional accounts.

For customers who leverage BYOIP for Aegis, the Aegis API will allow you to enable and disable Aegis on zones within your parent account and within any accounts to which you delegate prefix permissions. We recommend BYOIP for Aegis for improved configurability and cost efficiency.

	BYOIP	Cloudflare-owned IPs
Enable Aegis on zones on parent account	✓	✓
Enable Aegis on zones beyond parent account	✓	✗
Disable Aegis on zones on parent account	✓	✓
Disable Aegis on zones beyond parent account	✓	✗
Access Aegis analytics via the API	✓	✓

With the improved Aegis observability, all Aegis customers will be able to monitor their port usage by IP address, account, zone, and data centers in aggregate via the API. You will also be able to ingest these metrics to configure your own, customizable alerts based on certain port usage thresholds. Alongside the new configurability of Aegis, this visibility will better equip customers to manage their Aegis deployments themselves and alert us to any changes, rather than the other way around.

We also have a few adjacent recommendations to optimize your Aegis configuration. We generally encourage the following best practices for security hygiene for your origin and traffic as well.

IPv6 Compatibility: if your origin(s) support IPv6, you will experience even better resiliency, performance, and availability with your dedicated egress IP addresses at a lower overall cost.
HTTP/2 or HTTP/3 adoption: by supporting connection reuse and coalescence, you will reduce overall load to your origin and latency in the path of your request.
Multi-level origin protection: while Aegis protects your origin at the application level, it pairs well with Access and CNI, Authenticated Origin Pulls, and/or other Cloudflare products to holistically protect, verify, and facilitate your traffic from edge to origin.

If you or your organization want to enhance security and lock down your origin with dedicated egress IP addresses reach out to your account team to onboard today.

Cloudflare incident on September 17, 2024

Joe Abley — Fri, 20 Sep 2024 14:00:00 GMT

On September 17, 2024, during routine maintenance, Cloudflare inadvertently stopped announcing fifteen IPv4 prefixes, affecting some Business plan websites for approximately one hour. During this time, IPv4 traffic for these customers would not have reached Cloudflare, and users attempting to connect to websites assigned addresses within those prefixes would have received errors.

We’re very sorry for this outage.

This outage was the result of an internal software error and not the result of an attack. In this blog post, we’re going to talk about what the failure was, why it occurred, and what we’re doing to make sure this doesn’t happen again.

Background

Cloudflare assembled a dedicated Addressing team in 2019 to simplify the ways that IP addresses are used across Cloudflare products and services. The team builds and maintains systems that help Cloudflare conserve and manage its own network resources. The Addressing team also manages periodic changes to the assignment of IP addresses across infrastructure and services at Cloudflare. In this case, our goal was to reduce the number of IPv4 addresses used for customer websites, allowing us to free up addresses for other purposes, like deploying infrastructure in new locations. Since IPv4 addresses are a finite resource and are becoming more scarce over time, we carry out these kinds of “renumbering” exercises quite regularly.

Renumbering in Cloudflare is carried out using internal processes that move websites between sets of IP addresses. A set of IP addresses that no longer has websites associated with it is no longer needed, and can be retired. Once that has happened, the associated addresses are free to be used elsewhere.

Back in July 2024, a batch of Business plan websites were moved from their original set of IPv4 addresses to a new, smaller set, appropriate to the forecast requirements of that particular plan. On September 17, after confirming that all of the websites using those addresses had been successfully renumbered, the next step was to be carried out: detach the IPv4 prefixes associated with those addresses from Cloudflare’s network and to withdraw them from service. That last part was to be achieved by removing those IPv4 prefixes from the Internet’s global routing table using the Border Gateway Protocol (BGP), so that traffic to those addresses is no longer routed towards Cloudflare. The prefixes concerned would then be ready to be deployed for other purposes.

What was released and how did it break?

When we migrated customer websites out of their existing assigned address space in July, we used a one time migration template that cycles through all the websites associated with the old IP addresses and moves them to new ones. This calls a function that updates the IP assignment mechanism to synchronize the IP address-to-website mapping.

A couple of months prior to the July migration, the relevant function code was updated as part of a separate project related to legacy SSL configurations. That update contained a fix that replaced legacy code to synchronize two address pools with a call to an existing synchronization function. The update was reviewed, approved, merged, and released.

Unfortunately, the fix had consequences for the subsequent renumbering work. Upon closer inspection (we’ve done some very close post-incident inspection), a side effect of the change was to suppress updates in cases where there was no linked reference to a legacy SSL certificate. Since not all websites use legacy certificates, the effect was that not all websites were renumbered — 1,661 customer websites remained linked to old addresses in the address pools that were intended to be withdrawn. This was not noticed during the renumbering work in July, which had concluded with the assumption that every website linked to the old addresses had been renumbered, and that assumption was not checked.

At 2024-09-17 17:51 UTC, fifteen IPv4 prefixes corresponding to the addresses that were thought to be safely unused were withdrawn using BGP. Cloudflare operates a global network with hundreds of data centers, and there was some variation in the precise time when the prefixes were withdrawn from particular parts of the world. In the following ten minutes, we observed an aggregate 10 Gbps drop in traffic to the 1,661 affected websites network-wide.

_{The graph above shows traffic volume (in bits per second) for each individual prefix that was affected by the incident.}

Incident timeline and impact

All timestamps are UTC on 2024-09-17.

At 17:41, the Addressing engineering team initiated the release that disabled prefixes in production.

At 17:51, BGP announcements began to be withdrawn and traffic to Cloudflare on the impacted prefixes started to drop.

At 17:57, the SRE team noticed alerts triggered by an increase in unreachable IP address space and began investigating. The investigation ended shortly afterwards, since it is generally expected that IP addresses will become unreachable when they are being removed from service, and consequently the alerts did not seem to indicate an abnormal situation.

At 18:36, Cloudflare received escalations from two customers, and an incident was declared. A limited deployment window was quickly implemented once the severity of the incident was assessed.

At 18:46, Addressing team engineers confirmed that the change introduced in the renumbering release triggered the incident and began preparing the rollback procedure to revert changes.

At 18:50, the release was rolled back, prefixes were re-announced in BGP to the Internet, and traffic began flowing back through Cloudflare.

At 18:50:27, the affected routes were restored and prefixes began receiving traffic again.

There was no impact to IPv6 traffic. 1,661 customer websites that were associated with addresses in the withdrawn IPv4 prefixes were affected. There was no impact to other customers or services.

How did we fix it?

The immediate fix to the problem was to roll back the release that was determined to be the proximal cause. Since all approved changes have tested roll back procedures, this is often a pragmatic first step to fix whatever has just been found to be broken. In this case, as in many, it was an effective way to resolve the immediate impact and return things to normal.

Identifying the root cause took more effort. The code mentioned above that had been modified earlier this year is quite old, and part of a legacy system that the Addressing team has been working on moving away from since the team’s inception. Much of the engineering effort during that time has been on building the modern replacement, rather than line-level dives into the legacy code.

We have since fixed the specific bug that triggered this incident. However, to address the more general problem of relying on old code that is not as well understood as the code in modern systems, we will do more. Sometimes software has bugs, and sometimes software is old, and these are not useful excuses; they are just the way things are. It’s our job to maintain the agility and confidence in our release processes while living in this reality, maintaining the level of safety and stability that our customers and their customers rely on.

What are we doing to prevent this from happening again?

We take incidents like this seriously, and we recognise the impact that this incident had. Though this specific bug has been resolved, we have identified several steps we can take to mitigate the risk of a similar problem occurring in the future. We are implementing the following plan as a result of this incident:

Test: The Addressing Team is adding tests that check for the existence of outstanding assignments of websites to IP addresses as part of future renumbering exercises. These tests will verify that there are no remaining websites that inadvertently depend on the old addresses being in service. The changes that prompted this incident made incorrect assumptions that all websites had been renumbered. In the future, we will avoid making assumptions like those, and instead do explicit checks to make sure.

Process: The Addressing team is improving the processes associated with the withdrawal of Cloudflare-owned prefixes, regardless of whether the withdrawal is associated with a renumbering event, to include automated and manual verification of traffic levels associated with the addresses that are intended to be withdrawn. Where traffic is attached to a service that provides more detailed logging, service-specific request logs will be checked for signs that the addresses thought to be unused are not associated with active traffic.

Implementation: The Addressing Team is reviewing every use of stored procedures and functions associated with legacy systems. Where there is doubt, functionality will be re-implemented with present-day standards of documentation and test coverage.

connect() - why are you so slow?

Frederick Lawler — Thu, 08 Feb 2024 14:00:27 GMT

It is no secret that Cloudflare is encouraging companies to deprecate their use of IPv4 addresses and move to IPv6 addresses. We have a couple articles on the subject from this year:

And many more in our catalog. To help with this, we spent time this last year investigating and implementing infrastructure to reduce our internal and egress use of IPv4 addresses. We prefer to re-allocate our addresses than to purchase more due to increasing costs. And in this effort we discovered that our cache service is one of our bigger consumers of IPv4 addresses. Before we remove IPv4 addresses for our cache services, we first need to understand how cache works at Cloudflare.

How does cache work at Cloudflare?

Describing the full scope of the architecture is out of scope of this article, however, we can provide a basic outline:

Internet User makes a request to pull an asset
Cloudflare infrastructure routes that request to a handler
Handler machine returns cached asset, or if miss
Handler machine reaches to origin server (owned by a customer) to pull the requested asset

The particularly interesting part is the cache miss case. When a website suddenly becomes very popular, many uncached assets may need to be fetched all at once. Hence we may make an upwards of: 50k TCP unicast connections to a single destination_._

That is a lot of connections! We have strategies in place to limit the impact of this or avoid this problem altogether. But in these rare cases when it occurs, we will then balance these connections over two source IPv4 addresses.

Our goal is to remove the load balancing and prefer one IPv4 address. To do that, we need to understand the performance impact of two IPv4 addresses vs one.

TCP connect() performance of two source IPv4 addresses vs one IPv4 address

We leveraged a tool called wrk, and modified it to distribute connections over multiple source IP addresses. Then we ran a workload of 70k connections over 48 threads for a period of time.

During the test we measured the function tcp_v4_connect() with the BPF BCC libbpf-tool funclatency tool to gather latency metrics as time progresses.

Note that throughout the rest of this article, all the numbers are specific to a single machine with no production traffic. We are making the assumption that if we can improve a worse case scenario in an algorithm with a best case machine, that the results could be extrapolated to production. Lock contention was specifically taken out of the equation, but will have production implications.

Two IPv4 addresses

The y-axis are buckets of nanoseconds in powers of ten. The x-axis represents the number of connections made per bucket. Therefore, more connections in a lower power of ten buckets is better.

We can see that the majority of the connections occur in the fast case with roughly ~20k in the slow case. We should expect this bimodal to increase over time due to wrk continuously closing and establishing connections.

Now let us look at the performance of one IPv4 address under the same conditions.

One IPv4 address

In this case, the bimodal distribution is even more pronounced. Over half of the total connections are in the slow case than in the fast! We may conclude that simply switching to one IPv4 address for cache egress is going to introduce significant latency on our connect() syscalls.

The next logical step is to figure out where this bottleneck is happening.

Port selection is not what you think it is

To investigate this, we first took a flame graph of a production machine:

Flame graphs depict a run-time function call stack of a system. Y-axis depicts call-stack depth, and x-axis depicts a function name in a horizontal bar that represents the amount of times the function was sampled. Checkout this in-depth guide about flame graphs for more details.

Most of the samples are taken in the function __inet_hash_connect(). We can see that there are also many samples for __inet_check_established() with some lock contention sampled between. We have a better picture of a potential bottleneck, but we do not have a consistent test to compare against.

Wrk introduces a bit more variability than we would like to see. Still focusing on the function tcp_v4_connect(), we performed another synthetic test with a homegrown benchmark tool to test one IPv4 address. A tool such as stress-ng may also be used, but some modification is necessary to implement the socket option IP_LOCAL_PORT_RANGE. There is more about that socket option later.

We are now going to ensure a deterministic amount of connections, and remove lock contention from the problem. The result is something like this:

On the y-axis we measured the latency between the start and end of a connect() syscall. The x-axis denotes when a connect() was called. Green dots are even numbered ports, and red dots are odd numbered ports. The orange line is a linear-regression on the data.

The disparity between the average time for port allocation between even and odd ports provides us with a major clue. Connections with odd ports are found significantly slower than the even. Further, odd ports are not interleaved with earlier connections. This implies we exhaust our even ports before attempting the odd. The chart also confirms our bimodal distribution.

__inet_hash_connect()

At this point we wanted to understand this split a bit better. We know from the flame graph and the function __inet_hash_connect() that this holds the algorithm for port selection. For context, this function is responsible for associating the socket to a source port in a late bind. If a port was previously provided with bind(), the algorithm just tests for a unique TCP 4-tuple (src ip, src port, dest ip, dest port) and ignores port selection.

Before we dive in, there is a little bit of setup work that happens first. Linux first generates a time-based hash that is used as the basis for the starting port, then adds randomization, and then puts that information into an offset variable. This is always set to an even integer.

net/ipv4/inet_hashtables.c

   offset &= ~1U;
    
other_parity_scan:
    port = low + offset;
    for (i = 0; i < remaining; i += 2, port += 2) {
        if (unlikely(port >= high))
            port -= remaining;

        inet_bind_bucket_for_each(tb, &head->chain) {
            if (inet_bind_bucket_match(tb, net, port, l3mdev)) {
                if (!check_established(death_row, sk, port, &tw))
                    goto ok;
                goto next_port;
            }
        }
    }

    offset++;
    if ((offset & 1) && remaining > 1)
        goto other_parity_scan;

Then in a nutshell: loop through one half of ports in our range (all even or all odd ports) before looping through the other half of ports (all odd or all even ports respectively) for each connection. Specifically, this is a variation of the Double-Hash Port Selection Algorithm. We will ignore the bind bucket functionality since that is not our main concern.

Depending on your port range, you either start with an even port or an odd port. In our case, our low port, 9024, is even. Then the port is picked by adding the offset to the low port:

net/ipv4/inet_hashtables.c

port = low + offset;

If low was odd, we will have an odd starting port because odd + even = odd.

There is a bit too much going on in the loop to explain in text. I have an example instead:

This example is bound by 8 ports and 8 possible connections. All ports start unused. As a port is used up, the port is grayed out. Green boxes represent the next chosen port. All other colors represent open ports. Blue arrows are even port iterations of offset, and red are the odd port iterations of offset. Note that the offset is randomly picked, and once we cross over to the odd range, the offset is incremented by one.

For each selection of a port, the algorithm then makes a call to the function check_established() which dereferences __inet_check_established(). This function loops over sockets to verify that the TCP 4-tuple is unique. The takeaway is that the socket list in the function is usually smaller than not. This grows as more unique TCP 4-tuples are introduced to the system. Longer socket lists may slow down port selection eventually. We have a blog post on ephemeral port exhausting that dives into the socket list and port uniqueness criteria.

At this point, we can summarize that the odd/even port split is what is causing our performance bottleneck. And during the investigation, it was not obvious to me (or even maybe you) why the offset was initially calculated the way it was, and why the odd/even port split was introduced. After some git-archaeology the decisions become more clear.

Security considerations

Port selection has been shown to be used in device fingerprinting in the past. This led the authors to introduce more randomization into the initial port selection. Prior, ports were predictably picked solely based on their initial hash and a salt value which does not change often. This helps with explaining the offset, but does not explain the split.

Why the even/odd split?

Prior to this patch and that patch, services may have conflicts between the connect() and bind() heavy workloads. Thus, to avoid those conflicts, the split was added. An even offset was chosen for the connect() workloads, and an odd offset for the bind() workloads. However, we can see that the split works great for connect() workloads that do not exceed one half of the allotted port range.

Now we have an explanation for the flame graph and charts. So what can we do about this?

User space solution (kernel < 6.8)

We have a couple of strategies that would work best for us. Infrastructure or architectural strategies are not considered due to significant development effort. Instead, we prefer to tackle the problem where it occurs.

Select, test, repeat

For the “select, test, repeat” approach, you may have code that ends up looking like this:

sys = get_ip_local_port_range()
estab = 0
i = sys.hi
while i >= 0:
    if estab >= sys.hi:
        break

    random_port = random.randint(sys.lo, sys.hi)
    connection = attempt_connect(random_port)
    if connection is None:
        i += 1
        continue

    i -= 1
    estab += 1

The algorithm simply loops through the system port range, and randomly picks a port each iteration. Then test that the connect() worked. If not, rinse and repeat until range exhaustion.

This approach is good for up to ~70-80% port range utilization. And this may take roughly eight to twelve attempts per connection as we approach exhaustion. The major downside to this approach is the extra syscall overhead on conflict. In order to reduce this overhead, we can consider another approach that allows the kernel to still select the port for us.

Select port by random shifting range

This approach leverages the IP_LOCAL_PORT_RANGE socket option. And we were able to achieve performance like this:

That is much better! The chart also introduces black dots that represent errored connections. However, they have a tendency to clump at the very end of our port range as we approach exhaustion. This is not dissimilar to what we may see in “select, test, repeat”.

The way this solution works is something like:

IP_BIND_ADDRESS_NO_PORT = 24
IP_LOCAL_PORT_RANGE = 51
sys = get_local_port_range()
window.lo = 0
window.hi = 1000
range = window.hi - window.lo
offset = randint(sys.lo, sys.hi - range)
window.lo = offset
window.hi = offset + range

sk = socket(AF_INET, SOCK_STREAM)
sk.setsockopt(IPPROTO_IP, IP_BIND_ADDRESS_NO_PORT, 1)
range = pack("@I", window.lo | (window.hi << 16))
sk.setsockopt(IPPROTO_IP, IP_LOCAL_PORT_RANGE, range)
sk.bind((src_ip, 0))
sk.connect((dest_ip, dest_port))

We first fetch the system's local port range, define a custom port range, and then randomly shift the custom range within the system range. Introducing this randomization helps the kernel to start port selection randomly at an odd or even port. Then reduces the loop search space down to the range of the custom window.

We tested with a few different window sizes, and determined that a five hundred or one thousand size works fairly well for our port range:

Window size	Errors	Total test time	Connections/second
500	868	~1.8 seconds	~30,139
1,000	1,129	~2 seconds	~27,260
5,000	4,037	~6.7 seconds	~8,405
10,000	6,695	~17.7 seconds	~3,183

As the window size increases, the error rate increases. That is because a larger window provides less random offset opportunity. A max window size of 56,512 is no different from using the kernels default behavior. Therefore, a smaller window size works better. But you do not want it to be too small either. A window size of one is no different from “select, test, repeat”.

In kernels >= 6.8, we can do even better.

Kernel solution (kernel >= 6.8)

A new patch was introduced that eliminates the need for the window shifting. This solution is going to be available in the 6.8 kernel.

Instead of picking a random window offset for setsockopt(IPPROTO_IP, IP_LOCAL_PORT_RANGE, …), like in the previous solution, we instead just pass the full system port range to activate the solution. The code may look something like this:

IP_BIND_ADDRESS_NO_PORT = 24
IP_LOCAL_PORT_RANGE = 51
sys = get_local_port_range()
sk = socket(AF_INET, SOCK_STREAM)
sk.setsockopt(IPPROTO_IP, IP_BIND_ADDRESS_NO_PORT, 1)
range = pack("@I", sys.lo | (sys.hi << 16))
sk.setsockopt(IPPROTO_IP, IP_LOCAL_PORT_RANGE, range)
sk.bind((src_ip, 0))
sk.connect((dest_ip, dest_port))

Setting IP_LOCAL_PORT_RANGE option is what tells the kernel to use a similar approach to “select port by random shifting range” such that the start offset is randomized to be even or odd, but then loops incrementally rather than skipping every other port. We end up with results like this:

The performance of this approach is quite comparable to our user space implementation. Albeit, a little faster. Due in part to general improvements, and that the algorithm can always find a port given the full search space of the range. Then there are no cycles wasted on a potentially filled sub-range.

These results are great for TCP, but what about other protocols?

Other protocols & connect()

It is worth mentioning at this point that the algorithms used for the protocols are mostly the same for IPv4 & IPv6. Typically, the key difference is how the sockets are compared to determine uniqueness and where the port search happens. We did not compare performance for all protocols. But it is worth mentioning some similarities and differences with TCP and a couple of others.

DCCP

The DCCP protocol leverages the same port selection algorithm as TCP. Therefore, this protocol benefits from the recent kernel changes. It is also possible the protocol could benefit from our user space solution, but that is untested. We will let the reader exercise DCCP use-cases.

UDP & UDP-Lite

UDP leverages a different algorithm found in the function udp_lib_get_port(). Similar to TCP, the algorithm will loop over the whole port range space incrementally. This is only the case if the port is not already supplied in the bind() call. The key difference between UDP and TCP is that a random number is generated as a step variable. Then, once a first port is identified, the algorithm loops on that port with the random number. This relies on an uint16_t overflow to eventually loop back to the chosen port. If all ports are used, increment the port by one and repeat. There is no port splitting between even and odd ports.

The best comparison to the TCP measurements is a UDP setup similar to:

sk = socket(AF_INET, SOCK_DGRAM)
sk.bind((src_ip, 0))
sk.connect((dest_ip, dest_port))

And the results should be unsurprising with one IPv4 source address:

UDP fundamentally behaves differently from TCP. And there is less work overall for port lookups. The outliers in the chart represent a worst-case scenario when we reach a fairly bad random number collision. In that case, we need to more-completely loop over the ephemeral range to find a port.

UDP has another problem. Given the socket option SO_REUSEADDR, the port you get back may conflict with another UDP socket. This is in part due to the function udp_lib_lport_inuse() ignoring the UDP 2-tuple (src ip, src port) check given the socket option. When this happens you may have a new socket that overwrites a previous. Extra care is needed in that case. We wrote more in depth about these cases in a previous blog post.

In summary

Cloudflare can make a lot of unicast egress connections to origin servers with popular uncached assets. To avoid port-resource exhaustion, we balance the load over a couple of IPv4 source addresses during those peak times. Then we asked: “what is the performance impact of one IPv4 source address for our connect()-heavy workloads?”. Port selection is not only difficult to get right, but is also a performance bottleneck. This is evidenced by measuring connect() latency with a flame graph and synthetic workloads. That then led us to discovering TCP’s quirky port selection process that loops over half your ephemeral ports before the other for each connect().

We then proposed three solutions to solve the problem outside of adding more IP addresses or other architectural changes: “select, test, repeat”, “select port by random shifting range”, and an IP_LOCAL_PORT_RANGE socket option solution in newer kernels. And finally closed out with other protocol honorable mentions and their quirks.

Do not take our numbers! Please explore and measure your own systems. With a better understanding of your workloads, you can make a good decision on which strategy works best for your needs. Even better if you come up with your own strategy!

Amazon’s $2bn IPv4 tax — and how you can avoid paying it

Anie Jacob — Tue, 26 Sep 2023 13:02:00 GMT

One of the wonderful things about the Internet is that, whether as a consumer or producer, the cost has continued to come down. Back in the day, it used to be that you needed a server room, a whole host of hardware, and an army of folks to help keep everything up and running. The cloud changed that, but even with that shift, services like SSL or unmetered DDoS protection were out of reach for many. We think that the march towards a more accessible Internet — both through ease of use, and reduced cost — is a wonderful thing, and we’re proud to have played a part in making it happen.

Every now and then, however, the march of progress gets interrupted.

On July 28, 2023, Amazon Web Services (AWS) announced that they would begin to charge “per IP per hour for all public IPv4 addresses, whether attached to a service or not”, starting February 1, 2024. This change will add at least \$43 extra per year for every IPv4 address Amazon customers use; this may not sound like much, but we’ve seen back of the napkin analysis that suggests this will result in an approximately \$2bn tax on the Internet.

In this blog, we’ll explain a little bit more about the technology involved, but most importantly, give you a step-by-step walkthrough of how Cloudflare can help you not only eliminate the need to pay Amazon for something that they shouldn’t be charging you for in the first place, but also if you’re a Pro or Business subscriber, we want to put \$43 in your pocket instead of taking it out. Don’t give Amazon \$43 for IPv4, let us give you \$43 and throw in IPv4 as well.

How can Cloudflare help avoid AWS IPv4 charges?

The only way to avoid Amazon’s IPv4 tax is to transition to IPv6 with AWS. But we recognize that not everyone is ready to make that shift — it can be an expensive and challenging process, and may present problems with hardware compatibility and network performance. We cover the finer details of these challenges below, so keep reading! Cloudflare can help ease this transition: let us deal with communicating to AWS using IPv6. Not only that, you’ll get all the rest of the benefits of using Cloudflare and our global network — including all the performance and security that Cloudflare is known for — and a \$43 dollar credit for using us!

IPv6 services like these are something we’ve been offering at Cloudflare for years - in fact this was first announced during Cloudflare's first birthday week in 2011! We’ve made this process simple to enable as well, so you can set this up as soon as today.

To set this feature up you will need to both enable IPv6 Compatibility and set up your origin for AWS to be an IPv6 origin.

To configure this feature simply follow these steps:

1. Login to your Cloudflare account.

2. Select the appropriate domain

3. Click the Network app.

4. Make sure IPv6 Compatibility is toggled on.

To get an IPv6 origin from Amazon you will likely have to follow these steps:

Associate an IPv6 CIDR block with your VPC and subnets
Update your route tables
Update your security group rules
Change your instance type
Assign IPv6 addresses to your instances
(Optional) Configure IPv6 on your instances

(For more information about this migration, check out this link.)

Once you have your IPv6 origins, you’ll want to update your origins on Cloudflare to use the IPv6 addresses. In the simple example of a single origin at root, this is done by creating a proxied (orange-cloud) AAAA record in your Cloudflare DNS editor:

If you are using Load Balancers, you will want to update the origin(s) there.

Once that’s done, you can remove the A/IPv4 record(s) and traffic will move over to the v6 address. While this process is easy now, we’re working on how we can make moving to IPv6 on Cloudflare even easier.

Once you have these features configured and have traffic running through Cloudflare to your origin for at least 6 months, you will be eligible to have a $43 credit deposited right into your Cloudflare account! You can use this credit for your Pro or Biz subscription or even for Workers and R2 usage. See here for more information on how to opt in to this offer.

Through this feature Cloudflare provides the flexibility to manage your IPv6 settings as per your requirements. By leveraging Cloudflare's robust IPv6 support, you can ensure seamless connectivity for your users, while avoiding additional costs associated with public IPv4 addresses.

What’s wrong with IPv4?

So if Cloudflare has this solution, why should you even move to IPv6? Well to clearly explain this let's start with the problem with IPv4.

IP addresses are used to identify and reach resources on a network, which could be a private network, like your office's private network, or a complex public network like the Internet. An example of an IPv4 address would be 198.51.100.1 or 198.51.100.50. And there are approximately 4.3 billion unique IPv4 addresses like these for websites, servers, and other destinations on the Internet to use for routing.

4.3 billion IPv4 addresses may sound like a lot, but it’s not as IPv4 space is running out. In September 2015 ARIN, one of the regional Internet registries that allows people to acquire IP addresses, announced that they had no available space: if you want to buy an IPv4 address you have to go and talk to private companies who are selling them. These companies charge a pretty penny for their IPv4 addresses. It costs about $40 per IPv4 address today. To buy a grouping of IPv4 addresses, also known as a prefix of which the minimum required size is 256 IP addresses, costs about \$10,000.

IP addresses are necessary for having a domain or device on the Internet, but today IPv4 addresses are an increasingly more complicated resource to acquire. Therefore, to facilitate the growth of the Internet there needed to be more unique addresses made available without breaking the bank. That’s where IPv6 comes in.

IPv4 vs. IPv6

In 1995 the IETF (Internet Engineering Task Force) published the RFC for IPv6, which proposed to solve this problem of the limited IPv4 space. Instead of 32 bits of addressable space, IPv6 expanded to 128 bits of addressable space. This means that instead of 4.3 billion addresses available, there are approximately 340 undecillion IPv6 addresses available. This is roughly equivalent to the number of grains of sand on Earth.

So this problem is solved, why should you care? The answer is because many networks on the Internet still prefer IPv4, and companies like AWS are starting to charge money for IPv4 usage.

Let's speak on AWS first: AWS today owns one of the largest chunks of the IPv4 space. During a period of time when IPv4 addresses were on the private market to purchase for dollars per IP address, AWS chose to use its large capital to its advantage and buy up a large amount of the space. Today AWS owns 1.7% of the IPv4 address space which equates to ~100 million IPv4 addresses.

So you would think that moving to IPv6 is the right move, however, for the Internet community it’s proven to be quite a challenge.

When IPv6 was published in the 90s very few networks had devices that supported IPv6. However, today in 2023, that is not the case: global networks supporting IPv6 has increased to 46 percent, so the hardware limitations around supporting it are decreasing. Additionally, anti-abuse and security tools initially had no idea how to deal with attacks or traffic that used IPv6 address space, and this still remains an issue for some of these tools. In 2014, we made it even easier for origin tools to convert by creating pseudo IPv4 to help bridge the gap to those tools.

Despite all of this, many networks don’t have good support infrastructure for IPv6 networking since most networks were built on IPv4. At Cloudflare, we have built our network to support both protocols, known as “dual-stack”.

For a while there were also many networks which had markedly worse performance for IPv6 than IPv4. This is not true anymore, as of today we see only a slight degradation in IPv6 performance across the whole Internet compared to IPv4. The reasons for this include things like legacy hardware, sub-optimal IPv6 connectivity outside our network and high cost for deploying IPv6. You can see in the chart below the additional latency of IPv6 traffic on Cloudflare’s network as compared to IPv4 traffic:

There were many challenges to adopting IPv6, and for some these issues with hardware compatibility and network performance are still worries. This is why still using IPv4 can be useful to folks while transitioning to IPv6, which is what makes AWS’ decision to charge for IPv4 impactful on many websites.

So, don’t pay for AWS IPv4 charges

At the end of the day the choice is clear: you could pay Amazon more to rent their IPs than to buy them, or move to Cloudflare and use our free service to help with the transition to IPv6 with little overhead.

Building fast interpreters in Rust

Ingvar Stepanyan — Mon, 04 Mar 2019 16:00:00 GMT

In the previous post we described the Firewall Rules architecture and how the different components are integrated together. We also mentioned that we created a configurable Rust library for writing and executing Wireshark®-like filters in different parts of our stack written in Go, Lua, C, C++ and JavaScript Workers.

With a mixed set of requirements of performance, memory safety, low memory use, and the capability to be part of other products that we’re working on like Spectrum, Rust stood out as the strongest option.

We have now open-sourced this library under our Github account: https://github.com/cloudflare/wirefilter. This post will dive into its design, explain why we didn’t use a parser generator and how our execution engine balances security, runtime performance and compilation cost for the generated filters.

Parsing Wireshark syntax

When building a custom Domain Specific Language (DSL), the first thing we need to be able to do is parse it. This should result in an intermediate representation (usually called an Abstract Syntax Tree) that can be inspected, traversed, analysed and, potentially, serialised.

There are different ways to perform such conversion, such as:

Manual char-by-char parsing using state machines, regular expression and/or native string APIs.
Parser combinators, which use higher-level functions to combine different parsers together (in Rust-land these are represented by nom, chomp, combine and others).
Fully automated generators which, provided with a grammar, can generate a fully working parser for you (examples are peg, pest, LALRPOP, etc.).

Wireshark syntax

But before trying to figure out which approach would work best for us, let’s take a look at some of the simple official Wireshark examples, to understand what we’re dealing with:

ip.len le 1500
udp contains 81:60:03
sip.To contains "a1762"
http.request.uri matches "gl=se$"
eth.dst == ff:ff:ff:ff:ff:ff
ip.addr == 192.168.0.1
ipv6.addr == ::1

You can see that the right hand side of a comparison can be a number, an IPv4 / IPv6 address, a set of bytes or a string. They are used interchangeably, without any special notion of a type, which is fine given that they are easily distinguishable… or are they?

Let’s take a look at some IPv6 forms on Wikipedia:

2001:0db8:0000:0000:0000:ff00:0042:8329
2001:db8:0:0:0:ff00:42:8329
2001:db8::ff00:42:8329

So IPv6 can be written as a set of up to 8 colon-separated hexadecimal numbers, each containing up to 4 digits with leading zeros omitted for convenience. This appears suspiciously similar to the syntax for byte sequences. Indeed, if we try writing out a sequence like 2f:31:32:33:34:35:36:37, it’s simultaneously a valid IPv6 and a byte sequence in terms of Wireshark syntax.

There is no way of telling what this sequence actually represents without looking at the type of the field it’s being compared with, and if you try using this sequence in Wireshark, you’ll notice that it does just that:

ipv6.addr == 2f:31:32:33:34:35:36:37: right hand side is parsed and used as an IPv6 address
http.request.uri == 2f:31:32:33:34:35:36:37: right hand side is parsed and used as a byte sequence (will match a URL "/1234567")

Are there other examples of such ambiguities? Yup - for example, we can try using a single number with two decimal digits:

tcp.port == 80: matches any traffic on the port 80 (HTTP)
http.file_data == 80: matches any HTTP request/response with body containing a single byte (0x80)

We could also do the same with ethernet address, defined as a separate type in Wireshark, but, for simplicity, we represent it as a regular byte sequence in our implementation, so there is no ambiguity here.

Choosing a parsing approach

This is an interesting syntax design decision. It means that we need to store a mapping between field names and types ahead of time - a Scheme, as we call it - and use it for contextual parsing. This restriction also immediately rules out many if not most parser generators.

We could still use one of the more sophisticated ones (like LALRPOP) that allow replacing the default regex-based lexer with your own custom code, but at that point we’re so close to having a full parser for our DSL that the complexity outweighs any benefits of using a black-box parser generator.

Instead, we went with a manual parsing approach. While (for a good reason) this might sound scary in unsafe languages like C / C++, in Rust all strings are bounds checked by default. Rust also provides a rich string manipulation API, which we can use to build more complex helpers, eventually ending up with a full parser.

This approach is, in fact, pretty similar to parser combinators in that the parser doesn’t have to keep state and only passes the unprocessed part of the input down to smaller, narrower scoped functions. Just as in parser combinators, the absence of mutable state also allows to easily test and maintain each of the parsers for different parts of the syntax independently of the others.

Compared with popular parser combinator libraries in Rust, one of the differences is that our parsers are not standalone functions but rather types that implement common traits:

pub trait Lex<'i>: Sized {
   fn lex(input: &'i str) -> LexResult<'i, Self>;
}
pub trait LexWith<'i, E>: Sized {
   fn lex_with(input: &'i str, extra: E) -> LexResult<'i, Self>;
}

The lex method or its contextual variant lex_with can either return a successful pair of (instance of the type, rest of input) or a pair of (error kind, relevant input span).

The Lex trait is used for target types that can be parsed independently of the context (like field names or literals), while LexWith is used for types that need a Scheme or a part of it to be parsed unambiguously.

A bigger difference is that, instead of relying on higher-level functions for parser combinators, we use the usual imperative function call syntax. For example, when we want to perform sequential parsing, all we do is call several parsers in a row, using tuple destructuring for intermediate results:

let input = skip_space(input);
let (op, input) = CombinedExpr::lex_with(input, scheme)?;
let input = skip_space(input);
let input = expect(input, ")")?;

And, when we want to try different alternatives, we can use native pattern matching and ignore the errors:

if let Ok(input) = expect(input, "(") {
   ...
   (SimpleExpr::Parenthesized(Box::new(op)), input)
} else if let Ok((op, input)) = UnaryOp::lex(input) {
   ...
} else {
   ...
}

Finally, when we want to automate parsing of some more complicated common cases - say, enums - Rust provides a powerful macro syntax:

lex_enum!(#[repr(u8)] OrderingOp {
   "eq" | "==" => Equal = EQUAL,
   "ne" | "!=" => NotEqual = LESS | GREATER,
   "ge" | ">=" => GreaterThanEqual = GREATER | EQUAL,
   "le" | "<=" => LessThanEqual = LESS | EQUAL,
   "gt" | ">" => GreaterThan = GREATER,
   "lt" | "<" => LessThan = LESS,
});

This gives an experience similar to parser generators, while still using native language syntax and keeping us in control of all the implementation details.

Execution engine

Because our grammar and operations are fairly simple, initially we used direct AST interpretation by requiring all nodes to implement a trait that includes an execute method.

trait Expr<'s> {
    fn execute(&self, ctx: &ExecutionContext<'s>) -> bool;
}

The ExecutionContext is pretty similar to a Scheme, but instead of mapping arbitrary field names to their types, it maps them to the runtime input values provided by the caller.

As with Scheme, initially ExecutionContext used an internal HashMap for registering these arbitrary String -> RhsValue mappings. During the execute call, the AST implementation would evaluate itself recursively, and look up each field reference in this map, either returning a value or raising an error on missing slots and type mismatches.

This worked well enough for an initial implementation, but using a HashMap has a non-trivial cost which we would like to eliminate. We already used a more efficient hasher - [Fnv](https://github.com/servo/rust-fnv) - because we are in control of all keys and so are not worried about hash DoS attacks, but there was still more we could do.

Speeding up field access

If we look at the data structures involved, we can see that the scheme is always well-defined in advance, and all our runtime values in the execution engine are expected to eventually match it, even if the order or a precise set of fields is not guaranteed:

So what if we ditch the second map altogether and instead use a fixed-size array of values? Array indexing should be much cheaper than looking up in a map, so it might be well worth the effort.

How can we do it? We already know the number of items (thanks to the predefined scheme) so we can use that for the size of the backing storage, and, in order to simulate HashMap “holes” for unset values, we can wrap each item an Option<...>:

pub struct ExecutionContext<'e> {
    scheme: &'e Scheme,
    values: Box<[Option>]>,
}

The only missing piece is an index that could map both structures to each other. As you might remember, Scheme still uses a HashMap for field registration, and a HashMap is normally expected to be randomised and indexed only by the predefined key.

While we could wrap a value and an auto-incrementing index together into a custom struct, there is already a better solution: [IndexMap](https://github.com/bluss/indexmap). IndexMap is a drop-in replacement for a HashMap that preserves ordering and provides a way to get an index of any element and vice versa - exactly what we needed.

After replacing a HashMap in the Scheme with IndexMap, we can change parsing to resolve all the parsed field names to their indices in-place and store that in the AST:

impl<'i, 's> LexWith<'i, &'s Scheme> for Field<'s> {
   fn lex_with(mut input: &'i str, scheme: &'s Scheme) -> LexResult<'i, Self> {
       ...
       let field = scheme
           .get_field_index(name)
           .map_err(|err| (LexErrorKind::UnknownField(err), name))?;
       Ok((field, input))
   }
}

After that, in the ExecutionContext we allocate a fixed-size array and use these indices for resolving values during runtime:

impl<'e> ExecutionContext<'e> {
   /// Creates an execution context associated with a given scheme.
   ///
   /// This scheme will be used for resolving any field names and indices.
   pub fn new<'s: 'e>(scheme: &'s Scheme) -> Self {
       ExecutionContext {
           scheme,
           values: vec![None; scheme.get_field_count()].into(),
       }
   }
   ...
}

This gave significant (~2x) speed ups on our standard benchmarks:

Before:

test matching ... bench:       2,548 ns/iter (+/- 98)
test parsing  ... bench:     192,037 ns/iter (+/- 21,538)

After**:**

test matching ... bench:       1,227 ns/iter (+/- 29)
test parsing  ... bench:     197,574 ns/iter (+/- 16,568)

This change also improved the usability of our API, as any type errors are now detected and reported much earlier, when the values are just being set on the context, and not delayed until filter execution.

[not] JIT compilation

Of course, as with any respectable DSL, one of the other ideas we had from the beginning was “...at some point we’ll add native compilation to make everything super-fast, it’s just a matter of time...”.

In practice, however, native compilation is a complicated matter, but not due to lack of tools.

First of all, there is question of storage for the native code. We could compile each filter statically into some sort of a library and publish to a key-value store, but that would not be easy to maintain:

We would have to compile each filter to several platforms (x86-64, ARM, WASM, …).
The overhead of native library formats would significantly outweigh the useful executable size, as most filters tend to be small.
Each time we’d like to change our execution logic, whether to optimise it or to fix a bug, we would have to recompile and republish all the previously stored filters.
Finally, even if/though we’re sure of the reliability of the chosen store, executing dynamically retrieved native code on the edge as-is is not something that can be taken lightly.

The usual flexible alternative that addresses most of these issues is Just-in-Time (JIT) compilation.

When you compile code directly on the target machine, you get to re-verify the input (still expressed as a restricted DSL), you can compile it just for the current platform in-place, and you never need to republish the actual rules.

Looks like a perfect fit? Not quite. As with any technology, there are tradeoffs, and you only get to choose those that make more sense for your use cases. JIT compilation is no exception.

First of all, even though you’re not loading untrusted code over the network, you still need to generate it into the memory, mark that memory as executable and trust that it will always contain valid code and not garbage or something worse. Depending on your choice of libraries and complexity of the DSL, you might be willing to trust it or put heavy sandboxing around, but, either way, it’s a risk that one must explicitly be willing to take.

Another issue is the cost of compilation itself. Usually, when measuring the speed of native code vs interpretation, the cost of compilation is not taken into the account because it happens out of the process.

With JIT compilers though, it’s different as you’re now compiling things the moment they’re used and cache the native code only for a limited time. Turns out, generating native code can be rather expensive, so you must be absolutely sure that the compilation cost doesn’t offset any benefits you might gain from the native execution speedup.

I’ve talked a bit more about this at Rust Austin meetup and, I believe, this topic deserves a separate blog post so won’t go into much more details here, but feel free to check out the slides: https://www.slideshare.net/RReverser/building-fast-interpreters-in-rust. Oh, and if you’re in Austin, you should pop into our office for the next meetup!

Let’s get back to our original question: is there anything else we can do to get the best balance between security, runtime performance and compilation cost? Turns out, there is.

Dynamic dispatch and closures to the rescue

Introducing Fn trait!

In Rust, the Fn trait and friends (FnMut, FnOnce) are automatically implemented on eligible functions and closures. In case of a simple Fn case the restriction is that they must not modify their captured environment and can only borrow from it.

Normally, you would want to use it in generic contexts to support arbitrary callbacks with given argument and return types. This is important because in Rust, each function and closure implements a unique type and any generic usage would compile down to a specific call just to that function.

fn just_call(me: impl Fn(), maybe: bool) {
  if maybe {
    me()
  }
}

Such behaviour (called static dispatch) is the default in Rust and is preferable for performance reasons.

However, if we don’t know all the possible types at compile-time, Rust allows us to opt-in for a dynamic dispatch instead:

fn just_call(me: &dyn Fn(), maybe: bool) {
  if maybe {
    me()
  }
}

Dynamically dispatched objects don't have a statically known size, because it depends on the implementation details of particular type being passed. They need to be passed as a reference or stored in a heap-allocated Box, and then used just like in a generic implementation.

In our case, this allows to create, return and store arbitrary closures, and later call them as regular functions:

trait Expr<'s> {
    fn compile(self) -> CompiledExpr<'s>;
}

pub(crate) struct CompiledExpr<'s>(Box) -> bool>);

impl<'s> CompiledExpr<'s> {
   /// Creates a compiled expression IR from a generic closure.
   pub(crate) fn new(closure: impl 's + Fn(&ExecutionContext<'s>) -> bool) -> Self {
       CompiledExpr(Box::new(closure))
   }

   /// Executes a filter against a provided context with values.
   pub fn execute(&self, ctx: &ExecutionContext<'s>) -> bool {
       self.0(ctx)
   }
}

The closure (an Fn box) will also automatically include the environment data it needs for the execution.

This means that we can optimise the runtime data representation as part of the “compile” process without changing the AST or the parser. For example, when we wanted to optimise IP range checks by splitting them for different IP types, we could do that without having to modify any existing structures:

RhsValues::Ip(ranges) => {
   let mut v4 = Vec::new();
   let mut v6 = Vec::new();
   for range in ranges {
       match range.clone().into() {
           ExplicitIpRange::V4(range) => v4.push(range),
           ExplicitIpRange::V6(range) => v6.push(range),
       }
   }
   let v4 = RangeSet::from(v4);
   let v6 = RangeSet::from(v6);
   CompiledExpr::new(move |ctx| {
       match cast!(ctx.get_field_value_unchecked(field), Ip) {
           IpAddr::V4(addr) => v4.contains(addr),
           IpAddr::V6(addr) => v6.contains(addr),
       }
   })
}

Moreover, boxed closures can be part of that captured environment, too. This means that we can convert each simple comparison into a closure, and then combine it with other closures, and keep going until we end up with a single top-level closure that can be invoked as a regular function to evaluate the entire filter expression.

It’s turtles closures all the way down:

let items = items
   .into_iter()
   .map(|item| item.compile())
   .collect::>()
   .into_boxed_slice();

match op {
   CombiningOp::And => {
       CompiledExpr::new(move |ctx| items.iter().all(|item| item.execute(ctx)))
   }
   CombiningOp::Or => {
       CompiledExpr::new(move |ctx| items.iter().any(|item| item.execute(ctx)))
   }
   CombiningOp::Xor => CompiledExpr::new(move |ctx| {
       items
           .iter()
           .fold(false, |acc, item| acc ^ item.execute(ctx))
   }),
}

What’s nice about this approach is:

Our execution is no longer tied to the AST, and we can be as flexible with optimising the implementation and data representation as we want without affecting the parser-related parts of code or output format.
Even though we initially “compile” each node to a single closure, in future we can pretty easily specialise certain combinations of expressions into their own closures and so improve execution speed for common cases. All that would be required is a separate match branch returning a closure optimised for just that case.
Compilation is very cheap compared to real code generation. While it might seem that allocating many small objects (one Boxed closure per expression) is not very efficient and that it would be better to replace it with some sort of a memory pool, in practice we saw a negligible performance impact.
No native code is generated at runtime, which means that we execute only code that was statically verified by Rust at compile-time and compiled down to a static function. All that we do at the runtime is call existing functions with different values.
Execution turns out to be faster too. This initially came as a surprise, because dynamic dispatch is widely believed to be costly and we were worried that it would get slightly worse than AST interpretation. However, it showed an immediate ~10-15% runtime improvement in benchmarks and on real examples.

The only obvious downside is that each level of AST requires a separate dynamically-dispatched call instead of a single inlined code for the entire expression, like you would have even with a basic template JIT.

Unfortunately, such output could be achieved only with real native code generation, and, for our case, the mentioned downsides and risks would outweigh runtime benefits, so we went with the safe & flexible closure approach.

Bonus: WebAssembly support

As was mentioned earlier, we chose Rust as a safe high-level language that allows easy integration with other parts of our stack written in Go, C and Lua via C FFI. But Rust has one more target it invests in and supports exceptionally well: WebAssembly.

Why would we be interested in that? Apart from the parts of the stack where our rules would run, and the API that publishes them, we also have users who like to write their own rules. To do that, they use a UI editor that allows either writing raw expressions in Wireshark syntax or as a WYSIWYG builder.

We thought it would be great to expose the parser - the same one as we use on the backend - to the frontend JavaScript for a consistent real-time editing experience. And, honestly, we were just looking for an excuse to play with WASM support in Rust.

WebAssembly could be targeted via regular C FFI, but in that case you would need to manually provide all the glue for the JavaScript side to hold and convert strings, arrays and objects forth and back.

In Rust, this is all handled by wasm-bindgen. While it provides various attributes and methods for direct conversions, the simplest way to get started is to activate the “serde” feature which will automatically convert types using JSON.parse, JSON.stringify and [serde_json](https://docs.serde.rs/serde_json/) under the hood.

In our case, creating a wrapper for the parser with only 20 lines of code was enough to get started and have all the WASM code + JavaScript glue required:

#[wasm_bindgen]
pub struct Scheme(wirefilter::Scheme);

fn into_js_error(err: impl std::error::Error) -> JsValue {
   js_sys::Error::new(&err.to_string()).into()
}

#[wasm_bindgen]
impl Scheme {
   #[wasm_bindgen(constructor)]
   pub fn try_from(fields: &JsValue) -> Result {
       fields.into_serde().map(Scheme).map_err(into_js_error)
   }

   pub fn parse(&self, s: &str) -> Result {
       let filter = self.0.parse(s).map_err(into_js_error)?;
       JsValue::from_serde(&filter).map_err(into_js_error)
   }
}

And by using a higher-level tool called wasm-pack, we also got automated npm package generation and publishing, for free.

This is not used in the production UI yet because we still need to figure out some details for unsupported browsers, but it’s great to have all the tooling and packages ready with minimal efforts. Extending and reusing the same package, it should be even possible to run filters in Cloudflare Workers too (which also support WebAssembly).

The future

The code in the current state is already doing its job well in production and we’re happy to share it with the open-source Rust community.

This is definitely not the end of the road though - we have many more fields to add, features to implement and planned optimisations to explore. If you find this sort of work interesting and would like to help us by working on firewalls, parsers or just any Rust projects at scale, give us a shout!

eBPF, Sockets, Hop Distance and manually writing eBPF assembly

Marek Majkowski — Thu, 29 Mar 2018 10:43:38 GMT

A friend gave me an interesting task: extract IP TTL values from TCP connections established by a userspace program. This seemingly simple task quickly exploded into an epic Linux system programming hack. The result code is grossly over engineered, but boy, did we learn plenty in the process!

CC BY-SA 2.0 image by Paul Miller

Context

You may wonder why she wanted to inspect the TTL packet field (formally known as "IP Time To Live (TTL)" in IPv4, or "Hop Count" in IPv6)? The reason is simple - she wanted to ensure that the connections are routed outside of our datacenter. The "Hop Distance" - the difference between the TTL value set by the originating machine and the TTL value in the packet received at its destination - shows how many routers the packet crossed. If a packet crossed two or more routers, we know it indeed came from outside of our datacenter.

It's uncommon to look at TTL values (except for their intended purpose of mitigating routing loops by checking when the TTL reaches zero). The normal way to deal with the problem we had would be to blocklist IP ranges of our servers. But it’s not that simple in our setup. Our IP numbering configuration is rather baroque, with plenty of Anycast, Unicast and Reserved IP ranges. Some belong to us, some don't. We wanted to avoid having to maintain a hard-coded blocklist of IP ranges.

The gist of the idea is: we want to note the TTL value from a returned SYN+ACK packet. Having this number we can estimate the Hop Distance - number of routers on the path. If the Hop Distance is:

zero: we know the connection went to localhost or a local network.

one: connection went through our router, and was terminated just behind it.

two: connection went through two routers. Most possibly our router, and one just near to it.

For our use case, we want to see if the Hop Distance was two or more - this would ensure the connection was routed outside the datacenter.

Not so easy

It's easy to read TTL values from a userspace application, right? No. It turns out it's almost impossible. Here are the theoretical options we considered early on:

A) Run a libpcap/tcpdump-like raw socket, and catch the SYN+ACK's manually. We ruled out this design quickly - it requires elevated privileges. Also, raw sockets are pretty fragile: they can suffer packet loss if the userspace application can’t keep up.

B) Use the IP_RECVTTL socket option. IP_RECVTTL requests a "cmsg" data to be attached to control/ancillary data in a recvmsg() syscall. This is a good choice for UDP connections, but this socket option is not supported by TCP SOCK_STREAM sockets.

Extracting the TTL is not so easy.

SO_ATTACH_FILTER to rule the world!

CC BY-SA 2.0 image by Lee Jordan

Wait, there is a third way!

You see, for quite some time it has been possible to attach a BPF filtering program to a socket. See socket(7)

SO_ATTACH_FILTER (since Linux 2.2), SO_ATTACH_BPF (since Linux 3.19)
    Attach a classic BPF (SO_ATTACH_FILTER) or an extended BPF
    (SO_ATTACH_BPF) program to the socket for use as a filter of
    incoming packets.  A packet will be dropped if the filter pro‐
    gram returns zero.  If the filter program returns a nonzero
    value which is less than the packet's data length, the packet
    will be truncated to the length returned.  If the value
    returned by the filter is greater than or equal to the
    packet's data length, the packet is allowed to proceed unmodi‐
    fied.

You probably take advantage of SO_ATTACH_FILTER already: This is how tcpdump/wireshark does filtering when you're dumping packets off the wire.

How does it work? Depending on the result of a BPF program, packets can be filtered, truncated or passed to the socket without modification. Normally SO_ATTACH_FILTER is used for RAW sockets, but surprisingly, BPF filters can also be attached to normal SOCK_STREAM and SOCK_DGRAM sockets!

We don't want to truncate packets though - we want to extract the TTL. Unfortunately with Classical BPF (cBPF) it's impossible to extract any data from a running BPF filter program.

eBPF and maps

This changed with modern BPF machinery, which includes:

modernised eBPF bytecode
eBPF maps
bpf() syscall
SO_ATTACH_BPF socket option

eBPF bytecode can be thought of as an extension to Classical BPF, but it's the extra features that really let it shine.

The gem is the "map" abstraction. An eBPF map is a thingy that allows an eBPF program to store data and share it with a userspace code. Think of an eBPF map as a data structure (a hash table most usually) shared between a userspace program and an eBPF program running in kernel space.

To solve our TTL problem, we can use eBPF filter program. It will look at the TTL values of passing packets, and save them in an eBPF map. Later, we can inspect the eBPF map and analyze the recorded values from userspace.

SO_ATTACH_BPF to rule the world!

To use eBPF we need a number of things set up. First, we need to create an "eBPF map". There are many specialized map types, but for our purposes let's use the "hash" BPF_MAP_TYPE_HASH type.

We need to figure out the "bpf(BPF_MAP_CREATE, map type, key size, value size, limit, flags)" parameters. For our small TTL program, let's set 4 bytes for key size, and 8 byte value size. The max element limit is set to 5. It doesn't matter, we expect all the packets in one connection to have just one coherent TTL value anyway.

This is how it would look in a Golang code:

bpfMapFd, err := ebpf.NewMap(ebpf.Hash, 4, 8, 5, 0)

A word of warning is needed here. BPF maps use the "locked memory" resource. With multiple BPF programs and maps, it's easy to exhaust the default tiny 64 KiB limit. Consider bumping this with ulimit -l, for example:

ulimit -l 10240

The bpf() syscall returns a file descriptor pointing to the kernel BPF map we just created. With it handy we can operate on a map. The possible operations are:

bpf(BPF_MAP_LOOKUP_ELEM, )
bpf(BPF_MAP_UPDATE_ELEM, , , )
bpf(BPF_MAP_DELETE_ELEM, )
bpf(BPF_MAP_GET_NEXT_KEY, )

BPF calling convention

Before we go further we should highlight couple of things about the eBPF environment. The official kernel documentation isn't too friendly:

Documentation/networking/filter.txt

The first important bit to know, is the calling convention:

R0 - return value from in-kernel function, and exit value for eBPF program
R1-R5 - arguments from eBPF program to in-kernel function
R6-R9 - callee saved registers that in-kernel function will preserve
R10 - read-only frame pointer to access stack

When the BPF is started, R1 contains a pointer to ctx. This data structure is defined as struct __sk_buff. For example, to access the protocol field you'd need to run:

r0 = *(u32 *)(r1 + 16)

Or in other words:

ebpf.BPFIDstOffSrc(ebpf.LdXW, ebpf.Reg0, ebpf.Reg1, 16),

Which is exactly what we do in first line of our program, since we need to choose between IPv4 or IPv6 code branches.

Accessing the BPF payload

Next, there are special instructions for packet payload loading. Most BPF programs (but not all!) run in the context of packet filtering, so it makes sense to accelerate data lookups by having magic opcodes for accessing packet data.

Instead of dereferencing context, like ctx->data[x] to load a byte, BPF supports the BPF_LD instruction that can do it in one operation. There are caveats though, the documentation says:

eBPF has two non-generic instructions: (BPF_ABS |  | BPF_LD) and
(BPF_IND |  | BPF_LD) which are used to access packet data.

They had to be carried over from classic BPF to have strong performance of
socket filters running in eBPF interpreter. These instructions can only
be used when interpreter context is a pointer to 'struct sk_buff' and
have seven implicit operands. Register R6 is an implicit input that must
contain pointer to sk_buff. Register R0 is an implicit output which contains
the data fetched from the packet. Registers R1-R5 are scratch registers
and must not be used to store the data across BPF_ABS | BPF_LD or
BPF_IND | BPF_LD instructions.

In other words: before calling BPF_LD we must move ctx to R6, like this:

ebpf.BPFIDstSrc(ebpf.MovSrc, ebpf.Reg6, ebpf.Reg1),

Then we can call the load:

ebpf.BPFIImm(ebpf.LdAbsB, int32(-0x100000+7)),

At this stage the result is in r0, but we must remember the r1-r5 should be considered dirty. For an instruction the BPF_LD looks very much like a function call.

Magical Layer 3 offset

Next note the load offset - we loaded the -0x100000+7 byte of the packet. This magic offset is another BPF context curiosity. It turns out that the BPF script loaded under SO_ATTACH_BPF on a SOCK_STREAM (or SOCK_DGRAM) socket, will only see Layer 4 and higher OSI layers by default. To extract the TTL we need access to the layer 3 header (i.e. the IP header). To access L3 in the L4 context, we must offset the data lookups by magical -0x100000.

This magic constant is defined in the kernel.

For completeness, the +7 is, of course, the offset of the TTL field in an IPv4 packet. Our small BPF program also supports IPv6 where the TTL/Hop Count is at offset +8.

Return value

Finally, the return value of the BPF program is meaningful. In the context of packet filtering it will be interpreted as a truncated packet length.Had we returned 0 - the packet would be dropped and wouldn't be seen by the userspace socket application. It's quite interesting that we can do packet-based data manipulation with eBPF on a stream-based socket. Anyway, our script returns -1, which when cast to unsigned will be interpreted as a very large number:

ebpf.BPFIDstImm(ebpf.MovImm, ebpf.Reg0, -1),
ebpf.BPFIOp(ebpf.Exit),

Extracting data from map

Our running BPF program will set a key on our map for any matched packet. The key is the recorded TTL value, the value is the packet count. The value counter is somewhat vulnerable to a tiny race condition, but it's ignorable for our purposes. Later on, to extract the data from userspace program we use this Golang loop:

var (
	value   MapU64
	k1, k2  MapU32
)

for {
	ok, err := bpfMap.Get(k1, &value, 8)
	if ok {
		// k1 is TTL, value is counter
		...
	}

	ok, err = bpfMap.GetNextKey(k1, &k2, 4)
	if err != nil || ok == false {
		break
	}
	k1 = k2
}

Putting it all together

Now with all the pieces ready we can make it a proper runnable program. There is little point in discussing it here, so allow me to refer to the source code. The BPF pieces are here:

ebpf.go

We haven't discussed how to catch inbound SYN+ACK in the BPF program. This is a matter of setting up BPF before calling connect(). Sadly, it's impossible to customize net.Dial in Golang. Instead we wrote a surprisingly painful and awful custom Dial implementation. The ugly custom dialer code is here:

magic_conn.go

To run all this you need kernel 4.4+ Kernel with the bpf() syscall compiled in. BPF features of specific kernels are documented in this superb page from BCC:

docs/kernel-versions.md

Run the code to observe the TTL Hop Counts:

$ ./ttl-ebpf tcp4://google.com:80 tcp6://google.com:80 \
             tcp4://cloudflare.com:80 tcp6://cloudflare.com:80
[+] TTL distance to tcp4://google.com:80 172.217.4.174 is 6
[+] TTL distance to tcp6://google.com:80 [2607:f8b0:4005:809::200e] is 4
[+] TTL distance to tcp4://cloudflare.com:80 198.41.215.162 is 3
[+] TTL distance to tcp6://cloudflare.com:80 [2400:cb00:2048:1::c629:d6a2] is 3

Takeaways

In this blog post we dived into the new eBPF machinery, including the bpf() syscall, maps and SO_ATTACH_BPF. This work allowed me to realize the potential of running SO_ATTACH_BPF on fully established TCP sockets. Undoubtedly, eBPF still requires plenty of love and documentation, but it seems to be a perfect bridge to expose low level toggles to userspace applications.

I highly recommend keeping the dependencies small. For small BPF programs, like the one shown, there is little need for complex clang compilation and ELF loading. Don't be afraid of the eBPF bytecode!

We only touched on SO_ATTACH_BPF, where we analyzed network packets with BPF running on network sockets. There is more! First, you can attach BPFs to a dozen "things", XDP being the most obvious example. Full list. Then, it's possible to actually affect kernel packet processing, here is a full list of helper functions, some of which can modify kernel data structures.

In February LWN jokingly wrote:

Developers should be careful, though; this could
prove to be a slippery slope leading toward something 
that starts to look like a microkernel architecture.

There is a grain of truth here. Maybe the ability to run eBPF on variety of subsystems feels like microkernel coding, but definitely the SO_ATTACH_BPF smells like STREAMS programming model from 1984.

Thanks to Gilberto Bertin and David Wragg for helping out with the eBPF bytecode.

Doing eBPF work sound interesting? Join our world famous team in London, Austin, San Francisco, Champaign and our elite office in Warsaw, Poland.

2018 and the Internet: our predictions

John Graham-Cumming — Thu, 21 Dec 2017 14:01:43 GMT

At the end of 2016, I wrote a blog post with seven predictions for 2017. Let’s start by reviewing how I did.

Public Domain image by Michael Sharpe

I’ll score myself with two points for being correct, one point for mostly right and zero for wrong. That’ll give me a maximum possible score of fourteen. Here goes...

2017-1: 1Tbps DDoS attacks will become the baseline for ‘massive attacks’

This turned out to be true but mostly because massive attacks went away as Layer 3 and Layer 4 DDoS mitigation services got good at filtering out high bandwidth and high packet rates. Over the year we saw many DDoS attacks in the 100s of Gbps (up to 0.5Tbps) and then in September announced Unmetered Mitigation. Almost immediately we saw attackers stop bothering to attack Cloudflare-protected sites with large DDoS.

So, I’ll be generous and give myself one point.

2017-2: The Internet will get faster yet again as protocols like QUIC become more prevalent

Well, yes and no. QUIC has become more prevalent as Google has widely deployed it in the Chrome browser and it accounts for about 7% of Internet traffic. At the same time the protocol is working its way through the IETF standardization process and has yet to be deployed widely outside Google.

So, I’ll award myself one point for this as QUIC did progress but didn’t get as far as I thought.

2017-3: IPv6 will become the defacto for mobile networks and IPv4-only fixed networks will be looked upon as old fashioned

IPv6 continued to grow throughout 2017 and seems to be on a pretty steady trajectory upwards. Although it’s not yet deployed on ¼ of the top 25,000 web sites. Note that the large jump in IPv6 support that occurred in the middle of 2016 when Cloudflare enabled it by default for all our customers.

The Internet Society reported that mobile networks that switch to IPv6 see 70-95% of their traffic use IPv6. Google reports that traffic from Verizon is now 90% IPv6 and T-Mobile is turning off IPv4 completely.

Here I’ll award myself two points.

2017-4: A SHA-1 collision will be announced

That happened on 23 February 2017 with the announcement of an efficient way to generate colliding PDF documents. It’s so efficient that here are two PDFs containing the old and new Cloudflare logos. I generated these two PDFs using a web site that takes two JPEGs, embeds them in two PDFs and makes them collide. It does this instantly.

They have the same SHA-1 hash:

$ shasum *.pdf
e1964edb8bcafc43de6d1d99240e80dfc710fbe1  a.pdf
e1964edb8bcafc43de6d1d99240e80dfc710fbe1  b.pdf

But different SHA-256 hash:

$ shasum -a256 *.pdf
8e984df6f4a63cee798f9f6bab938308ebad8adf67daba349ec856aad07b6406  a.pdf
f20f44527f039371f0aa51bc9f68789262416c5f2f9cefc6ff0451de8378f909  b.pdf

So, two points for getting that right (and thanks, Nick Sullivan, for suggesting it and making me look smart).

2017-5: Layer 7 attacks will rise but Layer 6 won’t be far behind

The one constant of 2017 in terms of DDoS was the prevalence of Layer 7 attacks. Even as attackers decided that large scale Layer 3 and 4 DDoS attacks were being mitigated easily and hence stopped performing them so frequently, Layer 7 attacks continued apace with attacks in the 100s of krps common place.

Awarding myself one point because Layer 6 attacks didn’t materialize as much as predicted.

2017-6: Mobile traffic will account for 60% of all Internet traffic by the end of the year

Ericsson reported mid-year that mobile data traffic was continuing to grow strongly and grew 70% between Q116 and Q117. Stats show that while mobile traffic continued to increase its share of Internet traffic and passed 50% in 2017 it didn’t reach 60%.

Zero points for me.

2017-7: The security of DNS will be taken seriously

This has definitely happened. The 2016 Dyn DNS attack was a wake up call that often overlooked infrastructure was at risk of DDoS attack. In April 2017 Wired reported that hackers took over 36 Brazilian banking web sites by hijacking DNS registration, and in June Mozilla and ICANN proposed encrypting DNS by sending it over HTTPS and the IETF has a working group on what’s now being called doh.

DNSSEC deployment continued with SecSpider showing steady, continuous growth during 2017.

So, two points for me.

Overall, I scored myself a total of 9 out of 14, or 64% right. With that success rate in mind here are my predictions for 2018.

2018 Predictions

2018-1: By the end of 2018 more than 50% of HTTPS connections will happen over TLS 1.3

The roll out of TLS 1.3 has been stalled because of difficulty in getting it working correctly in the heterogenous Internet environment. Although Cloudflare has had TLS 1.3 in production and available for all customers for over a year only 0.2% of our traffic is currently using that version.

Given the state of standardization of TLS 1.3 today we believe that major browser vendors will enable TLS 1.3 during 2018 and by the end of the year more than 50% of HTTPS connections will be using the latest, most secure version of TLS.

2018-2: Vendor lock-in with Cloud Computing vendors becomes dominant worry for enterprises

In Mary Meeker’s 2017 Internet Trends report she gives on statistics (slide 183) on the top three concerns of users of cloud computing. These show a striking change from being primarily about security and cost to worries about vendor lock-in and compliance. Cloudflare believes that vendor lock-in will become the top concern of users of cloud computing in 2018 and that multi-cloud strategies will become common.

BillForward is already taking a multi-cloud approach with Cloudflare moving traffic dynamically between cloud computing providers. Alongside vendor lock-in, users will name data portability between clouds as a top concern.

2018-3: Deep learning hype will subside as self-driving cars don't become a reality but AI/ML salaries will remain high

Self-driving cars won’t become available in 2018, but AI/ML will remain red hot as every technology company tries to hire appropriate engineering staff and finds they can’t. At the same time deep learning techniques will be widely applied across companies and industries as it becomes clear that these techniques are not limited to game playing, classification, or translation tasks and can be widely applied.

Expect unexpected applications of techniques, that are already in use in Silicon Valley, when they are applied to the rest of the world. Don’t be surprised if there’s talk of AI/ML managed traffic management for highways, for example. Anywhere there's a heuristic we'll see AI/ML applied.

But it’ll take another couple of years for AI/ML to really have profound effects. By 2020 the talent pool will have greatly increased and manufacturers such as Qualcomm, nVidia and Intel will have followed Google’s lead and produced specialized chipsets designed for deep learning and other ML techniques.

2018-4: k8s becomes the dominant platform for cloud computing

A corollary to users’ concerns about cloud vendor lock-in and the need for multi-cloud capability is that an orchestration framework will dominate. We believe that Kubernetes will be that dominant platform and that large cloud vendors will work to ensure compatibility across implementations at the demand of customers.

We are currently in the infancy of k8s deployment with the major cloud computing vendors deploying incompatible versions. We believe that customer demand for portability will cause cloud computer vendors to ensure compatibility.

2018-5: Quantum resistant crypto will be widely deployed in machine-to-machine links across the internet

During 2017 Cloudflare experimented with, and open sourced, quantum-resistant cryptography as part of our implementation of TLS 1.3. Today there is a threat to the security of Internet protocols from quantum computers, and although the threat has not been realized, cryptographers are working on cryptographic schemes that will resist attacks from quantum computers when they arrive.

We predict that quantum-resistant cryptography will become widespread in links between machines and data centers especially where the connections being encrypted cross the public Internet. We don’t predict that quantum-resistant cryptography will be widespread in browsers, however.

2018-6: Mobile traffic will account for 60% of all Internet traffic by the end of the year

Based on the continued trend upwards in mobile traffic I’m predicting that 2018 (instead of 2017) will be the year mobile traffic shoots past 60% of overall Internet traffic. Fingers crossed.

2018-7: Stable BTC/USD exchanges will emerge as others die off from security-based Darwinism

The meteoric rise in the Bitcoin/USD exchange rate has been accompanied by a drumbeat of stories about stolen Bitcoins and failing exchanges. We believe that in 2018 the entire Bitcoin ecosystem will stabilize.

This will partly be through security-based Darwinism as trust in exchanges and wallets that have security problems plummets and those that survive have developed the scale and security to cope with the explosion in Bitcoin transactions and attacks on their services.

WHOIS going to be at the Grace Hopper Celebration?

Dani Grant — Tue, 03 Oct 2017 10:00:00 GMT

Ubuntu us are doing the round trip! It’s time to live - WAN you arrive at GHC, come meet us and say HELO (we love GNU faces, we’ll be very api to meet you). When you’re exhausted like IPv4, git over to the Cloudflare corner to reboot –– we’ll have chargers and Wi-Fi (it’s not a SYN to REST). R booth can be your ESC. Then Thursday morning we’re hosting a breakfast bash with Zendesk –– it will be quite the Assembly, you should definitely Go,compile a bowl of serial, drink a bit of CIDR or a cup of tee.

I’m also speaking at 1:30PM on Wednesday in OCCC W414 hashing out encryption and updates for IoT –– DES should be a fun session.

ACK! I did NAT tell you how to find us. Check for sum women in capes a few hops away from the booths with the lava LAMP stack. I'm the one with cURLs.

In D air! Excited to LANd. C you soon.

How to use Cloudflare for Service Discovery

Dani Grant — Fri, 21 Jul 2017 08:01:00 GMT

Cloudflare runs 3,588 containers, making up 1,264 apps and services that all need to be able to find and discover each other in order to communicate -- a problem solved with service discovery.

You can use Cloudflare for service discovery. By deploying microservices behind Cloudflare, microservices’ origins are masked, secured from DDoS and L7 exploits and authenticated, and service discovery is natively built in. Cloudflare is also cloud platform agnostic, which means that if you have distributed infrastructure deployed across cloud platforms, you still get a holistic view of your services and the ability to manage your security and authentication policies in one place, independent of where services are actually deployed.

How it works

Service locations and metadata are stored in a distributed KV store deployed in all 100+ Cloudflare edge locations (the service registry).

Services register themselves to the service registry when they start up and deregister themselves when they spin down via a POST to Cloudflare’s API. Services provide data in the form of a DNS record, either by giving Cloudflare the address of the service in an A (IPv4) or AAAA (IPv6) record, or by providing more metadata like transport protocol and port in an SRV record.

Services are also automatically registered and deregistered by health check monitors so only healthy nodes are sent traffic. Health checks are over HTTP and can be setup with custom configuration so that responses to the health check must return a specific response body and or response code otherwise the nodes are marked as unhealthy.

Traffic is distributed evenly between redundant nodes using a load balancer. Clients of the service discovery query the load balancer directly over DNS. The load balancer receives data from the service registry and returns the corresponding service address. If services are behind Cloudflare, the load balancer returns a Cloudflare IP address to route traffic to the service through Cloudflare’s L7 proxy.

Traffic can also be sent to specific service nodes based on client geography, so the data replication service in North America, for example, can talk to a specific North American version of the billing service, or European data can stay in Europe.

Clients query the service registry over DNS, and service location and metadata is packaged in A, AAAA, CNAME or SRV records. The benefit of this is that no additional client software needs to be installed on service nodes beyond a DNS client. Cloudflare works natively over DNS, meaning that if your services have a DNS client, there’s no extra software to install, manage, upgrade or patch.

While usually, TTL’s in DNS mean that if a service location changes or deregisters, clients may still get stale information, Cloudflare DNS keeps low TTL’s (it’s able to do this and maintain fast performance because of its distributed network) and if you are using Cloudflare as a proxy, the DNS answers always point back to Cloudflare even when the IP’s of services behind Cloudflare change, removing the effect of cache staleness.

If your services communicate over HTTP/S and websockets, you can additionally use Cloudflare as a L7 proxy for added security, authentication and optimization. Cloudflare prevents DDoS attacks from hitting your infrastructure, masks your IP’s behind its network, and routes traffic through an optimized edge PoP to edge PoP route to shave latency off the internet.

Once service <--> service traffic is going through Cloudflare, you can use TLS client certificates to authenticate traffic between your services. Cloudflare can authenticate traffic at the edge by ensuring that the client certificate presented during the TLS handshake is signed by your root CA.

Setting it up

During the signup process, add all your initial services as DNS records in the DNS editor.

To finish sign up, move DNS to Cloudflare by logging into your registrar and changing your nameservers to the Cloudflare nameservers assigned to you when you signed up for Cloudflare. If you want traffic to those services to be proxied through Cloudflare, click on the cloud next to each DNS record to make it orange.

Run a script on each node so that:

On startup, the node sends a POST to the DNS record API to register itself and PUT to load balancing API to add itself to the origin pool.

On shutdown, the node sends a DELETE to the DNS record API to deregister itself and PUT to load balancing API to remove itself to the origin pool.

These can be accomplished via startup and shutdown scripts on Google Compute Engine or user data scripts or auto scaling lifecycle hooks on AWS.

Registration:

curl -X POST "https://api.cloudflare.com/client/v4/zones/023e105f4ecef8ad9ca31a8372d0c353/dns_records" \
     -H "X-Auth-Email: user@example.com" \
     -H "X-Auth-Key: c2547eb745079dac9320b638f5e225cf483cc5cfdda41" \
     -H "Content-Type: application/json" \
     --data '{"type":"SRV","data":{"service":"_http","proto":"_tcp","name":"name","priority":1,"weight":1,"port":80,"target":"staging.badtortilla.com"},"ttl":1,"zone_name":"badtortilla.com","name":"_http._tcp.name.","content":"SRV 1 1 80 staging.badtortilla.com.","proxied":false,"proxiable":false,"priority":1}'

De-Registration:

curl -X DELETE "https://api.cloudflare.com/client/v4/zones/023e105f4ecef8ad9ca31a8372d0c353/dns_records/372e67954025e0ba6aaa6d586b9e0b59" \
     -H "X-Auth-Email: user@example.com" \
     -H "X-Auth-Key: c2547eb745079dac9320b638f5e225cf483cc5cfdda41" \
     -H "Content-Type: application/json"

Add or remove an origin from an origin pool (this should be a unique IP per node added to the pool):

curl -X PUT "https://api.cloudflare.com/client/v4/user/load_balancers/pools/17b5962d775c646f3f9725cbc7a53df4" \
     -H "X-Auth-Email: user@example.com" \
     -H "X-Auth-Key: c2547eb745079dac9320b638f5e225cf483cc5cfdda41" \
     -H "Content-Type: application/json" \
     --data '{"description":"Primary data center - Provider XYZ","name":"primary-dc-1","enabled":true,"monitor":"f1aba936b94213e5b8dca0c0dbf1f9cc","origins":[{"name":"app-server-1","address":"1.2.3.4","enabled":true}],"notification_email":"someone@example.com"}'

Create a health check. You can do this in the API or in the Cloudflare dashboard (in the Load Balancer card).

curl -X POST "https://api.cloudflare.com/client/v4/organizations/01a7362d577a6c3019a474fd6f485823/load_balancers/monitors" \
     -H "X-Auth-Email: user@example.com" \
     -H "X-Auth-Key: c2547eb745079dac9320b638f5e225cf483cc5cfdda41" \
     -H "Content-Type: application/json" \
     --data '{"type":"https","description":"Login page monitor","method":"GET","path":"/health","header":{"Host":["example.com"],"X-App-ID":["abc123"]},"timeout":3,"retries":0,"interval":90,"expected_body":"alive","expected_codes":"2xx"}'

Create an initial load balancer, either through the API or in the Cloudflare dashboard.

curl -X POST "https://api.cloudflare.com/client/v4/zones/699d98642c564d2e855e9661899b7252/load_balancers" \
     -H "X-Auth-Email: user@example.com" \
     -H "X-Auth-Key: c2547eb745079dac9320b638f5e225cf483cc5cfdda41" \
     -H "Content-Type: application/json" \
     --data '{"description":"Load Balancer for www.example.com","name":"www.example.com","ttl":30,"fallback_pool":"17b5962d775c646f3f9725cbc7a53df4","default_pools":["de90f38ced07c2e2f4df50b1f61d4194","9290f38c5d07c2e2f4df57b1f61d4196","00920f38ce07c2e2f4df50b1f61d4194"],"region_pools":{"WNAM":["de90f38ced07c2e2f4df50b1f61d4194","9290f38c5d07c2e2f4df57b1f61d4196"],"ENAM":["00920f38ce07c2e2f4df50b1f61d4194"]},"pop_pools":{"LAX":["de90f38ced07c2e2f4df50b1f61d4194","9290f38c5d07c2e2f4df57b1f61d4196"],"LHR":["abd90f38ced07c2e2f4df50b1f61d4194","f9138c5d07c2e2f4df57b1f61d4196"],"SJC":["00920f38ce07c2e2f4df50b1f61d4194"]},"proxied":true}'

(optional) Setup geographic routing rules. You can do this via API or in the Cloudflare dashboard.

curl -X POST "https://api.cloudflare.com/client/v4/zones/699d98642c564d2e855e9661899b7252/load_balancers" \
     -H "X-Auth-Email: user@example.com" \
     -H "X-Auth-Key: c2547eb745079dac9320b638f5e225cf483cc5cfdda41" \
     -H "Content-Type: application/json" \
     --data '{"description":"Load Balancer for www.example.com","name":"www.example.com","ttl":30,"fallback_pool":"17b5962d775c646f3f9725cbc7a53df4","default_pools":["de90f38ced07c2e2f4df50b1f61d4194","9290f38c5d07c2e2f4df57b1f61d4196","00920f38ce07c2e2f4df50b1f61d4194"],"region_pools":{"WNAM":["de90f38ced07c2e2f4df50b1f61d4194","9290f38c5d07c2e2f4df57b1f61d4196"],"ENAM":["00920f38ce07c2e2f4df50b1f61d4194"]},"pop_pools":{"LAX":["de90f38ced07c2e2f4df50b1f61d4194","9290f38c5d07c2e2f4df57b1f61d4196"],"LHR":["abd90f38ced07c2e2f4df50b1f61d4194","f9138c5d07c2e2f4df57b1f61d4196"],"SJC":["00920f38ce07c2e2f4df50b1f61d4194"]},"proxied":true}'

(optional) Setup Argo for faster PoP to PoP transit in the traffic app of the Cloudflare dashboard.

(optional) Setup rate limiting via API or in the dashboard

curl -X POST "https://api.cloudflare.com/client/v4/zones/023e105f4ecef8ad9ca31a8372d0c353/rate_limits" \
     -H "X-Auth-Email: user@example.com" \
     -H "X-Auth-Key: c2547eb745079dac9320b638f5e225cf483cc5cfdda41" \
     -H "Content-Type: application/json" \
     --data '{"id":"372e67954025e0ba6aaa6d586b9e0b59","disabled":false,"description":"Prevent multiple login failures to mitigate brute force attacks","match":{"request":{"methods":["GET","POST"],"schemes":["HTTP","HTTPS"],"url":"*.example.org/path*"},"response":{"status":[401,403],"origin_traffic":true}},"bypass":[{"name":"url","value":"api.example.com/*"}],"threshold":60,"period":900,"action":{"mode":"simulate","timeout":86400,"response":{"content_type":"text/xml","body":"This request has been rate-limited."}}}'

(optional) Setup TLS client authentication. (Enterprise only) Send your account manager your root CA certificate and which options you would like enabled.

Less Is More - Why The IPv6 Switch Is Missing

Dani Grant — Thu, 25 May 2017 17:30:00 GMT

At Cloudflare we believe in being good to the Internet and good to our customers. By moving on from the legacy world of IPv4-only to the modern-day world where IPv4 and IPv6 are treated equally, we believe we are doing exactly that.

"No matter what happens in life, be good to people. Being good to people is a wonderful legacy to leave behind." - Taylor Swift (whose website has been IPv6 enabled for many many years)

Starting today with free domains, IPv6 is no longer something you can toggle on and off, it’s always just on.

How we got here

Cloudflare has always been a gateway for visitors on IPv6 connections to access sites and applications hosted on legacy IPv4-only infrastructure. Connections to Cloudflare are terminated on either IP version and then proxied to the backend over whichever IP version the backend infrastructure can accept.

That means that a v6-only mobile phone (looking at you, T-Mobile users) can establish a clean path to any site or mobile app behind Cloudflare instead of doing an expensive 464XLAT protocol translation as part of the connection (shaving milliseconds and conserving very precious battery life).

That IPv6 gateway is set by a simple toggle that for a while now has been default-on. And to make up for the time lost before the toggle was default on, in August 2016 we went back retroactively and enabled IPv6 for those millions of domains that joined before IPv6 was the default. Over those next few months, we enabled IPv6 for nearly four million domains –– you can see Cloudflare’s dent in the IPv6 universe below –– and by the time we were done, 98.1% of all of our domains had IPv6 connectivity.

As an interim step, we added an extra feature –– when you turn off IPv6 in our dashboard, we remind you just how archaic we think that is.

With close to 100% IPv6 enablement, it no longer makes sense to offer an IPv6 toggle. Instead, Cloudflare is offering IPv6 always on, with no off-switch. We’re starting with free domains, and over time we’ll change the toggle on the rest of Cloudflare paid-plan domains.

The Future: How Cloudflare and OpenDNS are working together to make IPv6 even faster and more globally deployed

In November we published stats about the IPv6 usage we see on the Cloudflare network in an attempt to answer who and what is pushing IPv6. The top operating systems by percent IPv6 traffic are iOS, ChromeOS, and MacOS respectively. These operating systems push significantly more IPv6 traffic than their peers because they use a routing choice algorithm called Happy Eyeballs. Happy Eyeballs opportunistically chooses IPv6 when available by doing two DNS lookups –– one for an IPv6 address (this IPv6 address is stored in the DNS AAAA record - pronounced quad-A) and then one for the IPv4 address (stored in the DNS A record). Both DNS queries are flying over the Internet at the same time and the client chooses the address that comes back first. The client even gives IPv6 a few milliseconds head start (iOS and MacOS give IPv6 lookups a 25ms head start for example) so that IPv6 may be chosen more often. This works and has fueled some of IPv6’s growth. But it has fallen short of the goal of a 100% IPv6 world.

While there are perfectly good historical reasons why IPv6 and IPv4 addresses are stored in separate DNS types, today clients are IP version agnostic and it no longer makes sense for it to require two separate round trips to learn what addresses are available to fetch a resource from.

Alongside OpenDNS, we are testing a new idea - what if you could ask for all the addresses in just one DNS query?

With OpenDNS, we are prototyping and testing just that –– a new DNS metatype that returns all available addresses in one DNS answer –– A records and AAAA records in one response. (A metatype is a query type in DNS that end users can’t add into their DNS zone file, it’s assembled dynamically by the authoritative nameserver.)

What this means is that in the future if a client like an iPhone wants to access a mobile app that uses Cloudflare DNS or using another DNS provider that supports the spec, the iPhone DNS client would only need to do one DNS lookup to find where the app’s API server is located, cutting the number of necessary round trips in half.

This reduces the amount of bandwidth on the DNS system, and pre-populates global DNS caches with IPv6 addresses, making IPv6 lookups faster in the future, with the side benefit that Happy Eyeballs clients prefer IPv6 when they can get the address quickly, which increases the amount of IPv6 traffic that flows through the Internet.

We have the metaquery working in code with the reserved TYPE65535 querytype. You can ask a Cloudflare nameserver for TYPE65535 of any domain on Cloudflare and get back all available addresses for that name.

$ dig cloudflare.com @ns1.cloudflare.com -t TYPE65535 +short
198.41.215.162
198.41.214.162
2400:cb00:2048:1::c629:d6a2
2400:cb00:2048:1::c629:d7a2
$

Did we mention Taylor Swift earlier?

$ dig taylorswift.com @ns1.cloudflare.com -t TYPE65535 +short
104.16.193.61
104.16.194.61
104.16.191.61
104.16.192.61
104.16.195.61
2400:cb00:2048:1::6810:c33d
2400:cb00:2048:1::6810:c13d
2400:cb00:2048:1::6810:bf3d
2400:cb00:2048:1::6810:c23d
2400:cb00:2048:1::6810:c03d
$

We believe in proving concepts in code and through the IETF standards process. We’re currently working on an experiment with OpenDNS and will translate our learnings to an Internet Draft we will submit to the IETF to become an RFC. We’re sure this is just the beginning to faster, better deployed IPv6.

Supporting the transition to IPv6-only networking services for iOS

Dragos Bogdan — Tue, 07 Jun 2016 18:55:29 GMT

Early last month Apple announced that all apps submitted to the Apple Store June 1 forward would need to support IPv6-only networking as they transition to IPv6-only network services in iOS 9. Apple reports that “Most apps will not require any changes”, as these existing apps support IPv6 through Apple's NSURLSession and CFNetwork APIs.

Our goal with IPv6, and any other emerging networking technology, is to make it ridiculously easy for our customers to make the transition. Over 2 years ago, we published Eliminating the last reasons to not enable IPv6 in celebration of World IPv6 Day. CloudFlare has been offering full IPv6 support as well as our IPv6-to-IPv4 gateway to all of our customers since 2012.

Why is the transition happening?

IPv4 represents a technical limitation, a hard stop to the number of devices that can access the Internet. When the Internet Protocol (IP) was first introduced by Vint Cerf and Bob Kahn in the late 1970s, Internet Protocol Version 4 (IPv4) used a 32-bit (four-byte) number, allowing about 4 billion unique addresses. At the time, IPv4 seemed more than sufficient to power the World Wide Web. On January 31, 2011, the top-level pool of Internet Assigned Numbers Authority (IANA) IPv4 addresses was officially exhausted. On September 24th 2015, The American Registry for Internet Numbers (ARIN) officially ran out of IPv4 addresses.

It was clear well before 2011 that 4 billion addresses would not be nearly enough for all of the people in the world—let alone all of the phones, tablets, TVs, cars, watches, and the upcoming slew of devices that would need to access the Internet. In 1998, the Internet Engineering Task Force (IETF) had formalized IPv4’s successor, IPv6. IPv6 uses a 128-bit address, which theoretically allows for approximately 340 trillion trillion trillion or 340,000,000,000,000,000,000,000,000,000,000,000,000 unique addresses.

So here we are nearly 20 years since IPv6 and… we’re at a little over 10% IPv6 adoption. :/

(source)

The good news, as the graph above indicates, is that the rate of IPv6 adoption has increased significantly; last year being a record year with a 4% increase, according to Google. The way Google derives these numbers is by having a small number of their users execute JavaScript code that tests whether a computer can load URLs over IPv6.

What’s up with the delay?

Transitioning from IPv4 to IPv6 is a complex thing to execute on a global scale. When data gets sent through the Internet, packets of data are sent using the IP protocol. Within each packet you have a number of elements besides the payload (the actual data you’re sending), including the source and destination.

In order for the transmission to get to its destination, each device passing it along (clients/servers, routers, firewalls, load balancers, etc) needs to be able to communicate with the other. Traditionally, these devices communicate with IPv4, the universal language on the Internet. IPv6 represents an entirely new language. In order for all of these devices to be able to communicate, they all need to talk IPv6 or have some sort of translator involved.

Translation requires technologies such as NAT64 and DNS64. NAT64 allow IPv6 hosts to communicate with IPv4 servers by creating a NAT-mapping between the IPv6 and the IPv4 address. DNS64 will synthesize AAAA records from A records. DNS64 has some known issues like DNSSEC validation failure (because the DNS server doing the translation is not the owner’s domain server). For service providers, supporting IPv4/IPv6 translation means providing separate IPv4 and IPv6 connectivity thus incurring additional complexity as well as additional operational and administrative costs.

Making the move

Using IP literals (hard coded IP addresses) in your code is a common pitfall to meeting Apple's IPv6 support requirement. Developers should check their configuration files for any IP literals and replace them with domain names. Literals should not be embedded in protocols either. Although literals may seem unavoidable when using certain low-level APIs in communications protocols like SIP, WebSockets, or P2PP, Apple offers high-level networking frameworks that are easy to implement and less error prone.

Stay away from network preflighting and instead simply attempt to connect to a network resource and gracefully handle failures. Preflighting often attempts to check for Internet connectivity by passing IP addresses to network reachability APIs. This is bad practice both in terms of introducing IP literals in your code and misusing reachability APIs.

For iOS developers, it’s important to review Supporting IPv6 DNS64/NAT64 Networks to ensure code compatibility. Within Apple’s documentation you’ll find a list of IPv4-specific APIs that need to be eliminated, IPv6 equivalents for IPv4 types, system APIs that can synthesize IPv6 addresses, as well as how to set up a local IPv6 DNS64/NAT64 network so you can regularly test for IPv6 DNS64/NAT64 compatibility.

CloudFlare offers a number of IPv6 features that developers can take advantage of during their migration. If your domain is running through CloudFlare, enabling IPv6 support is as simple as enabling IPv6 Compatibility in the dashboard.

Certain legacy IPv4 applications may be able to take advantage of CloudFlare's Pseudo IPv4. Pseudo IPv4 works by adding an HTTP header to requests established over IPv6 with a "pseudo" IPv4 address. Using a hashing algorithm, Pseudo IPv4 will create a Class E IPv4 address which always produces the same output for the same input; the same IPv6 address will always result in the same Pseudo IPv4 address. Using the Class E IP space, we have access to 268,435,456 possible unique IPv4 addresses.

Pseudo IPv4 offers 2 options: Add Header or Overwrite Headers. Add Header will automatically add a header (Cf-Pseudo-IPv4), that can be parsed by software as needed. Overwrite Headers will overwrite the existing Cf-Connecting-IP and X-Forwarded-For headers with a Pseudo IPv4 address. The overwrite option has the advantage (in most cases), of not requiring any software changes. If you choose the overwrite option, we'll append a new header (Cf-Connecting-IPv6) in order to ensure you can still find the actual connecting IP address for debugging.

For iOS developers under the gun to make the transition to IPv6, there are benefits to the move apart from compliance with Apple’s policy. Beyond the inherent security benefits of IPv6 like mandatory IPSec, many companies have seen performance gains as well. Real User Measurement studies conducted by Facebook show IPv6 made their site 10-15 percent faster and LinkedIn realized 40 percent gains on select mobile networks in Europe.

For domains currently running through CloudFlare, since we currently do not enable IPv6 by default you’ll want to go into your account and make sure IPv6 Compatibility is enabled under the Network tab. CloudFlare has been offering rock solid IPv6 since 2012 with one-click IPv6 provisioning, an IPv4-to-IPv6 translation gateway, Pseudo IPv4, and much more. For more information, be sure to check out our IPv6 page: https://www.cloudflare.com/ipv6/

Test all the things: IPv6, HTTP/2, SHA-2

John Graham-Cumming — Wed, 02 Sep 2015 10:15:44 GMT

NOTE: The https://http2.cloudflare.com site has been shut down. This blog was posted in 2015 and subsequently we took all the technologies here and made them part of our core product.

CloudFlare constantly tries to stay on the leading edge of Internet technologies so that our customers' web sites use the latest, fastest, most secure protocols. For example, in the past we've enabled IPv6 and SPDY/3.1.

Today we've switched on a test server that is open for people to test compatibility of web clients. It's a mirror of this blog and is served from https://http2.cloudflare.com/. The server uses three technologies that it may be helpful to test with: IPv4/IPv6, HTTP/2 and an SSL certificate that uses SHA-2 for its signature.

The server has both IPv4 and IPv6 addresses.

$ dig +short http2.cloudflare.com A
45.55.83.207
$ dig +short http2.cloudflare.com AAAA
2604:a880:800:10:5ca1:ab1e:f4:e001

The certificate is based on SHA-2 (in this case SHA-256). This is important because SHA-1 is being deprecated by some browsers very soon. On a recent browser the connection will also be secured using ECDHE (for forward secrecy).

And, finally, the server uses HTTP/2 if the browser is capable. For example, in Google Chrome, with the HTTP/2 and SPDY indicator extension the blue lightning bolt indicates that the page was served using HTTP/2:

This server isn't on the normal CloudFlare network and is intended for testing purposes only. We'll endeavor to keep it online so that people have an HTTP/2 endpoint to test clients against.

In Google Chrome's net-internals view you can see the HTTP/2 session in use when connecting to the site.

We hope that this will prove useful for people tests HTTP/2 compatible client software. Let us know how it goes.

Four years later and CloudFlare is still doing IPv6 automatically

Martin J Levy — Fri, 05 Jun 2015 18:42:40 GMT

Over the past four years CloudFlare has helped well over two million websites join the modern web, making us one of the fastest growing providers of IPv6 web connectivity on the Internet. CloudFlare's Automatic IPv6 Gateway allows IPv4-only websites to support IPv6-only clients with zero clicks. No hardware. No software. No code changes. And no need to change your hosting provider.

Image by Andrew D. Ferguson

A Four Year Story

The story of IPv6 support for customers of CloudFlare is about as long as the story of CloudFlare itself. June 6th, 2011 (four years ago) was the original World IPv6 Day, and CloudFlare participated. Each year since, the global Internet community has pushed forward with additional IPv6 deployment. Now, four years later, CloudFlare is celebrating June 6th knowing that our customers are being provided with a solid IPv6 offering that requires zero configuration to enable. CloudFlare is the only global CDN that provides IPv4/IPv6 delivery of content by default and at scale.

IPv6 has been featured in our blog various times over the last four years. We have provided support for legacy logging systems to handle IPv6 addresses, provided DDoS protection on IPv6 alongside classic IPv4 address space, and provided the open source community with software to handle ECMP load balancing correctly.

It's all about the numbers

Today CloudFlare is celebrating four years of World IPv6 Day with a few new graphs:

First, let's measure IPv6 addresses (broken down by /64 blocks).

While CloudFlare operates the CDN portion of web traffic delivery to end-users, we don’t control the end-user's deployment of IPv6. We do, however, see IP addresses when they are used, and we clearly see an uptick in deployed IPv6 at the end-user sites. Measuring unique IP addresses is a good indicator of IPv6 deployment.

Next we can look at how end users, using IPv6, access CloudFlare customers' websites.

With nearly 25% of our IPv6 traffic being delivered to mobile devices as of today, CloudFlare is happy to help demonstrate that mobile operators have embraced IPv6 with gusto.

The third graph looks at traffic based on destination country and is a measure of end-users (eyeballs as they are called in the industry) that are enabled for IPv6. The graph shows the top countries that CloudFlare delivers IPv6 web traffic to.

On the Y Axis, we have the percentage of traffic delivered via IPv6. On the X Axis we have each of the last eight months. Let's just have a shoutout to Latin America. In Latin America CloudFlare operates five data centers: Argentina, Brazil, Chile, Columbia, and Peru. CloudFlare would like to highlight our Peru datacenter because it has the highest percentage of traffic being delivered over IPv6 in Latin America. Others agree.

What's next for IPv6?

The real question is: What's next for users stuck with only IPv4? Because the RIRs have depleted their IPv4 address space pools, ISPs are simply out of options regarding the need to embrace IPv6. Basically, there are no more IPv4 addresses available.

Supporting IPv6 is even more important today than it was four years ago and CloudFlare is happy that it provides this automatic IPv6 solution for all its customers. Come try it out and don’t let other web properties languish in the non-IPv6 world by telling a friend about our automatic IPv6 support.

Path MTU discovery in practice

Marek Majkowski — Wed, 04 Feb 2015 14:16:31 GMT

Last week, a very small number of our users who are using IP tunnels (primarily tunneling IPv6 over IPv4) were unable to access our services because a networking change broke "path MTU discovery" on our servers. In this article, I'll explain what path MTU discovery is, how we broke it, how we fixed it, and the open source code we used.

via Flickr

First There Was the Fragmentation

When a host on the Internet wants to send some data, it must know how to divide the data into packets. And in particular, it needs to know the maximum size of packet. The maximum size of a packet a host can send is called Maximum Transmission Unit: MTU.

The longer the MTU the better for performance, but the worse for reliability. This is because a lost packet means more data to be retransmitted and because many routers on the Internet can't deliver very long packets.

The fathers of the Internet assumed that this problem would be solved at the IP layer with IP fragmentation. Unfortunately IP fragmentation has serious disadvantages, and it's avoided in practice.

Do-not-fragment bit

To work around fragmentation problems the IP layer contains a "Don't Fragment" bit on every IP packet. It forbids any router on the path from performing fragmentation.

So what does the router do if it can't deliver a packet and can't fragment it either?

That's when the fun begins.

According to RFC1191:

When a router is unable to forward a datagram because it exceeds theMTU of the next-hop network and its Don't Fragment bit is set, therouter is required to return an ICMP Destination Unreachable messageto the source of the datagram, with the Code indicating"fragmentation needed and DF set".

So a router must send ICMP type 3 code 4 message. If you want to see one type:

tcpdump -s0 -p -ni eth0 'icmp and icmp[0] == 3 and icmp[1] == 4'

This ICMP message is supposed to be delivered to the originating host, which in turn should adjust the MTU setting for that particular connection. This mechanism is called Path MTU Discovery.

In theory, it's great but unfortunately many things can go wrong when delivering ICMP packets. The most common problems are caused by misconfigured firewalls that drop ICMP packets.

via Flickr

Another Black Hole

A situation when a router drops packets but for some reason can't deliver relevant ICMP messages is called an ICMP black hole.

When that happens the whole connection gets stuck. The sending side constantly tries to resend lost packets, while the receiving side acknowledges only the small packets that get delivered.

There are generally three solutions to this problem.

1) Reduce the MTU on the client side.

The network stack hints its MTU in SYN packets when it opens a TCP/IP connection. This is called the MSS TCP option. You can see it in tcpdump, for example on my host (notice "mss 1460"):

$ sudo tcpdump -s0 -p -ni eth0 '(ip and ip[20+13] & tcp-syn != 0)'
10:24:13.920656 IP 192.168.97.138.55991 > 212.77.100.83.23: Flags [S], seq 711738977, win 29200, options [mss 1460,sackOK,TS val 6279040 ecr 0,nop,wscale 7], length 0

If you know you are behind a tunnel you might consider advising your operating system to reduce the advertised MTUs. In Linux it's as simple as (notice the advmss 1400):

$ ip route change default via <> advmss 1400
$ sudo tcpdump -s0 -p -ni eth0 '(ip and ip[20+13] & tcp-syn != 0)'
10:25:48.791956 IP 192.168.97.138.55992 > 212.77.100.83.23: Flags [S], seq 389206299, win 28000, options [mss 1400,sackOK,TS val 6302758 ecr 0,nop,wscale 7], length 0

2) Reduce the MTU of all connections to the minimal value on the server side.

It's a much harder problem to solve on the server side. A server can encounter clients with many different MTUs, and if you can't rely on Path MTU Discovery, you have a real problem. One commonly used workaround is to reduce the MTU for all of the outgoing packets. The minimal required MTU for all IPv6 hosts is 1,280, which is fair. Unfortunately for IPv4 the value is 576 bytes. On the other hand RFC4821 suggests that it's "probably safe enough" to assume minimal MTU of 1,024. To change the MTU setting you can type:

ip -f inet6 route replace <> mtu 1280

(As a side note you could use advmss instead. The mtu setting is stronger, applying to all the packets, not only TCP. While advmss affects only the Path MTU hints given on TCP layer.)

Forcing a reduced packet size is not an optimal solution though.

3) Enable smart MTU black hole detection.

RFC4821 proposes a mechanism to detect ICMP black holes and tries to adjust the path MTU in a smart way. To enable this on Linux type:

echo 1 > /proc/sys/net/ipv4/tcp_mtu_probing
echo 1024 > /proc/sys/net/ipv4/tcp_base_mss

The second setting bumps the starting MSS used in discovery from a miserable default of 512 bytes to an RFC4821 suggested 1,024.

CloudFlare Architecture

To understand what happened last week, first we need to make a detour and discuss our architecture. At CloudFlare we use BGP in two ways. We use it externally, the BGP that organizes the Internet, and internally between servers.

The internal use of BGP allows us to get high availability across servers. If a machine goes down the BGP router will notice and will push the traffic to another server automatically.

We've used this for a long time but last week increased its use in our network as we increased internal BGP load balancing. The load balancing mechanism is known as ECMP: Equal Cost Multi Path routing. From the configuration point of view the change was minimal, it only required adjusting the BGP metrics to equal values, the ECMP-enabled router does the rest of the work.

But with ECMP the router must be somewhat smart. It can't do true load balancing, as packets from one TCP connections could end up on a wrong server. Instead in ECMP the router counts a hash from the usual tuple extracted from every packet:

For TCP it hashes a tuple (src ip, src port, dst ip, dst port)

And uses this hash to chose destination server. This guarantees that packets from one "flow" will always hit the same server.

ECMP dislikes ICMP

ECMP will indeed forward TCP packets in a session to the appropriate server. Unfortunately, it has no special knowledge of ICMP and it hashes only (src ip, dst ip).

Where the the source IP is most likely an IP of a router on the Internet. In practice, this may sometimes end up delivering the ICMP packets to a different server than the one handling the TCP connection.

This is exactly what happened in our case.

Temporary Fix

A couple of users who were accessing CloudFlare via IPv6 tunnels reported this problem. They were affected as almost every tunneling IPv6 has a reduced MTU.

As a temporary fix we reduced the MTU for all IPv6 paths to 1,280 (solution mentioned as #2). Many other providers have the same problem and use this trick on IPv6, and never send IPv6 packets greater than 1,280 bytes.

For better or worse this problem is not so prevalent on IPv4. Fewer people use IPv4 tunneling and the Path MTU problems were well understood for much longer. As temporary fix for IPv4, we deployed RFC4821 path MTU discovery (the solution mentioned in #3 above).

Our solution: PMTUD

In the meantime we were working on a comprehensive and reliable solution the restores the real MTUs. The idea is to broadcast ICMP MTU messages to all servers. That way we can guarantee that ICMP messages will hit the relevant server that handles a particular flow, no matter where the ECMP router decides to forward it.

We're happy to open source our small "Path MTU Daemon" that does the work:

https://github.com/cloudflare/pmtud

Summary

Delivering Path MTU ICMP messages correctly in a network that uses ECMP for internal routing is a surprisingly hard problem. While there is still plenty to improve we're happy to report:

Like many other companies we've reduced the MTU on IPv6 to safevalue of 1,280. We'll consider bumping it to improve performanceonce we get more confident that our other fixes are working asintended.
We're rolling out PMTUD to ensure "proper" routing of ICMP's path MTU messages within our datacenters.
We're enabling the Linux RFC4821 Path MTU Discovery for IPv4, whichshould help with true ICMP black holes.

Found this article interesting? Join our team, including to our elite office in London. We're constantly looking for talented developers and network engineers!

Appendix

To simulate a black hole we used this iptables rule:

iptables -I INPUT -s  -m length --length 1400:1500 -j DROP

And this Scapy snippet to simulate MTU ICMP messages:

#!/usr/bin/python
from scapy.all import *

def callback(pkt):
    pkt2 = IP(dst=pkt[IP].src) /
           ICMP(type=3,code=4,unused=1280) /
           str(pkt[IP])[:28]
    pkt2.show()
    send(pkt2)

if __name__ == '__main__':
    sniff(prn=callback,
          store=0,
          filter="ip and src  and greater 1400")

DDoS Packet Forensics: Take me to the hex!

John Graham-Cumming — Tue, 06 Jan 2015 23:10:34 GMT

A few days ago, my colleague Marek sent an email about a DDoS attack against one of our DNS servers that we'd been blocking with our BPF rules. He noticed that there seemed to be a strange correlation between the TTL field in the IP header and the IPv4 source address.

CC BY 2.0 image by Jeremy Keith

The source address was being spoofed, as usual, and apparently chosen randomly, but something else was going on. He offered a bottle of Scotch to the first person to come up with a satisfactory solution.

Here's what some of the packets looked like:

$ tcpdump -ni eth0 -c 10 "ip[8]=40 and udp and port 53"
1.181.207.7.46337 > x.x.x.x.53: 65098+
1.178.97.141.45569 > x.x.x.x.53: 65101+
1.248.136.142.63489 > x.x.x.x.53: 65031+
1.207.241.195.52993 > x.x.x.x.53: 65072+

$ tcpdump -ni eth0 -c 10 "ip[8]=41 and udp and port 53"
2.10.30.2.2562 > x.x.x.x.53: 65013+
2.4.9.36.1026 > x.x.x.x.53: 65019+
2.98.1.99.25090 > x.x.x.x.53: 64925+
2.109.69.229.27906 > x.x.x.x.53: 64914+

$ tcpdump -ni eth0 -c 10 "ip[8]=42 and udp and port 53"
4.72.42.184.18436 > x.x.x.x.53: 64439+
4.240.78.0.61444 > x.x.x.x.53: 64271+
5.73.44.84.18693 > x.x.x.x.53: 64182+
4.69.99.10.17668 > x.x.x.x.53: 64442+

I've removed the destination IP address, but left in the DNS ID number (the number with the plus after it), the spoofed source IP, and port. There are three different TTLs represented (40, 41, and 42) by filtering on ip[8].

Stop reading here if you want to go figure this out for yourself.

Into the hex

I couldn't resist Marek's challenge, so I did what I always do with anything involving packets: I went straight for the hex. Since Marek hadn't given me a pcap, I manually converted the first few to hex.

The reason hex is useful is that bytes, words, dwords, and qwords are the real stuff of computing and looking at decimal obscures data.

Taking the first three (one for each TTL), I saw this pattern:

1.181.207.7.46337
2.10.30.2.2562
4.72.42.184.18436

Converted to hex they are

01.b5.cf.07.b501
02.0a.1e.02.0a02
04.48.2a.b8.4804

It's immediately obvious that the 'random' source port is the first two bytes of the random IP source address reversed: 01.b5.cf.07 has source port b501.

A little bit of tinkering revealed a relationship between the TTL and the first byte of the IP address.

TTL = first byte >> 1 + 40

This relationship was confirmed by a later packet that had a TTL of 150 and a source IP of 220.255.141.181. The shift right also explained why the same TTL was used when the first byte differed by one (see the third group above for an example: 4.x.x.x and 5.x.x.x have the same TTL).

I then spotted that the DNS ID field was also related to the IP address.

1.181.207.7.46337 > x.x.x.x.53: 65098+

Converted to hex:

01.b5.cf.07.b501 > x.x.x.x.35: fe4a+

It's probably not obvious at first glance (unless you're a hex-level hacker), but fe4a is the one's complement of 01b5. So the DNS ID field was simply the one's complement of the first two bytes of the source IP.

Finally, Marek gave me a pcap, and I had one more relationship to find. The ID value in the IPv4 header was also related to the random source IP address—in fact, it was just the first two bytes of the IP address.

Mystery

One mystery remains (and there's a CloudFlare T-shirt for person with the most convincing explanation): How are the random source IPs being generated?

The answer might be boring (i.e. it might just be reading bytes from /dev/random), but given the author's love of relationships between fields, perhaps there's something else going on. We've seen IP addresses get reused which leads us to think that there's something to discover here.

Here's a sequence of actual source IPs seen:

218.254.187.151
8.187.160.236
123.73.134.186
68.133.199.20
205.26.91.155
169.235.56.120
96.160.119.221
44.226.72.236
205.26.91.155
140.206.27.92
70.62.151.0
161.98.197.249

We have no real guarantee that the attacker generated them in this order, but perhaps there's an interesting way these are being generated.

Answers in comments if you manage to figure it out!