Posts tagged "Salt"

Finding the grain of sand in a heap of Salt

Opeyemi Onikute — Thu, 13 Nov 2025 14:00:00 GMT

How do you find the root cause of a configuration management failure when you have a peak of hundreds of changes in 15 minutes on thousands of servers?

That was the challenge we faced as we built the infrastructure to reduce release delays due to failures of Salt, a configuration management tool. (We eventually reduced such failures on the edge by over 5%, as we’ll explain below.) We’ll explore the fundamentals of Salt, and how it is used at Cloudflare. We then describe the common failure modes and how they delay our ability to release valuable changes to serve our customers.

By first solving an architectural problem, we provided the foundation for self-service mechanisms to find the root cause of Salt failures on servers, datacenters and groups of datacenters. This system is able to correlate failures with git commits, external service failures and ad hoc releases. The result of this has been a reduction in the duration of software release delays, and an overall reduction in toilsome, repetitive triage for SRE.

To start, we will go into the basics of the Cloudflare network and how Salt operates within it. And then we’ll get to how we solved the challenge akin to finding a grain of sand in a heap of Salt.

How Salt works

Configuration management (CM) ensures that a system corresponds to its configuration information, and maintains the integrity and traceability of that information over time. A good configuration management system ensures that a system does not “drift” – i.e. deviate from the desired state. Modern CM systems include detailed descriptions of infrastructure, version control for these descriptions, and other mechanisms to enforce the desired state across different environments. Without CM, administrators must manually configure systems, a process that is error-prone and difficult to reproduce.

Salt is an example of such a CM tool. Designed for high-speed remote execution and configuration management, it uses a simple, scalable model to manage large fleets. As a mature CM tool, it provides consistency, reproducibility, change control, auditability and collaboration across team and organisational boundaries.

Salt’s design revolves around a master/minion architecture, a message bus built on ZeroMQ, and a declarative state system. (At Cloudflare we generally avoid the terms "master" and "minion." But we will use them here because that's how Salt describes its architecture.) The salt master is a central controller that distributes jobs and configuration data. It listens for requests on the message bus and dispatches commands to targeted minions. It also stores state files, pillar data and cache files. The salt minion is a lightweight agent installed on each managed host/server. Each minion maintains a connection to the master via ZeroMQ and subscribes to published jobs. When a job matches the minion, it executes the requested function and returns results.

The diagram below shows a simplification of the Salt architecture described in the docs, for the purpose of this blog post.

The state system provides declarative configuration management. States are often written in YAML and describe a resource (package, file, service, user, etc.) and the desired attributes. A common example is a package state, which ensures that a package is installed at a specified version.

States can call execution modules, which are Python functions that implement system actions. When applying states, Salt returns a structured result containing whether the state succeeded (result: True/False), a comment, changes made, and duration.

Salt at Cloudflare

We use Salt to manage our ever-growing fleet of machines, and have previously written about our extensive usage. The master-minion architecture described above allows us to push configuration in the form of states to thousands of servers, which is essential for maintaining our network. We’ve designed our change propagation to involve blast radius protection. With these protections in place, a highstate failure becomes a signal, rather than a customer-impacting event.

This release design was intentional – we decided to “fail safe” instead of failing hard. By further adding guardrails to safely release new code before a feature reaches all users, we are able to propagate a change with confidence that failures will halt the Salt deployment pipeline by default. However, every halt blocks other configuration deployments and requires human intervention to determine the root cause. This can quickly become a toilsome process as the steps are repetitive and bring no enduring value.

Part of our deployment pipeline for Salt changes uses Apt. Every X minutes a commit is merged into the master branch, per Y minutes those merges are bundled and deployed to APT servers. The key file to retrieving Salt Master configuration from that APT server is the APT source file:

This file directs a master to the correct suite for its specific environment. Using that suite, it retrieves the latest package containing the relevant Salt Debian package with the latest changes. It installs that package and begins deploying the included configuration. As it deploys the configuration on machines, the machines report their health using Prometheus. If a version is healthy, it will be progressed into the next environment. Before it can be progressed, a version has to pass a certain soak threshold to allow a version to develop its errors, making more complex issues become apparent. That is the happy case.

The unhappy case brings a myriad of complications: As we do progressive deployments, if a version is broken, any subsequent version is also broken. And because broken versions are continuously overtaken by newer versions, we need to stop deployments altogether. In a broken version scenario, it is crucial to get a fix out as soon as possible. This touches upon the core question of this blog post: What if a broken Salt version is propagated across the environment, we are abandoning deployments, and we need to get a fix out as soon as possible?

The pain: how Salt breaks and reports errors (and how it affects Cloudflare)

While Salt aims for idempotent and predictable configuration, failures can occur during the render, compile, or runtime stages. These failures are commonly due to misconfiguration. Errors in Jinja templates or invalid YAML can cause the render stage to fail. Examples include missing colons, incorrect indentation, or undefined variables. A syntax error is often raised with a stack trace pointing to the offending line.

Another frequent cause of failure is missing pillar or grain data. Since pillar data is compiled on the master, forgetting to update pillar top files or refreshing pillar can result in KeyError exceptions. As a system that maintains order using requisites, misconfigured requisites can lead to states executing out-of-order or being skipped. Failures can also happen when minions are unable to authenticate with the master, or cannot reach the master due to network or firewall issues.

Salt reports errors in several ways. By default, the salt and salt-call commands exit with a retcode 1 when any state fails. Salt also sets internal retcodes for specific cases: 1 for compile errors, 2 when a state returns False, and 5 for pillar compilation errors. Test mode shows what changes would be made without actually executing them, but is useful for catching syntax or ordering issues. Debug logs can be toggled using the -l debug CLI option (salt state.highstate -l debug).

The state return also includes the details of the individual state failures - the durations, timestamps, functions and results. If we introduce a failure to the file.managed state by referencing a file that doesn’t exist in the Salt fileserver, we see this failure:

The return can also be displayed in JSON:

The flexibility of the output format means that humans can parse them in custom scripts. But more importantly, it can also be consumed by more complex, interconnected automation systems. We knew we could easily parse these outputs to attribute the cause of a Salt failure with an input – e.g. a change in source control, an external service failure, or a software release. But something was missing.

The solutions

Configuration errors are a common cause of failure in large-scale systems. Some of these could even lead to full system outages, which we prevent with our release architecture. When a new release or configuration breaks in production, our SRE team needs to find and fix the root cause to avoid release delays. As we’ve previously noted, this triage is tedious and increasingly difficult due to system complexity.

While some organisations use formal techniques such as automated root cause analysis, most triage is still frustratingly manual. After evaluating the scope of the problem, we decided to adopt an automated approach. This section describes the step-by-step approach to solving this broad, complex problem in production.

Phase one: retrievable CM inputs

When a Salt highstate fails on a minion, SRE teams faced a tedious investigation process: manually SSHing into minions, searching through logs for error messages, tracking down job IDs (JIDs), and locating the job associated with the JID on one of multiple associated masters. This is all while racing against a 4-hour retention window on master logs. The fundamental problem was architectural: Job results live on Salt Masters, not on the minions where they're executed, forcing operators to guess which master processed their job (SSHing into each one) and limiting visibility for users without master access.

We built a solution that caches job results directly on minions, similar to the local_cache returner that exists for masters. That enables local job retrieval and extended retention periods. This transformed a multistep, time-sensitive investigation into a single query — operators can retrieve job details, automatically extract error context, and trace failures back to specific file changes and commit authors, all from the minion itself. The custom returner filters and manages cache size intelligently, eliminating the “which master?” problem while also enabling automated error attribution, reducing time to resolution, and removing human toil from routine troubleshooting.

By decentralizing job history and making it queryable at the source, we moved significantly closer to a self-service debugging experience where failures are automatically contextualized and attributed, letting SRE teams focus on fixes rather than forensics.

Phase two: Self-service using a Salt Blame Module

Once job information was available on the minion, we no longer needed to resolve which master triggered the job that failed. The next step was to write a Salt execution module that would allow an external service to query for job information, and more specifically failed job information, without needing to know Salt internals. This led us to write a module called Salt Blame. Cloudflare prides itself on its blameless culture, our software on the other hand…

The blame module is responsible for pulling together three things:

Local job history information
CM inputs (latest commit present during the job)
Git repo commit history

We chose to write an execution module for simplicity, decoupling external automation from the need to understand Salt internals, and potential usage by operators for further troubleshooting. Writing execution modules is already well established within operational teams and adheres to well-defined best practices such as unit tests, linting and extensive peer-review.

The module is understandably very simple. It iterates in reverse chronological order through the jobs in the local cache and looks for the first job failure chronologically, and then the successful job immediately prior to it. This is for no other reason than narrowing down the true first failure and giving us before and after state results. At this stage, we have several avenues to present context to the caller: To find possible commit culprits, we look through all commits between the last successful Job ID and the failure to determine if any of these changed files relevant to the failure. We also provided the list of failed states and their outputs as another avenue to spot the root cause. We’ve learned that this flexibility is important to cover the wide range of failure possibilities.

We also make a distinction between normal failed states, and compile errors. As described in the Salt docs, each job returns different retcodes based on the outcome.

Compile Error: 1 is set when any error is encountered in the state compiler.
Failed State: 2 is set when any state returns a False result.

Most of our failures manifest as failed states as a result of a change in source control. An engineer building a new feature for our customers may unintentionally introduce a failure that was uncaught by our CI and Salt Master tests. In the first iteration of the module, listing all the failed states was sufficient to pinpoint the root cause of a highstate failure.

However, we noticed that we had a blind spot. Compile errors do not result in a failed state, since no state runs. Since these errors returned a different retcode from what we checked for, the module was completely blind to them. Most compile errors happen when a Salt service dependency fails during the state compile phase. They can also happen as a result of a change in source control, although that is rare.

With both state failures and compile errors accounted for, we drastically improved our ability to pinpoint issues. We released the module to SREs who immediately realised the benefits of faster Salt triage.

Phase three: automate, automate, automate!

Faster triage is always a welcome development, and engineers were comfortable running local commands on minions to triage Salt failures. But in a busy shift, time is of the essence. When failures spanned across multiple datacenters or machines, it easily became cumbersome to run commands across all these minions. This solution also required context-switches between multiple nodes and datacenters. We needed a way to aggregate common failure types using a single command – single minions, pre-production datacenters and production datacenters.

We implemented several mechanisms to simplify triage and eliminate manual triggers. We aimed to get this tooling as close to the triage location as possible, which is often chat. With three distinct commands, engineers were now able to triage Salt failures right from chat threads.

With a hierarchical approach, we made individual triage possible for minions, data centers and groups of data centers. A hierarchy makes this architecture fully extensible, flexible and self-organising. An engineer is able to triage a failure on one minion, and at the same time the entire data center as needed.

The ability to triage multiple data centers at the same time became immediately useful for tracking the root cause of failures in pre-production data centers. These failures delay the propagation of changes to other data centers, and hinder our ability to release changes for customer features, bug fixes or incident remediation. The addition of this triage option has cut down the time to debug and remediate Salt failures by over 5%, allowing us to consistently release important changes for our customers.

While 5% does not immediately look like a drastic improvement, the magic is in the cumulative effect. We won’t release actual figures of the amount of time releases are delayed for, but we can do a simple thought experiment. If the average amount of time spent is even just 60 minutes per day, a reduction by 5% saves us 90 minutes (one hour 30 minutes) per month.

Another indirect benefit lies in more efficient feedback loops. Since engineers spend less time fiddling with complex configurations, that energy is diverted towards preventing reoccurrence, further reducing the overall time by an immeasurable amount. Our future plans include measurement and data analytics to understand the outcomes of these direct and indirect feedback loops.

The image below shows an example of pre-production triage output. We are able to correlate failures with git commits, releases, and external service failures. During a busy shift, this information is invaluable for quickly fixing breakage. On average, each minion “blame” takes less than 30 seconds, while multiple data centers are able to return a result in a minute or less.

The image below describes the hierarchical model. Each step in the hierarchy is executed in parallel, allowing us to achieve blazing fast results.

With these mechanisms available, we further cut down triage time by triggering the triage automation on known conditions, especially those with impact to the release pipeline. This directly improved the velocity of changes to the edge since it took less time to find a root cause and fix-forward or revert.

Phase four: measure, measure, measure

After we got blazing fast Salt triage, we needed a way to measure the root causes. While individual root causes are not immediately valuable, historical analysis was deemed important. We wanted to understand the common causes of failure, especially as they hinder our ability to deliver value to customers. This knowledge creates a feedback loop that can be used to keep the number of failures low.

Using Prometheus and Grafana, we track the top causes of failure: git commits, releases, external service failures and unattributed failed states. The list of failed states is particularly useful because we want to know repeat offenders and drive better adoption of stable releasing practices. We are also particularly interested in root causes — a spike in the number of failures due to git commits indicates a need to adopt better coding practices and linting, a spike in external service failures indicates a regression in an internal system to be investigated, and a spike in release-based failures indicates a need for better gating and release-shepherding.

We analyse these metrics on a monthly cycle, providing feedback mechanisms through internal tickets and escalations. While the immediate impact of these efforts is not yet visible as the efforts are nascent, we expect to improve the overall health of our Saltstack infrastructure and release process by reducing the amount of breakage we see.

The broader picture

Much of operational work is often seen as a “necessary evil”. Humans in ops are conditioned to intervene when failures happen and remediate them. This cycle of alert-response is necessary to keep the infrastructure running, but it often leads to toil. We have discussed the effect of toil in a previous blog post.

This work represents another step in the right direction – removing more toil for our on-call SREs, and freeing up valuable time to work on novel issues. We hope that this encourages other operations engineers to share the progress they are making towards reducing overall toil in their organizations. We also hope that this sort of work can be adopted within Saltstack itself, although the lack of homogeneity in production systems across several companies makes it unlikely.

In the future, we plan to improve the accuracy of detection and rely less on external correlation of inputs to determine the root cause of failed outcomes. We will investigate how to move more of this logic into our native Saltstack modules, further streamlining the process and avoiding regressions as external systems drift.

If this sort of work is exciting to you, we encourage you to take a look at our careers page.

Deploying firmware at Cloudflare-scale: updating thousands of servers in more than 285 cities

Chris Howells — Fri, 10 Mar 2023 14:00:00 GMT

As a security company, it’s critical that we have good processes for dealing with security issues. We regularly release software to our servers - on a daily basis even - which includes new features, bug fixes, and as required, security patches. But just as critical is the software which is embedded into the server hardware, known as firmware. Primarily of interest is the BIOS and Baseboard Management Controller (BMC), but many other components also have firmware such as Network Interface Cards (NICs).

As the world becomes more digital, software which needs updating is appearing in more and more devices. As well as my computer, over the last year, I have waited patiently while firmware has updated in my TV, vacuum cleaner, lawn mower and light bulbs. It can be a cumbersome process, including obtaining the firmware, deploying it to the device which needs updating, navigating menus and other commands to initiate the update, and then waiting several minutes for the update to complete.

Firmware updates can be annoying even if you only have a couple of devices. We have more than a few devices at Cloudflare. We have a huge number of servers of varying kinds, from varying vendors, spread over 285 cities worldwide. We need to be able to rapidly deploy various types of firmware updates to all of them, reliably, and automatically, without any kind of manual intervention.

In this blog post I will outline the methods that we use to automate firmware deployment to our entire fleet. We have been using this method for several years now, and have deployed firmware without interrupting our SRE team, entirely automatically.

Background

A key component of our ability to deploy firmware at scale is the iPXE, an open source boot loader. iPXE is the glue which operates between the server and operating system, and is responsible for loading the operating system after the server has completed the Power On Self Test (POST). It is very flexible and contains a scripting language. With iPXE, we can write boot scripts which query the firmware version, continue booting if the correct firmware version is deployed, or if not, boot into a flashing environment to flash the correct firmware.

We only deploy new firmware when our systems are out of production, so we need a method to coordinate deployment only on out of production systems. The simplest way to do this is when they are rebooting, because by definition they are out of production then. We reboot our entire fleet every month, and have the ability to schedule reboots more urgently if required to deal with a security issue. Regularly rebooting our fleets has many advantages. We can deploy the latest Linux kernel, base operating system, and ensure that we do not have any breaking changes in our operating system and configuration management environment that breaks on fresh boot.

Our entire fleet operates in UEFI mode. UEFI is a modern replacement for the BIOS and offers more features and more security, such as Secure Boot. A full description of all of these changes is outside the scope of this article, but essentially UEFI provides a minimal environment and shell capable of executing binaries. Secure Boot ensures that the binaries are signed with keys embedded in the system, to prevent a bad actor from tampering with our software.

How we update the BIOS

We are able to update the BIOS without booting any operating system, purely by taking advantage of features offered by iPXE and the UEFI shell. This requires a flashing binary written for the UEFI environment.

Upon boot, iPXE is started. Through iPXE’s built-in variable ${smbios/0.5.0} it is possible to query the current BIOS version, and compare it to the latest version, and trigger a flash only if there is a mis-match. iPXE then downloads the files required for the firmware update to a ramdisk.

The following is an example of a very basic iPXE script which performs such an action:

Meanwhile, startup.nsh contains the binary to run and command line arguments to effect the flash:

startup.nsh:

After rebooting, the machine will boot using its new BIOS firmware, version 2.03. Since ${smbios/0.5.0} now contains 2.03, the machine continues to boot and enter production.

Other firmware updates such as BMC, network cards and more

Unfortunately, the number of vendors that support firmware updates with UEFI flashing binaries is limited. There are a large number of other updates that we need to perform such as BMC and NIC.

Consequently, we need another way to flash these binaries. Thankfully, these vendors invariably support flashing from Linux. Consequently we can perform flashing from a minimal Linux environment. Since vendor firmware updates are typically closed source utilities and vendors are often highly secretive about firmware flashing, we can ensure that the flashing environment does not provide an attackable surface by ensuring that the network is not configured. If it’s not on the network, it can’t be attacked and exploited.

Not being on the network means that we need to inject files into the boot process when the machine boots. We can accomplish this with an initial ramdisk (initrd), and iPXE makes it easy to add additional initrd to the boot.

Creating an initrd is as simple as creating an archive of the files using cpio using the newc archive format.

Let’s imagine we are going to flash Broadcom NIC firmware. We’ll use the bnxtnvm firmware update utility, the firmware image firmware.pkg, and a shell script called flash to automate the task.

The files are laid out in the file system like this:

Now we compress all of these files into an image called broadcom.img.

This is the first step completed; we have the firmware packaged up into an initrd.

Since it’s challenging to read, say, the firmware version of the NIC, from the EFI shell, we store firmware versions as UEFI variables. These can be written from Linux via efivars, the UEFI variable file system, and then read by iPXE on boot.

An example of writing an EFI variable from Linux looks like this:

Then we can write an iPXE configuration file to load the flashing kernel, userland and flashing utilities.

Flashing scripts are deposited into /opt/preflight/scripts and we use systemd to execute them with run-parts on boot:

/etc/systemd/system/preflight.service:

An example flashing script in /opt/preflight/scripts might look like:

Conclusion

Whether you manage just your laptop or desktop computer, or a fleet of servers, it’s important to keep the firmware updated to ensure that the availability, performance and security of the devices is maintained.

If you have a few devices and would benefit from automating the deployment process, we hope that we have inspired you to have a go by making use of some basic open source tools such as the iPXE boot loader and some scripting.

Final thanks to my colleague Ignat Korchagin who did a large amount of the original work on the UEFI BIOS firmware flashing infrastructure.

Using Apache Kafka to process 1 trillion inter-service messages

Matt Boyle — Tue, 19 Jul 2022 13:00:00 GMT

Cloudflare has been using Kafka in production since 2014. We have come a long way since then, and currently run 14 distinct Kafka clusters, across multiple data centers, with roughly 330 nodes. Between them, over a trillion messages have been processed over the last eight years.

Cloudflare uses Kafka to decouple microservices and communicate the creation, change or deletion of various resources via a common data format in a fault-tolerant manner. This decoupling is one of many factors that enables Cloudflare engineering teams to work on multiple features and products concurrently.

We learnt a lot about Kafka on the way to one trillion messages, and built some interesting internal tools to ease adoption that will be explored in this blog post. The focus in this blog post is on inter-application communication use cases alone and not logging (we have other Kafka clusters that power the dashboards where customers view statistics that handle more than one trillion messages each day). I am an engineer on the Application Services team and our team has a charter to provide tools/services to product teams, so they can focus on their core competency which is delivering value to our customers.

In this blog I’d like to recount some of our experiences in the hope that it helps other engineering teams who are on a similar journey of adopting Kafka widely.

Tooling

One of our Kafka clusters is creatively named Messagebus. It is the most general purpose cluster we run, and was created to:

Prevent data silos;
Enable services to communicate more clearly with basically zero integration cost (more on how we achieved this below);
Encourage the use of a self-documenting communication format and therefore removing the problem of out of date documentation.

To make it as easy to use as possible and to encourage adoption, the Application Services team created two internal projects. The first is unimaginatively named Messagebus-Client. Messagebus-Client is a Go library that wraps the fantastic Shopify Sarama library with an opinionated set of configuration options and the ability to manage the rotation of mTLS certificates.

The success of this project is also somewhat its downfall. By providing a ready-to-go Kafka client, we ensured teams got up and running quickly, but we also abstracted some core concepts of Kafka a little too much, meaning that small unassuming configuration changes could have a big impact.

One such example led to partition skew (a large portion of messages being directed towards a single partition, meaning we were not processing messages in real time; see the chart below). One drawback of Kafka is you can only have one consumer per partition, so when incidents do occur, you can’t trivially scale your way to faster throughput.

That also means before your service hits production it is wise to do some back of the napkin math to figure out what throughput might look like, otherwise you will need to add partitions later. We have since amended our library to make events like the below less likely.

The reception for the Messagebus-Client has been largely positive. We spent time as a team to understand what the predominant use cases were, and took the concept one step further to build out what we call the connector framework.

Connectors

The connector framework is based on Kafka-connectors and allows our engineers to easily spin up a service that can read from a system of record and push it somewhere else (such as Kafka, or even Cloudflare’s own Quicksilver). To make this as easy as possible, we use Cookiecutter templating to allow engineers to enter a few parameters into a CLI and in return receive a ready to deploy service.

We provide the ability to configure data pipelines via environment variables. For simple use cases, we provide the functionality out of the box. However, extending the readers, writers and transformations is as simple as satisfying an interface and “registering” the new entry.

For example, adding the environment variables:

will:

Read messages from Kafka topic “topic1” and “topic2”;
Transform the message using a transformation function called “pf_edge” which maps the request from a Kafka protobuf to a Quicksilver request;
Write the result to Quicksilver.

Connectors come readily baked with basic metrics and alerts, so teams know they can move to production quickly but with confidence.

Below is a diagram of how one team used our connector framework to read from the Messagebus cluster and write to various other systems. This is orchestrated by a system the Application Service team runs called Communication Preferences Service (CPS). Whenever a user opts in/out of marketing emails or changes their language preferences on cloudflare.com, they are calling CPS which ensures those settings are reflected in all the relevant systems.

Strict Schemas

Alongside the Messagebus-Client library, we also provide a repo called Messagebus Schema. This is a schema registry for all message types that will be sent over our Messagebus cluster. For message format, we use protobuf and have been very happy with that decision. Previously, our team had used JSON for some of our kafka schemas, but we found it much harder to enforce forward and backwards compatibility, as well as message sizes being substantially larger than the protobuf equivalent. Protobuf provides strict message schemas (including type safety), the forward and backwards compatibility we desired, the ability to generate code in multiple languages as well as the files being very human-readable.

We encourage heavy commentary before approving a merge. Once merged, we use prototool to do breaking change detection, enforce some stylistic rules and to generate code for various languages (at time of writing it's just Go and Rust, but it is trivial to add more).

An example Protobuf message in our schema

Furthermore, in Messagebus Schema we store a mapping of proto messages to a team, alongside that team’s chat room in our internal communication tool. This allows us to escalate issues to the correct team easily when necessary.

One important decision we made for the Messagebus cluster is to only allow one proto message per topic. This is configured in Messagebus Schema and enforced by the Messagebus-Client. This was a good decision to enable easy adoption, but it has led to numerous topics existing. When you consider that for each topic we create, we add numerous partitions and replicate them with a replication factor of at least three for resilience, there is a lot of potential to optimize compute for our lower throughput topics.

Observability

Making it easy for teams to observe Kafka is essential for our decoupled engineering model to be successful. We therefore have automated metrics and alert creation wherever we can to ensure that all the engineering teams have a wealth of information available to them to respond to any issues that arise in a timely manner.

We use Salt to manage our infrastructure configuration and follow a Gitops style model, where our repo holds the source of truth for the state of our infrastructure. To add a new Kafka topic, our engineers make a pull request into this repo and add a couple of lines of YAML. Upon merge, the topic and an alert for high lag (where lag is defined as the difference in time between the last committed offset being read and the last produced offset being produced) will be created. Other alerts can (and should) be created, but this is left to the discretion of application teams. The reason we automatically generate alerts for high lag is that this simple alert is a great proxy for catching a high amount of issues including:

Your consumer isn’t running.
Your consumer cannot keep up with the amount of throughput or there is an anomalous amount of messages being produced to your topic at this time.
Your consumer is misbehaving and not acknowledging messages.

For metrics, we use Prometheus and display them with Grafana. For each new topic created, we automatically provide a view into production rate, consumption rate and partition skew by producer/consumer. If an engineering team is called out, within the alert message is a link to this Grafana view.

In our Messagebus-Client, we expose some metrics automatically and users get the ability to extend them further. The metrics we expose by default are:

For producers:

Messages successfully delivered.
Message failed to deliver.

For consumer:

Messages successfully consumed.
Message consumption errors.

Some teams use these for alerting on a significant change in throughput, others use them to alert if no messages are produced/consumed in a given time frame.

A Practical Example

As well as providing the Messagebus framework, the Application Services team looks for common concerns within Engineering and looks to solve them in a scalable, extensible way which means other engineering teams can utilize the system and not have to build their own (thus meaning we are not building lots of disparate systems that are only slightly different).

One example is the Alert Notification System (ANS). ANS is the backend service for the “Notifications” tab in the Cloudflare dashboard. You may have noticed over the past 12 months that new alert and policy types have been made available to customers very regularly. This is because we have made it very easy for other teams to do this. The approach is:

Create a new entry into ANS’s configuration YAML (We use CUE lang to validate the configuration as part of our continuous integration process);
Import our Messagebus-Client into your code base;
Emit a message to our alert topic when an event of interest takes place.

That’s it! The producer team now has a means for customers to configure granular alerting policies for their new alert that includes being able to dispatch them via Slack, Google Chat or a custom webhook, PagerDuty or email (by both API and dashboard). Retrying and dead letter messages are managed for them, and a whole host of metrics are made available, all by making some very small changes.

What’s Next?

Usage of Kafka (and our Messagebus tools) is only going to increase at Cloudflare as we continue to grow, and as a team we are committed to making the tooling around Messagebus easy to use, customizable where necessary and (perhaps most importantly) easy to observe. We regularly take feedback from other engineers to help improve the Messagebus-Client (we are on the fifth version now) and are currently experimenting with abstracting the intricacies of Kafka away completely and allowing teams to use gRPC to stream messages to Kafka. Blog post on the success/failure of this to follow!

If you're interested in building scalable services and solving interesting technical problems, we are hiring engineers on our team in Austin, and Remote US.

Future-proofing SaltStack

Lenka Mareková — Thu, 31 Mar 2022 12:59:45 GMT

At Cloudflare, we are preparing the Internet and our infrastructure for the arrival of quantum computers. A sufficiently large and stable quantum computer will easily break commonly deployed cryptography such as RSA. Luckily there is a solution: we can swap out the vulnerable algorithms with so-called post-quantum algorithms that are believed to be secure even against quantum computers. For a particular system, this means that we first need to figure out which cryptography is used, for what purpose, and under which (performance) constraints. Most systems use the TLS protocol in a standard way, and there a post-quantum upgrade is routine. However, some systems such as SaltStack, the focus of this blog post, are more interesting. This blog post chronicles our path of making SaltStack quantum-secure, so welcome to this adventure: this secret extra post-quantum blog post!

SaltStack, or simply Salt, is an open-source infrastructure management tool used by many organizations. At Cloudflare, we rely on Salt for provisioning and automation, and it has allowed us to grow our infrastructure quickly.

Salt uses a bespoke cryptographic protocol to secure its communication. Thus, the first step to a post-quantum Salt was to examine what the protocol was actually doing. In the process we discovered a number of security vulnerabilities (CVE-2022-22934, CVE-2022-22935, CVE-2022-22936). This blogpost chronicles the investigation, our findings, and how we are helping secure Salt now and in the Quantum future.

Cryptography in Salt

Let’s start with a high-level overview of Salt.

The main agents in a Salt system are servers and clients (referred to as masters and minions in the Salt documentation). A server is the central control point for a number of clients, which can be in the tens of thousands: it can issue a command to the entire fleet, provision client machines with different characteristics, collect reports on jobs running on each machine, and much more. Depending on the architecture, there can be multiple servers managing the same fleet of clients. This is what makes Salt great, as it helps the management of complex infrastructure.

By default, the communication between a server and a client happens over ZeroMQ on top of TCP though there is an experimental option to switch to a custom transport directly on TCP. The cryptographic protocol is largely the same for both transports. The experimental TCP transport has an option to enable TLS which does not replace the custom protocol but wraps it in server-authenticated TLS. More about that later on.

The custom protocol relies on a setup phase in which each server and each client has its own long-term RSA-2048 keypair. On the surface similar to TLS, Salt defines a handshake, or key exchange protocol, that generates a shared secret, and a record protocol which uses this secret with symmetric encryption (the symmetric channel).

Key exchange protocol

In its basic form, the key exchange (or handshake) protocol is an RSA key exchange in which the server chooses the secret and encrypts it to the client’s public key. The exact form of the protocol then depends on whether either party already knows the other party’s long-term public key, since certificates (like in TLS) are not used. By default, clients trust the server’s public key on first use, and servers only trust the client’s public key after it has been accepted by an out-of-band mechanism. The shared secret is shared among the entire fleet of clients_,_ so it is not specific to a particular server and client pair. This allows the server to encrypt a broadcast message only once. We will come back to this performance trade-off later on.

Salt key exchange (as of version 3004) under default settings, showing the first connection between the given server and client.

Symmetric channel

The shared secret is used as the key for encryption. Most of the messages between a server and a client are secured in an Encrypt-then-MAC fashion, with AES-192 in CBC mode with SHA-256 for HMAC. For certain more sensitive messages, variations on this protocol are used to add more security. For example, commands are signed using the server’s long-term secret key, and “pillar data” (deemed more sensitive) is encrypted only to a particular client using a freshly generated secret key.

Symmetric channel in Salt (as of version 3004).

Security vulnerabilities

We found that the protocol variation used for pillar messages contains a flaw. As shown in the diagram below, a monster-in-the-middle attacker (MitM) that sits between a server and a client can substitute arbitrary “pillar data” to the client. It needs to know the client’s public key, but that is easy to find since clients broadcast it as part of their key exchange request. The initial key exchange can be observed, or one could be triggered on-demand using a specially crafted message. This MitM was possible because neither the newly generated key nor the actual payload were authenticated as coming from the server. This matters because “pillar data” can include anything from specifying packages to be installed to credentials and cryptographic keys. Thus, it is possible that this flaw could allow an attacker to gain access to the vulnerable client machine.

Illustration of the monster-in-the-middle attack, CVE-2022-22934.

We reported the issue to Salt November 5, 2021, which assigned CVE-2022-22934 to it. Earlier this week, on March 28, 2022, Salt released a patch that adds a signature of the server on the pillar message to prevent the attack.

There were several other smaller issues we found. Messages could be replayed to the same or a different client. This could allow a file intended for one client, to be served to a different one, perhaps aiding lateral movement. This is CVE-2022-22936 and has been patched by adding the name of the addressed client, a nonce and a signature to messages.

Finally, there were some messages which could be manipulated to cause the client to crash. This is CVE-2022-22935 and was patched similarly by adding the addressee, nonce and signature.

If you are running Salt, please update as soon as possible to either 3002.8, 3003.4 or 3004.1.

Moving forward

These patches add signatures to almost every single message. The decision to have a single shared secret was a performance trade-off: only a single encryption is required to broadcast a message. As signatures are computationally much more expensive than symmetric encryption, this trade-off didn’t work out that well. It’s better to switch to a separate shared key per client, so that we don’t need to add a signature on every separate message. In effect, we are creating a single long-lived mutually-authenticated channel per client. But then we are getting very close to what mutually authenticated TLS (mTLS) can provide. What would that look like? Hold that thought for a moment: we will return to it below.

We got sidetracked from our original question: what does this all mean for making Salt post-quantum secure? One thing to take away about post-quantum cryptography today is that signatures are much larger: Dilithium2, for instance, weighs in at 2.4 kB compared to 256 bytes for an RSA-2048 signature. So ironically, patching these vulnerabilities has made our job more difficult as there are many more signatures. Thus, also for our post-quantum goal, mTLS seems very attractive. Not the least because there are post-quantum TLS stacks ready to go.

Finally, as the security properties of mTLS are well understood, it will be much easier to add new messages and functionality to Salt’s protocol. With the current complex protocol, any change is much harder to judge with confidence.

An mTLS-based transport

So what would such an mTLS-based protocol for Salt look like? Clients would pin the server certificate and the server would pin the client certificates. Thus, we wouldn’t have any traditional public key infrastructure (PKI). This matches up nicely with how Salt deals with keys currently. This allows clients to establish long-lived connections with their server and be sure that all data exchanged between them is confidential, mutually authenticated and has forward secrecy. This mitigates replays, swapping of messages, reflection attacks or subtle denial of service pathways for free.

We tested this idea by implementing a third transport using WebSocket over mTLS (WSS). As mentioned before, Salt already offers an option to use TLS with the TCP transport, but it doesn’t authenticate clients and creates a new TCP connection for every client request which leads to a multitude of unnecessary TLS handshakes. Internally, Salt has been architected to work with new connections for each request, so our proof of concept required some laborious retooling.

Our findings show promise that there would be no significant losses and potentially some improvements when it comes to performance at scale. In our preliminary experiments with a single server handling a thousand clients, there was no difference in several metrics compared to the default ZeroMQ transport. Resource-intensive operations such as the fetching of pillar and state data resulted, in our experiment, in lower CPU usage in the mTLS transport. Enabling long-lived connections reduced the amount of data transmitted between the clients and the server, in some cases, significantly so.

We have shared our preliminary results with Salt, and we are working together to add an mTLS-based transport upstream. Stay tuned!

Conclusion

We had a look at how to make Salt post-quantum secure. While doing so, we found and helped fix several issues. We see a clear path forward to a future post-quantum Salt based on mTLS. Salt is but one system: we will continue our work, checking system-by-system, collaborating with vendors to bring post-quantum into the present.

With thanks to Bas and Sofía for their help on the project.

Research Directions in Password Security

Ian McQuoid — Thu, 14 Oct 2021 12:59:14 GMT

As Internet users, we all deal with passwords every day. With so many different services, each with their own login systems, we have to somehow keep track of the credentials we use with each of these services. This situation leads some users to delegate credential storage to password managers like LastPass or a browser-based password manager, but this is far from universal. Instead, many people still rely on old-fashioned human memory, which has its limitations — leading to reused passwords and to security problems. This blog post discusses how Cloudflare Research is exploring how to minimize password exposure and thwart password attacks.

The Problem of Password Reuse

Because it’s too difficult to remember many distinct passwords, people often reuse them across different online services. When breached password datasets are leaked online, attackers can take advantage of these to conduct “credential stuffing attacks”. In a credential stuffing attack, an attacker tests breached credentials against multiple online login systems in an attempt to hijack user accounts. These attacks are highly effective because users tend to reuse the same credentials across different websites, and they have quickly become one of the most prevalent types of online guessing attacks. Automated attacks can be run at a large scale, testing out exposed passwords across multiple systems, under the assumption that some of these passwords will unlock accounts somewhere else (if they have been reused). When a data breach is detected, users of that service will likely receive a security notification and will reset that account password. However, if this password was reused elsewhere, they may easily forget that it needs to be changed for those accounts as well.

How can we protect against credential stuffing attacks? There are a number of methods that have been deployed — with varying degrees of success. Password managers address the problem of remembering a strong, unique password for every account, but many users have yet to adopt them. Multi-factor authentication is another potential solution — that is, using another form of authentication in addition to the username/password pair. This can work well, but has limits: for example, such solutions may rely on specialized hardware that not all clients have. Consumer systems are often reluctant to mandate multi-factor authentication, given concerns that people may find it too complicated to use; companies do not want to deploy something that risks impeding the growth of their user base.

Since there is no perfect solution, security researchers continue to try to find improvements. Two different approaches we will discuss in this blog post are hardening password systems using cryptographically secure keys, and detecting the reuse of compromised credentials, so they don’t leave an account open to guessing attacks.

Improved Authentication with PAKEs

Investigating how to securely authenticate a user just using what they can remember has been an important area in secure communication. To this end, the subarea of cryptography known as Password Authenticated Key Exchange (PAKE) came about. PAKEs deal with protocols for establishing cryptographically secure keys where the only source of authentication is a human memorizable (low-entropy, attacker-guessable) password — that is, the “what you know” side of authentication.

Before diving into the details, we’ll provide a high-level overview of the basic problem. Although passwords are typically protected in transit by being sent over HTTPS, servers handle them in plaintext to verify them once they arrive. Handling plaintext passwords increases security risk — for instance, they might get inadvertently logged and exposed. Ideally, the user’s password never gets sent to the server in the first place. This is where PAKEs come in — a means of verifying that the user and server share a password, ideally without revealing information about the password that could help attackers to discover or crack it.

A few words on PAKEs

PAKE protocols let two parties turn a password into a shared key. Each party only gets one guess at the password the other holds. If a user tries to log in to the wrong server with a PAKE, that server will not be able to turn around and impersonate the user. As such, PAKEs guarantee that communication with one of the parties is the only way for an attacker to test their (single) password guess. This may seem like an unneeded level of complexity when we could use already available tools like a key distribution mechanism along with password-over-TLS, but this puts a lot of trust in the service. You may trust a service with learning your password on that service, but what about if you accidentally use a password for a different service when trying to log in? Note the particular risks of a reused password: it is no longer just a secret shared between a user and a single service, but is now a secret shared between a user and multiple services. This therefore increases the password’s privacy sensitivity — a service should not know users’ account login information for other services.

A comparison of shared secrets between passwords over TLS versus PAKEs.With passwords over TLS, a service might learn passwords used on another service. This problem does not arise with PAKEs.

PAKE protocols are built with the assumption that the server isn’t always working in the best interest of the client and, even more, cannot use any kind of public-key infrastructure during login (although it doesn’t hurt to have both!). This precludes the user from sending their plaintext password (or any information that could be used to derive it — in a computational sense) to the server during login.

PAKE protocols have expanded into new territory since the seminal EKE paper of Bellovin and Merritt, where the client and server both remembered a plaintext version of the password. As mentioned above, when the server stores the plaintext password, the client risks having the password logged or leaked. To address this, new protocols were developed, referred to as augmented, verifier-based, or asymmetric PAKEs (aPAKEs), where the server stored a modified version (similar to a hash) of the password instead of the plaintext password. This mirrors the way many of us were taught to store passwords in a database, specifically as a hash of the password with accompanying salt and pepper. However, in these cases, attackers can still use traditional methods of attack such as targeted rainbow tables. To avoid these kinds of attacks, a new kind of PAKE was born, the strong asymmetric PAKE (saPAKE).

OPAQUE was the first saPAKE and it guarantees defense against precomputation by hiding the password dictionary itself! It does this by replacing the noninteractive hash function with an interactive protocol referred to as an Oblivious Pseudorandom Function (OPRF) where one party inputs their “salt”, another inputs their “password”, and only the password-providing party learns the output of the function. The fact that the password-providing party learns nothing (computationally) about the salt prevents offline precomputation by disallowing an attacker from evaluating the function in their head.

Another way to think about the three PAKE paradigms has to do with how each of them treats the password dictionary:

PAKE type

Password Dictionary

Threat Model

PAKE

The password dictionary is public and common to every user.

Without any guessing, the attacker learns the user’s password upon compromise of the server.

aPAKE

Each user gets their own password dictionary; a description of the dictionary (e.g., the “salt”) is leaked to the client when they attempt to log in.

The attacker must perform an independent precomputation for each client they want to attack.

saPAKE (e.g., OPAQUE)

Each user gets their own password dictionary; the server only provides an online interface (the OPRF) to the dictionary.

The adversary must wait until after they compromise the server to run an offline attack on the user’s password¹.

OPAQUE also goes one step further and allows the user to perform the password transformation on their own device so that the server doesn’t see the plaintext password during registration either. Cloudflare Research has been involved with OPAQUE for a while now — for instance, you can read about our previous implementation work and demo if you want to learn more.

But OPAQUE is not a panacea: in the event of server compromise, the attacker can learn the salt that the server uses to evaluate the OPRF and can still run the same offline attack that was available in the aPAKE world, although this is now considerably more time-consuming and can be made increasingly difficult through the use of memory-hard hash functions like scrypt. This means that despite our best efforts, when a server is breached, the attacker can eventually come out with a list of plaintext passwords. Indeed, this attack is always inevitable as the attacker can always run the (sa)PAKE protocol in their head acting as both parties to test each password. With this being the case, we still need to take steps to defend against automated password attacks such as credential stuffing attacks and have ways of mitigating them.

Are You Overexposed?

To help detect and respond to credential stuffing, Cloudflare recently rolled out the Exposed Credential Checks feature on the Web Application Firewall (WAF), which can alert the origin if a user’s login credentials have appeared in a recent breach. Historically, compromised credential checking services have allowed users to be proactive against credential stuffing attacks when their username and password appear together in a breach. However, they do not account for recently proposed credential tweaking attacks, in which an attacker tries variants of a breached password, under the assumption that users often use slight modifications of the same password for different accounts, such as “sunshineFB”, “sunshineIG”, and so on. Therefore, compromised credential check services should incorporate methods of checking for credential tweaks.

Under the hood, Cloudflare’s Exposed Credential Checks feature relies on an underlying protocol deemed Might I Get Pwned (MIGP). MIGP uses the bucketization method proposed in Li et al. to avoid sending the plaintext username or password to the server while handling a large breach dataset. After receiving a user’s credentials, MIGP hashes the username and sends a portion of that hash as a “bucket identifier” to the server. The client and server can then perform a private membership test protocol to verify whether the user’s username/password pair appeared in that bucket, without ever having to send plaintext credentials to the server.

Unlike previous compromised credential check services, MIGP also enables credential tweaking checks by augmenting the original breach dataset with a set of password “variants”. For each leaked password, it generates a list of password variants, which are labeled as such to differentiate them from the original leaked password and added to the original dataset. For more information, you can check out the Cloudflare Research blog post detailing our open-source implementation and deployment of the MIGP protocol.

Measuring Credential Compromises

The question remains, just how important are these exposed credential checks for detecting and preventing credential stuffing attacks in practice? To answer this question, the Research Team has initiated a study investigating login requests to our own Cloudflare dashboard. For this study, we are collecting the data logged by Cloudflare’s Exposed Credential Check feature (described above), designed to be privacy-preserving: this check does not reveal a password, but provides a “yes/no” response on whether the submitted credentials appear in our breach dataset. Along with this signal, we are looking at other fields that may be indicative of malicious behavior such as bot score and IP reputation. As this project develops, we plan to cluster the data to find patterns of different types of credential stuffing attacks that we can generalize to form attack fingerprints. We can then feed these fingerprints into the alert logs for the Cloudflare Detection & Response team to see if they provide useful information for the security analysts.

Additionally, we hope to investigate potential post-compromise behavior as it relates to these compromise check fields. After an attacker successfully hijacks an account, they may take a number of actions such as changing the password, revoking all valid access tokens, or setting up a malicious script. By analyzing compromised credential checks along with these signals, we may be able to better differentiate benign from malicious behavior.

Future directions: OPAQUE and MIGP combined

This post has discussed how we’re approaching the problem of preventing credential stuffing attacks from two different angles. Through the deployment and analysis of compromised credential checks, we aim to prevent server compromise by detecting and preventing credential stuffing attacks before they happen. In addition, in the case that a server does get compromised, the wider use of OPAQUE would help address the problem of leaking passwords to an attacker by avoiding the reception and storage of plaintext passwords on the server as well as preventing precomputation attacks.

However, there are still remaining research challenges to address. Notably, the current method for interfacing with MIGP still requires the server to either pass along a plaintext version of the client’s password, or trust the client to honestly communicate with the MIGP service on behalf of the server. If we want to leverage the security guarantees of OPAQUE (or generally an saPAKE) with the analytics and alert system provided by MIGP in a privacy-preserving way, we need additional mechanisms.

At first glance, the privacy-preserving goals of both protocols seem to be perfect matches for each other. Both OPAQUE and MIGP are built upon the idea of replacing the traditional salted password hashes with an OPRF as a way of keeping the client’s plaintext passwords from ever leaving their device. However, both the interfaces for these protocols rely on user-provided inputs which aren’t cryptographically tied to each other. This allows an attacker to provide a false password to MIGP while providing their actual password to the OPAQUE server. Further, the security analysis of both protocols assume that their idealized building blocks are separated in an important way. This isn’t to say that the two protocols are incompatible, and indeed, much of these protocols may be salvaged.

The next stages for password privacy will be an integration of these two protocols such that a server can be made aware of credential stuffing attacks and the patterns of compromised account usage that can protect a server against the compromise of other servers while providing the same privacy guarantees OPAQUE does. Our goal is to allow you to protect yourself from other compromised servers while protecting your clients from compromise of your server. Stay tuned for updates!

We’re always keen to collaborate with others to build more secure systems, and would love to hear from those interested in password research. You can reach us with questions, comments, and research ideas at ask-research@cloudflare.com. For those interested in joining our team, please visit our Careers Page.

...

¹There are other ways of constructing saPAKE protocols. The curious reader can see this CRYPTO 2019 paper for details.

OPAQUE: The Best Passwords Never Leave your Device

Tatiana Bradley — Tue, 08 Dec 2020 12:00:00 GMT

Update: On January 19, 2022, we added a new demo for the OPAQUE protocol.

Passwords are a problem. They are a problem for reasons that are familiar to most readers. For us at Cloudflare, the problem lies much deeper and broader. Most readers will immediately acknowledge that passwords are hard to remember and manage, especially as password requirements grow increasingly complex. Luckily there are great software packages and browser add-ons to help manage passwords. Unfortunately, the greater underlying problem is beyond the reaches of software to solve.

The fundamental password problem is simple to explain, but hard to solve: A password that leaves your possession is guaranteed to sacrifice security, no matter its complexity or how hard it may be to guess. Passwords are insecure by their very existence.

You might say, “but passwords are always stored in encrypted format!” That would be great. More accurately, they are likely stored as a salted hash, as explained below. Even worse is that there is no way to verify the way that passwords are stored, and so we can assume that on some servers passwords are stored in cleartext. The truth is that even responsibly stored passwords can be leaked and broken, albeit (and thankfully) with enormous effort. An increasingly pressing problem stems from the nature of passwords themselves: any direct use of a password, today, means that the password must be handled in the clear.

You say, “but my password is transmitted securely over HTTPS!” This is true.

You say, “but I know the server stores my password in hashed form, secure so no one can access it!” Well, this puts a lot of faith in the server. Even so, let’s just say that yes, this may be true, too.

There remains, however, an important caveat — a gap in the end-to-end use of passwords. Consider that once a server receives a password, between being securely transmitted and securely stored, the password has to be read and processed. Yes, as cleartext!

And it gets worse — because so many are used to thinking in software, it’s easy to forget about the vulnerability of hardware. This means that even if the software is somehow trusted, the password must at some point reside in memory. The password must at some point be transmitted over a shared bus to the CPU. These provide vectors of attack to on-lookers in many forms. Of course, these attack vectors are far less likely than those presented by transmission and permanent storage, but they are no less severe (recent CPU vulnerabilities such as Spectre and Meltdown should serve as a stark reminder.)

The only way to fix this problem is to get rid of passwords altogether. There is hope! Research and private sector communities are working hard to do just that. New standards are emerging and growing mature. Unfortunately, passwords are so ubiquitous that it will take a long time to agree on and supplant passwords with new standards and technology.

At Cloudflare, we’ve been asking if there is something that can be done now, imminently. Today’s deep-dive into OPAQUE is one possible answer. OPAQUE is one among many examples of systems that enable a password to be useful without it ever leaving your possession. No one likes passwords, but as long they’re in use, at least we can ensure they are never given away.

I’ll be the first to admit that password-based authentication is annoying. Passwords are hard to remember, tedious to type, and notoriously insecure. Initiatives to reduce or replace passwords are promising. For example, WebAuthn is a standard for web authentication based primarily on public key cryptography using hardware (or software) tokens. Even so, passwords are frustratingly persistent as an authentication mechanism. Whether their persistence is due to their ease of implementation, familiarity to users, or simple ubiquity on the web and elsewhere, we’d like to make password-based authentication as secure as possible while they persist.

My internship at Cloudflare focused on OPAQUE, a cryptographic protocol that solves one of the most glaring security issues with password-based authentication on the web: though passwords are typically protected in transit by HTTPS, servers handle them in plaintext to check their correctness. Handling plaintext passwords is dangerous, as accidentally logging or caching them could lead to a catastrophic breach. The goal of the project, rather than to advocate for adoption of any particular protocol, is to show that OPAQUE is a viable option among many for authentication. Because the web case is most familiar to me, and likely many readers, I will use the web as my main example.

Web Authentication 101: Password-over-TLS

When you type in a password on the web, what happens? The website must check that the password you typed is the same as the one you originally registered with the site. But how does this check work?

Usually, your username and password are sent to a server. The server then checks if the registered password associated with your username matches the password you provided. Of course, to prevent an attacker eavesdropping on your Internet traffic from stealing your password, your connection to the server should be encrypted via HTTPS (HTTP-over-TLS).

Despite use of HTTPS, there still remains a glaring problem in this flow: the server must store a representation of your password somewhere. Servers are hard to secure, and breaches are all too common. Leaking this representation can cause catastrophic security problems. (For records of the latest breaches, check out https://haveibeenpwned.com/).

To make these leaks less devastating, servers often apply a hash function to user passwords. A hash function maps each password to a unique, random-looking value. It’s easy to apply the hash to a password, but almost impossible to reverse the function and retrieve the password. (That said, anyone can guess a password, apply the hash function, and check if the result is the same.)

With password hashing, plaintext passwords are no longer stored on servers. An attacker who steals a password database no longer has direct access to passwords. Instead, the attacker must apply the hash to many possible passwords and compare the results with the leaked hashes.

Unfortunately, if a server hashes only the passwords, attackers can download precomputed rainbow tables containing hashes of trillions of possible passwords and almost instantly retrieve the plaintext passwords. (See https://project-rainbowcrack.com/table.htm for a list of some rainbow tables).

With this in mind, a good defense-in-depth strategy is to use salted hashing, where the server hashes your password appended to a random, per-user value called a salt. The server also saves the salt alongside the username, so the user never sees or needs to submit it. When the user submits a password, the server re-computes this hash function using the salt. An attacker who steals password data, i.e., the password representations and salt values, must then guess common passwords one by one and apply the (salted) hash function to each guessed password. Existing rainbow tables won’t help because they don’t take the salts into account, so the attacker needs to make a new rainbow table for each user!

This (hopefully) slows down the attack enough for the service to inform users of a breach, so they can change their passwords. In addition, the salted hashes should be hardened by applying a hash many times to further slow attacks. (See https://blog.cloudflare.com/keeping-passwords-safe-by-staying-up-to-date/ for a more detailed discussion).

These two mitigation strategies — encrypting the password in transit and storing salted, hardened hashes — are the current best practices.

A large security hole remains open. Password-over-TLS (as we will call it) requires users to send plaintext passwords to servers during login, because servers must see these passwords to match against registered passwords on file. Even a well-meaning server could accidentally cache or log your password attempt(s), or become corrupted in the course of checking passwords. (For example, Facebook detected in 2019 that it had accidentally been storing hundreds of millions of plaintext user passwords). Ideally, servers should never see a plaintext password at all.

But that’s quite a conundrum: how can you check a password if you never see the password? Enter OPAQUE: a Password-Authenticated Key Exchange (PAKE) protocol that simultaneously proves knowledge of a password and derives a secret key. Before describing OPAQUE in detail, we’ll first summarize PAKE functionalities in general.

Password Proofs with Password-Authenticated Key Exchange

Password-Authenticated Key Exchange (PAKE) was proposed by Bellovin and Merrit[1] in 1992, with an initial motivation of allowing password-authentication without the possibility of dictionary attacks based on data transmitted over an insecure channel.

Essentially, a plain, or symmetric, PAKE is a cryptographic protocol that allows two parties who share only a password to establish a strong shared secret key. The goals of PAKE are:

The secret keys will match if the passwords match, and appear random otherwise.
Participants do not need to trust third parties (in particular, no Public Key Infrastructure),
The resulting secret key is not learned by anyone not participating in the protocol - including those who know the password.
The protocol does not reveal either parties’ password to each other (unless the passwords match), or to eavesdroppers.

In sum, the only way to successfully attack the protocol is to guess the password correctly while participating in the protocol. (Luckily, such attacks can be mostly thwarted by rate-limiting, i.e, blocking a user from logging in after a certain number of incorrect password attempts).

Given these requirements, password-over-TLS is clearly not a PAKE, because:

It relies on WebPKI, which places trust in third-parties called Certificate Authorities (see https://blog.cloudflare.com/introducing-certificate-transparency-and-nimbus/ for an in-depth explanation of WebPKI and some of its shortcomings).
The user’s password is revealed to the server.
Password-over-TLS provides the user no assurance that the server knows their password or a derivative of it — a server could accept any input from the user with no checks whatsoever.

That said, plain PAKE is still worse than Password-over-TLS, simply because it requires the server to store plaintext passwords. We need a PAKE that lets the server store salted hashes if we want to beat the current practice.

An improvement over plain PAKE is what’s called an asymmetric PAKE (aPAKE), because only the client knows the password, and the server knows a hashed password. An aPAKE has the four properties of PAKE, plus one more:

An attacker who steals password data stored on the server must perform a dictionary attack to retrieve the password.

The issue with most existing aPAKE protocols, however, is that they do not allow for a salted hash (or if they do, they require that salt to be transmitted to the user, which means the attacker has access to the salt beforehand and can begin computing a rainbow table for the user before stealing any data). We’d like, therefore, to upgrade the security property as follows:

5*) An attacker who steals password data stored on the server must perform a per-user dictionary attack to retrieve the password after the data is compromised.

OPAQUE is the first aPAKE protocol with a formal security proof that has this property: it allows for a completely secret salt.

OPAQUE - Servers safeguard secrets without knowing them!

OPAQUE is what’s referred to as a strong aPAKE, which simply means that it resists these pre-computation attacks by using a secretly salted hash on the server. OPAQUE was proposed and formally analyzed by Stanislaw Jarecki, Hugo Krawcyzk and Jiayu Xu in 2018 (full disclosure: Stanislaw Jarecki is my academic advisor). The name OPAQUE is a combination of the names of two cryptographic protocols: OPRF and PAKE. We already know PAKE, but what is an OPRF? OPRF stands for Oblivious Pseudo-Random Function, which is a protocol by which two parties compute a function F(key, x) that is deterministic but outputs random-looking values. One party inputs the value x, and another party inputs the key - the party who inputs x learns the result F(key, x) but not the key, and the party providing the key learns nothing. (You can dive into the math of OPRFs here: https://blog.cloudflare.com/privacy-pass-the-math/).

The core of OPAQUE is a method to store user secrets for safekeeping on a server, without giving the server access to those secrets. Instead of storing a traditional salted password hash, the server stores a secret envelope for you that is “locked” by two pieces of information: your password known only by you, and a random secret key (like a salt) known only by the server. To log in, the client initiates a cryptographic exchange that reveals the envelope key to the client, but, importantly, not to the server.

The server then sends the envelope to the user, who now can retrieve the encrypted keys. (The keys included in the envelope are a private-public key pair for the user, and a public key for the server.) These keys, once unlocked, will be the inputs to an Authenticated Key Exchange (AKE) protocol, which allows the user and server to establish a secret key which can be used to encrypt their future communication.

OPAQUE consists of two phases, being credential registration and login via key exchange.

OPAQUE: Registration Phase

Before registration, the user first signs up for a service and picks a username and password. Registration begins with the OPRF flow we just described: Alice (the user) and Bob (the server) do an OPRF exchange. The result is that Alice has a random key rwd, derived from the OPRF output F(key, pwd), where key is a server-owned OPRF key specific to Alice and pwd is Alice’s password.

Within his OPRF message, Bob sends the public key for his OPAQUE identity. Alice then generates a new private/public key pair, which will be her persistent OPAQUE identity for Bob’s service, and encrypts her private key along with Bob’s public key with the rwd (we will call the result an encrypted envelope). She sends this encrypted envelope along with her public key (unencrypted) to Bob, who stores the data she provided, along with Alice’s specific OPRF keysecret, in a database indexed by her username.

OPAQUE: Login Phase

The login phase is very similar. It starts the same way as registration — with an OPRF flow. However, on the server side, instead of generating a new OPRF key, Bob instead looks up the one he created during Alice’s registration. He does this by looking up Alice’s username (which she provides in the first message), and retrieving his record of her. This record contains her public key, her encrypted envelope, and Bob’s OPRF key for Alice.

He also sends over the encrypted envelope which Alice can decrypt with the output of the OPRF flow. (If decryption fails, she aborts the protocol — this likely indicates that she typed her password incorrectly, or Bob isn’t who he says he is). If decryption succeeds, she now has her own secret key and Bob’s public key. She inputs these into an AKE protocol with Bob, who, in turn, inputs his private key and her public key, which gives them both a fresh shared secret key.

Integrating OPAQUE with an AKE

An important question to ask here is: what AKE is suitable for OPAQUE? The emerging CFRG specification outlines several options, including 3DH and SIGMA-I. However, on the web, we already have an AKE at our disposal: TLS!

Recall that TLS is an AKE because it provides unilateral (and mutual) authentication with shared secret derivation. The core of TLS is a Diffie-Hellman key exchange, which by itself is unauthenticated, meaning that the parties running it have no way to verify who they are running it with. (This is a problem because when you log into your bank, or any other website that stores your private data, you want to be sure that they are who they say they are). Authentication primarily uses certificates, which are issued by trusted entities through a system called Public Key Infrastructure (PKI). Each certificate is associated with a secret key. To prove its identity, the server presents its certificate to the client, and signs the TLS handshake with its secret key.

Modifying this ubiquitous certificate-based authentication on the web is perhaps not the best place to start. Instead, an improvement would be to authenticate the TLS shared secret, using OPAQUE, after the TLS handshake completes. In other words, once a server is authenticated with its typical WebPKI certificate, clients could subsequently authenticate to the server. This authentication could take place “post handshake” in the TLS connection using OPAQUE.

Exported Authenticators are one mechanism for “post-handshake” authentication in TLS. They allow a server or client to provide proof of an identity without setting up a new TLS connection. Recall that in the standard web case, the server establishes their identity with a certificate (proving, for example, that they are “cloudflare.com”). But if the same server also holds alternate identities, they must run TLS again to prove who they are.

The basic Exported Authenticator flow works resembles a classical challenge-response protocol, and works as follows. (We’ll consider the server authentication case only, as the client case is symmetric).

At any point after a TLS connection is established, Alice (the client) sends an authenticator request to indicate that she would like Bob (the server) to prove an additional identity. This request includes a context (an unpredictable string — think of this as a challenge), and extensions which include information about what identity the client wants to be provided. For example, the client could include the SNI extension to ask the server for a certificate associated with a certain domain name other than the one initially used in the TLS connection.

On receipt of the client message, if the server has a valid certificate corresponding to the request, it sends back an exported authenticator which proves that it has the secret key for the certificate. (This message has the same format as an Auth message from the client in TLS 1.3 handshake - it contains a Certificate, a CertificateVerify and a Finished message). If the server cannot or does not wish to authenticate with the requested certificate, it replies with an empty authenticator which contains only a well formed Finished message.

The client then checks that the Exported Authenticator it receives is well-formed, and then verifies that the certificate presented is valid, and if so, accepts the new identity.

In sum, Exported Authenticators provide authentication in a higher layer (such as the application layer) safely by leveraging the well-vetted cryptography and message formats of TLS. Furthermore, it is tied to the TLS session so that authentication messages can't be copied and pasted from one TLS connection into another. In other words, Exported Authenticators provide exactly the right hooks needed to add OPAQUE-based authentication into TLS.

OPAQUE with Exported Authenticators (OPAQUE-EA)

OPAQUE-EA allows OPAQUE to run at any point after a TLS connection has already been set up. Recall that Bob (the server) will store his OPAQUE identity, in this case a signing key and verification key, and Alice will store her identity — encrypted — on Bob’s server. (The registration flow where Alice stores her encrypted keys is the same as in regular OPAQUE, except she stores a signing key, so we will skip straight to the login flow). Alice and Bob run two request-authenticate EA flows, one for each party, and OPAQUE protocol messages ride along in the extensions section of the EAs. Let’s look in detail how this works.

First, Alice generates her OPRF message based on her password. She creates an Authenticator Request asking for Bob’s OPAQUE identity, and includes (in the extensions field) her username and her OPRF message, and sends this to Bob over their established TLS connection.

Bob receives the message and looks up Alice’s username in his database. He retrieves her OPAQUE record containing her verification key and encrypted envelope, and his OPRF key. He uses the OPRF key on the OPRF message, and creates an Exported Authenticator proving ownership of his OPAQUE signing key, with an extension containing his OPRF message and the encrypted envelope. Additionally, he sends a new Authenticator Request asking Alice to prove ownership of her OPAQUE signing key.

Alice parses the message and completes the OPRF evaluation using Bob’s message to get output rwd, and uses rwd to decrypt the envelope. This reveals her signing key and Bob’s public key. She uses Bob’s public key to validate his Authenticator Response proof, and, if it checks out, she creates and sends an Exported Authenticator proving that she holds the newly decrypted signing key. Bob checks the validity of her Exported Authenticator, and if it checks out, he accepts her login.

My project: OPAQUE-EA over HTTPS

Everything described above is supported by lots and lots of theory that has yet to find its way into practice. My project was to turn the theory into reality. I started with written descriptions of Exported Authenticators, OPAQUE, and a preliminary draft of OPAQUE-in-TLS. My goal was to get from those to a working prototype.

My demo shows the feasibility of implementing OPAQUE-EA on the web, completely removing plaintext passwords from the wire, even encrypted. This provides a possible alternative to the current password-over-TLS flow with better security properties, but no visible change to the user.

A few of the implementation details are worth knowing. In computer science, abstraction is a powerful tool. It means that we can often rely on existing tools and APIs to avoid duplication of effort. In my project I relied heavily on mint, an open-source implementation of TLS 1.3 in Go that is great for prototyping. I also used CIRCL’s OPRF API. I built libraries for Exported Authenticators, the core of OPAQUE, and OPAQUE-EA (which ties together the two).

I made the web demo by wrapping the OPAQUE-EA functionality in a simple HTTP server and client that pass messages to each other over HTTPS. Since a browser can’t run Go, I compiled from Go to WebAssembly (WASM) to get the Go functionality in the browser, and wrote a simple script in JavaScript to call the WASM functions needed.

Since current browsers do not give access to the underlying TLS connection on the client side, I had to implement a work-around to allow the client to access the exporter keys, namely, that the server simply computes the keys and sends them to the client over HTTPS. This workaround reduces the security of the resulting demo — it means that trust is placed in the server to provide the right keys. Even so, the user’s password is still safe, even if a malicious server provided bad keys— they just don’t have assurance that they actually previously registered with that server. However, in the future, browsers could include a mechanism to support exported keys and allow OPAQUE-EA to run with its full security properties.

You can explore my implementation on Github, and even follow the instructions to spin up your own OPAQUE-EA test server and client. I’d like to stress, however, that the implementation is meant as a proof-of-concept only, and must not be used for production systems without significant further review.

OPAQUE-EA Limitations

Despite its great properties, there will definitely be some hurdles in bringing OPAQUE-EA from a proof-of-concept to a fully fledged authentication mechanism.

Browser support for TLS exporter keys. As mentioned briefly before, to run OPAQUE-EA in a browser, you need to access secrets from the TLS connection called exporter keys. There is no way to do this in the current most popular browsers, so support for this functionality will need to be added.

Overhauling password databases. To adopt OPAQUE-EA, servers need not only to update their password-checking logic, but also completely overhaul their password databases. Because OPAQUE relies on special password representations that can only be generated interactively, existing salted hashed passwords cannot be automatically updated to OPAQUE records. Servers will likely need to run a special OPAQUE registration flow on a user-by-user basis. Because OPAQUE relies on buy-in from both the client and the server, servers may need to support the old method for a while before all clients catch up.

Reliance on emerging standards. OPAQUE-EA relies on OPRFs, which is in the process of standardization, and Exported Authenticators, a proposed standard. This means that support for these dependencies is not yet available in most existing cryptographic libraries, so early adopters may need to implement these tools themselves.

Summary

As long as people still use passwords, we’d like to make the process as secure as possible. Current methods rely on the risky practice of handling plaintext passwords on the server side while checking their correctness. PAKEs, and (specifically aPAKEs) allow secure password login without ever letting the server see the passwords.

OPAQUE is also being explored within other companies. According to Kevin Lewi, a research scientist from the Novi Research team at Facebook, they are “excited by the strong cryptographic guarantees provided by OPAQUE and are actively exploring OPAQUE as a method for further safeguarding credential-protected fields that are stored server-side.”

OPAQUE is one of the best aPAKEs out there, and can be fully integrated into TLS. You can check out the core OPAQUE implementation here and the demo TLS integration here. A running version of the demo is also available here. A Typescript client implementation of OPAQUE is coming soon. If you're interested in implementing the protocol, or encounter any bugs with the current implementation, please drop us a line at ask-research@cloudflare.com! Consider also subscribing to the IRTF CFRG mailing list to track discussion about the OPAQUE specification and its standardization.

[1] Bellovin, S. M., and Merritt, M. “Encrypted key exchange: Password-based protocols secure against dictionary attacks.” In Proc. IEEE Computer Society Symposium on Research in Security and Privacy (Oakland, May 1992), pp. 72–84.

Moving Quicksilver into production

Geoffrey Plouviez — Wed, 25 Nov 2020 15:11:22 GMT

One of the great arts of software engineering is making updates and improvements to working systems without taking them offline. For some systems this can be rather easy, spin up a new web server or load balancer, redirect traffic and you’re done. For other systems, such as the core distributed data store which keeps millions of websites online, it’s a bit more of a challenge.

Quicksilver is the data store responsible for storing and distributing the billions of KV pairs used to configure the millions of sites and Internet services which use Cloudflare. In a previous post, we discussed why it was built and what it was replacing. Building it, however, was only a small part of the challenge. We needed to deploy it to production into a network which was designed to be fault tolerant and in which downtime was unacceptable.

We needed a way to deploy our new service seamlessly, and to roll back that deploy should something go wrong. Ultimately many, many, things did go wrong, and every bit of failure tolerance put into the system proved to be worth its weight in gold because none of this was visible to customers.

The Bridge

Our goal in deploying Quicksilver was to run it in parallel with our then existing KV distribution tool, Kyoto Tycoon. Once we had evaluated its performance and scalability, we would gradually phase out reads and writes to Kyoto Tycoon, until only Quicksilver was handling production traffic. To make this possible we decided to build a bridge - QSKTBridge.

Image Credit: https://en.wikipedia.org/wiki/Bodie_Creek_Suspension_Bridge

Our bridge replicated from Kyoto Tycoon, sending write and delete commands to our Quicksilver root node. This bridge service was deployed on Kubernetes within our infrastructure, consumed the Kyoto Tycoon update feed, and wrote the batched changes every 500ms.

In the event of node failure or misconfiguration it was possible for multiple bridge nodes to be live simultaneously. To prevent duplicate entries we included a timestamp with each change. When writing into Quicksilver we used the Compare And Swap command to ensure that only one batch with a given timestamp would ever be written.

First Baby Steps In To Production

Gradually we began to point reads over a loopback interface from the Kyoto Tycoon port to the Quicksilver port. Fortunately we implemented the memcached and the KT HTTP protocol to make it transparent to the many client instances that connect to Kyoto Tycoon and Quicksilver.

We began with DOG, DOG is a virtual data center which serves traffic for Cloudflare employees only. It’s called DOG because we use it for dog-fooding. Testing releases on the dog-fooding data center helps prevent serious issues from ever reaching the rest of the world. Our DOG testing went suspiciously well, so we began to move forward with more of our data centers around the world. Little did we know this rollout would take more than a year!

Replication Saturation

To make life easier for SREs provisioning a new data center, we implemented a bootstrap mechanism that pulls an entire database from a remote server. This was an easy win because our datastore library - LMDB - provides the ability to copy an entire LMDB environment to a file descriptor. Quicksilver simply wrapped this code and sent a specific version of the DB to a network socket using its file descriptor.

When bootstrapping our Quicksilver DBs to more servers, Kyoto Tycoon was still serving production traffic in parallel. As we mentioned in the first blogpost, Kyoto Tycoon has an exclusive write lock that is sensitive to I/O. We were used to the problems this can cause such as write bursts slowing down reads. This time around we found the separate bootstrap process caused issues. It filled the page cache quickly, causing I/O saturation, which impacted all production services reading from Kyoto Tycoon.

There was no easy fix to that, to bootstrap Quicksilver DBs in a datacenter we'd have to wait until it was moved offline for maintenance, which pushed the schedule beyond our initial estimates. Once we had a data center deployment we could finally see metrics of Quicksilver running in the wild. It was important to validate that the performance was healthy before moving on to the next data center. This required coordination with the engineering teams that consume from Quicksilver.

The FL (front line) test candidate

Many requests reaching Cloudflare’s edge are forwarded to a component called FL, which acts as the "brain". It uses multiple configuration values to decide what to do with the request. It might apply the WAF, pass the request to Workers and so on. FL requires a lot of QS access, so it was the perfect candidate to help validate Quicksilver performance in a real environment.

Here are some screenshots showing FL metrics. The vertical red line marks the switchover point, on the left of the line is Kyoto Tycoon, on the right is Quicksilver. The drop immediately before the line is due to the upgrade procedure itself. We saw an immediate and significant improvement in error rates and response time by migrating to Quicksilver.

The middle graph shows the error count. Green dot errors are non-critical errors: the first request to Kyoto Tycoon/Quicksilver failed so FL will try again. The red dot errors are the critical ones, they mean the second request also failed. In that case FL will return an error code to an end user. Both counts decreased substantially.

The bottom graph shows the read latency average from Kyoto Tycoon/Quicksilver. The average drops a lot and is usually around 500 microseconds and no longer reaches 1ms which it used to do quite often with Kyoto Tycoon.

Some readers may notice a few things. First of all, it was said earlier that the consumers were reading through the loopback interface. How come we experience errors there? How come we need 500us to read from Quicksilver?

At that time Cloudflare had some old SSDs and these could sometimes take hundreds of milliseconds to respond and this was affecting the response time from Quicksilver hence triggering read timeout from FL.

That was also obviously increasing the average response time above.

From there, some readers would ask why using averages and not percentiles? This screenshot dates back to 2017 and at that time the tool used for gathering FL metrics did not offer percentiles, only min, max and average.

Also, back then, all consumers were reading through TCP over loopback. Later on we switched to Unix socket.

Finally, we enabled memcached and KT HTTP malformed request detection. It appears that some consumers were sending us bogus requests and not monitoring the value returned properly, so they could not detect any error. We are now alerting on all bogus requests as this should never happen.

Spotting the wood for the trees

Removing Kyoto Tycoon on the edge took around a year. We'd done a lot of work already but rolling Quicksilver out to production services meant the project was just getting started. Thanks to the new metrics we put in place, we learned a lot about how things actually happen in the wild. And this let us come up with some ideas for improvement.

Our biggest surprise was that most of our replication delays were caused by saturated I/O and not network issues. This is something we could not easily see with Kyoto Tycoon. A server experiencing I/O contention could take seconds to write a batch to disk. It slowly became out of sync and then failed to propagate updates fast enough to its own clients, causing them to fall out of sync. The workaround was quite simple, if the latest received transaction log is over 30 seconds old, Quicksilver would disconnect and try the next server in the list of sources.

This was again an issue caused by aging SSDs. When a disk reaches that state it is very likely to be queued for a hardware replacement.

We quickly had to address another weakness in our codebase. A Quicksilver instance which was not currently replicating from a remote server would stop its own replication server, forcing clients to disconnect and reconnect somewhere else. The problem is this behavior has a cascading effect, if an intermediate server disconnects from its server to move somewhere else, it will disconnect all its clients, etc. The problem was compounded by an exponential backoff approach. In some parts of the world our replication topology has six layers, triggering the backoff many times caused significant and unnecessary replication lag. Ultimately the feature was fully removed.

Reads, writes, and sad SSDs

We discovered that we had been underestimating the frequency of I/O errors and their effect on performance. LMDB maps the database file into memory and when the kernel experiences an I/O error while accessing the SSD, it will terminate Quicksilver with a SIGBUS. We could see these in kernel logs. I/O errors are a sign of unhealthy disks and cause us to see spikes in Quicksilver's read and write latency metrics. They can also cause spikes in the system load average, which affects all running services. When I/O errors occur, a check usually finds such disks are close to their maximum allowed write cycles, and further writes will cause a permanent failure.

Something we didn't anticipate was LMDB's interaction with our filesystem, causing high write amplification that increased I/O load and accelerated the wear and tear of our SSDs. For example, when Quicksilver wrote 1 megabyte, 30 megabytes were flushed to disk. This amplification is due to LMDB copy-on-write, writing a 2 byte key-value (KV) pair ends up copying the whole page. That was the price to pay for crashproof storage. We've seen amplification factors between 1.5X and 80X. All of this happens before any internal amplification that might happen in an SSD.

We also spotted that some producers kept writing the same KV pair over and over. Although we saw low rates of 50 writes per second, considering the amplification problem above, repetition increased pressure on SSDs. To fix this we just drop identical KV writes on the root node.

Where does a server replicate from?

Each small data center replicates from multiple bigger sites and these replicate from our two core data centers. To decide which data centers to use, we talk to the Network team, get an answer and hardcode the topology in Salt, the same place we configure all our infrastructure.

Each replication node is identified using IP addresses in our Salt configuration. This is quite a rigid way of doing things, wouldn't DNS be easier? The interesting thing is that Quicksilver is a source of DNS for customers and Cloudflare infrastructure. We do not want Quicksilver to become dependent on itself, that could lead to problems.

While this somewhat rigid way of configuring topology works fine, as Cloudflare continues to scale there is room for improvement. Configuring Salt can be tricky and changes need to be thoroughly tested to avoid breaking things. It would be nicer for us to have a way to build the topology more dynamically, allowing us to more easily respond to any changes in network details and provision new data centers. That is something we are working on.

Removing Kyoto Tycoon from the Edge

Migrating services away from Kyoto Tycoon requires identifying them. There were some obvious candidates like FL and DNS which play a big part in day-to-day Cloudflare activity and do a lot of reads. But the tricky part to completely removing Kyoto Tycoon is in finding all the services that use it.

We informed engineering teams of our migration work but that was not enough. Some consumers get less attention because they were done during a research activity, or they run as part of a team's side project. We added some logging code into Kyoto Tycoon so we could see who exactly was connecting and reached out directly to them. That's a good way to catch the consumers that are permanently connected or connect often. However, some consumers only connect rarely and briefly, for example at startup time to load some configuration. So this takes time.

We also spotted some special cases that used exotic setups like using stunnel to read from the memcached interface remotely. That might have been a "quick-win" back in the day but its technical debt that has to be paid off; we now strictly enforce the methods that consumers use to access Quicksilver.

The migration activity was steady and methodical and it took a few months. We deployed Quicksilver per group of data centers and instance by instance, removing Kyoto Tycoon only when we deemed it ready. We needed to support two sets of configuration and alerting in parallel. That's not ideal but it gave us confidence and ensured that customers were not disrupted. The only thing they might have noticed is the performance improvement!

Quicksilver, Core side, welcome to the mouth of technical debt madness

Once the edge migration was complete, it was time to take care of the core data centers. We just dealt with 200+ data centers so doing two should be easy right? Well no, the edge side of Quicksilver uses the reading interface, the core side is the writing interface so it is a totally different kind of job. Removing Kyoto Tycoon from the core was harder than the edge.

Many Kyoto Tycoon components were all running on a single physical server - the Kyoto Tycoon root node. This single point of failure had become a growing risk over the years. It had been responsible for incidents, for example it once completely disappeared due to faulty hardware. In that emergency we had to start all services on a backup server. The move to Quicksilver was a great opportunity to remove the Kyoto Tycoon root node server. This was a daunting task and almost all the engineering teams across the company got involved.

Note: Cloudflare’s migration from Marathon to Kubernetes was happening in parallel to this task so each time we refer to Kubernetes it is likely that the job was started on Marathon before being migrated later on.

The diagram above shows how all components below interact with each other.

From KTrest to QSrest

KTrest is a stateless service providing a REST interface for writing and deleting KV pairs from Kyoto Tycoon. Its only purpose is to enforce access control (ACL) so engineering teams do not experience accidental keyspace overlap. Of course, sometimes keyspaces are shared on purpose but it is not the norm.

The Quicksilver team took ownership of KTrest and migrated it from the root server to Kubernetes. We asked all teams to move their KTrest producers to Kubernetes as well. This went smoothly and did not require code changes in the services.

Our goal was to move people to Quicksilver, so we built QSrest, a KTrest compatible service for Quicksilver. Remember the problem about disk load? The QSKTBridge was in charge of batching: aggregating 500ms of updates before flushing to Quicksilver, to help our SSDs cope with our sync frequency. If we moved people to QSrest, it needed to support batching too. So it does. We used QSrest as an opportunity to implement write quotas. Once a quota is reached, all further writes from the same producer would be slowed down until the quota goes back to normal.

We found that not all teams actually used KTrest. Many teams were still writing directly to Kyoto Tycoon. One reason for this is that they were sending large range read requests in order to read thousands of keys in one shot - a feature that KTrest did not provide. People might ask why we would allow reads on a write interface. This is an unfortunate historical artefact and we wanted to put things right. Large range requests would hurt Quicksilver (we learned this from our experience with the bootstrap mechanism) so we added a cluster of servers which would serve these range requests: the Quicksilver consumer nodes. We asked the teams to switch their writes to KTrest and their reads to Quicksilver consumers.

Once the migration of the producers out of the root node was complete, we asked teams to start switching from KTrest to QSrest. Once the move was started, moving back would be really hard: our Kyoto Tycoon and Quicksilver DB were now inconsistent. We had to move over 50 producers, owned by teams in different time zones. So we did it one after another while carefully babysitting our servers to spot any early warning which could end up in an incident.

Once we were all done, we then started to look at Kyoto Tycoon transaction logs: is there something still writing to it? We found an obsolete heartbeat mechanism which was shut down.

Finally, Kyoto Tycoon root processes were turned off to never be turned on again.

Removing Kyoto Tycoon from all over Cloudflare ended up taking four years.

No hardware single point of failure

We have made great progress, Kyoto Tycoon is no longer among us, we smoothly moved all Cloudflare services to Quicksilver, our new KV store. And things now look more like this.

The job is still far from done though. The Quicksilver root server was still a single point of failure, so we decided to build Quicksilver Raft - a Raft-enabled root cluster using the etcd Raft package. It was harder than I originally thought. Adding Raft into Quicksilver required proper synchronization between the Raft snapshot and the current Quicksilver DB. Sometimes there are situations where Quicksilver needs to fall back to asynchronous replication to catch up before doing proper Raft synchronous writes. Getting all of this to work required deep changes in our code base, which also proved to be hard to build proper unit and integration tests. But removing single points of failure is worth it.

Introducing QSQSBridge

We like to test our components in an environment very close to production before releasing it. How do we test the core side/writer interface of Quicksilver then? Well the days of Kyoto Tycoon, we used to have a second Quicksilver root node feeding from Kyoto Tycoon through the KTQSbridge. In turn, some very small data centers were being fed from that root node. When Kyoto Tycoon was removed we lost that capability, limiting our testing. So we built QSQSbridge, which replicates from a Quicksilver instance and writes to a Quicksilver root node.

Removing the Quicksilver legacy top-main

With Quicksilver Raft and QSQSbridge in place, we tested three small data centers replicating from the test Raft cluster over the course of weeks. We then wanted to promote the high availability (HA) cluster and make the whole world replicate from it. In the same way we did for Kyoto Tycoon removal, we wanted to be able to roll back at any time if things go wrong.

Ultimately we reworked the QSQSbridge so that it builds compatible transaction logs with the feeding root node. That way, we started to move groups of data centers underneath the Raft cluster and we could, at any time, move them back to the legacy lop-main.

We started with this:

We moved our 10 QSrest instances one after another from the legacy root node to write against the HA cluster. We did this slowly without any significant issue and we were finally able to turn off the legacy top-main.This is what we ended up with:

And that was it, Quicksilver was running in the wild without a hardware single point of failure.

First signs of obsolescence

The Quicksilver bootstrap mechanism we built to make SRE life better worked by pulling a full copy of a database but it had pains. When bootstrapping begins, a long lived read transaction is opened on the server, it must keep the DB version requested intact while also applying new log entries. This causes two problems. First, if the read is interrupted by something like a network glitch the read transaction is closed and the version of the DB is not available anymore. The entire process has to start from scratch. Secondly, it causes fragmentation in the source DB, which requires somebody to go and compact it.

For the first six months, the databases were fairly small and everything worked fine. However, as our databases continued to grow, these two problems came to fruition. Longer transfers are at risk of network glitches and the problem compounds, especially for more remote data centers. Longer transfers mean more fragmentation and more time compacting it. Bootstrapping was becoming a pain.

One day an SRE reported that manually bootstrapping instances one after another was much faster than bootstrapping all together in parallel. We investigated and saw that when all Quicksilver instances were bootstrapping at the same time the kernel and SSD were overwhelmed swapping pages. There was page cache contention caused by the ratio of the combined size of the DBs vs. available physical memory.

The workaround was to serialize bootstrapping across instances. We did this by implementing a lock between Quicksilver processes, a client grabs the lock to bootstrap and gives it back once done, ready for the next client. The lock moves around in round robin fashion forever.

Unfortunately, we hit another page cache thrashing issue. We know that LMDB mostly does random I/O but we turned on readahead to load as much DB as possible into memory. But again as our DBs grew, the readahead started evicting useful pages. So we improved the performance by… disabling it.

Our biggest issue by far though is more fundamental: disk space and wear and tear! Each and every server in Cloudflare stores the same data. So a piece of data one megabyte in size consumes at least 10 gigabytes globally. To accommodate continued growth you can add more disks, replace dying disks, and use bigger disks. That's not a very sustainable solution. It became clear that the Quicksilver design we came up with five years ago was becoming obsolete and needed a rethink. We approach that problem both in the short term and long term.

The long and short of it

In the long term we are now working on a sharded version of Quicksilver. This will avoid storing all data on all machines but guarantee the entire dataset is in each data center.

In the shorter term we have addressed our storage size problems with a few approaches.

Many engineering teams did not even know how many bytes of Quicksilver they used. Some might have had an impression that storage is free. We built a service named “QSusage” which uses the QSrest ACL to match KV with their owner and computes the total disk space used per producer, including Quicksilver metadata. This service allows teams to better understand their usage and pace of growth.

We also built a service name "QSanalytics" to help answer the question “How do I know which of my KV pairs are never used?”. It gathers all keys being accessed all over Cloudflare, aggregates these and pushes these into a ClickHouse cluster and we store these information over a 30 days rolling window. There's no sampling here, we keep track of all read access. We can easily report back to engineering teams responsible for unused keys, they can consider if these can be deleted or must be kept.

Some of our issues are caused by the storage engine LMDB. So we started to look around and got quickly interested in RocksDB. It provides built in compression, online defragmentation and other features like prefix deduplication.

We ran some tests using representative Quicksilver data and they seemed to show that RocksDB could store the same data in 40% of the space compared to LMDB. Read latency was a little bit higher in some cases but nothing too serious. CPU usage was also higher, using around 150% of CPU time compared to LMDBs 70%. Write amplification appeared to be much lower, giving some relief to our SSD and the opportunity to ingest more data.

We decided to switch to RocksDB to help sustain our growing product and customer bases. In a follow-up blog we'll do a deeper dive into performance comparisons and explain how we managed to switch the entire business without impacting customers at all.

Conclusion

Migrating from Kyoto Tycoon to Quicksilver was no mean feat. It required company-wide collaboration, time and patience to smoothly roll out a new system without impacting customers. We've removed single points of failure and improved the performance of database access, which helps to deliver a better experience.

In the process of moving to Quicksilver we learned a lot. Issues that were unknown at design time were surfaced by running software and services at scale. Furthermore, as the customer and product base grows and their requirements change, we've got new insight into what our KV store needs to do in the future. Quicksilver gives us a solid technical platform to do this.

How Cloudflare uses Cloudflare Spectrum: A look into an intern’s project at Cloudflare

Ryan Jacobs — Fri, 21 Aug 2020 11:00:00 GMT

Cloudflare extensively uses its own products internally in a process known as dogfooding. As part of my onboarding as an intern on the Spectrum (a layer 4 reverse proxy) team, I learned that many internal services dogfood Spectrum, as they are exposed to the Internet and benefit from layer 4 DDoS protection. One of my first tasks was to update the configuration for an internal service that was using Spectrum. The configuration was managed in Salt (used for configuration management at Cloudflare), which was not particularly user-friendly, and required an engineer on the Spectrum team to handle updating it manually.

This process took about a week. That should instantly raise some questions, as a typical Spectrum customer can create a new Spectrum app in under a minute through Cloudflare Dashboard. So why couldn’t I?

This question formed the basis of my intern project for the summer.

The Process

Cloudflare uses various IP ranges for its products. Some customers also authorize Cloudflare to announce their IP prefixes on their behalf (this is known as BYOIP). Collectively, we can refer to these IPs as managed addresses. To prevent Bad Stuff (defined later) from happening, we prohibit managed addresses from being used as Spectrum origins. To accomplish this, Spectrum had its own table of denied networks that included the managed addresses. For the average customer, this approach works great – they have no legitimate reason to use a managed address as an origin.

Unfortunately, the services dogfooding Spectrum all use Cloudflare IPs, preventing those teams with a legitimate use-case from creating a Spectrum app through the configuration service (i.e. Cloudflare Dashboard). To bypass this check, these internal customers needed to define a custom Spectrum configuration, which needed to be manually deployed to the edge via a pull request to our Salt repo, resulting in a time consuming process.

If an internal customer wanted to change their configuration, the same time consuming process must be used. While this allowed internal customers to use Spectrum, it was tedious and error prone.

Bad Stuff

Bad Stuff is quite vague and deserves a better definition. It may seem arbitrary that we deny Cloudflare managed addresses. To motivate this, consider two Spectrum apps, A and B, where the origin of app A is the Cloudflare edge IP of app B, and the origin of app B is the edge IP of app A. Essentially, app A will proxy incoming connections to app B, and app B will proxy incoming connections to app A, creating a cycle.

This could potentially crash the daemon or degrade performance. In practice, this configuration is useless and would only be created by a malicious user, as the proxied connection never reaches an origin, so it is never allowed.

As well, since we are providing a reverse proxy to customer origins, we do not need to allow connections to IP ranges that cannot be used on the public Internet, as specified in this RFC.

The Problem

To improve usability and allow internal Spectrum customers to create apps using the Dashboard instead of the static configuration workflow, we needed a way to give particular customers permission to use Cloudflare managed addresses in their Spectrum configuration. Solving this problem was my main project for the internship.

A good starting point ended up being the Addressing API. The Addressing API is Cloudflare’s solution to IP management, an internal database and suite of tools to keep track of IP prefixes, with the goal of providing a unified source of truth for how IP addresses are being used across the organization. This makes it possible to provide a cross-product platform for products and features such as BYOIP, BGP On Demand, and Magic Transit.

The Addressing API keeps track of all Cloudflare managed IP prefixes, along with who owns the prefix. As well, the owner of a prefix can give permission for someone else to use the prefix. We call this a delegation.

A user’s permission to use an IP address managed by the Addressing API is determined as followed:

Is the user the owner of the prefix containing the IP address?a) Yes, the user has permission to use the IPb) No, go to step 2
Has the user been delegated a prefix containing the IP address?a) Yes, the user has permission to use the IP.b) No, the user does not have permission to use the IP.

The Solution

With the information present in the Addressing API, the solution starts to become clear. For a given customer and IP, we use the following algorithm:

Is the IP managed by Cloudflare (or contained in the previous RFC)?a) Yes, go to step 2b) No, allow as origin
Does the customer have permission to use the IP address?a) Yes, allow as originb) No, deny as origin

As long as the internal customer has been given permission to use the Cloudflare IP (through a delegation in the Addressing API), this approach would allow them to use it as an origin.

However, we run into a corner case here – since BYOIP customers also have permission to use their own ranges, they would be able to set their own IP as an origin, potentially causing a cycle. To mitigate this, we need to check if the IP is a Spectrum edge IP. Fortunately, the Addressing API also contains this information, so all we have to do is check if the given origin IP is already in use as a Spectrum edge IP, and if so, deny it. Since all of the denied networks checks occur in the Addressing API, we were able to remove Spectrum's own deny network database, reducing the engineering workload to maintain it along the way.

Let's go through a concrete example. Consider an internal customer who wants to use 104.16.8.54/32 as an origin for their Spectrum app. This address is managed by Cloudflare, and suppose the customer has permission to use it, and the address is not already in use as an edge IP. This means the customer is able to specify this IP as an origin, since it meets all of our criteria.

For example, a request to the Addressing API could look like this:

Now we have completely moved the responsibility of validating the use of origin IP addresses from Spectrum’s configuration service to the Addressing API.

Performance

This approach required making another HTTP request on the critical path of every create app request in the Spectrum configuration service. Some basic performance testing showed (as expected) increased response times for the API call (about 100ms). This led to discussion among the Spectrum team about the performance impact of different HTTP requests throughout the critical path. To investigate, we decided to use OpenTracing.

OpenTracing is a standard for providing distributed tracing of microservices. When an HTTP request is received, special headers are added to it to allow it to be traced across the different services. Within a given trace, we can see how long a SQL query took, the time a function took to complete, the amount of time a request spent at a given service, and more.

We have been deploying a tracing system for our services to provide more visibility into a complex system.

After instrumenting the Spectrum config service with OpenTracing, we were able to determine that the Addressing API accounted for a very small amount of time in the overall request, and allowed us to identify potentially problematic request times to other services.

Lessons Learned

Reading documentation is important! Having a good understanding of how the Addressing API and the config service worked allowed me to create and integrate an endpoint that made sense for my use-case.

Writing documentation is just as important. For the final part of my project, I had to onboard Crossbow – an internal Cloudflare tool used for diagnostics – to Spectrum, using the new features I had implemented. I had written an onboarding guide, but some stuff was unclear during the onboarding process, so I made sure to gather feedback from the Crossbow team to improve the guide.

Finally, I learned not to underestimate the amount of complexity required to implement relatively simple validation logic. In fact, the implementation required understanding the entire system. This includes how multiple microservices work together to validate the configuration and understanding how the data is moved from the Core to the Edge, and then processed there. I found increasing my understanding of this system to be just as important and rewarding as completing the project.

Cloudflare Bot Management: machine learning and more

Alex Bocharov — Wed, 06 May 2020 11:00:00 GMT

Introduction

Building Cloudflare Bot Management platform is an exhilarating experience. It blends Distributed Systems, Web Development, Machine Learning, Security and Research (and every discipline in between) while fighting ever-adaptive and motivated adversaries at the same time.

This is the ongoing story of Bot Management at Cloudflare and also an introduction to a series of blog posts about the detection mechanisms powering it. I’ll start with several definitions from the Bot Management world, then introduce the product and technical requirements, leading to an overview of the platform we’ve built. Finally, I’ll share details about the detection mechanisms powering our platform.

Let’s start with Bot Management’s nomenclature.

Some Definitions

Bot - an autonomous program on a network that can interact with computer systems or users, imitating or replacing a human user's behavior, performing repetitive tasks much faster than human users could.

Good bots - bots which are useful to businesses they interact with, e.g. search engine bots like Googlebot, Bingbot or bots that operate on social media platforms like Facebook Bot.

Bad bots - bots which are designed to perform malicious actions, ultimately hurting businesses, e.g. credential stuffing bots, third-party scraping bots, spam bots and sneakerbots.

Bot Management - blocking undesired or malicious Internet bot traffic while still allowing useful bots to access web properties by detecting bot activity, discerning between desirable and undesirable bot behavior, and identifying the sources of the undesirable activity.

WAF - a security system that monitors and controls network traffic based on a set of security rules.

Gathering requirements

Cloudflare has been stopping malicious bots from accessing websites or misusing APIs from the very beginning, at the same time helping the climate by offsetting the carbon costs from the bots. Over time it became clear that we needed a dedicated platform which would unite different bot fighting techniques and streamline the customer experience. In designing this new platform, we tried to fulfill the following key requirements.

Complete, not complex - customers can turn on/off Bot Management with a single click of a button, to protect their websites, mobile applications, or APIs.
Trustworthy - customers want to know whether they can trust the website visitor is who they say they are and provide a certainty indicator for that trust level.
Flexible - customers should be able to define what subset of the traffic Bot Management mitigations should be applied to, e.g. only login URLs, pricing pages or sitewide.
Accurate - Bot Management detections should have a very small error, e.g. none or very few human visitors ever should be mistakenly identified as bots.
Recoverable - in case a wrong prediction was made, human visitors still should be able to access websites as well as good bots being let through.

Moreover, the goal for new Bot Management product was to make it work well on the following use cases:

Technical requirements

Additionally, to the product requirements above, we engineers had a list of must-haves for the new Bot Management platform. The most critical were:

Scalability - the platform should be able to calculate a score on every request, even at over 10 million requests per second.
Low latency - detections must be performed extremely quickly, not slowing down request processing by more than 100 microseconds, and not requiring additional hardware.
Configurability - it should be possible to configure what detections are applied on what traffic, including on per domain/data center/server level.
Modifiability - the platform should be easily extensible with more detection mechanisms, different mitigation actions, richer analytics and logs.
Security - no sensitive information from one customer should be used to build models that protect another customer.
Explainability & debuggability - we should be able to explain and tune predictions in an intuitive way.

Equipped with these requirements, back in 2018, our small team of engineers got to work to design and build the next generation of Cloudflare Bot Management.

Meet the Score

“Simplicity is the ultimate sophistication.”- Leonardo Da Vinci

Cloudflare operates on a vast scale. At the time of this writing, this means covering 26M+ Internet properties, processing on average 11M requests per second (with peaks over 14M), and examining more than 250 request attributes from different protocol levels. The key question is how to harness the power of such “gargantuan” data to protect all of our customers from modern day cyberthreats in a simple, reliable and explainable way?

Bot management is hard. Some bots are much harder to detect and require looking at multiple dimensions of request attributes over a long time, and sometimes a single request attribute could give them away. More signals may help, but are they generalizable?

When we classify traffic, should customers decide what to do with it or are there decisions we can make on behalf of the customer? What concept could possibly address all these uncertainty problems and also help us to deliver on the requirements from above?

As you might’ve guessed from the section title, we came up with the concept of Trusted Score or simply The Score - one thing to rule them all - indicating the likelihood between 0 and 100 whether a request originated from a human (high score) vs. an automated program (low score).

"One Ring to rule them all" by idreamlikecrazy, used under CC BY / Desaturated from original

Okay, let’s imagine that we are able to assign such a score on every incoming HTTP/HTTPS request, what are we or the customer supposed to do with it? Maybe it’s enough to provide such a score in the logs. Customers could then analyze them on their end, find the most frequent IPs with the lowest scores, and then use the Cloudflare Firewall to block those IPs. Although useful, such a process would be manual, prone to error and most importantly cannot be done in real time to protect the customer's Internet property.

Fortunately, around the same time we started worked on this system , our colleagues from the Firewall team had just announced Firewall Rules. This new capability provided customers the ability to control requests in a flexible and intuitive way, inspired by the widely known Wireshark® language. Firewall rules supported a variety of request fields, and we thought - why not have the score be one of these fields? Customers could then write granular rules to block very specific attack types. That’s how the cf.bot_management.score field was born.

Having a score in the heart of Cloudflare Bot Management addressed multiple product and technical requirements with one strike - it’s simple, flexible, configurable, and it provides customers with telemetry about bots on a per request basis. Customers can adjust the score threshold in firewall rules, depending on their sensitivity to false positives/negatives. Additionally, this intuitive score allows us to extend our detection capabilities under the hood without customers needing to adjust any configuration.

So how can we produce this score and how hard is it? Let’s explore it in the following section.

Architecture overview

What is powering the Bot Management score? The short answer is a set of microservices. Building this platform we tried to re-use as many pipelines, databases and components as we could, however many services had to be built from scratch. Let’s have a look at overall architecture (this overly simplified version contains Bot Management related services):

Core Bot Management services

In a nutshell our systems process data received from the edge data centers, produce and store data required for bot detection mechanisms using the following technologies:

Databases & data stores - Kafka, ClickHouse, Postgres, Redis, Ceph.
Programming languages - Go, Rust, Python, Java, Bash.
Configuration & schema management - Salt, Quicksilver, Cap’n Proto.
Containerization - Docker, Kubernetes, Helm, Mesos/Marathon.

Each of these services is built with resilience, performance, observability and security in mind.

Edge Bot Management module

All bot detection mechanisms are applied on every request in real-time during the request processing stage in the Bot Management module running on every machine at Cloudflare’s edge locations. When a request comes in we extract and transform the required request attributes and feed them to our detection mechanisms. The Bot Management module produces the following output:

Firewall fields - Bot Management fields- cf.bot_management.score - an integer indicating the likelihood between 0 and 100 whether a request originated from an automated program (low score) to a human (high score).- cf.bot_management.verified_bot - a boolean indicating whether such request comes from a Cloudflare allowlisted bot.- cf.bot_management.static_resource - a boolean indicating whether request matches file extensions for many types of static resources.

Cookies - most notably it produces cf_bm, which helps manage incoming traffic that matches criteria associated with bots.

JS challenges - for some of our detections and customers we inject into invisible JavaScript challenges, providing us with more signals for bot detection.

Detection logs - we log through our data pipelines to ClickHouse details about each applied detection, used features and flags, some of which are used for analytics and customer logs, while others are used to debug and improve our models.

Once the Bot Management module has produced the required fields, the Firewall takes over the actual bot mitigation.

Firewall integration

The Cloudflare Firewall's intuitive dashboard enables users to build powerful rules through easy clicks and also provides Terraform integration. Every request to the firewall is inspected against the rule engine. Suspicious requests can be blocked, challenged or logged as per the needs of the user while legitimate requests are routed to the destination, based on the score produced by the Bot Management module and the configured threshold.

Firewall rules provide the following bot mitigation actions:

Log - records matching requests in the Cloudflare Logs provided to customers.
Bypass - allows customers to dynamically disable Cloudflare security features for a request.
Allow - matching requests are exempt from challenge and block actions triggered by other Firewall Rules content.
Challenge (Captcha) - useful for ensuring that the visitor accessing the site is human, and not automated.
JS Challenge - useful for ensuring that bots and spam cannot access the requested resource; browsers, however, are free to satisfy the challenge automatically.
Block - matching requests are denied access to the site.

Our Firewall Analytics tool, powered by ClickHouse and GraphQL API, enables customers to quickly identify and investigate security threats using an intuitive interface. In addition to analytics, we provide detailed logs on all bots-related activity using either the Logpull API and/or LogPush, which provides the easy way to get your logs to your cloud storage.

Cloudflare Workers integration

In case a customer wants more flexibility on what to do with the requests based on the score, e.g. they might want to inject new, or change existing, HTML page content, or serve incorrect data to the bots, or stall certain requests, Cloudflare Workers provide an option to do that. For example, using this small code-snippet, we can pass the score back to the origin server for more advanced real-time analysis or mitigation:

Now let’s have a look into how a single score is produced using multiple detection mechanisms.

Detection mechanisms

The Cloudflare Bot Management platform currently uses five complementary detection mechanisms, producing their own scores, which we combine to form the single score going to the Firewall. Most of the detection mechanisms are applied on every request, while some are enabled on a per-customer basis to better fit their needs.

Having a score on every request for every customer has the following benefits:

Ease of onboarding - even before we enable Bot Management in active mode, we’re able to tell how well it’s going to work for the specific customer, including providing historical trends about bot activity.
Feedback loop - availability of the score on every request along with all features has tremendous value for continuous improvement of our detection mechanisms.
Ensures scaling - if we can compute for score every request and customer, it means that every Internet property behind Cloudflare is a potential Bot Management customer.
Global bot insights - Cloudflare is sitting in front of more than 26M+ Internet properties, which allows us to understand and react to the tectonic shifts happening in security and threat intelligence over time.

Overall globally, more than third of the Internet traffic visible to Cloudflare is coming from bad bots, while Bot Management customers have the ratio of bad bots even higher at ~43%!

Let’s dive into specific detection mechanisms in chronological order of their integration with Cloudflare Bot Management.

Machine learning

The majority of decisions about the score are made using our machine learning models. These were also the first detection mechanisms to produce a score and to on-board customers back in 2018. The successful application of machine learning requires data high in Quantity, Diversity, and Quality, and thanks to both free and paid customers, Cloudflare has all three, enabling continuous learning and improvement of our models for all of our customers.

At the core of the machine learning detection mechanism is CatBoost - a high-performance open source library for gradient boosting on decision trees. The choice of CatBoost was driven by the library’s outstanding capabilities:

Categorical features support - allowing us to train on even very high cardinality features.
Superior accuracy - allowing us to reduce overfitting by using a novel gradient-boosting scheme.
Inference speed - in our case it takes less than 50 microseconds to apply any of our models, making sure request processing stays extremely fast.
C and Rust API - most of our business logic on the edge is written using Lua, more specifically LuaJIT, so having a compatible FFI interface to be able to apply models is fantastic.

There are multiple CatBoost models run on Cloudflare’s Edge in the shadow mode on every request on every machine. One of the models is run in active mode, which influences the final score going to Firewall. All ML detection results and features are logged and recorded in ClickHouse for further analysis, model improvement, analytics and customer facing logs. We feed both categorical and numerical features into our models, extracted from request attributes and inter-request features built using those attributes, calculated and delivered by the Gagarin inter-requests features platform.

We’re able to deploy new ML models in a matter of seconds using an extremely reliable and performant Quicksilver configuration database. The same mechanism can be used to configure which version of an ML model should be run in active mode for a specific customer.

A deep dive into our machine learning detection mechanism deserves a blog post of its own and it will cover how do we train and validate our models on trillions of requests using GPUs, how model feature delivery and extraction works, and how we explain and debug model predictions both internally and externally.

Heuristics engine

Not all problems in the world are the best solved with machine learning. We can tweak the ML models in various ways, but in certain cases they will likely underperform basic heuristics. Often the problems machine learning is trying to solve are not entirely new. When building the Bot Management solution it became apparent that sometimes a single attribute of the request could give a bot away. This means that we can create a bunch of simple rules capturing bots in a straightforward way, while also ensuring lowest false positives.

The heuristics engine was the second detection mechanism integrated into the Cloudflare Bot Management platform in 2019 and it’s also applied on every request. We have multiple heuristic types and hundreds of specific rules based on certain attributes of the request, some of which are very hard to spoof. When any of the requests matches any of the heuristics - we assign the lowest possible score of 1.

The engine has the following properties:

Speed - if ML model inference takes less than 50 microseconds per model, hundreds of heuristics can be applied just under 20 microseconds!
Deployability - the heuristics engine allows us to add new heuristic in a matter of seconds using Quicksilver, and it will be applied on every request.
Vast coverage - using a set of simple heuristics allows us to classify ~15% of global traffic and ~30% of Bot Management customers’ traffic as bots. Not too bad for a few if conditions, right?
Lowest false positives - because we’re very sure and conservative on the heuristics we add, this detection mechanism has the lowest FP rate among all detection mechanisms.
Labels for ML - because of the high certainty we use requests classified with heuristics to train our ML models, which then can generalize behavior learnt from from heuristics and improve detections accuracy.

So heuristics gave us a lift when tweaked with machine learning and they contained a lot of the intuition about the bots, which helped to advance the Cloudflare Bot Management platform and allowed us to onboard more customers.

Behavioral analysis

Machine learning and heuristics detections provide tremendous value, but both of them require human input on the labels, or basically a teacher to distinguish between right and wrong. While our supervised ML models can generalize well enough even on novel threats similar to what we taught them on, we decided to go further. What if there was an approach which doesn’t require a teacher, but rather can learn to distinguish bad behavior from the normal behavior?

Enter the behavioral analysis detection mechanism, initially developed in 2018 and integrated with the Bot Management platform in 2019. This is an unsupervised machine learning approach, which has the following properties:

Fitting specific customer needs - it’s automatically enabled for all Bot Management customers, calculating and analyzing normal visitor behavior over an extended period of time.
Detects bots never seen before - as it doesn’t use known bot labels, it can detect bots and anomalies from the normal behavior on specific customer’s website.
Harder to evade - anomalous behavior is often a direct result of the bot’s specific goal.

Please stay tuned for a more detailed blog about behavioral analysis models and the platform powering this incredible detection mechanism, protecting many of our customers from unseen attacks.

Verified bots

So far we’ve discussed how to detect bad bots and humans. What about good bots, some of which are extremely useful for the customer website? Is there a need for a dedicated detection mechanism or is there something we could use from previously described detection mechanisms? While the majority of good bot requests (e.g. Googlebot, Bingbot, LinkedInbot) already have low score produced by other detection mechanisms, we also need a way to avoid accidental blocks of useful bots. That’s how the Firewall field cf.bot_management.verified_bot came into existence in 2019, allowing customers to decide for themselves whether they want to let all of the good bots through or restrict access to certain parts of the website.

The actual platform calculating Verified Bot flag deserves a detailed blog on its own, but in the nutshell it has the following properties:

Validator based approach - we support multiple validation mechanisms, each of them allowing us to reliably confirm good bot identity by clustering a set of IPs.
Reverse DNS validator - performs a reverse DNS check to determine whether or not a bots IP address matches its alleged hostname.
ASN Block validator - similar to rDNS check, but performed on ASN block.
Downloader validator - collects good bot IPs from either text files or HTML pages hosted on bot owner sites.
Machine learning validator - uses an unsupervised learning algorithm, clustering good bot IPs which are not possible to validate through other means.
Bots Directory - a database with UI that stores and manages bots that pass through the Cloudflare network.

Bots directory UI sample‌‌

Using multiple validation methods listed above, the Verified Bots detection mechanism identifies hundreds of unique good bot identities, belonging to different companies and categories.

JS fingerprinting

When it comes to Bot Management detection quality it’s all about the signal quality and quantity. All previously described detections use request attributes sent over the network and analyzed on the server side using different techniques. Are there more signals available, which can be extracted from the client to improve our detections?

As a matter of fact there are plenty, as every browser has unique implementation quirks. Every web browser graphics output such as canvas depends on multiple layers such as hardware (GPU) and software (drivers, operating system rendering). This highly unique output allows precise differentiation between different browser/device types. Moreover, this is achievable without sacrificing website visitor privacy as it’s not a supercookie, and it cannot be used to track and identify individual users, but only to confirm that request’s user agent matches other telemetry gathered through browser canvas API.

This detection mechanism is implemented as a challenge-response system with challenge injected into the webpage on Cloudflare’s edge. The challenge is then rendered in the background using provided graphic instructions and the result sent back to Cloudflare for validation and further action such as producing the score. There is a lot going on behind the scenes to make sure we get reliable results without sacrificing users’ privacy while being tamper resistant to replay attacks. The system is currently in private beta and being evaluated for its effectiveness and we already see very promising results. Stay tuned for this new detection mechanism becoming widely available and the blog on how we’ve built it.

This concludes an overview of the five detection mechanisms we’ve built so far. It’s time to sum it all up!

Summary

Cloudflare has the unique ability to collect data from trillions of requests flowing through its network every week. With this data, Cloudflare is able to identify likely bot activity with Machine Learning, Heuristics, Behavioral Analysis, and other detection mechanisms. Cloudflare Bot Management integrates seamlessly with other Cloudflare products, such as WAF and Workers.

All this could not be possible without hard work across multiple teams! First of all thanks to everybody on the Bots Team for their tremendous efforts to make this platform come to life. Other Cloudflare teams, most notably: Firewall, Data, Solutions Engineering, Performance, SRE, helped us a lot to design, build and support this incredible platform.

Bots team during Austin team summit 2019 hunting bots with axes :)

Lastly, there are more blogs from the Bots series coming soon, diving into internals of our detection mechanisms, so stay tuned for more exciting stories about Cloudflare Bot Management!

Diving into Technical SEO using Cloudflare Workers

Guest Author — Thu, 07 Mar 2019 16:05:41 GMT

This is a guest post by Igor Krestov and Dan Taylor. Igor is a lead software developer at SALT.agency, and Dan a lead technical SEO consultant, and has also been credited with coining the term “edge SEO”. SALT.agency is a technical SEO agency with offices in London, Leeds, and Boston, offering bespoke consultancy to brands around the world. You can reach them both via Twitter.

With this post we illustrate the potential applications of Cloudflare Workers in relation to search engine optimization, which is more commonly referred to as ‘SEO’ using our research and testing over the past year making Sloth.

This post is aimed at readers who are both proficient in writing performant JavaScript, as well as complete newcomers, and less technical stakeholders, who haven’t really written many lines of code before.

Endless practical applications to overcome obstacles

Working with various clients and projects over the years we’ve continuously encountered the same problems and obstacles in getting their websites to a point of “technical SEO excellence”. A lot of these problems come from platform restriction at an enterprise level, legacy tech stacks, incorrect builds, and years of patching together various services and infrastructures.

As a team of technical SEO consultants, we can often be left frustrated by these barriers, that often lead to essential fixes and implementations either being not possible or delayed for months (if not years) at a time – and in this time, the business is often losing traffic and revenue.

Workers offers us a hail Mary solution to a lot of common frustrations in getting technical SEO implemented, and we believe in the long run can become an integral part of overcoming legacy issues, reducing DevOps costs, speeding up lead times, all in addition to utilising a globally distributed serverless platform with blazing fast cold start times.

Creating accessibility at scale

When we first started out, we needed to implement simple redirects, which should be easy to create on the majority of platforms but wasn’t supported in this instance.

When the second barrier arose, we needed to inject Hreflang tags, cross-linking an old multi-lingual website on a bespoke platform build to an outdated spec. This required experiments to find an efficient way of implementing the tags without increasing latency or adding new code to the server – in a manner befitting of search engine crawling.

At this point we had a number of other applications for Workers, with arising need for non-developers to be able to modify and deploy new Worker code. This has since become an idea of Worker code generation, via Web UI or command line.

Having established a number of different use cases for Workers, we identified 3 processing phases:

Incoming request modification – changing origin request URL or adding authorization headers.
Outgoing response modification - adding security headers, Hreflang header injection, logging.
Response body modification – injecting/changing content e.g. canonicals, robots and JSON-LD

We wanted to generate lean Worker code, which would enable us to keep each functionality contained and independent of another, and went with an idea of filter chains, which can be used to compose fairly complex request processing.

A request chain depicting the path of a request as it is transformed while moving from client to origin server and back again.

A key accessibility issue we identified from a non-technical perspective was the goal trying of making this serverless technology accessible to all in SEO, because with understanding comes buy-in from stakeholders. In order to do this, we had to make Workers:

Accessible to users who don’t understand how to write JavaScript / Performant JavaScript
Process of implementation can complement existing deployment processes
Process of implementation is secure (internally and externally)
Process of implementation is compliant with data and privacy policies
Implementations must be able to be verified through existing processes and practices (BAU)

Before we dive into actual filters, here are partial TypeScript interfaces to illustrate filter APIs:

Request filter — Simple Redirects

Firstly, we would like to point out that this is very niche use case, if your platform supports redirects natively, you should absolutely do it through your platform, but there are a number of limited, legacy or bespoke platforms, where redirects are not supported or are limited, or are charged for (per line) by your hosting or platform. For example, Github Pages only support redirects via HTML refresh meta tag.

The most basic redirect filter, would look like this:

You can see it live in Cloudflare’s playground here.

The one implemented in Sloth supports basic path matching, hostname matching and query string matching, as well as wildcards.

The Sloth dashboard for visually creating and modifying redirects.

It is all well and good when you do not have a lot of redirects to manage, but what do you do when size of redirects starts to take up significant memory available to Worker? This is where we faced another scaling issue, in taking a small handful of possible redirects, to the tens of thousands.

Managing Redirects with Workers KV and Cuckoo Filters

Well, here is one way you can solve it by using Workers KV - a key-value data store.

Instead of hard coding redirects inside Worker code, we will store them inside Workers KV. Naive approach would be to check redirect for each URL. But Workers KV, maximum performance is not reached until a key is being read on the order of once-per-second in any given data center.

Alternative could be using a probabilistic data structure, like Cuckoo Filters, stored in KV, possibly split between a couple of keys as KV is limited to 64KB. Such filter flow would be:

Retrieve frequently read filter key.
Check whether full url (or only pathname) is in the filter.
Get redirect from Worker KV using URL as a key.

In our tests, we managed to pack 20 thousand redirects into Cuckoo Filter taking up 128KB, split between 2 keys, verified against 100 thousand active URLs with a false-positive rate of 0.5-1%.

Body filter - Hreflang Injection

Hreflang meta tags need to be placed inside HTML element, so before actually injecting them, we need to find either start or end of the head HTML tag, which in itself is a streaming search problem.

The caveat here is that naive method decoding UTF-8 into JavaScript string, performing search, re-encoding back into UTF-8 is fairly slow. Instead, we attempted pure JavaScript search on bytes strings (Uint8Array), which straight away showed promising results.

For our use case, we picked the Boyer-Moore-Horspool algorithm as a base of our streaming search, as it is simple, has great average case performance and only requires a pre-processing search pattern, with manual prefix/suffix matching at chunk boundaries.

Here is comparison of methods we have tested on Node v10.15.0:

Can we do better?

Having achieved decent performance improvement with pure JavaScript search over naive method, we wanted to see whether we can do better. As Workers support WASM, we used rust to build a simple WASM module, which exposed standard rust string search.

As rust version did not use precomputed search pattern, it should be significantly faster, if we precomputed and cached search patterns.

In our case, we were searching for a single pattern and stopping once it was found, where pure JavaScript version was fast enough, but if you need multi-pattern, advanced search, WASM is the way to go.

We could not record statistically significant change in latency, between basic worker and one with a body filter deployed to production, due to unstable network latency, with a mean response latency of 150ms and 10% 90th percentile standard deviation.

What’s next?

We believe that Workers and serverless applications can open up new opportunities to overcome a lot of issues faced by the SEO community when working with legacy tech stacks, platform limitations, and heavily congested development queues.

We are also investigating whether Workers can allow us to make a more efficient Tag Manager, which bundles and pushes only matching Tags with their code, to minimize number of external requests caused by trackers and thus reduce load on user browser.

You can experiment with Cloudflare Workers yourself through Sloth, even if you don’t know how to write JavaScript.

Leave your VPN and cURL secure APIs with Cloudflare Access

Sam Rhea — Fri, 05 Oct 2018 18:30:07 GMT

We built Access to solve a problem here at Cloudflare: our VPN. Our team members hated the slowness and inconvenience of VPN but, that wasn’t the issue we needed to solve. The security risks posed by a VPN required a better solution.

VPNs punch holes in the network perimeter. Once inside, individuals can access everything. This can include critically sensitive content like private keys, cryptographic salts, and log files. Cloudflare is a security company; this situation was unacceptable. We need a better method that gives every application control over precisely who is allowed to reach it.

Access meets that need. We started by moving our browser-based applications behind Access. Team members could connect to applications faster, from anywhere, while we improved the security of the entire organization. However, we weren’t yet ready to turn off our VPN as some tasks are better done through a command line. We cannot #EndTheVPN without replacing all of its use cases. Reaching a server from the command line required us to fall back to our VPN.

Today, we’re releasing a beta command line tool to help your team, and ours. Before we started using this feature at Cloudflare, curling a server required me to stop, find my VPN client and credentials, login, and run my curl command. With Cloudflare’s command line tool, cloudflared, and Access, I can run $ cloudflared access curl https://example.com/api and Cloudflare authenticates my request to the server. I save time and the security team at Cloudflare can control who reaches that endpoint (and monitor the logs).

Protect your API with Cloudflare Access

To protect an API with Access, you’ll follow the same steps that you use to protect a browser-based application. Start by adding the hostname where your API is deployed to your Cloudflare account.

Just like web applications behind Access, you can create granular policies for different paths of your HTTP API. Cloudflare Access will evaluate every request to the API for permission based on settings you configure. Placing your API behind Access means requests from any operation, CLI or other, will continue to be gated by Cloudflare. You can continue to use your API keys, if needed, as a second layer of security.

Reach a protected API

Cloudflare Access protects your application by checking for a valid JSON Web Token (JWT), whether the request comes through a browser or from the command line. We issue and sign that JWT when you successfully login with your identity provider. That token contains claims about your identity and session. The Cloudflare network looks at the claims in that token to determine if the request should proceed to the target application.

When you use a browser with Access, we redirect you to your identity provider, you login, and we store that token in a cookie. Authenticating from the command line requires a different flow, but relies on the same principles. When you need to reach an application behind Access from your command line, the Cloudflare CLI tool, cloudflared, launches a browser window so that you can login with your identity provider. Once you login, Access will generate a JWT for your session, scoped to your user identity.

Rather than placing that JWT in a cookie, Cloudflare transfers the token in a cryptographically secure handoff to your machine. The client stores the token for you so that you don’t need to re-authenticate each time. The token is valid for the session duration as configured in Access.

When you make requests from your command line, Access will look for an HTTP header, cf-access-token, instead of a cookie. We’ll evaluate the token in that header and on every request. If you use cURL, we can help you move even faster. cloudflared includes a subcommand that wraps cURL and injects the JWT into the header for you.

Why use cloudflared to reach your application?

With cloudflared and its cURL wrapper, you can perform any cURL operation against an API protected by Cloudflare Access.

Control endpoint access for specific usersCloudflare Access can be configured to protect specific endpoints. For example, you can create a rule that only a small group within your team can reach a particular URL path. You can apply that granular protection to sensitive endpoints so that you control who can reach those, while making other parts of the tool available to the full team.
Download sensitive dataPlacing applications with sensitive data behind Access lets you control who can reach that information. If a particular file is stored at a known location, you can save time by downloading it to your machine from the command line instead of walking through the UI flow.

What's next?

CLI authentication is available today to all Access customers through the cloudflared tool. Just add the API hostname to your Cloudflare account and enable Access to start building policies that control who can reach that API. If you do not have an Access subscription yet, you can read more about the plans here and sign up.

Once you’re ready to continue ditching your VPN, follow this link to install cloudflared today. The tool is in beta and does not yet support automated scripting or service-to-service connections. Full instructions and known limitations can be found here. If you are interested in providing feedback, you can post your comments in this thread.

Validating Leaked Passwords with k-Anonymity

Junade Ali — Wed, 21 Feb 2018 19:00:44 GMT

Today, v2 of Pwned Passwords was released as part of the Have I Been Pwned service offered by Troy Hunt. Containing over half a billion real world leaked passwords, this database provides a vital tool for correcting the course of how the industry combats modern threats against password security.

I have written about how we need to rethink password security and Pwned Passwords v2 in the following post: How Developers Got Password Security So Wrong. Instead, in this post I want to discuss one of the technical contributions Cloudflare has made towards protecting user information when using this tool.

Cloudflare continues to support Pwned Passwords by providing CDN and security functionality such that the data can easily be made available for download in raw form to organisations to protect their customers. Further, as part of the second iteration of this project, I have also worked with Troy on designing and implementing API endpoints that support anonymised range queries to function as an additional layer of security for those consuming the API, that is visible to the client.

This contribution allows for Pwned Passwords clients to use range queries to search for breached passwords, without having to disclose a complete unsalted password hash to the service.

Getting Password Security Right

Over time, the industry has realised that complex password composition rules (such as requiring a minimum number of special characters) have done little to improve user behaviour in making stronger passwords; they have done little to prevent users from putting personal information in passwords, avoiding common passwords or prevent the use of previously breached passwords[1]. Credential Stuffing has become a real threat recently; usernames and passwords are obtained from compromised websites and then injected into other websites until you find user accounts that are compromised.

This fundamentally works because users reuse passwords across different websites; when one set of credentials is breached on one site, this can be reused on other websites. Here are some examples of how credentials can be breached from insecure websites:

Websites which don't use rate limiting or challenge login requests can have a user's log-in credentials breached using brute force attacks of common passwords for a given user,
database dumps from hacked websites can be taken offline and the password hashes can be cracked; modern GPUs make this very efficient for dictionary passwords (even with algorithms like Argon2, PBKDF2 and BCrypt),
many websites continue not to use any form of password hashing, once breached they can be captured in raw form,
Proxy Attacks or hijacking a web server can allow for capturing passwords before they're hashed.

This becomes a problem with password reuse; having obtained real life username/password combinations, they can be injected into other websites (such as payment gateways, social networks, etc) until access is obtained to more accounts (often of a higher value than the original compromised site).

Under recent NIST guidance, it is a requirement, when storing or updating passwords, to ensure they do not contain values which are commonly used, expected or compromised[2]. Research has found that 88.41% of users who received a fear appeal later set unique passwords, whilst only 4.45% of users who did not receive a fear appeal would set a unique password[3].

Unfortunately, there are a lot of leaked passwords out there; the downloadable raw data from Pwned Passwords currently contains over 30 GB in password hashes.

Anonymising Password Hashes

The key problem in checking passwords against the old Pwned Passwords API (and all similar services) lies in how passwords are checked; with users being effectively required to submit unsalted hashes of passwords to identify if the password is breached. The hashes must be unsalted, as salting them makes them computationally difficult to search quickly.

Currently there are two choices that are available for validating whether a password is or is not leaked:

Submit the password (in an unsalted hash) to a third-party service, where the hash can potentially be stored for later cracking or analysis. For example, if you make an API call for a leaked password to a third-party API service using a WordPress plugin, the IP of the request can be used to identify the WordPress installation and then breach it when the password is cracked (such as from a later disclosure); or,
download the entire list of password hashes, uncompress the dataset and then run a search to see if your password hash is listed.

Needless to say, this conflict can seem like being placed between a security-conscious rock and an insecure hard place.

The Middle Way

The Private Set Intersection (PSI) Problem

Academic computer scientists have considered the problem of how two (or more) parties can validate the intersection of data (from two or more unequal sets of data either side already has) without either sharing information about what they have. Whilst this work is exciting, unfortunately these techniques are new and haven't been subject to long-term review by the cryptography community and cryptographic primitives have not been implemented in any major libraries. Additionally (but critically), PSI implementations have substantially higher overhead than our k-Anonymity approach (particularly for communication[4]). Even the current academic state-of-the-art is not with acceptable performance bounds for an API service, with the communication overhead being equivalent to downloading the entire set of data.

k-Anonymity

Instead, our approach adds an additional layer of security by utilising a mathematical property known as k-Anonymity and applying it to password hashes in the form of range queries. As such, the Pwned Passwords API service never gains enough information about a non-breached password hash to be able to breach it later.

k-Anonymity is used in multiple fields to release anonymised but workable datasets; for example, so that hospitals can release patient information for medical research whilst withholding information that discloses personal information. Formally, a data set can be said to hold the property of k-Anonymity, if for every record in a released table, there are k − 1 other records identical to it.

By using this property, we are able to seperate hashes into anonymised "buckets". A client is able to anonymise the user-supplied hash and then download all leaked hashes in the same anonymised "bucket" as that hash, then do an offline check to see if the user-supplied hash is in that breached bucket.

In more concrete terms:

In essence, we turn the table on password derivation functions; instead of seeking to salt hashes to the point at which they are unique (against identical inputs), we instead introduce ambiguity into what the client is requesting.

Given hashes are essentially fixed-length hexadecimal values, we are able to simply truncate them, instead of having to resort to a decision tree structure to filter down the data. This does mean buckets are of unequal sizes but allows clients to query in a single API request.

This approach can be implemented in a trivial way. Suppose a user enters the password test into a login form and the service they’re logging into is programmed to validate whether their password is in a database of leaked password hashes. Firstly the client will generate a hash (in our example using SHA-1) of a94a8fe5ccb19ba61c4c0873d391e987982fbbd3. The client will then truncate the hash to a predetermined number of characters (for example, 5) resulting in a Hash Prefix of a94a8. This Hash Prefix is then used to query the remote database for all hashes starting with that prefix (for example, by making a HTTP request to example.com/a94a8.txt). The entire hash list is then downloaded and each downloaded hash is then compared to see if any match the locally generated hash. If so, the password is known to have been leaked.

As this can easily be implemented over HTTP, client side caching can easily be used for performance purposes; the API is simple enough for developers to implement with little pain.

Below is a simple Bash implementation of how the Pwned Passwords API can be queried using range queries (Gist):

Implementation

Hashes (even in unsalted form) have two useful properties that are useful in anonymising data.

Firstly, the Avalanche Effect means that a small change in a hash results in a very different output; this means that you can't infer the contents of one hash from another hash. This is true even in truncated form.

For example; the Hash Prefix 21BD1 contains 475 seemingly unrelated passwords, including:

Further, hashes are fairly uniformally distributed. If we were to count the original 320 million leaked passwords (in Troy's dataset) by the first hexadecimal charectar of the hash, the difference between the hashes associated to the largest and the smallest Hash Prefix is ≈ 1%. The chart below shows hash count by their first hexadecimal digit:

Algorithm 1 provides us a simple check to discover how much we should truncate hashes by to ensure every "bucket" has more than one hash in it. This requires every hash to be sorted by hexadecimal value. This algorithm, including an initial merge sort, runs in roughly O(n log n + n) time (worst-case):

After identifying the Maximum Hash Prefix length, it is fairly easy to seperate the hashes into seperate buckets, as described in Algorithm 3:

This implementation was originally evaluated on a dataset of over 320 million breached passwords and we find the Maximum Prefix Length that all hashes can be truncated to, whilst maintaining the property k-anonymity, is 5 characters. When hashes are grouped together by a Hash Prefix of 5 characters, we find the median number of hashes associated with a Hash Prefix is 305. With the range of response sizes for a query varying from 8.6KB to 16.8KB (a median of 12.2KB), the dataset is usable in many practical scenarios and is certainly a good response size for an API client.

On the new Pwned Password dataset (with over half a billion) passwords and whilst keeping the Hash Prefix length 5; the average number of hashes returned is 478 - with the smallest being 381 (E0812 and E613D) and the largest Hash Prefix being 584 (00000 and 4A4E8).

Splitting the hashes into buckets by a Hash Prefix of 5 would mean a maximum of 16^5 = 1,048,576 buckets would be utilised (for SHA-1), assuming that every possible Hash Prefix would contain at least one hash. In the datasets we found this to be the case and the amount of distinct Hash Prefix values was equal to the highest possible quantity of buckets. Whilst for secure hashing algorithms it is computationally inefficient to invert the hash function, it is worth noting that as the length of a SHA-1 hash is a total of 40 hexadecimal characters long and 5 characters is utilised by the Hash Prefix, the total number of possible hashes associated with a Hash Prefix is 16^{35} ≈ 1.39E42.

Important Caveats

It is important to note that where a user's password is already breached, an API call for a specific range of breached passwords can reduce the search candidates used in a brute-force attack. Whilst users with existing breached passwords are already vulnerable to brute-force attacks, searching for a specific range can help reduce the amount of search candidates - although the API service would have no way of determining if the client was or was not searching for a password that was breached. Using a deterministic algorithm to run queries for other Hash Prefixes can help reduce this risk.

One reason this is important is that this implementation does not currently guarantee l-diversity, meaning a bucket may contain a hash which is of substantially higher use than others. In the future we hope to use percentile-based usage information from the original breached data to better guarantee this property.

For general users, Pwned Passwords is usually exposed via web interface, it uses a JavaScript client to run this process; if the origin web server was hijacked to change the JavaScript being returned, this computation could be removed (and the password could be sent to the hijacked origin server). Whilst JavaScript requests are somewhat transparent to the client (in the case of a developer), this may not be depended on and for technical users, non-web client based requests are preferable.

The original use-case for this service was to be deployed privately in a Cloudflare data centre where our services can use it to enhance user security, and use range queries to complement the existing transport security used. Depending on your risks, it's safer to deploy this service yourself (in your own data centre) and use the k-anonymity approach to validate passwords where services do not themselves have the resources to store an entire database of leaked password hashes.

I would strongly recommend against storing the range queries used by users of your service, but if you do for whatever reason, store them as aggregate analytics such that they cannot be linked back to any given user's password.

Final Thoughts

Going forward, as we test this technology more, Cloudflare is looking into how we can use a private deployment of this service to better offer security functionality, both for log-in requests to our dashboard and for customers who want to prevent against credential stuffing on their own websites using our edge network. We also seek to consider how we can incorporate recent work on the Private Set Interesection Problem alongside considering l-diversity for additional security guarantees. As always; we'll keep you updated right here, on our blog.

Campbell, J., Ma, W. and Kleeman, D., 2011. Impact of restrictive composition policy on user password choices. Behaviour & Information Technology, 30(3), pp.379-388. ↩︎
Grassi, P. A., Fenton, J. L., Newton, E. M., Perlner, R. A., Regenscheid, A. R., Burr, W. E., Richer, J. P., Lefkovitz, N. B., Danker, J. M., Choong, Y.-Y., Greene, K. K., and Theofanos, M. F. (2017). NIST Special Publication 800-63B Digital Identity Guidelines, chapter Authentication and Lifecycle Management. National Institute of Standards and Technology, U.S. Department of Commerce. ↩︎
Jenkins, Jeffrey L., Mark Grimes, Jeffrey Gainer Proudfoot, and Paul Benjamin Lowry. "Improving password cybersecurity through inexpensive and minimally invasive means: Detecting and deterring password reuse through keystroke-dynamics monitoring and just-in-time fear appeals." Information Technology for Development 20, no. 2 (2014): 196-213. ↩︎
De Cristofaro, E., Gasti, P. and Tsudik, G., 2012, December. Fast and private computation of cardinality of set intersection and union. In International Conference on Cryptology and Network Security (pp. 218-231). Springer, Berlin, Heidelberg. ↩︎

How Developers got Password Security so Wrong

Junade Ali — Wed, 21 Feb 2018 19:00:11 GMT

Both in our real lives, and online, there are times where we need to authenticate ourselves - where we need to confirm we are who we say we are. This can be done using three things:

Something you know
Something you have
Something you are

Passwords are an example of something you know; they were introduced in 1961 for computer authentication for a time-share computer in MIT. Shortly afterwards, a PhD researcher breached this system (by being able to simply download a list of unencrypted passwords) and used the time allocated to others on the computer.

As time has gone on; developers have continued to store passwords insecurely, and users have continued to set them weakly. Despite this, no viable alternative has been created for password security. To date, no system has been created that retains all the benefits that passwords offer as researchers have rarely considered real world constraints[1]. For example; when using fingerprints for authentication, engineers often forget that there is a sizable percentage of the population that do not have usable fingerprints or hardware upgrade costs.

Cracking Passwords

In the 1970s, people started thinking about how to better store passwords and cryptographic hashing started to emerge.

Cryptographic hashes work like trapdoors; whilst it's easy to hash a password, it's far harder to turn that "hash" back into the original output (or computationally difficult for an ideal hashing algorithm). They are used in a lot of things from speeding up searching from files, to the One Time Password generators in banks.

Passwords should ideally use specialised hashing functions like Argon2, BCrypt or PBKDF2, they are modified to prevent Rainbow Table attacks.

If you were to hash the password, p4$$w0rd using the SHA-1 hashing algorithm, the output would be 6c067b3288c1b5c791afa04e12fb013ed2e84d10. This output is the same every time the algorithm is run. As a result, attackers are able to create Rainbow Tables which contain the hashes of common passwords and then this information is used to break password hashes (where the password and hash is listed in a Rainbow Table).

Algorithms like BCrypt essentially salt passwords before they hash them using a random string. This random string is stored alongside the password hash and is used to help make the password harder to crack by making the output unique. The hashing process is repeated many times (defined by a difficulty variable), each time adding the random salt onto the output of the hash and rerunning the hash computation.

For example; the BCrypt hash $2a$10$N9qo8uLOickgx2ZMRZoMyeIjZAgcfl7p92ldGxad68LJZdL17lhWy starts with $2a$10$ which indicates the algorithm used is BCrypt and contains a random salt of N9qo8uLOickgx2ZMRZoMye and a resulting hash of IjZAgcfl7p92ldGxad68LJZdL17lhWy. Storing the salt allows the password hash to be regenerated identically when the input is known.

Unfortunately; salting is no longer enough, passwords can be cracked quicker and quicker using modern GPUs (specialised at doing the same task over and over). When a site suffers a security breach, users passwords can be taken offline in database dumps in order to be cracked offline.

Additionally; websites that fail to rate limit login requests or use captchas, can be challenged by Brute Force attacks. For a given user, an attacker will repeatedly try different (but common) passwords until they gain access to a given users account.

Sometimes sites will lock users out after a handful of failed login attempts, attacks can instead be targeted to move on quickly to a new account after the most common set of a passwords has been attempted. Lists like the following (in some cases with many, many more passwords) can be attempted to breach an account:

The industry has tried to combat this problem by requiring password composition rules; requiring users comply to complex rules before setting passwords (requiring a minimum amount of numbers or punctuation symbols). Research has shown that this work hasn't helped combat the problem of password reuse, weak passwords or users putting personal information in passwords.

Credential Stuffing

Whilst it may seem that this is only a bad sign for websites that store passwords weakly, Credential Stuffing makes this problem even worse.

It is common for users to reuse passwords from site to site, meaning a username and password from a compromised website can be used to breach far more important information - like online banking gateways or government logins. When a password is reused - it takes just one website being breached, to gain access to others that a users has credentials for.

This Is Not Fine - The Nib

Fixing Passwords

There are fundamentally three things that need to be done to fix this problem:

Good UX to improve User Decisions
Improve Developer Education
Eliminating reuse of breached passwords

How Can I Secure Myself (or my Users)?

Before discussing the things we're doing, I wanted to briefly discuss what you can do to help protect yourself now. For most users, there are three steps you can immediately take to help yourself.

Use a Password Manager (like 1Password or LastPass) to set random, unique passwords for every site. Additionally, look to enable Two-Factor Authentication where possible; this uses something you have, in addition to the password you know, to validate you. This will mean, alongside your password, you have to enter a short-lived password from a device like your phone before being able to login to any site.

Two-Factor Authentication is supported on many of the worlds most popular social media, banking and shopping sites. You can find out how to enable it on popular websites at turnon2fa.com. If you are a developer, you should take efforts to ensure you support Two Factor Authentication.

Set a secure memorable password for your password manager; and yes, turn on Two-Factor Authentication for it (and keep your backup codes safe). You can find additional security tips (including tips on how to create a secure main password) in my blog post: Simple Cyber Security Tips.

Developers should look to abolish bad practice composition rules (and simplify them as much as possible). Password expiration policies do more harm than good, so seek to do away with them. For further information refer to the blog post by the UK's National Cyber Security Centre: The problems with forcing regular password expiry.

Finally; Troy Hunt has an excellent blog post on passwords for users and developers alike: Passwords Evolved: Authentication Guidance for the Modern Era

Improving Developer Education

Developers should seek to build a culture of security in the organisations where they work; try and talk about security, talk about the benefits of challenging malicious login requests and talk about password hashing in simple terms.

If you're working on an open-source project that handles authentication; expose easy password hashing APIs - for example the password_hash, password_needs_rehash & password_verify functions in modern PHP versions.

Eliminating Password Reuse

We know that complex password composition rules are largely ineffective, and recent guidance has followed suit. A better alternative to composition rules is to block users from signing up with passwords which are known to have been breached. Under recent NIST guidance, it is a requirement, when storing or updating passwords, to ensure they do not contain values which are commonly used, expected or compromised[2].

This is easier said than done, the recent version of Troy Hunt's Pwned Passwords database contains over half a billion passwords (over 30 GB uncompressed). Whilst developers can use API services to check if a password is reused, this requires either sending the raw password, or the password in an unsalted hash. This can be especially problematic when multiple services handle authentication in a business, and each has to store a large quantity of passwords.

This is a problem I've started looking into recently; as part of our contribution to Troy Hunt's Pwned Passwords database, I have designed a range search API that allows developers to check if a password is reused without needing to share the password (even in hashed form) - instead only needing to send a short segment of the cryptographic hash used. You can find more information on this contribution in the post: Validating Leaked Passwords with k-Anonymity.

Version 2 of Pwned Passwords is now available - you can find more information on how it works on Troy Hunt's blog post "I've Just Launched Pwned Passwords, Version 2".

Bonneau, J., Herley, C., Van Oorschot, P.C. and Stajano, F., 2012, May. The quest to replace passwords: A framework for comparative evaluation of web authentication schemes. In Security and Privacy (SP), 2012 IEEE Symposium on (pp. 553-567). IEEE. ↩︎
Grassi, P. A., Fenton, J. L., Newton, E. M., Perlner, R. A., Regenscheid, A. R., Burr, W. E., Richer, J. P., Lefkovitz, N. B., Danker, J. M., Choong, Y.-Y., Greene, K. K., and Theofanos, M. F. (2017). NIST Special Publication 800-63B Digital Identity Guidelines, chapter Authentication and Lifecycle Management. National Institute of Standards and Technology, U.S. Department of Commerce. ↩︎

NANOG - the art of running a network and discussing common operational issues

Martin J Levy — Thu, 02 Feb 2017 12:15:26 GMT

The North American Network Operators Group (NANOG) is the loci of modern Internet innovation and the day-to-day cumulative network-operational knowledge of thousands and thousands of network engineers. NANOG itself is a non-profit membership organization; but you don’t need to be a member in order to attend the conference or join the mailing list. That said, if you can become a member, then you’re helping a good cause.

The next NANOG conference starts in a few days (February 6-8 2017) in Washington, DC. Nearly 900 network professionals are converging on the city to discuss a variety of network-related issues, both big and small; but all related to running and improving the global Internet. For this upcoming meeting, Cloudflare has three network professionals in attendance. Two from the San Francisco office and one from the London office.

With the conference starting next week, it seemed a great opportunity to introduce readers of the blog as to why a NANOG conference is so worth attending.

Tutorials

While it seems obvious how to do some network tasks (you unpack the spiffy new wireless router from its box, you set up its security and plug it in); alas the global Internet is somewhat more complex. Even seasoned professionals could do with a recap on how traceroute actually works, or how DNSSEC operates, or this years subtle BGP complexities, or be enlightened about Optical Networking. All this can assist you with deployments within your networks or datacenter.

Peering

If there’s one thing that keeps the Internet (a network-of-networks) operating, it’s peering. Peering is the act of bringing together two or more networks to allow traffic (bits, bytes, packets, email messages, web pages, audio and video streams) to flow efficiently and cleanly between source and destination. The Internet is nothing more than a collection of individual networks. NANOG provides one of many forums for diverse network operators to meet face-to-face and negotiate and enable those interconnections.

While NANOG isn’t the only event that draws networks together to discuss interconnection, it’s one of the early forums to support these peering discussions.

Security and Reputation

In this day-and-age we are brutally aware that security is the number-one issue when using the Internet. This is something to think about when you choose your email password, lock screen password on your laptop, tablet or smartphone. Hint: you should always have a lock screen!

At NANOG the security discussion is focused on a much deeper part of the global Internet, in the very hardware and software practices, that operate and support the underlying networks we all use on a daily basis. An Internet backbone (rarely seen) is a network that moves traffic from one side of the globe to the other (or from one side of a city to the other). At NANOG we discuss how that underlying infrastructure can operate efficiently, securely, and be continually strengthened. The growth of the Internet over the last handful of decades has pushed the envelope when it comes to hardware deployments and network-complexity. Sometimes it only takes one compromised box to ruin your day. Discussions at conferences like NANOG are vital to the sharing of knowledge and collective improvement of everyone's networks.

Above the hardware layer (from a network stack point of view) is the Domain Name System (DNS). DNS has always been a major subject of discussion within the NANOG community. It’s very much up to the operational community to make sure that when you type a website name into web browser or type in someone’s email address into your email program that there’s a highly efficient process to convert from names to numbers (numbers, or IP address, are the address book and routing method of the Internet). DNS has had its fair share of focus in the security arena and it comes down to network operators (and their system administrator colleagues) to protect DNS infrastructure.

Network Operations; best practices and stories of disasters

Nearly everyone knows that bad news sells. It’s a fact. To be honest, the same is the case in the network operator community. However, within NANOG, those stories of disasters are nearly always told from a learning and improvement point of view. There’s simply no need to repeat a failure, no-one enjoys it a second time around. Notable stories have included subjects like route-leaks, BGP protocol hiccups, peering points, and plenty more.

We simply can’t rule out failures within portions of the network; hence NANOG has spent plenty of time discussing redundancy. The internet operates using routing protocols that explicitly allow for redundancy in the paths that traffic travels. Should a failure occur (a hardware failure, or a fiber cut), the theory is that the traffic will be routed around that failure. This is a recurring topic for NANOG meetings. Subsea cables (and their occasional cuts) always make for good talks.

Network Automation

While we learned twenty or more years ago how to type into Internet routers on the command line, those days are quickly becoming history. We simply can’t scale if network operational engineers have to type the same commands into hundreds (or thousands?) of boxes around the globe. We need automation. This is where NANOG has been a leader in this space. Cloudflare has been active in this arena and Mircea Ulinic presented our experience for Network Automation with Salt and NAPALM at the previous NANOG meeting. Mircea (and Jérôme Fleury) will be giving a follow-up in-depth tutorial on the subject at next week’s meeting.

Many more subjects covered

The first NANOG conference was held June 1994 in Ann Arbor, Michigan and the conference has grown significantly since then. While it’s fun to follow the history, it’s maybe more important to realize that NANOG has covered a multitude of subjects since that start. Go scan the archives at nanog.org and/or watch some of the online videos.

The socials (downtime between technical talks)

Let’s not forget the advantages of spending time with other operators within a relaxed setting. After all, sometimes the big conversations happen when spending time over a beer discussing common issues. NANOG has long understood this and it’s clear that the Tuesday evening Beer ’n Gear social is set up specifically to let network geeks both grab a drink (soft drinks included) and poke around with the latest and greatest network hardware on show. The social is as much about blinking lights on shiny network boxes as it is about tracking down that network buddy.

Oh; there’s a fair number of vendor giveaways (so far there’s 15 hardware and software vendors signed up for next week’s event). After all, who doesn’t need a new t-shirt?

But there’s more to the downtime and casual hallway conversations. For myself (the author of this blog), I know that sometimes the most important work is done within the hallways during breaks in the meeting vs. standing in front of the microphone presenting at the podium. The industry has long recognized this and the NANOG organizers were one of the early pioneers in providing full-time coffee and snacks that cover the full conference agenda times. Why? Because sometimes you have to step out of the regular presentations to meet and discuss with someone from another network. NANOG knows its audience!

Besides NANOG, there’s IETF, ICANN, ARIN, and many more

NANOG isn’t the only forum to discuss network operational issues, however it’s arguably the largest. It started off as a “North American” entity; however, in the same way that the Internet doesn’t have country barriers, NANOG meetings (which take place in the US, Canada and at least once in the Caribbean) have fostered an online community that has grown into a global resource. The mailing list (well worth reviewing) is a bastion of networking discussions.

In a different realm, the Internet Engineering Task Force (IETF) focuses on protocol standards. Its existence is why diverse entities can communicate. Operators participate in IETF meetings; however, it’s a meeting focused outside of the core operational mindset.

Central to the Internet’s existence is ICANN. Meeting three times a year at locations around the globe, it focused on the governance arena and in domain names and related items. Within the meetings there’s an excellent Tech Day.

In the numbers arena ARIN is an example of a regional routing registry (an RIR) that runs members meeting. An RIR deals with allocating resources like IP addresses and AS numbers. ARIN focuses on the North American area and sometimes holds its meetings alongside NANOG meetings.

ARIN’s counterparts in other parts of the world also hold meetings. Sometimes they simply focus on resource policy and sometimes they also focus on network operational issues. For example RIPE (in Europe, Central Asia and the Middle East) runs a five-day meeting that covers operational and policy issues. APNIC (Asia Pacific), AFRINIC (Africa), LACNIC (Latin America & Caribbean) all do similar variations. There isn’t one absolute method and that's a good thing. It’s worth pointing out that APNIC holds it’s members meetings once a year in conjunction with APRICOT which is the primary operations meeting in the Asia Pacific region.

While NANOG is somewhat focused on North America, there are also the regional NOGs. These regional NOGs are vital to help the education of network operators globally. Japan has JANOG, Southern Africa has SAFNOG, MENOG in the Middle East, AUSNOG & NZNOG in Australia & New Zealand, DENOG in Germany, PHNOG in the Philippines, and just to be different, the UK has UKNOF (“Forum” vs. “Group”). It would be hard to list them all; but each is a worthwhile forum for operational discussions.

Peering specific meetings also exist. Global Peering Forum, European Peering Forum, and Peering Forum de LACNOG for example. Those focus on bilateral meetings within a group of network operators or administrators and specifically focus on interconnect agreements.

In the commercial realms there’s plenty of other meetings that are attended by networks like Cloudflare. PTC and International Telecoms Week (ITW) are global telecom meetings specifically designed to host one-to-one (bilateral) meetings. They are very commercial in nature and less operational in focus.

NANOG isn’t the only forum Cloudflare attends

As you would guess, you will find our network team at RIR meetings, sometimes at IETF meetings, sometimes at ICANN meetings, often at various regional NOG meetings (like SANOG in South East Asia, NoNOG in Norway, RONOG in Romania, AUSNOG/NZNOG in Australia/New Zealand and many other NOGs). We get around; however, we also run a global network and we need to interact with many many networks around the globe. These meetings provide an ideal opportunity for one-to-one discussions.

If you've heard something you like from Cloudflare at one of these operational-focused conferences, then check out our jobs listings (in various North American cities, London, Singapore, and beyond!)

Manage Cloudflare records with Salt

Ivan Babrou — Wed, 14 Dec 2016 14:25:52 GMT

We use Salt to manage our ever growing global fleet of machines. Salt is great for managing configurations and being the source of truth. We use it for remote command execution and for network automation tasks. It allows us to grow our infrastructure quickly with minimal human intervention.

CC-BY 2.0 image by Kevin Dooley

We got to thinking. Are DNS records not just a piece of the configuration? We concluded that they are and decided to manage our own records from Salt too.

We are strong believers in eating our own dog food, so we make our employees use the next version of our service before rolling it to everyone else. That way if there's a problem visiting one of the 5 million websites that use Cloudflare it'll get spotted quickly internally. This is also why we keep our own DNS records on Cloudflare itself.

Cloudflare has an API that allows you to manage your zones programmatically without ever logging into the dashboard. Until recently, we were using handcrafted scripts to manage our own DNS records via our API. These scripts were in exotic languages like PHP for historical reasons and had interesting behavior that not everybody enjoyed. While we were dogfooding our own APIs, these scripts were pushing the source of truth for DNS out of Salt.

When we decided to move some zones to Salt, we had a few key motivations:

Single source of truth
Peer-reviewed, audited and versioned changes
Making things that our customers would want to use

Points 1 and 2 were achieved by having DNS records in a Salt repo. The Salt configuration is itself in git, so we get peer reviews and audit trail for free. We think that we made progress on point 3 also.

After extensive internal testing and finding a few bugs in our API (that's what we wanted!), we are happy to announce the public availability of Cloudflare Salt module.

If you are familiar with Salt, it should be easy to see how to configure your records via Salt. All you need is the following:

Create the state cloudflare to deploy your zones:

Add a pillar to configure your zone:

Here we configure zone example.com to only have one record blog.example.com pointing to 93.184.216.34 behind Cloudflare.

You can test your changes before you deploy:

And then deploy if you are happy with the dry run:

After the initial setup, all you need is to change the records array in pillar and re-deploy the state. See the README for more details.

DNS records are only one part of configuration you may want to change for your Cloudflare domain. We have plans to "saltify" other settings like WAF, caching, page rules and others too.

Come work with us if you want to help!

The Internet is Hostile: Building a More Resilient Network

Jérôme Fleury — Tue, 08 Nov 2016 18:56:49 GMT

In a recent post we discussed how we have been adding resilience to our network.

The strength of the Internet is its ability to interconnect all sorts of networks — big data centers, e-commerce websites at small hosting companies, Internet Service Providers (ISP), and Content Delivery Networks (CDN) — just to name a few. These networks are either interconnected with each other directly using a dedicated physical fiber cable, through a common interconnection platform called an Internet Exchange (IXP), or they can even talk to each other by simply being on the Internet connected through intermediaries called transit providers.

The Internet is like the network of roads across a country and navigating roads means answering questions like “How do I get from Atlanta to Boise?” The Internet equivalent of that question is asking how to reach one network from another. For example, as you are reading this on the Cloudflare blog, your web browser is connected to your ISP and packets from your computer found their way across the Internet to Cloudflare’s blog server.

Figuring out the route between networks is accomplished through a protocol designed 25 years ago (on two napkins) called BGP.

BGP allows interconnections between networks to change dynamically. It provides an administrative protocol to exchange routes between networks, and allows for withdrawals in the case that a path is no longer viable (when some route no longer works).

The Internet has become such a complex set of tangled fibers, neighboring routers, and millions of servers that you can be certain there is a server failing or a optical fibre being damaged at any moment. Whether it’s in a datacenter, a trench next to a railroad, or at the bottom of the ocean. The reality is that the Internet is in a constant state of flux as connections break and are fixed; it’s incredible strength is that it operates in the face of the real world where conditions constantly change.

While BGP is the cornerstone of Internet routing, it does not provide first class mechanisms to automatically deal with these events, nor does it provide tools to manage quality of service in general.

Although BGP is able to handle the coming and going of networks with grace, it wasn’t designed to deal with Internet brownouts. One common problem is that a connection enters a state where it hasn’t failed, but isn’t working correctly either. This usually presents itself as packet loss: packets enter a connection and never arrive at their destination. The only solution to these brownouts is active, continuous monitoring of the health of the Internet.

CC BY 2.0 image by Steve Jurvetson

Again, the metaphor of a system of roads is useful. A printed map may tell you the route from one city to another, but it won't tell you where there's a traffic jam. However, modern GPS applications such as Waze can tell you which roads are congested and which are clear. Similarly, Internet monitoring shows which parts of the Internet are blocked or losing packets and which are working well.

At Cloudflare we decided to deploy our own mechanisms to react to unpredictable events causing these brownouts. While most events do not fall under our jurisdiction — they are “external” to the Cloudflare network — we have to operate a reliable service by minimizing the impact of external events.

This is a journey of continual improvement, and it can be deconstructed into a few simple components:

Building an exhaustive and consistent view of the quality of the Internet
Building a detection and alerting mechanism on top of this view
Building the automatic mitigation mechanisms to ensure the best reaction time

Monitoring the Internet

Having deployed our network in a hundred locations worldwide, we are in a unique position to monitor the quality of the Internet from a wide variety of locations. To do this, we are leveraging the probing capabilities of our network hardware and have added some extra tools that we’ve built.

By collecting data from thousands of automatically deployed probes, we have a real-time view of the Internet’s infrastructure: packet loss in any of our transit provider’s backbones, packet loss on Internet Exchanges, or packet loss between continents. It is salutary to watch this real-time view over time and realize how often parts of the Internet fail and how resilient the overall network is.

Our monitoring data is stored in real-time in our metrics pipeline powered by a mix of open-source software: ZeroMQ, Prometheus and OpenTSDB. The data can then be queried and filtered on a single dashboard to give us a clear view of the state of a specific transit provider, or one specific PoP.

Above we can see a time-lapse of a transit provider having some packet loss issues.

Here we see a transit provider having some trouble on the US West Coast on October 28, 2016.

Building a Detection Mechanism

We didn’t want to stop here. Having a real-time map of Internet quality puts us in a great position to detect problems and create alerts as they unfold. We have defined a set of triggers that we know are indicative of a network issue, which allow us to quickly analyze and repair problems.

For example, 3% packet loss from Latin America to Asia is expected under normal Internet conditions and not something that would trigger an alert. However, 3% packet loss between two countries in Europe usually indicates a bigger and potentially more impactful problem, and thus will immediately trigger alerts for our Systems Reliability Engineering and Network Engineering teams to look into the issue.

Sitting between eyeball networks and content networks, it is easy for us to correlate this packet loss with various other metrics in our system, such as difficulty connecting to customer origin servers (which manifest as Cloudflare error 522) or a sudden decrease of traffic from a local ISP.

Automatic Mitigation

Receiving valuable and actionable alerts is great, however we were still facing the hard to compress time-to-reaction factor. Thankfully in our early years, we’ve learned a lot from DDoS attacks. We’ve learned how to detect and auto-mitigate most attacks with our efficient automated DDoS mitigation pipeline. So naturally we wondered if we could apply what we’ve learned from DDoS mitigation to these generic internet events? After all, they do exhibit the same characteristics: they’re unpredictable, they’re external to our network, and they can impact our service.

The next step was to correlate these alerts with automated actions. The actions should reflect what an on-call network engineer would have done given the same information. This includes running some important checks: is the packet loss really external to our network? Is the packet loss correlated to an actual impact? Do we currently have enough capacity to reroute the traffic? When all the stars align, we know we have a case to perform some action.

All that said, automating actions on network devices turns out to be more complicated than one would imagine. Without going into too much detail, we struggled to find a common language to talk to our equipment with because we’re a multi-vendor network. We decided to contribute to the brilliant open-source project Napalm, coupled it with the automation framework Salt, and and improved it to bring us the features we needed.

We wanted to be able to perform actions such as configuring probes, retrieving their data, and managing complex BGP neighbor configuration regardless of the network device a given PoP was using. With all these features put together into an automated system, we can see the impact of actions it has taken:

Here you can see one of our transit provider having a sudden problem in Hong-Kong. Our system automatically detects the fault and takes the necessary action, which is to disable this link for our routing.

Our system keeps improving every day, but it is already running at a high pace and making immediate adjustments across our network to optimize performance every single day.

Here we can see actions taken during 90 days of our mitigation bot.

The impact of this is that we’ve managed to make the Internet perform better for our customers and reduce the number of errors that they'd see if they weren't using Cloudflare. One way to measure this is how often we're unable to reach a customer's origin. Sometimes origins are completely offline. However, we are increasingly at a point where if an origin is reachable we'll find a path to reach it. You can see the effects of our improvements over the last year in the graph below.

The Future

While we keep improving this resiliency pipeline every day, we are looking forward to deploying some new technologies to streamline it further: Telemetry will permit a more real-time collection of our data by moving from a pull model to a push model, and new automation languages like OpenConfig will unify and simplify our communication with network devices. We look forward to deploying these improvements as soon as they are mature enough for us to release.

At Cloudflare our mission is to help build a better internet. The internet though, by its nature and size, is in constant flux — breaking down, being added to, and being repaired at almost any given moment — meaning services are often interrupted and traffic is slowed without warning. By enhancing the reliability and resiliency of this complex network of networks we think we are one step closer to fulfilling our mission and building a better internet.

Economical With The Truth: Making DNSSEC Answers Cheap

Dani Grant — Fri, 24 Jun 2016 16:31:10 GMT

We launched DNSSEC late last year and are already signing 56.9 billion DNS record sets per day. At this scale, we care a great deal about compute cost. One of the ways we save CPU cycles is our unique implementation of negative answers in DNSSEC.

CC BY-SA 2.0 image by Chris Short

I will briefly explain a few concepts you need to know about DNSSEC and negative answers, and then we will dive into how CloudFlare saves on compute when asked for names that don’t exist.

What You Need To Know: DNSSEC Edition

Here’s a quick summary of DNSSEC:

This is an unsigned DNS answer (unsigned == no DNSSEC):

This is an answer with DNSSEC:

Answers with DNSSEC contain a signature for every record type that is returned. (In this example, only A records are returned so there is only one signature.) The signatures allow DNS resolvers to validate the records returned and prevent on-path attackers from intercepting and changing the answers.

What You Need To Know: Negative Answer Edition

There are two types of negative answers. The first is NXDOMAIN, which means that the name asked for does not exist. An example of this is a query asking for missing.cloudflare.com. missing.cloudflare.com doesn’t exist at all.

The second type is NODATA, which means that the name does exist, just not in the requested type. An example of this would be asking for the MX record of blog.cloudflare.com. There are A records for blog.cloudflare.com but no MX records so the appropriate response is NODATA.

What Goes Into An `NXDOMAIN` With DNSSEC

To see what gets returned in a negative NXDOMAIN answer, let’s look at the response for a query for bogus.ietf.org.

The first record that has to be returned in a negative answer with DNSSEC is an SOA, just like in an unsigned negative answer. The SOA contains some metadata about the zone and lets the recursor know how long to cache the negative answer for.

Because the domain is signed with DNSSEC, the signature for the SOA is also returned:

The next part of the negative answer in DNSSEC is a record type called NSEC. The NSEC record returns the previous and next name in the zone, which proves to the recursor that the queried name cannot possibly exist, because nothing exists between the two names listed in the NSEC record.

This NSEC record above tells you that bogus.ietf.org does not exist because no names exist canonically between www.apps.ietf.org and cloudflare-verify.ietf.org. Of course, this record also has a signature contained in the answer:

A second NSEC record is also returned to prove that there is no wildcard that would have covered bogus.ietf.org:

This record above tells you that a wildcard (*. ietf.org) would have existed between those two names. Because there is no wildcard record at *.ietf.org, as proven by this NSEC record, the DNS resolver knows that really nothing should have been returned for bogus.ietf.org. This NSEC record also has a signature:

All in all, the negative answer for bogus.ietf.org contains an SOA + SOA RRSIG + (2) NSEC + (2) NSEC RRSIG. It is 6 records in total, returning an answer that is 1095 bytes (this is a large DNS answer).

Zone Walking

What you may have noticed is that because the negative answer returns the previous and next name, you can keep asking for next names and essentially “walk” the zone until you learn every single name contained in it.

For example, if you ask for the NSEC on ietf.org, you will get back the first name in the zone, ietf1._domainkey.ietf.org:

Then if you ask for the NSEC on ietf1._domainkey.ietf.org you will get the next name in the zone:

And you can keep going until you get every name in the zone:

The root zone uses NSEC as well, so you can walk the root to see every TLD:

The root NSEC:

.aaa NSEC:

.aarp NSEC:

Zone walking was actually considered a feature of the original design:

The complete NXT chains specified in this document enable a resolver to obtain, by successive queries chaining through NXTs, all of the names in a zone. - RFC2535

(NXT is the original DNS record type that NSEC was based off of)

However, as you can imagine, this is a terrible idea for some zones. If you could walk the .gov zone, you could learn every US government agency and government agency portal. If you owned a real estate company where every realtor got their own subdomain, a competitor could walk through your zone and find out who all of your realtors are.

CC BY 2.0 image by KIUI

So the DNS community rallied together and found a solution. They would continue to return previous and next names, but they would hash the outputs. This was defined in an upgrade to NSEC called NSEC3.

NSEC3 was a “close but no cigar” solution to the problem. While it’s true that it made zone walking harder, it did not make it impossible. Zone walking with NSEC3 is still possible with a dictionary attack. An attacker can use a list of the most common hostnames, hash them with the hashing algorithm used in the NSEC3 record (which is listed in the record itself) and see if there are any matches. Even if the domain owner uses a salt on the hash, the length of the salt is included in the NSEC3 record, so there are a finite number of salts to guess.

The Salt Length field defines the length of the salt in octets, ranging in value from 0 to 255.”

RFC5155

`NODATA` Responses

If you recall from above, NODATA is the response from a server when it is asked for a name that exists, but not in the requested type (like an MX record for blog.cloudflare.com). NODATA is similar in output to NXDOMAIN. It still requires SOA, but it only takes one NSEC record to prove the next name, and to specify which types do exist on the queried name.

For example, if you look for a TXT record on apps.ietf.org, the NSEC record will tell you that while there is no TXT record on apps.ietf.org, there are MX, RRSIG and NSEC records.

Problems With Negative Answers

There are two problems with negative answers:

The first is that the authoritative server needs to return the previous and next name. As you’ll see, this is computationally expensive for CloudFlare, and as you’ve already seen, it can leak information about a zone.

The second is that negative answers require two NSEC records and their two subsequent signatures (or three NSEC3 records and three NSEC3 signatures) to authenticate the nonexistence of one name. This means that answers are bigger than they need to be.

The Trouble with Previous and Next Names

CloudFlare has a custom in house DNS server built in Go called RRDNS. What's unique about RRDNS is that unlike standard DNS servers, it does not have the concept of a zone file. Instead, it has a key value store that holds all of the DNS records of all of the domains. When it gets a query for a record, it can just pick out the record that it needs.

Another unique aspect of CloudFlare's DNS is that a lot of our business logic is handled in the DNS. We often dynamically generate DNS answers on the fly, so we don't always know what we will respond with before we are asked.

Traditional negative answers require the authoritative server to return the previous and next name of a missing name. Because CloudFlare does not have the full view of the zone file, we'd have to ask the database to do a sorted search just to figure out the previous and next names. Beyond that, because we generate answers on the fly, we don’t have a reliable way to know what might be the previous and next name, unless we were to precompute every possible option ahead of time.

One proposed solution to the previous and next name, and secrecy problems is RFC4470, dubbed 'White Lies'. This RFC proposes that DNS operators make up a previous and next name by randomly generating names that are canonically slightly before and after the requested name.

White lies is a great solution to block zone walking (and it helps us prevent unnecessary database lookups), but it still requires 2 NSEC records (one for previous and next name and another for the wildcard) to say one thing, so the answer is still bigger than it needs to be.

When CloudFlare Lies

We decided to take lying in negative answers to its fullest extent. Instead of white lies, we do black lies.

For an NXDOMAIN, we always return \000.(the missing name) as the next name, and because we return an NSEC directly on the missing name, we do not have to return an additional NSEC for the wildcard. This way we only have to return SOA, SOA RRSIG, NSEC and NSEC RRSIG, and we do not need to search the database or precompute dynamic answers.

Our negative answers are usually around 300 bytes. For comparison, negative answers for ietf.org which uses NSEC and icann.org, which uses NSEC3 are both slightly over 1000 bytes, three times the size. The reason this matters so much is that the maximum size of an unsigned UDP packet is typically 512 octets. DNSSEC requires support for at least 1220 octets long messages over UDP, but above that limit, the client may need to upgrade to DNS over TCP. A good practice is to keep enough headroom in order to keep response sizes below fragmentation threshold during zone signing key rollover periods.

NSEC: 1096 bytes

Black Lies: 357 bytes

DNS Shotgun

Our take on NODATA responses is also unique. Traditionally, NODATA responses contain one NSEC record to tell the resolver which types exist on the requested name. This is highly inefficient. To do this, we’d have to search the database for all the types that do exist, just to answer that the requested type does not exist. Remember that’s not even always possible because we have dynamic answers that are generated on the fly.

What we realized was that NSEC is a denial of existence. What matters in NSEC are the missing types, not the present ones. So what we do is we set all the types. We say, this name does exist, just not on the one type you asked for.

For example, if you asked for an TXT record of blog.cloudflare.com we would say, all the types exist, just not TXT.

And then if you queried for a MX on blog.cloudflare.com, we would return saying we have every record type, even TXT, but just not MX.

This saves us a database lookup and from leaking any zone information in negative answers. We call this the DNS Shotgun.

How Are Black Lies and DNS Shotgun Standards Compliant

We put a lot of care to ensure CloudFlare’s negative answers are standards compliant. We’re even pushing for them to become an Internet Standard by publishing an Internet Draft earlier this year.

RFC4470, White Lies, allows us to randomly generate next names in NSEC. Not setting the second NSEC for the wildcard subdomain is also allowed, so long as there exists an NSEC record on the actual queried name. And lastly, our lie of setting every record type in NSEC records for NODATA, is okay too –– after all, domains are constantly changing, it’s feasible that the zone file changed in between the time the NSEC record indicated to you there was no MX record on blog.cloudflare.com and when you queried successfully for MX on blog.cloudflare.com.

Conclusion

We’re proud of our negative answers. They help us keep packet size small, and CPU consumption low enough for us to provide DNSSEC for free for any domain. Let us know what you think, we’re looking forward to hearing from you.

DNSSEC: Complexities and Considerations

Nick Sullivan — Wed, 05 Nov 2014 14:25:04 GMT

This blog post is a follow-up to our previous introduction to DNSSEC. Read that first if you are not familiar with DNSSEC.

DNSSEC is an extension to DNS: it provides a system of trust for DNS records. It’s a major change to one of the core components of the Internet. In this post we examine some of the complications of DNSSEC, and what CloudFlare plans to do to reduce any negative impact they might have. The main issues are zone content exposure, key management, and the impact on DNS reflection/amplification attacks.

Zone content exposure

DNS is split into smaller pieces called zones. A zone typically starts at a domain name, and contains all records pertaining to the subdomains. Each zone is managed by a single manager. For example, cloudflare.com is a zone containing all DNS records for cloudflare.com and its subdomains (e.g. www.cloudflare.com, api.cloudflare.com).

There is no directory service for subdomains in DNS so if you want to know if api.cloudflare.com exists, you have to ask a DNS server and that DNS server will end up asking cloudflare.com whether api.cloudflare.com exists. This is not true with DNSSEC. In some cases, enabling DNSSEC may expose otherwise obscured zone content. Not everyone cares about the secrecy of subdomains, and zone content may already be easily guessable because most sites have a ‘www’ subdomain; however, subdomains are sometimes used as login portals or other services that the site owner wants to keep private. A site owner may not want to reveal that “secretbackdoor.example.com” exists in order to protect that site from attackers.

The reason DNSSEC can expose subdomains has to do with how zones are signed. Historically, DNSSEC is used to sign static zones. A static zone is a complete set of records for a given domain. The DNSSEC signature records are created using the Key Signing Key (KSK) and Zone Signing Key (ZSK) in a central location and sent to the authoritative server to be published. This set of records allows an authoritative server to answer any question it is asked, including questions about subdomains that don’t exist.

Unlike standard DNS, where the server returns an unsigned NXDOMAIN (Non-Existent Domain) response when a subdomain does not exist, DNSSEC guarantees that every answer is signed. This is done with a special record that serves as a proof of non-existence called the NextSECure (NSEC) record. An NSEC record can be used to say: “there are no subdomains between subdomains X and subdomain Y.” By filling the gap between every domain in the zone, NSEC provides a way to answer any query with a static record. The NSEC record also lists what Resource Record types exist at each name.

For statically signed zones, there are, by definition, a fixed number of records. Since each NSEC record points to the next, this results in a finite ‘ring’ of NSEC records that covers all the subdomains. Anyone can ‘walk’ a zone by following one NSEC record to the next until they know all subdomains. This method can be used to reveal all of the names in that zone---possibly exposing internal information.

Suppose there is a DNSSEC-enabled zone called example.com, with subdomains public.example.com and secret.example.com. Adding NSEC records will reveal the existence of all subdomains.

Asking for the NSEC record of example.com gives the following:

Asking for public.example.com gives the following NSEC record:

Asking for secret.example.com gives the following NSEC record:

The first one is for the zone top/apex, and says that the name “example.com” exists and the next name is “public.example.com”. The public.example.com record says that the next name is “secret.example.com” revealing the existence of a private subdomain. The “secret.example.com” says the next record is “example.com” completing the chain of subdomains. Therefore, with a few queries anybody can know the complete set of records in the zone.

Technically, DNS records are not supposed to be secret, but in practice they are sometimes considered so. Subdomains have been used to keep things (such as a corporate login page) private for a while, and suddenly revealing the contents of the zone file may be unexpected and unappreciated.

Before DNSSEC the only way to discover the contents of names in a zone was to either query for them, or attempt to perform a transfer of the zone from one of the authoritative servers. Zone Transfers (AXFR) are frequently blocked. NSEC’s alternatative, NSEC3, was introduced to fight zone enumeration concerns, but even NSEC3 can be used to reveal the existence of subdomains.

Most domains under .ch use NSEC3

The NSEC3 record is like an NSEC record, but, rather than a signed gap of domain names for which there are no answers to the question, NSEC3 provides a signed gap of hashes of domain names. This was intended to prevent zone enumeration. Thus, the NSEC3 chain for a zone containing “example.com” and “www.example.com” could be (each NSEC3 record is on 3 lines for clarity):

Where 231SPNAMH63428R68U7BV359PFPJI2FC is the salted hash of example.com and NKDO8UKT2STOL6EJRD1EKVD1BQ2688DM is the salted hash of www.example.com. This is reminiscent of the way password databases work.

The first line of the NSEC3 record contains the “name” of the zone after it has been hashed, the number of hash rounds and salt used in the hashing are the two last parameters on the first line “3 ABCDEF”. The “1 0” stands for digest algorithm (1 means SHA-1) and if the zone uses Opt-out (0 means no). The second line is the “next hashed name in the zone”, the third line lists the types at the name. You can see the “next name” at the first NSEC3 record matches the name on the second NSEC3 record and the “next name” on that one completes the chain.

For NSEC enumeration, you can create the full list of domains by starting to guess at possible names in the domain. If the zone has around 100 domain names, it will take around 100 requests to enumerate the entire zone. With NSEC3, when you request a record that does not exist, a signed NSEC3 record is returned with the next zone present ordered alphabetically by hash. Checking if the next query name candidate fits in one of the known gaps allows anyone to discover the full chain in around 100 queries. There are many tools that can do this computation for you, including a plug-in to nmap.

With the hashes that correspond to all the valid names in the zone, a dictionary attack can be used to figure out the real names. Short names are easily guessed, and by using a dictionary, longer names can be revealed as existing without having to flood the authoritative nameservers with guesses. Tools like HashCat make this easy to do in software, and the popularity of bitcoin has dramatically lowered the price of hashing-specific hardware. There is a burgeoning cottage industry of devices built to compute cryptographic hashes. The Tesla Password cracker (below) is just one example of these off-the shelf devices.

The Tesla Password Cracker

Because hashing is cheap, zone privacy is only slightly improved when using NSEC3 as designed; the amount of protection a name gets is proportional to its unguessability.

In short, NSEC is like revealing plaintext passwords, and NSEC3 is like revealing a Unix-style passwords file. Neither technique is very secure. With NSEC3 a subdomain is only as private as it is hard to guess.

This vulnerability can be mitigated by a techniques introduced in RFCs 4470 and 4471 (https://tools.ietf.org/html/rfc4470 and https://tools.ietf.org/html/rfc4471) called “DNSSEC white lies”; this was implemented by Dan Kaminsky for demonstration purposes. When a request comes in for a domain that is not present, instead of providing an NSEC3 record of the next real domain, an NSEC3 record of the next hash alphabetically is presented. This does not break the NSEC3 guarantee that there are no domains whose hash fits lexicographically between the NSEC3 response and the question.

You can only implement NSEC3 or NSEC “white lies” if signatures can be computed in real-time in response to questions. Traditionally, static zone records for DNS resolution are created offline, and all the records with signatures stored in a zone file. This file is then read by a live DNS server allowing it to answer questions about the zone. Having a DNS server with the minimum amount of logic inside allows the operator to conserve CPU resources and maximize the number of queries that can be answered. In order to do DNSSEC white lies, the servers must have keys available to perform generation of signatures on the fly. This is a major change in the traditional operating practices of DNS servers because the DNS authoritative server itself needs to do the cryptographic operations in response to the incoming query. This demand for live signing imposes several other security problems in distributed environments.

Key management

DNSSEC was designed to operate in various modes, each providing different security, performance and convenience tradeoffs. Live signing solves the zone content exposure problem in exchange for less secure key management.

The most common DNSSEC mode is offline signing of static zones. This allows the signing system to be highly protected from external threats by keeping the private keys on a machine that is not connected to the network. This operating model works well when the DNS information does not change often.

Another common operating mode is centralized online signing. If you sign data in restricted access, dedicated DNS signing systems, it allows DNS data to change and get published quickly. Some operators run DNS signing on their main DNS servers. Just like offline signing of static zones, this mode follows the central signing model, i.e., a single (or replicated) central signer does all the signing and data gets propagated from it to the actual authoritative DNS servers.

A more radical mode is to allow the actual authoritative DNS servers to sign data on the fly when needed, this allows a number of new features including geographically dependant information signed where the answer is generated. The downside is that now the keying material is on many different machines that have direct access to the Internet. Doing live signing at the edge introduces new problems such as key distribution and places extra computing requirements on the nodes.

Recently, a bug known as Heartbleed was found that opened a major security hole in server applications. It was caused by a coding error in OpenSSL that created a remote memory disclosure vulnerability. This bug allowed remote attackers to extract cryptographic keys from Internet-facing servers. Remote memory exposure bug are just one of the many threats to private key security when the key is being used in an active process such as DNSSEC live signing. The more a machine is exposed to the Internet, the more vectors of attack there are. Offline signing machines have a much smaller window of exposure to such threats.

One way to keep keys secure is to use a hardware-backed solution such as a Hardware Security Module (HSM). The major drawback for this is cost – HSMs are very expensive (and slow). This is one of the stickiest points for running DNS servers that are spread out geographically in order to be close to their customers. Running an HSM in every server location can not only be expensive, but there can also be legal complications too.

Another solution to protect keys from remote disclosure is to offload cryptographic operations into trusted component of the system. This is where having a custom DNS server that can offload cryptography can come in handy.

Key management for DNSSEC is similar to key management for TLS and has similar challenges. Earlier this year, we introduced Keyless SSL to help improve key security for TLS. We are looking at extending Keyless SSL to provide the advantages of remote key servers for DNSSEC live-signing.

Reflection/amplification threat

Operators running an authoritative DNS server are often nervous their server will be used as a conduit for malicious distributed denial of service (DDoS) attacks. This stems from the fact that DNS uses UDP, a stateless protocol.

In TCP, each connection begins with a three-way handshake. This ensures that the IP address of both parties is known and correct before starting a connection. In UDP, there is no such handshake: messages are just sent directly to an IP with an unverified ‘from’ IP address. If an attacker can craft a UDP packet that says ‘hi, from IP X’ to a server, the server will typically respond by sending a UDP packet to X. Choosing X as a victim’s IP address instead of the sender’s is called UDP ‘spoofing’. By spoofing a victim, an attacker can cause a server that responds to UDP requests to flood the victim with ‘reflected’ traffic. This applies as much to authoritative servers as to open recursive resolvers.

DNSSEC also works over UDP, and the answers to DNS queries can be very long, containing multiple DNSKEY and RRSIG records. This is an attractive target for attackers since it allows them to ‘amplify’ their reflection attacks. If a small volume of spoofed UDP DNSSEC requests is sent to nameservers, the victim will receive a large volume of reflected traffic. Sometimes this is enough to overwhelm the victim’s server, and cause a denial of service.

Asking for a TLD that does not exist from a root server returns an answer that is around 100 bytes, the signed answer for the same question is about 650 bytes or an amplification factor of 6.5. The root is signed using a 1,024 bit RSA key and uses NSEC for negative answers. Asking for a domain that does not exist in a TLD using NSEC3 signed with 1,024 bit key will yield an amplification factor of around 10. There are other queries that can yield even higher amplification factors, the most effective being the “ANY” query.

Like many services, DNS can also work over TCP. There is a ‘truncation’ flag that can be sent back to a resolver to indicate that TCP is required. This would fix the issue of DNS reflection at the cost of slower DNS requests. This solution is not practical at the moment since 16% of resolvers don’t respect the TCP truncation flag, and 4% don’t try a second server.

Another option to reduce the size of responses is to use Elliptic Curve Digital Signature Algorithm (ECDSA) keys instead of traditional RSA keys. ECDSA keys are much smaller than RSA keys of equivalent strength, and they produce smaller signatures making DNSSEC responses much smaller, reducing the amplification factor. Unfortunately, many recursive resolvers (including Google Public DNS) do not support this key type yet, and many registrars have not added the option to their DNS management portals.

Support for TCP and ECDSA is still lagging behind general DNSSEC support, so traditional anti-abuse methods can be used instead. This includes Resource Rate Limiting (RRL) and other heuristics.

To protect against reflection attacks, CloudFlare is working on a multi-pronged approach. First, by using attack identification heuristics and anti-abuse techniques that we are currently using in our DNS server, and second by reducing the amplification factor of DNSSEC responses. Ways to reduces the maximum amplification factor includes only replying to “ANY” requests over TCP, using smaller ECDSA keys when possible, and reducing the frequency of key rollovers.

Conclusions

CloudFlare is aware of the complexity introduced by DNSSEC with respect to zone privacy, key management, and reflection/amplification risk. With smart engineering decisions, and operational controls in place, the dangers of DNSSEC can be prevented.

Red October: CloudFlare’s Open Source Implementation of the Two-Man Rule

Nick Sullivan — Thu, 21 Nov 2013 09:00:00 GMT

At CloudFlare, we are always looking for better ways to secure the data we’re entrusted with. This means hardening our system against outside threats such as hackers, but it also means protecting against insider threats. According to a recent Verizon report, insider threats account for around 14% of data breaches in 2013. While we perform background checks and carefully screen team members, we also implement technical barriers to protect the data with which we are entrusted.

One good information security practice is known as the “two-man rule.” It comes from military history, where a nuclear missile couldn’t be launched unless two people agreed and turned their launch keys simultaneously. This requirement was introduced in order to prevent one individual from accidentally (or intentionally) starting World War III.

To prevent the risk of rogue employees misusing sensitive data we built a service in Go to enforce the two-person rule. We call the service Red October after the famous scene from “The Hunt for Red October.” In line with our philosophy on security software, we are open sourcing the technology so you can use it in your own organization (here’s a link to the public Github repo). If you are interested in the nitty-gritty details, read on.

What it is

Red October is a cryptographically-secure implementation of the two-person rule to protect sensitive data. From a technical perspective, Red October is a software-based encryption and decryption server. The server can be used to encrypt a payload in such a way that no one individual can decrypt it. The encryption of the payload is cryptographically tied to the credentials of the authorized users.

Authorized persons can delegate their credentials to the server for a period of time. The server can decrypt any previously-encrypted payloads as long as the appropriate number of people have delegated their credentials to the server.

This architecture allows Red October to act as a convenient decryption service. Other systems, including CloudFlare’s build system, can use it for decryption and users can delegate their credentials to the server via a simple web interface. All communication with Red October is encrypted with TLS, ensuring that passwords are not sent in the clear.

How to use it

Setting up a Red October server is simple; all it requires is a locally-readable path and an SSL key pair. After that, all control is handled remotely through a set of JSON-based APIs.

Red October is backed by a database of accounts stored on disk in a portable password vault. The server never stores the account password there, only a salted hash of the password for each account. For each user, the server creates an RSA key pair and encrypts the private key with a key derived from the password and a randomly generated salt using a secure derivation function.

Any administrator can encrypt any piece of data with the encrypt API. This request takes a list of users and the minimum number of users needed to decrypt it. The server returns a somewhat larger piece of data that contains an encrypted version of this data. The encrypted data can then be stored elsewhere.

This data can later be decrypted with the decrypt API, but only if enough people have delegated their credentials to the server. The delegation API lets a user grant permission to a server to use their credentials for a limited amount of time and a limited number of uses.

Cryptographic Design

Red October was designed from cryptographic first principles, combining trusted and understood algorithms in known ways. CloudFlare is also opening the source of the server to allow others to analyze its design.

Red October is based on combinatorial techniques and trusted cryptographic primitives. We investigated using complicated secret primitives like Shamir's sharing scheme, but we found that a simpler combinatorial approach based on primitives from Go's standard library was preferable to implementing a mathematical algorithm from scratch. Red October uses 128-bit AES, 2048-bit RSA and scrypt as its cryptographic primitives.

Creating an account

Each user is assigned a unique, randomly-generated RSA key pair when creating an account on a Red October server. The private key is encrypted with a password key derived from the user’s password and salt using scrypt. The public key is stored unencrypted in the vault with the encrypted private key.

Encrypting data

When asked to encrypt a piece of data, the server generates a random 128-bit AES key. This key is used to encrypt the data. For each user that is allowed to decrypt the data, a user-specific key encryption key is chosen. For each unique pair of users, the data key is doubly encrypted, once with the key encryption key of each user. The key encryption keys are then encrypted with the public RSA key associated with their account. The encrypted data, the set of doubly-encrypted data keys, and the RSA-encrypted key encryption keys are all bundled together and returned. The encrypted data is never stored on the server.

Delegating credentials to the server

When a user delegates their key to the server, they submit their username and password over TLS using the delegate JSON API. For each account, the password is verified against the salted hash. If the password is correct, a password key is derived from the password and used to decrypt the user’s RSA private key. This key is now “Live” for the length of time and number of decryptions chosen by the user.

Decrypting data

To decrypt a file, the server validates that the requesting user is an administrator and has the correct password. If two users of the list of valid users have delegated their keys, then decryption can occur. First the RSA private key is used to decrypt the key encryption key for these two users, then the key encryption keys are used to decrypt the doubly encrypted data key, which is then used to decrypt the data.

Some other key points:

Cryptographic security. The Red October server does not have the ability to decrypt user keys without their password. This prevents someone with access to the vault from decrypting data.
Password flexibility. Passwords can be changed without changing the encryption of a given file. Key encryption keys ensure that password changes are decoupled from data encryption keys.

Looking ahead

The version of Red October we are releasing to GitHub is in beta. It is licensed under the 3-clause BSD license. We plan to continue to release our improvements to the open source community. Here is the project on GitHub: Red October.

Writing the server in Go allowed us to design the different components of this server in a modular way. Our hope is this modularity will make it easy for anyone to build in support for different authentication methods that are not based on passwords (for example, TLS client certificates, time-based one-time-passwords) and new core cryptographic primitives (for example, elliptic curve cryptography).

CloudFlare is always looking to improve the state of security on the Internet. It is important to us to share our advances with the world and contribute back to the community. See the CloudFlare GitHub page for the list of our open source projects and initiatives.

A note about Kerckhoff's Principle

John Graham-Cumming — Tue, 19 Jun 2012 15:56:00 GMT

Image credit: psicologiaclinica

The other day I wrote a long post describing in detail how we used to and how we now store customer passwords. Some people were surprised that we were open about this, and others wanted to understand exactly what part of a security system needs to be kept secret.

The simple answer is: a security system is only secure if its details can be safely shared with the world. This is known as Kerckhoff's Principle.

The principle is sometimes stated as "a cryptosystem should be secure even if everything about the system, except the key, is public knowledge" or using Claude Shannon's simpler version: "the enemy knows the system".

The idea is that if any part of a cryptosystem (except the individual secret key) has to be kept secret then the cryptosystem is not secure. That's because if the simple act of disclosing some detail of the system were to make it suddenly insecure then you've got a problem on your hands.

You've somehow got to keep that detail secret and for that you'll need a cryptosystem! Given that the whole point of the cryptosystem was to keep secrets, it's useless if it needs some other system to keep itself secret.

So, the gold standard for any secret keeping system is that all its details should be able to be made public without compromising the security of the system. The security relies on the system itself, not the secrecy of the system. (And as a corollary if anyone tells you've they've got some supersecret encryption system they can't tell you about then it's likely rubbish).

A great example of this is the breaking of the Nazi German Enigma cipher during the Second World War. By stealing machines, receiving information from other secret services, and reading the manuals, the Allies knew everything there was to know about how the Enigma machine worked.

Engima's security relied not on its secrecy, but on its complexity (and on keeping the daily key a secret). Engima was broken by attacking the mathematics behind its encryption and building special machines to exploit mathematical flaws in the encryption.

That's just as true today. The security of HTTPS, SSL and ciphers like AES or RSA rely on the complexity of the algorithm, not on keeping them secret. In fact, they are all published, detailed standards. The only secret is the key that's chosen when you connect to a secure web site (that's done automatically and randomly by your browser and the server) or when you encrypt a document using a program like GPG.

Another example is home security. Imagine if you bought a lock that stated that it must be hidden from view so that no one knew what type of lock you had installed. That wouldn't provide much reassurance that the lock was any good. The security of the lock should depend on its mechanism and you keeping the key safe not on you keeping the lock a secret!

Image credit: paul.orear

When storing passwords securely we rely on the complexity of the bcrypt algorithm. Everything about our storage mechanism is assumed to be something that can be made public. So it's safe to say that we choose a random salt, and that we use bcrypt. The salt is not a key, and it does not need to be kept secret.

But even more we assume that in the horrible case that our password database were accessed it will still be secure even though the hashed passwords and salts would be available to a hacker. The security of the system relies on the security of bcrypt and nothing else.

Of course, as a practical matter we don't leave the database lying around for anyone to access. It's kept behind firewalls and securely stored. But when thinking about security it's important to think about the worst case situation, a full disclosure of the secured information, and rely on the algorithm's strength and nothing else.