Cloudflare extensively uses its own products internally in a process known as dogfooding. As part of my onboarding as an intern on the Spectrum (a layer 4 reverse proxy) team, I learned that many internal services dogfood Spectrum, as they are exposed to the Internet and benefit from layer 4 DDoS protection. One of my first tasks was to update the configuration for an internal service that was using Spectrum. The configuration was managed in Salt (used for configuration management at Cloudflare), which was not particularly user-friendly, and required an engineer on the Spectrum team to handle updating it manually.

This process took about a week. That should instantly raise some questions, as a typical Spectrum customer can create a new Spectrum app in under a minute through Cloudflare Dashboard. So why couldn’t I?

This question formed the basis of my intern project for the summer.

The Process

Cloudflare uses various IP ranges for its products. Some customers also authorize Cloudflare to announce their IP prefixes on their behalf (this is known as BYOIP). Collectively, we can refer to these IPs as managed addresses. To prevent Bad Stuff (defined later) from happening, we prohibit managed addresses from being used as Spectrum origins. To accomplish this, Spectrum had its own table of denied networks that included the managed addresses. For the average customer, this approach works great – they have no legitimate reason to use a managed address as an origin.

Unfortunately, the services dogfooding Spectrum all use Cloudflare IPs, preventing those teams with a legitimate use-case from creating a Spectrum app through the configuration service (i.e. Cloudflare Dashboard). To bypass this check, these internal customers needed to define a custom Spectrum configuration, which needed to be manually deployed to the edge via a pull request to our Salt repo, resulting in a time consuming process.

If an internal customer wanted to change their configuration, the same time consuming process must be used. While this allowed internal customers to use Spectrum, it was tedious and error prone.

Bad Stuff

Bad Stuff is quite vague and deserves a better definition. It may seem arbitrary that we deny Cloudflare managed addresses. To motivate this, consider two Spectrum apps, A and B, where the origin of app A is the Cloudflare edge IP of app B, and the origin of app B is the edge IP of app A. Essentially, app A will proxy incoming connections to app B, and app B will proxy incoming connections to app A, creating a cycle.

This could potentially crash the daemon or degrade performance. In practice, this configuration is useless and would only be created by a malicious user, as the proxied connection never reaches an origin, so it is never allowed.

In fact, the more general case of setting another Spectrum app as an origin (even when the configuration does not result in a cycle) is (almost1) never needed, so it also needs to be avoided.

As well, since we are providing a reverse proxy to customer origins, we do not need to allow connections to IP ranges that cannot be used on the public Internet, as specified in this RFC.

The Problem

To improve usability and allow internal Spectrum customers to create apps using the Dashboard instead of the static configuration workflow, we needed a way to give particular customers permission to use Cloudflare managed addresses in their Spectrum configuration. Solving this problem was my main project for the internship.

A good starting point ended up being the Addressing API. The Addressing API is Cloudflare’s solution to IP management, an internal database and suite of tools to keep track of IP prefixes, with the goal of providing a unified source of truth for how IP addresses are being used across the organization. This makes it possible to provide a cross-product platform for products and features such as BYOIP, BGP On Demand, and Magic Transit.

The Addressing API keeps track of all Cloudflare managed IP prefixes, along with who owns the prefix. As well, the owner of a prefix can give permission for someone else to use the prefix. We call this a delegation.

A user’s permission to use an IP address managed by the Addressing API is determined as followed:

  1. Is the user the owner of the prefix containing the IP address?
    a) Yes, the user has permission to use the IP
    b) No, go to step 2
  2. Has the user been delegated a prefix containing the IP address?
    a) Yes, the user has permission to use the IP.
    b) No, the user does not have permission to use the IP.

The Solution

With the information present in the Addressing API, the solution starts to become clear. For a given customer and IP, we use the following algorithm:

  1. Is the IP managed by Cloudflare (or contained in the previous RFC)?
    a) Yes, go to step 2
    b) No, allow as origin
  2. Does the customer have permission to use the IP address?
    a) Yes, allow as origin
    b) No, deny as origin

As long as the internal customer has been given permission to use the Cloudflare IP (through a delegation in the Addressing API), this approach would allow them to use it as an origin.

However, we run into a corner case here – since BYOIP customers also have permission to use their own ranges, they would be able to set their own IP as an origin, potentially causing a cycle. To mitigate this, we need to check if the IP is a Spectrum edge IP. Fortunately, the Addressing API also contains this information, so all we have to do is check if the given origin IP is already in use as a Spectrum edge IP, and if so, deny it. Since all of the denied networks checks occur in the Addressing API, we were able to remove Spectrum's own deny network database, reducing the engineering workload to maintain it along the way.

Let's go through a concrete example. Consider an internal customer who wants to use 104.16.8.54/32 as an origin for their Spectrum app. This address is managed by Cloudflare, and suppose the customer has permission to use it, and the address is not already in use as an edge IP. This means the customer is able to specify this IP as an origin, since it meets all of our criteria.

For example, a request to the Addressing API could look like this:

curl --silent 'https://addr-api.internal/edge_services/spectrum/validate_origin_ip_acl?cidr=104.16.8.54/32' -H "Authorization: Bearer $JWT" | jq .
{
  "success": true,
  "errors": [],
  "result": {
    "allowed_origins": {
      "104.16.8.54/32": {
        "allowed": true,
        "is_managed": true,
        "is_delegated": true,
        "is_reserved": false,
        "has_binding": false
      }
    }
  },
  "messages": []
}

Now we have completely moved the responsibility of validating the use of origin IP addresses from Spectrum’s configuration service to the Addressing API.

Performance

This approach required making another HTTP request on the critical path of every create app request in the Spectrum configuration service. Some basic performance testing showed (as expected) increased response times for the API call (about 100ms). This led to discussion among the Spectrum team about the performance impact of different HTTP requests throughout the critical path. To investigate, we decided to use OpenTracing.

OpenTracing is a standard for providing distributed tracing of microservices. When an HTTP request is received, special headers are added to it to allow it to be traced across the different services. Within a given trace, we can see how long a SQL query took, the time a function took to complete, the amount of time a request spent at a given service, and more.

We have been deploying a tracing system for our services to provide more visibility into a complex system.

After instrumenting the Spectrum config service with OpenTracing, we were able to determine that the Addressing API accounted for a very small amount of time in the overall request, and allowed us to identify potentially problematic request times to other services.

Lessons Learned

Reading documentation is important! Having a good understanding of how the Addressing API and the config service worked allowed me to create and integrate an endpoint that made sense for my use-case.

Writing documentation is just as important. For the final part of my project, I had to onboard Crossbow – an internal Cloudflare tool used for diagnostics – to Spectrum, using the new features I had implemented. I had written an onboarding guide, but some stuff was unclear during the onboarding process, so I made sure to gather feedback from the Crossbow team to improve the guide.

Finally, I learned not to underestimate the amount of complexity required to implement relatively simple validation logic. In fact, the implementation required understanding the entire system. This includes how multiple microservices work together to validate the configuration and understanding how the data is moved from the Core to the Edge, and then processed there. I found increasing my understanding of this system to be just as important and rewarding as completing the project.

Footnotes:

Regional Services actually makes use of proxying a Spectrum connection to another colocation, and then proxying to the origin, but the configuration plane is not involved in this setup.