In this blog post, we will walk you through the reliability model of services running in our more than 200 edge cities worldwide. Then, we will go over how deploying a new dynamic task scheduling system, HashiCorp Nomad, helped us improve the availability of services in each of those data centers, covering how we deployed Nomad and the challenges we overcame along the way. Finally, we will show you both how we currently use Nomad and how we are planning on using it in the future.

Reliability model of services running in each data center

For this blog post, we will distinguish between two different categories of services running in each data center:

  • Customer-facing services: all of our stack of products that our customers use, such as caching, WAF, DDoS protection, rate-limiting, load-balancing, etc.
  • Management services: software required to operate the data center, that is not in the direct request path of customer traffic.

Customer-facing services

The reliability model of our customer-facing services is to run them on all machines in each data center. This works well as it allows each data center’s capacity to scale dynamically by adding more machines.

Scaling is especially made easy thanks to our dynamic load balancing system, Unimog, which runs on each machine. Its role is to continuously re-balance traffic based on current resource usage and to check the health of services. This helps provide resiliency to individual machine failures and ensures resource usage is close to identical on all machines.

As an example, here is the CPU usage over a day in one of our data centers where each time series represents one machine and the different colors represent different generations of hardware. Unimog keeps all machines processing traffic and at roughly the same CPU utilization.

Management services

Some of our larger data centers have a substantial number of machines, but sometimes we need to reliably run just a single or a few instances of a management service in each location.

There are currently a couple of options to do this, each have their own pros and cons:

  1. Deploying the service to all machines in the data center:
    • Pro: it ensures the service’s reliability
    • Con: it unnecessarily uses resources which could have been used to serve customer traffic and is not cost-effective
  2. Deploying the service to a static handful of machines in each data center:
    • Pro: it is less wasteful of resources and more cost-effective
    • Con: it runs the risk of service unavailability when those handful of machines unexpectedly fail

A third, more viable option, is to use dynamic task scheduling so that only the right amount of resources are used while ensuring reliability.

A need for more dynamic task scheduling

Having to pick between two suboptimal reliability model options for management services we want running in each data center was not ideal.

Indeed, some of those services, even though they are not in the request path, are required to continue operating the data center. If the machines running those services become unavailable, in some cases we have to temporarily disable the data center while recovering them. Doing so automatically re-routes users to the next available data center and doesn’t cause disruption. In fact, the entire Cloudflare network is designed to operate with data centers being disabled and brought back automatically. But it’s optimal to route end users to a data center near them so we want to minimize any data center level downtime.

This led us to realize we needed a system to ensure a certain number of instances of a service were running in each data center, regardless of which physical machine ends up running it.

Customer-facing services run on all machines in each data center and do not need to be onboarded to that new system. On the other hand, services currently running on a fixed subset of machines with sub-optimal reliability guarantees and services which don’t need to run on all machines are good candidates for onboarding.

Our pick: HashiCorp Nomad

Armed with our set of requirements, we conducted some research on candidate solutions.

While Kubernetes was another option, we decided to use HashiCorp’s Nomad for the following reasons:

  • Satisfies our initial requirement, which was reliably running a single instance of a binary with resource isolation in each data center.
  • Has few dependencies and a straightforward integration with Consul. Consul is another piece of HashiCorp software we had already deployed in each datacenter. It provides distributed key-value storage and service discovery capabilities.
  • Is lightweight (single Go binary), easy to deploy and provision new clusters which is a plus when deploying as many clusters as we have data centers.
  • Has a modular task driver (part responsible for executing tasks and providing resource isolation) architecture to support not only containers but also binaries and any custom task driver.
  • Is open source and written in Go. We have Go language experience within the team, and Nomad has a responsive community of maintainers on GitHub.

Deployment architecture

Nomad is split in two different pieces:

  1. Nomad Server: instances forming the cluster responsible for scheduling, five per data center to provide sufficient failure tolerance
  2. Nomad Client: instances executing the actual tasks, running on all machines in every data center

To guarantee Nomad Server cluster reliability, we deployed instances on machines which are part of different failure domains:

  • In different inter-connected physical data centers forming a single location
  • In different racks, connected to different switches
  • In different multi-node chassis (most of our edge hardware comes in the form of multi-node chassis, one chassis contains four individual servers)

We also added logic to our configuration management tool to ensure we always keep a consistent number of Nomad Server instances regardless of the expansions and decommissions of servers happening on a close to daily basis.

The logic is rather simple, as server expansions and decommissions happen, the Nomad Server role gets redistributed to a new list of machines. Our configuration management tool then ensures that Nomad Server runs on the new machines before turning it off on the old ones.

Additionally, because server expansions and decommissions affect a subset of racks at a time and the Nomad Server role assignment logic provides rack-diversity guarantees, the cluster stays healthy as quorum is kept at all times.

Job files

Nomad job files are templated and checked into a git repository. Our configuration management tool then ensures the jobs are scheduled in every data center. From there, Nomad takes over and ensures the jobs are running at all times in each data center.

By exposing rack metadata to each Nomad Client, we are able to make sure each instance of a particular service runs in a different rack and is tied to a different failure domain. This way we make sure that the failure of one rack of servers won’t impact the service health as the service is also running in a different rack, unaffected by the failure.

We achieve this with the following job file constraint:

constraint {
  attribute = "${meta.rack}"
  operator  = "distinct_property"
}

Service discovery

We leveraged Nomad integration with Consul to get Nomad jobs dynamically added to the Consul Service Catalog. This allows us to discover where a particular service is currently running in each data center by querying Consul. Additionally, with the Consul DNS Interface enabled, we can also use DNS-based lookups to target services running on Nomad.

Observability

To be able to properly operate as many Nomad clusters as we have data centers, good observability on Nomad clusters and services running on those clusters was essential.

We use Prometheus to scrape Nomad Server and Client instances running in each data center and Alertmanager to alert on key metrics. Using Prometheus metrics, we built a Grafana dashboard to provide visibility on each cluster.

We set up our Prometheus instances to discover services running on Nomad by querying the Consul Service Directory and scraping their metrics periodically using the following Prometheus configuration:

- consul_sd_configs:
  - server: localhost:8500
  job_name: management_service_via_consul
  relabel_configs:
  - action: keep
    regex: management-service
    source_labels:
    - __meta_consul_service

We then use those metrics to create Grafana dashboards and set up alerts for services running on Nomad.

To restrict access to Nomad API endpoints, we enabled mutual TLS authentication and are generating client certificates for each entity interacting with Nomad. This way, only entities with a valid client certificate can interact with Nomad API endpoints in order to schedule jobs or perform any CLI operation.

Challenges

Deploying a new component always comes with its set of challenges; here is a list of a few hurdles we have had to overcome along the way.

Initramfs rootfs and pivot_root

When starting to use the exec driver to run binaries isolated in a chroot environment, we noticed our stateless root partition running on initramfs was not supported as the task would not start and we got this error message in our logs:

Feb 12 19:49:03 machine nomad-client[258433]: 2020-02-12T19:49:03.332Z [ERROR] client.alloc_runner.task_runner: running driver failed: alloc_id=fa202-63b-33f-924-42cbd5 task=server error="failed to launch command with executor: rpc error: code = Unknown desc = container_linux.go:346: starting container process caused "process_linux.go:449: container init caused \"rootfs_linux.go:109: jailing process inside rootfs caused \\\"pivot_root invalid argument\\\"\"""

We filed a GitHub issue and submitted a workaround pull request which was promptly reviewed and merged upstream.

In parallel, for maximum isolation security, we worked on enabling pivot_root in our setup by modifying our boot process and other team members developed and proposed a patch to the kernel mailing list to make it easier in the future.

Resource usage containment

One very important aspect was to make sure the resource usage of tasks running on Nomad would not disrupt other services colocated on the same machine.

Disk space is a shared resource on every machine and being able to set a quota for Nomad was a must. We achieved this by isolating the Nomad data directory to a dedicated fixed-size mount point on each machine. Limiting disk bandwidth and IOPS, however, is not currently supported out of the box by Nomad.

Nomad job files have a resources section where memory and CPU usage can be limited (memory is in MB, cpu is in MHz):

resources {
  memory = 2000
  cpu = 500
}

This uses cgroups under the hood and our testing showed that while memory limits are enforced as one would expect, the CPU limits are soft limits and not enforced as long as there is available CPU on the host machine.

Workload (un)predictability

As mentioned above, all machines currently run the same customer-facing workload. Scheduling individual jobs dynamically with Nomad to run on single machines challenges that assumption.

While our dynamic load balancing system, Unimog, balances requests based on resource usage to ensure it is close to identical on all machines, batch type jobs with spiky resource usage can pose a challenge.

We will be paying attention to this as we onboard more services and:

  • attempt to limit resource usage spikiness of Nomad jobs with constraints aforementioned
  • ensure Unimog adjusts to this batch type workload and does not end up in a positive feedback loop

What we are running on Nomad

Now Nomad has been deployed in every data center, we are able to improve the reliability of management services essential to operations by gradually onboarding them. We took a first step by onboarding our reboot and maintenance management service.

Reboot and maintenance management service

In each data center, we run a service which facilitates online unattended rolling reboots and maintenance of machines. This service used to run on a single well-known machine in each data center. This made it vulnerable to single machine failures and when down prevented machines from enabling automatically after a reboot. Therefore, it was a great first service to be onboarded to Nomad to improve its reliability.

We now have a guarantee this service is always running in each data center regardless of individual machine failures. Instead of other machines relying on a well-known address to target this service, they now query Consul DNS and dynamically figure out where the service is running to interact with it.

This is a big improvement in terms of reliability for this service, therefore many more management services are expected to follow in the upcoming months and we are very excited for this to happen.