The Cloudflare Blog

A one-line Kubernetes fix that saved 600 hours a year

Braxton Schafer — Thu, 26 Mar 2026 13:00:00 GMT

Every time we restarted Atlantis, the tool we use to plan and apply Terraform changes, we’d be stuck for 30 minutes waiting for it to come back up. No plans, no applies, no infrastructure changes for any repository managed by Atlantis. With roughly 100 restarts a month for credential rotations and onboarding, that added up to over 50 hours of blocked engineering time every month, and paged the on-call engineer every time.

This was ultimately caused by a safe default in Kubernetes that had silently become a bottleneck as the persistent volume used by Atlantis grew to millions of files. Here’s how we tracked it down and fixed it with a one-line change.

Mysteriously slow restarts

We manage dozens of Terraform projects with GitLab merge requests (MRs) using Atlantis, which handles planning and applying. It enforces locking to ensure that only one MR can modify a project at a time.

It runs on Kubernetes as a singleton StatefulSet and relies on a Kubernetes PersistentVolume (PV) to keep track of repository state on disk. Whenever a Terraform project needs to be onboarded or offboarded, or credentials used by Terraform are updated, we have to restart Atlantis to pick up those changes — a process that can take 30 minutes.

The slow restart was apparent when we recently ran out of inodes on the persistent storage used by Atlantis, forcing us to restart it to resize the volume. Inodes are consumed by each file and directory entry on disk, and the number available to a filesystem is determined by parameters passed when creating it. The Ceph persistent storage implementation provided by our Kubernetes platform does not expose a way to pass flags to mkfs, so we’re at the mercy of default values: growing the filesystem is the only way to grow available inodes, and restarting a PV requires a pod restart.

We talked about extending the alert window, but that would just mask the problem and delay our response to actual issues. Instead, we decided to investigate exactly why it was taking so long.

Bad behavior

When we were asked to do a rolling restart of Atlantis to pick up a change to the secrets it uses, we would run kubectl rollout restart statefulset atlantis, which would gracefully terminate the existing Atlantis pod before spinning up a new one. The new pod would appear almost immediately, but looking at it would show:

$ kubectl get pod atlantis-0
atlantis-0                                                        0/1     
Init:0/1     0             30m

...so what gives? Naturally, the first thing to check would be events for that pod. It's waiting around for an init container to run, so maybe the pod events would illuminate why?

$ kubectl events --for=pod/atlantis-0
LAST SEEN   TYPE      REASON                   OBJECT                   MESSAGE
30m         Normal    Killing                  Pod/atlantis-0   Stopping container atlantis-server
30m        Normal    Scheduled                Pod/atlantis-0   Successfully assigned atlantis/atlantis-0 to 36com1167.cfops.net
22s         Normal    Pulling                  Pod/atlantis-0   Pulling image "oci.example.com/git-sync/master:v4.1.0"
22s         Normal    Pulled                   Pod/atlantis-0   Successfully pulled image "oci.example.com/git-sync/master:v4.1.0" in 632ms (632ms including waiting). Image size: 58518579 bytes.

That looks almost normal... but what's taking so long between scheduling the pod and actually starting to pull the image for the init container? Unfortunately that was all the data we had to go on from Kubernetes itself. But surely there had to be something more that can tell us why it's taking so long to actually start running the pod.

Going deeper

In Kubernetes, a component called kubelet that runs on each node is responsible for coordinating pod creation, mounting persistent volumes, and many other things. From my time on our Kubernetes team, I know that kubelet runs as a systemd service and so its logs should be available to us in Kibana. Since the pod has been scheduled, we know the host name we're interested in, and the log messages from kubelet include the associated object, so we could filter for atlantis to narrow down the log messages to anything we found interesting.

We were able to observe the Atlantis PV being mounted shortly after the pod was scheduled. We also observed all the secret volumes mount without issue. However, there was still a big unexplained gap in the logs. We saw:

[operation_generator.go:664] "MountVolume.MountDevice succeeded for volume \"pvc-94b75052-8d70-4c67-993a-9238613f3b99\" (UniqueName: \"kubernetes.io/csi/rook-ceph-nvme.rbd.csi.ceph.com^0001-000e-rook-ceph-nvme-0000000000000002-a6163184-670f-422b-a135-a1246dba4695\") pod \"atlantis-0\" (UID: \"83089f13-2d9b-46ed-a4d3-cba885f9f48a\") device mount path \"/state/var/lib/kubelet/plugins/kubernetes.io/csi/rook-ceph-nvme.rbd.csi.ceph.com/d42dcb508f87fa241a49c4f589c03d80de2f720a87e36932aedc4c07840e2dfc/globalmount\"" pod="atlantis/atlantis-0"
[pod_workers.go:1298] "Error syncing pod, skipping" err="unmounted volumes=[atlantis-storage], unattached volumes=[], failed to process volumes=[]: context deadline exceeded" pod="atlantis/atlantis-0" podUID="83089f13-2d9b-46ed-a4d3-cba885f9f48a"
[util.go:30] "No sandbox for pod can be found. Need to start a new one" pod="atlantis/atlantis-0"

The last two messages looped several times until eventually we observed the pod actually start up properly.

So kubelet thinks that the pod is otherwise ready to go, but it's not starting it and something's timing out.

The missing piece

The lowest-level logs we had on the pod didn't show us what's going on. What else do we have to look at? Well, the last message before it hangs is the PV being mounted onto the node. Ordinarily, if the PV has issues mounting (e.g. due to still being stuck mounted on another node), that will bubble up as an event. But something's still going on here, and the only thing we have left to drill down on is the PV itself. So I plug that into Kibana, since the PV name is unique enough to make a good search term... and immediately something jumps out:

[volume_linux.go:49] Setting volume ownership for /state/var/lib/kubelet/pods/83089f13-2d9b-46ed-a4d3-cba885f9f48a/volumes/kubernetes.io~csi/pvc-94b75052-8d70-4c67-993a-9238613f3b99/mount and fsGroup set. If the volume has a lot of files then setting volume ownership could be slow, see https://github.com/kubernetes/kubernetes/issues/69699

Remember how I said at the beginning we'd just run out of inodes? In other words, we have a lot of files on this PV. When the PV is mounted, kubelet is running chgrp -R to recursively change the group on every file and folder across this filesystem. No wonder it was taking so long — that's a ton of entries to traverse even on fast flash storage!

The pod's spec.securityContext included fsGroup: 1, which ensures that processes running under GID 1 can access files on the volume. Atlantis runs as a non-root user, so without this setting it wouldn’t have permission to read or write to the PV. The way Kubernetes enforces this is by recursively updating ownership on the entire PV every time it's mounted.

The fix

Fixing this was heroically...boring. Since version 1.20, Kubernetes has supported an additional field on pod.spec.securityContext called fsGroupChangePolicy. This field defaults to Always, which leads to the exact behavior we see here. It has another option, OnRootMismatch, to only change permissions if the root directory of the PV doesn't have the right permissions. If you don’t know exactly how files are created on your PV, do not set fsGroupChangePolicy: OnRootMismatch. We checked to make sure that nothing should be changing the group on anything in the PV, and then set that field:

spec:
  template:
    spec:
      securityContext:
        fsGroupChangePolicy: OnRootMismatch

Now, it takes about 30 seconds to restart Atlantis, down from the 30 minutes it was when we started.

Default Kubernetes settings are sensible for small volumes, but they can become bottlenecks as data grows. For us, this one-line change to fsGroupChangePolicy reclaimed nearly 50 hours of blocked engineering time per month. This was time our teams had been spending waiting for infrastructure changes to go through, and time that our on-call engineers had been spending responding to false alarms. That’s roughly 600 hours a year returned to productive work, from a fix that took longer to diagnose than deploy.

Safe defaults in Kubernetes are designed for small, simple workloads. But as you scale, they can slowly become bottlenecks. If you’re running workloads with large persistent volumes, it’s worth checking whether recursive permission changes like this are silently eating your restart time. Audit your securityContext settings, especially fsGroup and fsGroupChangePolicy. OnRootMismatch has been available since v1.20.

Not every fix is heroic or complex, and it’s usually worth asking “why does the system behave this way?”

If debugging infrastructure problems at scale sounds interesting, we’re hiring. Come join us on the Cloudflare Community or our Discord to talk shop.

Leveraging Kubernetes virtual machines at Cloudflare with KubeVirt

Justin Cichra — Tue, 08 Oct 2024 13:00:00 GMT

Cloudflare runs several multi-tenant Kubernetes clusters across our core data centers. These general-purpose clusters run on bare metal and power our control plane, analytics, and various engineering tools such as build infrastructure and continuous integration.

Kubernetes is a container orchestration platform. It enables software engineers to deploy containerized applications to a cluster of machines. This enables teams to build highly-available software on a scalable and resilient platform.

In this blog post we discuss our Kubernetes architecture, why we needed virtualization, and how we’re using it today.

Multi-tenant clusters

Multi-tenancy is a concept where one system can share its resources among a wide range of customers. This model allows us to build and manage a small number of general purpose Kubernetes clusters for our internal application teams. Keeping the number of clusters small reduces our operational toil. This model shrinks costs and increases computational efficiency by sharing hardware. Multi-tenancy also allows us to scale more efficiently. Scaling is done at either a cluster or application level. Cluster operators scale the platform by adding more hardware. Teams scale their applications by updating their Kubernetes manifests. They can scale vertically by increasing their resource requests or horizontally by increasing the number of replicas.

All of our Kubernetes clusters are multi-tenant with various components enabled for a secure and resilient platform.

Pods are secured using the latest standards recommended by the Kubernetes project. We use Pod Security Admission (PSA) and Pod Security Standards to ensure all workloads are following best practices. By default, all namespaces use the most restrictive profile, and only a few Kubernetes control plane namespaces are granted privileged access. For additional policies not covered by PSA, we built custom Validating Webhooks on top of the controller-runtime framework. PSA and our custom policies ensure clusters are secure and workloads are isolated.

Our need for virtualization

A select number of teams needed tight integration with the Linux kernel. Examples include Docker daemons for build infrastructure and the ability to simulate servers running the software and configuration of our global network. With our pod security requirements, these workloads are not permitted to interface with the host kernel at a deep level (e.g. no iptables or sysctls). Doing so may disrupt other tenants sharing the node and open additional attack vectors if an application was compromised. A virtualization platform would enable these workloads to interact with their own kernel within a secured Kubernetes cluster.

We considered various different virtualization solutions. Running a separate virtualization platform outside of Kubernetes would have worked, but would not tightly integrate containerized workloads with virtual machines. It would also be an additional operational burden on our team, as backups, alerting, and fleet management would have to exist for both our Kubernetes and virtual machine clusters.

We then looked for solutions that run virtual machines within Kubernetes. Teams could already manually deploy QEMU pods, but this was not an elegant solution. We needed a better way. There were several other options, but KubeVirt was the tool that met the majority of our requirements. Other solutions required a privileged container to run a virtual machine, but KubeVirt did not – this was a crucial requirement in our goal of creating a more secure multi-tenant cluster. KubeVirt also uses a feature of the Kubernetes API called Custom Resource Definitions (CRDs), which extends the Kubernetes API with new objects, increasing the flexibility of Kubernetes beyond its built-in types. For KubeVirt, this includes objects such as VirtualMachine and VirtualMachineInstanceReplicaSet. We felt the use of CRDs would allow KubeVirt to grow as more features were added.

What is KubeVirt?

KubeVirt is a virtualization platform that enables users to run virtual machines within Kubernetes. With KubeVirt, virtual machines run alongside containerized workloads on the same platform. Kubernetes primitives such as network policies, configmaps, and services all integrate with virtual machines. KubeVirt scales with our needs and is successfully running hundreds of virtual machines across several clusters. We frequently remediate Kubernetes nodes, so virtual machines and pods are always exercising their startup/shutdown processes.

How Cloudflare uses KubeVirt

There are a number of internal projects leveraging virtual machines at Cloudflare. We’ll touch on a few of our more popular use cases:

Kubernetes scalability testing
Development environments
Kernel and iPXE testing
Build pipelines

Kubernetes scalability testing

Setup process

Our staging clusters are much smaller than our largest production clusters. They also run on bare metal and mirror the configuration we have for each production cluster. This is extremely useful when rolling out new software, operating systems, or kernel changes; however, they miss bugs that only surface at scale. We use KubeVirt to bridge this gap and virtualize Kubernetes clusters with hundreds of nodes and thousands of pods.

The setup process for virtualized clusters differs from our bare metal provisioning steps. For bare metal, we use Salt to provision clusters from start to finish. For our virtualized clusters we use Ansible and kubeadm. Our bare metal staging clusters are responsible for testing and validating our Salt configuration. The virtualized clusters give us a vanilla Kubernetes environment without any Cloudflare customizations. Having a stock environment in addition to our Salt environment helps us isolate bugs down to a Kubernetes change, a kernel change, or a Cloudflare-specific configuration change.

Our virtualized clusters consist of a KubeVirt VirtualMachine object per node. We create three control-plane nodes and any number of worker nodes. Each virtual machine starts out as a vanilla Debian generic cloud image. Using KubeVirt’s cloud-init support, the virtual machine downloads an internal Ansible playbook which installs a recent kernel, cri-o (the container runtime we use), and kubeadm.

- name: Add the Kubernetes gpg key
  apt_key:
    url: https://pkgs.k8s.io/core:/stable:/{{ kube_version }}/deb/Release.key
    keyring: /etc/apt/keyrings/kubernetes-apt-keyring.gpg
    state: present

- name: Add the Kubernetes repository
  shell: echo "deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/{{ kube_version }}/deb/ /" | tee /etc/apt/sources.list.d/kubernetes.list

- name: Add the CRI-O gpg key
  apt_key:
    url: https://pkgs.k8s.io/addons:/cri-o:/{{ crio_version }}/deb/Release.key
    keyring: /etc/apt/keyrings/cri-o-apt-keyring.gpg
    state: present

- name: Add the CRI-O repository
  shell: echo "deb [signed-by=/etc/apt/keyrings/cri-o-apt-keyring.gpg] https://pkgs.k8s.io/addons:/cri-o:/{{ crio_version }}/deb/ /" | tee /etc/apt/sources.list.d/cri-o.list

- name: Install CRI-O and Kubernetes packages
  apt:
    name:
      - cri-o
      - kubelet
      - kubeadm
      - kubectl
    update_cache: yes
    state: present

- name: Enable and start CRI-O service
  service:
    state: started
    enabled: yes
    name: crio.service

^{Ansible playbook steps to download and install Kubernetes tooling}

Once each node has completed its individual playbook, we can initialize and join nodes to the cluster using another playbook that runs kubeadm. From there the cluster can be accessed by logging into a control plane node using kubectl.

Simulating at scale

When losing 10s or 100s of nodes at once, Kubernetes needs to act quickly to minimize downtime. The sooner it recognizes node failure, the faster it can reroute traffic to healthy pods.

Using Kubernetes in KubeVirt we are able to simulate a large cluster undergoing a network cut and observe how Kubernetes reacts. The KubeVirt Kubernetes cluster allows us to rapidly iterate on configuration changes and code patches.

The following Ansible playbook task simulates a network segmentation failure where only the control-plane nodes remain online.

- name: Disable network interfaces on all workers
  command: ifconfig enp1s0 down
  async: 5
  poll: 0
  ignore_errors: yes
  when: inventory_hostname in groups['kube-node']

^{An Ansible role which disables the network on all worker nodes simultaneously.}

This framework allows us to exercise the code in controller-manager, Kubernetes’s daemon that reconciles the fundamental state of the system (Nodes, Pods, etc). Our simulation platform helped us drastically shorten full traffic recovery time when a large number of Kubernetes nodes become unreachable. We upstreamed our changes to Kubernetes and more controller-manager speed improvements are coming soon.

Development environments

Compiling code on your laptop can be slow. Perhaps you’re working on a patch for a large open-source project (e.g. V8 or Clickhouse) or need more bandwidth to upload and download containers. With KubeVirt, we enable our developers to rapidly iterate on software development and testing on powerful server hardware. KubeVirt integrates with Kubernetes Persistent Volumes, which enables teams to persist their development environment across restarts.

There are a number of teams at Cloudflare using KubeVirt for a variety of development and testing environments. Most notably is a project called Edge Test Fleet, which emulates a physical server and all the software that runs Cloudflare’s global network. Teams can test their code and configuration changes against the entire software stack without reserving dedicated hardware. Cloudflare uses Salt to provision systems. It can be difficult to iterate and test Salt changes without a complete virtual environment. Edge Test Fleet makes iterating on Salt easier, ensuring states compile and render the right output. With Edge Test Fleet, new developers can better understand how Cloudflare’s global network works without touching staging or production.

Additionally, one Cloudflare team developed a framework that allows users to build and test changes to Clickhouse using a VSCode environment. This framework is generally applicable to all teams requiring a development environment. Once a template environment is provisioned, CSI Volume Cloning can duplicate a golden volume, separating persistent environments for each developer.

apiVersion: v1
kind: PersistentVolumeClaim
name: devspace-jcichra-rootfs
namespace: dev-clickhouse-vms
spec:
  accessModes:
  - ReadWriteOnce
  storageClassName: rook-ceph-nvme
  dataSource:
    kind: PersistentVolumeClaim
    name: dev-rootfs
  resources:
    requests:
      storage: 500Gi

^{A PersistentVolumeClaim that clones data from another volume using CSI Volume Cloning}

Kernel and iPXE testing

Unlike user space software development, when a kernel crashes, the entire system crashes. The kernel team uses KubeVirt for development. KubeVirt gives all kernel engineers, regardless of laptop OS or architecture, the same x86 environment and hypervisor. Virtual machines on server hardware can be scaled up to more cores and memory than on laptops. The Cloudflare kernel team has also found low-level issues which only surface in environments with many CPUs.

To make testing fast and easy, the kernel team serves iPXE images via an nginx Pod and Service adjacent to the virtual machine. A recent kernel and Debian image are copied to the nginx pod via kubectl cp. The iPXE file can then be referenced in the KubeVirt virtual machine definition via the DNS name for the Kubernetes Service.

interfaces:
  name: default
  masquerade: {}
  model: e1000e
  ports:
    - port: 22
      dhcpOptions:
        bootFileName: http://httpboot.u-$K8S_USER.svc.cluster.local/boot.ipxe

When the virtual machine boots, it will get an IP address on the default interface behind NAT due to our masquerade setting. Then it will download boot.ipxe, which describes what additional files should be downloaded to start the system. In this case, the kernel (vmlinuz-amd64), Debian (baseimg-amd64.img) and additional kernel modules (modules-amd64.img) are downloaded.

^{UEFI iPXE boot connecting and downloading files from nginx pod in user’s namespace}

Once the system is booted, a developer can log in to the system for testing:

linux login: root
Password: 
Linux linux 6.6.35-cloudflare-2024.6.7 #1 SMP PREEMPT_DYNAMIC Mon Sep 27 00:00:00 UTC 2010 x86_64

The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
root@linux:~#

Custom kernels can be copied to the nginx pod via kubectl cp. Restarting the virtual machine will load that new kernel for testing. When a kernel panic occurs, the virtual machine can quickly be restarted with virtctl restart linux and it will go through the iPXE boot process again.

Build pipelines

Cloudflare leverages KubeVirt to build a majority of software at Cloudflare. Virtual machines give build system users full control over their pipeline. For example, Debian packages can easily be installed and separate container daemons (such as Docker) can run all within a Kubernetes namespace using the restricted Pod Security Standard. KubeVirt’s VirtualMachineReplicaSet concept allows us to quickly scale up and down the number of build agents to match demand. We can roll out different sets of virtual machines with varying sizes, kernels, and operating systems.

To scale efficiently, we leverage container disks to store our agent virtual machine images. Container disks allow us to store the virtual machine image (for example, a qcow image) in our container registry. This strategy works well when the state in virtual machines is ephemeral. Liveness probes detect unhealthy or broken agents, shutting down the virtual machine and replacing them with a fresh instance. Other automation limits virtual machine uptime, capping it to 3–4 hours to keep build agents fresh.

Next steps

We’re excited to expand our use of KubeVirt and unlock new capabilities for our internal users. KubeVirt’s Linux ARM64 support will allow us to build ARM64 packages in-cluster and simulate ARM64 systems.

Projects like KubeVirt CDI (Containerized Data Importer) will streamline our user’s virtual machine experience. Instead of users manually building container disks, we can provide a catalog of virtual machine images. It also allows us to copy virtual machine disks between namespaces.

Conclusion

KubeVirt has proven to be a great tool for virtualization in our Kubernetes-first environment. We’ve unlocked the ability to support more workloads with our multi-tenant model. The KubeVirt platform allows us to offer a single compute platform supporting containers and virtual machines. Managing it has been simple, and upgrades have been straightforward and non-disruptive. We’re exploring additional features KubeVirt offers to improve the experience for our users.

Finally, our team is expanding! We’re looking for more people passionate about Kubernetes to join our team and help us push Kubernetes to the next level.

Intelligent, automatic restarts for unhealthy Kafka consumers

Chris Shepherd — Tue, 24 Jan 2023 14:00:00 GMT

At Cloudflare, we take steps to ensure we are resilient against failure at all levels of our infrastructure. This includes Kafka, which we use for critical workflows such as sending time-sensitive emails and alerts.

We learned a lot about keeping our applications that leverage Kafka healthy, so they can always be operational. Application health checks are notoriously hard to implement: What determines an application as healthy? How can we keep services operational at all times?

These can be implemented in many ways. We’ll talk about an approach that allows us to considerably reduce incidents with unhealthy applications while requiring less manual intervention.

Kafka at Cloudflare

Cloudflare is a big adopter of Kafka. We use Kafka as a way to decouple services due to its asynchronous nature and reliability. It allows different teams to work effectively without creating dependencies on one another. You can also read more about how other teams at Cloudflare use Kafka in this post.

Kafka is used to send and receive messages. Messages represent some kind of event like a credit card payment or details of a new user created in your platform. These messages can be represented in multiple ways: JSON, Protobuf, Avro and so on.

Kafka organises messages in topics. A topic is an ordered log of events in which each message is marked with a progressive offset. When an event is written by an external system, that is appended to the end of that topic. These events are not deleted from the topic by default (retention can be applied).

Topics are stored as log files on disk, which are finite in size. Partitions are a systematic way of breaking the one topic log file into many logs, each of which can be hosted on separate servers–enabling to scale topics.

Topics are managed by brokers–nodes in a Kafka cluster. These are responsible for writing new events to partitions, serving reads and replicating partitions among themselves.

Messages can be consumed by individual consumers or co-ordinated groups of consumers, known as consumer groups.

Consumers use a unique id (consumer id) that allows them to be identified by the broker as an application which is consuming from a specific topic.

Each topic can be read by an infinite number of different consumers, as long as they use a different id. Each consumer can replay the same messages as many times as they want.

When a consumer starts consuming from a topic, it will process all messages, starting from a selected offset, from each partition. With a consumer group, the partitions are divided amongst each consumer in the group. This division is determined by the consumer group leader. This leader will receive information about the other consumers in the group and will decide which consumers will receive messages from which partitions (partition strategy).

The offset of a consumer’s commit can demonstrate whether the consumer is working as expected. Committing a processed offset is the way a consumer and its consumer group report to the broker that they have processed a particular message.

A standard measurement of whether a consumer is processing fast enough is lag. We use this to measure how far behind the newest message we are. This tracks time elapsed between messages being written to and read from a topic. When a service is lagging behind, it means that the consumption is at a slower rate than new messages being produced.

Due to Cloudflare’s scale, message rates typically end up being very large and a lot of requests are time-sensitive so monitoring this is vital.

At Cloudflare, our applications using Kafka are deployed as microservices on Kubernetes.

Health checks for Kubernetes apps

Kubernetes uses probes to understand if a service is healthy and is ready to receive traffic or to run. When a liveness probe fails and the bounds for retrying are exceeded, Kubernetes restarts the services.

When a readiness probe fails and the bounds for retrying are exceeded, it stops sending HTTP traffic to the targeted pods. In the case of Kafka applications this is not relevant as they don’t run an http server. For this reason, we’ll cover only liveness checks.

A classic Kafka liveness check done on a consumer checks the status of the connection with the broker. It’s often best practice to keep these checks simple and perform some basic operations - in this case, something like listing topics. If, for any reason, this check fails consistently, for instance the broker returns a TLS error, Kubernetes terminates the service and starts a new pod of the same service, therefore forcing a new connection. Simple Kafka liveness checks do a good job of understanding when the connection with the broker is unhealthy.

Problems with Kafka health checks

Due to Cloudflare’s scale, a lot of our Kafka topics are divided into multiple partitions (in some cases this can be hundreds!) and in many cases the replica count of our consuming service doesn’t necessarily match the number of partitions on the Kafka topic. This can mean that in a lot of scenarios this simple approach to health checking is not quite enough!

Microservices that consume from Kafka topics are healthy if they are consuming and committing offsets at regular intervals when messages are being published to a topic. When such services are not committing offsets as expected, it means that the consumer is in a bad state, and it will start accumulating lag. An approach we often take is to manually terminate and restart the service in Kubernetes, this will cause a reconnection and rebalance.

When a consumer joins or leaves a consumer group, a rebalance is triggered and the consumer group leader must re-assign which consumers will read from which partitions.

When a rebalance happens, each consumer is notified to stop consuming. Some consumers might get their assigned partitions taken away and re-assigned to another consumer. We noticed when this happened within our library implementation; if the consumer doesn’t acknowledge this command, it will wait indefinitely for new messages to be consumed from a partition that it’s no longer assigned to, ultimately leading to a deadlock. Usually a manual restart of the faulty client-side app is needed to resume processing.

Intelligent health checks

As we were seeing consumers reporting as “healthy” but sitting idle, it occurred to us that maybe we were focusing on the wrong thing in our health checks. Just because the service is connected to the Kafka broker and can read from the topic, it does not mean the consumer is actively processing messages.

Therefore, we realised we should be focused on message ingestion, using the offset values to ensure that forward progress was being made.

The PagerDuty approach

PagerDuty wrote an excellent blog on this topic which we used as inspiration when coming up with our approach.

Their approach used the current (latest) offset and the committed offset values. The current offset signifies the last message that was sent to the topic, while the committed offset is the last message that was processed by the consumer.

Checking the consumer is moving forwards, by ensuring that the latest offset was changing (receiving new messages) and the committed offsets were changing as well (processing the new messages).

Therefore, the solution we came up with:

If we cannot read the current offset, fail liveness probe.
If we cannot read the committed offset, fail liveness probe.
If the committed offset == the current offset, pass liveness probe.
If the value for the committed offset has not changed since the last run of the health check, fail liveness probe.

To measure if the committed offset is changing, we need to store the value of the previous run, we do this using an in-memory map where partition number is the key. This means each instance of our service only has a view of the partitions it is currently consuming from and will run the health check for each.

Problems

When we first rolled out our smart health checks we started to notice cascading failures some time after release. After initial investigations we realised this was happening when a rebalance happens. It would initially affect one replica then quickly result in the others reporting as unhealthy.

What we observed was due to us storing the previous value of the committed offset in-memory, when a rebalance happens the service may get re-assigned a different partition. When this happened it meant our service was incorrectly assuming that the committed offset for that partition had not changed (as this specific replica was no longer updating the latest value), therefore it would start to report the service as unhealthy. The failing liveness probe would then cause it to restart which would in-turn trigger another rebalancing in Kafka causing other replicas to face the same issue.

Solution

To fix this issue we needed to ensure that each replica only kept track of the offsets for the partitions it was consuming from at that moment. Luckily, the Shopify Sarama library, which we use internally, has functionality to observe when a rebalancing happens. This meant we could use it to rebuild the in-memory map of offsets so that it would only include the relevant partition values.

This is handled by receiving the signal from the session context channel:

for {
  select {
  case message, ok := <-claim.Messages(): // <-- Message received

     // Store latest received offset in-memory
     offsetMap[message.Partition] = message.Offset


     // Handle message
     handleMessage(ctx, message)


     // Commit message offset
     session.MarkMessage(message, "")


  case <-session.Context().Done(): // <-- Rebalance happened

     // Remove rebalanced partition from in-memory map
     delete(offsetMap, claim.Partition())
  }
}

Verifying this solution was straightforward, we just needed to trigger a rebalance. To test this worked in all possible scenarios we spun up a single replica of a service consuming from multiple partitions, then proceeded to scale up the number of replicas until it matched the partition count, then scaled back down to a single replica. By doing this we verified that the health checks could safely handle new partitions being assigned as well as partitions being taken away.

Takeaways

Probes in Kubernetes are very easy to set up and can be a powerful tool to ensure your application is running as expected. Well implemented probes can often be the difference between engineers being called out to fix trivial issues (sometimes outside of working hours) and a service which is self-healing.

However, without proper thought, “dumb” health checks can also lead to a false sense of security that a service is running as expected even when it’s not. One thing we have learnt from this was to think more about the specific behaviour of the service and decide what being unhealthy means in each instance, instead of just ensuring that dependent services are connected.

Kubectl with Cloudflare Zero Trust

Terin Stock — Fri, 24 Jun 2022 14:08:51 GMT

Cloudflare is a heavy user of Kubernetes for engineering workloads: it's used to power the backend of our APIs, to handle batch-processing such as analytics aggregation and bot detection, and engineering tools such as our CI/CD pipelines. But between load balancers, API servers, etcd, ingresses, and pods, the surface area exposed by Kubernetes can be rather large.

In this post, we share a little bit about how our engineering team dogfoods Cloudflare Zero Trust to secure Kubernetes — and enables kubectl without proxies.

Our General Approach to Kubernetes Security

As part of our security measures, we heavily limit what can access our clusters over the network. Where a network service is exposed, we add additional protections, such as requiring Cloudflare Access authentication or Mutual TLS (or both) to access ingress resources.

These network restrictions include access to the cluster's API server. Without access to this, engineers at Cloudflare would not be able to use tools like kubectl to introspect their team's resources. While we believe Continuous Deployments and GitOps are best practices, allowing developers to use the Kubernetes API aids in troubleshooting and increasing developer velocity. Not having access would have been a deal breaker.

To satisfy our security requirements, we're using Cloudflare Zero Trust, and we wanted to share how we're using it, and the process that brought us here.

Before Zero Trust

In the world before Zero Trust, engineers could access the Kubernetes API by connecting to a VPN appliance. While this is common across the industry, and it does allow access to the API, it also dropped engineers as clients into the internal network: they had much more network access than necessary.

We weren't happy with this situation, but it was the status quo for several years. At the beginning of 2020, we retired our VPN and thus the Kubernetes team needed to find another solution.

Kubernetes with Cloudflare Tunnels

At the time we worked closely with the team developing Cloudflare Tunnels to add support for handling kubectl connections using Access and cloudflared tunnels.

While this worked for our engineering users, it was a significant hurdle to on-boarding new employees. Each Kubernetes cluster required its own tunnel connection from the engineer's device, which made shuffling between clusters annoying. While kubectl supported connecting through SOCKS proxies, this support was not universal to all tools in the Kubernetes ecosystem.

We continued using this solution internally while we worked towards a better solution.

Kubernetes with Zero Trust

Since the launch of Cloudflare One, we've been dogfooding the Zero Trust agent in various configurations. At first we'd been using it to implement secure DNS with 1.1.1.1. As time went on, we began to use it to dogfood additional Zero Trust features.

We're now leveraging the private network routing in Cloudflare Zero Trust to allow engineers to access the Kubernetes APIs without needing to setup cloudflared tunnels or configure kubectl and other Kubernetes ecosystem tools to use tunnels. This isn't something specific to Cloudflare, you can do this for your team today!

Configuring Zero Trust

We use a configuration management tool for our Zero Trust configuration to enable infrastructure-as-code, which we've adapted below. However, the same configuration can be achieved using the Cloudflare Zero Trust dashboard.

The first thing we need to do is create a new tunnel. This tunnel will be used to connect the Cloudflare edge network to the Kubernetes API. We run the tunnel endpoints within Kubernetes, using configuration shown later in this post.

resource "cloudflare_argo_tunnel" "k8s_zero_trust_tunnel" {
  account_id = var.account_id
  name       = "k8s_zero_trust_tunnel"
  secret     = var.tunnel_secret
}

The "tunnel_secret" secret should be a 32-byte random number, which you should temporarily save as we'll reuse it later for the Kubernetes setup later.

After we've created the tunnel, we need to create the routes so the Cloudflare network knows what traffic to send through the tunnel.

resource "cloudflare_tunnel_route" "k8s_zero_trust_tunnel_ipv4" {
  account_id = var.account_id
  tunnel_id  = cloudflare_argo_tunnel.k8s_zero_trust_tunnel.id
  network    = "198.51.100.101/32"
  comment    = "Kubernetes API Server (IPv4)"
}
 
resource "cloudflare_tunnel_route" "k8s_zero_trust_tunnel_ipv6" {
  account_id = var.account_id
  tunnel_id  = cloudflare_argo_tunnel.k8s_zero_trust_tunnel.id
  network    = "2001:DB8::101/128"
  comment    = "Kubernetes API Server (IPv6)"
}

We support accessing the Kubernetes API via both IPv4 and IPv6 addresses, so we configure routes for both. If you're connecting to your API server via a hostname, these IP addresses should match what is returned via a DNS lookup.

Next we'll configure settings for Cloudflare Gateway so that it's compatible with the API servers and clients.

resource "cloudflare_teams_list" "k8s_apiserver_ips" {
  account_id = var.account_id
  name       = "Kubernetes API IPs"
  type       = "IP"
  items      = ["198.51.100.101/32", "2001:DB8::101/128"]
}
 
resource "cloudflare_teams_rule" "k8s_apiserver_zero_trust_http" {
  account_id  = var.account_id
  name        = "Don't inspect Kubernetes API"
  description = "Allow connections from kubectl to API"
  precedence  = 10000
  action      = "off"
  enabled     = true
  filters     = ["http"]
  traffic     = format("any(http.conn.dst_ip[*] in $%s)", replace(cloudflare_teams_list.k8s_apiserver_ips.id, "-", ""))
}

As we use mutual TLS between clients and the API server, and not all the traffic between kubectl and the API servers are HTTP, we've disabled HTTP inspection for these connections.

You can pair these rules with additional Zero Trust rules, such as device attestation, session lifetimes, and user and group access policies to further customize your security.

Deploying Tunnels

Once you have your tunnels created and configured, you can deploy their endpoints into your network. We've chosen to deploy the tunnels as pods, as this allows us to use Kubernetes's deployment strategies for rolling out upgrades and handling node failures.

apiVersion: v1
kind: ConfigMap
metadata:
  name: tunnel-zt
  namespace: example
  labels:
    tunnel: api-zt
data:
  config.yaml: |
    tunnel: 8e343b13-a087-48ea-825f-9783931ff2a5
    credentials-file: /opt/zt/creds/creds.json
    metrics: 0.0.0.0:8081
    warp-routing:
        enabled: true

This creates a Kubernetes ConfigMap with a basic configuration that enables WARP routing for the tunnel ID specified. You can get this tunnel ID from your configuration management system, the Cloudflare Zero Trust dashboard, or by running the following command from another device logged into the same account.

cloudflared tunnel list

Next, we'll need to create a secret for our tunnel credentials. While you should use a secret management system, for simplicity we'll create one directly here.

jq -cn --arg accountTag $CF_ACCOUNT_TAG \
       --arg tunnelID $CF_TUNNEL_ID \
       --arg tunnelName $CF_TUNNEL_NAME \
       --arg tunnelSecret $CF_TUNNEL_SECRET \
   '{AccountTag: $accountTag, TunnelID: $tunnelID, TunnelName: $tunnelName, TunnelSecret: $tunnelSecret}' | \
kubectl create secret generic -n example tunnel-creds --from-file=creds.json=/dev/stdin

This creates a new secret "tunnel-creds" in the "example" namespace with the credentials file the tunnel expects.

Now we can deploy the tunnels themselves. We deploy multiple replicas to ensure some are always available, even while nodes are being drained.

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    tunnel: api-zt
  name: tunnel-api-zt
  namespace: example
spec:
  replicas: 3
  selector:
    matchLabels:
      tunnel: api-zt
  strategy:
    rollingUpdate:
      maxSurge: 0
      maxUnavailable: 1
  template:
    metadata:
      labels:
        tunnel: api-zt
    spec:
      containers:
        - args:
            - tunnel
            - --config
            - /opt/zt/config/config.yaml
            - run
          env:
            - name: GOMAXPROCS
              value: "2"
            - name: TZ
              value: UTC
          image: cloudflare/cloudflared:2022.5.3
          livenessProbe:
            failureThreshold: 1
            httpGet:
              path: /ready
              port: 8081
            initialDelaySeconds: 10
            periodSeconds: 10
          name: tunnel
          ports:
            - containerPort: 8081
              name: http-metrics
          resources:
            limits:
              cpu: "1"
              memory: 100Mi
          volumeMounts:
            - mountPath: /opt/zt/config
              name: config
              readOnly: true
            - mountPath: /opt/zt/creds
              name: creds
              readOnly: true
      volumes:
        - secret:
            name: tunnel-creds
          name: creds
        - configMap:
            name: tunnel-api-zt
          name: config

Using Kubectl with Cloudflare Zero Trust

After deploying the Cloudflare Zero Trust agent, members of your team can now access the Kubernetes API without needing to set up any special SOCKS tunnels!

kubectl version --short
Client Version: v1.24.1
Server Version: v1.24.1

What's next?

If you try this out, send us your feedback! We're continuing to improve Zero Trust for non-HTTP workflows.

Automatic Remediation of Kubernetes Nodes

Andrew DeMaria — Thu, 15 Jul 2021 12:59:42 GMT

We use Kubernetes to run many of the diverse services that help us control Cloudflare’s edge. We have five geographically diverse clusters, with hundreds of nodes in our largest cluster. These clusters are self-managed on bare-metal machines which gives us a good amount of power and flexibility in the software and integrations with Kubernetes. However, it also means we don’t have a cloud provider to rely on for virtualizing or managing the nodes. This distinction becomes even more prominent when considering all the different reasons that nodes degrade. With self-managed bare-metal machines, the list of reasons that cause a node to become unhealthy include:

Hardware failures
Kernel-level software failures
Kubernetes cluster-level software failures
Degraded network communication
Software updates are required
Resource exhaustion¹

Unhappy Nodes

We have plenty of examples of failures in the aforementioned categories, but one example has been particularly tedious to deal with. It starts with the following log line from the kernel:

unregister_netdevice: waiting for lo to become free. Usage count = 1

The issue is further observed with the number of network interfaces on the node owned by the Container Network Interface (CNI) plugin getting out of proportion with the number of running pods:

$ ip link | grep cali | wc -l
1088

This is unexpected as it shouldn't exceed the maximum number of pods allowed on a node (we use the default limit of 110). While this issue is interesting and perhaps worthy of a whole separate blog, the short of it is that the Linux network interfaces owned by the CNI are not getting cleaned up after a pod terminates.

Some history on this can be read in a Docker GitHub issue. We found this seems to plague nodes with a longer uptime, and after rebooting the node it would be fine for about a month. However, with a significant number of nodes, this was happening multiple times per day. Each instance would need rebooting, which means going through our worker reboot procedure which looked like this:

Cordon off the affected node to prevent new workloads from scheduling on it.
Collect any diagnostic information for later investigation.
Drain the node of current workloads.
Reboot and wait for the node to come back.
Verify the node is healthy.
Re-enable scheduling of new workloads to the node.

While solving the underlying issue would be ideal, we needed a mitigation to avoid toil in the meantime — an automated node remediation process.

Existing Detection and Remediation Solutions

While not complicated, the manual remediation process outlined previously became tedious and distracting, as we had to reboot nodes multiple times a day. Some manual intervention is unavoidable, but for cases matching the following, we wanted automation:

Generic worker nodes
Software issues confined to a given node
Already researched and diagnosed issues

Limiting automatic remediation to generic worker nodes is important as there are other node types in our clusters where more care is required. For example, for control-plane nodes the process has to be augmented to check etcd cluster health and ensure proper redundancy for components servicing the Kubernetes API. We are also going to limit the problem space to known software issues confined to a node where we expect automatic remediation to be the right answer (as in our ballooning network interface problem). With that in mind, we took a look at existing solutions that we could use.

Node Problem Detector

Node problem detector is a daemon that runs on each node that detects problems and reports them to the Kubernetes API. It has a pluggable problem daemon system such that one can add their own logic for detecting issues with a node. Node problems are distinguished between temporary and permanent problems, with the latter being persisted as status conditions on the Kubernetes node resources.²

Draino and Cluster-Autoscaler

Draino as its name implies, drains nodes but does so based on Kubernetes node conditions. It is meant to be used with cluster-autoscaler which then can add or remove nodes via the cluster plugins to scale node groups.

Kured

Kured is a daemon that looks at the presence of a file on the node to initiate a drain, reboot and uncordon of the given node. It uses a locking mechanism via the Kubernetes API to ensure only a single node is acted upon at a time.

Cluster-API

The Kubernetes cluster-lifecycle SIG has been working on the cluster-api project to enable declaratively defining clusters to simplify provisioning, upgrading, and operating multiple Kubernetes clusters. It has a concept of machine resources which back Kubernetes node resources and furthermore has a concept of machine health checks. Machine health checks use node conditions to determine unhealthy nodes and then the cluster-api provider is then delegated to replace that machine via create and delete operations.

Proof of Concept

Interestingly, with all the above except for Kured, there is a theme of pluggable components centered around Kubernetes node conditions. We wanted to see if we could build a proof of concept using the existing theme and solutions. For the existing solutions, draino with cluster-autoscaler didn’t make sense in a non-cloud environment like our bare-metal set up. The cluster-api health checks are interesting, however they require a more complete investment into the cluster-api project to really make sense. That left us with node-problem-detector and kured. Deploying node-problem-detector was simple, and we ended up testing a custom-plugin-monitor like the following:

apiVersion: v1
kind: ConfigMap
metadata:
  name: node-problem-detector-config
data:
  check_calico_interfaces.sh: |
    #!/bin/bash
    set -euo pipefail
    
    count=$(nsenter -n/proc/1/ns/net ip link | grep cali | wc -l)
    
    if (( $count > 150 )); then
      echo "Too many calico interfaces ($count)"
      exit 1
    else
      exit 0
    fi
  cali-monitor.json: |
    {
      "plugin": "custom",
      "pluginConfig": {
        "invoke_interval": "30s",
        "timeout": "5s",
        "max_output_length": 80,
        "concurrency": 3,
        "enable_message_change_based_condition_update": false
      },
      "source": "calico-custom-plugin-monitor",
      "metricsReporting": false,
      "conditions": [
        {
          "type": "NPDCalicoUnhealthy",
          "reason": "CalicoInterfaceCountOkay",
          "message": "Normal amount of interfaces"
        }
      ],
      "rules": [
        {
          "type": "permanent",
          "condition": "NPDCalicoUnhealthy",
          "reason": "TooManyCalicoInterfaces",
          "path": "/bin/bash",
          "args": [
            "/config/check_calico_interfaces.sh"
          ],
          "timeout": "3s"
        }
      ]
    }

Testing showed that when the condition became true, a condition would be updated on the associated Kubernetes node like so:

kubectl get node -o json worker1a | jq '.status.conditions[] | select(.type | test("^NPD"))'
{
  "lastHeartbeatTime": "2020-03-20T17:05:17Z",
  "lastTransitionTime": "2020-03-20T17:05:16Z",
  "message": "Too many calico interfaces (154)",
  "reason": "TooManyCalicoInterfaces",
  "status": "True",
  "type": "NPDCalicoUnhealthy"
}

With that in place, the actual remediation needed to happen. Kured seemed to do most everything we needed, except that it was looking at a file instead of Kubernetes node conditions. We hacked together a patch to change that and tested it successfully end to end — we had a working proof of concept!

Revisiting Problem Detection

While the above worked, we found that node-problem-detector was unwieldy because we were replicating our existing monitoring into shell scripts and node-problem-detector configuration. A 2017 blog post describes Cloudflare’s monitoring stack, although some things have changed since then. What hasn’t changed is our extensive usage of Prometheus and Alertmanager.

For the network interface issue and other issues we wanted to address, we already had the necessary exported metrics and alerting to go with them. Here is what our already existing alert looked like³:

- alert: CalicoTooManyInterfaces
  expr: sum(node_network_info{device=~"cali.*"}) by (node) >= 200
  for: 1h
  labels:
    priority: "5"
    notify: chat-sre-core chat-k8s

Note that we use a “notify” label to drive the routing logic in Alertmanager. However, that got us asking, could we just route this to a Kubernetes node condition instead?

Introducing Sciuro

Sciuro is our open-source replacement of node-problem-detector that has one simple job: synchronize Kubernetes node conditions with currently firing alerts in Alertmanager. Node problems can be defined with a more holistic view and using already existing exporters such as node exporter, cadvisor or mtail. It also doesn’t run on affected nodes which allows us to rely on out-of-band remediation techniques. Here is a high level diagram of how Sciuro works:

Starting from the top, nodes are scraped by Prometheus, which collects those metrics and fires relevant alerts to Alertmanager. Sciuro polls Alertmanager for alerts with a matching receiver, matches them with a corresponding node resource in the Kubernetes API and updates that node’s conditions accordingly.

In more detail, we can start by defining an alert in Prometheus like the following:

- alert: CalicoTooManyInterfacesEarly
  expr: sum(node_network_info{device=~"cali.*"}) by (node) >= 150
  labels:
    priority: "6"
    notify: node-condition-k8s

Note the two differences from the previous alert. First, we use a new name with a more sensitive trigger. The idea is that we want automatic node remediation to try fixing the node first as soon as possible, but if the problem worsens or automatic remediation is failing, humans will still get notified to act. The second difference is that instead of notifying chat rooms, we route to a target called “node-condition-k8s”.

Sciuro then comes into play, polling the Altertmanager API for alerts matching the “node-condition-k8s” receiver. The following shows the equivalent using amtool:

$ amtool alert query -r node-condition-k8s
Alertname                 	Starts At            	Summary                                                               	 
CalicoTooManyInterfacesEarly  2021-05-11 03:25:21 UTC  Kubernetes node worker1a has too many Calico interfaces

We can also check the labels for this alert:

$ amtool alert query -r node-condition-k8s -o json | jq '.[] | .labels'
{
  "alertname": "CalicoTooManyInterfacesEarly",
  "cluster": "a.k8s",
  "instance": "worker1a",
  "node": "worker1a",
  "notify": "node-condition-k8s",
  "priority": "6",
  "prometheus": "k8s-a"
}

Note the node and instance labels which Sciuro will use for matching with the corresponding Kubernetes node. Sciuro uses the excellent controller-runtime to keep track of and update node sources in the Kubernetes API. We can observe the updated node condition on the status field via kubectl:

$ kubectl get node worker1a -o json | jq '.status.conditions[] | select(.type | test("^AlertManager"))'
{
  "lastHeartbeatTime": "2021-05-11T03:31:20Z",
  "lastTransitionTime": "2021-05-11T03:26:53Z",
  "message": "[P6] Kubernetes node worker1a has too many Calico interfaces",
  "reason": "AlertIsFiring",
  "status": "True",
  "type": "AlertManager_CalicoTooManyInterfacesEarly"
}

One important note is Sciuro added the AlertManager_ prefix to the node condition type to prevent conflicts with other node condition types. For example, DiskPressure, a kubelet managed condition, could also be an alert name. Sciuro will also properly update heartbeat and transition times to reflect when it first saw the alert and its last update. With node conditions synchronized by Sciuro, remediation can take place via one of the existing tools. As mentioned previously we are using a modified version of Kured for now.

We’re happy to announce that we’ve open sourced Sciuro, and it can be found on GitHub where you can read the code, find the deployment instructions, or open a Pull Request for changes.

Managing Node Uptime

While we began using automatic node remediation for obvious problems, we’ve expanded its purpose to additionally keep node uptime low. Low node uptime is desirable to further reduce drift on nodes, keep the node initialization process well-oiled, and encourage the best deployment practices on the Kubernetes clusters. To expand on the last point, services that are deployed with best practices and in a high availability fashion should see negligible impact when a single node leaves the cluster. However, services that are not deployed with best practices will most likely have problems especially if they rely on singleton pods. By draining nodes more frequently, it introduces regular chaos that encourages best practices. To enable this with automatic node remediation the following alert was defined:

- alert: WorkerUptimeTooHigh
  expr: |
    (
      (
        (
              max by(node) (kube_node_role{role="worker"})
            - on(node) group_left()
              (max by(node) (kube_node_role{role!="worker"}))
          or on(node)
            max by(node) (kube_node_role{role="worker"})
        ) == 1
      )
    * on(node) group_left()
      (
        (time() - node_boot_time_seconds) > (60 * 60 * 24 * 7)
      )
    )
  labels:
    priority: "9"
    notify: node-condition-k8s

There is a bit of juggling with the kube_node_roles metric in the above to isolate the alert to generic worker nodes, but at a high level it looks at node_boot_time_seconds, a metric from prometheus node_exporter. Again the notify label is configured to send to node conditions which kicks off the automatic node remediation. One further detail is the priority here is set to “9” which is of lower precedence than our other alerts. Note that the message field of the node condition is prefixed with the alert priority in brackets. This allows the remediation process to take priority into account when choosing which node to remediate first, which is important because Kured uses a lock to act on a single node at a time.

Wrapping Up

In the past 30 days, we’ve used the above automatic node remediation process to action 571 nodes. That has saved our humans a considerable amount of time. We’ve also been able to reduce the time to repair for some issues as automatic remediation can act at all times of the day and with a faster response time.

As mentioned before, we’re open sourcing Sciuro and its code can be found on GitHub. We’re open to issues, suggestions, and pull requests. We do have some ideas for future improvements. For Sciuro, we may look to reduce latency which is mainly due to polling and potentially add a push model from Altermanager although this isn’t a need we’ve had yet. For the larger node remediation story, we hope to do an overhaul of the remediating component. As mentioned previously, we are currently using a fork of kured, but a future replacement component should include the following:

Use out-of-band management interfaces to be able to shut down and power on nodes without a functional operating system.
Move from decentralized architecture to a centralized one that can integrate more complicated logic. This might include being able to act on entire failure domains in parallel.
Handle specialized nodes such as masters or storage nodes.

Finally, we’re looking for more people passionate about Kubernetes to join our team. Come help us push Kubernetes to the next level to serve Cloudflare’s many needs!

¹Exhaustion can be applied to hardware resources, kernel resources, or logical resources like the amount of logging being produced.

²Nearly all Kubernetes objects have spec and status fields. The status field is used to describe the current state of an object. For node resources, typically the kubelet manages a conditions field under the status field for reporting things like if the node is ready for servicing pods. ³The format of the following alert is documented on Prometheus Alerting Rules.

Moving k8s communication to gRPC

Alex Fattouche — Sat, 20 Mar 2021 14:00:00 GMT

Over the past year and a half, Cloudflare has been hard at work moving our back-end services running in our non-edge locations from bare metal solutions and Mesos Marathon to a more unified approach using Kubernetes(K8s). We chose Kubernetes because it allowed us to split up our monolithic application into many different microservices with granular control of communication.

For example, a ReplicaSet in Kubernetes can provide high availability by ensuring that the correct number of pods are always available. A Pod in Kubernetes is similar to a container in Docker. Both are responsible for running the actual application. These pods can then be exposed through a Kubernetes Service to abstract away the number of replicas by providing a single endpoint that load balances to the pods behind it. The services can then be exposed to the Internet via an Ingress. Lastly, a network policy can protect against unwanted communication by ensuring the correct policies are applied to the application. These policies can include L3 or L4 rules.

The diagram below shows a simple example of this setup.

Though Kubernetes does an excellent job at providing the tools for communication and traffic management, it does not help the developer decide the best way to communicate between the applications running on the pods. Throughout this blog we will look at some of the decisions we made and why we made them to discuss the pros and cons of two commonly used API architectures, REST and gRPC.

Out with the old, in with the new

When the DNS team first moved to Kubernetes, all of our pod-to-pod communication was done through REST APIs and in many cases also included Kafka. The general communication flow was as follows:

We use Kafka because it allows us to handle large spikes in volume without losing information. For example, during a Secondary DNS Zone zone transfer, Service A tells Service B that the zone is ready to be published to the edge. Service B then calls Service A’s REST API, generates the zone, and pushes it to the edge. If you want more information about how this works, I wrote an entire blog post about the Secondary DNS pipeline at Cloudflare.

HTTP worked well for most communication between these two services. However, as we scaled up and added new endpoints, we realized that as long as we control both ends of the communication, we could improve the usability and performance of our communication. In addition, sending large DNS zones over the network using HTTP often caused issues with sizing constraints and compression.

In contrast, gRPC can easily stream data between client and server and is commonly used in microservice architecture. These qualities made gRPC the obvious replacement for our REST APIs.

gRPC Usability

Often overlooked from a developer’s perspective, HTTP client libraries are clunky and require code that defines paths, handles parameters, and deals with responses in bytes. gRPC abstracts all of this away and makes network calls feel like any other function calls defined for a struct.

The example below shows a very basic schema to set up a GRPC client/server system. As a result of gRPC using protobuf for serialization, it is largely language agnostic. Once a schema is defined, the protoc command can be used to generate code for many languages.

Protocol Buffer data is structured as messages, with each message containing information stored in the form of fields. The fields are strongly typed, providing type safety unlike JSON or XML. Two messages have been defined, Hello and HelloResponse. Next we define a service called HelloWorldHandler which contains one RPC function called SayHello that must be implemented if any object wants to call themselves a HelloWorldHandler.

Simple Proto:

message Hello{
   string Name = 1;
}

message HelloResponse{}

service HelloWorldHandler {
   rpc SayHello(Hello) returns (HelloResponse){}
}

Once we run our protoc command, we are ready to write the server-side code. In order to implement the HelloWorldHandler, we must define a struct that implements all of the RPC functions specified in the protobuf schema above_._ In this case, the struct Server defines a function SayHello that takes in two parameters, context and *pb.Hello. *pb.Hello was previously specified in the schema and contains one field, Name. SayHello must also return the *pbHelloResponse which has been defined without fields for simplicity.

Inside the main function, we create a TCP listener, create a new gRPC server, and then register our handler as a HelloWorldHandlerServer. After calling Serve on our gRPC server, clients will be able to communicate with the server through the function SayHello.

Simple Server:

type Server struct{}

func (s *Server) SayHello(ctx context.Context, in *pb.Hello) (*pb.HelloResponse, error) {
    fmt.Println("%s says hello\n", in.Name)
    return &pb.HelloResponse{}, nil
}

func main() {
    lis, err := net.Listen("tcp", ":8080")
    if err != nil {
        panic(err)
    }
    gRPCServer := gRPC.NewServer()
    handler := Server{}
    pb.RegisterHelloWorldHandlerServer(gRPCServer, &handler)
    if err := gRPCServer.Serve(lis); err != nil {
        panic(err)
    }
}

Finally, we need to implement the gRPC Client. First, we establish a TCP connection with the server. Then, we create a new pb.HandlerClient. The client is able to call the server's SayHello function by passing in a *pb.Hello object.

Simple Client:

conn, err := gRPC.Dial("127.0.0.1:8080", gRPC.WithInsecure())
if err != nil {
    panic(err)
}
client := pb.NewHelloWorldHandlerClient(conn)
client.SayHello(context.Background(), &pb.Hello{Name: "alex"})

Though I have removed some code for simplicity, these services and messages can become quite complex if needed. The most important thing to understand is that when a server attempts to announce itself as a HelloWorldHandlerServer, it is required to implement the RPC functions as specified within the protobuf schema. This agreement between the client and server makes cross-language network calls feel like regular function calls.

In addition to the basic Unary server described above, gRPC lets you decide between four types of service methods:

Unary (example above): client sends a single request to the server and gets a single response back, just like a normal function call.
Server Streaming: server returns a stream of messages in response to a client's request.
Client Streaming: client sends a stream of messages to the server and the server replies in a single message, usually once the client has finished streaming.
Bi-directional Streaming: the client and server can both send streams of messages to each other asynchronously.

gRPC Performance

Not all HTTP connections are created equal. Though Golang natively supports HTTP/2, the HTTP/2 transport must be set by the client and the server must also support HTTP/2. Before moving to gRPC, we were still using HTTP/1.1 for client connections. We could have switched to HTTP/2 for performance gains, but we would have lost some of the benefits of native protobuf compression and usability changes.

The best option available in HTTP/1.1 is pipelining. Pipelining means that although requests can share a connection, they must queue up one after the other until the request in front completes. HTTP/2 improved pipelining by using connection multiplexing. Multiplexing allows for multiple requests to be sent on the same connection and at the same time.

HTTP REST APIs generally use JSON for their request and response format. Protobuf is the native request/response format of gRPC because it has a standard schema agreed upon by the client and server during registration. In addition, protobuf is known to be significantly faster than JSON due to its serialization speeds. I’ve run some benchmarks on my laptop, source code can be found here.

As you can see, protobuf performs better in small, medium, and large data sizes. It is faster per operation, smaller after marshalling, and scales well with input size. This becomes even more noticeable when unmarshaling very large data sets. Protobuf takes 96.4ns/op but JSON takes 22647ns/op, a 235X reduction in time! For large DNS zones, this efficiency makes a massive difference in the time it takes us to go from record change in our API to serving it at the edge.

Combining the benefits of HTTP/2 and protobuf showed almost no performance change from our application’s point of view. This is likely due to the fact that our pods were already so close together that our connection times were already very low. In addition, most of our gRPC calls are done with small amounts of data where the difference is negligible. One thing that we did notice — likely related to the multiplexing of HTTP/2 — was greater efficiency when writing newly created/edited/deleted records to the edge. Our latency spikes dropped in both amplitude and frequency.

gRPC Security

One of the best features in Kubernetes is the NetworkPolicy. This allows developers to control what goes in and what goes out.

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: test-network-policy
  namespace: default
spec:
  podSelector:
    matchLabels:
      role: db
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - ipBlock:
        cidr: 172.17.0.0/16
        except:
        - 172.17.1.0/24
    - namespaceSelector:
        matchLabels:
          project: myproject
    - podSelector:
        matchLabels:
          role: frontend
    ports:
    - protocol: TCP
      port: 6379
  egress:
  - to:
    - ipBlock:
        cidr: 10.0.0.0/24
    ports:
    - protocol: TCP
      port: 5978

In this example, taken from the Kubernetes docs, we can see that this will create a network policy called test-network-policy. This policy controls both ingress and egress communication to or from any pod that matches the role db and enforces the following rules:

Ingress connections allowed:

Any pod in default namespace with label “role=frontend”
Any pod in any namespace that has a label “project=myproject”
Any source IP address in 172.17.0.0/16 except for 172.17.1.0/24

Egress connections allowed:

Any dest IP address in 10.0.0.0/24

NetworkPolicies do a fantastic job of protecting APIs at the network level, however, they do nothing to protect APIs at the application level. If you wanted to control which endpoints can be accessed within the API, you would need k8s to be able to not only distinguish between pods, but also endpoints within those pods. These concerns led us to per RPC credentials. Per RPC credentials are easy to set up on top of the pre-existing gRPC code. All you need to do is add interceptors to both your stream and unary handlers.

func (s *Server) UnaryAuthInterceptor(ctx context.Context, req interface{}, info *grpc.UnaryServerInfo, handler grpc.UnaryHandler) (interface{}, error) {
    // Get the targeted function
    functionInfo := strings.Split(info.FullMethod, "/")
    function := functionInfo[len(functionInfo)-1]
    md, _ := metadata.FromIncomingContext(ctx)

    // Authenticate
    err := authenticateClient(md.Get("username")[0], md.Get("password")[0], function)
    // Blocked
    if err != nil {
        return nil, err
    }
    // Verified
    return handler(ctx, req)
}

In this example code snippet, we are grabbing the username, password, and requested function from the info object. We then authenticate against the client to make sure that it has correct rights to call that function. This interceptor will run before any of the other functions get called, which means one implementation protects all functions. The client would initialize its secure connection and send credentials like so:

transportCreds, err := credentials.NewClientTLSFromFile(certFile, "")
if err != nil {
    return nil, err
}
perRPCCreds := Creds{Password: grpcPassword, User: user}
conn, err := grpc.Dial(endpoint, grpc.WithTransportCredentials(transportCreds), grpc.WithPerRPCCredentials(perRPCCreds))
if err != nil {
    return nil, err
}
client:= pb.NewRecordHandlerClient(conn)
// Can now start using the client

Here the client first verifies that the server matches with the certFile. This step ensures that the client does not accidentally send its password to a bad actor. Next, the client initializes the perRPCCreds struct with its username and password and dials the server with that information. Any time the client makes a call to an rpc defined function, its credentials will be verified by the server.

Next Steps

Our next step is to remove the need for many applications to access the database and ultimately DRY up our codebase by pulling all DNS-related code into a single API, accessed from one gRPC interface. This removes the potential for mistakes in individual applications and makes updating our database schema easier. It also gives us more granular control over which functions can be accessed rather than which tables can be accessed.

So far, the DNS team is very happy with the results of our gRPC migration. However, we still have a long way to go before we can move entirely away from REST. We are also patiently waiting for HTTP/3 support for gRPC so that we can take advantage of those super quic speeds!

Lessons Learned from Scaling Up Cloudflare’s Anomaly Detection Platform

Jeffrey Tang — Fri, 12 Mar 2021 15:48:32 GMT

Introduction to Anomaly Detection for Bot Management

Cloudflare’s Bot Management platform follows a “defense in depth” model. Although each layer of Bot Management has its own strengths and weaknesses, the combination of many different detection systems — including Machine Learning, rule-based heuristics, JavaScript challenges, and more — makes for a robust platform in which different detection systems compensate for each other’s weaknesses.

One of these systems is Anomaly Detection, a platform motivated by a simple idea: because bots are made to accomplish specific goals, such as credential stuffing or content scraping, they interact with websites in distinct and difficult-to-disguise ways. Over time, the actions of a bot are likely to differ from those of a real user. Anomaly detection aims to model the characteristics of legitimate user traffic as a healthy baseline. Then, when automated bot traffic is set against this baseline, the bots appear as outlying anomalies that can be targeted for mitigation.

An anomaly detection approach is:

Resilient against bots that try to circumvent protections by spoofing request metadata (e.g., user agents)
Able to catch previously unseen bots without being explicitly trained against them.

So, how well does this work?

Today, Anomaly Detection processes more than 500K requests per second. This translates to over 200K CAPTCHAs issued per minute, not including traffic that’s already caught by other bot management systems or traffic that’s outright blocked. These suspected bots originate from over 140 different countries and 2,200 different ASNs. And all of this happens using automatically generated baselines and visitor models which are unique to every enrolled site — no cross-site analysis or manual intervention required.

How Anomaly Detection Identifies Bots

Anomaly Detection uses an algorithm called Histogram-Based Outlier Scoring (HBOS) to detect anomalous traffic in a scalable way. While HBOS is less precise than algorithms like kNN when it comes to local outliers, it is able to score global outliers quickly (in linear time).

There are two parts to every behavioral bot detection: the site-specific baseline and the visitor-specific behavior model. We make heavy use of ClickHouse, an open-source columnar database, for storing and analyzing enormous amounts of data. When a customer opts-in to Anomaly Detection, we begin aggregating traffic data in ClickHouse to form a unique baseline for their site.

To understand visitor behavior, we maintain a sliding window of ephemeral feature aggregates in-memory using Redis. HyperLogLog data structures allow us to efficiently store and estimate unique counts of these high-cardinality features. Because these data are privacy-sensitive, they are retained only within a recent time window and are specific to each opted-in site. This makes efficient data representations due to the resulting high cardinality of the problem space.

The output of each detection run is an outlier score, representing how anomalous a visitor’s behavior is when viewed against the baseline for that particular site. This outlier score feeds into the final bot score calculation for use on the edge.

The Anomaly Detection Platform

The Anomaly Detection platform consists of a series of microservices running on Kubernetes. Request data come in through a dedicated Kafka topic and are inserted to both ClickHouse and Redis.

The Detector service lazily retrieves (and caches) baseline data from the Baseline service and calculates outlier scores for visitors compared to the baselines. These scores are then written to another Kafka topic to be persisted for later analysis.

Finally, the Publisher service collects batches of detections (visitors whose behavior is anomalous or bot-like) and sends them out to the edge to be applied as part of our bot score calculations.

Each microservice runs independently and tolerates downtime from its dependencies. They are also sized very differently: some services require dozens of replicas and gigabytes of memory, while others are much cheaper.

Today, the Anomaly Detection platform handles nearly 500k requests per second across ~310M unique visitors, representing 2x growth over the last six months. But once upon a time, we struggled to handle even 10K rps.

The story of our growth is also the story of how we adapted, redesigned, and improved our infrastructure in order to respond to the corresponding increases in resource demand, customer support requests, and maintenance challenges.

Launch, then Iterate

In an earlier incarnation, most of the core Anomaly Detection logic was contained in a single (replicated) service running under Kubernetes.

Each service pod fulfilled multiple responsibilities: generating behavioral baselines from ClickHouse data, aggregating visitor profiles from Redis, and calculating outlier scores. These outlier detections were then forwarded to the edge by piggybacking on another bot management system’s existing integration with Quicksilver, Cloudflare’s replicated key-value store.

This simple design was easy to understand and implement. It also reused existing infrastructure, making it perfect for a v1 deployment in keeping with Cloudflare’s culture of fast iteration. Of course, it also had some shortcomings:

A monolithic service meant a single (logical) point of failure.
From a resource (CPU, memory) perspective, it was difficult to scale pieces of functionality independently.
The “reused” integration with Quicksilver was never meant to support something like Anomaly Detection, causing instability for both systems.

It’s easy in hindsight to focus on the flaws of an existing system, but it’s important to keep in mind that priorities can and should evolve over time. A design that doesn’t meet today’s needs was likely suited to the needs of yesterday.

One key benefit of a launch-and-iterate approach is that you get a concrete, real-world understanding of where your system needs improvement. Having a real system to observe means that improvements are both targeted and measurable.

Tuning Redis

As mentioned above, Redis is a key part of the Anomaly Detection platform, used to store and aggregate features about site visitors. Although we only keep these data in a sliding window, the cardinality of the set is very large (per visitor per site). For this reason, many of the early improvements to Anomaly Detection performance centered on Redis. In fact, the first launch of Anomaly Detection struggled to keep up with only 10k requests per second.

Profiling revealed that load was centered on our heavy use of the Redis PFMERGE command for merging HyperLogLog structures. Unlike most Redis commands, which are O(1), PFMERGE runs in linear time (proportional to the number of features * window size). As the demand for scoring increased, this proved to be a serious bottleneck.

To resolve this problem, we looked for ways to further optimize our use of Redis. One idea was lowering the threshold for promoting a sparse HyperLogLog representation to a dense one - trading memory for compute, as dense HyperLogLogs are generally faster to merge.

However, as is so often the case, a big win came from a simple idea: we introduced a “recency register,” essentially a cache that placed a time bound on how often we would run expensive detection logic on a given site-visitor pair. Since behavior patterns need to be established over a time window, the potential extra detection latency from the recency register was not a significant concern. This straightforward solution was enough to raise our throughput by an order of magnitude.

Working with Redis involves a lot of balancing between memory and compute resources. For example, our Redis shards’ memory sizes were empirically determined based on the observed CPU utilization when reaching memory bounds. A higher memory bound meant more visitors tracked per shard and thus more commands per second. The fact that Redis shards are single-threaded made reasoning about these situations easier as well.

As the number of features and visitors grew, we discovered that “standard” Redis recommendations didn’t always work for us in practice. Redis typically recommends using human-readable keys, even if they are longer.

However, by encoding our keys in a compact, binary-encoded format, we observed roughly 30% memory savings, as well as CPU savings — which again demonstrates the value of iterating on a real-world system.

Moving to Microservices

As Anomaly Detection continued to grow in adoption, it became clear that optimizing individual pieces of the pipeline was no longer sufficient: our platform needed to scale up as well. But, as it turns out, scaling isn’t as simple as requesting more resources and running more replicas of whatever service isn’t keeping up with demand. As we expanded, the amount of load we placed on external (shared!) dependencies like ClickHouse grew. The way we piggybacked on Quicksilver to send updates to the edge coupled two systems together in a bloated and unreliable way.

So we set out to do more with less - to build a more resilient system that would also be a better steward of Cloudflare’s shared resources.

The idea of a microservice-based architecture was not a new one; in fact, even early Anomaly Detection designs suggested the eventual need for such a migration. But real-world observations indicated that the redesign was now fully worth the time investment.

Why did we think moving to microservices would help us solve our scalability issues?

First, we observed that a large contributor to our load on ClickHouse was repeated baseline aggregation. Because each replica of the monolithic Anomaly Detection service calculated its own copies of the baseline profiles, our pressure on ClickHouse would increase each time we horizontally scaled our service deployment. What’s more, this work was essentially duplicated. There was no reason each replica should need to recalculate copies of the same baseline. Moving this work to a dedicated baseline service cut out the duplication to the tune of a 10x reduction in load from this particular operation.

Secondly, we noticed that part of our use case (accept a stream of data from Kafka, apply simple transformations, and persist batches of this data to ClickHouse) was a pretty common one at Cloudflare. There already existed robust, battle-tested inserter code for launching microservices with exactly this pattern of operation. Adapting this code to suit our needs not only saved us development time, but brought us more in line with wider convention.

We also learned the importance of concrete details during design. When we initially began working on the redesign of the Anomaly Detection platform, we felt that Kafka might have a role to play in connecting some of our services. Still, we couldn’t fully justify the initial investment required to move away from the RESTful interfaces we already had.

The benefits of using Kafka only became clear and concrete once we committed to using ClickHouse as the storage solution for our outlier score data. ClickHouse performs best when data is inserted in larger, less frequent batches, rather than rapid, small operations, which create a lot of internal overhead. Transporting outlier scores via Kafka allowed us to batch updates while being resilient to data loss during transient downtime.

The Future

It’s been a journey getting to this point, but we’re far from done. Cloudflare’s mission is to help make the Internet better for everyone - from small websites to the largest enterprises. For Anomaly Detection, this means expanding into a problem space with huge cardinality: roughly the cross-product of the number of sites and the number of unique visitors. To do this, we’re continuing to improve the efficiency of our infrastructure through smarter traffic sampling, compressed baseline windows, and more memory-efficient data representations.

Additionally, we want to deliver even better detection accuracy on sites with multiple distinct traffic types. Traffic coming to a site from web browsers behaves quite differently than traffic coming from mobile apps — but both sources of traffic are legitimate. While the HBOS outlier detection algorithm is lightweight and efficient, there are alternatives which are more performant in the presence of multiple traffic profiles.

One of these alternatives is local outlier factor (LOF) detection. LOF automatically builds baselines that capture “local clusters” of behavior corresponding to multiple traffic streams, rather than a single site-wide baseline. These new baselines allow Anomaly Detection to better distinguish between human use of a web browser and automated abuse of an API on the same site. Of course, there are trade-offs here as well: generating, storing, and working with these larger and more sophisticated behavioral baselines requires careful and creative engineering. But the reward for doing so is enhanced protection for an even wider range of sites using Anomaly Detection.

Finally, let’s not forget the very human side of building, supporting, and expanding Anomaly Detection and Bot Management. We’ve recently launched features that speed up model experimentation for Anomaly Detection, allow us to run “shadow” models to record and evaluate performance behind the scenes, give us instant “escape hatches” in case of unexpected customer impact, and more. But our team — as well as the many Solutions Engineers, Product Managers, Subject Matter Experts, and others who support Bot Management — are continuing to invest in improved tooling and education. It’s no small challenge, but it’s an exciting one.

Getting to the Core: Benchmarking Cloudflare’s Latest Server Hardware

Brian Bassett — Fri, 20 Nov 2020 12:00:00 GMT

Maintaining a server fleet the size of Cloudflare’s is an operational challenge, to say the least. Anything we can do to lower complexity and improve efficiency has effects for our SRE (Site Reliability Engineer) and Data Center teams that can be felt throughout a server’s 4+ year lifespan.

At the Cloudflare Core, we process logs to analyze attacks and compute analytics. In 2020, our Core servers were in need of a refresh, so we decided to redesign the hardware to be more in line with our Gen X edge servers. We designed two major server variants for the core. The first is Core Compute 2020, an AMD-based server for analytics and general-purpose compute paired with solid-state storage drives. The second is Core Storage 2020, an Intel-based server with twelve spinning disks to run database workloads.

Core Compute 2020

Earlier this year, we blogged about our 10th generation edge servers or Gen X and the improvements they delivered to our edge in both performance and security. The new Core Compute 2020 server leverages many of our learnings from the edge server. The Core Compute servers run a variety of workloads including Kubernetes, Kafka, and various smaller services.

Configuration Changes (Kubernetes)

	Previous Generation Compute	Core Compute 2020
CPU	2 x Intel Xeon Gold 6262	1 x AMD EPYC 7642
Total Core / Thread Count	48C / 96T	48C / 96T
Base / Turbo Frequency	1.9 / 3.6 GHz	2.3 / 3.3 GHz
Memory	8 x 32GB DDR4-2666	8 x 32GB DDR4-2933
Storage	6 x 480GB SATA SSD	2 x 3.84TB NVMe SSD
Network	Mellanox CX4 Lx 2 x 25GbE	Mellanox CX4 Lx 2 x 25GbE

Configuration Changes (Kafka)

	Previous Generation (Kafka)	Core Compute 2020
CPU	2 x Intel Xeon Silver 4116	1 x AMD EPYC 7642
Total Core / Thread Count	24C / 48T	48C / 96T
Base / Turbo Frequency	2.1 / 3.0 GHz	2.3 / 3.3 GHz
Memory	6 x 32GB DDR4-2400	8 x 32GB DDR4-2933
Storage	12 x 1.92TB SATA SSD	10 x 3.84TB NVMe SSD
Network	Mellanox CX4 Lx 2 x 25GbE	Mellanox CX4 Lx 2 x 25GbE

Both previous generation servers were Intel-based platforms, with the Kubernetes server based on Xeon 6262 processors, and the Kafka server based on Xeon 4116 processors. One goal with these refreshed versions was to converge the configurations in order to simplify spare parts and firmware management across the fleet.

As the above tables show, the configurations have been converged with the only difference being the number of NVMe drives installed depending on the workload running on the host. In both cases we moved from a dual-socket configuration to a single-socket configuration, and the number of cores and threads per server either increased or stayed the same. In all cases, the base frequency of those cores was significantly improved. We also moved from SATA SSDs to NVMe SSDs.

Core Compute 2020 Synthetic Benchmarking

The heaviest user of the SSDs was determined to be Kafka. The majority of the time Kafka is sequentially writing 2MB blocks to the disk. We created a simple FIO script with 75% sequential write and 25% sequential read, scaling the block size from a standard page table entry size of 4096B to Kafka’s write size of 2MB. The results aligned with what we expected from an NVMe-based drive.

Core Compute 2020 Production Benchmarking

Cloudflare runs many of our Core Compute services in Kubernetes containers, some of which are multi-core. By transitioning to a single socket, problems associated with dual sockets were eliminated, and we are guaranteed to have all cores allocated for any given container on the same socket.

Another heavy workload that is constantly running on Compute hosts is the Cloudflare CSAM Scanning Tool. Our Systems Engineering team isolated a Compute 2020 compute host and the previous generation compute host, had them run just this workload, and measured the time to compare the fuzzy hashes for images to the NCMEC hash lists and verify that they are a “miss”.

Because the CSAM Scanning Tool is very compute intensive we specifically isolated it to take a look at its performance with the new hardware. We’ve spent a great deal of effort on software optimization and improved algorithms for this tool but investing in faster, better hardware is also important.

In these heatmaps, the X axis represents time, and the Y axis represents “buckets” of time taken to verify that it is not a match to one of the NCMEC hash lists. For a given time slice in the heatmap, the red point is the bucket with the most times measured, the yellow point the second most, and the green points the least. The red points on the Compute 2020 graph are all in the 5 to 8 millisecond bucket, while the red points on the previous Gen heatmap are all in the 8 to 13 millisecond bucket, which shows that on average, the Compute 2020 host is verifying hashes significantly faster.

Core Storage 2020

Another major workload we identified was ClickHouse, which performs analytics over large datasets. The last time we upgraded our servers running ClickHouse was back in 2018.

Configuration Changes

	Previous Generation	Core Storage 2020
CPU	2 x Intel Xeon E5-2630 v4	1 x Intel Xeon Gold 6210U
Total Core / Thread Count	20C / 40T	20C / 40T
Base / Turbo Frequency	2.2 / 3.1 GHz	2.5 / 3.9 GHz
Memory	8 x 32GB DDR4-2400	8 x 32GB DDR4-2933
Storage	12 x 10TB 7200 RPM 3.5” SATA	12 x 10TB 7200 RPM 3.5” SATA
Network	Mellanox CX4 Lx 2 x 25GbE	Mellanox CX4 Lx 2 x 25GbE

CPU Changes

For ClickHouse, we use a 1U chassis with 12 x 10TB 3.5” hard drives. At the time we were designing Core Storage 2020 our server vendor did not yet have an AMD version of this chassis, so we remained on Intel. However, we moved Core Storage 2020 to a single 20 core / 40 thread Xeon processor, rather than the previous generation’s dual-socket 10 core / 20 thread processors. By moving to the single-socket Xeon 6210U processor, we were able to keep the same core count, but gained 17% higher base frequency and 26% higher max turbo frequency. Meanwhile, the total CPU thermal design profile (TDP), which is an approximation of the maximum power the CPU can draw, went down from 165W to 150W.

On a dual-socket server, remote memory accesses, which are memory accesses by a process on socket 0 to memory attached to socket 1, incur a latency penalty, as seen in this table:

	Previous Generation	Core Storage 2020
Memory latency, socket 0 to socket 0	81.3 ns	86.9 ns
Memory latency, socket 0 to socket 1	142.6 ns	N/A

An additional advantage of having a CPU with all 20 cores on the same socket is the elimination of these remote memory accesses, which take 76% longer than local memory accesses.

Memory Changes

The memory in the Core Storage 2020 host is rated for operation at 2933 MHz; however, in the 8 x 32GB configuration we need on these hosts, the Intel Xeon 6210U processor clocks them at 2666 MH. Compared to the previous generation, this gives us a 13% boost in memory speed. While we would get a slightly higher clock speed with a balanced, 6 DIMMs configuration, we determined that we are willing to sacrifice the slightly higher clock speed in order to have the additional RAM capacity provided by the 8 x 32GB configuration.

Storage Changes

Data capacity stayed the same, with 12 x 10TB SATA drives in RAID 0 configuration for best throughput. Unlike the previous generation, the drives in the Core Storage 2020 host are helium filled. Helium produces less drag than air, resulting in potentially lower latency.

Core Storage 2020 Synthetic benchmarking

We performed synthetic four corners benchmarking: IOPS measurements of random reads and writes using 4k block size, and bandwidth measurements of sequential reads and writes using 128k block size. We used the fio tool to see what improvements we would get in a lab environment. The results show a 10% latency improvement and 11% IOPS improvement in random read performance. Random write testing shows 38% lower latency and 60% higher IOPS. Write throughput is improved by 23%, and read throughput is improved by a whopping 90%.

	Previous Generation	Core Storage 2020	% Improvement
4k Random Reads (IOPS)	3,384	3,758	11.0%
4k Random Read Mean Latency (ms, lower is better)	75.4	67.8	10.1% lower
4k Random Writes (IOPS)	4,009	6,397	59.6%
4k Random Write Mean Latency (ms, lower is better)	63.5	39.7	37.5% lower
128k Sequential Reads (MB/s)	1,155	2,195	90.0%
128k Sequential Writes (MB/s)	1,265	1,558	23.2%

CPU frequencies

The higher base and turbo frequencies of the Core Storage 2020 host’s Xeon 6210U processor allowed that processor to achieve higher average frequencies while running our production ClickHouse workload. A recent snapshot of two production hosts showed the Core Storage 2020 host being able to sustain an average of 31% higher CPU frequency while running ClickHouse.

	Previous generation (average core frequency)	Core Storage 2020 (average core frequency)	% improvement
Mean Core Frequency	2441 MHz	3199 MHz	31%

Core Storage 2020 Production benchmarking

Our ClickHouse database hosts are continually performing merge operations to optimize the database data structures. Each individual merge operation takes just a few seconds on average, but since they’re constantly running, they can consume significant resources on the host. We sampled the average merge time every five minutes over seven days, and then sampled the data to find the average, minimum, and maximum merge times reported by a Compute 2020 host and by a previous generation host. Results are summarized below.

ClickHouse merge operation performance improvement

Time	Previous generation	Core Storage 2020	% improvement
Mean time to merge	1.83	1.15	37% lower
Maximum merge time	3.51	2.35	33% lower
Minimum merge time	0.68	0.32	53% lower

Our lab-measured CPU frequency and storage performance improvements on Core Storage 2020 have translated into significantly reduced times to perform this database operation.

Conclusion

With our Core 2020 servers, we were able to realize significant performance improvements, both in synthetic benchmarking outside production and in the production workloads we tested. This will allow Cloudflare to run the same workloads on fewer servers, saving CapEx costs and data center rack space. The similarity of the configuration of the Kubernetes and Kafka hosts should help with fleet management and spare parts management. For our next redesign, we will try to further converge the designs on which we run the major Core workloads to further improve efficiency.

Special thanks to Will Buckner and Chris Snook for their help in the development of these servers, and to Tim Bart for validating CSAM Scanning Tool’s performance on Compute.

Automated Origin CA for Kubernetes

Terin Stock — Fri, 13 Nov 2020 12:00:00 GMT

In 2016, we launched the Cloudflare Origin CA, a certificate authority optimized for making it easy to secure the connection between Cloudflare and an origin server. Running our own CA has allowed us to support fast issuance and renewal, simple and effective revocation, and wildcard certificates for our users.

Out of the box, managing TLS certificates and keys within Kubernetes can be challenging and error prone. The secret resources have to be constructed correctly, as components expect secrets with specific fields. Some forms of domain verification require manually rotating secrets to pass. Once you're successful, don't forget to renew before the certificate expires!

cert-manager is a project to fill this operational gap, providing Kubernetes resources that manage the lifecycle of a certificate. Today we're releasing origin-ca-issuer, an extension to cert-manager integrating with Cloudflare Origin CA to easily create and renew certificates for your account's domains.

Origin CA Integration

Creating an Issuer

After installing cert-manager and origin-ca-issuer, you can create an OriginIssuer resource. This resource creates a binding between cert-manager and the Cloudflare API for an account. Different issuers may be connected to different Cloudflare accounts in the same Kubernetes cluster.

apiVersion: cert-manager.k8s.cloudflare.com/v1
kind: OriginIssuer
metadata:
  name: prod-issuer
  namespace: default
spec:
  signatureType: OriginECC
  auth:
    serviceKeyRef:
      name: service-key
      key: key
      ```

This creates a new OriginIssuer named "prod-issuer" that issues certificates using ECDSA signatures, and the secret "service-key" in the same namespace is used to authenticate to the Cloudflare API.

Signing an Origin CA Certificate

After creating an OriginIssuer, we can now create a Certificate with cert-manager. This defines the domains, including wildcards, that the certificate should be issued for, how long the certificate should be valid, and when cert-manager should renew the certificate.

apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: example-com
  namespace: default
spec:
  # The secret name where cert-manager
  # should store the signed certificate.
  secretName: example-com-tls
  dnsNames:
    - example.com
  # Duration of the certificate.
  duration: 168h
  # Renew a day before the certificate expiration.
  renewBefore: 24h
  # Reference the Origin CA Issuer you created above,
  # which must be in the same namespace.
  issuerRef:
    group: cert-manager.k8s.cloudflare.com
    kind: OriginIssuer
    name: prod-issuer

Once created, cert-manager begins managing the lifecycle of this certificate, including creating the key material, crafting a certificate signature request (CSR), and constructing a certificate request that will be processed by the origin-ca-issuer.

When signed by the Cloudflare API, the certificate will be made available, along with the private key, in the Kubernetes secret specified within the secretName field. You'll be able to use this certificate on servers proxied behind Cloudflare.

Extra: Ingress Support

If you're using an Ingress controller, you can use cert-manager's Ingress support to automatically manage Certificate resources based on your Ingress resource.

apiVersion: networking/v1
kind: Ingress
metadata:
  annotations:
    cert-manager.io/issuer: prod-issuer
    cert-manager.io/issuer-kind: OriginIssuer
    cert-manager.io/issuer-group: cert-manager.k8s.cloudflare.com
  name: example
  namespace: default
spec:
  rules:
    - host: example.com
      http:
        paths:
          - backend:
              serviceName: examplesvc
              servicePort: 80
            path: /
  tls:
    # specifying a host in the TLS section will tell cert-manager 
    # what DNS SANs should be on the created certificate.
    - hosts:
        - example.com
      # cert-manager will create this secret
      secretName: example-tls

Building an External cert-manager Issuer

An external cert-manager issuer is a specialized Kubernetes controller. There's no direct communication between cert-manager and external issuers at all; this means that you can use any existing tools and best practices for developing controllers to develop an external issuer.

We've decided to use the excellent controller-runtime project to build origin-ca-issuer, running two reconciliation controllers.

OriginIssuer Controller

The OriginIssuer controller watches for creation and modification of OriginIssuer custom resources. The controllers create a Cloudflare API client using the details and credentials referenced. This client API instance will later be used to sign certificates through the API. The controller will periodically retry to create an API client; once it is successful, it updates the OriginIssuer's status to be ready.

CertificateRequest Controller

The CertificateRequest controller watches for the creation and modification of cert-manager's CertificateRequest resources. These resources are created automatically by cert-manager as needed during a certificate's lifecycle.

The controller looks for Certificate Requests that reference a known OriginIssuer, this reference is copied by cert-manager from the origin Certificate resource, and ignores all resources that do not match. The controller then verifies the OriginIssuer is in the ready state, before transforming the certificate request into an API request using the previously created clients.

On a successful response, the signed certificate is added to the certificate request, and which cert-manager will use to create or update the secret resource. On an unsuccessful request, the controller will periodically retry.

Learn More

Up-to-date documentation and complete installation instructions can be found in our GitHub repository. Feedback and contributions are greatly appreciated. If you're interested in Kubernetes at Cloudflare, including building controllers like these, we're hiring.

Secondary DNS - Deep Dive

Alex Fattouche — Tue, 15 Sep 2020 11:00:00 GMT

How Does Secondary DNS Work?

If you already understand how Secondary DNS works, please feel free to skip this section. It does not provide any Cloudflare-specific information.

Secondary DNS has many use cases across the Internet; however, traditionally, it was used as a synchronized backup for when the primary DNS server was unable to respond to queries. A more modern approach involves focusing on redundancy across many different nameservers, which in many cases broadcast the same anycasted IP address.

Secondary DNS involves the unidirectional transfer of DNS zones from the primary to the Secondary DNS server(s). One primary can have any number of Secondary DNS servers that it must communicate with in order to keep track of any zone updates. A zone update is considered a change in the contents of a zone, which ultimately leads to a Start of Authority (SOA) serial number increase. The zone’s SOA serial is one of the key elements of Secondary DNS; it is how primary and secondary servers synchronize zones. Below is an example of what an SOA record might look like during a dig query.

example.com	3600	IN	SOA	ashley.ns.cloudflare.com. dns.cloudflare.com. 
2034097105  // Serial
10000 // Refresh
2400 // Retry
604800 // Expire
3600 // Minimum TTL

Each of the numbers is used in the following way:

Serial - Used to keep track of the status of the zone, must be incremented at every change.
Refresh - The maximum number of seconds that can elapse before a Secondary DNS server must check for a SOA serial change.
Retry - The maximum number of seconds that can elapse before a Secondary DNS server must check for a SOA serial change, after previously failing to contact the primary.
Expire - The maximum number of seconds that a Secondary DNS server can serve stale information, in the event the primary cannot be contacted.
Minimum TTL - Per RFC 2308, the number of seconds that a DNS negative response should be cached for.

Using the above information, the Secondary DNS server stores an SOA record for each of the zones it is tracking. When the serial increases, it knows that the zone must have changed, and that a zone transfer must be initiated.

Serial Tracking

Serial increases can be detected in the following ways:

The fastest way for the Secondary DNS server to keep track of a serial change is to have the primary server NOTIFY them any time a zone has changed using the DNS protocol as specified in RFC 1996, Secondary DNS servers will instantly be able to initiate a zone transfer.
Another way is for the Secondary DNS server to simply poll the primary every “Refresh” seconds. This isn’t as fast as the NOTIFY approach, but it is a good fallback in case the notifies have failed.

One of the issues with the basic NOTIFY protocol is that anyone on the Internet could potentially notify the Secondary DNS server of a zone update. If an initial SOA query is not performed by the Secondary DNS server before initiating a zone transfer, this is an easy way to perform an amplification attack. There is two common ways to prevent anyone on the Internet from being able to NOTIFY Secondary DNS servers:

Using transaction signatures (TSIG) as per RFC 2845. These are to be placed as the last record in the extra records section of the DNS message. Usually the number of extra records (or ARCOUNT) should be no more than two in this case.
Using IP based access control lists (ACL). This increases security but also prevents flexibility in server location and IP address allocation.

Generally NOTIFY messages are sent over UDP, however TCP can be used in the event the primary server has reason to believe that TCP is necessary (i.e. firewall issues).

Zone Transfers

In addition to serial tracking, it is important to ensure that a standard protocol is used between primary and Secondary DNS server(s), to efficiently transfer the zone. DNS zone transfer protocols do not attempt to solve the confidentiality, authentication and integrity triad (CIA); however, the use of TSIG on top of the basic zone transfer protocols can provide integrity and authentication. As a result of DNS being a public protocol, confidentiality during the zone transfer process is generally not a concern.

Authoritative Zone Transfer (AXFR)

AXFR is the original zone transfer protocol that was specified in RFC 1034 and RFC 1035 and later further explained in RFC 5936. AXFR is done over a TCP connection because a reliable protocol is needed to ensure packets are not lost during the transfer. Using this protocol, the primary DNS server will transfer all of the zone contents to the Secondary DNS server, in one connection, regardless of the serial number. AXFR is recommended to be used for the first zone transfer, when none of the records are propagated, and IXFR is recommended after that.

Incremental Zone Transfer (IXFR)

IXFR is the more sophisticated zone transfer protocol that was specified in RFC 1995. Unlike the AXFR protocol, during an IXFR, the primary server will only send the secondary server the records that have changed since its current version of the zone (based on the serial number). This means that when a Secondary DNS server wants to initiate an IXFR, it sends its current serial number to the primary DNS server. The primary DNS server will then format its response based on previous versions of changes made to the zone. IXFR messages must obey the following pattern:

Current latest SOA
Secondary server current SOA
DNS record deletions
Secondary server current SOA + changes
DNS record additions
Current latest SOA

Steps 2,3,4,5,6 can be repeated any number of times, as each of those represents one change set of deletions and additions, ultimately leading to a new serial.

IXFR can be done over UDP or TCP, but again TCP is generally recommended to avoid packet loss.

How Does Secondary DNS Work at Cloudflare?

The DNS team loves microservice architecture! When we initially implemented Secondary DNS at Cloudflare, it was done using Mesos Marathon. This allowed us to separate each of our services into several different marathon apps, individually scaling apps as needed. All of these services live in our core data centers. The following services were created:

Zone Transferer - responsible for attempting IXFR, followed by AXFR if IXFR fails.
Zone Transfer Scheduler - responsible for periodically checking zone SOA serials for changes.
Rest API - responsible for registering new zones and primary nameservers.

In addition to the marathon apps, we also had an app external to the cluster:

Notify Listener - responsible for listening for notifies from primary servers and telling the Zone Transferer to initiate an AXFR/IXFR.

Each of these microservices communicates with the others through Kafka.

Figure 1: Secondary DNS Microservice Architecture‌‌

Once the zone transferer completes the AXFR/IXFR, it then passes the zone through to our zone builder, and finally gets pushed out to our edge at each of our 200 locations.

Although this current architecture worked great in the beginning, it left us open to many vulnerabilities and scalability issues down the road. As our Secondary DNS product became more popular, it was important that we proactively scaled and reduced the technical debt as much as possible. As with many companies in the industry, Cloudflare has recently migrated all of our core data center services to Kubernetes, moving away from individually managed apps and Marathon clusters.

What this meant for Secondary DNS is that all of our Marathon-based services, as well as our NOTIFY Listener, had to be migrated to Kubernetes. Although this long migration ended up paying off, many difficult challenges arose along the way that required us to come up with unique solutions in order to have a seamless, zero downtime migration.

Challenges When Migrating to Kubernetes

Although the entire DNS team agreed that kubernetes was the way forward for Secondary DNS, it also introduced several challenges. These challenges arose from a need to properly scale up across many distributed locations while also protecting each of our individual data centers. Since our core does not rely on anycast to automatically distribute requests, as we introduce more customers, it opens us up to denial-of-service attacks.

The two main issues we ran into during the migration were:

How do we create a distributed and reliable system that makes use of kubernetes principles while also making sure our customers know which IPs we will be communicating from?
When opening up a public-facing UDP socket to the Internet, how do we protect ourselves while also preventing unnecessary spam towards primary nameservers?.

Issue 1:

As was previously mentioned, one form of protection in the Secondary DNS protocol is to only allow certain IPs to initiate zone transfers. There is a fine line between primary servers allow listing too many IPs and them having to frequently update their IP ACLs. We considered several solutions:

Open source k8s controllers
Altering Network Address Translation(NAT) entries
Do not use k8s for zone transfers
Allowlist all Cloudflare IPs and dynamically update
Proxy egress traffic

Ultimately we decided to proxy our egress traffic from k8s, to the DNS primary servers, using static proxy addresses. Shadowsocks-libev was chosen as the SOCKS5 implementation because it is fast, secure and known to scale. In addition, it can handle both UDP/TCP and IPv4/IPv6.

Figure 2: Shadowsocks proxy Setup

The partnership of k8s and Shadowsocks combined with a large enough IP range brings many benefits:

Horizontal scaling
Efficient load balancing
Primary server ACLs only need to be updated once
It allows us to make use of kubernetes for both the Zone Transferer and the Local ShadowSocks Proxy.
Shadowsocks proxy can be reused by many different Cloudflare services.

Issue 2:

The Notify Listener requires listening on static IPs for NOTIFY Messages coming from primary DNS servers. This is mostly a solved problem through the use of k8s services of type loadbalancer, however exposing this service directly to the Internet makes us uneasy because of its susceptibility to attacks. Fortunately DDoS protection is one of Cloudflare's strengths, which lead us to the likely solution of dogfooding one of our own products, Spectrum.

Spectrum provides the following features to our service:

Reverse proxy TCP/UDP traffic
Filter out Malicious traffic
Optimal routing from edge to core data centers
Dual Stack technology

Figure 3: Spectrum interaction with Notify Listener

Figure 3 shows two interesting attributes of the system:

Spectrum <-> k8s IPv4 only:
This is because our custom k8s load balancer currently only supports IPv4; however, Spectrum has no issue terminating the IPv6 connection and establishing a new IPv4 connection.
Spectrum <-> k8s routing decisions based of L4 protocol:
This is because k8s only supports one of TCP/UDP/SCTP per service of type load balancer. Once again, spectrum has no issues proxying this correctly.

One of the problems with using a L4 proxy in between services is that source IP addresses get changed to the source IP address of the proxy (Spectrum in this case). Not knowing the source IP address means we have no idea who sent the NOTIFY message, opening us up to attack vectors. Fortunately, Spectrum’s proxy protocol feature is capable of adding custom headers to TCP/UDP packets which contain source IP/Port information.

As we are using miekg/dns for our Notify Listener, adding proxy headers to the DNS NOTIFY messages would cause failures in validation at the DNS server level. Alternatively, we were able to implement custom read and write decorators that do the following:

Reader: Extract source address information on inbound NOTIFY messages. Place extracted information into new DNS records located in the additional section of the message.
Writer: Remove additional records from the DNS message on outbound NOTIFY replies. Generate a new reply using proxy protocol headers.

There is no way to spoof these records, because the server only permits two extra records, one of which is the optional TSIG. Any other records will be overwritten.

Figure 4: Proxying Records Between Notifier and Spectrum‌‌

This custom decorator approach abstracts the proxying away from the Notify Listener through the use of the DNS protocol.

Although knowing the source IP will block a significant amount of bad traffic, since NOTIFY messages can use both UDP and TCP, it is prone to IP spoofing. To ensure that the primary servers do not get spammed, we have made the following additions to the Zone Transferer:

Always ensure that the SOA has actually been updated before initiating a zone transfer.
Only allow at most one working transfer and one scheduled transfer per zone.

Additional Technical Challenges

Zone Transferer Scheduling

As shown in figure 1, there are several ways of sending Kafka messages to the Zone Transferer in order to initiate a zone transfer. There is no benefit in having a large backlog of zone transfers for the same zone. Once a zone has been transferred, assuming no more changes, it does not need to be transferred again. This means that we should only have at most one transfer ongoing, and one scheduled transfer at the same time, for any zone.

If we want to limit our number of scheduled messages to one per zone, this involves ignoring Kafka messages that get sent to the Zone Transferer. This is not as simple as ignoring specific messages in any random order. One of the benefits of Kafka is that it holds on to messages until the user actually decides to acknowledge them, by committing that messages offset. Since Kafka is just a queue of messages, it has no concept of order other than first in first out (FIFO). If a user is capable of reading from the Kafka topic concurrently, it is entirely possible that a message in the middle of the queue be committed before a message at the end of the queue.

Most of the time this isn’t an issue, because we know that one of the concurrent readers has read the message from the end of the queue and is processing it. There is one Kubernetes-related catch to this issue, though: pods are ephemeral. The kube master doesn’t care what your concurrent reader is doing, it will kill the pod and it’s up to your application to handle it.

Consider the following problem:

Figure 5: Kafka Partition‌‌

Read offset 1. Start transferring zone 1.
Read offset 2. Start transferring zone 2.
Zone 2 transfer finishes. Commit offset 2, essentially also marking offset 1.
Restart pod.
Read offset 3 Start transferring zone 3.

If these events happen, zone 1 will never be transferred. It is important that zones stay up to date with the primary servers, otherwise stale data will be served from the Secondary DNS server. The solution to this problem involves the use of a list to track which messages have been read and completely processed. In this case, when a zone transfer has finished, it does not necessarily mean that the kafka message should be immediately committed. The solution is as follows:

Keep a list of Kafka messages, sorted based on offset.
If finished transfer, remove from list:
If the message is the oldest in the list, commit the messages offset.

Figure 6: Kafka Algorithm to Solve Message Loss

This solution is essentially soft committing Kafka messages, until we can confidently say that all other messages have been acknowledged. It’s important to note that this only truly works in a distributed manner if the Kafka messages are keyed by zone id, this will ensure the same zone will always be processed by the same Kafka consumer.

Life of a Secondary DNS Request

Although Cloudflare has a large global network, as shown above, the zone transferring process does not take place at each of the edge datacenter locations (which would surely overwhelm many primary servers), but rather in our core data centers. In this case, how do we propagate to our edge in seconds? After transferring the zone, there are a couple more steps that need to be taken before the change can be seen at the edge.

Zone Builder - This interacts with the Zone Transferer to build the zone according to what Cloudflare edge understands. This then writes to Quicksilver, our super fast, distributed KV store.
Authoritative Server - This reads from Quicksilver and serves the built zone.

Figure 7: End to End Secondary DNS‌‌

What About Performance?

At the time of writing this post, according to dnsperf.com, Cloudflare leads in global performance for both Authoritative and Resolver DNS. Here, Secondary DNS falls under the authoritative DNS category here. Let’s break down the performance of each of the different parts of the Secondary DNS pipeline, from the primary server updating its records, to them being present at the Cloudflare edge.

Primary Server to Notify Listener - Our most accurate measurement is only precise to the second, but we know UDP/TCP communication is likely much faster than that.
NOTIFY to Zone Transferer - This is negligible
Zone Transferer to Primary Server - 99% of the time we see ~800ms as the average latency for a zone transfer.

Figure 8: Zone XFR latency

4. Zone Transferer to Zone Builder - 99% of the time we see ~10ms to build a zone.

Figure 9: Zone Build time

5. Zone Builder to Quicksilver edge: 95% of the time we see less than 1s propagation.

Figure 10: Quicksilver propagation time

End to End latency: less than 5 seconds on average. Although we have several external probes running around the world to test propagation latencies, they lack precision due to their sleep intervals, location, provider and number of zones that need to run. The actual propagation latency is likely much lower than what is shown in figure 10. Each of the different colored dots is a separate data center location around the world.

Figure 11: End to End Latency

An additional test was performed manually to get a real world estimate, the test had the following attributes:

Primary server: NS1Number of records changed: 1Start test timer event: Change record on NS1Stop test timer event: Observe record change at Cloudflare edge using digRecorded timer value: 6 seconds

Conclusion

Cloudflare serves 15.8 trillion DNS queries per month, operating within 100ms of 99% of the Internet-connected population. The goal of Cloudflare operated Secondary DNS is to allow our customers with custom DNS solutions, be it on-premise or some other DNS provider, to be able to take advantage of Cloudflare's DNS performance and more recently, through Secondary Override, our proxying and security capabilities too. Secondary DNS is currently available on the Enterprise plan, if you’d like to take advantage of it, please let your account team know. For additional documentation on Secondary DNS, please refer to our support article.

Releasing kubectl support in Access

Sam Rhea — Mon, 27 Apr 2020 11:00:00 GMT

Starting today, you can use Cloudflare Access and Argo Tunnel to securely manage your Kubernetes cluster with the kubectl command-line tool.

We built this to address one of the edge cases that stopped all of Cloudflare, as well as some of our customers, from disabling the VPN. With this workflow, you can add SSO requirements and a zero-trust model to your Kubernetes management in under 30 minutes.

Once deployed, you can migrate to Cloudflare Access for controlling Kubernetes clusters without disrupting your current kubectl workflow, a lesson we learned the hard way from dogfooding here at Cloudflare.

What is kubectl?

A Kubernetes deployment consists of a cluster that contains nodes, which run the containers, as well as a control plane that can be used to manage those nodes. Central to that control plane is the Kubernetes API server, which interacts with components like the scheduler and manager.

kubectl is the Kubernetes command-line tool that developers can use to interact with that API server. Users run kubectl commands to perform actions like starting and stopping the nodes, or modifying other elements of the control plane.

In most deployments, users connect to a VPN that allows them to run commands against that API server by addressing it over the same local network. In that architecture, user traffic to run these commands must be backhauled through a physical or virtual VPN appliance. More concerning, in most cases the user connecting to the API server will also be able to connect to other addresses and ports in the private network where the cluster runs.

How does Cloudflare Access apply?

Cloudflare Access can secure web applications as well as non-HTTP connections like SSH, RDP, and the commands sent over kubectl. Access deploys Cloudflare’s network in front of all of these resources. Every time a request is made to one of these destinations, Cloudflare’s network checks for identity like a bouncer in front of each door.

If the request lacks identity, we send the user to your team’s SSO provider, like Okta, AzureAD, and G Suite, where the user can login. Once they login, they are redirected to Cloudflare where we check their identity against a list of users who are allowed to connect. If the user is permitted, we let their request reach the destination.

In most cases, those granular checks on every request would slow down the experience. However, Cloudflare Access completes the entire check in just a few milliseconds. The authentication flow relies on Cloudflare’s serverless product, Workers, and runs in every one of our data centers in 200 cities around the world. With that distribution, we can improve performance for your applications while also authenticating every request.

How does it work with kubectl?

To replace your VPN with Cloudflare Access for kubectl, you need to complete two steps:

Connect your cluster to Cloudflare with Argo Tunnel
Connect from a client machine to that cluster with Argo Tunnel

Connecting the cluster to Cloudflare

On the cluster side, Cloudflare Argo Tunnel connects those resources to our network by creating a secure tunnel with the Cloudflare daemon, cloudflared. As an administrator, you can run cloudflared in any space that can connect to the kubectl API server over TCP.

Once installed, an administrator authenticates the instance of cloudflared by logging in to a browser with their Cloudflare account and choosing a hostname to use. Once selected, Cloudflare will issue a certificate to cloudflared that can be used to create a subdomain for the cluster.

Next, an administrator starts the tunnel. In the example below, the hostname value can be any subdomain of the hostname selected in Cloudflare; the url value should be the API server for the cluster.

cloudflared tunnel --hostname cluster.site.com --url tcp://kubernetes.docker.internal:6443 --socks5=true

This should be run as a systemd process to ensure the tunnel reconnects if the resource restarts.

Connecting as an end user

End users do not need an agent or client application to connect to web applications secured by Cloudflare Access. They can authenticate to on-premise applications through a browser, without a VPN, like they would for SaaS tools. When we apply that same security model to non-HTTP protocols, we need to establish that secure connection from the client with an alternative to the web browser.

Unlike our SSH flow, end users cannot modify kubeconfig to proxy requests through cloudflared. Pull requests have been submitted to add this functionality to kubeconfig, but in the meantime users can set an alias to serve a similar function.

First, users need to download the same cloudflared tool that administrators deploy on the cluster. Once downloaded, they will need to run a corresponding command to create a local SOCKS proxy. When the user runs the command, cloudflared will launch a browser window to prompt them to login with their SSO and check that they are allowed to reach this hostname.

$ cloudflared access tcp --hostname cluster.site.com url --127.0.0.1:1234

The proxy allows your local kubectl tool to connect to cloudflared via a SOCKS5 proxy, which helps avoid issues with TLS handshakes to the cluster itself. In this model, TLS verification can still be exchanged with the kubectl API server without disabling or modifying that flow for end users.

Users can then create an alias to save time when connecting. The example below aliases all of the steps required to connect in a single command. This can be added to the user’s bash profile so that it persists between restarts.

$ alias kubeone="env HTTPS_PROXY=socks5://127.0.0.1:1234 kubectl"

A (hard) lesson when dogfooding

When we build products at Cloudflare, we release them to our own organization first. The entire company becomes a feature’s first customer, and we ask them to submit feedback in a candid way.

Cloudflare Access began as a product we built to solve our own challenges with security and connectivity. The product impacts every user in our team, so as we’ve grown, we’ve been able to gather more expansive feedback and catch more edge cases.

The kubectl release was no different. At Cloudflare, we have a team that manages our own Kubernetes deployments and we went to them to discuss the prototype. However, they had more than just some casual feedback and notes for us.

They told us to stop.

We had started down an implementation path that was technically sound and solved the use case, but did so in a way that engineers who spend all day working with pods and containers would find to be a real irritant. The flow required a small change in presenting certificates, which did not feel cumbersome when we tested it, but we do not use it all day. That grain of sand would cause real blisters as a new requirement in the workflow.

With their input, we stopped the release, and changed that step significantly. We worked through ideas, iterated with them, and made sure the Kubernetes team at Cloudflare felt this was not just good enough, but better.

What’s next?

Support for kubectl is available in the latest release of the cloudflared tool. You can begin using it today, on any plan. More detailed instructions are available to get started.

If you try it out, please send us your feedback! We’re focused on improving the ease of use for this feature, and other non-HTTP workflows in Access, and need your input.

New to Cloudflare for Teams? You can use all of the Teams products for free through September, including Cloudflare Access and Argo Tunnel. You can learn more about the program, and request a dedicated onboarding session, here.

How To Minikube + Cloudflare

Guest Author — Sun, 08 Jul 2018 13:00:00 GMT

_{The following is a guest blog post by}_{Nathan Franzen}_{, Software Engineer at}_{StackPointCloud}_{. StackPointCloud is the creator of Stackpoint.io, the leading multi-cloud management platform for cloud native workloads. They are the developers of the}_{Cloudflare Ingress Controller}_{for Kubernetes.}

Deploying Applications on Minikube with Argo Tunnels

This article assumes basic knowledge of Kubernetes. If you're not familiar with Kubernetes, visit https://kubernetes.io/docs/tutorials/kubernetes-basics/ to learn the basics.

Minikube is a tool which allows you to run a Kubernetes cluster locally. It’s not only a great way to experiment with Kubernetes, but also a great way to try out deploying services using a reverse tunnel.

At Cloudflare, we've created a product called Argo Tunnel which allows you to host services through a tunnel using Cloudflare as your edge. Tunnels provide a way to expose your services to the internet by creating a connection to Cloudflare's edge and routing your traffic over it. Since your service is creating its own outbound connection to the edge, you don’t have to open ports, configure a firewall, or even have a public IP address for your service. All traffic flows through Cloudflare, blocking attacks and intrusion attempts before they ever make it to you, completely securing your origin.

Deploying your service to more locations around the world is as simple as spinning up more containers. Anything which uses the Ingress Controller will receive your traffic, wherever the container is running in the world or on the Internet. Tunnels make it simpler to have robust security even while deploying across multiple regions or cloud providers.

Usually Minikube applications need to be ported over to a production Kubernetes setup to be deployed, but with Argo Tunnel, you can easily deploy a locally-running yet publicly-available Minikube instance making it a great way to try out both Kubernetes and Argo Tunnel. In this example, we’ll create a simple microservice that returns data when given a key, deploy it into Minikube, and start up the Argo Tunnel machinery to get it exposed to the Internet.

Getting Started with an Application API

We'll start by by creating a web service in Python using Flask. We'll write a simple application to represent a small piece of an API in just a few lines of code. The complete application, secret_token.py is simply:

from flask import Flask, jsonify, abort  
  
app = Flask(__name__)

@app.route('/api/v1/token/', methods=['GET'])  
def token(key):  
	test_data = {  
		"e8990ab9be26": "3OX9+p39QLIvE6+x/w=",  
		"b01323031589": "wBvlo9G7Wqxsb2P9YS=",  
	}  
	secret = test_data.get(key)  
	if secret is None:  
		abort(404)  
	return jsonify({"key": key, "token": secret})

This tiny service will simply respond to a GET request with some secret data, given a key.

Using Docker

We’ll take the next step toward deployment and package our application into a portable Docker image with a Dockerfile:

FROM  python:alpine3.7  
RUN  pip install flask gunicorn
COPY  secret_token.py .
CMD  gunicorn -b 0.0.0.0:8000 secret_token:app

This will allow us to define a Docker image, the blueprint for the containers Minikube will build.

Deploying into Minikube

If you don't have Minikube installed, install it here: https://kubernetes.io/docs/tasks/tools/install-minikube/

Usually, we would build the Docker image with our Docker daemon and push it to a repository where the cluster can access it. With Minikube, however, that’s a round-trip we don’t need. We can share the Minikube Docker daemon with the Docker build process and avoid pushing to a cloud repository:

$ eval $(minikube docker-env)  
$ docker build -t myrepo/secret_token .

The image is now present on the Minikube VM where Kubernetes is running.

In a production Kubernetes system, we might spend a good deal of time going over the details of the deployment and service manifests, but kubectl run provides a simple way to get the basic app up and running. We add the image-pull-policy flag to make sure that Kubernetes doesn’t first try to pull the image remotely from Docker Hub.

$ kubectl run token --image myrepo/secret_token --expose --port 8000 --image-pull-policy=IfNotPresent --replicas=3

We now have a Kubernetes deployment running with our 3 replicas of containers built from our image, and a service associated with it that exposes port 8000. Save the two manifests locally into files:

kubectl get deployment token --export -o yaml > deployment.yaml  
kubectl get svc token --export -o yaml > service.yaml

We’ll be able to edit these files to make changes to our cluster configuration.

For local testing, let's change that service so that it exposes a NodePort -- this will proxy the service to a port on the Minikube VM. Replace the spec in the service.yaml file with:

spec:  
	ports:  
	-	nodePort:  32080  
		port:  8000  
		protocol:  TCP  
		targetPort:  8000  
	selector:  
		run:  token  
	sessionAffinity:  None  
	type:  NodePort

And apply the change to our cluster:

$ kubectl apply -f service.yaml

Now, we can test locally with curl, reaching the service via the NodePort on the Minikube VM:

$ minikube start
$ export MINIKUBE_IP=$(minikube ip)
$ curl http://$MINIKUBE_IP:32080/api/v1/token/b01323031589

Using Cloudflare’s Argo Tunnel

The NodePort setup is fine for testing the application locally, but if we want to share this service with others or better simulate how it will work in the real world, we need to expose it to the internet. In most cases, this means running in a cloud environment and dealing with load balancer configuration or setting up an NGINX ingress controller and dealing with network rules and routing. The Cloudflare Argo Tunnel Ingress Controller allows us to route almost anything to a Cloudflare domain, including services running inside of Minikube.

In the Kubernetes cluster, an ingress is an object that describes how we want our service exposed on the internet and an ingress-controller is the process that actually exposes it. To install the Cloudflare Ingress Controller, you’ll need to have a Cloudflare domain and an Argo Tunnel certificate, configured with the cloudflared application.

kubectl run was fine for quickly installing the test application, but for more complex installations, helm is a great tool, and is used to package the Cloudflare agent. Once you have the helm client installed, a simple helm init will configure Minikube to work with it. The chart for the ingress controller is found at the trusted-charts public repository and can be installed directly from there.

Cloudflared Configuration

Cloudflared is the end of the tunnel that runs on your machine and proxies traffic to and from your origin server through the tunnel. If you don't have it installed already, the cloudflared application complete quickstart instructions can be found at https://developers.cloudflare.com/argo-tunnel/quickstart/quickstart/

Installing the Controller with Helm

Now we will run some commands that define the repository that holds our chart and override a few default values:

$ minikube start
$ export MINIKUBE_IP=$(minikube ip)
$ curl http://$MINIKUBE_IP:32080/api/v1/token/b01323031589

This installation configures two cloudflare-warp-ingress controller replicas so that any service we expose will get two separate tunnels to the Cloudflare edge, paired together in a single pool.

Exposing Our Application with an Ingress

We'll need to write an ingress definition. Create a file called warp-controller.yaml:

apiVersion:  extensions/v1beta1  
kind:  Ingress  
metadata:  
	annotations:  
		kubernetes.io/ingress.class:  argo-tunnel  
	name:  token  
	namespace:  default  
spec:  
	rules:  
	-	host:  token.anthopleura.net  
	http:  
	paths:  
	-	backend:  
			serviceName:  token  
			servicePort:  8000

And apply the definition:

$ kubectl apply -f service.yaml

Examining the deployment

$ kubectl get pod

Should print:

NAME									READY 	STATUS	RESTARTS 	AGE

cloudflare-argo-ingress-6b886994b-52fsl 1/1 	Running 	   0 	34s

token-766cd8dd4c-bmksw 					1/1 	Running 	   0 	 2m

token-766cd8dd4c-l8gkw 					1/1 	Running 	   0 	 2m

token-766cd8dd4c-p2phg 					1/1 	Running 	   0 	 2m

The output shows the three token pods and the cloudflare-warp-ingress pod. Examine the logs from the argo pod to see the activity of the ingress controller:

$ kubectl logs cloudflare-argo-ingress-6b886994b-52fsl

The controller watches the cluster for creation of ingresses, services and pods.

The endpoint is live at https://token.anthopleura.net/api/v1/token/e8990ab9be26 returning

{  
	"key": "e8990ab9be26",  
	"token": "3OX9+p39QLIvE6+x/YK4DxWWCFi/D+c7g99c14oNB8g="  
}

Now this small piece of an api is available publicly on the internet for testing. Obviously you don’t want to serve public traffic into a minikube instance, but it’s certainly handy for sharing preliminary work across development teams.

The Cloudflare dashboard under the analytics tab will show some general statistics about the requests to your zone.

Routing and relationships

A quick sketch of the routing in the Kubernetes cluster and from the Cloudflare network:

The warp controller pods provide a way for Argo Tunnels to connect the pods containing your application to the internet through Cloudflare's edge.

Going farther with Cloudflare Load Balancers

This demo exposes a service through a single Argo Tunnel. If your Cloudflare account is enabled with load balancing, you can route traffic through a load balancer and pool of tunnels instead, by adding the annotationargo.cloudflare.com/lb-pool=token to the ingress. For details of load balancer routing and weighting please refer to the Cloudflare docs.

If you do use load balancing, then it is possible to run multiple instances of the ingress controller. When installing from the helm chart, set the value replicaCount to two or more and get multiple instances of the controller in the minikube cluster. The configuration will be useful for high availability in a single cluster. Load balancing can also be used to spread traffic across multiple clusters with different argo ingress controllers connecting to the same load balancing pool.

With two ingress controllers, the Cloudflare UI will show a pool named token.anthopleura.net with two origins with tunnel ids as origin addresses:

Copenhagen & London developers, join us for five events this May

Andrew Fitch — Thu, 26 Apr 2018 05:32:00 GMT

Photo by Nick Karvounis / Unsplash

Are you based in Copenhagen or London? Drop by one or all of these five events.

Ross Guarino and Terin Stock, both Systems Engineers at Cloudflare are traveling to Europe to lead Go and Kubernetes talks in Copenhagen. They'll then join Junade Ali and lead talks on their use of Go, Kubernetes, and Cloudflare’s Mobile SDK at Cloudflare's London office.

My Developer Relations teammates and I are visiting these cities over the next two weeks to produce these events with Ross, Terin, and Junade. We’d love to meet you and invite you along.

Our trip will begin with two meetups and a conference talk in Copenhagen.

Event #1 (Copenhagen): 6 Cloud Native Talks, 1 Evening: Special KubeCon + CloudNativeCon EU Meetup

Tuesday, 1 May: 17:00-21:00

Location: Trifork Copenhagen - Borgergade 24B, 1300 København K

How to extend your Kubernetes cluster

A brief introduction to controllers, webhooks and CRDs. Ross and Terin will talk about how Cloudflare’s internal platform builds on Kubernetes.

Speakers: Ross Guarino and Terin Stock

View Event Details & Register Here »

Event #2 (Copenhagen): Gopher Meetup At Falcon.io: Building Go With Bazel & Internationalization in Go

Wednesday, 2 May: 18:00-21:00

Location: Falcon.io - H.C. Andersen Blvd. 27, København

Talk 1: Building Go with Bazel

Fast and Reproducible go builds with Bazel. Learn how to remove Makefiles from your repositories.

Speaker: Ross Guarino

Talk 2: Internationalization in Go

Explore making effective use of Go’s internationalization and localization packages and easily making your applications world-friendly.

Speaker: Terin Stock

View Event Details & Register Here »

Event #3 (Copenhagen): Controllers: Lambda Functions for Extending your Infrastructure at KubeCon + CloudNativeCon 2018

Friday, 4 May: 14:45-15:20

Location: KubeCon + CloudNativeCon 2018 - Bella Center, Center Blvd. 5, 2300 København

If you happen to be attending KubeCon + CloudNativeCon 2018, check out Terin and Ross’s conference talk as well.

This session demonstrates how to leverage Kubernetes Controllers and Initializers as a framework for building transparent extensions of your Kubernetes cluster. Using a live coding exercise and demo, this presentation will showcase the possibilities of the basic programming paradigms the Kubernetes API server makes easy.

Speakers: Ross Guarino and Terin Stock

View Event Details & Register Here »

Photo by Paul Buffington / Unsplash

When KubeCon + CloudNativeCon 2018 concludes, we're all heading to the Cloudflare London office where we are hosting two more meetups.

Event #4 (London): Kubernetes Controllers: Lambda Functions for Extending your Infrastructure

Wednesday, 9 May: 18:00-20:00

Location: Cloudflare London - 25 Lavington St, Second floor | SE1 0NZ London

Speakers: Ross Guarino and Terin Stock

View Event Details & Register Here »

Event #5 (London): Architecture for Network Failure, Developing for Mobile Performance

Thursday, 10 May: 18:00-20:00

Location: Cloudflare London - 25 Lavington St, Second floor | SE1 0NZ London

Whether you're building an e-commerce app or a new mobile game, chances are you'll be needing some network functionality at some point when building a mobile app. Network performance can vary dramatically between carriers, networks, and APIs, but far too often mobile apps are tested inconsistent conditions with the same decent network performance. Fortunately we can iterate on our apps by collecting real-life performance measurements from your users; however, unfortunately existing mobile app analytics platforms only provide visibility into in-app performance but have no knowledge about outgoing network call.

This talk will cover how you can easily collect vital performance data from your users at no cost and then use this data to improve your apps' reliability and experience, discussing the tips and tricks needed to boost app performance.

Speaker: Junade Ali

View Event Details & Register Here »

More About the Speakers

Ross Guarino is a Systems Engineer at Cloudflare in charge of the technical direction of the internal platform. He’s determined to improve the lives of developers building and maintaining everything from a simple function to complex globally distributed systems.

Terin Stock is a long-time engineer at Cloudflare, currently working on building an internal Kubernetes cluster. By night, he hacks on building new hardware projects. Terin is also a member of gulp.js core team and the author of the Sticker Standard.

Junade Ali is a software engineer who is specialised in computer security and software architecture. Currently, Junade works at Cloudflare as a polymath, and helps make the Internet more secure and faster; prior to this, he was a technical lead at some of the UK's leading digital agencies before moving into architecting software for mission-critical road-safety systems.

We'll hope to meet you soon!

Creating a single pane of glass for your multi-cloud Kubernetes workloads with Cloudflare

Kamilla Amirova — Fri, 23 Feb 2018 17:00:00 GMT

(This is a crosspost of a blog post originally published on Google Cloud blog)

One of the great things about container technology is that it delivers the same experience and functionality across different platforms. This frees you as a developer from having to rewrite or update your application to deploy it on a new cloud provider—or lets you run it across multiple cloud providers. With a containerized application running on multiple clouds, you can avoid lock-in, run your application on the cloud for which it’s best suited, and lower your overall costs.

If you’re using Kubernetes, you probably manage traffic to clusters and services across multiple nodes using internal load-balancing services, which is the most common and practical approach. But if you’re running an application on multiple clouds, it can be hard to distribute traffic intelligently among them. In this blog post, we show you how to use Cloudflare Load Balancer in conjunction with Kubernetes so you can start to achieve the benefits of a multi-cloud configuration.

To continue reading follow the Google Cloud blog here or if you are ready to get started we created a guide on how to deploy an application using Kubernetes on GCP and AWS along with our Cloudflare Load Balancer.

Introducing the Cloudflare Warp Ingress Controller for Kubernetes

John Graham-Cumming — Tue, 05 Dec 2017 14:00:00 GMT

NOTE: Prior to launch, this product was renamed Argo Tunnel. Read more in the launch announcement.

It’s ironic that the one thing most programmers would really rather not have to spend time dealing with is... a computer. When you write code it’s written in your head, transferred to a screen with your fingers and then it has to be run. On. A. Computer. Ugh.

Of course, code has to be run and typed on a computer so programmers spend hours configuring and optimizing shells, window managers, editors, build systems, IDEs, compilation times and more so they can minimize the friction all those things introduce. Optimizing your editor’s macros, fonts or colors is a battle to find the most efficient path to go from idea to running code.

CC BY 2.0 image by Yutaka Tsutano

Once the developer is managing their own universe they can write code at the speed of their mind. But when it comes to putting their code into production (which necessarily requires running their programs on machines that they don’t control) things inevitably go wrong. Production machines are never the same as developer machines.

If you’re not a developer, here’s an analogy. Imagine carefully writing an essay on a subject dear to your heart and then publishing it only to be told “unfortunately, the word ‘the’ is not available in the version of English the publisher uses and so your essay is unreadable”. That’s the sort of problem developers face when putting their code into production.

Over time different technologies have tried to deal with this problem: dual booting, different sorts of isolation (e.g. virtualenv, chroot), totally static binaries, virtual machines running on a developer desktop, elastic computing resources in clouds, and more recently containers.

Ultimately, using containers is all about a developer being able to say “it ran on my machine” and be sure that it’ll run in production, because fighting incompatibilities between operating systems, libraries and runtimes that differ from development to production is a waste of time (in particular developer brain time).

CC BY 2.0 image by Jumilla

In parallel, the rise of microservices is also a push to optimize developer brain time. The reality is that we all have limited brain power and ability to comprehend the complex systems that we build in their entirety and so we break them down into small parts that we can understand and test: functions, modules and services.

A microservice with a well-defined API and related tests running in a container is the ultimate developer fantasy. An entire program, known to operate correctly, that runs on their machine and in production.

Of course, no silver lining is without its cloud and containers beget a coordination problem: how do all these little programs find each other, scale, handle failure, log messages, communicate and remain secure. The answer, of course, is a coordination system like Kubernetes.

Kubernetes completes the developer fantasy by allowing them to write and deploy a service and have it take part in a whole.

Sadly, these little programs have one last hurdle before they turn into useful Internet services: they have to be connected to the brutish outside world. Services must be safely and scalably exposed to the Internet.

Recently, Cloudflare introduced a new service that can be used to connect a web server to Cloudflare without needing to have a public IP address for it. That service, Cloudflare Warp, maintains a connection from the server into the Cloudflare network. The server is then only exposed to the Internet through Cloudflare with no way for attackers to reach the server directly.

That means that any connection to it is protected and accelerated by Cloudflare’s service.

Cloudflare Warp Ingress Controller and StackPointCloud

Today, we are extending Warp’s reach by announcing the Cloudflare Warp Ingress Controller for Kubernetes (it’s an open source project and can be found here). We worked closely with the team at StackPointCloud to integrate Warp, Kubernetes and their universal control plane for Kubernetes.

Within Kubernetes creating an ingress with annotation kubernetes.io/ingress.class: cloudflare-warp will automatically create secure Warp tunnels to Cloudflare for any service using that ingress. The entire lifecycle of tunnels is transparently managed by the ingress controller making exposing Kubernetes-managed services securely via Cloudflare Warp trivially easy.

The Warp Ingress Controller is responsible for finding Warp-enabled services and registering them with Cloudflare using the hostname(s) specified in the Ingress resource. It is added to a Kubernetes cluster by creating a file called warp-controller.yaml with the content below:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "1"
  creationTimestamp: null
  generation: 1
  labels:
    run: warp-controller
  name: warp-controller
spec:
  replicas: 1
  selector:
    matchLabels:
      run: warp-controller
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 1
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        run: warp-controller
    spec:
      containers:
      - command:
        - /warp-controller
        - -v=6
        image: quay.io/stackpoint/warp-controller:beta
        imagePullPolicy: Always
        name: warp-controller
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - name: cloudflare-warp-cert
          mountPath: /etc/cloudflare-warp
          readOnly: true
      volumes:
        - name: cloudflare-warp-cert
          secret:
            secretName: cloudflare-warp-cert
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30

The full documentation is here and shows how to get up and running with Kubernetes and Cloudflare Warp on StackPointCloud, Google GKE, Amazon EKS or even minikube.

One Click with StackPointCloud

Within StackPointCloud adding the Cloudflare Warp Ingress Controller requires just a single click. And one more click and you've deployed a Kubernetes cluster.

The connection between the Kubernetes cluster and Cloudflare is made using a TLS tunnel ensuring that all communication between the cluster and the outside world is secure.

Once connected the cluster and its services then benefit from Cloudflare's DDoS protection, WAF, global load balancing and health checks and huge global network.

The combination of Kubernetes and Cloudflare makes managing, scaling, accelerating and protecting Internet facing services simple and fast.

Living In A Multi-Cloud World

Sergi Isasi — Tue, 21 Nov 2017 16:30:00 GMT

A few months ago at Cloudflare’s Internet Summit, we hosted a discussion on A Cloud Without Handcuffs with Joe Beda, one of the creators of Kubernetes, and Brandon Phillips, the co-founder of CoreOS. The conversation touched on multiple areas, but it’s clear that more and more companies are recognizing the need to have some strategy around hosting their applications on multiple cloud providers.

Earlier this year, Mary Meeker published her annual Internet Trends report which revealed that 22% of respondents viewed Cloud Vendor Lock-In as a top 3 concern, up from just 7% in 2012. This is in contrast to previous top concerns, Data Security and Cost & Savings, both of which dropped amongst those surveyed.

At Cloudflare, our mission is to help build a better internet. To fulfill this mission, our customers need to have consistent access to the best technology and services, over time. This is especially the case with respect to storage and compute providers. This means not becoming locked-in to any single provider and taking advantage of multiple cloud computing vendors (such as Amazon Web Services or Google Cloud Platform) for the same end user services.

The Benefits of Having Multiple Cloud Vendors

There are a number of potential challenges when selecting a single cloud provider. Though there may be scenarios where it makes sense to consolidate on a single vendor, our belief is that it is important that customers are aware of their choice and downsides of being potentially locked-in to that particular vendor. In short, know what trade offs you are making should you decide to continue to consolidate parts of your network, compute, and storage with a single cloud provider. While not comprehensive, here are a few trade-offs you may be making if you are locked-in to one cloud.

Cost Efficiences

For some companies, there may be a cost savings involved in spreading traffic across multiple vendors. Some can take advantage of free or reduced cost tiers at lower volumes. Vendors may provide reduced costs for certain times of day that are lower utilized on their infrastructure. Applications can have varying compute requirements amongst layers of the application: some may require faster, immediate processing while others may benefit from delayed processing at a lower cost.

Negotiation Strength

One of the most important reasons to consider deploying in multiple cloud providers is to minimize your reliance on a single vendor’s technology for your critical business processes. As you become more vertically integrated with any vendor, your negotiation posture for pricing or favorable contract terms becomes diminished. Having production ready code available on multiple providers allows you to have less technical debt should you need to change. If you go a step further and are already sending traffic to multiple providers, you have minimized the technical debt required to switch and can negotiate from a position of strength.

Business Continuity or High Availability

While the major cloud providers are generally reliable, there have been a few notable outages in recent years. The most significant in recent memory being Amazon’s US-EAST S3 outage in February. Some organizations may have a policy specifying multiple providers for high availability while others should consider it where necessary and feasible as a best practice. A multi-cloud strategy can lower operational risk from a single vendor’s mistakes causing a significant outage for a mission critical application.

Experimentation

One of the exciting things about having competition in the space is the level of innovation and feature velocity of each provider. Every year there are major announcements of new products or features that may have a significant impact on improving your organization's competitive advantage. Having test and production environments in multiple providers gives your engineers the ability to understand and experiment with a new capability in the context of your technology stack and data. You may even try these features for a portion of your traffic and get real world data on any benefits realized.

Cloudflare’s Role

Cloudflare is an independent third party in your multi-cloud strategy. Our goal is to minimize the layers of lock-in between you and a provider and lower the effort of change. In particular, one area where we can help right away is to minimize the operational changes necessary at the network, similar to what Kubernetes can do at the storage and compute level. As a benefit of our network, you can also have a centralized point for security and operational control.

Cloudflare’s Load Balancing can easily be configured to act as your global application traffic aggregator and distribute your traffic amongst origins at as many clouds as you choose to utilize. Active layer 7 health checks continually probe your origins and can automatically move traffic in the case of network or application failure. All consolidated web traffic can be inspected and acted upon by Cloudflare’s best of breed Security services, providing a single control point and visibility across all application traffic, regardless of which cloud the origin may be on. You also have the benefit of Cloudflare’s Global Anycast Network, providing for better speed and higher availability regardless of which clouds your origins are hosted on.

Billforward: Using Cloudflare to Implement Multi-Cloud

Billforward is a San Francisco and London based startup that is focused and mission driven on changing the way people bill and charge their customers, providing a solution to the complexities of Quote-to-Cash. Their platform is built on a number of Rest APIs that other developers call to bill and generate revenue for their own companies.

Billforward is using Cloudflare for its core customer facing application to failover traffic between Google Compute Engine and Amazon Web Services. Acting as a reverse proxy, Cloudflare receives all requests for and decides which of Billforward’s two configured cloud origins to use based upon the availability of that origin in near real-time. This allows Billforward to completely manage the connections to and from two disparate cloud providers using Cloudflare’s UI or API. Billforward is in the process of migrating all of their customer facing domains to a similar setup.

Configuration

Billforward has a single load balanced hostname with two available Pools. They’ve named the two Pools with “gce” and “aws” labels and each Pool has one Origin associated with it. All of the Pools are enabled and the entire LB/hostname is proxied through Cloudflare (as indicated by the orange cloud).

Cloudflare probes Billforward’s Origins once every minute from all of Cloudflare’s data centers around the world (a feature available to all Load Balancing Enterprise customers). If Billforward’s GCE Origin goes down, Cloudflare will quickly and automatically failover to the AWS Origin with no actions required from Billforward’s team.

Google Compute Engine was chosen as the primary provider for this application by virtue of cost. Martin Lee, Site Reliability Engineer at Billforward says, “Essentially, GCE is cheaper for our general purpose computing needs but we're more experienced with deployments in AWS. This strategy allows us to switch back and forth at will and avoid being tied in to either platform.” It is likely that Billforward will change the priority as pricing models evolve.

“It's a fairly fast moving world and features released by cloud providers can have a meaningful impact on performance and cost on a week by week basis - it helps to stay flexible,” says Martin. “We may also change priority based on features.”

For orchestration of the compute and storage layers, Billforward uses Docker containers managed through Rancher. They use distinct environments between cloud providers but are considering bridging an environment across cloud providers and using VPNs between them, which will enable them to move load between providers even more easily. “Our system is loosely coupled through a message queue,” adds Martin. “Having a container system across clouds means we can really take advantage of this - we can very easily move workloads across clouds without any danger of dropping tasks or ending up in an inconsistent state.”

Benefits

Billforward manages these connections at Cloudflare’s edge. Through this interface (or via the Cloudflare APIs), they can also manually move traffic from GCE to AWS by just disabling the GCE pool or by rearranging the Pool priority and make AWS the primary. These changes are near instant on the Cloudflare network and require no downtime to Billforward’s customer facing application. This allows them to act on potential advantageous pricing changes between the two cloud providers or move traffic to hit pricing tiers.

In addition, Billforward is now not “locked-in” to either provider’s network; being able to move traffic and without any downtime means they can make traffic changes independent of Amazon or Google. They can also integrate additional cloud providers any time they deem fit: adding Microsoft Azure, for example, as a third Origin would be as simple as creating a new Pool and adding it to the Load Balancer.

Billforward is a good example of a forward thinking company that is taking advantage of technologies from multiple providers to best serve their business and customers, while not being reliant on a single vendor. For further detail on their setup using Cloudflare, please check their blog.

A Cloud Without Handcuffs

Internet Summit Team — Thu, 14 Sep 2017 19:01:55 GMT

Brandon Philips, Co-Founder & CTO, CoreOS, and Joe Beda, CTO, Heptio, & Co-Founder, Kubernetes

Moderator: Alex Dyner, Head of Special Projects, Cloudflar

We’re exploring increasing risk of few companies locking in customers gaining more power over time.

AD: I want to hear your stories about how you got into what you do.

JB: Kubernetes faced problem of either having googlers use rbs or bring X to rest of world. We wanted to have Googlers and outside people using something similar. We chose to do it as open source because you play a different game when you’re the underdog. Through open source we could garner interest. We wanted to provide applicational mobility.

AD: Brandon, talk about your mission and why you started company.

BP: We started CoreOS four years ago; We spent a lot of time thinking about this problem and containers were natural choice. They are necessary for achieving our mission. We wanted to allow people to have mobility around their applications. We wanted to enable new security model through containers. So we started building a product portfolio

AD: There are tradeoffs between using a container or an open source tech; how do you think about those tradeoffs?

BP: First Kubernetes is providing application-centric view. The abstraction is: how do we create a platform? Also, how to build useful integrations?

The project tries to build useful integrations. It’s really about that initial abstraction.

JB: One useful comparison is Kubernetes for is a kernel for system. There is a feeling that we want to keep Kubernetes as flexible kernel, while recognizing that you have to build integrations & user mode on top of it.

AD: How do you talk about different levels (developer, operational)?

JB: The advice i give is that lock-in is unavoidable. The question is: What is the risk of that lock-in? You have to weigh that risk against the benefits. If you’re a startup, you’re not worried about the risk of moving away from a public cloud network. Vs. very large company. There are certain types of lock-in that present problem for operations vs. development teams. Kubernetes makes it an operational problem versus a developmental problem.

BP: Operational: by using Kubernetes, people can bring up dev environments and test on internal infrastructure in our office. This is already providing value.

On the app side, risk comes in when cloud providers build databases where data is tied to the data center. Abstraction allows developers to be free from data center.

AD: How does that work over time?

BP: For many organizations it comes down to cost benefit analysis. They look at their application code, figure out how long they’re locked-in. Leverage only comes when you can call a bluff.Basically a business decision.

JB: It’s a new type of technical debt.There is no one answer.

AD: As less people can do this, salaries of mainframe programmers are going up now; what do you think about that?

JP: There is an analogy between the big public clouds and the legacy mainframe

Legacy mainframe vs. public cloud. Even if no longer preferred choice, it will have a long future. It’s here to stay, even if world moves on.

BP: The larger companies will be competing against the major tech companies that run clouds. We don’t have a term. Is it “cloud debt”? Cloud technical debt? It’s a nascent topic but becoming important.A new challenge .

JB: Data gravity.

AD: A lot of this is about Amazon---are other large vendors approaching this because of their market position?

JB: Amazon is the big elephant for sure. But this goes beyond Amazon. When you look at Kubernetes in containers, it provides a model that did not exist before Amazon. Amazon has been struggling to find balance between infrastructure and ease of use.

So what is making this layer of infrastructure so interesting is not just multi-cloud strategy, but a different way of thinking about programming and automating applications.

The interesting stuff is how we utilize this new tool set.

BP: It’s about making and ensuring the tech works across the board. When Kubernetes started the tech wasn’t there yet for it to run on Amazon. One of our first challenges was to make it possible to get Kubernetes on Amazon. It’s an ongoing technological battle to figure out abstractions and making cloud providers innovators themselves in data and network storage, etc.

AD: What’s the counter to, yes, CoreOS will help me not get locked into Amazon?

BP: Customers are getting APIs. We’re giving customers an API that we don’t modify and they get upstream Kubernetes. We take open source software and integrate it; they can put that integration into their own apps.It’s about taking pieces and providing an adhesive experience.

Not just infrastructure but application monitoringA lot of value of the cloud is that it automates operations.

We provide you with open source software that is automated.

Software venders have to start providing value proposition of resecuring infrastructure when a vulnerability appears in the cloud. “Zero-toil automation”

Q&A:

Q: Customers with critical applications usually use multiple networks; is this one value proposition of the cloud lock-in argument?

BP: we have seen both; it Depends on their internal risk assessment. You can have beautiful architecture about how your business will survive but if you don’t have applications around it, it’s all pointless.

JB: Geography is important. Having a substrate to write app against is important.

BP: It will be interesting as we see global distribution of compute network and storage, the different cost-benefit analyses that will be available. A lot of competition will arise outside of the US in terms of building data centers.

All our sessions will be streamed live! If you can't make it to Summit, here's the link: cloudflare.com/summit17