Dynamic Process Isolation: Research by Cloudflare and TU Graz

Last year, I wrote about the Cloudflare Workers security model, including how we fight Spectre attacks. In that post, I explained that there is no known complete defense against Spectre — regardless of whether you're using isolates, processes, containers, or virtual machines to isolate tenants. What we do have, though, is a huge number of tools to increase the cost of a Spectre attack, to the point where it becomes infeasible. Cloudflare Workers has been designed from the very beginning with protection against side channel attacks in mind, and because of this we have been able to incorporate many defenses that other platforms — such as virtual machines and web browsers — cannot. However, the performance and scalability requirements of edge compute make it infeasible to run every Worker in its own private process, so we cannot rely on the usual defenses provided by the operating system kernel and address space separation.

Given our different approach, we cannot simply rely on others to tell us if we are safe. We had to do our own research. To do this we partnered with researchers at Graz University of Technology (TU Graz) to study the impact of Spectre on our environment. The team at TU Graz are some of the foremost experts on the topic, having co-discovered Spectre initially as well as discovered several follow-on bugs like NetSpectre, ZombieLoad, Fallout, and others.

Today we are publishing a paper describing our findings, authored by Martin Schwarzl, Pietro Borrello, Andreas Kogler, Thomas Schuster, Daniel Gruss, Michael Schwarz, and myself. This paper covers research done in 2019 and early 2020. The research both tests the possibility of attacking Workers using Spectre, and proposes a new defense mechanism, which we now employ in production.

For this research, the team at TU Graz had full access to the Workers Runtime source code and were able to compile and run it locally for testing.

The research has two basic components.

Part 1: Develop an attack

A side channel attack (of which Spectre is one variety) is kind of like playing poker with a CPU. In poker, players try to understand what their opponents are thinking by looking for subtle unconscious behaviors, such as a nervous look or a hand motion. These behaviors are called "tells". In a side channel attack, the attacker wants to find out secrets that the CPU knows. The CPU won't reveal these secrets directly, but they can sometimes subtly affect how long the CPU spends to perform certain operations, kind of like a poker tell. If an attacker can carefully time the CPU's actions, they can potentially discover the underlying secrets. Spectre attacks in particular focus on side channels that result from the CPU's use of speculative execution, in which the CPU executes code that it is not yet sure should be executed, and then attempts to roll it back if not. Speculative execution is a particularly potent tool in side channel attacks because it essentially allows the attacker to program custom side channels in speculatively-executed code.

Many Spectre defenses focus on eliminating the "tells" by trying to prevent the variability in the CPU's timing. This is hard, because CPUs are extremely complex and there are many ways that their timing can be affected. While many specific "tells" have been found and mitigated, there are undoubtedly many more that haven't been disclosed. This has led to a game of whack-a-mole, where researchers continuously find new "tells" while CPU vendors rush out kernel and microcode patches to solve them — often with large performance losses as a side effect.

In Workers, we have focused on a different approach: preventing the attacker from seeing the "tells". The Workers Runtime is designed to prevent a Worker from measuring its own execution time, as well as to prevent other forms of non-deterministic behavior like multithreading that could be used in place of a timer. I described these techniques in detail in last year's post.

However, this approach can't be perfect as long as Workers are allowed to talk to the rest of the world. A Worker could always communicate with a remote time server to measure time. Such communications will be far less accurate than a local timer, and since the timing differences are extremely small, they will be hard to measure this way. But, by using amplification techniques to improve the strength of the signal, repeating the attack many times and applying statistics, it could still be possible to derive secrets.

We therefore set out to develop an attack based on this approach. Upon applying the best techniques available to us, we were indeed able to produce a working Spectre variant 1 attack that could leak memory at a rate of 120 bits per hour. Compared to attacks demonstrated on many other platforms, 120 bits per hour is pretty slow. However, it's obviously still fast enough to be a problem.

It's important to note, though, that this speed was achieved in an ideal scenario:

  • Since the Workers Runtime prevents Workers from measuring their own execution time, any attack would need to rely on a remote time server. But for the purpose of our test, the "remote" server was in fact located on the same machine. In a real-world scenario, such a server would need to be accessed over the Internet, making the timing less accurate.
  • The machine running the test had no other load. A real-world machine would be processing hundreds or thousands of requests concurrently, creating noise.
  • The attack only demonstrated that it could read some bits that it shouldn't. In order to read interesting bits, an attacker would first need to locate those bits, which likely would require reading hundreds or thousands of other bits first.

In the real world, these factors appear to make an attack too slow to be interesting. If an attack takes days or weeks to carry out, the contents of memory are highly likely to change before it can read them. For example, we update the Workers Runtime code at least once a week, which causes a restart of all processes.

That said, we did not feel comfortable relying on this argument as our defense. Instead, we set out to do better.

Part 2: Enhance our defenses

In the second part of the research, we designed and implemented a novel Spectre defense which we call Dynamic Process Isolation.

Dynamic Process Isolation was described in my blog post last year. At the time, this system was still in testing, but it has since been fully deployed in production.

In short, our defense uses hardware performance counters to detect Workers whose performance characteristics could be indicative of an attack. Before the attack has had enough time to leak any bits, we move the Worker into a separate operating system process, thus taking advantage of the additional defenses implemented by the OS kernel. Crucially, since a benign Worker can still operate normally while in an isolated process, we are able to use a detector that produces false positives, as long as the rate is relatively low. This affordance made it possible for us to develop a working classifier where previous work in the area had struggled.

Specifically, we developed a detector based on measuring branch mispredictions. Spectre variant 1 attacks — the fastest and easiest kind of Spectre attack — work by fooling the CPU's branch predictor to trigger speculative code execution. Such an attack, when running in our environment, must trigger repeated mispredictions in a loop, in order to get enough data to apply statistics to overcome the noise floor. We can see these mispredictions in the hardware performance counters. While an attack could try to evade the detector by spreading out its trials over a longer time period, doing so would slow down the attack by orders of magnitude, which is exactly our goal. Classifiers for other Spectre variants might be straightforward to build as well, however, we find other variants already produce much lower bandwidth or are otherwise effectively mitigated by our existing defenses.

This defense successfully detects and mitigates the attack we developed. We also tested it against a number of Spectre proofs of concept and found it caught all of them. Meanwhile, the rate of false positives is well within the range we can tolerate: Out of many thousands of Workers running on our platform, we see only about 20 being falsely detected as attacks.

For more details, check out the paper and my blog post from last year.

Read the Paper

Collaborating with TU Graz was a great experience. We are very happy to work with some of the world's foremost experts on this problem, and to have produced not just an attack but also a constructive defense.

For more details, download the full paper on arXiv.