As we enable more ARM64 machines in our network, I want to give some technical insight into the process we went through to reach software parity in our multi-architecture environment.
To give some idea of the scale of this task, it’s necessary to describe the software stack we run on our servers. The foundation is the Linux kernel. Then, we use the Debian distribution as our base operating system. Finally, we install hundreds of packages that we build ourselves. Some packages are based on open-source software, often tailored to better meet our needs. Other packages were written from scratch within Cloudflare.
Industry support for ARM64 is very active, so a lot of open-source software has already been ported. This includes the Linux kernel. Additionally, Debian made ARM64 a first-class release architecture starting with Stretch in 2017. This meant that upon obtaining our ARM64 hardware, a few engineers were able to bring Debian up quickly and smoothly. Our attention then turned to getting all our in-house packages to build and run for ARM64.
Our stack uses a diverse range of programming languages, including C, C++, Go, Lua, Python, and Rust. Different languages have different porting requirements, with some being easier than others.
Porting Go Code
Cross-compiling Go code is relatively simple, since ARM64 is a first-class citizen. Go compiles and links static binaries using the system’s crossbuild toolchain, meaning the only additional Debian package we had to install on top of
After installing the crossbuild toolchain, we then replaced every
go build invocation with a loop of
GOARCH=<arch> CGO_ENABLED=1 go build, where
<arch> iterates through
CGO_ENABLED=1 is required, as
cgo is disabled by default for cross-compilation. The generated binaries are then run through our testing framework.
Porting Rust Code
Rust also has mature support for ARM64. The steps for porting start at installing
crossbuild-essential-arm64, and defining the
--target triple in either
cargo. Different targets are bucketed into different tiers of completeness. Full instructions are well-documented at rust-cross.
One thing to note, however, is that any crates pulled in by a package must also be cross-compiled. The more crates used, the higher of a chance of running into one that does not cross-compile well.
Testing, Plus Porting Other Code
Other languages are less cooperative when it comes to cross-compilation. Fiddling with
LD values didn’t seem to be best use of engineering resources. What we really wanted was an emulation layer. An emulation layer would leverage all of our
x86_64 machines, from our distributed compute behemoths to developers’ laptops, for the purposes of both building and testing code.
QEMU is an emulator with multiple modes, including both full system emulation and user-space emulation. Our compute nodes are beefy enough to handle system-level emulation, but for developers’ laptops, user-space emulation provides most of the benefits, with less overhead.
For user-space emulation, our porting team did not want to intrude too much into our developers’ normal workflow. Our internal build system already uses Docker as a backend, so it would be ideal to be able to
docker run into an ARM environment, like so:
host$ uname -m x86_64 host$ docker run --rm -it stretch-arm64/master:latest guest# uname -m aarch64
Fortunately, we were not the first ones to come up with this idea: folks over at resin.io have solved this problem already! They’ve also submitted a patch to
qemu-user that prepends the emulator into every
execve call, similar to how
binfmt_misc is implemented. By prepending the emulator, you’re essentially forcing every new process to also be emulated, resulting in a nice self-contained environment.
execve patch in built into
qemu-user, all we had to do was copy the emulator into an ARM64 container, and set the appropriate entrypoint:
# internal build of qemu with patch FROM qemu-aarch64/master:latest as qemu # arm64v8/debian:stretch-slim at 2018-02-12T13:02:00Z FROM arm64v8/[email protected]:841bbe6f4132526be95c91bec6757831c76e603309a47992e6444de6a0b6521a COPY --from=qemu /qemu-aarch64 /qemu-aarch64 SHELL ["/qemu-aarch64", "/bin/sh", "-c"] # setcap is required for `sudo` to work, but breaks environment variable passing # run `setcap -r /qemu-aarch64` to break sudo, but allow environment variable passing # maybe one day we’ll have our cake and eat it too RUN apt-get update && apt-get install -y --no-install-recommends libcap2-bin && \ setcap cap_setuid,cap_setgid+ep /qemu-aarch64 && \ apt-get remove --purge -y libcap2-bin && apt-get autoremove -y && \ rm -rf /var/lib/apt/lists/* ENTRYPOINT ["/qemu-aarch64", "--execve", "/qemu-aarch64"] CMD ["/bin/bash"]
This Dockerfile resulted in the cross-architecture output we were looking for earlier.
Now that we had a self-contained ARM64 environment, we could build and test most of our code relatively smoothly.
Of course, there are always a few blockers on the road to perfection. Upon releasing this modified Debian image to our developers, they returned with a few interesting problems:
- tests were failing due to
- Go programs segfaulting at indeterminate intervals
- system-installed libraries were taking precedence over user-installed libraries
- slow builds and sporadic test case failures, speeding up our plan for native builds and CI
LD_LIBRARY_PATH and Friends
It turns out that
LD_LIBRARY_PATH was not the only environment variable that failed to work correctly. All environment variables, either set on the command line or via other means (e.g.
export), would fail to propagate into the
Through bisection of known good code, we found that it was the
setcap in our Dockerfile which prevented the environment variable passthrough. Unfortunately, this
setcap is the same one that allows us to call
sudo, so we have a caveat for our developers that they can either run
sudo inside their containers, or have environment variable passing, but not both.
Intermittent Go Failures
With a decent amount of Go code running through our CI system, it was easy to spot a trend of intermittent segfaults.
Going on a hunch, we confirmed a hypothesis that non-deterministic failures are generally due to threading issues. Unfortunately, opinion on the issue tracker showed that Go / QEMU incompatibilities aren’t a priority, so we were left without an upstream fix.
The workaround we came up with is simple: if the problem is threading-related, limit where the threads can run! When we package our internal
go binaries, we add a
.deb post-install script to detect if we’re running under ARM64 emulation, and if so, reduce the number of CPUs the
go binary can run under to one. We lose performance by pinning to one CPU, but this slowdown is negligible when we’re already running under emulation, and slow code is better than non-working code.
With the workaround in place, reports of intermittent crashes dropped to zero. Onto the next problem!
Shared Library Mixups
We like to be at the forefront of technology. From suggesting improvements to what would become TLS 1.3, to partnering with Mozilla to make DNS queries more secure, and everything in between. To be able to support these new technologies, our software has to be at the cutting edge.
On the other hand, we also need a reliable platform to build on. One of the reasons we chose Debian is due to its long-term support lifecycle, versus other rolling release operating systems.
With these two ideas counterposed, we opted not to overwrite system libraries in
/usr/lib with our cutting edge version, but instead supplement the defaults by installing into
The same development team that reported the
LD_LIBRARY_PATH issue also came to us saying the ARM64 version of their software would fail to load shared object symbols. A debugging session was launched and we eventually isolated it to the ordering of
/etc/ld.so.conf.d/ in Debian.
$ uname -m x86_64 $ ls /etc/ld.so.conf.d/ libc.conf x86_64-linux-gnu.conf $ cat /etc/ld.so.conf.d/libc.conf # libc default configuration /usr/local/lib $ cat /etc/ld.so.conf.d/x86_64-linux-gnu.conf # Multiarch support /lib/x86_64-linux-gnu /usr/lib/x86_64-linux-gnu
$ uname -m aarch64 $ ls /etc/ld.so.conf.d/ aarch64-linux-gnu.conf libc.conf $ cat /etc/ld.so.conf.d/aarch64-linux-gnu.conf # Multiarch support /lib/aarch64-linux-gnu /usr/lib/aarch64-linux-gnu $ cat /etc/ld.so.conf.d/libc.conf # libc default configuration /usr/local/lib
/etc/ld.so.conf.d/ in alphabetical order, shared libraries in
/usr/local/lib would be loaded before
/usr/lib/$(uname -m)-linux-gnu on
x86_64, while the opposite is true for
Internal discussion resulted in us not changing the system default search order, but instead use the linker flag
--rpath to request the runtime loader to explicitly search our
/usr/local/lib location first.
This issue applies to both the emulated and physical ARM64 environments, which is a boon for the emulation framework we’ve just put together.
Native Builds and CI
Cross- and emulated compilation brought over 99% of our edge codebase, but there were still a handful of packages that did not fit the models we defined. Some packages, e.g.
llvm, parallelize their build so well that the cost of userspace emulation slowed the build time to over 6 hours. Other packages called more esoteric functions which QEMU was not prepared to handle.
Rather than devote resources to emulating the long tail, we allocated a few ARM64 machines for developer access, and one machine for a native CI agent. Maintainers of the long tail could iterate in peace, knowing their failing test cases were never due to the emulation layer. When ready, CI would pick up their changes and build an official package, post-review.
While native compilation is the least error-prone build method, limited supply of machines made this option unattractive; the more machines we allocate for development and CI mean the more machines we take away from our proving grounds.
Ideally, we should follow the Go team’s recommendation of running code natively, but as long as our developers iterate on their
x86_64 laptops, supporting emulation is necessary for us.
With the most glaring blockers out of the way, we have now given our developers an even footing to easily build for multiple architectures.
The rest of the time was spent coordinating over a hundred packages, split between dozens of tech teams. At the beginning, responsibility of building ARM64 packages laid on the porting team. Working on a changing codebase required close collaboration between maintainer and porter.
Once we deemed our ARM64 platform production-ready, a self-guided procedure was created to use the build methods listed above, and a request was sent out to all of engineering to support ARM64 as a first-class citizen.
The end result of our stack is currently being tested, profiled, and optimized, with results coming soon!
Many more opportunities exist for systems integration, debugging deep dives, cross-team collaboration, and internal tools development. Come join us if you’re interested!
ARM64 is sometimes used interchangeably with aarch64 and ARMv8 ↩︎
binfmt_miscis also what Docker for Mac uses to leverage multi-architecture support; we’re supporting something very similar, albeit from Linux to Linux, versus macOS to Linux. ↩︎