Posts tagged "Servers"

A year of improving Node.js compatibility in Cloudflare Workers

James M Snell — Thu, 25 Sep 2025 13:00:00 GMT

We've been busy.

Compatibility with the broad JavaScript developer ecosystem has always been a key strategic investment for us. We believe in open standards and an open web. We want you to see Workers as a powerful extension of your development platform with the ability to just drop code in that Just Works. To deliver on this goal, the Cloudflare Workers team has spent the past year significantly expanding compatibility with the Node.js ecosystem, enabling hundreds (if not thousands) of popular npm modules to now work seamlessly, including the ever popular express framework.

We have implemented a substantial subset of the Node.js standard library, focusing on the most commonly used, and asked for, APIs. These include:

Each of these has been carefully implemented to approximate Node.js' behavior as closely as possible where feasible. Where matching Node.js' behavior is not possible, our implementations will throw a clear error when called, rather than silently failing or not being present at all. This ensures that packages that check for the presence of these APIs will not break, even if the functionality is not available.

In some cases, we had to implement entirely new capabilities within the runtime in order to provide the necessary functionality. For node:fs, we added a new virtual file system within the Workers environment. In other cases, such as with node:net, node:tls, and node:http, we wrapped the new Node.js APIs around existing Workers capabilities such as the Sockets API and fetch.

Most importantly, all of these implementations are done natively in the Workers runtime, using a combination of TypeScript and C++. Whereas our earlier Node.js compatibility efforts relied heavily on polyfills and shims injected at deployment time by developer tooling such as Wrangler, we are moving towards a model where future Workers will have these APIs available natively, without need for any additional dependencies. This not only improves performance and reduces memory usage, but also ensures that the behavior is as close to Node.js as possible.

The networking stack

Node.js has a rich set of networking APIs that allow applications to create servers, make HTTP requests, work with raw TCP and UDP sockets, send DNS queries, and more. Workers do not have direct access to raw kernel-level sockets though, so how can we support these Node.js APIs so packages still work as intended? We decided to build on top of the existing managed Sockets and fetch APIs. These implementations allow many popular Node.js packages that rely on networking APIs to work seamlessly in the Workers environment.

Let's start with the HTTP APIs.

HTTP client and server support

From the moment we announced that we would be pursuing Node.js compatibility within Workers, users have been asking specifically for an implementation of the node:http module. There are countless modules in the ecosystem that depend directly on APIs like http.get(...) and http.createServer(...).

The node:http and node:https modules provide APIs for creating HTTP clients and servers. We have implemented both, allowing you to create HTTP clients using http.request() and servers using http.createServer(). The HTTP client implementation is built on top of the Fetch API, while the HTTP server implementation is built on top of the Workers runtime’s existing request handling capabilities.

The client side is fairly straightforward:

The server side is just as simple but likely even more exciting. We've often been asked about the possibility of supporting Express, or Koa, or Fastify within Workers, but it was difficult to do because these were so dependent on the Node.js APIs. With the new additions it is now possible to use both Express and Koa within Workers, and we're hoping to be able to add Fastify support later.

The httpServerHandler() function from the cloudflare:node module integrates the HTTP server with the Workers fetch event, allowing it to handle incoming requests.

The `node:dns` module

The node:dns module provides an API for performing DNS queries.

At Cloudflare, we happen to have a DNS-over-HTTPS (DoH) service and our own DNS service called 1.1.1.1. We took advantage of this when exposing node:dns in Workers. When you use this module to perform a query, it will just make a subrequest to 1.1.1.1 to resolve the query. This way the user doesn’t have to think about DNS servers, and the query will just work.

The `node:net` and `node:tls` modules

The node:net module provides an API for creating TCP sockets, while the node:tls module provides an API for creating secure TLS sockets. As we mentioned before, both are built on top of the existing Workers Sockets API. Note that not all features of the node:net and node:tls modules are available in Workers. For instance, it is not yet possible to create a TCP server using net.createServer() yet (but maybe soon!), but we have implemented enough of the APIs to allow many popular packages that rely on these modules to work in Workers.

A new virtual file system and the `node:fs` module

What does supporting filesystem APIs mean in a serverless environment? When you deploy a Worker, it runs in Region:Earth and we don’t want you needing to think about individual servers with individual file systems. There are, however, countless existing applications and modules in the ecosystem that leverage the file system to store configuration data, read and write temporary data, and more.

Workers do not have access to a traditional file system like a Node.js process does, and for good reason! A Worker does not run on a single machine; a single request to one worker can run on any one of thousands of servers anywhere in Cloudflare's global network. Coordinating and synchronizing access to shared physical resources such as a traditional file system harbor major technical challenges and risks of deadlocks and more; challenges that are inherent in any massively distributed system. Fortunately, Workers provide powerful tools like Durable Objects that provide a solution for coordinating access to shared, durable state at scale. To address the need for a file system in Workers, we built on what already makes Workers great.

We implemented a virtual file system that allows you to use the node:fs APIs to read and write temporary, in-memory files. This virtual file system is specific to each Worker. When using a stateless worker, files created in one request are not accessible in any other request. However, when using a Durable Object, this temporary file space can be shared across multiple requests from multiple users. This file system is ephemeral (for now), meaning that files are not persisted across Worker restarts or deployments, so it does not replace the use of the Durable Object Storage mechanism, but it provides a powerful new tool that greatly expands the capabilities of your Durable Objects.

The node:fs module provides a rich set of APIs for working with files and directories:

The virtual file system supports a wide range of file operations, including reading and writing files, creating and removing directories, and working with file descriptors. It also supports standard input/output/error streams via process.stdin, process.stdout, and process.stderr, symbolic links, streams, and more.

While the current implementation of the virtual file system is in-memory only, we are exploring options for adding persistent storage in the future that would link to existing Cloudflare storage solutions like R2 or Durable Objects. But you don't have to wait on us! When combined with powerful tools like Durable Objects and JavaScript RPC, it's certainly possible to create your own general purpose, durable file system abstraction backed by sqlite storage.

Cryptography with `node:crypto`

The node:crypto module provides a comprehensive set of cryptographic functionality, including hashing, encryption, decryption, and more. We have implemented a full version of the node:crypto module, allowing you to use familiar cryptographic APIs in your Workers applications. There will be some difference in behavior compared to Node.js due to the fact that Workers uses BoringSSL under the hood, while Node.js uses OpenSSL. However, we have strived to make the APIs as compatible as possible, and many popular packages that rely on node:crypto now work seamlessly in Workers.

To accomplish this, we didn't just copy the implementation of these cryptographic operations from Node.js. Rather, we worked within the Node.js project to extract the core crypto functionality out into a separate dependency project called ncrypto that is used – not only by Workers but Bun as well – to implement Node.js compatible functionality by simply running the exact same code that Node.js is running.

All major capabilities of the node:crypto module are supported, including:

Hashing (e.g., SHA-256, SHA-512)
HMAC
Symmetric encryption/decryption
Asymmetric encryption/decryption
Digital signatures
Key generation and management
Random byte generation
Key derivation functions (e.g., PBKDF2, scrypt)
Cipher and Decipher streams
Sign and Verify streams
KeyObject class for managing keys
Certificate handling (e.g., X.509 certificates)
Support for various encoding formats (e.g., PEM, DER, base64)
and more…

Process & Environment

In Node.js, the node:process module provides a global object that gives information about, and control over, the current Node.js process. It includes properties and methods for accessing environment variables, command-line arguments, the current working directory, and more. It is one of the most fundamental modules in Node.js, and many packages rely on it for basic functionality and simply assume its presence. There are, however, some aspects of the node:process module that do not make sense in the Workers environment, such as process IDs and user/group IDs which are tied to the operating system and process model of a traditional server environment and have no equivalent in the Workers environment.

When nodejs_compat is enabled, the process global will be available in your Worker scripts or you can import it directly via import process from 'node:process'. Note that the process global is only available when the nodejs_compat flag is enabled. If you try to access process without the flag, it will be undefined and the import will throw an error.

Let's take a look at the process APIs that do make sense in Workers, and that have been fully implemented, starting with process.env.

Environment variables

Workers have had support for environment variables for a while now, but previously they were only accessible via the env argument passed to the Worker function. Accessing the environment at the top-level of a Worker was not possible:

With the new process.env implementation, you can now access environment variables in a more familiar way, just like in Node.js, and at any scope, including the top-level of your Worker:

Environment variables are set in the same way as before, via the wrangler.toml or wrangler.jsonc configuration file, or via the Cloudflare dashboard or API. They may be set as simple key-value pairs or as JSON objects:

When accessed via process.env, all environment variable values are strings, just like in Node.js.

Because process.env is accessible at the global scope, it is important to note that environment variables are accessible from anywhere in your Worker script, including third-party libraries that you may be using. This is consistent with Node.js behavior, but it is something to be aware of from a security and configuration management perspective. The Cloudflare Secrets Store can provide enhanced handling around secrets within Workers as an alternative to using environment variables.

Importable environment and waitUntil

When not using the nodejs_compat flag, we decided to go a step further and make it possible to import both the environment, and the waitUntil mechanism, as a module, rather than forcing users to always access it via the env and ctx arguments passed to the Worker function. This can make it easier to access the environment in a more modular way, and can help to avoid passing the env argument through multiple layers of function calls. This is not a Node.js-compatibility feature, but we believe it is a useful addition to the Workers environment:

One important note about process.env: changes to environment variables via process.env will not be reflected in the env argument passed to the Worker function, and vice versa. The process.env is populated at the start of the Worker execution and is not updated dynamically. This is consistent with Node.js behavior, where changes to process.env do not affect the actual environment variables of the running process. We did this to minimize the risk that a third-party library, originally meant to run in Node.js, could inadvertently modify the environment assumed by the rest of the Worker code.

Stdin, stdout, stderr

Workers do not have a traditional standard input/output/error streams like a Node.js process does. However, we have implemented process.stdin, process.stdout, and process.stderr as stream-like objects that can be used similarly. These streams are not connected to any actual process stdin and stdout, but they can be used to capture output that is written to the logs captured by the Worker in the same way as console.log and friends, just like them, they will show up in Workers Logs.

The process.stdout and process.stderr are Node.js writable streams:

Support for stdin, stdout, and stderr is also integrated with the virtual file system, allowing you to write to the standard file descriptors 0, 1, and 2 (representing stdin, stdout, and stderr respectively) using the node:fs APIs:

Other process APIs

We cannot cover every node:process API in detail here, but here are some of the other notable APIs that we have implemented:

process.nextTick(fn): Schedules a callback to be invoked after the current execution context completes. Our implementation uses the same microtask queue as promises so that it behaves exactly the same as queueMicrotask(fn).
process.cwd() and process.chdir(): Get and change the current virtual working directory. The current working directory is initialized to /bundle when the Worker starts, and every request has its own isolated view of the current working directory. Changing the working directory in one request does not affect the working directory in other requests.
process.exit(): Immediately terminates the current Worker request execution. This is unlike Node.js where process.exit() terminates the entire process. In Workers, calling process.exit() will stop execution of the current request and return an error response to the client.

Compression with `node:zlib`

The node:zlib module provides APIs for compressing and decompressing data using various algorithms such as gzip, deflate, and brotli. We have implemented the node:zlib module, allowing you to use familiar compression APIs in your Workers applications. This enables a wide range of use cases, including data compression for network transmission, response optimization, and archive handling.

While Workers has had built-in support for gzip and deflate compression via the Web Platform Standard Compression API, the node:zlib module support brings additional support for the Brotli compression algorithm, as well as a more familiar API for Node.js developers.

Timing & scheduling

Node.js provides a set of timing and scheduling APIs via the node:timers module. We have implemented these in the runtime as well.

The Node.js implementations of the timers APIs are very similar to the standard Web Platform with one key difference: the Node.js timers APIs return Timeout objects that can be used to manage the timers after they have been created. We have implemented the Timeout class in Workers to provide this functionality, allowing you to clear or re-fire timers as needed.

Console

The node:console module provides a set of console logging APIs that are similar to the standard console global, but with some additional features. We have implemented the node:console module as a thin wrapper around the existing globalThis.console that is already available in Workers.

How to enable the Node.js compatibility features

To enable the Node.js compatibility features as a whole within your Workers, you can set the nodejs_compat compatibility flag in your wrangler.jsonc or wrangler.toml configuration file. If you are not using Wrangler, you can also set the flag via the Cloudflare dashboard or API:

The compatibility date here is key! Update that to the most current date, and you'll always be able to take advantage of the latest and greatest features.

The nodejs_compat flag is an umbrella flag that enables all the Node.js compatibility features at once. This is the recommended way to enable Node.js compatibility, as it ensures that all features are available and work together seamlessly. However, if you prefer, you can also enable or disable some features individually via their own compatibility flags:

By separating these features, you can have more granular control over which Node.js APIs are available in your Workers. At first, we had started rolling out these features under the one nodejs_compat flag, but we quickly realized that some users perform feature detection based on the presence of certain modules and APIs and that by enabling everything all at once we were risking breaking some existing Workers. Users who are checking for the existence of these APIs manually can ensure new changes don’t break their workers by opting out of specific APIs:

But, to keep things simple, we recommend starting with the nodejs_compat flag, which will enable everything. You can always disable individual features later if needed. There is no performance penalty to having the additional features enabled.

Handling end-of-life'd APIs

One important difference between Node.js and Workers is that Node.js has a defined long term support (LTS) schedule that allows it to make breaking changes at certain points in time. More specifically, Node.js can remove APIs and features when they reach end-of-life (EOL). On Workers, however, we have a rule that once a Worker is deployed, it will continue to run as-is indefinitely, without any breaking changes as long as the compatibility date does not change. This means that we cannot simply remove APIs when they reach EOL in Node.js, since this would break existing Workers. To address this, we have introduced a new set of compatibility flags that allow users to specify that they do not want the nodejs_compat features to include end-of-life APIs. These flags are based on the Node.js major version in which the APIs were removed:

The remove_nodejs_compat_eol flag will remove all APIs that have reached EOL up to your current compatibility date:

The remove_nodejs_compat_eol_v22 flag will remove all APIs that reached EOL in Node.js v22. When using removenodejs_compat_eol, this flag will be automatically enabled if your compatibility date is set to a date after Node.js v22's EOL date (April 30, 2027).
The remove_nodejs_compat_eol_v23 flag will remove all APIs that reached EOL in Node.js v23. When using removenodejs_compat_eol, this flag will be automatically enabled if your compatibility date is set to a date after Node.js v24's EOL date (April 30, 2028).
The remove_nodejs_compat_eol_v24 flag will remove all APIs that reached EOL in Node.js v24. When using removenodejs_compat_eol, this flag will be automatically enabled if your compatibility date is set to a date after Node.js v24's EOL date (April 30, 2028).

If you look at the date for remove_nodejs_compat_eol_v23 you'll notice that it is the same as the date for remove_nodejs_compat_eol_v24. That is not a typo! Node.js v23 is not an LTS release, and as such it has a very short support window. It was released in October 2023 and reached EOL in May 2024. Accordingly, we have decided to group the end-of-life handling of non-LTS releases into the next LTS release. This means that when you set your compatibility date to a date after the EOL date for Node.js v24, you will also be opting out of the APIs that reached EOL in Node.js v23. Importantly, these flags will not be automatically enabled until your compatibility date is set to a date after the relevant Node.js version's EOL date, ensuring that existing Workers will have plenty of time to migrate before any APIs are removed, or can choose to just simply keep using the older APIs indefinitely by using the reverse compatibility flags like add_nodejs_compat_eol_v24.

Giving back

One other important bit of work that we have been doing is expanding Cloudflare's investment back into the Node.js ecosystem as a whole. There are now five members of the Workers runtime team (plus one summer intern) that are actively contributing to the Node.js project on GitHub, two of which are members of Node.js' Technical Steering Committee. While we have made a number of new feature contributions such as an implementation of the Web Platform Standard URLPattern API and improved implementation of crypto operations, our primary focus has been on improving the ability for other runtimes to interoperate and be compatible with Node.js, fixing critical bugs, and improving performance. As we continue to grow our efforts around Node.js compatibility we will also grow our contributions back to the project and ecosystem as a whole.

Cloudflare is also proud to continue supporting critical infrastructure for the Node.js project through its ongoing strategic partnership with the OpenJS Foundation, providing free access to the project to services such as Workers, R2, DNS, and more.

Give it a try!

Our vision for Node.js compatibility in Workers is not just about implementing individual APIs, but about creating a comprehensive platform that allows developers to run existing Node.js code seamlessly in the Workers environment. This involves not only implementing the APIs themselves, but also ensuring that they work together harmoniously, and that they integrate well with the unique aspects of the Workers platform.

In some cases, such as with node:fs and node:crypto, we have had to implement entirely new capabilities that were not previously available in Workers and did so at the native runtime level. This allows us to tailor the implementations to the unique aspects of the Workers environment and ensure both performance and security.

And we're not done yet. We are continuing to work on implementing additional Node.js APIs, as well as improving the performance and compatibility of the existing implementations. We are also actively engaging with the community to understand their needs and priorities, and to gather feedback on our implementations. If there are specific Node.js APIs or npm packages that you would like to see supported in Workers, please let us know! If there are any issues or bugs you encounter, please report them on our GitHub repository. While we might not be able to implement every single Node.js API, nor match Node.js' behavior exactly in every case, we are committed to providing a robust and comprehensive Node.js compatibility layer that meets the needs of the community.

All the Node.js compatibility features described in this post are available now. To get started, simply enable the nodejs_compat compatibility flag in your wrangler.toml or wrangler.jsonc file, or via the Cloudflare dashboard or API. You can then start using the Node.js APIs in your Workers applications right away.

Bringing Node.js HTTP servers to Cloudflare Workers

Yagiz Nizipli — Mon, 08 Sep 2025 13:00:00 GMT

We’re making it easier to run your Node.js applications on Cloudflare Workers by adding support for the node:http client and server APIs. This significant addition brings familiar Node.js HTTP interfaces to the edge, enabling you to deploy existing Express.js, Koa, and other Node.js applications globally with zero cold starts, automatic scaling, and significantly lower latency for your users — all without rewriting your codebase. Whether you're looking to migrate legacy applications to a modern serverless platform or build new ones using the APIs you already know, you can now leverage Workers' global network while maintaining your existing development patterns and frameworks.

The Challenge: Node.js-style HTTP in a Serverless Environment

Cloudflare Workers operate in a unique serverless environment where direct tcp connection isn't available. Instead, all networking operations are fully managed by specialized services outside the Workers runtime itself — systems like our Open Egress Router (OER) and Pingora that handle connection pooling, keeping connections warm, managing egress IPs, and all the complex networking details. This means as a developer, you don't need to worry about TLS negotiation, connection management, or network optimization — it's all handled for you automatically.

This fully-managed approach is actually why we can't support certain Node.js APIs — these networking decisions are handled at the system level for performance and security. While this makes Workers different from traditional Node.js environments, it also makes them better for serverless computing — you get enterprise-grade networking without the complexity.

This fundamental difference required us to rethink how HTTP APIs work at the edge while maintaining compatibility with existing Node.js code patterns.

Our Solution: we've implemented the core `node:http` APIs by building on top of the web-standard technologies that Workers already excel at. Here's how it works:

HTTP Client APIs

The node:http client implementation includes the essential APIs you're familiar with:

http.get() - For simple GET requests
http.request() - For full control over HTTP requests

Our implementations of these APIs are built on top of the standard fetch() API that Workers use natively, providing excellent performance while maintaining Node.js compatibility.

What's Supported

Standard HTTP methods (GET, POST, PUT, DELETE, etc.)
Request and response headers
Request and response bodies
Streaming responses
Basic authentication

Current Limitations

The Agent API is provided but operates as a no-op.
Trailers, early hints, and 1xx responses are not supported.
TLS-specific options are not supported (Workers handle TLS automatically).

HTTP Server APIs

The server-side implementation is where things get particularly interesting. Since Workers can't create traditional TCP servers listening on specific ports, we've created a bridge system that connects Node.js-style servers to the Workers request handling model.

When you create an HTTP server and call listen(port), instead of opening a TCP socket, the server is registered in an internal table within your Worker. This internal table acts as a bridge between http.createServer executions and the incoming fetch requests using the port number as the identifier.

You then use one of two methods to bridge incoming Worker requests to your Node.js-style server.

Manual Integration with `handleAsNodeRequest`

This approach gives you the flexibility to integrate Node.js HTTP servers with other Worker features, and allows you to have multiple handlers in your default entrypoint such as fetch, scheduled, queue, etc.

This approach is perfect when you need to:

Integrate with other Workers features like KV, Durable Objects, or R2
Handle some routes differently while delegating others to the Node.js server
Apply custom middleware or request processing

Automatic Integration with `httpServerHandler`

For use cases where you want to integrate a Node.js HTTP server without any additional features or complexity, you can use the `httpServerHandler` function. This function automatically handles the integration for you. This solution is ideal for applications that don’t need Workers-specific features.

Express.js, Koa.js and Framework Compatibility

These HTTP APIs open the door to running popular Node.js frameworks like Express.js on Workers. If any of the middlewares for these frameworks don’t work as expected, please open an issue to Cloudflare Workers repository.

In addition to Express.js, Koa.js is also supported:

Getting started with serverless Node.js applications

The node:http and node:https APIs are available in Workers with Node.js compatibility enabled using the nodejs_compat compatibility flag with a compatibility date later than 08-15-2025.

The addition of node:http support brings us closer to our goal of making Cloudflare Workers the best platform for running JavaScript at the edge, whether you're building new applications or migrating existing ones.

Ready to try it out? Enable Node.js compatibility in your Worker and start exploring the possibilities of familiar HTTP APIs at the edge.

Is this thing on? Using OpenBMC and ACPI power states for reliable server boot

Nnamdi Ajah — Tue, 22 Oct 2024 13:00:00 GMT

Introduction

At Cloudflare, we provide a range of services through our global network of servers, located in 330 cities worldwide. When you interact with our long-standing application services, or newer services like Workers AI, you’re in contact with one of our fleet of thousands of servers which support those services.

These servers which provide Cloudflare services are managed by a Baseboard Management Controller (BMC). The BMC is a special purpose processor — different from the Central Processing Unit (CPU) of a server — whose sole purpose is ensuring a smooth operation of the server.

Regardless of the server vendor, each server has this BMC. The BMC runs independently of the CPU and has its own embedded operating system, usually referred to as firmware. At Cloudflare, we customize and deploy a server-specific version of the BMC firmware. The BMC firmware we deploy at Cloudflare is based on the Linux Foundation Project for BMCs, OpenBMC. OpenBMC is an open-sourced firmware stack designed to work across a variety of systems including enterprise, telco, and cloud-scale data centers. The open-source nature of OpenBMC gives us greater flexibility and ownership of this critical server subsystem, instead of the closed nature of proprietary firmware. This gives us transparency (which is important to us as a security company) and allows us faster time to develop custom features/fixes for the BMC firmware that we run on our entire fleet.

In this blog post, we are going to describe how we customized and extended the OpenBMC firmware to better monitor our servers’ boot-up processes to start more reliably and allow better diagnostics in the event that an issue happens during server boot-up.

Server subsystems

Server systems consist of multiple complex subsystems that include the processors, memory, storage, networking, power supply, cooling, etc. When booting up the host of a server system, the power state of each subsystem of the server is changed in an asynchronous manner. This is done so that subsystems can initialize simultaneously, thereby improving the efficiency of the boot process. Though started asynchronously, these subsystems may interact with each other at different points of the boot sequence and rely on handshake/synchronization to exchange information. For example, during boot-up, the UEFI (Universal Extensible Firmware Interface), often referred to as the BIOS, configures the motherboard in a phase known as the Platform Initialization (PI) phase, during which the UEFI collects information from subsystems such as the CPUs, memory, etc. to initialize the motherboard with the right settings.

^{Figure 1: Server Boot Process}

When the power state of the subsystems, handshakes, and synchronization are not properly managed, there may be race conditions that would result in failures during the boot process of the host. Cloudflare experienced some of these boot-related failures while rolling out open source firmware (OpenBMC) to the Baseboard Management Controllers (BMCs) of our servers.

Baseboard Management Controller (BMC) as a manager of the host

A BMC is a specialized microprocessor that is attached to the board of a host (server) to assist with remote management capabilities of the host. Servers usually sit in data centers and are often far away from the administrators, and this creates a challenge to maintain them at scale. This is where a BMC comes in, as the BMC serves as the interface that gives administrators the ability to securely and remotely access the servers and carry out management functions. The BMC does this by exposing various interfaces, including Intelligent Platform Management Interface (IPMI) and Redfish, for distributed management. In addition, the BMC receives data from various sensors/devices (e.g. temperature, power supply) connected to the server, and also the operating parameters of the server, such as the operating system state, and publishes the values on its IPMI and Redfish interfaces.

^{Figure 2: Block diagram of BMC in a server system.}

At Cloudflare, we use the OpenBMC project for our Baseboard Management Controller (BMC).

Below are examples of management functions carried out on a server through the BMC. The interactions in the examples are done over ipmitool, a command line utility for interacting with systems that support IPMI.

Switching to OpenBMC firmware for our BMCs gives us more control over the software that powers our infrastructure. This has given us more flexibility, customizations, and an overall better uniform experience for managing our servers. Since OpenBMC is open source, we also leverage community fixes while upstreaming some of our own. Some of the advantages we have experienced with OpenBMC include a faster turnaround time to fixing issues, optimizations around thermal cooling, increased power efficiency and supporting AI inference.

While developing Cloudflare’s OpenBMC firmware, however, we ran into a number of boot problems.

Host not booting: When we send a request over IPMI for a host to power on (as in the example above, power on the server), ipmitool would indicate the power status of the host as ON, but we would not see any power going into the CPU nor any activity on the CPU. While ipmitool was correct about the power going into the chassis as ON, we had no information about the power state of the server from ipmitool, and we initially falsely assumed that since the chassis power was on, the rest of the server components should be ON. The System Event Log (SEL), which is responsible for displaying platform-specific events, was not giving us any useful information beyond indicating that the server was in a soft-off state (powered off), working state (operating system is loading and running), or that a “System Restart” of the host was initiated.

In the System Event Logs shown above, ACPI is the acronym for Advanced Configuration and Power Interface, a standard for power management on computing systems. In the ACPI soft-off state, the host is powered off (the motherboard is on standby power but CPU/host isn’t powered on); according to the ACPI specifications, this state is called S5_G2. (These states are discussed in more detail below.) In the ACPI working state, the host is booted and in a working state, also known in the ACPI specifications as status S0_G0 (which in our case happened to be false), and the third row indicates the cause of the restart was due to a System Restart. Most of the boot-related SEL events are sent from the UEFI to the BMC. The UEFI has been something of a black box to us, as we rely on our original equipment manufacturers (OEMs) to develop the UEFI firmware for us, and for the generation of servers with this issue, the UEFI firmware did not implement sending the boot progress of the host to the BMC.

One discrepancy we observed was the difference in the power status and the power going into the CPU, which we read with a sensor we call CPU_POWER.

However, checking the power into the CPU shows that the CPU was not receiving any power.

The CPU_POWER being at 0 watts contradicts all the previous information that the host was powered up and working, when the host was actually completely shut down.

Missing Memory Modules: Our servers would randomly boot up with less memory than expected. Computers can boot up with less memory than installed due to a number of problems, such as a loose connection, hardware problem, or faulty memory. For our case, it happened not to be any of the usual suspects, but instead was due to both the BMC and UEFI trying to simultaneously read from the memory modules, leading to access contentions. Memory modules usually contain a Serial Presence Detect (SPD), which is used by the UEFI to dynamically detect the memory module. This SPD is usually located on an inter-integrated circuit (i2c), which is a low speed, two write protocol for devices to talk to each other. The BMC also reads the temperature of the memory modules via the i2c. When the server is powered on, amongst other hardware initializations, the UEFI also initializes the memory modules that it can detect via their (i.e. each individual memory modules) Serial Presence Detect (SPD), the BMC could also be trying to access the temperature of the memory module at the same time, over the same i2c protocol. This simultaneous attempted read denies one of the parties access. When the UEFI is denied access to the SPD, it thinks the memory module is not available and skips over it. Below is an example of the related i2c-bus contention logs we saw in the journal of the BMC when the host is booting.

The above logs indicate that the i2c address 1e78a300 (which happens to be connected to the serial presence detect of the memory modules) could not properly handle a signal, known as an interrupt request (irq). When this scenario plays out on the UEFI, the UEFI is unable to detect the memory module.

^{Figure 3: I2C diagram showing I2C interconnection of the server’s memory modules (also known as DIMMs) with the BMC}

DIMM in Figure 3 refers to Dual Inline Memory Module, which is the type of memory module used in servers.

Thermal telemetry: During the boot-up process of some of our servers, some temperature devices, such as the temperature sensors of the memory modules, would show up as failed, thereby causing some of the fans to enter a fail-safe Pulse Width Modulation (PWM) mode. PWM is a technique to encode information delivered to electronic devices by adjusting the frequency of the waveform signal to the device. It is used in this case to control fan speed by adjusting the frequency of the power signal delivered to the fan. When a fan enters a fail-safe mode, PWM is used to set the fan speeds to a preset value, irrespective of what the optimized PWM setting of the fans should be, and this could negatively affect the cooling of the server and power consumption.

Implementing host ACPI state on OpenBMC

In the process of studying the issues we faced relating to the boot-up process of the host, we learned how the power state of the subsystems within the chassis changes. Part of our learnings led us to investigate the Advanced Configuration and Power Interface (ACPI) and how the ACPI state of the host changed during the boot process.

Advanced Configuration and Power Interface (ACPI) is an open industry specification for power management used in desktop, mobile, workstation, and server systems. The ACPI Specification replaces previous power management methodologies such as Advanced Power Management (APM). ACPI provides the advantages of:

Allowing OS-directed power management (OSPM).
Having a standardized and robust interface for power management.
Sending system-level events such as when the server power/sleep buttons are pressed
Hardware and software support, such as a real-time clock (RTC) to schedule the server to wake up from sleep or to reduce the functionality of the CPU based on RTC ticks when there is a loss of power.

From the perspective of power management, ACPI enables an OS-driven conservation of energy by transitioning components which are not in active use to a lower power state, thereby reducing power consumption and contributing to more efficient power management.

The ACPI Specification defines four global “Gx” states, six sleeping “Sx” states, and four “Dx” device power states. These states are defined as follows:

The states that matter to us are:

S0_G0_D0: often referred to as the working state. Here we know our host system is running just fine.
S2_D2: Memory contexts are held, but CPU context is lost. We usually use this state to know when the host’s UEFI is performing platform firmware initialization.
S5_G2: Often referred to as the soft off state. Here we still have power going into the chassis, however, processor and DRAM context are not maintained, and the operating system power management of the host has no context.

Since the issues we were experiencing were related to the power state changes of the host — when we asked the host to reboot or power on — we needed a way to track the various power state changes of the host as it went from power off to a complete working state. This would give us better management capabilities over the devices that were on the same power domain of the host during the boot process. Fortunately, the OpenBMC community already implemented an ACPI daemon, which we extended to serve our needs. We added an ACPI S2_D2 power state, in which memory contexts are held, but CPU context is lost, to the ACPI daemon running on the BMC to enable us to know when the host’s UEFI is performing firmware initialization, and also set up various management tasks for the different ACPI power states.

An example of a power management task we carry out using the S0_G0_D0 state is to re-export our Voltage Regulator (VR) sensors on S0_G0_D0 state, as shown with the service file below:

Having set this up, OpenBMC has a Net Function (ipmiSetACPIState) in phosphor-host-ipmid that is responsible for setting the ACPIState of the host on the BMC. This command is called by the host using the standard ipmi command with the corresponding NetFn=0x06 and Cmd=0x06.

In the event of an immediate power cycle (i.e. host reboots without operating system shutdown), the host is unable to send its S5_G2 state to the BMC. For this case, we created a patch to OpenBMC’s x86-power-control to let the BMC become aware that the host has entered the ACPI S5_G2 state (i.e. soft-off). When the host comes out of the power off state, the UEFI performs the Power On Self Test (POST) and sends the S2_D2 to the BMC, and after the UEFI has loaded the OS on the host, it notifies the BMC by sending the ACPI S0_G0_D0 state.

Fixing the issues

Going back to the boot-up issues we faced, we discovered that they were mostly caused by devices which were in the same power domain of the CPU, interfering with the UEFI/platform firmware initialization phase. Below is a high level description of the fixes we applied.

Servers not booting: After identifying the devices that were interfering with the POST stage of the firmware initialization, we used the host ACPI state to control when we set the appropriate power mode state for those devices so as not to cause POST to fail.

Memory modules missing: During the boot-up process, memory modules (DIMMs) are powered and initialized in S2_D2 ACPI state. During this initialization process, UEFI firmware sends read commands to the Serial Presence Detect (SPD) on the DIMM to retrieve information for DIMM enumeration. At the same time, the BMC could be sending commands to read DIMM temperature sensors. This can cause SMBUS collisions, which could either cause DIMM temperature reading to fail or UEFI DIMM enumeration to fail. The latter case would cause the system to boot up with reduced DIMM capacity, which could be mistaken as a failing DIMM scenario. After we had discovered the race condition issue, we disabled the BMC from reading the DIMM temperature sensors during S2_D2 ACPI state and set a fixed speed for the corresponding fans. This solution allows our UEFI to retrieve all the necessary DIMM subsystems information for enumeration, and our servers now boot up with the correct size of memory.

Thermal telemetry: In S0_G0 power state, when sensors are not reporting values back to the BMC, the BMC assumes that devices may be overheating and puts the fan controller into fail-safe mode where fan speeds are ramped up to maximum speed. However, in S5_G2 state, some thermal sensors like CPU temperature, NIC temperature, etc. are not powered and not available. Our solution is to set these thermal sensors as non-functional in their exported configuration when in S5_G2 state and during the transition from S5_G2 state to S2_D2 state. Setting the affected devices as non-functional in their configuration, instead of waiting for thermal sensor read commands to error out, prevents the controller from entering the fail-safe mode.

Moving forward

Aside from resolving issues, we have seen other benefits from implementing ACPI Power State on our BMC firmware. An example is in the area of our automated firmware regression testing. Various parts of our tests require rebooting/power cycling the servers over a hundred times, during which we monitor the ACPI power state changes of our servers as against using a boolean (running or not running, pingable or not pingable) to assert the status of our servers.

Also, it has given us the opportunity to learn more about the complex subsystems in a server system, and the various power modes of the different subsystems. This is an aspect that we are still actively learning about as we look to further optimize various aspects of the boot sequence of our servers.

In the course of time, implementing ACPI states is helping us achieve the following:

All components are enabled by end of boot sequence,
BIOS and BMC are able to retrieve component information,
And the BMC is aware when thermal sensors are in a non-functional state.

For better observability of the boot progress and “last state” of our systems, we have also started the process of adding the BootProgress object of the Redfish ComputerSystem Schema into our systems. This will give us an opportunity for pre-operating system (OS) boot observability and an easier debug starting point when the UEFI has issues (such as when the server isn’t coming on) during the server platform initialization.

With each passing day, Cloudflare’s OpenBMC team, which is made up of folks from different embedded backgrounds, learns about, experiments with, and deploys OpenBMC across our global fleet. This has been made possible by relying on the OpenBMC community’s contribution (as well as upstreaming some of our own contributions), and our interaction with our various vendors, thereby giving us the opportunity to make our systems more reliable, and giving us the ownership and responsibility of the firmware that powers the BMCs that manage our servers. If you are thinking of embracing open-source firmware in your BMC, we hope this blog post written by a team which started deploying OpenBMC less than 18 months ago has inspired you to give it a try.

For those who are interested in considering making the jump to open-source firmware, check it out here!

The effect of switching to TCMalloc on RocksDB memory use

Dmitry Vorobev — Wed, 03 Feb 2021 12:00:00 GMT

In previous posts we wrote about our configuration distribution system Quicksilver and the story of migrating its storage engine to RocksDB. This solution proved to be fast, resilient and stable. During the migration process, we noticed that Quicksilver memory consumption was unexpectedly high. After our investigation we found out that the root cause was a default memory allocator that we used. Switching memory allocator improved service memory consumption by almost three times.

Unexpected memory growth

After migrating to RocksDB, the memory used by the application increased significantly. Also, the way memory was growing over time looked suspicious. It was around 15GB immediately after start and then was steadily growing for multiple days, until stabilizing at around 30GB. Below, you can see a memory consumption increase after migrating one of our test instances to RocksDB.

We started our investigation with heap profiling with the assumption that we had a memory leak somewhere and found that heap size was almost three times less than the RSS value reported by the operating system. So, if our application does not actually use all this memory, it means that memory is ‘lost’ somewhere between the system and our application, which points to possible problems with the memory allocator.

We have multiple services running with the tcmalloc allocator, so in order to test our hypothesis, we ran a test with TCMalloc on a couple of instances. The test showed significant improvement in memory usage. So why did this happen? We’ll dig into memory allocator internals to understand the issue.

glibc malloc

Let’s begin with a high level view of glibc’s malloc design. malloc uses a concept called an arena. An arena is a contiguous block of memory obtained from the system. An important part of glibc malloc design is that it expects developers to free memory in a reverse order of allocation, otherwise a lot of memory will be ‘locked’, and never returned to the system. Let’s see what it means on practise:

In the picture, you can see an arena, from which we allocated three chunks of memory: 100kb, 40kb, 1kb. Next, the application frees the chunks with sizes of 40kb and 100kb:

Before we go further, let me explain the terminology I use here and what each type of memory means:

Free - this is virtual memory of a process, not backed by physical memory, and corresponds to the VIRT parameter of the top/ps command.
Used - memory used by the application, backed by physical memory, contributes to the RES parameter of the top/ps command.
Available - memory held by the allocator, backed by physical memory. The allocator can either return this memory to the OS, and it would become ‘Free’ or later reuse it to satisfy application requests. From a system perspective, this memory is still held by the application. Available + Used = RES.

So we see that memory which was used by the application changed state to Available, and it’s not returned to the operating system. This is because malloc can only return memory from the top of the heap, and in the case above we have a chunk of memory that blocks 140kb from being released back to the system. As soon as we release this 1kb object, all memory can be returned to the system.

Let’s go further with our simple example, if our application allocates/frees memory without keeping malloc’s design in mind, after a while we will see roughly following picture:

Here we see one of the main problems that all allocators try to solve: memory fragmentation. We have some chunks used by the application, but a lot of the memory is not used at the moment. And since it’s not returned to the system, other services can’t use this memory either. Malloc implements several mechanisms to decrease memory fragmentation, but it’s a problem that all allocators have, and how bad this problem is depends on a lot of factors: allocator design, workload, settings, etc.

OK, so the problem is clear, memory fragmentation, but why did it lead to such high memory usage? To understand that, let’s take a step back and consider how malloc works for highly concurrent multithreaded applications.

To allocate a chunk of memory from an arena, a thread should acquire an exclusive lock for that arena. When an application has multiple threads this would create lock contention and poor performance for multithreaded services. To handle this situation malloc creates several arenas, using the following logic:

A thread tries to get a chunk of memory from an arena it used last time, in order to do that it acquires an exclusive lock for the arena
If the lock is held by another thread, it tries the next arena
If all arenas were locked it creates a new arena and uses memory from it
There is a limit on the number of arenas - eight arenas per core

Normally, our service has around 25 threads, and we have seen 60-80 arenas allocated by malloc using the logic above.

And this is a place where the fragmentation problem magnifies and leads to huge memory waste. All arenas are independent of each other and memory can never move from one arena to another. Why is that bad? Let’s take a look at the following example:

Here, we can see that Thread 1 requests 20kb of memory from Arena 1; as I’ve written before, malloc tries to allocate memory from the same arena it’s used before. Since Arena 1 still has enough free memory, Thread 1 will get a block from it, which at the end will increase memory that the process takes from the system. Ideally, in this scenario, we would prefer to get this block of memory from Arena 2, since it has a chunk of that size available. However, due to the design this won’t happen.

The main point here: having multiple independent arenas improves the performance of multithreaded applications, by reducing lock contention, but the trade-off is that it increases memory fragmentation, since each memory request chooses the best fit fragment from an individual arena and not the best fit fragment overall.

Remember, I wrote that memory locked between used chunks can never be returned to the system? Actually, there is a way to do that, ‘malloc_trim’ is a function provided by glibc malloc, and it does exactly that. It goes through all the unused chunks and returns them to the system. The problem is that you need to explicitly call this function from your application. You might say: “Oh, wait, I remember that this function is sometimes called when you call the free function, I saw it in the man page.” No, that never happens, it’s a bug in the man page that has existed for more than 15 years, which is now finally fixed!

Let’s now discuss what options we have to improve the memory consumption of glibc malloc. Here are a couple of useful strategies to try out:

The first thing you would find on the Internet is to reduce MALLOC_ARENA_MAX to a lower value, usually 2. This setting limits the number of arenas malloc would create per core. The fewer arenas we have the better the memory reuse, hence lower fragmentation, but at the same time it would increase lock contention.
Calling malloc_trim from time to time. This function goes through all arenas one at a time, it locks the arena and releases all locked chunks back to the system. This at the end increases lock contention and will execute a lot of syscalls to return memory and later would lead to more page faults and again worse performance.
M_MMAP_THRESHOLD. All allocations higher than this parameter would use the mmap syscall, and would not take memory from the arena directly. That means that memory allocated with this approach would never be locked between used chunks of memory and can always be returned to the system. It solves the fragmentation problem for large chunks, so only small chunks would be locked. The trade-off here is that each such allocation would execute an expensive syscall. And there is a system limit that caps the maximum number of chunks allocated with mmap.

Short summary: multiple arenas cause higher memory fragmentation that can lead to 2-3x higher memory consumption.

TCMalloc

While glibc malloc was designed for single-threaded applications and later optimized for multithreaded services, TCMalloc was built for multithreading at the beginning. Let’s take a look at how it tries to solve the problems we just talked about. The TCMalloc design is more complex, so if you want to understand the details I recommend reading the official design page. Here is a high level view of its design:

Here we can see 3 main parts of TCMalloc design:

Back-end: allocates big chunks of memory from the system, returns these chunks back to the operating system when they are not needed and also serves big allocation requests.
Front-end: serves allocation requests, there is one cache per core.
Middle-end: this is a core part of the TCMalloc design, which helps to significantly reduce fragmentation for multithreaded applications. It populates caches and returns unused memory to the back-end, but most importantly it can move memory from one cache to another, dramatically improving memory reuse.

Let's look how it works on the example that we showed for malloc:

Here we see the following:

Cache 2 has a chunk of memory that it doesn’t need, so it returns it to the middle-end
Thread 1 requests 20kb of memory from cache 1
Cache 1 doesn’t have a chunk of memory of that size, so it requests this memory from middle-end, where it can reuse memory from cache 2

This design dramatically improves memory reuse. If memory was freed by one thread it can be moved to the middle-end and later reused by other threads.

Conclusion

The main goal of this post is to make people aware of the importance of the choice of memory allocator. After deploying TCMalloc, we decreased memory usage by 2.5 times.

Usage of an allocator which is not optimal for a workload can cause a huge waste of memory. If you have a long-running application with a lot of threads and care about memory usage then glibc malloc is probably not your choice. Allocators that are designed for multithreaded services, like TCMalloc, jemalloc and others can provide much better memory utilization. So be conscious of this factor and go and check how much memory your application wastes.