io_submit: The epoll alternative you've never heard about

My curiosity was piqued by an LWN article about IOCB_CMD_POLL - A new kernel polling interface. It discusses an addition of a new polling mechanism to Linux AIO API, which was merged in 4.18 kernel. The whole idea is rather intriguing. The author of the patch is proposing to use the Linux AIO API with things like network sockets.

Hold on. The Linux AIO is designed for, well, Asynchronous disk IO! Disk files are not the same thing as network sockets! Is it even possible to use the Linux AIO API with network sockets in the first place?

The answer turns out to be a strong YES! In this article I'll explain how to use the strengths of Linux AIO API to write better and faster network servers.

But before we start, what is Linux AIO anyway?

Photo by Scott Schiller CC/BY/2.0

Introduction to Linux AIO

Linux AIO exposes asynchronous disk IO to userspace software.

Historically on Linux, all disk operations were blocking. Whether you did open(), read(), write() or fsync(), you could be sure your thread would stall if the needed data and meta-data was not ready in disk cache. This usually isn't a problem. If you do small amount of IO or have plenty of memory, the disk syscalls would gradually fill the cache and on average be rather fast.

The IO operation performance drops for IO-heavy workloads, like databases or caching web proxies. In such applications it would be tragic if a whole server stalled, just because some odd read() syscall had to wait for disk.

To work around this problem, applications use one of the three approaches:

(1) Use thread pools and offload blocking syscalls to worker threads. This is what glibc POSIX AIO (not to be confused with Linux AIO) wrapper does. (See: IBM's documentation). This is also what we ended up doing in our application at Cloudflare - we offloaded read() and open() calls to a thread pool.

(2) Pre-warm the disk cache with posix_fadvise(2) and hope for the best.

(3) Use Linux AIO with XFS file system, file opened with O_DIRECT, and avoid the undocumented pitfalls.

None of these methods is perfect. Even the Linux AIO if used carelessly, could still block in the io_submit() call. This was recently mentioned in another LWN article:

The Linux asynchronous I/O (AIO) layer tends to have many critics and few defenders, but most people at least expect it to actually be asynchronous. In truth, an AIO operation can block in the kernel for a number of reasons, making AIO difficult to use in situations where the calling thread truly cannot afford to block.

Now that we know what Linux AIO API doesn't do well, let's see where it shines.

Simplest Linux AIO program

To use Linux AIO you first need to define the 5 needed syscalls - glibc doesn't provide wrapper functions. To use Linux AIO we need:

(1) First call io_setup() to set up the aio_context data structure. Kernel will hand us an opaque pointer.

(2) Then we can call io_submit() to submit a vector of "I/O control blocks" struct iocb for processing.

(3) Finally, we can call io_getevents() to block and wait for a vector of struct io_event - completion notification of the iocb's.

There are 8 commands that can be submitted in an iocb. Two read, two write, two fsync variants and a POLL command introduced in 4.18 Kernel:

IOCB_CMD_PREAD = 0,
IOCB_CMD_PWRITE = 1,
IOCB_CMD_FSYNC = 2,
IOCB_CMD_FDSYNC = 3,
IOCB_CMD_POLL = 5,   /* from 4.18 */
IOCB_CMD_NOOP = 6,
IOCB_CMD_PREADV = 7,
IOCB_CMD_PWRITEV = 8,

The struct iocb passed to io_submit is large, and tuned for disk IO. Here's a simplified version:

struct iocb {
  __u64 data;           /* user data */
  ...
  __u16 aio_lio_opcode; /* see IOCB_CMD_ above */
  ...
  __u32 aio_fildes;     /* file descriptor */
  __u64 aio_buf;        /* pointer to buffer */
  __u64 aio_nbytes;     /* buffer size */
...
}

The completion notification retrieved from io_getevents:

struct io_event {
  __u64  data;  /* user data */
  __u64  obj;   /* pointer to request iocb */
  __s64  res;   /* result code for this event */
  __s64  res2;  /* secondary result */
};

Let's try an example. Here's the simplest program reading /etc/passwd file with Linux AIO API:

fd = open("/etc/passwd", O_RDONLY);

aio_context_t ctx = 0;
r = io_setup(128, &ctx);

char buf[4096];
struct iocb cb = {.aio_fildes = fd,
                  .aio_lio_opcode = IOCB_CMD_PREAD,
                  .aio_buf = (uint64_t)buf,
                  .aio_nbytes = sizeof(buf)};
struct iocb *list_of_iocb[1] = {&cb};

r = io_submit(ctx, 1, list_of_iocb);

struct io_event events[1] = {{0}};
r = io_getevents(ctx, 1, 1, events, NULL);

bytes_read = events[0].res;
printf("read %lld bytes from /etc/passwd\n", bytes_read);

Full source is, as usual, on GitHub. Here's a strace of this program:

openat(AT_FDCWD, "/etc/passwd", O_RDONLY)
io_setup(128, [0x7f4fd60ea000])
io_submit(0x7f4fd60ea000, 1, [{aio_lio_opcode=IOCB_CMD_PREAD, aio_fildes=3, aio_buf=0x7ffc5ff703d0, aio_nbytes=4096, aio_offset=0}])
io_getevents(0x7f4fd60ea000, 1, 1, [{data=0, obj=0x7ffc5ff70390, res=2494, res2=0}], NULL)

This all worked fine! But the disk read was not asynchronous: the io_submit syscall blocked and did all the work! The io_getevents call finished instantly. We could try to make the disk read async, but this requires O_DIRECT flag which skips the caches.

Let's try to better illustrate the blocking nature of io_submit on normal files. Here's similar example, showing strace when reading large 1GiB block from /dev/zero:

io_submit(0x7fe1e800a000, 1, [{aio_lio_opcode=IOCB_CMD_PREAD, aio_fildes=3, aio_buf=0x7fe1a79f4000, aio_nbytes=1073741824, aio_offset=0}]) \
    = 1 <0.738380>
io_getevents(0x7fe1e800a000, 1, 1, [{data=0, obj=0x7fffb9588910, res=1073741824, res2=0}], NULL) \
    = 1 <0.000015>

The kernel spent 738ms in io_submit and only 15us in io_getevents. The kernel behaves the same way with network sockets - all the work is done in io_submit.

Photo by Helix84 CC/BY-SA/3.0

Linux AIO with sockets - batching

The implementation of io_submit is rather conservative. Unless the passed descriptor is O_DIRECT file, it will just block and perform the requested action. In case of network sockets it means:

For blocking sockets IOCB_CMD_PREAD will hang until a packet arrives.
For non-blocking sockets IOCB_CMD_PREAD will -11 (EAGAIN) return code.

These are exactly the same semantics as for vanilla read() syscall. It's fair to say that for network sockets io_submit is no smarter than good old read/write calls.

It's important to note the iocb requests passed to kernel are evaluated in-order sequentially.

While Linux AIO won't help with async operations, it can definitely be used for syscall batching!

If you have a web server needing to send and receive data from hundreds of network sockets, using io_submit can be a great idea. This would avoid having to call send and recv hundreds of times. This will improve performance - jumping back and forth from userspace to kernel is not free, especially since the meltdown and spectre mitigations.

	One buffer	Multiple buffers
One file descriptor	read()	readv()
Many file descriptors	io_submit + IOCB_CMD_PREAD	io_submit + IOCB_CMD_PREADV

To illustrate the batching aspect of io_submit, let's create a small program forwarding data from one TCP socket to another. In simplest form, without Linux AIO, the program would have a trivial flow like this:

while True:
  d = sd1.read(4096)
  sd2.write(d)

We can express the same logic with Linux AIO. The code will look like this:

struct iocb cb[2] = {{.aio_fildes = sd2,
                      .aio_lio_opcode = IOCB_CMD_PWRITE,
                      .aio_buf = (uint64_t)&buf[0],
                      .aio_nbytes = 0},
                     {.aio_fildes = sd1,
                     .aio_lio_opcode = IOCB_CMD_PREAD,
                     .aio_buf = (uint64_t)&buf[0],
                     .aio_nbytes = BUF_SZ}};
struct iocb *list_of_iocb[2] = {&cb[0], &cb[1]};
while(1) {
  r = io_submit(ctx, 2, list_of_iocb);

  struct io_event events[2] = {};
  r = io_getevents(ctx, 2, 2, events, NULL);
  cb[0].aio_nbytes = events[1].res;
}

This code submits two jobs to io_submit. First, request to write data to sd2 then to read data from sd1. After the read is done, the code fixes up the write buffer size and loops again. The code does a cool trick - the first write is of size 0. We are doing so since we can fuse write+read in one io_submit (but not read+write). After a read is done we have to fix the write buffer size.

Is this code faster than the simple read/write version? Not yet. Both versions have two syscalls: read+write and io_submit+io_getevents. Fortunately, we can improve it.

Getting rid of io_getevents

When running io_setup(), the kernel allocates a couple of pages of memory for the process. This is how this memory block looks like in /proc//maps:

marek:~$ cat /proc/`pidof -s aio_passwd`/maps
...
7f7db8f60000-7f7db8f63000 rw-s 00000000 00:12 2314562     /[aio] (deleted)
...

The [aio] memory region (12KiB in my case) was allocated by the io_setup. This memory range is used a ring buffer storing the completion events. In most cases, there isn't any reason to call the real io_getevents syscall. The completion data can be easily retrieved from the ring buffer without the need of consulting the kernel. Here is a fixed version of the code:

int io_getevents(aio_context_t ctx, long min_nr, long max_nr,
                 struct io_event *events, struct timespec *timeout)
{
    int i = 0;

    struct aio_ring *ring = (struct aio_ring*)ctx;
    if (ring == NULL || ring->magic != AIO_RING_MAGIC) {
        goto do_syscall;
    }

    while (i < max_nr) {
        unsigned head = ring->head;
        if (head == ring->tail) {
            /* There are no more completions */
            break;
        } else {
            /* There is another completion to reap */
            events[i] = ring->events[head];
            read_barrier();
            ring->head = (head + 1) % ring->nr;
            i++;
        }
    }

    if (i == 0 && timeout != NULL && timeout->tv_sec == 0 && timeout->tv_nsec == 0) {
        /* Requested non blocking operation. */
        return 0;
    }

    if (i && i >= min_nr) {
        return i;
    }

do_syscall:
    return syscall(__NR_io_getevents, ctx, min_nr-i, max_nr-i, &events[i], timeout);
}

Here's full code. This ring buffer interface is poorly documented. I adapted this code from the axboe/fio project.

With this code fixing the io_getevents function, our Linux AIO version of the TCP proxy needs only one syscall per loop, and indeed is a tiny bit faster than the read+write code.

Photo by Train Photos CC/BY-SA/2.0

Epoll alternative

With the addition of IOCB_CMD_POLL in kernel 4.18, one could use io_submit also as select/poll/epoll equivalent. For example, here's some code waiting for data on a socket:

struct iocb cb = {.aio_fildes = sd,
                  .aio_lio_opcode = IOCB_CMD_POLL,
                  .aio_buf = POLLIN};
struct iocb *list_of_iocb[1] = {&cb};

r = io_submit(ctx, 1, list_of_iocb);
r = io_getevents(ctx, 1, 1, events, NULL);

Full code. Here's the strace view:

io_submit(0x7fe44bddd000, 1, [{aio_lio_opcode=IOCB_CMD_POLL, aio_fildes=3}]) \
    = 1 <0.000015>
io_getevents(0x7fe44bddd000, 1, 1, [{data=0, obj=0x7ffef65c11a8, res=1, res2=0}], NULL) \
    = 1 <1.000377>

As you can see this time the "async" part worked fine, the io_submit finished instantly and the io_getevents successfully blocked for 1s while awaiting data. This is pretty powerful and can be used instead of the epoll_wait() syscall.

Furthermore, normally dealing with epoll requires juggling epoll_ctl syscalls. Application developers go to great lengths to avoid calling this syscall too often. Just read the man page on EPOLLONESHOT and EPOLLET flags. Using io_submit for polling works around this whole complexity, and doesn't require any spurious syscalls. Just push your sockets to the iocb request vector, call io_submit exactly once and wait for completions. The API can't be simpler than this.

Summary

In this blog post we reviewed the Linux AIO API. While initially conceived to be a disk-only API, it seems to be working in the same way as normal read/write syscalls on network sockets. But as opposed to read/write io_submit allows syscall batching, potentially improving performance.

Since kernel 4.18 io_submit and io_getevents can be used to wait for events like POLLIN and POLLOUT on network sockets. This is great, and could be used as a replacement for epoll() in the event loop.

I can imagine a network server that could just be doing io_submit and io_getevents syscalls, as opposed to the usual mix of read, write, epoll_ctl and epoll_wait. With such design the syscall batching aspect of io_submit could really shine. Such a server would be meaningfully faster.

Sadly, even with recent Linux AIO API improvements, the larger discussion remains. Famously, Linus hates it:

AIO is a horrible ad-hoc design, with the main excuse being "other, less gifted people, made that design, and we are implementing it for compatibility because database people - who seldom have any shred of taste - actually use it". But AIO was always really really ugly.

Over the years there had been multiple attempts on creating a better batching and async interfaces, unfortunately, lacking coherent vision. For example, recent addition of sendto(MSG_ZEROCOPY) allows for truly async transmission operations, but no batching. io_submit allows batching but not async. It's even worse than that - Linux currently has three ways of delivering async notifications - signals, io_getevents and MSG_ERRQUEUE.

Having said that I'm really excited to see the new developments which allow developing faster network servers. I'm jumping on the code to replace my rusty epoll event loops with io_submit!

The Cloudflare Blog

io_submit: The epoll alternative you've never heard about

Introduction to Linux AIO

Simplest Linux AIO program

Linux AIO with sockets - batching

Getting rid of io_getevents

Epoll alternative

Summary

Browser Run: now running on Cloudflare Containers, it’s faster and more scalable

How Cloudflare responded to the “Copy Fail” Linux vulnerability

Introducing Dynamic Workflows: durable execution that follows the tenant

Making Rust Workers reliable: panic and abort recovery in wasm‑bindgen