The Cloudflare Blog

Searching for the cause of hung tasks in the Linux kernel

Oxana Kharitonova — Fri, 14 Feb 2025 14:00:00 GMT

Depending on your configuration, the Linux kernel can produce a hung task warning message in its log. Searching the Internet and the kernel documentation, you can find a brief explanation that the kernel process is stuck in the uninterruptable state and hasn’t been scheduled on the CPU for an unexpectedly long period of time. That explains the warning’s meaning, but doesn’t provide the reason it occurred. In this blog post we’re going to explore how the hung task warning works, why it happens, whether it is a bug in the Linux kernel or application itself, and whether it is worth monitoring at all.

INFO: task XXX:1495882 blocked for more than YYY seconds.

The hung task message in the kernel log looks like this:

INFO: task XXX:1495882 blocked for more than YYY seconds.
     Tainted: G          O       6.6.39-cloudflare-2024.7.3 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:XXX         state:D stack:0     pid:1495882 ppid:1      flags:0x00004002
. . .

Processes in Linux can be in different states. Some of them are running or ready to run on the CPU — they are in the TASK_RUNNING state. Others are waiting for some signal or event to happen, e.g. network packets to arrive or terminal input from a user. They are in a TASK_INTERRUPTIBLE state and can spend an arbitrary length of time in this state until being woken up by a signal. The most important thing about these states is that they still can receive signals, and be terminated by a signal. In contrast, a process in the TASK_UNINTERRUPTIBLE state is waiting only for certain special classes of events to wake them up, and can’t be interrupted by a signal. The signals are not delivered until the process emerges from this state and only a system reboot can clear the process. It’s marked with the letter D in the log shown above.

What if this wake up event doesn’t happen or happens with a significant delay? (A “significant delay” may be on the order of seconds or minutes, depending on the system.) Then our dependent process is hung in this state. What if this dependent process holds some lock and prevents other processes from acquiring it? Or if we see many processes in the D state? Then it might tell us that some of the system resources are overwhelmed or are not working correctly. At the same time, this state is very valuable, especially if we want to preserve the process memory. It might be useful if part of the data is written to disk and another part is still in the process memory — we don’t want inconsistent data on a disk. Or maybe we want a snapshot of the process memory when the bug is hit. To preserve this behaviour, but make it more controlled, a new state was introduced in the kernel: TASK_KILLABLE — it still protects a process, but allows termination with a fatal signal.

How Linux identifies the hung process

The Linux kernel has a special thread called khungtaskd. It runs regularly depending on the settings, iterating over all processes in the D state. If a process is in this state for more than YYY seconds, we’ll see a message in the kernel log. There are settings for this daemon that can be changed according to your wishes:

$ sudo sysctl -a --pattern hung
kernel.hung_task_all_cpu_backtrace = 0
kernel.hung_task_check_count = 4194304
kernel.hung_task_check_interval_secs = 0
kernel.hung_task_panic = 0
kernel.hung_task_timeout_secs = 10
kernel.hung_task_warnings = 200

At Cloudflare, we changed the notification threshold kernel.hung_task_timeout_secs from the default 120 seconds to 10 seconds. You can adjust the value for your system depending on configuration and how critical this delay is for you. If the process spends more than hung_task_timeout_secs seconds in the D state, a log entry is written, and our internal monitoring system emits an alert based on this log. Another important setting here is kernel.hung_task_warnings — the total number of messages that will be sent to the log. We limit it to 200 messages and reset it every 15 minutes. It allows us not to be overwhelmed by the same issue, and at the same time doesn’t stop our monitoring for too long. You can make it unlimited by setting the value to "-1".

To better understand the root causes of the hung tasks and how a system can be affected, we’re going to review more detailed examples.

Example #1 or XFS

Typically, there is a meaningful process or application name in the log, but sometimes you might see something like this:

INFO: task kworker/13:0:834409 blocked for more than 11 seconds.
 	Tainted: G      	O   	6.6.39-cloudflare-2024.7.3 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/13:0	state:D stack:0 	pid:834409 ppid:2   flags:0x00004000
Workqueue: xfs-sync/dm-6 xfs_log_worker

In this log, kworker is the kernel thread. It’s used as a deferring mechanism, meaning a piece of work will be scheduled to be executed in the future. Under kworker, the work is aggregated from different tasks, which makes it difficult to tell which application is experiencing a delay. Luckily, the kworker is accompanied by the Workqueue line. Workqueue is a linked list, usually predefined in the kernel, where these pieces of work are added and performed by the kworker in the order they were added to the queue. The Workqueue name xfs-sync and the function which it points to, xfs_log_worker, might give a good clue where to look. Here we can make an assumption that the XFS is under pressure and check the relevant metrics. It helped us to discover that due to some configuration changes, we forgot no_read_workqueue / no_write_workqueue flags that were introduced some time ago to speed up Linux disk encryption.

Summary: In this case, nothing critical happened to the system, but the hung tasks warnings gave us an alert that our file system had slowed down.

Example #2 or Coredump

Let’s take a look at the next hung task log and its decoded stack trace:

INFO: task test:964 blocked for more than 5 seconds.
      Not tainted 6.6.72-cloudflare-2025.1.7 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:test            state:D stack:0     pid:964   ppid:916    flags:0x00004000
Call Trace:

__schedule (linux/kernel/sched/core.c:5378 linux/kernel/sched/core.c:6697) 
schedule (linux/arch/x86/include/asm/preempt.h:85 (discriminator 13) linux/kernel/sched/core.c:6772 (discriminator 13)) 
[do_exit (linux/kernel/exit.c:433 (discriminator 4) linux/kernel/exit.c:825 (discriminator 4)) 
? finish_task_switch.isra.0 (linux/arch/x86/include/asm/irqflags.h:42 linux/arch/x86/include/asm/irqflags.h:77 linux/kernel/sched/sched.h:1385 linux/kernel/sched/core.c:5132 linux/kernel/sched/core.c:5250) 
do_group_exit (linux/kernel/exit.c:1005) 
get_signal (linux/kernel/signal.c:2869) 
? srso_return_thunk (linux/arch/x86/lib/retpoline.S:217) 
? hrtimer_try_to_cancel.part.0 (linux/kernel/time/hrtimer.c:1347) 
arch_do_signal_or_restart (linux/arch/x86/kernel/signal.c:310) 
? srso_return_thunk (linux/arch/x86/lib/retpoline.S:217) 
? hrtimer_nanosleep (linux/kernel/time/hrtimer.c:2105) 
exit_to_user_mode_prepare (linux/kernel/entry/common.c:176 linux/kernel/entry/common.c:210) 
syscall_exit_to_user_mode (linux/arch/x86/include/asm/entry-common.h:91 linux/kernel/entry/common.c:141 linux/kernel/entry/common.c:304) 
? srso_return_thunk (linux/arch/x86/lib/retpoline.S:217) 
do_syscall_64 (linux/arch/x86/entry/common.c:88) 
entry_SYSCALL_64_after_hwframe (linux/arch/x86/entry/entry_64.S:121)

The stack trace says that the process or application test was blocked for more than 5 seconds. We might recognise this user space application by the name, but why is it blocked? It’s always helpful to check the stack trace when looking for a cause. The most interesting line here is do_exit (linux/kernel/exit.c:433 (discriminator 4) linux/kernel/exit.c:825 (discriminator 4)). The source code points to the coredump_task_exit function. Additionally, checking the process metrics revealed that the application crashed during the time when the warning message appeared in the log. When a process is terminated based on some set of signals (abnormally), the Linux kernel can provide a core dump file, if enabled. The mechanism — when a process terminates, the kernel makes a snapshot of the process memory before exiting and either writes it to a file or sends it through the socket to another handler — can be systemd-coredump or your custom one. When it happens, the kernel moves the process to the D state to preserve its memory and early termination. The higher the process memory usage, the longer it takes to get a core dump file, and the higher the chance of getting a hung task warning.

Let’s check our hypothesis by triggering it with a small Go program. We’ll use the default Linux coredump handler and will decrease the hung task threshold to 1 second.

Coredump settings:

$ sudo sysctl -a --pattern kernel.core
kernel.core_pattern = core
kernel.core_pipe_limit = 16
kernel.core_uses_pid = 1

You can make changes with sysctl:

$ sudo sysctl -w kernel.core_uses_pid=1

Hung task settings:

$ sudo sysctl -a --pattern hung
kernel.hung_task_all_cpu_backtrace = 0
kernel.hung_task_check_count = 4194304
kernel.hung_task_check_interval_secs = 0
kernel.hung_task_panic = 0
kernel.hung_task_timeout_secs = 1
kernel.hung_task_warnings = -1

Go program:

$ cat main.go
package main

import (
	"os"
	"time"
)

func main() {
	_, err := os.ReadFile("test.file")
	if err != nil {
		panic(err)
	}
	time.Sleep(8 * time.Minute) 
}

This program reads a 10 GB file into process memory. Let’s create the file:

$ yes this is 10GB file | head -c 10GB > test.file

The last step is to build the Go program, crash it, and watch our kernel log:

$ go mod init test
$ go build .
$ GOTRACEBACK=crash ./test
$ (Ctrl+\)

Hooray! We can see our hung task warning:

$ sudo dmesg -T | tail -n 31
INFO: task test:8734 blocked for more than 22 seconds.
      Not tainted 6.6.72-cloudflare-2025.1.7 #1
      Blocked by coredump.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:test            state:D stack:0     pid:8734  ppid:8406   task_flags:0x400448 flags:0x00004000

By the way, have you noticed the Blocked by coredump. line in the log? It was recently added to the upstream code to improve visibility and remove the blame from the process itself. The patch also added the task_flags information, as Blocked by coredump is detected via the flag PF_POSTCOREDUMP, and knowing all the task flags is useful for further root-cause analysis.

Summary: This example showed that even if everything suggests that the application is the problem, the real root cause can be something else — in this case, coredump.

Example #3 or rtnl_mutex

This one was tricky to debug. Usually, the alerts are limited by one or two different processes, meaning only a certain application or subsystem experiences an issue. In this case, we saw dozens of unrelated tasks hanging for minutes with no improvements over time. Nothing else was in the log, most of the system metrics were fine, and existing traffic was being served, but it was not possible to ssh to the server. New Kubernetes container creations were also stalling. Analyzing the stack traces of different tasks initially revealed that all the traces were limited to just three functions:

rtnetlink_rcv_msg+0x9/0x3c0
dev_ethtool+0xc6/0x2db0 
bonding_show_bonds+0x20/0xb0

Further investigation showed that all of these functions were waiting for rtnl_lock to be acquired. It looked like some application acquired the rtnl_mutex and didn’t release it. All other processes were in the D state waiting for this lock.

The RTNL lock is primarily used by the kernel networking subsystem for any network-related config, for both writing and reading. The RTNL is a global mutex lock, although upstream efforts are being made for splitting up RTNL per network namespace (netns).

From the hung task reports, we can observe the “victims” that are being stalled waiting for the lock, but how do we identify the task that is holding this lock for too long? For troubleshooting this, we leveraged BPF via a bpftrace script, as this allows us to inspect the running kernel state. The kernel's mutex implementation has a struct member called owner. It contains a pointer to the task_struct from the mutex-owning process, except it is encoded as type atomic_long_t. This is because the mutex implementation stores some state information in the lower 3-bits (mask 0x7) of this pointer. Thus, to read and dereference this task_struct pointer, we must first mask off the lower bits (0x7).

Our bpftrace script to determine who holds the mutex is as follows:

#!/usr/bin/env bpftrace
interval:s:10 {
  $rtnl_mutex = (struct mutex *) kaddr("rtnl_mutex");
  $owner = (struct task_struct *) ($rtnl_mutex->owner.counter & ~0x07);
  if ($owner != 0) {
    printf("rtnl_mutex->owner = %u %s\n", $owner->pid, $owner->comm);
  }
}

In this script, the rtnl_mutex lock is a global lock whose address can be exposed via /proc/kallsyms – using bpftrace helper function kaddr(), we can access the struct mutex pointer from the kallsyms. Thus, we can periodically (via interval:s:10) check if someone is holding this lock.

In the output we had this:

rtnl_mutex->owner = 3895365 calico-node

This allowed us to quickly identify calico-node as the process holding the RTNL lock for too long. To quickly observe where this process itself is stalled, the call stack is available via /proc/3895365/stack. This showed us that the root cause was a Wireguard config change, with function wg_set_device() holding the RTNL lock, and peer_remove_after_dead() waiting too long for a napi_disable() call. We continued debugging via a tool called drgn, which is a programmable debugger that can debug a running kernel via a Python-like interactive shell. We still haven’t discovered the root cause for the Wireguard issue and have asked the upstream for help, but that is another story.

Summary: The hung task messages were the only ones which we had in the kernel log. Each stack trace of these messages was unique, but by carefully analyzing them, we could spot similarities and continue debugging with other instruments.

Epilogue

Your system might have different hung task warnings, and we have many others not mentioned here. Each case is unique, and there is no standard approach to debug them. But hopefully this blog post helps you better understand why it’s good to have these warnings enabled, how they work, and what the meaning is behind them. We tried to provide some navigation guidance for the debugging process as well:

analyzing the stack trace might be a good starting point for debugging it, even if all the messages look unrelated, like we saw in example #3
keep in mind that the alert might be misleading, pointing to the victim and not the offender, as we saw in example #2 and example #3
if the kernel doesn’t schedule your application on the CPU, puts it in the D state, and emits the warning – the real problem might exist in the application code

Good luck with your debugging, and hopefully this material will help you on this journey!

How to execute an object file: part 4, AArch64 edition

Oxana Kharitonova — Fri, 17 Nov 2023 14:00:35 GMT

Translating source code written in a high-level programming language into an executable binary typically involves a series of steps, namely compiling and assembling the code into object files, and then linking those object files into the final executable. However, there are certain scenarios where it can be useful to apply an alternate approach that involves executing object files directly, bypassing the linker. For example, we might use it for malware analysis or when part of the code requires an incompatible compiler. We’ll be focusing on the latter scenario: when one of our libraries needed to be compiled differently from the rest of the code. Learning how to execute an object file directly will give you a much better sense of how code is compiled and linked together.

To demonstrate how this was done, we have previously published a series of posts on executing an object file:

The initial posts are dedicated to the x86 architecture. Since then the fleet of our working machines has expanded to include a large and growing number of ARM CPUs. This time we’ll repeat this exercise for the aarch64 architecture. You can pause here to read the previous blog posts before proceeding with this one, or read through the brief summary below and reference the earlier posts for more detail. We might reiterate some theory as working with ELF files can be daunting, if it’s not your day-to-day routine. Also, please be mindful that for simplicity, these examples omit bounds and integrity checks. Let the journey begin!

Introduction

In order to obtain an object file or an executable binary from a high-level compiled programming language the code needs to be processed by three components: compiler, assembler and linker. The compiler generates an assembly listing. This assembly listing is picked up by the assembler and translated into an object file. All source files, if a program contains multiple, go through these two steps generating an object file for each source file. At the final step the linker unites all object files into one binary, additionally resolving references to the shared libraries (i.e. we don’t implement the printf function each time, rather we take it from a system library). Even though the approach is platform independent, the compiler output varies by platform as the assembly listing is closely tied to the CPU architecture.

GCC (GNU Compiler Collection) can run each step: compiler, assembler and linker separately for us:

main.c:

#include 

int main(void)
{
	puts("Hello, world!");
	return 0;
}

Compiler (output main.s - assembly listing):

$ gcc -S main.c
$ ls
main.c  main.s

Assembler (output main.o - an object file):

$ gcc -c main.s -o main.o
$ ls
main.c  main.o  main.s

Linker (main - an object file):

$ gcc main.o -o main
$ ls
main  main.c  main.o  main.s
$ ./main
Hello, world!

All the examples assume gcc is running on a native aarch64 architecture or include a cross compilation flag for those who want to reproduce and have no aarch64.

We have two object files in the output above: main.o and main. Object files are files encoded with the ELF (Executable and Linkable Format) standard. Although, main.o is an ELF file, it doesn’t contain all the information to be fully executable.

$ file main.o
main.o: ELF 64-bit LSB relocatable, ARM aarch64, version 1 (SYSV), not stripped

$ file main
main: ELF 64-bit LSB pie executable, ARM aarch64, version 1 (SYSV), dynamically
linked, interpreter /lib/ld-linux-aarch64.so.1,
BuildID[sha1]=d3ecd2f8ac3b2dec11ed4cc424f15b3e1f130dd4, for GNU/Linux 3.7.0, not stripped

The ELF File

The central idea of this series of blog posts is to understand how to resolve dependencies from object files without directly involving the linker. For illustrative purposes we generated an object file based on some C-code and used it as a library for our main program. Before switching to the code, we need to understand the basics of the ELF structure.

Each ELF file is made up of one ELF header, followed by file data. The data can include: a program header table, a section header table, and the data which is referred to by the program or section header tables.

The ELF Header

The ELF header provides some basic information about the file: what architecture the file is compiled for, the program entry point and the references to other tables.

The ELF Header:

$ readelf -h main
ELF Header:
  Magic:   7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00 
  Class:                             ELF64
  Data:                              2's complement, little endian
  Version:                           1 (current)
  OS/ABI:                            UNIX - System V
  ABI Version:                       0
  Type:                              DYN (Position-Independent Executable file)
  Machine:                           AArch64
  Version:                           0x1
  Entry point address:               0x640
  Start of program headers:          64 (bytes into file)
  Start of section headers:          68576 (bytes into file)
  Flags:                             0x0
  Size of this header:               64 (bytes)
  Size of program headers:           56 (bytes)
  Number of program headers:         9
  Size of section headers:           64 (bytes)
  Number of section headers:         29
  Section header string table index: 28

The ELF Program Header

The execution process of almost every program starts from an auxiliary program, called loader, which arranges the memory and calls the program’s entry point. In the following output the loader is marked with a line “Requesting program interpreter: /lib/ld-linux-aarch64.so.1”. The whole program memory is split into different segments with associated size, permissions and type (which instructs the loader on how to interpret this block of memory). Because the execution process should be performed in the shortest possible time, the sections with the same characteristics and located nearby are grouped into bigger blocks — segments — and placed in the program header. We can say that the program header summarizes the types of data that appear in the section header.

The ELF Program Header:

$ readelf -Wl main

Elf file type is DYN (Position-Independent Executable file)
Entry point 0x640
There are 9 program headers, starting at offset 64

Program Headers:
  Type           Offset   VirtAddr           PhysAddr           FileSiz  MemSiz   Flg Align
  PHDR           0x000040 0x0000000000000040 0x0000000000000040 0x0001f8 0x0001f8 R   0x8
  INTERP         0x000238 0x0000000000000238 0x0000000000000238 0x00001b 0x00001b R   0x1
      [Requesting program interpreter: /lib/ld-linux-aarch64.so.1]
  LOAD           0x000000 0x0000000000000000 0x0000000000000000 0x00088c 0x00088c R E 0x10000
  LOAD           0x00fdc8 0x000000000001fdc8 0x000000000001fdc8 0x000270 0x000278 RW  0x10000
  DYNAMIC        0x00fdd8 0x000000000001fdd8 0x000000000001fdd8 0x0001e0 0x0001e0 RW  0x8
  NOTE           0x000254 0x0000000000000254 0x0000000000000254 0x000044 0x000044 R   0x4
  GNU_EH_FRAME   0x0007a0 0x00000000000007a0 0x00000000000007a0 0x00003c 0x00003c R   0x4
  GNU_STACK      0x000000 0x0000000000000000 0x0000000000000000 0x000000 0x000000 RW  0x10
  GNU_RELRO      0x00fdc8 0x000000000001fdc8 0x000000000001fdc8 0x000238 0x000238 R   0x1

 Section to Segment mapping:
  Segment Sections...
   00     
   01     .interp 
   02     .interp .note.gnu.build-id .note.ABI-tag .gnu.hash .dynsym .dynstr .gnu.version .gnu.version_r .rela.dyn .rela.plt .init .plt .text .fini .rodata .eh_frame_hdr .eh_frame 
   03     .init_array .fini_array .dynamic .got .got.plt .data .bss 
   04     .dynamic 
   05     .note.gnu.build-id .note.ABI-tag 
   06     .eh_frame_hdr 
   07     
   08     .init_array .fini_array .dynamic .got

The ELF Section Header

In the source code of high-level languages, variables, functions, and constants are mixed together. However, in assembly you might see that the data and instructions are separated into different blocks. The ELF file content is divided in an even more granular way. For example, variables with initial values are placed into different sections than the uninitialized ones. This approach optimizes for space, otherwise the values for uninitialized variables would be filled with zeros. Along with the space efficiency, there are security reasons for stratification — executable instructions can’t have writable permissions, while memory containing variables can't be executable. The section header describes each of these sections.

The ELF Section Header:

$ readelf -SW main
There are 29 section headers, starting at offset 0x10be0:

Section Headers:
  [Nr] Name              Type            Address          Off    Size   ES Flg Lk Inf Al
  [ 0]                   NULL            0000000000000000 000000 000000 00      0   0  0
  [ 1] .interp           PROGBITS        0000000000000238 000238 00001b 00   A  0   0  1
  [ 2] .note.gnu.build-id NOTE            0000000000000254 000254 000024 00   A  0   0  4
  [ 3] .note.ABI-tag     NOTE            0000000000000278 000278 000020 00   A  0   0  4
  [ 4] .gnu.hash         GNU_HASH        0000000000000298 000298 00001c 00   A  5   0  8
  [ 5] .dynsym           DYNSYM          00000000000002b8 0002b8 0000f0 18   A  6   3  8
  [ 6] .dynstr           STRTAB          00000000000003a8 0003a8 000092 00   A  0   0  1
  [ 7] .gnu.version      VERSYM          000000000000043a 00043a 000014 02   A  5   0  2
  [ 8] .gnu.version_r    VERNEED         0000000000000450 000450 000030 00   A  6   1  8
  [ 9] .rela.dyn         RELA            0000000000000480 000480 0000c0 18   A  5   0  8
  [10] .rela.plt         RELA            0000000000000540 000540 000078 18  AI  5  22  8
  [11] .init             PROGBITS        00000000000005b8 0005b8 000018 00  AX  0   0  4
  [12] .plt              PROGBITS        00000000000005d0 0005d0 000070 00  AX  0   0 16
  [13] .text             PROGBITS        0000000000000640 000640 000134 00  AX  0   0 64
  [14] .fini             PROGBITS        0000000000000774 000774 000014 00  AX  0   0  4
  [15] .rodata           PROGBITS        0000000000000788 000788 000016 00   A  0   0  8
  [16] .eh_frame_hdr     PROGBITS        00000000000007a0 0007a0 00003c 00   A  0   0  4
  [17] .eh_frame         PROGBITS        00000000000007e0 0007e0 0000ac 00   A  0   0  8
  [18] .init_array       INIT_ARRAY      000000000001fdc8 00fdc8 000008 08  WA  0   0  8
  [19] .fini_array       FINI_ARRAY      000000000001fdd0 00fdd0 000008 08  WA  0   0  8
  [20] .dynamic          DYNAMIC         000000000001fdd8 00fdd8 0001e0 10  WA  6   0  8
  [21] .got              PROGBITS        000000000001ffb8 00ffb8 000030 08  WA  0   0  8
  [22] .got.plt          PROGBITS        000000000001ffe8 00ffe8 000040 08  WA  0   0  8
  [23] .data             PROGBITS        0000000000020028 010028 000010 00  WA  0   0  8
  [24] .bss              NOBITS          0000000000020038 010038 000008 00  WA  0   0  1
  [25] .comment          PROGBITS        0000000000000000 010038 00001f 01  MS  0   0  1
  [26] .symtab           SYMTAB          0000000000000000 010058 000858 18     27  66  8
  [27] .strtab           STRTAB          0000000000000000 0108b0 00022c 00      0   0  1
  [28] .shstrtab         STRTAB          0000000000000000 010adc 000103 00      0   0  1
Key to Flags:
  W (write), A (alloc), X (execute), M (merge), S (strings), I (info),
  L (link order), O (extra OS processing required), G (group), T (TLS),
  C (compressed), x (unknown), o (OS specific), E (exclude),
  D (mbind), p (processor specific)

Executing example from Part 1 on aarch64

Actually, our initial code from Part 1 works on aarch64 as is!

Let’s have a quick summary about what was done in the code:

We need to find the code of two functions (add5 and add10) in the .text section of our object file (obj.o)
Load the functions in the executable memory
Return the memory locations of the functions to the main program

There is one nuance: even though all the sections are in the section header, neither of them have a string name. Without the names we can’t identify them. However, having an additional character field for each section in the ELF structure would be inefficient for the space — it must be limited by some maximum length and those names which are shorter would leave the space unfilled. Instead, ELF provides an additional section, .shstrtab. This string table concatenates all the names where each name ends with a null terminated byte. We can iterate over the names and match with an offset held by other sections to reference their name. But how do we find .shstrtab itself if we don’t have a name? To solve this chicken and egg problem, the ELF program header provides a direct pointer to .shstrtab. The similar approach is applied to two other sections: .symtab and .strtab. Where .symtab contains all information about the symbols and .strtab holds the list of symbol names. In the code we work with these tables to resolve all their dependencies and find our functions.

Executing example from Part 2 on aarch64

At the beginning of the second blog post on how to execute an object file we made the function add10 depend on add5 instead of being self-contained. This is the first time when we faced relocations. Relocations is the process of loading symbols defined outside the current scope. The relocated symbols can present global or thread-local variables, constant, functions, etc. We’ll start from checking assembly instructions which trigger relocations and uncovering how the ELF format handles them in a more general way.

After making add10 depend on add5 our aarch64 version stopped working as well, similarly to the x86. Let’s take a look at assembly listing:

$ objdump --disassemble --section=.text obj.o

obj.o:     file format elf64-littleaarch64


Disassembly of section .text:

0000000000000000 :
   0:	d10043ff 	sub	sp, sp, #0x10
   4:	b9000fe0 	str	w0, [sp, #12]
   8:	b9400fe0 	ldr	w0, [sp, #12]
   c:	11001400 	add	w0, w0, #0x5
  10:	910043ff 	add	sp, sp, #0x10
  14:	d65f03c0 	ret

0000000000000018 :
  18:	a9be7bfd 	stp	x29, x30, [sp, #-32]!
  1c:	910003fd 	mov	x29, sp
  20:	b9001fe0 	str	w0, [sp, #28]
  24:	b9401fe0 	ldr	w0, [sp, #28]
  28:	94000000 	bl	0 
  2c:	b9001fe0 	str	w0, [sp, #28]
  30:	b9401fe0 	ldr	w0, [sp, #28]
  34:	94000000 	bl	0 
  38:	a8c27bfd 	ldp	x29, x30, [sp], #32
  3c:	d65f03c0 	ret

Have you noticed that all the hex values in the second column are exactly the same length, in contrast with the instructions lengths seen for x86 in Part 2 of our series? This is because all Armv8-A instructions are presented in 32 bits. Since it is impossible to encode every immediate value into less than 32 bits, some operations require more than one instruction, as we’ll see later. For now, we’re interested in one instruction - bl (branch with link) on rows 28 and 34. The bl is a “jump” instruction, but before the jump it preserves the next instruction after the current one in the link register (lr). When the callee finishes execution the caller address is recovered from lr. Usually, the aarch64 instructions reserve the last 6 bits [31:26] for opcode and some auxiliary fields such as running architecture (32 or 64 bits), condition flag and others. Remaining bits are shared between arguments like source register, destination register and immediate value. Since the bl instruction does not require a source or destination register, the full 26 bits can be used to encode the immediate offset instead. However, 26 bits can only encode a small range (+/-32 MB), but because the jump can only target a beginning of an instruction, it must always be aligned to 4 bytes, which increases the effective range of the encoded immediate fourfold, to +/-128 MB.

Similarly to what we did in Part 2 we’re going to resolve our relocations - first by manually calculating the correct addresses and then by using an approach similar to what the linker does. The current value of our bl instruction is 94000000 or in binary representation 100101**00000000000000000000000000**. All 26 bits are zeros, so we don't jump anywhere. The address is calculated by an offset from the current program counter (pc), which can be positive or negative. In our case we expect it to be -0x28 and -0x34. As described above, it should be divided by 4 and taken as two's complements: -0x28 / 4 = -0xA == 0xFFFFFFF6 and -0x34 / 4 = -0xD == 0xFFFFFFF3. From these values we need to take the lower 26 bits and concatenate them with the initial 6 bits to get the final instruction. So, the final ones will be: 100101**11111111111111111111110110** == 0x97FFFFF6 and 100101**11111111111111111111110011** == 0x97FFFFF3. Have you noticed that all the distance calculations are done relative to the bl (or current pc), not the next instruction as in x86?

Let’s add to the code and execute:

... 

static void parse_obj(void)
{
	...
	/* copy the contents of `.text` section from the ELF file */
	memcpy(text_runtime_base, obj.base + text_hdr->sh_offset, text_hdr->sh_size);

	*((uint32_t *)(text_runtime_base + 0x28)) = 0x97FFFFF6;
	*((uint32_t *)(text_runtime_base + 0x34)) = 0x97FFFFF3;

	/* make the `.text` copy readonly and executable */
	if (mprotect(text_runtime_base, page_align(text_hdr->sh_size), PROT_READ | PROT_EXEC)) {
	...

Compile and run:

$ gcc -o loader loader.c
$ ./loader
Executing add5...
add5(42) = 47
Executing add10...
add10(42) = 52

It works! But this is not how the linker handles the relocations. The linker resolves relocation based on the type and formula assigned to this type. In our Part 2 we investigated it quite well. Here again we need to find the type and check the formula for this type:

$ readelf --relocs obj.o

Relocation section '.rela.text' at offset 0x228 contains 2 entries:
  Offset          Info           Type           Sym. Value    Sym. Name + Addend
000000000028  000a0000011b R_AARCH64_CALL26  0000000000000000 add5 + 0
000000000034  000a0000011b R_AARCH64_CALL26  0000000000000000 add5 + 0

Relocation section '.rela.eh_frame' at offset 0x258 contains 2 entries:
  Offset          Info           Type           Sym. Value    Sym. Name + Addend
00000000001c  000200000105 R_AARCH64_PREL32  0000000000000000 .text + 0
000000000034  000200000105 R_AARCH64_PREL32  0000000000000000 .text + 18

Our Type is R_AARCH64_CALL26 and the formula for it is:

ELF64 Code	Name	Operation
283	R__CALL26	S + A - P

where:

S (when used on its own) is the address of the symbol
A is the addend for the relocation
P is the address of the place being relocated (derived from r_offset)

Here are the relevant changes to loader.c:

/* Replace `#define R_X86_64_PLT32 4` with our Type */
#define R_AARCH64_CALL26 283
...

static void do_text_relocations(void)
{
	...
	uint32_t val;

	switch (type)
	{
	case R_AARCH64_CALL26:
		/* The mask separates opcode (6 bits) and the immediate value */
		uint32_t mask_bl = (0xffffffff << 26);
		/* S+A-P, divided by 4 */
		val = (symbol_address + relocations[i].r_addend - patch_offset) >> 2;
		/* Concatenate opcode and value to get final instruction */
		*((uint32_t *)patch_offset) &= mask_bl;
		val &= ~mask_bl;
		*((uint32_t *)patch_offset) |= val;
		break;
	}
	...
}

Compile and run:

$ gcc -o loader loader.c 
$ ./loader
Calculated relocation: 0x97fffff6
Calculated relocation: 0x97fffff3
Executing add5...
add5(42) = 47
Executing add10...
add10(42) = 52

So far so good. The next challenge is to add constant data and global variables to our object file and check relocations again:

$ readelf --relocs --wide obj.o

Relocation section '.rela.text' at offset 0x388 contains 8 entries:
    Offset             Info             Type               Symbol's Value  Symbol's Name + Addend
0000000000000000  0000000500000113 R_AARCH64_ADR_PREL_PG_HI21 0000000000000000 .rodata + 0
0000000000000004  0000000500000115 R_AARCH64_ADD_ABS_LO12_NC 0000000000000000 .rodata + 0
000000000000000c  0000000300000113 R_AARCH64_ADR_PREL_PG_HI21 0000000000000000 .data + 0
0000000000000010  0000000300000115 R_AARCH64_ADD_ABS_LO12_NC 0000000000000000 .data + 0
0000000000000024  0000000300000113 R_AARCH64_ADR_PREL_PG_HI21 0000000000000000 .data + 0
0000000000000028  0000000300000115 R_AARCH64_ADD_ABS_LO12_NC 0000000000000000 .data + 0
0000000000000068  000000110000011b R_AARCH64_CALL26       0000000000000040 add5 + 0
0000000000000074  000000110000011b R_AARCH64_CALL26       0000000000000040 add5 + 0
...

We have even two new relocations: R_AARCH64_ADD_ABS_LO12_NC and R_AARCH64_ADR_PREL_PG_HI21. Their formulas are:

ELF64 Code	Name	Operation
275	R__ ADR_PREL_PG_HI21	Page(S+A) - Page(P)
277	R__ ADD_ABS_LO12_NC	S + A

where:

Page(expr) is the page address of the expression expr, defined as (expr & ~0xFFF). (This applies even if the machine page size supported by the platform has a different value.)

It’s a bit unclear why we have two new types, while in x86 we had only one. Let’s investigate the assembly code:

$ objdump --disassemble --section=.text obj.o

obj.o:     file format elf64-littleaarch64


Disassembly of section .text:

0000000000000000 :
   0:	90000000 	adrp	x0, 0 
   4:	91000000 	add	x0, x0, #0x0
   8:	d65f03c0 	ret

000000000000000c :
   c:	90000000 	adrp	x0, 0 
  10:	91000000 	add	x0, x0, #0x0
  14:	b9400000 	ldr	w0, [x0]
  18:	d65f03c0 	ret

000000000000001c :
  1c:	d10043ff 	sub	sp, sp, #0x10
  20:	b9000fe0 	str	w0, [sp, #12]
  24:	90000000 	adrp	x0, 0 
  28:	91000000 	add	x0, x0, #0x0
  2c:	b9400fe1 	ldr	w1, [sp, #12]
  30:	b9000001 	str	w1, [x0]
  34:	d503201f 	nop
  38:	910043ff 	add	sp, sp, #0x10
  3c:	d65f03c0 	ret

We see that all adrp instructions are followed by add instructions. The [add](https://developer.arm.com/documentation/ddi0596/2021-12/Base-Instructions/ADD--immediate---Add--immediate--?lang=en) instruction adds an immediate value to the source register and writes the result to the destination register. The source and destination registers can be the same, the immediate value is 12 bits. The [adrp](https://developer.arm.com/documentation/ddi0596/2021-12/Base-Instructions/ADRP--Form-PC-relative-address-to-4KB-page-?lang=en) instruction generates a pc-relative (program counter) address and writes the result to the destination register. It takes pc of the instruction itself and adds a 21-bit immediate value shifted left by 12 bits. If the immediate value weren’t shifted it would lie in a range of +/-1 MB memory, which isn’t enough. The left shift increases the range up to +/-1 GB. However, because the 12 bits are masked out with the shift, we need to store them somewhere and restore later. That’s why we see add instruction following adrp and two types instead of one. Also, it’s a bit tricky to encode adrp: 2 low bits of immediate value are placed in the position 30:29 and the rest in the position 23:5. Due to size limitations, the aarch64 instructions try to make the most out of 32 bits.

In the code we are going to use the formulas to calculate the values and description of adrp and add instructions to obtain the final opcode:

#define R_AARCH64_CALL26 283
#define R_AARCH64_ADD_ABS_LO12_NC 277
#define R_AARCH64_ADR_PREL_PG_HI21 275
...

{
case R_AARCH64_CALL26:
	/* The mask separates opcode (6 bits) and the immediate value */
	uint32_t mask_bl = (0xffffffff << 26);
	/* S+A-P, divided by 4 */
	val = (symbol_address + relocations[i].r_addend - patch_offset) >> 2;
	/* Concatenate opcode and value to get final instruction */
	*((uint32_t *)patch_offset) &= mask_bl;
	val &= ~mask_bl;
	*((uint32_t *)patch_offset) |= val;
	break;
case R_AARCH64_ADD_ABS_LO12_NC:
	/* The mask of `add` instruction to separate 
	* opcode, registers and calculated value 
	*/
	uint32_t mask_add = 0b11111111110000000000001111111111;
	/* S + A */
	uint32_t val = *(symbol_address + relocations[i].r_addend);
	val &= ~mask_add;
	*((uint32_t *)patch_offset) &= mask_add;
	/* Final instruction */
	*((uint32_t *)patch_offset) |= val;
case R_AARCH64_ADR_PREL_PG_HI21:
	/* Page(S+A)-Page(P), Page(expr) is defined as (expr & ~0xFFF) */
	val = (((uint64_t)(symbol_address + relocations[i].r_addend)) & ~0xFFF) - (((uint64_t)patch_offset) & ~0xFFF);
	/* Shift right the calculated value by 12 bits.
	 * During decoding it will be shifted left as described above, 
	 * so we do the opposite.
	*/
	val >>= 12;
	/* Separate the lower and upper bits to place them in different positions */ 
	uint32_t immlo = (val & (0xf >> 2)) << 29 ;
	uint32_t immhi = (val & ((0xffffff >> 13) << 2)) << 22;
	*((uint32_t *)patch_offset) |= immlo;
	*((uint32_t *)patch_offset) |= immhi;
	break;
}

Compile and run:

$ gcc -o loader loader.c 
$ ./loader
Executing add5...
add5(42) = 47
Executing add10...
add10(42) = 52
Executing get_hello...
get_hello() = Hello, world!
Executing get_var...
get_var() = 5
Executing set_var(42)...
Executing get_var again...
get_var() = 42

It works! The final code is here.

Executing example from Part 3 on aarch64

Our Part 3 is about resolving external dependencies. When we write code we don’t think much about how to allocate memory or print debug information to the console. Instead, we involve functions from the system libraries. But the code of system libraries needs to be passed through to our programs somehow. Additionally, for optimization purposes, it would be nice if this code would be stored in one place and shared between all programs. And another wish — we don’t want to resolve all the functions and global variables from the libraries, only those which we need and at those times when we need them. To solve these problems, ELF introduced two sections: PLT (the procedure linkage table) and GOT (the global offset table). The dynamic loader creates a list which contains all external functions and variables from the shared library, but doesn’t resolve them immediately; instead they are placed in the PLT section. Each external symbol is presented by a small function, a stub, e.g. puts@plt. When an external symbol is requested, the stub checks if it was resolved previously. If not, the stub searches for an absolute address of the symbol, returns to the requester and writes it in the GOT table. The next time, the address returns directly from the GOT table.

In Part 3 we implemented a simplified PLT/GOT resolution. Firstly, we added a new function say_hello in the obj.c, which calls unresolved system library function puts. Further we added an optional wrapper my_puts in the loader.c. The wrapper isn’t required, we could’ve resolved directly to a standard function, but it's a good example of how the implementation of some functions can be overwritten with custom code. In the next steps we added our PLT/GOT resolution:

PLT section we replaced with a jumptable
GOT we replaced with assembly instructions

Basically, we created a small stub with assembly code (our jumptable) to resolve the global address of our my_puts wrapper and jump to it.

The approach for aarch64 is the same. But the jumptable is very different as it consists of different assembly instructions.

The big difference here compared to the other parts is that we need to work with a 64-bit address for the GOT resolution. Our custom PLT or jumptable is placed close to the main code of obj.c and can operate with the relative addresses as before. For the GOT or referencing my_puts wrapper we’ll use different branch instructions — br or blr. These instructions branch to the register, where the aarch64 registers can hold 64-bit values.

We can check how it resolves with the native PLT/GOT in our loader assembly code:

$ objdump --disassemble --section=.text loader
...
1d2c:	97fffb45 	bl	a40 
1d30:	f94017e0 	ldr	x0, [sp, #40]
1d34:	d63f0000 	blr	x0
...

The first instruction is bl jump to puts@plt stub. The next ldr instruction tells us that some value was loaded into the register x0 from the stack. Each function has its own stack frame to hold the local variables. The last blr instruction makes a jump to the address stored in x0 register. There is an agreement in the register naming: if the stored value is 64-bits then the register is called x0-x30; if only 32-bits are used then it’s called w0-w30 (the value will be stored in the lower 32-bits and upper 32-bits will be zeroed).

We need to do something similar — place the absolute address of our my_puts wrapper in some register and call br on this register. We don’t need to store the link before branching, the call will be returned to say_hello from obj.c, which is why a plain br will be enough. Let’s check an assembly of simple C-function:

hello.c:

#include 

void say_hello(void)
{
    uint64_t reg = 0x555555550c14;
}

$ gcc -c hello.c
$ objdump --disassemble --section=.text hello.o

hello.o:     file format elf64-littleaarch64


Disassembly of section .text:

0000000000000000 :
   0:	d10043ff 	sub	sp, sp, #0x10
   4:	d2818280 	mov	x0, #0xc14                 	// #3092
   8:	f2aaaaa0 	movk	x0, #0x5555, lsl #16
   c:	f2caaaa0 	movk	x0, #0x5555, lsl #32
  10:	f90007e0 	str	x0, [sp, #8]
  14:	d503201f 	nop
  18:	910043ff 	add	sp, sp, #0x10
  1c:	d65f03c0 	ret

The number 0x555555550c14 is the address returned by lookup_ext_function. We’ve printed it out to use as an example, but any 48-bits hex value can be used.

In our output we see that the value was split in three sections and written in x0 register with three instructions: one mov and two movk. The documentation says that there are only 16 bits for the immediate value, but a shift can be applied (in our case left shift lsl).

However, we can’t use x0 in our context. By convention the registers x0-x7 are caller-saved and used to pass function parameters between calls to other functions. Let’s use x9 then.

We need to modify our loader. Firstly let’s change the jumptable structure.

loader.c:

...
struct ext_jump {
	uint32_t instr[4];
};
...

As we saw above, we need four instructions: mov, movk, movk, br. We don’t need a stack frame as we aren’t preserving any local variables. We just want to load the address into the register and branch to it. But we can’t write human-readable code

e.g. mov x0, #0xc14 into instructions, we need machine binary or hex representation, e.g. d2818280.

Let’s write a simple assembly code to get it:

hw.s:

.global _start

_start: mov     x9, #0xc14 
        movk    x9, #0x5555, lsl #16
        movk    x9, #0x5555, lsl #32
        br      x9

$ as -o hw.o hw.s
$ objdump --disassemble --section=.text hw.o

hw.o:     file format elf64-littleaarch64


Disassembly of section .text:

0000000000000000 <_start>:
   0:	d2818289 	mov	x9, #0xc14                 	// #3092
   4:	f2aaaaa9 	movk	x9, #0x5555, lsl #16
   8:	f2caaaa9 	movk	x9, #0x5555, lsl #32
   c:	d61f0120 	br	x9

Almost done! But there’s one more thing to consider. Even if the value 0x555555550c14 is a real my_puts wrapper address, it will be different on each run if ASLR(Address space layout randomization) is enabled. We need to patch these instructions to put the value which will be returned by lookup_ext_function on each run. We’ll split the obtained value in three parts, 16-bits each, and replace them in our mov and movk instructions according to the documentation, similar to what we did before for our second part.

if (symbols[symbol_idx].st_shndx == SHN_UNDEF) {
	static int curr_jmp_idx = 0;

	uint64_t addr = lookup_ext_function(strtab +  symbols[symbol_idx].st_name);
	uint32_t mov = 0b11010010100000000000000000001001 | ((addr << 48) >> 43);
	uint32_t movk1 = 0b11110010101000000000000000001001 | (((addr >> 16) << 48) >> 43);
	uint32_t movk2 = 0b11110010110000000000000000001001 | (((addr >> 32) << 48) >> 43);
	jumptable[curr_jmp_idx].instr[0] = mov;         // mov  x9, #0x0c14
	jumptable[curr_jmp_idx].instr[1] = movk1;       // movk x9, #0x5555, lsl #16
	jumptable[curr_jmp_idx].instr[2] = movk2;       // movk x9, #0x5555, lsl #32
	jumptable[curr_jmp_idx].instr[3] = 0xd61f0120;  // br   x9

	symbol_address = (uint8_t *)(&jumptable[curr_jmp_idx].instr[0]);
	curr_jmp_idx++;
} else {
	symbol_address = section_runtime_base(§ions[symbols[symbol_idx].st_shndx]) + symbols[symbol_idx].st_value;
}
uint32_t val;
switch (type)
{
case R_AARCH64_CALL26:
	/* The mask separates opcode (6 bits) and the immediate value */
	uint32_t mask_bl = (0xffffffff << 26);
	/* S+A-P, divided by 4 */
	val = (symbol_address + relocations[i].r_addend - patch_offset) >> 2;
	/* Concatenate opcode and value to get final instruction */
	*((uint32_t *)patch_offset) &= mask_bl;
	val &= ~mask_bl;
	*((uint32_t *)patch_offset) |= val;
	break;
...

In the code we took the address of the first instruction &jumptable[curr_jmp_idx].instr[0] and wrote it in the symbol_address, further because the type is still R_AARCH64_CALL26 it will be put into bl - jump to the relative address. Where our relative address is the first mov instruction. The whole jumptable code will be executed and finished with the blr instruction.

The final run:

$ gcc -o loader loader.c
$ ./loader
Executing add5...
add5(42) = 47
Executing add10...
add10(42) = 52
Executing get_hello...
get_hello() = Hello, world!
Executing get_var...
get_var() = 5
Executing set_var(42)...
Executing get_var again...
get_var() = 42
Executing say_hello...
my_puts executed
Hello, world!

The final code is here.

Summary

There are several things we covered in this blog post. We gave a brief introduction on how the binary got executed on Linux and how all components are linked together. We saw a big difference between x86 and aarch64 assembly. We learned how we can hook into the code and change its behavior. But just as it was said in the first blog post of this series, the most important thing is to remember to always think about security first. Processing external inputs should always be done with great care. Bounds and integrity checks have been omitted for the purposes of keeping the examples short, so readers should be aware that the code is not production ready and is designed for educational purposes only.

The Linux Crypto API for user applications

Oxana Kharitonova — Thu, 11 May 2023 13:00:58 GMT

In this post we will explore Linux Crypto API for user applications and try to understand its pros and cons.

The Linux Kernel Crypto API was introduced in October 2002. It was initially designed to satisfy internal needs, mostly for IPsec. However, in addition to the kernel itself, user space applications can benefit from it.

If we apply the basic definition of an API to our case, we will have the kernel on one side and our application on the other. The application needs to send data, i.e. plaintext or ciphertext, and get encrypted/decrypted text in response from the kernel. To communicate with the kernel we need to make a system call. Also, before starting the data exchange, we need to agree on some cryptographic parameters, at least the selected crypto algorithm and key length. These constraints, along with all supported algorithms, can be found in the /proc/crypto virtual file.

Below is a short excerpt from my /proc/crypto looking at ctr(aes). In the examples, we will use the AES cipher in CTR mode, further we will give more details about the algorithm itself.

name         : ctr(aes)
driver       : ctr(aes-generic)
module       : ctr
priority     : 100
refcnt       : 1
selftest     : passed
internal     : no
type         : skcipher
async        : no
blocksize    : 1
min keysize  : 16
max keysize  : 32
ivsize       : 16
chunksize    : 16
walksize     : 16


name         : ctr(aes)
driver       : ctr(aes-aesni)
module       : ctr
priority     : 300
refcnt       : 1
selftest     : passed
internal     : no
type         : skcipher
async        : no
blocksize    : 1
min keysize  : 16
max keysize  : 32
ivsize       : 16
chunksize    : 16
walksize     : 16


name         : ctr(aes)
driver       : ctr-aes-aesni
module       : aesni_intel
priority     : 400
refcnt       : 1
selftest     : passed
internal     : no
type         : skcipher
async        : yes
blocksize    : 1
min keysize  : 16
max keysize  : 32
ivsize       : 16
chunksize    : 16
walksize     : 16

In the output above, there are three config blocks. The kernel may provide several implementations of the same algorithm depending on the CPU architecture, available hardware, presence of crypto accelerators etc.

We can pick the implementation based on the algorithm name or the driver name. The algorithm name is not unique, but the driver name is. If we use the algorithm name, the driver with the highest priority will be chosen for us, which in theory should provide the best cryptographic performance in this context. Let’s see the performance of different implementations of AES-CTR encryption. I use the libkcapi library: it’s a lightweight wrapper for the kernel crypto API which also provides built-in speed tests. We will examine these tests.

$ kcapi-speed -c "AES(G) CTR(G) 128" -b 1024 -t 10
AES(G) CTR(G) 128   	|d|	1024 bytes|          	149.80 MB/s|153361 ops/s
AES(G) CTR(G) 128   	|e|	1024 bytes|          	159.76 MB/s|163567 ops/s
 
$ kcapi-speed -c "AES(AESNI) CTR(ASM) 128" -b 1024 -t 10
AES(AESNI) CTR(ASM) 128 |d|	1024 bytes|          	343.10 MB/s|351332 ops/s
AES(AESNI) CTR(ASM) 128 |e|	1024 bytes|         	310.100 MB/s|318425 ops/s
 
$ kcapi-speed -c "AES(AESNI) CTR(G) 128" -b 1024 -t 10
AES(AESNI) CTR(G) 128   |d|	1024 bytes|          	155.37 MB/s|159088 ops/s
AES(AESNI) CTR(G) 128   |e|	1024 bytes|          	172.94 MB/s|177054 ops/s

Here and later ignore the absolute numbers, as they depend on the environment where the tests were running. Rather look at the relationship between the numbers.

The x86 AES instructions showed the best results, twice as fast vs the generic portable C implementation. As expected, this implementation has the highest priority in the /proc/crypto. We will use only this one later.

This brief introduction can be rephrased as: “I can ask the kernel to encrypt or decrypt data from my application”. But, why do I need it?

Why do I need it?

In our previous blog post Linux Kernel Key Retention Service we talked a lot about cryptographic key protection. We concluded that the best Linux option is to store cryptographic keys in the kernel space and restrict the access to a limited number of applications. However, if all our cryptography is processed in user space, potentially damaging code still has access to the raw key material. We have to think wisely about using the key: what part of the code has access to it, don’t log it accidentally, how the open-source libraries manage it and if the memory is purged after using it. We may need to support a dedicated process to not have a key in network-facing code. Thus, many things need to be done for security, and for each application which works with cryptography. And even after all these precautionary measures, the best of the best are subject to bugs and vulnerabilities. OpenSSL, the most known and widely used cryptographic library in user space, has had a few problems in its security.

Can we move all the cryptography to the kernel and help solve these problems? Looks like it! Our recent patch to upstream extended the key types which can be used in symmetric encryption in the Crypto API directly from the Linux Kernel Key Retention Service.

But nothing is free. There will be some overhead for the system calls and data copying between user and kernel spaces. So, the next question is how fast it is.

Is it fast?

To answer this question we need to have some baseline to compare with. OpenSSL would be the best as it’s used all around the Internet. OpenSSL provides a good composite of toolkits, including C-functions, a console utility and various speed tests. For the sake of equality, we will ignore the built-in tests and write our own tests using OpenSSL C-functions. We want the same data to be processed and the same logic parts to be measured in both cases (Kernel versus OpenSSL).

So, the task: write a benchmark for AES-CTR-128 encrypting data split in chunks. Make implementations for the Kernel Crypto API and OpenSSL.

About AES-CTR-128

AES stands for Advanced Encryption Standard. It is a block cipher algorithm, which means the whole plaintext is split into blocks and two operations are applied: substitution and permutation. There are two parameters characterizing a block cipher: the block size and the key size. AES processes a block of 128 bits using a key of either 128, 192 or 256 bits. Each 128 bits or 16 bytes block is presented as a 4x4 two-dimensional array (matrix), where one element of the matrix presents one byte of the plaintext. To change the plaintext to ciphertext several rounds of transformation are applied: the bits of the block XORs with a key derived from the main key and substitution with permutation are applied to rows and columns of the matrix. There can be 10, 12 or 14 rounds depending on the key size (the key size determines how many keys can be derived from it).

AES is a secure cipher, but there is one nuance - the same plaintext/block of text will produce the same result. Look at Linux’s mascot Tux. To avoid this, a mode of operation (or just mode) has to be applied. It determines how the text changes, so the same input doesn't result in the same output. Tux was encrypted using ECB mode, there is no text transformation at all. Another mode example is CBC, where the ciphertext from the previously encrypted block is added to the next block, for the first block an initial value (IV) is added. This mode guarantees that for the same input and different IV the output will be different. However, this mode is slow as each block depends on the previous one and so encryption can’t be parallelized. CTR is a counter mode, instead of using previously encrypted blocks it uses a counter and a nonce. A counter is an integer which is incremented for each block. A nonce is just a random number similar to the IV. The nonce, and IV, should be different for each message and can be transferred openly with the encrypted text. So, the title AES-CTR-128 means AES used in CTR mode with the key size of 128 bits.

Implementing AES-CTR-128 with the Kernel Crypto API

The kernel and user spaces are isolated for security reasons and each time data needs to be transferred between them, it’s copied. In our case, it would add a significant overhead - copying a big bunch of plain or encrypted text to the kernel and back. However, the crypto API supports a zero-copy interface. Instead of transferring the actual data, a file descriptor is passed. But it has a limitation - the maximum size is only 16 pages. So for our tests we picked the number closest to the maximum limit - 63KB (16 pages of 4KB minus 1KB to avoid any potential edge cases).

The code below is the exact implementation of what is written in the kernel documentation. Firstly we created a socket of AF_ALG type. The salg_type and salg_name parameters can be taken from the /proc/crypto file. Instead of a generic name we used the driver name ctr-aes-aesni. We might put just a name ctr(aes) and the driver with the highest priority (ctr-aes-aesni in our context) will be picked for us by the Kernel. Further we put the key length and accepted the socket. The IV size is provided before the payload as ancillary data. Constraints of the key and IV sizes can be found in /proc/crypto too.

Now we are ready to start communication. We excluded all pre-set up steps from the measurements. In a loop we send plaintext for encryption with the flag SPLICE_F_MORE to inform the kernel that more data will be provided. And here in the loop we read the cipher text from the kernel. The last plaintext should be sent without the flag thus saying that we are done, and the kernel can finalize the encryption.

In favor of brevity, error handling is omitted in both examples.

kernel.c

#define _GNU_SOURCE

#include 
#include 
#include 

#include 
#include 
#include 
#include 
#include 
#include 

#define PT_LEN (63 * 1024)
#define CT_LEN PT_LEN
#define IV_LEN 16
#define KEY_LEN 16
#define ITER_COUNT 100000

static uint8_t pt[PT_LEN];
static uint8_t ct[CT_LEN];
static uint8_t key[KEY_LEN];
static uint8_t iv[IV_LEN];

static void time_diff(struct timespec *res, const struct timespec *start, const struct timespec *end)
{
    res->tv_sec = end->tv_sec - start->tv_sec;
    res->tv_nsec = end->tv_nsec - start->tv_nsec;
    if (res->tv_nsec < 0) {
        res->tv_sec--;
        res->tv_nsec += 1000000000;
    }
}

int main(void)
{
    // Fill the test data
    getrandom(key, sizeof(key), GRND_NONBLOCK);
    getrandom(iv, sizeof(iv), GRND_NONBLOCK);
    getrandom(pt, sizeof(pt), GRND_NONBLOCK);

    // Set up AF_ALG socket
    int alg_s, aes_ctr;
    struct sockaddr_alg sa = { .salg_family = AF_ALG };
    strcpy(sa.salg_type, "skcipher");
    strcpy(sa.salg_name, "ctr-aes-aesni");

    alg_s = socket(AF_ALG, SOCK_SEQPACKET, 0);
    bind(alg_s, (const struct sockaddr *)&sa, sizeof(sa));
    setsockopt(alg_s, SOL_ALG, ALG_SET_KEY, key, KEY_LEN);
    aes_ctr = accept(alg_s, NULL, NULL);
    close(alg_s);

    // Set up IV
    uint8_t cmsg_buf[CMSG_SPACE(sizeof(uint32_t)) + CMSG_SPACE(sizeof(struct af_alg_iv) + IV_LEN)] = {0};
    struct msghdr msg = {
	.msg_control = cmsg_buf,
	.msg_controllen = sizeof(cmsg_buf)
    };

    struct cmsghdr *cmsg = CMSG_FIRSTHDR(&msg);
    cmsg->cmsg_len = CMSG_LEN(sizeof(uint32_t));
    cmsg->cmsg_level = SOL_ALG;
    cmsg->cmsg_type = ALG_SET_OP;
    *((uint32_t *)CMSG_DATA(cmsg)) = ALG_OP_ENCRYPT;
    
    cmsg = CMSG_NXTHDR(&msg, cmsg);
    cmsg->cmsg_len = CMSG_LEN(sizeof(struct af_alg_iv) + IV_LEN);
    cmsg->cmsg_level = SOL_ALG;
    cmsg->cmsg_type = ALG_SET_IV;
    ((struct af_alg_iv *)CMSG_DATA(cmsg))->ivlen = IV_LEN;
    memcpy(((struct af_alg_iv *)CMSG_DATA(cmsg))->iv, iv, IV_LEN);
    sendmsg(aes_ctr, &msg, 0);

    // Set up pipes for using zero-copying interface
    int pipes[2];
    pipe(pipes);

    struct iovec pt_iov = {
        .iov_base = pt,
        .iov_len = sizeof(pt)
    };

    struct timespec start, end;
    clock_gettime(CLOCK_MONOTONIC, &start);
    
    int i;
    for (i = 0; i < ITER_COUNT; i++) {
        vmsplice(pipes[1], &pt_iov, 1, SPLICE_F_GIFT);
        // SPLICE_F_MORE means more data will be coming
        splice(pipes[0], NULL, aes_ctr, NULL, sizeof(pt), SPLICE_F_MORE);
        read(aes_ctr, ct, sizeof(ct));
    }
    vmsplice(pipes[1], &pt_iov, 1, SPLICE_F_GIFT);
    // A final call without SPLICE_F_MORE
    splice(pipes[0], NULL, aes_ctr, NULL, sizeof(pt), 0);
    read(aes_ctr, ct, sizeof(ct));
    
    clock_gettime(CLOCK_MONOTONIC, &end);

    close(pipes[0]);
    close(pipes[1]);
    close(aes_ctr);

    struct timespec diff;
    time_diff(&diff, &start, &end);
    double tput_krn = ((double)ITER_COUNT * PT_LEN) / (diff.tv_sec + (diff.tv_nsec * 0.000000001 ));
    printf("Kernel: %.02f Mb/s\n", tput_krn / (1024 * 1024));
    
    return 0;
}

Compile and run:

$ gcc -o kernel kernel.c
$ ./kernel
Kernel: 2112.49 Mb/s

Implementing AES-CTR-128 with OpenSSL

With OpenSSL everything is straight forward, we just repeated an example from the official documentation.

openssl.c

#include 
#include 
#include 

#define PT_LEN (63 * 1024)
#define CT_LEN PT_LEN
#define IV_LEN 16
#define KEY_LEN 16
#define ITER_COUNT 100000

static uint8_t pt[PT_LEN];
static uint8_t ct[CT_LEN];
static uint8_t key[KEY_LEN];
static uint8_t iv[IV_LEN];

static void time_diff(struct timespec *res, const struct timespec *start, const struct timespec *end)
{
    res->tv_sec = end->tv_sec - start->tv_sec;
    res->tv_nsec = end->tv_nsec - start->tv_nsec;
    if (res->tv_nsec < 0) {
        res->tv_sec--;
        res->tv_nsec += 1000000000;
    }
}

int main(void)
{
    // Fill the test data
    getrandom(key, sizeof(key), GRND_NONBLOCK);
    getrandom(iv, sizeof(iv), GRND_NONBLOCK);
    getrandom(pt, sizeof(pt), GRND_NONBLOCK);

    EVP_CIPHER_CTX *ctx = EVP_CIPHER_CTX_new();
    EVP_EncryptInit_ex(ctx, EVP_aes_128_ctr(), NULL, key, iv);

    int outl = sizeof(ct);
    
    struct timespec start, end;
    clock_gettime(CLOCK_MONOTONIC, &start);

    int i;
    for (i = 0; i < ITER_COUNT; i++) {
        EVP_EncryptUpdate(ctx, ct, &outl, pt, sizeof(pt));
    }
    uint8_t *ct_final = ct + outl;
    outl = sizeof(ct) - outl;
    EVP_EncryptFinal_ex(ctx, ct_final, &outl);

    clock_gettime(CLOCK_MONOTONIC, &end);

    EVP_CIPHER_CTX_free(ctx);

    struct timespec diff;
    time_diff(&diff, &start, &end);
    double tput_ossl = ((double)ITER_COUNT * PT_LEN) / (diff.tv_sec + (diff.tv_nsec * 0.000000001 ));
    printf("OpenSSL: %.02f Mb/s\n", tput_ossl / (1024 * 1024));

    return 0;
}

Compile and run:

$ gcc -o openssl openssl.c -lcrypto
$ ./openssl
OpenSSL: 3758.60 Mb/s

Results of OpenSSL vs Crypto API

OpenSSL: 3758.60 Mb/s
Kernel: 2112.49 Mb/s

Don’t pay attention to the absolute values, look at the relationship.

The numbers look pessimistic. But why? Can't the kernel implement AES-CTR similar to OpenSSL? We used bpftrace to understand this better. The encryption function is called on the read() system call. Trying to be as close to the encryption code as possible, we put a probe on the ctr_crypt function instead of the whole read call.

$ sudo bpftrace -e 'kprobe:ctr_crypt { @start=nsecs; @count+=1; } kretprobe:ctr_crypt /@start!=0/ { @total+=nsecs-@start; }'

We took the same plaintext, encrypted it in chunks of 63KB and measured how much time it took for both cases to encrypt it with bpftrace attached to the kernel:

OpenSSL: 1 sec 650532178 nsec
Kernel: 3 sec 120442931 nsec // 3120442931 ns
OpenSSL: 3727.49 Mb/s
Kernel: 1971.63 Mb/s

@total: 2031169756     //  2031169756 / 3120442931 = 0.6509235390339526

The @total number is output from bpftrace, which tells us how much time the kernel spent in the encryption function. To compare plain kernel encryption vs OpenSSL we need to say how many Mb/s kernel would have done if only encryption had been involved (excluding all system calls and data copy/ manipulation). We need to apply some math:

The correlation between the total time and the time which the kernel spent in the encryption is 2031169756 / 3120442931 = 0.6509235390339526 or 65%.
So throughput would be 1971.63 / 0.650923539033952 - 3028.97 Mb/s. Comparing this to OpenSSL Mb/s we get 3028.97 / 3727.49, so around 81%.

It would be fair to say that bpftrace adds some overhead and our numbers for the kernel are less than they could be. So, we can safely say that while the Kernel Crypto API is two times slower than OpenSSL, the crypto part itself is almost equal.

Conclusion

In this post we reviewed the Linux Kernel Crypto API and its user space interface. We reiterated some security benefits of doing encryption through the Kernel vs using some sort of cryptographic library. We also measured the performance overhead of doing data encryption/decryption through the Kernel Crypto API, confirmed that in-kernel crypto is likely as good as in OpenSSL, but a better user space interface is needed to make Kernel Crypto API as fast as using a cryptographic library. Using Crypto API is a subjective decision depending on your circumstances, it’s a trade-off in speed vs. security.

The Linux Kernel Key Retention Service and why you should use it in your next application

Oxana Kharitonova — Mon, 28 Nov 2022 14:57:20 GMT

We want our digital data to be safe. We want to visit websites, send bank details, type passwords, sign documents online, login into remote computers, encrypt data before storing it in databases and be sure that nobody can tamper with it. Cryptography can provide a high degree of data security, but we need to protect cryptographic keys.

At the same time, we can’t have our key written somewhere securely and just access it occasionally. Quite the opposite, it’s involved in every request where we do crypto-operations. If a site supports TLS, then the private key is used to establish each connection.

Unfortunately cryptographic keys sometimes leak and when it happens, it is a big problem. Many leaks happen because of software bugs and security vulnerabilities. In this post we will learn how the Linux kernel can help protect cryptographic keys from a whole class of potential security vulnerabilities: memory access violations.

Memory access violations

According to the NSA, around 70% of vulnerabilities in both Microsoft's and Google's code were related to memory safety issues. One of the consequences of incorrect memory accesses is leaking security data (including cryptographic keys). Cryptographic keys are just some (mostly random) data stored in memory, so they may be subject to memory leaks like any other in-memory data. The below example shows how a cryptographic key may accidentally leak via stack memory reuse:

broken.c

#include 
#include 

static void encrypt(void)
{
    uint8_t key[] = "hunter2";
    printf("encrypting with super secret key: %s\n", key);
}

static void log_completion(void)
{
    /* oh no, we forgot to init the msg */
    char msg[8];
    printf("not important, just fyi: %s\n", msg);
}

int main(void)
{
    encrypt();
    /* notify that we're done */
    log_completion();
    return 0;
}

Compile and run our program:

$ gcc -o broken broken.c
$ ./broken 
encrypting with super secret key: hunter2
not important, just fyi: hunter2

Oops, we printed the secret key in the “fyi” logger instead of the intended log message! There are two problems with the code above:

we didn’t securely destroy the key in our pseudo-encryption function (by overwriting the key data with zeroes, for example), when we finished using it
our buggy logging function has access to any memory within our process

And while we can probably easily fix the first problem with some additional code, the second problem is the inherent result of how software runs inside the operating system.

Each process is given a block of contiguous virtual memory by the operating system. It allows the kernel to share limited computer resources among several simultaneously running processes. This approach is called virtual memory management. Inside the virtual memory a process has its own address space and doesn’t have access to the memory of other processes, but it can access any memory within its address space. In our example we are interested in a piece of process memory called the stack.

The stack consists of stack frames. A stack frame is dynamically allocated space for the currently running function. It contains the function’s local variables, arguments and return address. When compiling a function the compiler calculates how much memory needs to be allocated and requests a stack frame of this size. Once a function finishes execution the stack frame is marked as free and can be used again. A stack frame is a logical block, it doesn’t provide any boundary checks, it’s not erased, just marked as free. Additionally, the virtual memory is a contiguous block of addresses. Both of these statements give the possibility for malware/buggy code to access data from anywhere within virtual memory.

The stack of our program broken.c will look like:

At the beginning we have a stack frame of the main function. Further, the main() function calls encrypt() which will be placed on the stack immediately below the main() (the code stack grows downwards). Inside encrypt() the compiler requests 8 bytes for the key variable (7 bytes of data + C-null character). When encrypt() finishes execution, the same memory addresses are taken by log_completion(). Inside the log_completion() the compiler allocates eight bytes for the msg variable. Accidentally, it was put on the stack at the same place where our private key was stored before. The memory for msg was only allocated, but not initialized, the data from the previous function left as is.

Additionally, to the code bugs, programming languages provide unsafe functions known for the safe-memory vulnerabilities. For example, for C such functions are printf(), strcpy(), gets(). The function printf() doesn’t check how many arguments must be passed to replace all placeholders in the format string. The function arguments are placed on the stack above the function stack frame, printf() fetches arguments according to the numbers and type of placeholders, easily going off its arguments and accessing data from the stack frame of the previous function.

The NSA advises us to use safety-memory languages like Python, Go, Rust. But will it completely protect us?

The Python compiler will definitely check boundaries in many cases for you and notify with an error:

>>> print("x: {}, y: {}, {}".format(1, 2))
Traceback (most recent call last):
  File "", line 1, in 
IndexError: Replacement index 2 out of range for positional args tuple

However, this is a quote from one of 36 (for now) vulnerabilities:

Python 2.7.14 is vulnerable to a Heap-Buffer-Overflow as well as a Heap-Use-After-Free.

Golang has its own list of overflow vulnerabilities, and has an unsafe package. The name of the package speaks for itself, usual rules and checks don’t work inside this package.

Heartbleed

In 2014, the Heartbleed bug was discovered. The (at the time) most used cryptography library OpenSSL leaked private keys. We experienced it too.

Mitigation

So memory bugs are a fact of life, and we can’t really fully protect ourselves from them. But, given the fact that cryptographic keys are much more valuable than the other data, can we do better protecting the keys at least?

As we already said, a memory address space is normally associated with a process. And two different processes don’t share memory by default, so are naturally isolated from each other. Therefore, a potential memory bug in one of the processes will not accidentally leak a cryptographic key from another process. The security of ssh-agent builds on this principle. There are always two processes involved: a client/requester and the agent.

The agent will never send a private key over its request channel. Instead, operations that require a private key will be performed by the agent, and the result will be returned to the requester. This way, private keys are not exposed to clients using the agent.

A requester is usually a network-facing process and/or processing untrusted input. Therefore, the requester is much more likely to be susceptible to memory-related vulnerabilities but in this scheme it would never have access to cryptographic keys (because keys reside in a separate process address space) and, thus, can never leak them.

At Cloudflare, we employ the same principle in Keyless SSL. Customer private keys are stored in an isolated environment and protected from Internet-facing connections.

Linux Kernel Key Retention Service

The client/requester and agent approach provides better protection for secrets or cryptographic keys, but it brings some drawbacks:

we need to develop and maintain two different programs instead of one
we also need to design a well-defined-interface for communication between the two processes
we need to implement the communication support between two processes (Unix sockets, shared memory, etc.)
we might need to authenticate and support ACLs between the processes, as we don’t want any requester on our system to be able to use our cryptographic keys stored inside the agent
we need to ensure the agent process is up and running, when working with the client/requester process

What if we replace the agent process with the Linux kernel itself?

it is already running on our system (otherwise our software would not work)
it has a well-defined interface for communication (system calls)
it can enforce various ACLs on kernel objects
and it runs in a separate address space!

Fortunately, the Linux Kernel Key Retention Service can perform all the functions of a typical agent process and probably even more!

Initially it was designed for kernel services like dm-crypt/ecryptfs, but later was opened to use by userspace programs. It gives us some advantages:

the keys are stored outside the process address space
the well-defined-interface and the communication layer is implemented via syscalls
the keys are kernel objects and so have associated permissions and ACLs
the keys lifecycle can be implicitly bound to the process lifecycle

The Linux Kernel Key Retention Service operates with two types of entities: keys and keyrings, where a keyring is a key of a special type. If we put it into analogy with files and directories, we can say a key is a file and a keyring is a directory. Moreover, they represent a key hierarchy similar to a filesystem tree hierarchy: keyrings reference keys and other keyrings, but only keys can hold the actual cryptographic material similar to files holding the actual data.

Keys have types. The type of key determines which operations can be performed over the keys. For example, keys of user and logon types can hold arbitrary blobs of data, but logon keys can never be read back into userspace, they are exclusively used by the in-kernel services.

For the purposes of using the kernel instead of an agent process the most interesting type of keys is the asymmetric type. It can hold a private key inside the kernel and provides the ability for the allowed applications to either decrypt or sign some data with the key. Currently, only RSA keys are supported, but work is underway to add ECDSA key support.

While keys are responsible for safeguarding the cryptographic material inside the kernel, keyrings determine key lifetime and shared access. In its simplest form, when a particular keyring is destroyed, all the keys that are linked only to that keyring are securely destroyed as well. We can create custom keyrings manually, but probably one the most powerful features of the service are the “special keyrings”.

These keyrings are created implicitly by the kernel and their lifetime is bound to the lifetime of a different kernel object, like a process or a user. (Currently there are four categories of “implicit” keyrings), but for the purposes of this post we’re interested in two most widely used ones: process keyrings and user keyrings.

User keyring lifetime is bound to the existence of a particular user and this keyring is shared between all the processes of the same UID. Thus, one process, for example, can store a key in a user keyring and another process running as the same user can retrieve/use the key. When the UID is removed from the system, all the keys (and other keyrings) under the associated user keyring will be securely destroyed by the kernel.

Process keyrings are bound to some processes and may be of three types differing in semantics: process, thread and session. A process keyring is bound and private to a particular process. Thus, any code within the process can store/use keys in the keyring, but other processes (even with the same user id or child processes) cannot get access. And when the process dies, the keyring and the associated keys are securely destroyed. Besides the advantage of storing our secrets/keys in an isolated address space, the process keyring gives us the guarantee that the keys will be destroyed regardless of the reason for the process termination: even if our application crashed hard without being given an opportunity to execute any clean up code - our keys will still be securely destroyed by the kernel.

A thread keyring is similar to a process keyring, but it is private and bound to a particular thread. For example, we can build a multithreaded web server, which can serve TLS connections using multiple private keys, and we can be sure that connections/code in one thread can never use a private key, which is associated with another thread (for example, serving a different domain name).

A session keyring makes its keys available to the current process and all its children. It is destroyed when the topmost process terminates and child processes can store/access keys, while the topmost process exists. It is mostly useful in shell and interactive environments, when we employ the keyctl tool to access the Linux Kernel Key Retention Service, rather than using the kernel system call interface. In the shell, we generally can’t use the process keyring as every executed command creates a new process. Thus, if we add a key to the process keyring from the command line - that key will be immediately destroyed, because the “adding” process terminates, when the command finishes executing. Let’s actually confirm this with [bpftrace](https://github.com/iovisor/bpftrace).

In one terminal we will trace the [user_destroy](https://elixir.bootlin.com/linux/v5.19.17/source/security/keys/user_defined.c#L146) function, which is responsible for deleting a user key:

$ sudo bpftrace -e 'kprobe:user_destroy { printf("destroying key %d\n", ((struct key *)arg0)->serial) }'
Att

And in another terminal let’s try to add a key to the process keyring:

$ keyctl add user mykey hunter2 @p
742524855

Going back to the first terminal we can immediately see:

…
Attaching 1 probe...
destroying key 742524855

And we can confirm the key is not available by trying to access it:

$ keyctl print 742524855
keyctl_read_alloc: Required key not available

So in the above example, the key “mykey” was added to the process keyring of the subshell executing keyctl add user mykey hunter2 @p. But since the subshell process terminated the moment the command was executed, both its process keyring and the added key were destroyed.

Instead, the session keyring allows our interactive commands to add keys to our current shell environment and subsequent commands to consume them. The keys will still be securely destroyed, when our main shell process terminates (likely, when we log out from the system).

So by selecting the appropriate keyring type we can ensure the keys will be securely destroyed, when not needed. Even if the application crashes! This is a very brief introduction, but it will allow you to play with our examples, for the whole context, please, reach the official documentation.

Replacing the ssh-agent with the Linux Kernel Key Retention Service

We gave a long description of how we can replace two isolated processes with the Linux Kernel Retention Service. It’s time to put our words into code. We talked about ssh-agent as well, so it will be a good exercise to replace our private key stored in memory of the agent with an in-kernel one. We picked the most popular SSH implementation OpenSSH as our target.

Some minor changes need to be added to the code to add functionality to retrieve a key from the kernel:

openssh.patch

diff --git a/ssh-rsa.c b/ssh-rsa.c
index 6516ddc1..797739bb 100644
--- a/ssh-rsa.c
+++ b/ssh-rsa.c
@@ -26,6 +26,7 @@
 
 #include 
 #include 
+#include 
 
 #include "sshbuf.h"
 #include "compat.h"
@@ -63,6 +64,7 @@ ssh_rsa_cleanup(struct sshkey *k)
 {
 	RSA_free(k->rsa);
 	k->rsa = NULL;
+	k->serial = 0;
 }
 
 static int
@@ -220,9 +222,14 @@ ssh_rsa_deserialize_private(const char *ktype, struct sshbuf *b,
 	int r;
 	BIGNUM *rsa_n = NULL, *rsa_e = NULL, *rsa_d = NULL;
 	BIGNUM *rsa_iqmp = NULL, *rsa_p = NULL, *rsa_q = NULL;
+	bool is_keyring = (strncmp(ktype, "ssh-rsa-keyring", strlen("ssh-rsa-keyring")) == 0);
 
+	if (is_keyring) {
+		if ((r = ssh_rsa_deserialize_public(ktype, b, key)) != 0)
+			goto out;
+	}
 	/* Note: can't reuse ssh_rsa_deserialize_public: e, n vs. n, e */
-	if (!sshkey_is_cert(key)) {
+	else if (!sshkey_is_cert(key)) {
 		if ((r = sshbuf_get_bignum2(b, &rsa_n)) != 0 ||
 		    (r = sshbuf_get_bignum2(b, &rsa_e)) != 0)
 			goto out;
@@ -232,28 +239,46 @@ ssh_rsa_deserialize_private(const char *ktype, struct sshbuf *b,
 		}
 		rsa_n = rsa_e = NULL; /* transferred */
 	}
-	if ((r = sshbuf_get_bignum2(b, &rsa_d)) != 0 ||
-	    (r = sshbuf_get_bignum2(b, &rsa_iqmp)) != 0 ||
-	    (r = sshbuf_get_bignum2(b, &rsa_p)) != 0 ||
-	    (r = sshbuf_get_bignum2(b, &rsa_q)) != 0)
-		goto out;
-	if (!RSA_set0_key(key->rsa, NULL, NULL, rsa_d)) {
-		r = SSH_ERR_LIBCRYPTO_ERROR;
-		goto out;
-	}
-	rsa_d = NULL; /* transferred */
-	if (!RSA_set0_factors(key->rsa, rsa_p, rsa_q)) {
-		r = SSH_ERR_LIBCRYPTO_ERROR;
-		goto out;
-	}
-	rsa_p = rsa_q = NULL; /* transferred */
 	if ((r = sshkey_check_rsa_length(key, 0)) != 0)
 		goto out;
-	if ((r = ssh_rsa_complete_crt_parameters(key, rsa_iqmp)) != 0)
-		goto out;
-	if (RSA_blinding_on(key->rsa, NULL) != 1) {
-		r = SSH_ERR_LIBCRYPTO_ERROR;
-		goto out;
+
+	if (is_keyring) {
+		char *name;
+		size_t len;
+
+		if ((r = sshbuf_get_cstring(b, &name, &len)) != 0)
+			goto out;
+
+		key->serial = request_key("asymmetric", name, NULL, KEY_SPEC_PROCESS_KEYRING);
+		free(name);
+
+		if (key->serial == -1) {
+			key->serial = 0;
+			r = SSH_ERR_KEY_NOT_FOUND;
+			goto out;
+		}
+	} else {
+		if ((r = sshbuf_get_bignum2(b, &rsa_d)) != 0 ||
+			(r = sshbuf_get_bignum2(b, &rsa_iqmp)) != 0 ||
+			(r = sshbuf_get_bignum2(b, &rsa_p)) != 0 ||
+			(r = sshbuf_get_bignum2(b, &rsa_q)) != 0)
+			goto out;
+		if (!RSA_set0_key(key->rsa, NULL, NULL, rsa_d)) {
+			r = SSH_ERR_LIBCRYPTO_ERROR;
+			goto out;
+		}
+		rsa_d = NULL; /* transferred */
+		if (!RSA_set0_factors(key->rsa, rsa_p, rsa_q)) {
+			r = SSH_ERR_LIBCRYPTO_ERROR;
+			goto out;
+		}
+		rsa_p = rsa_q = NULL; /* transferred */
+		if ((r = ssh_rsa_complete_crt_parameters(key, rsa_iqmp)) != 0)
+			goto out;
+		if (RSA_blinding_on(key->rsa, NULL) != 1) {
+			r = SSH_ERR_LIBCRYPTO_ERROR;
+			goto out;
+		}
 	}
 	/* success */
 	r = 0;
@@ -333,6 +358,21 @@ rsa_hash_alg_nid(int type)
 	}
 }
 
+static const char *
+rsa_hash_alg_keyctl_info(int type)
+{
+	switch (type) {
+	case SSH_DIGEST_SHA1:
+		return "enc=pkcs1 hash=sha1";
+	case SSH_DIGEST_SHA256:
+		return "enc=pkcs1 hash=sha256";
+	case SSH_DIGEST_SHA512:
+		return "enc=pkcs1 hash=sha512";
+	default:
+		return NULL;
+	}
+}
+
 int
 ssh_rsa_complete_crt_parameters(struct sshkey *key, const BIGNUM *iqmp)
 {
@@ -433,7 +473,14 @@ ssh_rsa_sign(struct sshkey *key,
 		goto out;
 	}
 
-	if (RSA_sign(nid, digest, hlen, sig, &len, key->rsa) != 1) {
+	if (key->serial > 0) {
+		len = keyctl_pkey_sign(key->serial, rsa_hash_alg_keyctl_info(hash_alg), digest, hlen, sig, slen);
+		if ((long)len == -1) {
+			ret = SSH_ERR_LIBCRYPTO_ERROR;
+			goto out;
+		}
+	}
+	else if (RSA_sign(nid, digest, hlen, sig, &len, key->rsa) != 1) {
 		ret = SSH_ERR_LIBCRYPTO_ERROR;
 		goto out;
 	}
@@ -705,6 +752,18 @@ const struct sshkey_impl sshkey_rsa_impl = {
 	/* .funcs = */		&sshkey_rsa_funcs,
 };
 
+const struct sshkey_impl sshkey_rsa_keyring_impl = {
+	/* .name = */		"ssh-rsa-keyring",
+	/* .shortname = */	"RSA",
+	/* .sigalg = */		NULL,
+	/* .type = */		KEY_RSA,
+	/* .nid = */		0,
+	/* .cert = */		0,
+	/* .sigonly = */	0,
+	/* .keybits = */	0,
+	/* .funcs = */		&sshkey_rsa_funcs,
+};
+
 const struct sshkey_impl sshkey_rsa_cert_impl = {
 	/* .name = */		"ssh-rsa-cert-v01@openssh.com",
 	/* .shortname = */	"RSA-CERT",
diff --git a/sshkey.c b/sshkey.c
index 43712253..3524ad37 100644
--- a/sshkey.c
+++ b/sshkey.c
@@ -115,6 +115,7 @@ extern const struct sshkey_impl sshkey_ecdsa_nistp521_cert_impl;
 #  endif /* OPENSSL_HAS_NISTP521 */
 # endif /* OPENSSL_HAS_ECC */
 extern const struct sshkey_impl sshkey_rsa_impl;
+extern const struct sshkey_impl sshkey_rsa_keyring_impl;
 extern const struct sshkey_impl sshkey_rsa_cert_impl;
 extern const struct sshkey_impl sshkey_rsa_sha256_impl;
 extern const struct sshkey_impl sshkey_rsa_sha256_cert_impl;
@@ -154,6 +155,7 @@ const struct sshkey_impl * const keyimpls[] = {
 	&sshkey_dss_impl,
 	&sshkey_dsa_cert_impl,
 	&sshkey_rsa_impl,
+	&sshkey_rsa_keyring_impl,
 	&sshkey_rsa_cert_impl,
 	&sshkey_rsa_sha256_impl,
 	&sshkey_rsa_sha256_cert_impl,
diff --git a/sshkey.h b/sshkey.h
index 771c4bce..a7ae45f6 100644
--- a/sshkey.h
+++ b/sshkey.h
@@ -29,6 +29,7 @@
 #include 
 
 #ifdef WITH_OPENSSL
+#include 
 #include 
 #include 
 # ifdef OPENSSL_HAS_ECC
@@ -153,6 +154,7 @@ struct sshkey {
 	size_t	shielded_len;
 	u_char	*shield_prekey;
 	size_t	shield_prekey_len;
+	key_serial_t serial;
 };
 
 #define	ED25519_SK_SZ	crypto_sign_ed25519_SECRETKEYBYTES

We need to download and patch OpenSSH from the latest git as the above patch won’t work on the latest release (V_9_1_P1 at the time of this writing):

$ git clone https://github.com/openssh/openssh-portable.git
…
$ cd openssl-portable
$ $ patch -p1 < ../openssh.patch
patching file ssh-rsa.c
patching file sshkey.c
patching file sshkey.h

Now compile and build the patched OpenSSH

$ autoreconf
$ ./configure --with-libs=-lkeyutils --disable-pkcs11
…
$ make
…

Note that we instruct the build system to additionally link with [libkeyutils](https://man7.org/linux/man-pages/man3/keyctl.3.html), which provides convenient wrappers to access the Linux Kernel Key Retention Service. Additionally, we had to disable PKCS11 support as the code has a function with the same name as in `libkeyutils`, so there is a naming conflict. There might be a better fix for this, but it is out of scope for this post.

Now that we have the patched OpenSSH - let’s test it. Firstly, we need to generate a new SSH RSA key that we will use to access the system. Because the Linux kernel only supports private keys in the PKCS8 format, we’ll use it from the start (instead of the default OpenSSH format):

$ ./ssh-keygen -b 4096 -m PKCS8
Generating public/private rsa key pair.
…

Normally, we would be using `ssh-add` to add this key to our ssh agent. In our case we need to use a replacement script, which would add the key to our current session keyring:

ssh-add-keyring.sh

#/bin/bash -e

in=$1
key_desc=$2
keyring=$3

in_pub=$in.pub
key=$(mktemp)
out="${in}_keyring"

function finish {
    rm -rf $key
}
trap finish EXIT

# https://github.com/openssh/openssh-portable/blob/master/PROTOCOL.key
# null-terminanted openssh-key-v1
printf 'openssh-key-v1\0' > $key
# cipher: none
echo '00000004' | xxd -r -p >> $key
echo -n 'none' >> $key
# kdf: none
echo '00000004' | xxd -r -p >> $key
echo -n 'none' >> $key
# no kdf options
echo '00000000' | xxd -r -p >> $key
# one key in the blob
echo '00000001' | xxd -r -p >> $key

# grab the hex public key without the (00000007 || ssh-rsa) preamble
pub_key=$(awk '{ print $2 }' $in_pub | base64 -d | xxd -s 11 -p | tr -d '\n')
# size of the following public key with the (0000000f || ssh-rsa-keyring) preamble
printf '%08x' $(( ${#pub_key} / 2 + 19 )) | xxd -r -p >> $key
# preamble for the public key
# ssh-rsa-keyring in prepended with length of the string
echo '0000000f' | xxd -r -p >> $key
echo -n 'ssh-rsa-keyring' >> $key
# the public key itself
echo $pub_key | xxd -r -p >> $key

# the private key is just a key description in the Linux keyring
# ssh will use it to actually find the corresponding key serial
# grab the comment from the public key
comment=$(awk '{ print $3 }' $in_pub)
# so the total size of the private key is
# two times the same 4 byte int +
# (0000000f || ssh-rsa-keyring) preamble +
# a copy of the public key (without preamble) +
# (size || key_desc) +
# (size || comment )
priv_sz=$(( 8 + 19 + ${#pub_key} / 2 + 4 + ${#key_desc} + 4 + ${#comment} ))
# we need to pad the size to 8 bytes
pad=$(( 8 - $(( priv_sz % 8 )) ))
# so, total private key size
printf '%08x' $(( $priv_sz + $pad )) | xxd -r -p >> $key
# repeated 4-byte int
echo '0102030401020304' | xxd -r -p >> $key
# preamble for the private key
echo '0000000f' | xxd -r -p >> $key
echo -n 'ssh-rsa-keyring' >> $key
# public key
echo $pub_key | xxd -r -p >> $key
# private key description in the keyring
printf '%08x' ${#key_desc} | xxd -r -p >> $key
echo -n $key_desc >> $key
# comment
printf '%08x' ${#comment} | xxd -r -p >> $key
echo -n $comment >> $key
# padding
for (( i = 1; i <= $pad; i++ )); do
    echo 0$i | xxd -r -p >> $key
done

echo '-----BEGIN OPENSSH PRIVATE KEY-----' > $out
base64 $key >> $out
echo '-----END OPENSSH PRIVATE KEY-----' >> $out
chmod 600 $out

# load the PKCS8 private key into the designated keyring
openssl pkcs8 -in $in -topk8 -outform DER -nocrypt | keyctl padd asymmetric $key_desc $keyring

Depending on how our kernel was compiled, we might also need to load some kernel modules for asymmetric private key support:

$ sudo modprobe pkcs8_key_parser
$ ./ssh-add-keyring.sh ~/.ssh/id_rsa myssh @s
Enter pass phrase for ~/.ssh/id_rsa:
723263309

Finally, our private ssh key is added to the current session keyring with the name “myssh”. In addition, the ssh-add-keyring.sh will create a pseudo-private key file in ~/.ssh/id_rsa_keyring, which needs to be passed to the main ssh process. It is a pseudo-private key, because it doesn’t have any sensitive cryptographic material. Instead, it only has the “myssh” identifier in a native OpenSSH format. If we use multiple SSH keys, we have to tell the main ssh process somehow which in-kernel key name should be requested from the system.

Before we start testing it, let’s make sure our SSH server (running locally) will accept the newly generated key as a valid authentication:

$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Now we can try to SSH into the system:

$ SSH_AUTH_SOCK="" ./ssh -i ~/.ssh/id_rsa_keyring localhost
The authenticity of host 'localhost (::1)' can't be established.
ED25519 key fingerprint is SHA256:3zk7Z3i9qZZrSdHvBp2aUYtxHACmZNeLLEqsXltynAY.
This key is not known by any other names.
Are you sure you want to continue connecting (yes/no/[fingerprint])? yes
Warning: Permanently added 'localhost' (ED25519) to the list of known hosts.
Linux dev 5.15.79-cloudflare-2022.11.6 #1 SMP Mon Sep 27 00:00:00 UTC 2010 x86_64
…

It worked! Notice that we’re resetting the `SSH_AUTH_SOCK` environment variable to make sure we don’t use any keys from an ssh-agent running on the system. Still the login flow does not request any password for our private key, the key itself is resident of the kernel address space, and we reference it using its serial for signature operations.

User or session keyring?

In the example above, we set up our SSH private key into the session keyring. We can check if it is there:

$ keyctl show
Session Keyring
 577779279 --alswrv   1000  1000  keyring: _ses
 846694921 --alswrv   1000 65534   \_ keyring: _uid.1000
 723263309 --als--v   1000  1000   \_ asymmetric: myssh

We might have used user keyring as well. What is the difference? Currently, the “myssh” key lifetime is limited to the current login session. That is, if we log out and login again, the key will be gone, and we would have to run the ssh-add-keyring.sh script again. Similarly, if we log in to a second terminal, we won’t see this key:

$ keyctl show
Session Keyring
 333158329 --alswrv   1000  1000  keyring: _ses
 846694921 --alswrv   1000 65534   \_ keyring: _uid.1000

Notice that the serial number of the session keyring _ses in the second terminal is different. A new keyring was created and “myssh” key along with the previous session keyring doesn’t exist anymore:

$ SSH_AUTH_SOCK="" ./ssh -i ~/.ssh/id_rsa_keyring localhost
Load key "/home/ignat/.ssh/id_rsa_keyring": key not found
…

If instead we tell ssh-add-keyring.sh to load the private key into the user keyring (replace @s with @u in the command line parameters), it will be available and accessible from both login sessions. In this case, during logout and re-login, the same key will be presented. Although, this has a security downside - any process running as our user id will be able to access and use the key.

Summary

In this post we learned about one of the most common ways that data, including highly valuable cryptographic keys, can leak. We talked about some real examples, which impacted many users around the world, including Cloudflare. Finally, we learned how the Linux Kernel Retention Service can help us to protect our cryptographic keys and secrets.

We also introduced a working patch for OpenSSH to use this cool feature of the Linux kernel, so you can easily try it yourself. There are still many Linux Kernel Key Retention Service features left untold, which might be a topic for another blog post. Stay tuned!

ClickHouse Capacity Estimation Framework

Oxana Kharitonova — Thu, 05 Nov 2020 14:12:48 GMT

We use ClickHouse widely at Cloudflare. It helps us with our internal analytics workload, bot management, customer dashboards, and many other systems. For instance, before Bot Management can analyze and classify our traffic, we need to collect logs. The Firewall Analytics tool needs to store and query data somewhere too. The same goes for our new Cloudflare Radar project. We are using ClickHouse for this purpose. It is a big database that can store huge amounts of data and return it on demand. This is not the first time we have talked about ClickHouse, there is a dedicated blogpost on how we introduced ClickHouse for HTTP analytics.

Our biggest cluster has more than 100 nodes, another one about half that number. Besides that, we have over 20 clusters that have at least three nodes and the replication factor of three. Our current insertion rate is about 90M rows per second.

We use the standard approach in ClickHouse schema design. At the top level we have clusters, which hold shards, a group of nodes, and a node is a physical machine. You can find technical characteristics of the nodes here. Stored data is replicated between clusters. Different shards hold different parts of the data, but inside of each shard replicas are equal.

Schema of one cluster:

Capacity planning

As engineers, we periodically face the question of how many additional nodes we have to order to support the growing demand for the next X months, with disk space as our prime concern.

ClickHouse stores extensive information in system tables about the operating processes, which is helpful. From the early days of using ClickHouse we added clickhouse_exporter as part of our monitoring stack. One of the metrics we are interested in is exposed from the system.parts table. Roughly speaking, clickhouse_exporter runs SQL queries asking how many bytes are used by each table. After that, these metrics are sent from Prometheus to Thanos and stored for at least a year.

Every time we wanted to make a forecast of disk usage we queried Thanos for historical data using this expression:

sum by (table) (clickhouse_table_parts_bytes{cluster="{cluster}"})

After that, we uploaded data in dataframes to a Jupyter notebook.

There were a few problems with this approach. Only a few people knew where the notebooks were and how to get them running. It wasn't trivial to download historical data. And most importantly, it was hard to look at past predictions and assess whether or not they were correct since results were not stored anywhere except internal blog posts. Also, as the number and size of clusters and products grew it became impossible for a single team to work on capacity planning and we needed to get engineers building products involved as they have the most context on how the growth will change in the future.

We wanted to automate this process and made calculations more transparent for our colleagues, including those who use ClickHouse for their services. Honestly, at the beginning we weren’t sure if it was even possible and what we would get out of it.

Finding the right metrics

The crucial moment of adding new nodes for us is a disk space, so this was a place to start. We decided to use system.parts, as we used it before with the manual approach.

Luckily, we started doing it for the cluster that had recently changed its topology. That cluster had two shards with four and five nodes in every shard. After the topology change, it was replaced with three shards and three nodes in every shard, but the number of machines and unreplicated data on the disks remained the same. However, it had an impact on our metrics: we previously had four replicated nodes in one shard and five replicated in another, we took one node off from the first shard and two nodes from the second and created a new one based on these three nodes. The new shard was empty, so we just added it, but the total amount of data in the first and the second shards was less as the count of the remaining nodes.

You can see on the graph below in April we had this sharp decrease caused by topology changes. We got ~550T instead of ~850T among all shards and replicas.

When we tried to train our model based on the real data due to the April drop it thought we had a downward trend. Which was incorrect as we only dropped replicated data. The trend for unreplicated data hadn’t changed. So we decided to take into account only unreplicated data. It saved us from the topology change and node replacement in case of problems with hardware.

The rule that we use for metrics now is:

sum by(cluster) (
max by (cluster, shardgroup) (
node_clickhouse_shardgroupinfo{} *
on (instance) group_right (cluster, shardgroup) sum(table_parts_bytes{cluster="%s"}) by (instance)
))

We continue using system.parts from clickhouse_exporter, but instead of counting the whole amount of data we use the maximum of unreplicated data from every shard.

In the image below there is the same cluster as in the image above but instead of counting the whole amount of data we look at unreplicated data from all shards. You can clearly see that we continued to grow and didn’t have any drop in data.

Another problem we faced was that we migrated some tables from one cluster to another because we were running out of space and it required immediate action. However, our model didn’t know that part of the tables didn’t live there anymore, and we didn’t want them to be a part of the prediction. To solve this problem we queried Prometheus to get the list of the tables that existed at the prediction time, then filtered historical data to include only these tables and used them as the input for training a model.

Load of metrics

After determining the correct metrics, we needed to obtain them for our forecasting procedure. Our long-term metrics solution, Thanos, stores billions of data points. Querying it for a cluster with over one hundred nodes even for one day takes a huge amount of time, and we needed these data points for a year.

As we planned to use Python we wrote a small client using aiohttp that concurrently sends HTTP requests to Thanos. The requests are sent in chunks, and every request has start/end dates with a difference of one hour. We needed to get the data for the whole year once and then append new ones day by day. We got csv files: one file for one cluster. The client became a part of the project, and it runs once a day, queries Thanos for new metrics (previous day) and appends data to the files.

Forecasting procedure

At this point, we have collected metrics in files, now it’s time to make a forecast. We needed something for time-series metrics, so we chose Prophet from Facebook. It’s very simple to use, you can follow the documentation and get good results even with the default parameters.

One challenge we faced using Prophet was the need to feed it one data point for a day. In the metric files we have thousands of those for every day. It looks logical to take the point at the end of every day, but it’s not really true. All tables have a retention period, the time for how long we store data in ClickHouse. We don’t know when the data is cleared, it happens gradually throughout the day. So, we decided to take the maximum number for a day.

Drawing Graphs

We chose Grafana to present results, though we needed to store predicted data points somewhere. The first thought was to use Prometheus, but because of high cardinality, we had about 300,000 points in summary for clusters and tables, so we passed. We decided to use ClickHouse itself. We wanted to have both graphs, real and predicted, on the same dashboard. We had real data points in Prometheus and with mixed data source could do this. However, the problem was the same as the loading of metrics into files, for some clusters it’s impossible to obtain metrics for a long period of time. We added functionality to upload real metrics in ClickHouse as well, now both real and predicted metrics are displayed in Grafana, taken from ClickHouse.

Summary

This is what we have in Grafana:

The yellow line shows the real data;
The green line was created based on Prophet output;
The red line - the maximum disk capacity. We already increased it twice.

We have a service running in Kubernetes that does all the toil, and we created an environment for other metrics. We have the place where we collect metrics from Thanos and expose them in the required format to Grafana. If we find the right metrics for accounting other resources like IO, CPU or other systems like Kafka we can easily add them to our framework. We can easily replace Prophet with another algorithm, and we can go back months and evaluate how close we were in our prediction according to the real data.

With this automation we were able to spot we were going out of disk space for a couple of clusters which we didn’t expect. We have over 20 clusters and have updates for all of them every day. This dashboard is used not only by our colleagues who are direct customers of ClickHouse but by the team who makes a plan for buying servers. It is easy to read and costs none of developers time.

This project was carried out by the Core SRE team to improve our daily work. If you are interested in this project, check out our job openings.

We didn’t know what we would get at the end, we discussed, looked for solutions and tried different approaches. Huge thanks for this to Nicolae Vartolomei, Alex Semiglazov and John Skopis.