The Linux Crypto API for user applications

In this post we will explore Linux Crypto API for user applications and try to understand its pros and cons.

The Linux Kernel Crypto API was introduced in October 2002. It was initially designed to satisfy internal needs, mostly for IPsec. However, in addition to the kernel itself, user space applications can benefit from it.

If we apply the basic definition of an API to our case, we will have the kernel on one side and our application on the other. The application needs to send data, i.e. plaintext or ciphertext, and get encrypted/decrypted text in response from the kernel. To communicate with the kernel we need to make a system call. Also, before starting the data exchange, we need to agree on some cryptographic parameters, at least the selected crypto algorithm and key length. These constraints, along with all supported algorithms, can be found in the /proc/crypto virtual file.

Below is a short excerpt from my /proc/crypto looking at ctr(aes). In the examples, we will use the AES cipher in CTR mode, further we will give more details about the algorithm itself.

name         : ctr(aes)
driver       : ctr(aes-generic)
module       : ctr
priority     : 100
refcnt       : 1
selftest     : passed
internal     : no
type         : skcipher
async        : no
blocksize    : 1
min keysize  : 16
max keysize  : 32
ivsize       : 16
chunksize    : 16
walksize     : 16


name         : ctr(aes)
driver       : ctr(aes-aesni)
module       : ctr
priority     : 300
refcnt       : 1
selftest     : passed
internal     : no
type         : skcipher
async        : no
blocksize    : 1
min keysize  : 16
max keysize  : 32
ivsize       : 16
chunksize    : 16
walksize     : 16


name         : ctr(aes)
driver       : ctr-aes-aesni
module       : aesni_intel
priority     : 400
refcnt       : 1
selftest     : passed
internal     : no
type         : skcipher
async        : yes
blocksize    : 1
min keysize  : 16
max keysize  : 32
ivsize       : 16
chunksize    : 16
walksize     : 16

In the output above, there are three config blocks. The kernel may provide several implementations of the same algorithm depending on the CPU architecture, available hardware, presence of crypto accelerators etc.

We can pick the implementation based on the algorithm name or the driver name. The algorithm name is not unique, but the driver name is. If we use the algorithm name, the driver with the highest priority will be chosen for us, which in theory should provide the best cryptographic performance in this context. Let’s see the performance of different implementations of AES-CTR encryption. I use the libkcapi library: it’s a lightweight wrapper for the kernel crypto API which also provides built-in speed tests. We will examine these tests.

$ kcapi-speed -c "AES(G) CTR(G) 128" -b 1024 -t 10
AES(G) CTR(G) 128   	|d|	1024 bytes|          	149.80 MB/s|153361 ops/s
AES(G) CTR(G) 128   	|e|	1024 bytes|          	159.76 MB/s|163567 ops/s
 
$ kcapi-speed -c "AES(AESNI) CTR(ASM) 128" -b 1024 -t 10
AES(AESNI) CTR(ASM) 128 |d|	1024 bytes|          	343.10 MB/s|351332 ops/s
AES(AESNI) CTR(ASM) 128 |e|	1024 bytes|         	310.100 MB/s|318425 ops/s
 
$ kcapi-speed -c "AES(AESNI) CTR(G) 128" -b 1024 -t 10
AES(AESNI) CTR(G) 128   |d|	1024 bytes|          	155.37 MB/s|159088 ops/s
AES(AESNI) CTR(G) 128   |e|	1024 bytes|          	172.94 MB/s|177054 ops/s

Here and later ignore the absolute numbers, as they depend on the environment where the tests were running. Rather look at the relationship between the numbers.

The x86 AES instructions showed the best results, twice as fast vs the generic portable C implementation. As expected, this implementation has the highest priority in the /proc/crypto. We will use only this one later.

This brief introduction can be rephrased as: “I can ask the kernel to encrypt or decrypt data from my application”. But, why do I need it?

Why do I need it?

In our previous blog post Linux Kernel Key Retention Service we talked a lot about cryptographic key protection. We concluded that the best Linux option is to store cryptographic keys in the kernel space and restrict the access to a limited number of applications. However, if all our cryptography is processed in user space, potentially damaging code still has access to the raw key material. We have to think wisely about using the key: what part of the code has access to it, don’t log it accidentally, how the open-source libraries manage it and if the memory is purged after using it. We may need to support a dedicated process to not have a key in network-facing code. Thus, many things need to be done for security, and for each application which works with cryptography. And even after all these precautionary measures, the best of the best are subject to bugs and vulnerabilities. OpenSSL, the most known and widely used cryptographic library in user space, has had a few problems in its security.

Can we move all the cryptography to the kernel and help solve these problems? Looks like it! Our recent patch to upstream extended the key types which can be used in symmetric encryption in the Crypto API directly from the Linux Kernel Key Retention Service.

But nothing is free. There will be some overhead for the system calls and data copying between user and kernel spaces. So, the next question is how fast it is.

Is it fast?

To answer this question we need to have some baseline to compare with. OpenSSL would be the best as it’s used all around the Internet. OpenSSL provides a good composite of toolkits, including C-functions, a console utility and various speed tests. For the sake of equality, we will ignore the built-in tests and write our own tests using OpenSSL C-functions. We want the same data to be processed and the same logic parts to be measured in both cases (Kernel versus OpenSSL).

So, the task: write a benchmark for AES-CTR-128 encrypting data split in chunks. Make implementations for the Kernel Crypto API and OpenSSL.

About AES-CTR-128

AES stands for Advanced Encryption Standard. It is a block cipher algorithm, which means the whole plaintext is split into blocks and two operations are applied: substitution and permutation. There are two parameters characterizing a block cipher: the block size and the key size. AES processes a block of 128 bits using a key of either 128, 192 or 256 bits. Each 128 bits or 16 bytes block is presented as a 4x4 two-dimensional array (matrix), where one element of the matrix presents one byte of the plaintext. To change the plaintext to ciphertext several rounds of transformation are applied: the bits of the block XORs with a key derived from the main key and substitution with permutation are applied to rows and columns of the matrix. There can be 10, 12 or 14 rounds depending on the key size (the key size determines how many keys can be derived from it).

AES is a secure cipher, but there is one nuance - the same plaintext/block of text will produce the same result. Look at Linux’s mascot Tux. To avoid this, a mode of operation (or just mode) has to be applied. It determines how the text changes, so the same input doesn't result in the same output. Tux was encrypted using ECB mode, there is no text transformation at all. Another mode example is CBC, where the ciphertext from the previously encrypted block is added to the next block, for the first block an initial value (IV) is added. This mode guarantees that for the same input and different IV the output will be different. However, this mode is slow as each block depends on the previous one and so encryption can’t be parallelized. CTR is a counter mode, instead of using previously encrypted blocks it uses a counter and a nonce. A counter is an integer which is incremented for each block. A nonce is just a random number similar to the IV. The nonce, and IV, should be different for each message and can be transferred openly with the encrypted text. So, the title AES-CTR-128 means AES used in CTR mode with the key size of 128 bits.

Implementing AES-CTR-128 with the Kernel Crypto API

The kernel and user spaces are isolated for security reasons and each time data needs to be transferred between them, it’s copied. In our case, it would add a significant overhead - copying a big bunch of plain or encrypted text to the kernel and back. However, the crypto API supports a zero-copy interface. Instead of transferring the actual data, a file descriptor is passed. But it has a limitation - the maximum size is only 16 pages. So for our tests we picked the number closest to the maximum limit - 63KB (16 pages of 4KB minus 1KB to avoid any potential edge cases).

The code below is the exact implementation of what is written in the kernel documentation. Firstly we created a socket of AF_ALG type. The salg_type and salg_name parameters can be taken from the /proc/crypto file. Instead of a generic name we used the driver name ctr-aes-aesni. We might put just a name ctr(aes) and the driver with the highest priority (ctr-aes-aesni in our context) will be picked for us by the Kernel. Further we put the key length and accepted the socket. The IV size is provided before the payload as ancillary data. Constraints of the key and IV sizes can be found in /proc/crypto too.

Now we are ready to start communication. We excluded all pre-set up steps from the measurements. In a loop we send plaintext for encryption with the flag SPLICE_F_MORE to inform the kernel that more data will be provided. And here in the loop we read the cipher text from the kernel. The last plaintext should be sent without the flag thus saying that we are done, and the kernel can finalize the encryption.

In favor of brevity, error handling is omitted in both examples.

kernel.c

#define _GNU_SOURCE

#include <stdint.h>
#include <string.h>
#include <stdio.h>

#include <unistd.h>
#include <fcntl.h>
#include <time.h>
#include <sys/random.h>
#include <sys/socket.h>
#include <linux/if_alg.h>

#define PT_LEN (63 * 1024)
#define CT_LEN PT_LEN
#define IV_LEN 16
#define KEY_LEN 16
#define ITER_COUNT 100000

static uint8_t pt[PT_LEN];
static uint8_t ct[CT_LEN];
static uint8_t key[KEY_LEN];
static uint8_t iv[IV_LEN];

static void time_diff(struct timespec *res, const struct timespec *start, const struct timespec *end)
{
    res->tv_sec = end->tv_sec - start->tv_sec;
    res->tv_nsec = end->tv_nsec - start->tv_nsec;
    if (res->tv_nsec < 0) {
        res->tv_sec--;
        res->tv_nsec += 1000000000;
    }
}

int main(void)
{
    // Fill the test data
    getrandom(key, sizeof(key), GRND_NONBLOCK);
    getrandom(iv, sizeof(iv), GRND_NONBLOCK);
    getrandom(pt, sizeof(pt), GRND_NONBLOCK);

    // Set up AF_ALG socket
    int alg_s, aes_ctr;
    struct sockaddr_alg sa = { .salg_family = AF_ALG };
    strcpy(sa.salg_type, "skcipher");
    strcpy(sa.salg_name, "ctr-aes-aesni");

    alg_s = socket(AF_ALG, SOCK_SEQPACKET, 0);
    bind(alg_s, (const struct sockaddr *)&sa, sizeof(sa));
    setsockopt(alg_s, SOL_ALG, ALG_SET_KEY, key, KEY_LEN);
    aes_ctr = accept(alg_s, NULL, NULL);
    close(alg_s);

    // Set up IV
    uint8_t cmsg_buf[CMSG_SPACE(sizeof(uint32_t)) + CMSG_SPACE(sizeof(struct af_alg_iv) + IV_LEN)] = {0};
    struct msghdr msg = {
	.msg_control = cmsg_buf,
	.msg_controllen = sizeof(cmsg_buf)
    };

    struct cmsghdr *cmsg = CMSG_FIRSTHDR(&msg);
    cmsg->cmsg_len = CMSG_LEN(sizeof(uint32_t));
    cmsg->cmsg_level = SOL_ALG;
    cmsg->cmsg_type = ALG_SET_OP;
    *((uint32_t *)CMSG_DATA(cmsg)) = ALG_OP_ENCRYPT;
    
    cmsg = CMSG_NXTHDR(&msg, cmsg);
    cmsg->cmsg_len = CMSG_LEN(sizeof(struct af_alg_iv) + IV_LEN);
    cmsg->cmsg_level = SOL_ALG;
    cmsg->cmsg_type = ALG_SET_IV;
    ((struct af_alg_iv *)CMSG_DATA(cmsg))->ivlen = IV_LEN;
    memcpy(((struct af_alg_iv *)CMSG_DATA(cmsg))->iv, iv, IV_LEN);
    sendmsg(aes_ctr, &msg, 0);

    // Set up pipes for using zero-copying interface
    int pipes[2];
    pipe(pipes);

    struct iovec pt_iov = {
        .iov_base = pt,
        .iov_len = sizeof(pt)
    };

    struct timespec start, end;
    clock_gettime(CLOCK_MONOTONIC, &start);
    
    int i;
    for (i = 0; i < ITER_COUNT; i++) {
        vmsplice(pipes[1], &pt_iov, 1, SPLICE_F_GIFT);
        // SPLICE_F_MORE means more data will be coming
        splice(pipes[0], NULL, aes_ctr, NULL, sizeof(pt), SPLICE_F_MORE);
        read(aes_ctr, ct, sizeof(ct));
    }
    vmsplice(pipes[1], &pt_iov, 1, SPLICE_F_GIFT);
    // A final call without SPLICE_F_MORE
    splice(pipes[0], NULL, aes_ctr, NULL, sizeof(pt), 0);
    read(aes_ctr, ct, sizeof(ct));
    
    clock_gettime(CLOCK_MONOTONIC, &end);

    close(pipes[0]);
    close(pipes[1]);
    close(aes_ctr);

    struct timespec diff;
    time_diff(&diff, &start, &end);
    double tput_krn = ((double)ITER_COUNT * PT_LEN) / (diff.tv_sec + (diff.tv_nsec * 0.000000001 ));
    printf("Kernel: %.02f Mb/s\n", tput_krn / (1024 * 1024));
    
    return 0;
}

Compile and run:

$ gcc -o kernel kernel.c
$ ./kernel
Kernel: 2112.49 Mb/s

Implementing AES-CTR-128 with OpenSSL

With OpenSSL everything is straight forward, we just repeated an example from the official documentation.

openssl.c

#include <time.h>
#include <sys/random.h>
#include <openssl/evp.h>

#define PT_LEN (63 * 1024)
#define CT_LEN PT_LEN
#define IV_LEN 16
#define KEY_LEN 16
#define ITER_COUNT 100000

static uint8_t pt[PT_LEN];
static uint8_t ct[CT_LEN];
static uint8_t key[KEY_LEN];
static uint8_t iv[IV_LEN];

static void time_diff(struct timespec *res, const struct timespec *start, const struct timespec *end)
{
    res->tv_sec = end->tv_sec - start->tv_sec;
    res->tv_nsec = end->tv_nsec - start->tv_nsec;
    if (res->tv_nsec < 0) {
        res->tv_sec--;
        res->tv_nsec += 1000000000;
    }
}

int main(void)
{
    // Fill the test data
    getrandom(key, sizeof(key), GRND_NONBLOCK);
    getrandom(iv, sizeof(iv), GRND_NONBLOCK);
    getrandom(pt, sizeof(pt), GRND_NONBLOCK);

    EVP_CIPHER_CTX *ctx = EVP_CIPHER_CTX_new();
    EVP_EncryptInit_ex(ctx, EVP_aes_128_ctr(), NULL, key, iv);

    int outl = sizeof(ct);
    
    struct timespec start, end;
    clock_gettime(CLOCK_MONOTONIC, &start);

    int i;
    for (i = 0; i < ITER_COUNT; i++) {
        EVP_EncryptUpdate(ctx, ct, &outl, pt, sizeof(pt));
    }
    uint8_t *ct_final = ct + outl;
    outl = sizeof(ct) - outl;
    EVP_EncryptFinal_ex(ctx, ct_final, &outl);

    clock_gettime(CLOCK_MONOTONIC, &end);

    EVP_CIPHER_CTX_free(ctx);

    struct timespec diff;
    time_diff(&diff, &start, &end);
    double tput_ossl = ((double)ITER_COUNT * PT_LEN) / (diff.tv_sec + (diff.tv_nsec * 0.000000001 ));
    printf("OpenSSL: %.02f Mb/s\n", tput_ossl / (1024 * 1024));

    return 0;
}

Compile and run:

$ gcc -o openssl openssl.c -lcrypto
$ ./openssl
OpenSSL: 3758.60 Mb/s

Results of OpenSSL vs Crypto API

OpenSSL: 3758.60 Mb/s
Kernel: 2112.49 Mb/s

Don’t pay attention to the absolute values, look at the relationship.

The numbers look pessimistic. But why? Can't the kernel implement AES-CTR similar to OpenSSL? We used bpftrace to understand this better. The encryption function is called on the read() system call. Trying to be as close to the encryption code as possible, we put a probe on the ctr_crypt function instead of the whole read call.

$ sudo bpftrace -e 'kprobe:ctr_crypt { @start=nsecs; @count+=1; } kretprobe:ctr_crypt /@start!=0/ { @total+=nsecs-@start; }'

We took the same plaintext, encrypted it in chunks of 63KB and measured how much time it took for both cases to encrypt it with bpftrace attached to the kernel:

OpenSSL: 1 sec 650532178 nsec
Kernel: 3 sec 120442931 nsec // 3120442931 ns
OpenSSL: 3727.49 Mb/s
Kernel: 1971.63 Mb/s

@total: 2031169756     //  2031169756 / 3120442931 = 0.6509235390339526

The @total number is output from bpftrace, which tells us how much time the kernel spent in the encryption function. To compare plain kernel encryption vs OpenSSL we need to say how many Mb/s kernel would have done if only encryption had been involved (excluding all system calls and data copy/ manipulation). We need to apply some math:

The correlation between the total time and the time which the kernel spent in the encryption is 2031169756 / 3120442931 = 0.6509235390339526 or 65%.
So throughput would be 1971.63 / 0.650923539033952 - 3028.97 Mb/s. Comparing this to OpenSSL Mb/s we get 3028.97 / 3727.49, so around 81%.

It would be fair to say that bpftrace adds some overhead and our numbers for the kernel are less than they could be. So, we can safely say that while the Kernel Crypto API is two times slower than OpenSSL, the crypto part itself is almost equal.

Conclusion

In this post we reviewed the Linux Kernel Crypto API and its user space interface. We reiterated some security benefits of doing encryption through the Kernel vs using some sort of cryptographic library. We also measured the performance overhead of doing data encryption/decryption through the Kernel Crypto API, confirmed that in-kernel crypto is likely as good as in OpenSSL, but a better user space interface is needed to make Kernel Crypto API as fast as using a cryptographic library. Using Crypto API is a subjective decision depending on your circumstances, it’s a trade-off in speed vs. security.

The Cloudflare Blog

The Linux Crypto API for user applications

Why do I need it?

Is it fast?

About AES-CTR-128

Implementing AES-CTR-128 with the Kernel Crypto API

Implementing AES-CTR-128 with OpenSSL

Results of OpenSSL vs Crypto API

Conclusion

QUIC restarts, slow problems: udpgrm to the rescue

A steam locomotive from 1993 broke my yarn test

Searching for the cause of hung tasks in the Linux kernel

Multi-Path TCP: revolutionizing connectivity, one path at a time