使用 eBPF Linux 安全模块实时修补 Linux 内核中的安全漏洞

Linux 安全模块 (LSM) 是基于 hook 的框架，用于在 Linux 内核中实现安全策略和强制性访问控制。直到前不久，想要实现安全策略的用户还只有两种选项：配置 AppArmor 或 SELinux 等现有 LSM 模块，或编写自定义内核模块。

Linux 5.7 引入了第三种方式：LSM 扩充 Berkeley Packet Filter (eBPF)（简称 LSM BPF）。使用 LSM BPF，开发人员能够在无需配置或加载内核模块的情况下编写精细策略。LSM BPF 程序会在加载时进行验证，然后在调用路径中到达 LSM hook 时执行。

让我们解决现实问题

现代操作系统提供了允许“分割”内核资源的设施。例如，FreeBSD 有“jail”，Solaris 有“区域”。Linux 有所不同，它提供一组看起来独立的设施，每个设施允许隔离特定资源。这些称为“命名空间”，多年来一直在内核中增长。它们是 Docker、lxc 或 firejail 等流行工具的基础。许多命名空间都是无争议的，例如 UTS 命名空间，它允许主机系统隐藏其主机名和时间。其他一些命名空间则比较复杂，但直接明了，例如，NET 和 NS (mount) 命名空间就令人难以理解。最后，还有一个非常特殊且稀奇的 USER 命名空间。

USER 命名空间的特殊之处在于，它允许所有者以其中的“根”用户身份操作。具体机制超出了本博客文章的讨论范围，但简单地说，在它的基础上，Docker 等工具才能不以真正的根用户身份操作，并且它还是无根容器等事项的基础。

鉴于其性质，允许无特权的用户访问 USER 命名空间始终会带来极大的安全风险。其中一种风险就是特权提升。

特权提升是操作系统的常见攻击面。用户可以获取特权的一种方式是通过 unshare syscall 将其命名空间映射到根命名空间，并指定 CLONE_NEWUSER 标志。这会指示 unshare 创建有完整权限的新用户命名空间，并将新用户和组 ID 映射到之前的命名空间。您可以使用 unshare(1) 程序将根映射到我们的原始命名空间：

在大部分情况下，使用 unshare 没有损害，而且预定以较低特权运行。但是，此 syscall 已被发现用于提升特权。

$ id
uid=1000(fred) gid=1000(fred) groups=1000(fred) …
$ unshare -rU
# id
uid=0(root) gid=0(root) groups=0(root),65534(nogroup)
# cat /proc/self/uid_map
         0       1000          1

Syscall clone 和 clone3 值得仔细考虑，因为它们还能够 CLONE_NEWUSER。但就本文而言，我们将专注于 unshare。

Debian 使用这个“add sysctl to disallow unprivileged CLONE_NEWUSER by default”（添加 sysctl 以在默认情况下不允许无特权的 CLONE_NEWUSER）补丁解决了该问题，但这不是主流做法。另一个类似补丁“sysctl: allow CLONE_NEWUSER to be disabled”（sysctl：允许禁用 CLONE_NEWUSER）试图成为主流，但遭到了排挤。一种批评意见是针对特定应用程序无法切换此功能。在文章《控制对用户命名空间的访问》中，作者写道：“...现行补丁似乎很难成为主流。”显然，这些补丁最终并未包含在 vanilla 内核中。

我们的解决方案 - LSM BPF

由于限制 USER 命名空间的上游代码似乎行不通，我们决定使用 LSM BPF 来规避这些问题。这样做并不需要修改内核，而且我们可以制定守护访问权限的复杂规则。

找到合适的 hook 候选项

首先，让我们找到所需的 syscall。我们可以在 include/linux/syscalls.h 文件中找到原型。这在其中并不太容易查找到，但以下这行：

提供了线索，这样我们就知道接下来要在 kernel/fork.c 中的什么地方查找。其中发出了对 ksys_unshare() 的调用。在该函数中深入探查，我们找到对 unshare_userns() 的调用。此操作有望成功。

/* kernel/fork.c */

到目前为止，我们确定了 syscall 实现，但接下来要弄清楚的是，哪些 hook 可供我们使用？因为我们通过手册页可以知道，unshare 用于改变任务，所以我们来看一下 include/linux/lsm_hooks.h 中基于任务的 hook。早在函数 unshare_userns() 中，我们就看到对 prepare_creds() 的调用。这非常类似于 cred_prepare hook。为了验证我们是否通过 prepare_creds() 获得匹配，我们观察对安全性 hook security_prepare_creds() 的调用，后者最终会调用该 hook：

不必进一步详细探究细节，我们知道这个 hook 很适合使用，因为 prepare_creds() 刚好就在 create_user_ns()（位于 unshare_userns() 中）之前调用，后者是我们试图阻止的操作。

…
rc = call_int_hook(cred_prepare, 0, new, old, gfp);
…

LSM BPF 解决方案

我们打算使用 eBPF compile once-run everywhere (CO-RE) 方法进行编译。这样一来，我们就可以在一个架构上编译，而在另一个架构上加载。但我们打算专门以 x86_64 为目标。适用于 ARM64 的 LSM BPF 仍在开发中，以下代码将无法在该架构上运行。敬请留意 BPF 邮寄列表以关注进展。

测试该解决方案时采用的内核版本不低于 5.15，且配置了以下内容：

启动选项 lsm=bpf 在 CONFIG_LSM 未在列表中包含“bpf”时可能是必要的。

BPF_EVENTS
BPF_JIT
BPF_JIT_ALWAYS_ON
BPF_LSM
BPF_SYSCALL
BPF_UNPRIV_DEFAULT_OFF
DEBUG_INFO_BTF
DEBUG_INFO_DWARF_TOOLCHAIN_DEFAULT
DYNAMIC_FTRACE
FUNCTION_TRACER
HAVE_DYNAMIC_FTRACE

让我们从序言开始：

deny_unshare.bpf.c：

接下来，我们通过以下方式为 CO-RE 调整设置我们的必要结构：

#include <linux/bpf.h>
#include <linux/capability.h>
#include <linux/errno.h>
#include <linux/sched.h>
#include <linux/types.h>

#include <bpf/bpf_tracing.h>
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_core_read.h>

#define X86_64_UNSHARE_SYSCALL 272
#define UNSHARE_SYSCALL X86_64_UNSHARE_SYSCALL

deny_unshare.bpf.c：

我们不需要完全充实 struct 的细节，只需提供程序正常运行所需信息的绝对下限。CO-RE 将执行为内核执行调整所需的任何操作。这样就可以很轻松地编写 LSM BPF 程序！

…

typedef unsigned int gfp_t;

struct pt_regs {
	long unsigned int di;
	long unsigned int orig_ax;
} __attribute__((preserve_access_index));

typedef struct kernel_cap_struct {
	__u32 cap[_LINUX_CAPABILITY_U32S_3];
} __attribute__((preserve_access_index)) kernel_cap_t;

struct cred {
	kernel_cap_t cap_effective;
} __attribute__((preserve_access_index));

struct task_struct {
    unsigned int flags;
    const struct cred *cred;
} __attribute__((preserve_access_index));

char LICENSE[] SEC("license") = "GPL";

…

deny_unshare.bpf.c：

第一步是创建程序，第二步是加载程序并附加到我们所需的 hook。有几种方式可实现这一目的：Cilium ebpf 项目，Rust 绑定，以及 ebpf.io 项目环境页面上的其他几项。我们打算使用原生 libbpf。

SEC("lsm/cred_prepare")
int BPF_PROG(handle_cred_prepare, struct cred *new, const struct cred *old,
             gfp_t gfp, int ret)
{
    struct pt_regs *regs;
    struct task_struct *task;
    kernel_cap_t caps;
    int syscall;
    unsigned long flags;

    // If previous hooks already denied, go ahead and deny this one
    if (ret) {
        return ret;
    }

    task = bpf_get_current_task_btf();
    regs = (struct pt_regs *) bpf_task_pt_regs(task);
    // In x86_64 orig_ax has the syscall interrupt stored here
    syscall = regs->orig_ax;
    caps = task->cred->cap_effective;

    // Only process UNSHARE syscall, ignore all others
    if (syscall != UNSHARE_SYSCALL) {
        return 0;
    }

    // PT_REGS_PARM1_CORE pulls the first parameter passed into the unshare syscall
    flags = PT_REGS_PARM1_CORE(regs);

    // Ignore any unshare that does not have CLONE_NEWUSER
    if (!(flags & CLONE_NEWUSER)) {
        return 0;
    }

    // Allow tasks with CAP_SYS_ADMIN to unshare (already root)
    if (caps.cap[CAP_TO_INDEX(CAP_SYS_ADMIN)] & CAP_TO_MASK(CAP_SYS_ADMIN)) {
        return 0;
    }

    return -EPERM;
}

deny_unshare.c：

最后，我们使用以下 Makefile 来编译：

#include <bpf/libbpf.h>
#include <unistd.h>
#include "deny_unshare.skel.h"

static int libbpf_print_fn(enum libbpf_print_level level, const char *format, va_list args)
{
    return vfprintf(stderr, format, args);
}

int main(int argc, char *argv[])
{
    struct deny_unshare_bpf *skel;
    int err;

    libbpf_set_strict_mode(LIBBPF_STRICT_ALL);
    libbpf_set_print(libbpf_print_fn);

    // Loads and verifies the BPF program
    skel = deny_unshare_bpf__open_and_load();
    if (!skel) {
        fprintf(stderr, "failed to load and verify BPF skeleton\n");
        goto cleanup;
    }

    // Attaches the loaded BPF program to the LSM hook
    err = deny_unshare_bpf__attach(skel);
    if (err) {
        fprintf(stderr, "failed to attach BPF skeleton\n");
        goto cleanup;
    }

    printf("LSM loaded! ctrl+c to exit.\n");

    // The BPF link is not pinned, therefore exiting will remove program
    for (;;) {
        fprintf(stderr, ".");
        sleep(1);
    }

cleanup:
    deny_unshare_bpf__destroy(skel);
    return err;
}

Makefile：

结果

CLANG ?= clang-13
LLVM_STRIP ?= llvm-strip-13
ARCH := x86
INCLUDES := -I/usr/include -I/usr/include/x86_64-linux-gnu
LIBS_DIR := -L/usr/lib/lib64 -L/usr/lib/x86_64-linux-gnu
LIBS := -lbpf -lelf

.PHONY: all clean run

all: deny_unshare.skel.h deny_unshare.bpf.o deny_unshare

run: all
	sudo ./deny_unshare

clean:
	rm -f *.o
	rm -f deny_unshare.skel.h

#
# BPF is kernel code. We need to pass -D__KERNEL__ to refer to fields present
# in the kernel version of pt_regs struct. uAPI version of pt_regs (from ptrace)
# has different field naming.
# See: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=fd56e0058412fb542db0e9556f425747cf3f8366
#
deny_unshare.bpf.o: deny_unshare.bpf.c
	$(CLANG) -g -O2 -Wall -target bpf -D__KERNEL__ -D__TARGET_ARCH_$(ARCH) $(INCLUDES) -c $< -o $@
	$(LLVM_STRIP) -g $@ # Removes debug information

deny_unshare.skel.h: deny_unshare.bpf.o
	sudo bpftool gen skeleton $< > $@

deny_unshare: deny_unshare.c deny_unshare.skel.h
	$(CC) -g -Wall -c $< -o [email protected]
	$(CC) -g -o $@ $(LIBS_DIR) [email protected] $(LIBS)

.DELETE_ON_ERROR:

在新的终端窗口中，运行：

在另一个终端窗口中，我们成功被阻止！

$ make run
…
LSM loaded! ctrl+c to exit.

策略还有一项始终允许特权通过的功能：

$ unshare -rU
unshare: unshare failed: Cannot allocate memory
$ id
uid=1000(fred) gid=1000(fred) groups=1000(fred) …

在无特权的情况下，syscall 会及早中止。在有特权的情况下，对性能有何影响？

$ sudo unshare -rU
# id
uid=0(root) gid=0(root) groups=0(root)

测量性能

我们打算使用单行 unshare 来映射用户命名空间，并在其中执行测量的命令：

通过 syscall unshare enter/exit 的 CPU 周期分辨率，我们将以根用户身份测量以下内容：

$ unshare -frU --kill-child -- bash -c "exit 0"

不带策略运行的命令
带策略运行的命令

我们将使用 ftrace 记录测量：

目前，我们专门为 unshare 的 syscall enter 和 exit 启用了跟踪。现在，我们设置 enter/exit 调用的时间分辨率，计算 CPU 周期数量：

$ sudo su
# cd /sys/kernel/debug/tracing
# echo 1 > events/syscalls/sys_enter_unshare/enable ; echo 1 > events/syscalls/sys_exit_unshare/enable

接下来，我们开始测量：

# echo 'x86-tsc' > trace_clock

在新的终端窗口中运行策略，然后运行下一个 syscall：

# unshare -frU --kill-child -- bash -c "exit 0" &
[1] 92014

现在，我们来比较两个调用：

# unshare -frU --kill-child -- bash -c "exit 0" &
[2] 92019

unshare-92014 使用了 63294 个周期。

# cat trace
# tracer: nop
#
# entries-in-buffer/entries-written: 4/4   #P:8
#
#                                _-----=> irqs-off
#                               / _----=> need-resched
#                              | / _---=> hardirq/softirq
#                              || / _--=> preempt-depth
#                              ||| / _-=> migrate-disable
#                              |||| /     delay
#           TASK-PID     CPU#  |||||  TIMESTAMP  FUNCTION
#              | |         |   |||||     |         |
         unshare-92014   [002] ..... 762950852559027: sys_unshare(unshare_flags: 10000000)
         unshare-92014   [002] ..... 762950852622321: sys_unshare -> 0x0
         unshare-92019   [007] ..... 762975980681895: sys_unshare(unshare_flags: 10000000)
         unshare-92019   [007] ..... 762975980752033: sys_unshare -> 0x0

unshare-92019 使用了 70138 个周期。

这两个测量之间有 6,844 个（大约 10%）周期的差值。结果还不赖！

这些数字是针对单个 syscall 的情况，调用代码越频繁，这些数字也会相应累加。Unshare 通常在创建任务时调用，而在程序正常执行期间不会重复调用。需要对您的用例进行仔细考虑和测量。

结尾

我们大致了解了 LSM BPF 的基本概念，如何使用 unshare 将用户映射到根，以及如何通过在 eBPF 中实现解决方案来解决现实问题。找到合适的 hook 并不容易，需要开展一些试验，还要编写大量内核代码。幸运的是，其他部分都比较简单。由于策略是采用 C 语言编写的，我们可以通过精细调整策略来解决我们的问题。这意味着，可以使用允许列表扩展该策略，允许特定程序或用户继续使用无特权的 unshare。最后，我们考察了该程序的性能影响，并发现阻止攻击手段所需的开销是值得的。

“Cannot allocate memory”（无法分配内存）并不是拒绝权限的明确错误消息。我们提议了一个补丁，用于在调用堆栈中从 cred_prepare hook 向上传播错误代码。最终，我们得出结论，新的 hook 更适合解决该问题。敬请关注！

Cloudflare 博客

使用 eBPF Linux 安全模块实时修补 Linux 内核中的安全漏洞

让我们解决现实问题

我们的解决方案 - LSM BPF

找到合适的 hook 候选项

LSM BPF 解决方案

结果

测量性能

结尾

Securing non-human identities: automated revocation, OAuth, and scoped permissions

Scaling MCP adoption: Our reference architecture for simpler, safer and cheaper enterprise deployments of MCP

Managed OAuth for Access: make internal apps agent-ready in one click

Cloudflare 的目标是 2029 年实现全面后量子安全