USER namespaces power the functionality of our favorite tools such as docker and podman. We wrote about Linux namespaces back in June and explained them like this:
Most of the namespaces are uncontroversial, like the UTS namespace which allows the host system to hide its hostname and time. Others are complex but straightforward - NET and NS (mount) namespaces are known to be hard to wrap your head around. Finally, there is this very special, very curious USER namespace. USER namespace is special since it allows the - typically unprivileged owner to operate as "root" inside it. It's a foundation to having tools like Docker to not operate as true root, and things like rootless containers.
Due to its nature, allowing unprivileged users access to USER namespace always carried a great security risk. With its help the unprivileged user can in fact run code that typically requires root. This code is often under-tested and buggy. Today we will look into one such case where USER namespaces are leveraged to exploit a kernel bug that can result in an unprivileged denial of service attack.
Enter Linux Traffic Control queue disciplines
In 2019, we were exploring leveraging Linux Traffic Control's queue discipline (qdisc) to schedule packets for one of our services with the Hierarchy Token Bucket (HTB) classful qdisc strategy. Linux Traffic Control is a user-configured system to schedule and filter network packets. Queue disciplines are the strategies in which packets are scheduled. In particular, we wanted to filter and schedule certain packets from an interface, and drop others into the noqueue qdisc.
noqueue is a special case qdisc, such that packets are supposed to be dropped when scheduled into it. In practice, this is not the case. Linux handles noqueue such that packets are passed through and not dropped (for the most part). The documentation states as much. It also states that “It is not possible to assign the noqueue queuing discipline to physical devices or classes.” So what happens when we assign noqueue to a class?
Let's write some shell commands to show the problem in action:
1. $ sudo -i
2. # dev=enp0s5
3. # tc qdisc replace dev $dev root handle 1: htb default 1
4. # tc class add dev $dev parent 1: classid 1:1 htb rate 10mbit
5. # tc qdisc add dev $dev parent 1:1 handle 10: noqueue
First we need to log in as root because that gives us CAP_NET_ADMIN to be able to configure traffic control.
We then assign a network interface to a variable. These can be found with
ip a
. Virtual interfaces can be located by callingls /sys/devices/virtual/net
. These will match with the output fromip a
.Our interface is currently assigned to the pfifo_fast qdisc, so we replace it with the HTB classful qdisc and assign it the handle of
1:
. We can think of this as the root node in a tree. The “default 1” configures this such that unclassified traffic will be routed directly through this qdisc which falls back to pfifo_fast queuing. (more on this later)Next we add a class to our root qdisc
1:
, assign it to the first leaf node 1 of root 1:1:1
, and give it some reasonable configuration defaults.Lastly, we add the noqueue qdisc to our first leaf node in the hierarchy:
1:1
. This effectively means traffic routed here will be scheduled to noqueue
Assuming our setup executed without a hitch, we will receive something similar to this kernel panic:
BUG: kernel NULL pointer dereference, address: 0000000000000000
#PF: supervisor instruction fetch in kernel mode
...
Call Trace:
<TASK>
htb_enqueue+0x1c8/0x370
dev_qdisc_enqueue+0x15/0x90
__dev_queue_xmit+0x798/0xd00
...
</TASK>
We know that the root user is responsible for setting qdisc on interfaces, so if root can crash the kernel, so what? We just do not apply noqueue qdisc to a class id of a HTB qdisc:
# dev=enp0s5
# tc qdisc replace dev $dev root handle 1: htb default 1
# tc class add dev $dev parent 1: classid 1:2 htb rate 10mbit // A
// B is missing, so anything not filtered into 1:2 will be pfifio_fast
Here, we leveraged the default case of HTB where we assign a class id 1:2 to be rate-limited (A), and implicitly did not set a qdisc to another class such as id 1:1 (B). Packets queued to (A) will be filtered to HTB_DIRECT and packets queued to (B) will be filtered into pfifo_fast.
Because we were not familiar with this part of the codebase, we notified the mailing lists and created a ticket. The bug did not seem all that important to us at that time.
Fast-forward to 2022, we are pushing USER namespace creation hardening. We extended the Linux LSM framework with a new LSM hook: userns_create to leverage eBPF LSM for our protections, and encourage others to do so as well. Recently while combing our ticket backlog, we rethought this bug. We asked ourselves, “can we leverage USER namespaces to trigger the bug?” and the short answer is yes!
Demonstrating the bug
The exploit can be performed with any classful qdisc that assumes a struct Qdisc.enqueue function to not be NULL (more on this later), but in this case, we are demonstrating just with HTB.
$ unshare -rU –net
$ dev=lo
$ tc qdisc replace dev $dev root handle 1: htb default 1
$ tc class add dev $dev parent 1: classid 1:1 htb rate 10mbit
$ tc qdisc add dev $dev parent 1:1 handle 10: noqueue
$ ping -I $dev -w 1 -c 1 1.1.1.1
We use the “lo” interface to demonstrate that this bug is triggerable with a virtual interface. This is important for containers because they are fed virtual interfaces most of the time, and not the physical interface. Because of that, we can use a container to crash the host as an unprivileged user, and thus perform a denial of service attack.
Why does that work?
To understand the problem a bit better, we need to look back to the original patch series, but specifically this commit that introduced the bug. Before this series, achieving noqueue on interfaces relied on a hack that would set a device qdisc to noqueue if the device had a tx_queue_len = 0. The commit d66d6c3152e8 (“net: sched: register noqueue qdisc”) circumvents this by explicitly allowing noqueue to be added with the tc
command without needing to get around that limitation.
The way the kernel checks for whether we are in a noqueue case or not, is to simply check if a qdisc has a NULL enqueue() function. Recall from earlier that noqueue does not necessarily drop packets in practice? After that check in the fail case, the following logic handles the noqueue functionality. In order to fail the check, the author had to cheat a reassignment from noop_enqueue() to NULL by making enqueue = NULL in the init which is called way after register_qdisc() during runtime.
Here is where classful qdiscs come into play. The check for an enqueue function is no longer NULL. In this call path, it is now set to HTB (in our example) and is thus allowed to enqueue the struct skb to a queue by making a call to the function htb_enqueue(). Once in there, HTB performs a lookup to pull in a qdisc assigned to a leaf node, and eventually attempts to queue the struct skb to the chosen qdisc which ultimately reaches this function:
include/net/sch_generic.h
static inline int qdisc_enqueue(struct sk_buff *skb, struct Qdisc *sch,
struct sk_buff **to_free)
{
qdisc_calculate_pkt_len(skb, sch);
return sch->enqueue(skb, sch, to_free); // sch->enqueue == NULL
}
We can see that the enqueueing process is fairly agnostic from physical/virtual interfaces. The permissions and validation checks are done when adding a queue to an interface, which is why the classful qdics assume the queue to not be NULL. This knowledge leads us to a few solutions to consider.
Solutions
We had a few solutions ranging from what we thought was best to worst:
Follow tc-noqueue documentation and do not allow noqueue to be assigned to a classful qdisc
Instead of checking for NULL, check for struct noqueue_qdisc_ops, and reset noqueue to back to noop_enqueue
For each classful qdisc, check for NULL and fallback
While we ultimately went for the first option: "disallow noqueue for qdisc classes", the third option creates a lot of churn in the code, and does not solve the problem completely. Future qdiscs implementations could forget that important check as well as the maintainers. However, the reason for passing on the second option is a bit more interesting.
The reason we did not follow that approach is because we need to first answer these questions:
Why not allow noqueue for classful qdiscs?
This contradicts the documentation. The documentation does have some precedent for not being totally followed in practice, but we will need to update that to reflect the current state. This is fine to do, but does not address the behavior change problem other than remove the NULL dereference bug.
What behavior changes if we do allow noqueue for qdiscs?
This is harder to answer because we need to determine what that behavior should be. Currently, when noqueue is applied as the root qdisc for an interface, the path is to essentially allow packets to be processed. Claiming a fallback for classes is a different matter. They may each have their own fallback rules, and how do we know what is the right fallback? Sometimes in HTB the fallback is pass-through with HTB_DIRECT, sometimes it is pfifo_fast. What about the other classes? Perhaps instead we should fall back to the default noqueue behavior as it is for root qdiscs?
We felt that going down this route would only add confusion and additional complexity to queuing. We could also make an argument that such a change could be considered a feature addition and not necessarily a bug fix. Suffice it to say, adhering to the current documentation seems to be the more appealing approach to prevent the vulnerability now, while something else can be worked out later.
Takeaways
First and foremost, apply this patch as soon as possible. And consider hardening USER namespaces on your systems by setting sysctl -w
kernel.unprivileged_userns_clone
=0
, which only lets root create USER namespaces in Debian kernels, sysctl -w
user.max_user_namespaces
=[number]
for a process hierarchy, or consider backporting these two patches: [security_create_user_ns()](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7cd4c5c2101cb092db00f61f69d24380cf7a0ee8)
and the SELinux implementation (now in Linux 6.1.x) to allow you to protect your systems with either eBPF or SELinux. If you are sure you're not using USER namespaces and in extreme cases, you might consider turning the feature off with CONFIG_USERNS=n
. This is just one example of many where namespaces are leveraged to perform an attack, and more are surely to crop up in varying levels of severity in the future.
Special thanks to Ignat Korchagin and Jakub Sitnicki for code reviews and helping demonstrate the bug in practice.