A few months ago, after reading about Cloudflare doubling its intern class size, I quickly dusted off my CV and applied for an internship. Long story short: now, a couple of months later, I found myself staring into Linux kernel code and adding a pretty cool feature to gVisor, a Linux container runtime.
My internship was under the Emerging Technologies and Incubation group on a project involving gVisor. A co-worker contacted my team about not being able to read the debug symbols of stack traces inside the sandbox. For example, when the isolated process crashed, this is what we saw in the logs:
*** Check failure stack trace: *** @ 0x7ff5f69e50bd (unknown) @ 0x7ff5f69e9c9c (unknown) @ 0x7ff5f69e4dbd (unknown) @ 0x7ff5f69e55a9 (unknown) @ 0x5564b27912da (unknown) @ 0x7ff5f650ecca (unknown) @ 0x5564b27910fa (unknown)
Obviously, this wasn't very useful. I eagerly volunteered to fix this stack unwinding code - how hard could it be?
After some debugging, we found that the logging library used in the project opened
/proc/self/mem to look for ELF headers at the start of each memory-mapped region. This was necessary to calculate an offset to find the correct addresses for debug symbols.
It turns out this mechanism is rather common. The stack unwinding code is often run in weird contexts - like a SIGSEGV handler - so it would not be appropriate to dig over real memory addresses back and forth to read the ELF. This could trigger another SIGSEGV. And SIGSEGV inside a SIGSEGV handler means either termination via the default handler for a segfault or recursing into the same handler again and again (if one sets
SA_NODEFER) leading to a stack overflow.
However, inside gVisor, each call of
/proc/self/mem resulted in
ENOENT, because the entire
/proc/self/mem file was missing. In order to provide a robust sandbox, gVisor has to carefully reimplement the Linux kernel interfaces. This particular
/proc file was simply unimplemented in the virtual file system of Sentry, one of gVisor's sandboxing components.
Marek asked the devs on the project chat and got confirmation - they would be happy to accept a patch implementing this file.
The easy way out would have been to make a small, local patch to the unwinder behavior, yet I found myself diving into the Linux kernel trying to figure how the
mem file worked in an attempt to implement it in Sentry's VFS.
The file itself is quite powerful, because it allows raw access to the virtual address space of a process. According to manpages, the documented file operations are
lseek(). Typical use cases are debugging tasks or dumping process memory.
Opening the file
When a process wants to open the file, the kernel does the file permissions check, looks up the associated operations for
mem and invokes a method called
proc_mem_open. It retrieves the associated task and calls a method named
/* * Grab a reference to a task's mm, if it is not already going away * and ptrace_may_access with the mode parameter passed to it * succeeds. */
Seems relatively straightforward, right? The special thing about
mm_access is that it verifies the permissions the current task has regarding the task to which the memory belongs. If the current task and target task do not share the same memory manager, the kernel invokes a method named
/* * May we inspect the given task? * This check is used both for attaching with ptrace * and for allowing access to sensitive information in /proc. * * ptrace_attach denies several cases that /proc allows * because setting up the necessary parent/child relationship * or halting the specified task is impossible. * */
According to the manpages, a process which would like to read from an unrelated
/proc/[pid]/mem file should have access mode
PTRACE_MODE_ATTACH_FSCREDS. This check does not verify that a process is attached via
PTRACE_ATTACH, but rather if it has the permission to attach with the specified credentials mode.
After skimming through the function, you will see that a process is allowed access if the current task belongs to the same thread group as the target task, or denied access (depending on whether
PTRACE_MODE_REALCREDS is set, we will use either the file-system UID / GID, which is typically the same as the effective UID/GID, or the real UID / GID) if none of the following conditions are met:
- the current task's credentials (UID, GID) match up with the credentials (real, effective and saved set-UID/GID) of the target process
- the current task has
CAP_SYS_PTRACEinside the user namespace of the target process
In the next check, access is denied if the current task has neither
CAP_SYS_PTRACE inside the user namespace of the target task, nor the target's dumpable attribute is set to
SUID_DUMP_USER. The dumpable attribute is typically required to allow producing core dumps.
After these three checks, we also go through the commoncap Linux Security Module (and other LSMs) to verify our access mode is fine. LSMs you may know are SELinux and AppArmor. The commoncap LSM performs the checks on the basis of effective or permitted process capabilities (depending on the mode being
REALCREDS), allowing access if
- the capabilities of the current task are a superset of the capabilities of the target task, or
- the current task has
CAP_SYS_PTRACEin the target task's user namespace
In conclusion, one has access (with only commoncap LSM checks active) if:
- the current task is in the same task group as the target task, or
- the current task has
CAP_SYS_PTRACEin the target task's user namespace, or
- the credentials of the current and target task match up in the given credentials mode, the target task is dumpable, they run in the same user namespace and the target task's capabilities are a subset of the current task's capabilities
I highly recommend reading through the ptrace manpages to dig deeper into the different modes, options and checks.
Reading from the file
Since all the access checks occur when opening the file, reading from it is quite straightforward. When one invokes
read() on a
mem file, it calls up
mem_rw (which actually can do both reading and writing).
To avoid using lots of memory,
mem_rw performs the copy in a loop and buffers the data in an intermediate page.
mem_rw has a hidden superpower, that is, it uses
FOLL_FORCE to avoid permission checks on user-owned pages (handling pages marked as non-readable/non-writable readable and writable).
mem_rw has other specialties, such as its error handling. Some interesting cases are:
- if the target task has exited after opening the file descriptor, performing
read()will always succeed with reading 0 bytes
- if the initial copy from the target task's memory to the intermediate page fails, it does not always return an error but only if no data has been read
You can also perform
lseek on the file excluding
How it works in gVisor
Luckily, gVisor already implemented
kernel.task.CanTrace, so one can avoid reimplementing all the ptrace access logic. However, the implementation in gVisor is less complicated due to the lack of support for
PTRACE_MODE_FSCREDS (which is still an open issue).
When a new file descriptor is
GetFile method of the virtual Inode is invoked, therefore this is where the access check naturally happens. After a successful access check, the method returns a
fs.File implements all the file operations you would expect such as
Write(). gVisor also provides tons of primitives for quickly building a working file structure so that one does not have to reimplement a generic
lseek() for example.
In case a task invokes a
Read() call onto the
Read method retrieves the memory manager of the file’s Task.
Accessing the task's memory manager is a breeze with comfortable
CopyOut methods, with interfaces similar to
After implementing all of this, we finally got a useful stack trace.
*** Check failure stack trace: *** @ 0x7f190c9e70bd google::LogMessage::Fail() @ 0x7f190c9ebc9c google::LogMessage::SendToLog() @ 0x7f190c9e6dbd google::LogMessage::Flush() @ 0x7f190c9e75a9 google::LogMessageFatal::~LogMessageFatal() @ 0x55d6f718c2da main @ 0x7f190c510cca __libc_start_main @ 0x55d6f718c0fa _start
A comprehensive victory! The
/proc/<pid>/mem file is an important mechanism that gives insight into contents of process memory. It is essential to stack unwinders to do their work in case of complicated and unforeseeable failures. Because the process memory contains highly-sensitive information, data access to the file is determined by a complex set of poorly documented rules. With a bit of effort, you can emulate
/proc/[PID]/mem inside gVisor’s sandbox, where the process only has access to the subset of procfs that has been implemented by the gVisor authors and, as a result, you can have access to an easily readable stack trace in case of a crash.