
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/">
    <channel>
        <title><![CDATA[ The Cloudflare Blog ]]></title>
        <description><![CDATA[ Get the latest news on how products at Cloudflare are built, technologies used, and join the teams helping to build a better Internet. ]]></description>
        <link>https://blog.cloudflare.com</link>
        <atom:link href="https://blog.cloudflare.com/" rel="self" type="application/rss+xml"/>
        <language>en-us</language>
        <image>
            <url>https://blog.cloudflare.com/favicon.png</url>
            <title>The Cloudflare Blog</title>
            <link>https://blog.cloudflare.com</link>
        </image>
        <lastBuildDate>Sat, 04 Apr 2026 10:24:17 GMT</lastBuildDate>
        <item>
            <title><![CDATA[Diving into /proc/[pid]/mem]]></title>
            <link>https://blog.cloudflare.com/diving-into-proc-pid-mem/</link>
            <pubDate>Tue, 27 Oct 2020 12:00:00 GMT</pubDate>
            <description><![CDATA[ A few months ago, after reading about Cloudflare doubling its intern class, I quickly dusted off my CV and applied for an internship. Long story short: now, a couple of months later, I found myself staring at Linux kernel code and adding a pretty cool feature to gVisor. ]]></description>
            <content:encoded><![CDATA[ <p></p><p>A few months ago, after reading about <a href="/cloudflare-doubling-size-of-2020-summer-intern-class/">Cloudflare doubling its intern class size</a>, I quickly dusted off my CV and applied for an internship. Long story short: now, a couple of months later, I found myself staring into Linux kernel code and adding a pretty cool feature <a href="https://gvisor.dev/">to gVisor, a Linux container runtime</a>.</p><p>My internship was under the Emerging Technologies and Incubation group on a project involving gVisor. A co-worker contacted my team about not being able to read the debug symbols of stack traces inside the sandbox. For example, when the isolated process crashed, this is what we saw in the logs:</p>
            <pre><code>*** Check failure stack trace: ***
    @     0x7ff5f69e50bd  (unknown)
    @     0x7ff5f69e9c9c  (unknown)
    @     0x7ff5f69e4dbd  (unknown)
    @     0x7ff5f69e55a9  (unknown)
    @     0x5564b27912da  (unknown)
    @     0x7ff5f650ecca  (unknown)
    @     0x5564b27910fa  (unknown)</code></pre>
            <p>Obviously, this wasn't very useful. I eagerly volunteered to fix this stack unwinding code - how hard could it be?</p><p>After some debugging, we found that the logging library used in the project opened <code>/proc/self/mem</code> to look for ELF headers at the start of each memory-mapped region. This was necessary to calculate an offset to find the correct addresses for debug symbols.</p><p>It turns out this mechanism is rather common. The stack unwinding code is often run in weird contexts - like a SIGSEGV handler - so it would not be appropriate to dig over real memory addresses back and forth to read the ELF. This could trigger another SIGSEGV. And SIGSEGV inside a SIGSEGV handler means either termination via the default handler for a segfault or recursing into the same handler again and again (if one sets <code>SA_NODEFER</code>) leading to a stack overflow.</p><p>However, inside gVisor, each call of <code>open()</code> on <code>/proc/self/mem</code> resulted in <code>ENOENT</code>, because the entire <code>/proc/self/mem</code> file was missing. In order to provide a robust sandbox, gVisor has to carefully reimplement the Linux kernel interfaces. This particular <code>/proc</code> file was simply unimplemented in the virtual file system of Sentry, one of gVisor's sandboxing components.<a href="/author/marek-majkowski/">Marek</a> asked the devs on the project chat and got confirmation - they would be happy to accept a patch implementing this file.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4r2jNRjbWTgK866HBQjSKe/2a32421d3b661d496a54dde61721b45b/image1-39.png" />
            
            </figure><p>The easy way out would have been to make a small, local patch to the unwinder behavior, yet I found myself diving into the Linux kernel trying to figure how the <code>mem</code> file worked in an attempt to implement it in Sentry's VFS.</p>
    <div>
      <h2>What does <code>/proc/[pid]/mem</code> do?</h2>
      <a href="#what-does-proc-pid-mem-do">
        
      </a>
    </div>
    <p>The file itself is quite powerful, because it allows raw access to the virtual address space of a process. <a href="https://man7.org/linux/man-pages/man5/proc.5.html">According to manpages</a>, the documented file operations are <code>open()</code>, <code>read()</code> and <code>lseek()</code>. Typical use cases are debugging tasks or dumping process memory.</p>
    <div>
      <h2>Opening the file</h2>
      <a href="#opening-the-file">
        
      </a>
    </div>
    <p>When a process wants to open the file, the kernel does the file permissions check, looks up the associated operations for <code>mem</code> and invokes a method called <code>proc_mem_open</code>. It retrieves the associated task and <a href="https://elixir.bootlin.com/linux/v5.9/source/include/linux/sched/mm.h#L119">calls a method named <code>mm_access</code></a>.</p>
            <pre><code>/*
 * Grab a reference to a task's mm, if it is not already going away
 * and ptrace_may_access with the mode parameter passed to it
 * succeeds.
 */</code></pre>
            <p>Seems relatively straightforward, right? The special thing about <code>mm_access</code> is that it verifies the permissions the current task has regarding the task to which the memory belongs. If the current task and target task do not share the same memory manager, the kernel <a href="https://elixir.bootlin.com/linux/v5.9/source/kernel/ptrace.c#L293">invokes a method named <code>__ptrace_may_access</code></a>.</p>
            <pre><code>/*
 * May we inspect the given task?
 * This check is used both for attaching with ptrace
 * and for allowing access to sensitive information in /proc.
 *
 * ptrace_attach denies several cases that /proc allows
 * because setting up the necessary parent/child relationship
 * or halting the specified task is impossible.
 *
 */</code></pre>
            <p><a href="https://man7.org/linux/man-pages/man5/proc.5.html">According to the manpages</a>, a process which would like to read from an unrelated <code>/proc/[pid]/mem</code> file should have access mode <a href="https://man7.org/linux/man-pages/man2/ptrace.2.html"><code>PTRACE_MODE_ATTACH_FSCREDS</code></a>. This check does not verify that a process is attached via <code>PTRACE_ATTACH</code>, but rather if it has the permission to attach with the specified credentials mode.</p>
    <div>
      <h2>Access checks</h2>
      <a href="#access-checks">
        
      </a>
    </div>
    <p>After skimming through the function, you will see that a process is allowed access if the current task belongs to the same thread group as the target task, or denied access (depending on whether <code>PTRACE_MODE_FSCREDS</code> or <code>PTRACE_MODE_REALCREDS</code> is set, we will use either the file-system UID / GID, which is typically the same as the effective UID/GID, or the real UID / GID) if none of the following conditions are met:</p><ul><li><p>the current task's credentials (UID, GID) match up with the credentials (real, effective and saved set-UID/GID) of the target process</p></li><li><p>the current task has <code>CAP_SYS_PTRACE</code> inside the user namespace of the target process</p></li></ul><p>In the next check, access is denied if the current task has neither <code>CAP_SYS_PTRACE</code> inside the user namespace of the target task, nor the target's dumpable attribute is set to <code>SUID_DUMP_USER</code>. <a href="https://man7.org/linux/man-pages/man2/prctl.2.html">The dumpable attribute</a> is typically required to allow producing core dumps.</p><p>After these three checks, we also go through the commoncap Linux Security Module (and other LSMs) to verify our access mode is fine. LSMs you may know are SELinux and AppArmor. The commoncap LSM performs the checks on the basis of effective or permitted process capabilities (depending on the mode being <code>FSCREDS</code> or <code>REALCREDS</code>), allowing access if</p><ul><li><p>the capabilities of the current task are a superset of the capabilities of the target task, or</p></li><li><p>the current task has <code>CAP_SYS_PTRACE</code> in the target task's user namespace</p></li></ul><p>In conclusion, one has access (with only commoncap LSM checks active) if:</p><ul><li><p>the current task is in the same task group as the target task, or</p></li><li><p>the current task has <code>CAP_SYS_PTRACE</code> in the target task's user namespace, or</p></li><li><p>the credentials of the current and target task match up in the given credentials mode, the target task is dumpable, they run in the same user namespace and the target task's capabilities are a subset of the current task's capabilities</p></li></ul><p>I highly recommend reading through the <a href="https://www.man7.org/linux/man-pages/man2/ptrace.2.html">ptrace manpages</a> to dig deeper into the different modes, options and checks.</p>
    <div>
      <h2>Reading from the file</h2>
      <a href="#reading-from-the-file">
        
      </a>
    </div>
    <p>Since all the access checks occur when opening the file, reading from it is quite straightforward. When one invokes <code>read()</code> on a <code>mem</code> file, <a href="https://elixir.bootlin.com/linux/v5.9/source/fs/proc/base.c#L835">it calls up <code>mem_rw</code></a> (which actually can do both reading and writing).</p><p>To avoid using lots of memory, <code>mem_rw</code> performs the copy in a loop and buffers the data in an intermediate page. <code>mem_rw</code> has a hidden superpower, that is, it uses <code>FOLL_FORCE</code> to avoid permission checks on user-owned pages (handling pages marked as non-readable/non-writable readable and writable).</p><p><code>mem_rw</code> has other specialties, such as its error handling. Some interesting cases are:</p><ul><li><p>if the target task has exited after opening the file descriptor, performing <code>read()</code> will always succeed with reading 0 bytes</p></li><li><p>if the initial copy from the target task's memory to the intermediate page fails, it does not always return an error but only if no data has been read</p></li></ul><p>You can also perform <code>lseek</code> on the file excluding <code>SEEK_END</code>.</p>
    <div>
      <h2>How it works in gVisor</h2>
      <a href="#how-it-works-in-gvisor">
        
      </a>
    </div>
    <p>Luckily, gVisor already implemented <code>ptrace_may_access</code> as <code>kernel.task.CanTrace</code>, so one can avoid reimplementing all the ptrace access logic. However, <a href="https://cs.opensource.google/gvisor/gvisor/+/master:pkg/sentry/kernel/ptrace.go;l=105;bpv=0;bpt=1">the implementation in gVisor</a> is less complicated due to the lack of support for <code>PTRACE_MODE_FSCREDS</code> (which is <a href="https://gvisor.dev/issue/260">still an open issue</a>).</p><p>When a new file descriptor is <code>open()</code>ed, the <code>GetFile</code> method of the virtual Inode is invoked, therefore this is where the access check naturally happens. After a successful access check, the <a href="https://pkg.go.dev/gvisor.dev/gvisor/pkg/sentry/fs#File">method returns a <code>fs.File</code></a>. The <code>fs.File</code> implements all the file operations you would expect such as <code>Read()</code> and <code>Write()</code>. gVisor also provides tons of primitives for quickly building a working file structure so that one does not have to reimplement a generic <code>lseek()</code> for example.</p><p>In case a task invokes a <code>Read()</code> call onto the <code>fs.File</code>, the <code>Read</code> method retrieves the memory manager of the file’s Task.<a href="https://pkg.go.dev/gvisor.dev/gvisor/pkg/sentry/mm#MemoryManager">Accessing the task's memory manager</a> is a breeze with comfortable <code>CopyIn</code> and <code>CopyOut</code> methods, with interfaces similar to <code>io.Writer</code> and <code>io.Reader</code>.</p><p>After implementing all of this, we finally got a useful stack trace.</p>
            <pre><code>*** Check failure stack trace: ***
    @     0x7f190c9e70bd  google::LogMessage::Fail()
    @     0x7f190c9ebc9c  google::LogMessage::SendToLog()
    @     0x7f190c9e6dbd  google::LogMessage::Flush()
    @     0x7f190c9e75a9  google::LogMessageFatal::~LogMessageFatal()
    @     0x55d6f718c2da  main
    @     0x7f190c510cca  __libc_start_main
    @     0x55d6f718c0fa  _start</code></pre>
            
    <div>
      <h2>Conclusion</h2>
      <a href="#conclusion">
        
      </a>
    </div>
    <p>A comprehensive victory! The <code>/proc/&lt;pid&gt;/mem</code> file is an important mechanism that gives insight into contents of process memory. It is essential to stack unwinders to do their work in case of complicated and unforeseeable failures. Because the process memory contains highly-sensitive information, data access to the file is determined by a complex set of poorly documented rules. With a bit of effort, you can emulate <code>/proc/[PID]/mem</code> inside gVisor’s sandbox, where the process only has access to the subset of procfs that has been implemented by the gVisor authors and, as a result, you can have access to an easily readable stack trace in case of a crash.</p><p><a href="https://github.com/google/gvisor/pull/4060">Now I can't wait to get the PR merged into gVisor.</a></p> ]]></content:encoded>
            <category><![CDATA[Deep Dive]]></category>
            <category><![CDATA[Programming]]></category>
            <category><![CDATA[Linux]]></category>
            <guid isPermaLink="false">5p725SUdmDWpdefn3j7WPy</guid>
            <dc:creator>Lennart Espe</dc:creator>
        </item>
    </channel>
</rss>