A Short Guide to Kernel Debugging

A story about finding a kernel bug on a production system

Reddit
LinkedIn

Written by Benjamin Wester.

This article began as an internal email describing a problem that was found on Square’s production hosts. The story is written to serve as an introduction to the tools and practice of debugging the Linux kernel, though it assumes a quite a bit of familiarity with how C programs are constructed.

While I was on-call for our deployment infrastructure, a problem came my way: a service’s latest release wasn’t deploying properly. Log files pointed me to one host that seemed stuck. After logging into the machine and looking around, something seemed very wrong:

$ date
Thu Sep 10 16:04:00 UTC 2015
$ ps -fC cleanup_old.sh
UID    PID PPID  C STIME TTY      TIME CMD
root  5085 5081  0 Aug27 ?    00:00:10 /bin/sh ./cleanup_old.sh myapp
$ ps -fC rm
UID    PID PPID  C STIME TTY      TIME CMD
root  5109 5099  0 Aug27 ?    00:00:01 rm -rf /opt/myapp-1.0.4
$ sudo kill -9 5085 5109
$ ps -fC cleanup_old.sh
UID    PID PPID  C STIME TTY      TIME CMD
$ ps -fC rm
UID    PID PPID  C STIME TTY      TIME CMD
root  5109 5099  0 Aug27 ?    00:00:01 rm -rf /opt/myapp-1.0.4

Killing a management script unstuck the deployment process but left behind an orphaned, weeks-old, unkillable rm. My immediate problem was mitigated, yet there was clearly a deeper problem. What was going on?

I knew the process wasn’t spinning in user-space — the process didn’t show up in top’s output as a heavy CPU user before I tried to kill it, and it didn’t die afterwards — so it must have been blocked inside a system call to the kernel. Often, a blocked process can be quickly diagnosed just by knowing which function it blocked in. Hoping for a quick solution, I looked at the process’s kernel call stack:

$ sudo cat /proc/5109/stack
[<ffffffff811979bc>] vfs_unlink+0x5c/0xf0
[<ffffffff8119aac3>] do_unlinkat+0x163/0x260
[<ffffffff8119adf2>] sys_unlinkat+0x22/0x40
[<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff

Of course rm was blocked unlinking a file; it doesn’t do much else. What part of unlinking was it blocked on? Which file was it removing? What was so special about that file? Those answers depended on the data in this process’s kernel call stack, not just a list of its functions. I needed to peek into the kernel’s memory.

At this point, it was clear that simple diagnostics wouldn’t reveal the problem. A reboot would “fix” it quickly but wouldn’t prevent this from happening again. I sighed and set aside the rest of my day to find the root cause.

Debugging the Linux kernel is more involved than debugging a normal process. With an interesting problem to solve as an example, let’s take a look at some of the tools, techniques, and data structures that will be useful for understanding the internals of the kernel. Hopefully, this discussion can be a starting place the next time your own problem wanders out of user-space.

Let’s debug the kernel!

Setting up a kernel debugging environment

A typical way to examine a process’s state is to attach a debugger, like gdb, to it. The kernel is no different. Just like user-space processes, the binary must be built to support debugging. The machine we’re debugging runs RedHat 6.5, and thankfully RedHat builds its kernels with some debug options set. To save space, the debug symbols are separate from the main kernel binary, so we’ll need to install those.

$ uname -r
2.6.32-431.23.3.el6.x86_64
$ sudo yum --enablerepo=debug install kernel-debuginfo-2.6.32-431.23.3.el6 kernel-debuginfo-common-x86_64-2.6.32-431.23.3.el6

And if we want to understand what we’re seeing, we’ll have to install the source code for the kernel. The debug packages already install source code inside /usr/src/debug/…. If you want the source code without the debug package (perhaps on another machine), it can be generated from a Source RPM. (1)

$ curl -LO ftp://ftp.redhat.com/.../kernel-2.6.32-431.23.3.el6.src.rpm
$ rpm -Uvh kernel-2.6.32-431.23.3.el6.src.rpm
$ sudo yum install rpm-build yum-utils
$ sudo yum-builddep rpmbuild/SPECS/kernel.spec
$ [rpmbuild](http://www.rpm.org/max-rpm-snapshot/rpmbuild.8.html) -pb rpmbuild/SPECS/kernel.spec

Now, how do we attach gdb? There are many methods you could use depending on how stable the machine is, whether you can pause/reboot it, and whether it’s running in a virtual machine. This is a bare-metal machine that I don’t want to take down, so we’ll used RedHat’s crash tool. Crash is a wrapper around gdb that augments it with common functions for understanding kernel data structures. It’s also capable of debugging the live kernel it’s running on without disturbing it: just what we need.

$ sudo crash
  KERNEL: /usr/lib/debug/lib/modules/2.6.32-431.23.3.el6.x86_64/vmlinux
DUMPFILE: /dev/crash
    DATE: Thu Sep 10 17:35:01 2015
 RELEASE: 2.6.32-431.23.3.el6.x86_64
 VERSION: #1 SMP Thu Jul 31 17:20:51 UTC 2014
 MACHINE: x86_64
     PID: 3943
 COMMAND: "crash"
    TASK: ffff880494dd6080  [THREAD_INFO: ffff88094a1ea000]
     CPU: 7
   STATE: TASK_RUNNING (ACTIVE)

crash>

Understanding kernel data and x86 ASM

Let’s start with an easy task: we’ll verify that the program’s stack is still the same as before:

crash> [set](http://people.redhat.com/anderson/crash_whitepaper/help_pages/set.html) 5109
    PID: 5109
COMMAND: "rm"
   TASK: ffff8802d0e5a080  [THREAD_INFO: ffff88057f85a000]
    CPU: 18
  STATE: TASK_UNINTERRUPTIBLE
crash> [bt](http://people.redhat.com/anderson/crash_whitepaper/help_pages/bt.html)
PID: 5109  TASK: ffff8802d0e5a080  CPU: 18  COMMAND: "rm"
 #0 [ffff88057f85bce8] schedule at ffffffff81528bc0
 #1 [ffff88057f85bdb0] __mutex_lock_slowpath at ffffffff8152a36e
 #2 [ffff88057f85be20] mutex_lock at ffffffff8152a20b
 #3 [ffff88057f85be40] vfs_unlink at ffffffff811979bc
 #4 [ffff88057f85be70] do_unlinkat at ffffffff8119aac3
 #5 [ffff88057f85bf70] sys_unlinkat at ffffffff8119adf2
 #6 [ffff88057f85bf80] system_call_fastpath at ffffffff8100b072
    RIP: 00000033b36dc68d  RSP: 00007ffff7f1ec48  RFLAGS: 00010202
    RAX: 0000000000000107  RBX: ffffffff8100b072  RCX: 0000000000000032
    RDX: 0000000000000000  RSI: 0000000001982a70  RDI: 0000000000000006
    RBP: 00007ffff7f1efa0   R8: 0000000000000003   R9: 0000000000000000
    R10: 0000000000000100  R11: 0000000000000246  R12: ffffffff8119adf2
    R13: ffff88057f85bf78  R14: 0000000000000000  R15: 0000000001979060
    ORIG_RAX: 0000000000000107  CS: 0033  SS: 002b

When a process makes a system call, all its registers are saved at the bottom of its stack; that’s what we see here. We could use this to reconstruct the syscall parameters using the amd64 syscall calling convention. We also get to see struct task_struct and struct thread_info pointers for the process. These two structs are what the kernel uses to track every process/thread in the system. (In Linux, a “thread” is just another process that happens to share the same address space, exception handlers, file table, etc..)

Looking at the full stack, it’s clear that the process is waiting to acquire a mutex (the mutex_lock() call). With line numbers now:

crash> bt -l
PID: 5109  TASK: ffff8802d0e5a080  CPU: 18  COMMAND: "rm"
...
 #2 [ffff88057f85be20] mutex_lock at ffffffff8152a20b
    /arch/x86/include/asm/thread_info.h: 216
 #3 [ffff88057f85be40] vfs_unlink at ffffffff811979bc
    /fs/namei.c: 2810
...

If we look into “/fs/namei.c” to figure out which mutex_lock() call it’s blocking on:

crash> l fs/namei.c:2810
...
2809            mutex_lock(&dentry->d_inode->i_mutex);
2810            if (d_mountpoint(dentry))
...

Line 2810 is blocking? No, the backtrace shows that the return address goes to line 2810. The mutex call is before that, in line 2809.

A struct dentry is the data structure that Linux uses to track and cache file system names. It associates a name in some directory with the actual file, a struct inode, that holds the data. Usually, files have only one name (and one corresponding dentry) in a file system. Hard links create additional names for a single file; removing an open file allows the file to have zero names (it won’t be truly deleted until the file handle is closed).

So our process blocks acquiring an inode’s mutex. Which inode? We could extract the file’s name from the arguments to sys_unlinkat(), but that wouldn’t lead us to the inode data structure. Let’s see if we can access some of the function parameters from the call stack.

crash> up
up = $1 =
 {void (struct semaphore *)} 0xffffffff810a0f30 <up>
crash> gdb up
crash: prohibited gdb command: up

Crash doesn’t support gdb’s standard “up” command for traversing a call stack. Instead, we can print a “full” backtrace that includes the stack’s memory.

crash> bt -f
PID: 5109  TASK: ffff8802d0e5a080  CPU: 18  COMMAND: "rm"
 #0 [ffff88057f85bce8] schedule at ffffffff81528bc0
    ffff88057f85bcf0: 0000000000000086 0000000000000000
    ffff88057f85bd00: ffffffffa0230da4 ffff88057f85bd18
...

Normally, gdb would interpret that memory for us. The binary’s debugging info tells gdb how to find the location of any saved variables or registers from any instruction. However, crash doesn’t expose this functionality, forcing us to unwind the call stack manually while looking for useful context.

According to the amd64 calling conventions, most function arguments are passed via registers without being explicitly saved anywhere else. We have access to the task’s stack, but we don’t get intermediate register states. Our best hope for recovering context is to look for a function in the call stack that happened to save an important variable to the stack.

After looking through a few disassembled functions, we find dounlinkat() where it calls into vfsunlink():

crash> [dis](http://people.redhat.com/anderson/crash_whitepaper/help_pages/dis.html) do_unlinkat
...
0xffffffff8119aaa2 <do_unlinkat+322>:   mov    -0xc8(%rbp),%rax
0xffffffff8119aaa9 <do_unlinkat+329>:   **mov    %rdx,%rsi**
0xffffffff8119aaac <do_unlinkat+332>:   mov    0x10(%rax),%rdi
0xffffffff8119aab0 <do_unlinkat+336>:   mov    %rcx,-0xe8(%rbp)
0xffffffff8119aab7 <do_unlinkat+343>:   **mov    %rdx,-0xe0(%rbp)**
0xffffffff8119aabe <do_unlinkat+350>:   callq  0xffffffff81197960 <vfs_unlink>
0xffffffff8119aac3 <do_unlinkat+355>:   mov    -0xe0(%rbp),%rdx
...

This assembly block corresponds to the following code segment.

crash> [sym](http://people.redhat.com/anderson/crash_whitepaper/help_pages/sym.html) 0xffffffff8119aaa2
ffffffff8119aaa2 (t) do_unlinkat+322 fs/namei.c: 2869
crash> l fs/namei.c:2869
2869:         error = vfs_unlink(nd.path.dentry->d_inode, **dentry**);
2870: exit2:
2871:         dput(dentry);

The second argument to vfs_unlink(), a dentry pointer, can tell us which file is being removed. We know from the calling conventions that the second argument is passed in the %rsi register. Its values comes from %rdx, which is saved to the stack before the function call, where we can recover it. (2)

Now, we can look at a full memory dump of the task’s stack, extract a frame pointer (FP), and retrieve the dentry:

crash> bt -f
PID: 5109  TASK: ffff8802d0e5a080  CPU: 18  COMMAND: "rm"
...
    ffff88057f85be68: **ffff88057f85bf68** ffffffff8119aac3 <- last frame's FP
 #4 [ffff88057f85be70] do_unlinkat at ffffffff8119aac3
    ffff88057f85be78: ffff88004b3d2248 ffff8808a3dbf588
    ffff88057f85be88: **ffff8809c6c82a40** ffff880b92bd4000 <- FP - 0xe0
...

Now we can dump the relevant file system data structures:

crash> [struct](http://people.redhat.com/anderson/crash_whitepaper/help_pages/struct.html) dentry.d_name,d_parent,d_inode **ffff8809c6c82a40**
  d_name = {
    len = 7,
    name = 0xffff8809c6c82ae0 "fire.js"
  }
  d_parent = **0xffff8809c6c82b00**
  d_inode = **0xffff8808a3dbf588**
crash> struct dentry.d_name **0xffff8809c6c82b00**
  d_name = {
    len = 2,
    name = 0xffff8809c6c82ba0 "js"
  }
crash> struct inode.i_mutex **0xffff8808a3dbf588**
  i_mutex = {
    count = {
      counter = -1
    },
    wait_lock = {
      raw_lock = {
        slock = 196611
      }
    },
    wait_list = {
      next = 0xffff880749a05c48,
      prev = 0xffff88057f85bdc8
    },
    owner = **0xffff88076b78c000**
  }

So the file is “fire.js” in the “js” directory. That’s good to know. Now, let’s see which process owns the mutex:

crash> struct thread_info.task **0xffff88076b78c000**
  task = 0xffff8807ef98aae0
crash> [ps](http://people.redhat.com/anderson/crash_whitepaper/help_pages/ps.html) 0xffff8807ef98aae0
   PID  PPID  CPU       TASK        ST  %MEM     VSZ    RSS  COMM
  4723  4087   2  ffff8807ef98aae0  UN   0.3  641736 137204  ruby

One step forward, one step back

So, we now know which process has the mutex. What is it doing now?

crash> bt 4723
PID: 4723  TASK: ffff8807ef98aae0  CPU: 2   COMMAND: "ruby"
 #0 [ffff88076b78da08] schedule at ffffffff81528bc0
 #1 [ffff88076b78dad0] rwsem_down_failed_common at ffffffff8152b275
 #2 [ffff88076b78db30] rwsem_down_write_failed at ffffffff8152b3d3
 #3 [ffff88076b78db70] call_rwsem_down_write_failed at ffffffff8128f383
 #4 [ffff88076b78dbd0] xfs_ilock at ffffffffa02006fc [xfs]
 #5 [ffff88076b78dc00] xfs_setattr at ffffffffa021fdcd [xfs]
 #6 [ffff88076b78dcc0] xfs_vn_setattr at ffffffffa022c00b [xfs]
 #7 [ffff88076b78dcd0] notify_change at ffffffff811a7cf8
 #8 [ffff88076b78dd40] do_truncate at ffffffff81186f74
 #9 [ffff88076b78ddb0] do_filp_open at ffffffff8119c051
#10 [ffff88076b78df20] do_sys_open at ffffffff81185c39
#11 [ffff88076b78df70] sys_open at ffffffff81185d50
...

It’s waiting to acquire a write lock on a read-write semaphore. Which one? Using the same approach from above, we can recover frame #4’s first argument:

xfs_ilock(struct xfs_inode *ip = **0xffff8808a3dbf400**, ??)

So that one kernel can handle many different filesystems, there is a generic struct inode that the main virtual file system layer (VFS) can understand, but each different file system also has its own specific inode data structure. XFS uses a struct xfs_inode, and one if its fields is the embedded generic inode. For low-level synchronization, it includes a binary semaphore i_iolock, and this process is blocked trying to acquire that semaphore.

As it turns out, we’ve seen this inode before:

crash> struct -ox xfs_inode.i_vnode **0xffff8808a3dbf400**
struct xfs_inode {
  [**ffff8808a3dbf588**] struct inode i_vnode;
}

It’s the “fire.js” inode, whose i_mutex lock is already held by this process. It’s now acquiring an XFS-specific semaphore on the same inode.

Unfortunately, semaphores don’t carry an “owner” field (in this configuration) like a mutex. So how do we find its owner? In this case, we can get lucky by assuming first that we’re investigating a deadlock (though we haven’t quite found that yet) and second that this process is a critical part of that deadlock. If we’re right, another interesting process might be waiting on this process’s mutex.

crash> struct -ox inode.i_mutex **0xffff8808a3dbf588**
struct inode {
  [ffff8808a3dbf640] struct mutex i_mutex;
}
crash> struct -ox mutex.wait_list ffff8808a3dbf640
struct mutex {
  [ffff8808a3dbf648] struct list_head wait_list;
}
crash> [list](http://people.redhat.com/anderson/crash_whitepaper/help_pages/list.html) -s mutex_waiter.task -H ffff8808a3dbf648
ffff880749a05c48
  task = 0xffff88076dfe2040
ffff88057f85bdc8
  task = 0xffff8802d0e5a080
crash> ps 0xffff88076dfe2040 0xffff8802d0e5a080
   PID  PPID  CPU       TASK        ST  %MEM     VSZ    RSS  COMM
  4724     1   6  ffff88076dfe2040  UN   0.3  468676 138156  ruby
  5109  5099  18  ffff8802d0e5a080  UN   0.0    4144    696  rm

We started our session with process 5109, but what’s this other process, 4724?

crash> bt 4724
PID: 4724  TASK: ffff88076dfe2040  CPU: 6   COMMAND: "ruby"
 #0 [ffff880749a05b68] schedule at ffffffff81528bc0
 #1 [ffff880749a05c30] __mutex_lock_slowpath at ffffffff8152a36e
 #2 [ffff880749a05ca0] mutex_lock at ffffffff8152a20b
 #3 [ffff880749a05cc0] generic_file_splice_write at ffffffff811b8c7a
 #4 [ffff880749a05d50] xfs_file_splice_write at ffffffffa02277b0 [xfs]
 #5 [ffff880749a05dc0] do_splice_from at ffffffff811b873e
 #6 [ffff880749a05e00] direct_splice_actor at ffffffff811b8790
 #7 [ffff880749a05e10] splice_direct_to_actor at ffffffff811b8a66
 #8 [ffff880749a05e80] do_splice_direct at ffffffff811b8bad
 #9 [ffff880749a05ed0] do_sendfile at ffffffff8118936c
#10 [ffff880749a05f30] sys_sendfile64 at ffffffff81189404
#11 [ffff880749a05f80] system_call_fastpath at ffffffff8100b072
...

Hmm, I see XFS-specific code calling back into generic VFS code. Indeed, looking at xfsfilesplice_write(), you can see where it grabs the XFS iiolock then calls genericfilesplicewrite(), which then grabs the generic i_mutex.

Conclusion

Now there is enough data to put together a complete picture of what happened. Most XFS write operations acquire a file’s imutex first, then acquire iiolock. The sysopen() call is an example of this ordering. However, the splice write operation is coded differently. It acquires iiolock, then acquires imutex. We see that in the syssendfile64() call. So, whenever two processes concurrently write and splice to the same inode, each process can acquire one lock and then block on the other, leaving them deadlocked in uninterruptible sleeps and poisoning the inode.

We found an XFS bug.

Searching for “xfs splice deadlock” turns up an email thread from 2011 that describes this problem. However, bisecting the kernel source repository shows that the bug wasn’t really addressed until April, 2014 (8d02076) for release in Linux 3.16.

So why did we just spend all this time debugging a longstanding known and fixed problem? RedHat doesn’t use the latest version of the kernel in their Enterprise Linux product. Using older, battle-hardened versions is a good thing when your goal is long-term stability. However, bugs will still be found, and they must be backported into the stable versions. RedHat just hasn’t included this one particular fix yet. Square is working with them to reach a solution. Until then, all current versions to-date of RHEL/CentOS 6 and 7 are susceptible to this bug. Update as of 9/2/16: RedHat has now shipped updated kernels that contain a fix for this deadlock: version 2.6.32–642 for RHEL 6 and version 3.10.0–327.22.2 for RHEL 7.

As for the process that initiated this deadlock, it’s a regular MRI Ruby 2.2.2 interpreter. Its core I/O functions have been optimized to use zero-copy operations when possible. Since early 2014 on Linux, a call to FileUtils.copy_file() from the standard library will use sendfile64(). Hence, there is a fairly compact trigger:

ruby -rfileutils -e'FN=ARGV.shift; Thread.new { loop { FileUtils.copy_file "/etc/issue", FN } }; loop {File.new FN, File::CREAT|File::TRUNC|File::RDWR }' /xfs/somefile

I suspect that Square has seen Java processes deadlock in this way too. These programs are trying to do the right thing, and I would expect that as more libraries and runtimes mature, they will also try to optimize their I/O behavior by adopting zero-copy operations.

Corrections: (1): Added explanation that you get the kernel source code along with the kernel-debuginfo-common package. (2): The second argument is passed in %rsi, not %rdx.