A Short Guide to Kernel Debugging
A story about finding a kernel bug on a production system
Written by Benjamin Wester.
This article began as an internal email describing a problem that was found on Square’s production hosts. The story is written to serve as an introduction to the tools and practice of debugging the Linux kernel, though it assumes a quite a bit of familiarity with how C programs are constructed.
While I was on-call for our deployment infrastructure, a problem came my way: a service’s latest release wasn’t deploying properly. Log files pointed me to one host that seemed stuck. After logging into the machine and looking around, something seemed very wrong:
$ date
Thu Sep 10 16:04:00 UTC 2015
$ ps -fC cleanup_old.sh
UID PID PPID C STIME TTY TIME CMD
root 5085 5081 0 Aug27 ? 00:00:10 /bin/sh ./cleanup_old.sh myapp
$ ps -fC rm
UID PID PPID C STIME TTY TIME CMD
root 5109 5099 0 Aug27 ? 00:00:01 rm -rf /opt/myapp-1.0.4
$ sudo kill -9 5085 5109
$ ps -fC cleanup_old.sh
UID PID PPID C STIME TTY TIME CMD
$ ps -fC rm
UID PID PPID C STIME TTY TIME CMD
root 5109 5099 0 Aug27 ? 00:00:01 rm -rf /opt/myapp-1.0.4
Killing a management script unstuck the deployment process but left behind an orphaned, weeks-old, unkillable rm. My immediate problem was mitigated, yet there was clearly a deeper problem. What was going on?
I knew the process wasn’t spinning in user-space — the process didn’t show up in top’s output as a heavy CPU user before I tried to kill it, and it didn’t die afterwards — so it must have been blocked inside a system call to the kernel. Often, a blocked process can be quickly diagnosed just by knowing which function it blocked in. Hoping for a quick solution, I looked at the process’s kernel call stack:
$ sudo cat /proc/5109/stack
[<ffffffff811979bc>] vfs_unlink+0x5c/0xf0
[<ffffffff8119aac3>] do_unlinkat+0x163/0x260
[<ffffffff8119adf2>] sys_unlinkat+0x22/0x40
[<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
Of course rm was blocked unlinking a file; it doesn’t do much else. What part of unlinking was it blocked on? Which file was it removing? What was so special about that file? Those answers depended on the data in this process’s kernel call stack, not just a list of its functions. I needed to peek into the kernel’s memory.
At this point, it was clear that simple diagnostics wouldn’t reveal the problem. A reboot would “fix” it quickly but wouldn’t prevent this from happening again. I sighed and set aside the rest of my day to find the root cause.
Debugging the Linux kernel is more involved than debugging a normal process. With an interesting problem to solve as an example, let’s take a look at some of the tools, techniques, and data structures that will be useful for understanding the internals of the kernel. Hopefully, this discussion can be a starting place the next time your own problem wanders out of user-space.
Let’s debug the kernel!
Setting up a kernel debugging environment
A typical way to examine a process’s state is to attach a debugger, like gdb, to it. The kernel is no different. Just like user-space processes, the binary must be built to support debugging. The machine we’re debugging runs RedHat 6.5, and thankfully RedHat builds its kernels with some debug options set. To save space, the debug symbols are separate from the main kernel binary, so we’ll need to install those.
$ uname -r
2.6.32-431.23.3.el6.x86_64
$ sudo yum --enablerepo=debug install kernel-debuginfo-2.6.32-431.23.3.el6 kernel-debuginfo-common-x86_64-2.6.32-431.23.3.el6
And if we want to understand what we’re seeing, we’ll have to install the source code for the kernel. The debug packages already install source code inside /usr/src/debug/…. If you want the source code without the debug package (perhaps on another machine), it can be generated from a Source RPM. (1)
$ curl -LO ftp://ftp.redhat.com/.../kernel-2.6.32-431.23.3.el6.src.rpm
$ rpm -Uvh kernel-2.6.32-431.23.3.el6.src.rpm
$ sudo yum install rpm-build yum-utils
$ sudo yum-builddep rpmbuild/SPECS/kernel.spec
$ [rpmbuild](http://www.rpm.org/max-rpm-snapshot/rpmbuild.8.html) -pb rpmbuild/SPECS/kernel.spec
Now, how do we attach gdb? There are many methods you could use depending on how stable the machine is, whether you can pause/reboot it, and whether it’s running in a virtual machine. This is a bare-metal machine that I don’t want to take down, so we’ll used RedHat’s crash tool. Crash is a wrapper around gdb that augments it with common functions for understanding kernel data structures. It’s also capable of debugging the live kernel it’s running on without disturbing it: just what we need.
$ sudo crash
KERNEL: /usr/lib/debug/lib/modules/2.6.32-431.23.3.el6.x86_64/vmlinux
DUMPFILE: /dev/crash
DATE: Thu Sep 10 17:35:01 2015
RELEASE: 2.6.32-431.23.3.el6.x86_64
VERSION: #1 SMP Thu Jul 31 17:20:51 UTC 2014
MACHINE: x86_64
PID: 3943
COMMAND: "crash"
TASK: ffff880494dd6080 [THREAD_INFO: ffff88094a1ea000]
CPU: 7
STATE: TASK_RUNNING (ACTIVE)
crash>
Understanding kernel data and x86 ASM
Let’s start with an easy task: we’ll verify that the program’s stack is still the same as before:
crash> [set](http://people.redhat.com/anderson/crash_whitepaper/help_pages/set.html) 5109
PID: 5109
COMMAND: "rm"
TASK: ffff8802d0e5a080 [THREAD_INFO: ffff88057f85a000]
CPU: 18
STATE: TASK_UNINTERRUPTIBLE
crash> [bt](http://people.redhat.com/anderson/crash_whitepaper/help_pages/bt.html)
PID: 5109 TASK: ffff8802d0e5a080 CPU: 18 COMMAND: "rm"
#0 [ffff88057f85bce8] schedule at ffffffff81528bc0
#1 [ffff88057f85bdb0] __mutex_lock_slowpath at ffffffff8152a36e
#2 [ffff88057f85be20] mutex_lock at ffffffff8152a20b
#3 [ffff88057f85be40] vfs_unlink at ffffffff811979bc
#4 [ffff88057f85be70] do_unlinkat at ffffffff8119aac3
#5 [ffff88057f85bf70] sys_unlinkat at ffffffff8119adf2
#6 [ffff88057f85bf80] system_call_fastpath at ffffffff8100b072
RIP: 00000033b36dc68d RSP: 00007ffff7f1ec48 RFLAGS: 00010202
RAX: 0000000000000107 RBX: ffffffff8100b072 RCX: 0000000000000032
RDX: 0000000000000000 RSI: 0000000001982a70 RDI: 0000000000000006
RBP: 00007ffff7f1efa0 R8: 0000000000000003 R9: 0000000000000000
R10: 0000000000000100 R11: 0000000000000246 R12: ffffffff8119adf2
R13: ffff88057f85bf78 R14: 0000000000000000 R15: 0000000001979060
ORIG_RAX: 0000000000000107 CS: 0033 SS: 002b
When a process makes a system call, all its registers are saved at the bottom of its stack; that’s what we see here. We could use this to reconstruct the syscall parameters using the amd64 syscall calling convention. We also get to see struct task_struct and struct thread_info pointers for the process. These two structs are what the kernel uses to track every process/thread in the system. (In Linux, a “thread” is just another process that happens to share the same address space, exception handlers, file table, etc..)
Looking at the full stack, it’s clear that the process is waiting to acquire a mutex (the mutex_lock() call). With line numbers now:
crash> bt -l
PID: 5109 TASK: ffff8802d0e5a080 CPU: 18 COMMAND: "rm"
...
#2 [ffff88057f85be20] mutex_lock at ffffffff8152a20b
/arch/x86/include/asm/thread_info.h: 216
#3 [ffff88057f85be40] vfs_unlink at ffffffff811979bc
/fs/namei.c: 2810
...
If we look into “/fs/namei.c” to figure out which mutex_lock() call it’s blocking on:
crash> l fs/namei.c:2810
...
2809 mutex_lock(&dentry->d_inode->i_mutex);
2810 if (d_mountpoint(dentry))
...
Line 2810 is blocking? No, the backtrace shows that the return address goes to line 2810. The mutex call is before that, in line 2809.
A struct dentry is the data structure that Linux uses to track and cache file system names. It associates a name in some directory with the actual file, a struct inode, that holds the data. Usually, files have only one name (and one corresponding dentry) in a file system. Hard links create additional names for a single file; removing an open file allows the file to have zero names (it won’t be truly deleted until the file handle is closed).
So our process blocks acquiring an inode’s mutex. Which inode? We could extract the file’s name from the arguments to sys_unlinkat(), but that wouldn’t lead us to the inode data structure. Let’s see if we can access some of the function parameters from the call stack.
crash> up
up = $1 =
{void (struct semaphore *)} 0xffffffff810a0f30 <up>
crash> gdb up
crash: prohibited gdb command: up
Crash doesn’t support gdb’s standard “up” command for traversing a call stack. Instead, we can print a “full” backtrace that includes the stack’s memory.
crash> bt -f
PID: 5109 TASK: ffff8802d0e5a080 CPU: 18 COMMAND: "rm"
#0 [ffff88057f85bce8] schedule at ffffffff81528bc0
ffff88057f85bcf0: 0000000000000086 0000000000000000
ffff88057f85bd00: ffffffffa0230da4 ffff88057f85bd18
...
Normally, gdb would interpret that memory for us. The binary’s debugging info tells gdb how to find the location of any saved variables or registers from any instruction. However, crash doesn’t expose this functionality, forcing us to unwind the call stack manually while looking for useful context.
According to the amd64 calling conventions, most function arguments are passed via registers without being explicitly saved anywhere else. We have access to the task’s stack, but we don’t get intermediate register states. Our best hope for recovering context is to look for a function in the call stack that happened to save an important variable to the stack.
After looking through a few disassembled functions, we find do_unlinkat() where it calls into vfs_unlink():
crash> [dis](http://people.redhat.com/anderson/crash_whitepaper/help_pages/dis.html) do_unlinkat
...
0xffffffff8119aaa2 <do_unlinkat+322>: mov -0xc8(%rbp),%rax
0xffffffff8119aaa9 <do_unlinkat+329>: **mov %rdx,%rsi**
0xffffffff8119aaac <do_unlinkat+332>: mov 0x10(%rax),%rdi
0xffffffff8119aab0 <do_unlinkat+336>: mov %rcx,-0xe8(%rbp)
0xffffffff8119aab7 <do_unlinkat+343>: **mov %rdx,-0xe0(%rbp)**
0xffffffff8119aabe <do_unlinkat+350>: callq 0xffffffff81197960 <vfs_unlink>
0xffffffff8119aac3 <do_unlinkat+355>: mov -0xe0(%rbp),%rdx
...
This assembly block corresponds to the following code segment.
crash> [sym](http://people.redhat.com/anderson/crash_whitepaper/help_pages/sym.html) 0xffffffff8119aaa2
ffffffff8119aaa2 (t) do_unlinkat+322 fs/namei.c: 2869
crash> l fs/namei.c:2869
2869: error = vfs_unlink(nd.path.dentry->d_inode, **dentry**);
2870: exit2:
2871: dput(dentry);
The second argument to vfs_unlink(), a dentry pointer, can tell us which file is being removed. We know from the calling conventions that the second argument is passed in the %rsi register. Its values comes from %rdx, which is saved to the stack before the function call, where we can recover it. (2)
Now, we can look at a full memory dump of the task’s stack, extract a frame pointer (FP), and retrieve the dentry:
crash> bt -f
PID: 5109 TASK: ffff8802d0e5a080 CPU: 18 COMMAND: "rm"
...
ffff88057f85be68: **ffff88057f85bf68** ffffffff8119aac3 <- last frame's FP
#4 [ffff88057f85be70] do_unlinkat at ffffffff8119aac3
ffff88057f85be78: ffff88004b3d2248 ffff8808a3dbf588
ffff88057f85be88: **ffff8809c6c82a40** ffff880b92bd4000 <- FP - 0xe0
...
Now we can dump the relevant file system data structures:
crash> [struct](http://people.redhat.com/anderson/crash_whitepaper/help_pages/struct.html) dentry.d_name,d_parent,d_inode **ffff8809c6c82a40**
d_name = {
len = 7,
name = 0xffff8809c6c82ae0 "fire.js"
}
d_parent = **0xffff8809c6c82b00**
d_inode = **0xffff8808a3dbf588**
crash> struct dentry.d_name **0xffff8809c6c82b00**
d_name = {
len = 2,
name = 0xffff8809c6c82ba0 "js"
}
crash> struct inode.i_mutex **0xffff8808a3dbf588**
i_mutex = {
count = {
counter = -1
},
wait_lock = {
raw_lock = {
slock = 196611
}
},
wait_list = {
next = 0xffff880749a05c48,
prev = 0xffff88057f85bdc8
},
owner = **0xffff88076b78c000**
}
So the file is “fire.js” in the “js” directory. That’s good to know. Now, let’s see which process owns the mutex:
crash> struct thread_info.task **0xffff88076b78c000**
task = 0xffff8807ef98aae0
crash> [ps](http://people.redhat.com/anderson/crash_whitepaper/help_pages/ps.html) 0xffff8807ef98aae0
PID PPID CPU TASK ST %MEM VSZ RSS COMM
4723 4087 2 ffff8807ef98aae0 UN 0.3 641736 137204 ruby
One step forward, one step back
So, we now know which process has the mutex. What is it doing now?
crash> bt 4723
PID: 4723 TASK: ffff8807ef98aae0 CPU: 2 COMMAND: "ruby"
#0 [ffff88076b78da08] schedule at ffffffff81528bc0
#1 [ffff88076b78dad0] rwsem_down_failed_common at ffffffff8152b275
#2 [ffff88076b78db30] rwsem_down_write_failed at ffffffff8152b3d3
#3 [ffff88076b78db70] call_rwsem_down_write_failed at ffffffff8128f383
#4 [ffff88076b78dbd0] xfs_ilock at ffffffffa02006fc [xfs]
#5 [ffff88076b78dc00] xfs_setattr at ffffffffa021fdcd [xfs]
#6 [ffff88076b78dcc0] xfs_vn_setattr at ffffffffa022c00b [xfs]
#7 [ffff88076b78dcd0] notify_change at ffffffff811a7cf8
#8 [ffff88076b78dd40] do_truncate at ffffffff81186f74
#9 [ffff88076b78ddb0] do_filp_open at ffffffff8119c051
#10 [ffff88076b78df20] do_sys_open at ffffffff81185c39
#11 [ffff88076b78df70] sys_open at ffffffff81185d50
...
It’s waiting to acquire a write lock on a read-write semaphore. Which one? Using the same approach from above, we can recover frame #4’s first argument:
xfs_ilock(struct xfs_inode *ip = **0xffff8808a3dbf400**, ??)
So that one kernel can handle many different filesystems, there is a generic struct inode that the main virtual file system layer (VFS) can understand, but each different file system also has its own specific inode data structure. XFS uses a struct xfs_inode, and one if its fields is the embedded generic inode. For low-level synchronization, it includes a binary semaphore i_iolock, and this process is blocked trying to acquire that semaphore.
As it turns out, we’ve seen this inode before:
crash> struct -ox xfs_inode.i_vnode **0xffff8808a3dbf400**
struct xfs_inode {
[**ffff8808a3dbf588**] struct inode i_vnode;
}
It’s the “fire.js” inode, whose i_mutex lock is already held by this process. It’s now acquiring an XFS-specific semaphore on the same inode.
Unfortunately, semaphores don’t carry an “owner” field (in this configuration) like a mutex. So how do we find its owner? In this case, we can get lucky by assuming first that we’re investigating a deadlock (though we haven’t quite found that yet) and second that this process is a critical part of that deadlock. If we’re right, another interesting process might be waiting on this process’s mutex.
crash> struct -ox inode.i_mutex **0xffff8808a3dbf588**
struct inode {
[ffff8808a3dbf640] struct mutex i_mutex;
}
crash> struct -ox mutex.wait_list ffff8808a3dbf640
struct mutex {
[ffff8808a3dbf648] struct list_head wait_list;
}
crash> [list](http://people.redhat.com/anderson/crash_whitepaper/help_pages/list.html) -s mutex_waiter.task -H ffff8808a3dbf648
ffff880749a05c48
task = 0xffff88076dfe2040
ffff88057f85bdc8
task = 0xffff8802d0e5a080
crash> ps 0xffff88076dfe2040 0xffff8802d0e5a080
PID PPID CPU TASK ST %MEM VSZ RSS COMM
4724 1 6 ffff88076dfe2040 UN 0.3 468676 138156 ruby
5109 5099 18 ffff8802d0e5a080 UN 0.0 4144 696 rm
We started our session with process 5109, but what’s this other process, 4724?
crash> bt 4724
PID: 4724 TASK: ffff88076dfe2040 CPU: 6 COMMAND: "ruby"
#0 [ffff880749a05b68] schedule at ffffffff81528bc0
#1 [ffff880749a05c30] __mutex_lock_slowpath at ffffffff8152a36e
#2 [ffff880749a05ca0] mutex_lock at ffffffff8152a20b
#3 [ffff880749a05cc0] generic_file_splice_write at ffffffff811b8c7a
#4 [ffff880749a05d50] xfs_file_splice_write at ffffffffa02277b0 [xfs]
#5 [ffff880749a05dc0] do_splice_from at ffffffff811b873e
#6 [ffff880749a05e00] direct_splice_actor at ffffffff811b8790
#7 [ffff880749a05e10] splice_direct_to_actor at ffffffff811b8a66
#8 [ffff880749a05e80] do_splice_direct at ffffffff811b8bad
#9 [ffff880749a05ed0] do_sendfile at ffffffff8118936c
#10 [ffff880749a05f30] sys_sendfile64 at ffffffff81189404
#11 [ffff880749a05f80] system_call_fastpath at ffffffff8100b072
...
Hmm, I see XFS-specific code calling back into generic VFS code. Indeed, looking at xfs_file_splice_write(), you can see where it grabs the XFS i_iolock then calls generic_file_splice_write(), which then grabs the generic i_mutex.
Conclusion
Now there is enough data to put together a complete picture of what happened. Most XFS write operations acquire a file’s i_mutex first, then acquire i_iolock. The sys_open() call is an example of this ordering. However, the splice write operation is coded differently. It acquires i_iolock, then acquires i_mutex. We see that in the sys_sendfile64() call. So, whenever two processes concurrently write and splice to the same inode, each process can acquire one lock and then block on the other, leaving them deadlocked in uninterruptible sleeps and poisoning the inode.
We found an XFS bug.
Searching for “xfs splice deadlock” turns up an email thread from 2011 that describes this problem. However, bisecting the kernel source repository shows that the bug wasn’t really addressed until April, 2014 (8d02076) for release in Linux 3.16.
So why did we just spend all this time debugging a longstanding known and fixed problem? RedHat doesn’t use the latest version of the kernel in their Enterprise Linux product. Using older, battle-hardened versions is a good thing when your goal is long-term stability. However, bugs will still be found, and they must be backported into the stable versions. RedHat just hasn’t included this one particular fix yet. Square is working with them to reach a solution. Until then, all current versions to-date of RHEL/CentOS 6 and 7 are susceptible to this bug. Update as of 9/2/16: RedHat has now shipped updated kernels that contain a fix for this deadlock: version 2.6.32–642 for RHEL 6 and version 3.10.0–327.22.2 for RHEL 7.
As for the process that initiated this deadlock, it’s a regular MRI Ruby 2.2.2 interpreter. Its core I/O functions have been optimized to use zero-copy operations when possible. Since early 2014 on Linux, a call to FileUtils.copy_file() from the standard library will use sendfile64(). Hence, there is a fairly compact trigger:
ruby -rfileutils -e'FN=ARGV.shift; Thread.new { loop { FileUtils.copy_file "/etc/issue", FN } }; loop {File.new FN, File::CREAT|File::TRUNC|File::RDWR }' /xfs/somefile
I suspect that Square has seen Java processes deadlock in this way too. These programs are trying to do the right thing, and I would expect that as more libraries and runtimes mature, they will also try to optimize their I/O behavior by adopting zero-copy operations.
Corrections: (1): Added explanation that you get the kernel source code along with the kernel-debuginfo-common package. (2): The second argument is passed in %rsi, not %rdx.