How we spent two weeks hunting an NFS bug in the Linux kernel

How we spent two weeks hunting an NFS bug in the Linux kernel

On Sep. 14, the GitLab support team escalated a critical problem encountered by one of our customers: GitLab would run fine for a while, but after some time users encountered errors. When attempting to clone certain repositories via Git, users would see an opaque Stale file error message. The error message persisted for a long time, blocking employees from being able to work, unless a system administrator intervened manually by running ls in the directory itself.

Thus launched an investigation into the inner workings of Git and the Network File System (NFS). The investigation uncovered a bug with the Linux v4.0 NFS client and culiminated with a kernel patch that was written by Trond Myklebust and merged in the latest mainline Linux kernel on Oct. 26. This post describes the journey of investigating the issue and details the thought process and tools by which we tracked down the bug.

It was inspired by the fine detective work in How I spent two weeks hunting a memory leak in Ruby by Oleg Dashevskii. More importantly, this experience exemplifies how open source software debugging has become a team sport that involves expertise across multiple people, companies, and locations. The GitLab motto ‘everyone can contribute’ applies not only to GitLab itself, but also to other open source projects, such as the Linux kernel.

While we have run NFS on for many years, we have stopped using it to access repository data across our application machines. Instead, we have abstracted all Git calls to Gitaly. Still, NFS remains a supported configuration for our customers who manage their own installation of GitLab, but we had never seen the exact problem described by the customer before.