NFS is the Network File Storage protocol. It’s actually a standard of protocols that allow POSIX-like file access from a client to data that’s stored on a remote server.

Although NFS itself has not been a high-performance interface for networked file systems, many concepts that it employs are in use by HPC parallel file systems. Knowing how NFS works helps understand how these other file systems work, and knowing why NFS is slow helps understand the design decisions that have guided HPC-optimized file systems.

There are also a few recent extensions to NFS that can make it a high-performance parallel client. VAST is perhaps the most interesting of such implementations since it employs NFS over RDMA1, nconnect2, and an enhanced NFS multipathing3 to enable parallel, scale-out I/O performance.

NFS version 3

NFS version 3 is famously stateless; the NFS server does not keep track of who has the file system mounted or what files are open. It even goes so far as to not implement open or close commands, making it kind of like an object store in its low-level semantics. Instead, all NFS clients rely on the NFS server to provide file handles which are magical tokens that uniquely identify files or directories.

As a result,

  • mounting an NFS export actually involves asking a special service on the NFS server for the root file handle.
  • opening a file on an NFS export involves performing recursive LOOKUP commands given the file handle of a parent directory and the name of a file or directory in that parent. This starts with the root file handle.
  • reading from a file involves READing data from a file handle given a byte offset and number of bytes
  • writing to a file involves WRITEing data to a file handle at a given byte offset
  • closing a file, well, doesn’t happen because you never opened the file. If you want, you can COMMIT and force both client and server to flush any cached writes down to persistent media.

Because NFSv3 is stateless but POSIX file I/O is inherently stateful, there’s a bunch of wonky bolt-on services required by NFS servers in practice to make stateless NFSv3 operate like a stateful file system. For example,

  • allowing clients to statefully mount the NFS export (mount -t nfs) is enabled by mountd
  • allowing clients to hold locks on files (e.g., flock) is enabled by lockd

In practice, NFSv3 is like a good idea in principle that forgot to include a lot of important features used in practice.

Lock Manager (lockd) and Status Monitor (statd)

The state of all outstanding NFS locks is persisted by the NFS statd on a local file system on the NFS server. These outstanding locks can be viewed in /var/lib/nfs/sm. For example, taking a file lock on a file named hello on an NFS mount on a client named cloverdale:

glock@cloverdale:/mnt/nfs$ flock hello sleep 30

results in the following appearing on the server:

root@synology:/var/lib/nfs/sm# ls -lrt
total 4
-rw------- 1 root root 92 Mar 29 19:54 cloverdale
root@synology:/var/lib/nfs/sm# cat cloverdale
0100007f 000186b5 00000003 00000010 66a89ab53b650b00804d729c00000000 192.168.50.27 synology

Note that merely opening a file for writes does not lock it; this is why casual file editing from multiple NFS clients can result in file corruption.

NFS version 4

NFSv4 is a massive improvement over NFSv3 that

  1. rolls up the core NFSv3 service and all the required add-ons into a single standard
  2. adds a bunch of new features that enhance performance and functionality that were completely missing from NFSv3

Parallel NFS (pNFS) from NFS v4.1

The original implementation of Parallel NFS was kind of junky, because it relied on clients getting file layout information from a single metadata server (like Lustre does) before clients can talk directly to data servers. Panasas and NetApp were proponents of this, but it never gained a lot of traction.

pNFS v4.2 with Flex Files

There is a post on reddit4 that describes the benefits of pNFS from NFS v4.2 with Flex Files. For posterity:

cbg523 via r/hpc

Actually, there are multiple fundamental differences between pNFSv4.1 from 2011 and the introduction of pNFSv4.2 with Flex Files in 2019. With v4.2 it solves the innate chattiness of the NFS protocol, adds intelligence into the client, etc. The recent addition of LOCALIO in v6.12 of the Linux kernel adds even more efficiency, bypassing the NFS protocol within the kernel when storage is local within the server, such as NVMe that is included in DGX servers.

The advantage of course is that the client is already included in standard Linux distributions.

I think Hammerspace led the development of this, and it sounds like it addresses many of the issues of the original pNFS. This happened after I left the storage world though.

Footnotes

  1. https://www.usenix.org/legacy/events/fast02/wips.html#callaghan

  2. https://lkml.iu.edu/hypermail/linux/kernel/1907.2/02845.html

  3. https://vastdata.com/blog/meet-your-need-for-speed-with-nfs/

  4. Parallel NFS (pNFS) - Anyone using it without Hammerspace?