NFS is the Network File Storage protocol. It’s actually a standard of protocols that allow POSIX-like file access from a client to data that’s stored on a remote server.
Although NFS itself has not been a high-performance interface for networked file systems, many concepts that it employs are in use by HPC parallel file systems. Knowing how NFS works helps understand how these other file systems work, and knowing why NFS is slow helps understand the design decisions that have guided HPC-optimized file systems.
There are also a few recent extensions to NFS that can make it a high-performance parallel client. VAST is perhaps the most interesting of such implementations since it employs NFS over RDMA1, nconnect2, and an enhanced NFS multipathing3 to enable parallel, scale-out I/O performance.
NFS version 3
NFS version 3 is famously stateless; the NFS server does not keep track of who
has the file system mounted or what files are open. It even goes so far as to
not implement open
or close
commands, making it kind of like an object store
in its low-level semantics. Instead, all NFS clients rely on the NFS server to
provide file handles which are magical tokens that uniquely identify files or
directories.
As a result,
- mounting an NFS export actually involves asking a special service on the NFS server for the root file handle.
- opening a file on an NFS export involves performing recursive
LOOKUP
commands given the file handle of a parent directory and the name of a file or directory in that parent. This starts with the root file handle. - reading from a file involves
READ
ing data from a file handle given a byte offset and number of bytes - writing to a file involves
WRITE
ing data to a file handle at a given byte offset - closing a file, well, doesn’t happen because you never opened the file.
If you want, you can
COMMIT
and force both client and server to flush any cached writes down to persistent media.
Because NFSv3 is stateless but POSIX file I/O is inherently stateful, there’s a bunch of wonky bolt-on services required by NFS servers in practice to make stateless NFSv3 operate like a stateful file system. For example,
- allowing clients to statefully mount the NFS export (
mount -t nfs
) is enabled bymountd
- allowing clients to hold locks on files (e.g.,
flock
) is enabled bylockd
In practice, NFSv3 is like a good idea in principle that forgot to include a lot of important features used in practice.
Lock Manager (lockd) and Status Monitor (statd)
The state of all outstanding NFS locks is persisted by the NFS statd on a local
file system on the NFS server. These outstanding locks can be viewed in
/var/lib/nfs/sm
. For example, taking a file lock on a file named hello
on
an NFS mount on a client named cloverdale:
glock@cloverdale:/mnt/nfs$ flock hello sleep 30
results in the following appearing on the server:
root@synology:/var/lib/nfs/sm# ls -lrt
total 4
-rw------- 1 root root 92 Mar 29 19:54 cloverdale
root@synology:/var/lib/nfs/sm# cat cloverdale
0100007f 000186b5 00000003 00000010 66a89ab53b650b00804d729c00000000 192.168.50.27 synology
Note that merely opening a file for writes does not lock it; this is why casual file editing from multiple NFS clients can result in file corruption.
NFS version 4
NFSv4 is a massive improvement over NFSv3 that
- rolls up the core NFSv3 service and all the required add-ons into a single standard
- adds a bunch of new features that enhance performance and functionality that were completely missing from NFSv3
Parallel NFS (pNFS) from NFS v4.1
The original implementation of Parallel NFS was kind of junky, because it relied on clients getting file layout information from a single metadata server (like Lustre does) before clients can talk directly to data servers. Panasas and NetApp were proponents of this, but it never gained a lot of traction.
pNFS v4.2 with Flex Files
There is a post on reddit4 that describes the benefits of pNFS from NFS v4.2 with Flex Files. For posterity:
cbg523 via r/hpc
Actually, there are multiple fundamental differences between pNFSv4.1 from 2011 and the introduction of pNFSv4.2 with Flex Files in 2019. With v4.2 it solves the innate chattiness of the NFS protocol, adds intelligence into the client, etc. The recent addition of LOCALIO in v6.12 of the Linux kernel adds even more efficiency, bypassing the NFS protocol within the kernel when storage is local within the server, such as NVMe that is included in DGX servers.
…
The advantage of course is that the client is already included in standard Linux distributions.
I think Hammerspace led the development of this, and it sounds like it addresses many of the issues of the original pNFS. This happened after I left the storage world though.