MPI-IO

MPI-IO is an API within the MPI standard¹ that provides a way for parallel applications to read and write data in a way that is a better semantic fit for a single, coordinated application running across a bunch of nodes.

Consistency model

Whereas POSIX has strict consistency² and NFS has close-to-open consistency by default, MPI-IO defines three consistency levels which interplay with two modes.³

MPI-IO has two atomicity modes, set via MPI_File_set_atomicity(fh, flag):

Atomic mode: Writes and reads behave as if they execute one at a time in some serial order. Every completed write is immediately visible to all subsequent reads. This is the closest to what POSIX-familiar developers expect, but carries a performance cost.
Non-atomic mode (the default): No such guarantee. If two processes access overlapping byte ranges and at least one is writing, results are undefined unless the application explicitly synchronizes. This gives MPI implementations freedom to buffer and pipeline I/O aggressively.

These modes, combined with how file handles were opened, determine what consistency guarantees your application gets:

Single file handle: Your own reads always see your own writes. If accesses don’t overlap across processes, they’re well-defined without any extra work. Conflicting accesses require atomic mode.
Multiple handles from the same collective open: Same as above. Non-overlapping accesses are fine, but conflicting accesses require atomic mode.
Handles from different collective opens: You must do the work yourself. To ensure a write by one process is visible to a read by another, both must call MPI_File_sync (a collective operation) with a barrier in between. Opening and closing a file counts as an implicit sync.

In practice, most MPI-IO applications are structured to avoid conflicting accesses entirely, and each process owns disjoint regions of the file. This sidesteps all of this and lets non-atomic mode work safely.

ROMIO

ROMIO is the implementation of MPI-IO included with MPICH ~~and is one of two implementations available in OpenMPI.~~⁴ It provides a flexible driver interface that allows specific optimizations to be written for specific file systems.

Much of the optimization done these days is around collective I/O, where a subset of MPI processes are designed as aggregators. These aggregators are given exclusive access to write to non-overlapping parts of shared files. MPI processes that wish to write to a part of a file that is owned by another aggregator simply send their data to the correct aggregator, and the aggregator writes that data down to the file.

ROMIO handles this collective buffering under the hood by partitioning shared files into file domains, which are non-overlapping ranges within a file. Each aggregator is assigned a file domain, and the file system sees only a few clients (aggregators) each writing to non-overlapping parts of a shared file. The application itself might be issuing writes from every MPI process, and ROMIO cleans it all up.

GPFS driver

GPFS forces writers to obtain byte-range locks (tokens) before they can write, and if such a token is already possessed by a different client, the token must be revoked and re-issued. When multiple writers try to write to the same shared file, they can fight over tokens and serialize.

The GPFS driver does a few things to address this:

It partitions shared files into file domains along GPFS block boundaries so that two clients are never accessing the same GPFS block.
It has each aggregator proactively obtain write tokens for its domain partitions so that when writes occur, there is no lock contention.
When a file is opened, all of its locks are freed to clean up any dangling locks so they aren’t being revoked later on and causing writes to stall.

NFS driver

The NFS driver is very primitive and simply wraps every write in a byte-range lock using fcntl(F_SETLKW, F_WRLCK). This results in a round trip to the server before (obtain lock) and after (release lock) every write. This forces cache flushes after every write.

The NFS driver is notable in the performance optimizations it doesn’t have:

There is no NFS-specific collective I/O path, so every collective write (despite being non-overlapping) is wrapped in the lock-unlock to force synchronous consistency.
It is hard-coded to always use buffered I/O. There is no way to force direct I/O to bypass the lock-unlock round trips for every write.

Internally, it also sets romio_visibility_immediate = false which signals to other parts of ROMIO that it should not assume that writes are immediately visible to other processes within the ROMIO implementation itself. This triggers behavior that prefers data safety over performance optimization.

UFS driver

UFS is the Universal File System driver and provides sane behavior on file systems that lack a specific driver. It just calls the default

MPI-5.0 Standard - I/O - Introduction ↩
“POSIX requires that a read(2) that can be proved to occur after a write() has returned will return the new data. Note that not all filesystems are POSIX conforming.” (source) ↩
File Consistency ↩
OpenMPI 6.x deleted ROMIO from its tree (see https://docs.open-mpi.org/en/main/release-notes/changelog/v6.0.x.html). It looks like the latest NVIDIA HPC-X includes OpenMPI 4.1 and 5.0 (as a “preview mode”), so it still has ROMIO, but a time will come where ROMIO optimizations are not available in the latest NVIDIA stack. ↩

Glenn's Digital Garden

Explorer

MPI-IO

Consistency model

ROMIO

GPFS driver

NFS driver

UFS driver

Graph View

Table of Contents

Backlinks

Glenn's Digital Garden

Explorer

MPI-IO

Consistency model

ROMIO

GPFS driver

NFS driver

UFS driver

Footnotes

Graph View

Table of Contents

Backlinks