Here are my notes on what MLPerf Storage does. See storage benchmarking is dumb as well.

Training benchmark

MLPerf Storage defines three training benchmarks whose I/O patterns are being emulated by the DLIO benchmark tool.1

  1. UNet3D - meant to represent training a volumetric medical image segmentation model
  2. ResNet-50 - meant to represent training a classification model using an ImageNet-like dataset
  3. CosmoFlow - meant to represent a cosmology parameter prediction model

A cynical observation

Amusingly, all three are convolutional neural networks—not transformers, which are what drive most of the storage in AI these days.

Furthermore, CosmoFlow isn’t a real app and never has been; despite being developed at NERSC, it comprises 0% of the NERSC workload2 and it was developed to demonstrate the potential of deep learning for science in 2018,3 not to represent a real workload.

The I/O patterns generated by these benchmarks are strangely arbitrary.

UNet3D

Abstract

UNet3D amounts to a file-per-process benchmark where each process reads its own ~140 MB file in a single op. If you run it without direct I/O, the benchmark ultimately tests how fast np.load on a single big file is. If you run it with direct I/O, though, it tests the performance of readv(2) followed by an in-memory unzip.

Dataset generation

The dataset read by UNet3D consists of files, each containing exactly one sample. The files have an average size of 146,600,628 bytes4 with a standard deviation of 68,341,808 bytes. These are fixed values.

is determined at benchmark launch by taking the max of two constraints:5

N = max(
    (5 * total_client_memory_bytes) / 146_600_628,       # 5× memory rule
    500 ** num_accelerators ** 7 / 1                        # 500-step rule
)

The constant 500 is a hard-coded literal.3 The constant 7 is batch_size a fixed value as well.6 The 500-step rule dominates at any reasonable accelerator count, giving .

Workload setup

DLIO can run across multiple nodes using MPI using the --num-accelerators option to define the number of MPI ranks. Within each rank (accelerator), the benchmark then creates a PyTorch DataLoader with a configurable number of subprocesses7 to enable parallel workers. The input file list is distributed across parallel workers.

For example, if you run with --num-accelerators 16 and use the default read_threads: 4, you wind up reading 64 files concurrently.

I/O pattern

The actual read is called deep within DLIO and amounts to a single numpy.load of an entire file8 when direct I/O is not being used. Whether you use direct I/O is configurable.9

Without direct I/O, this means a single read(2) syscall that loads the entire contents of the ~140 MB file in a single read. How that single read is chopped into actual I/Os that storage sees is governed by the client’s page cache:

  • Local SSDs’ readahead sizes are governed by /dev/block/<dev>/queue/read_ahead_kb. This is typically 128 KiB.
  • NFS readahead is governed by /sys/class/bdi/*/read_ahead_kb. Sometimes this is 128K, but VAST recommends tuning it up to 4 MiB.10 The result are I/Os governed by the rsize mount parameter (usually 1 MiB) being sequentially issued in 4 MiB intervals.
  • Lustre readahead is governed by llite.*.max_read_ahead_mb which dozens of megabytes. The actual I/Os are governed by max_pages_per_rpc (usually 1-4 MiB) which are streamed into the readahead buffer.

Micron confirmed that UNet3D generates 128K reads,1 but they failed to note that this is a result of their client’s choice of read_ahead_kb and can be changed without DLIO ever knowing.

With direct I/O, DLIO still issues a single readv(2) syscall that loads the entire contents of the ~140 MB file in a single read.11 The Linux kernel then breaks this into smaller reads which are governed by a bunch of client tunables:

  • Local SSDs are governed by /sys/block/<dev>/queue/max_sectors_kb - typically between 512 KiB and 2 MiB unless the file system driver imposes its own chunk size above that.
  • NFS will chop the I/O into whatever the rsize mount parameter is - typically 1 MiB on high-performance mounts.
  • Lustre will chop the I/O into whatever the max_pages_per_rpc client tunable is. The default is 256 pages (1 MiB) but is often dialed up to 4 MiB or more.

Warning

While DLIO and UNet3D are meant to reproduce what PyTorch does, it never actually calls PyTorch; it is a reimplementation of something PyTorch did at a point in time. Thus, it doesn’t necessarily reflect what PyTorch does today.

ResNet-50

Abstract

ResNet-50 boils down to a file-per-process sequential read test with a configurable transfer size. The default is 256 KiB, but it should be increased to suit the backing store.

Dataset generation

ResNet-50 uses TFRecord, not NPZ files, and is therefore tied to TensorFlow in the closed case. The dataset consists of files, where is defined by

ceil(500 * num_accelerators * 400 / 1251) # 400 = batch_size

Each file contains 1,251 samples which are exactly 114,244 bytes in size,12 translating to 137 MiB files.

This dataset is meant to represent the actual ImageNet dataset on average, but the real ImageNet dataset has variation in both record size and file size. In addition, the number of files becomes not a power of 2 at high concurrencies, resulting in additional imbalance. As a result, the MLPerf Storage ResNet-50 test is a best-case scenario in terms of per-file access (really training against ImageNet would result in load imbalance) but a bad case in terms of sharding individual records across files.

Workload setup

DLIO can run across multiple nodes using MPI using the --num-accelerators option to define the number of MPI ranks. Within each rank (accelerator), the benchmark constructs a tf.data pipeline13 with configurable number of threads to enable parallel workers. Within each MPI rank, num_parallel_reads=read_threads causes the TF pipeline to interleave records from multiple files simultaneously using those threads.

For example, if you run with --num-accelerators 16 and the default read_threads: 8,14 if , those 1024 files are divided evenly across the 16 MPI ranks (64 files per rank), and each rank reads 8 of its files concurrently, giving 128 concurrently read files across the job. One MPI rank gets one tf.data pipeline with 8 threads and 64 files.

I/O pattern

ResNet-50’s I/O pattern is governed by TensorFlow’s own TFRecordDataset I/O size, and this is configurable using the reader.transfer_size tunable.15 Weirdly, this is not advertised in many places, so you must know where to set this in order to realize the benefits. The default is 256 KiB,16 but it should be changed to match whatever is ideal for the underlying storage system.

Assuming transfer_size = 256K, each thread for each tf.data (i.e., one MPI rank) walks through its file, reading 256K at a time into an internal buffer. From that buffer, TensorFlow then decodes the whole records (there will probably be some leftovers unless each record is evenly divisible into 256K), then proceeds with reading another 256K. It does this until it reaches the end of file, then it begins work on its next file.

Threads have exclusive access over their own files. Direct I/O is not supported.

So, this benchmark boils down to a concurrent read bandwidth test which is governed by whatever you set reader.transfer_size. In practice,

  • The default is 256 KiB, which is a property of TensorFlow. This is why Micron observed most I/Os being this size in their analysis;1 I think their observation of larger sizes is a result of readahead sometimes colliding with this record size. Not sure though.
  • Local SSDs are probably fine with the default of 256 KiB.
  • NFS should increase transfer_size to whatever the rss mount parameter is (probably 1 MiB) or an integer multiple of it (4 MiB) if you are using few processes per node or threads per process.
  • Lustre should increase transfer_size to whatever max_pages_per_rpc is (1 MiB - 4 MiB) or an integer multiple of it (4 MiB - 16 MiB) if you are using few processes per node or threads per process.

CosmoFlow

As I said earlier, CosmoFlow is not a real application or workload. I don’t think it’s helpful to describe it, because that would imply that its I/O patterns mean something. If you think CosmoFlow performance is telling you something important about your storage system that cannot be gleaned by UNet3D or ResNet-50, shame on you.

Vector benchmark

Milvus has a benchmark that:

  • reads a single big index from S3 into a local NVMe
  • the actual vector query is executed against a local SSD and returns a list of nearest neighbors
  • optionally, it lets you insert both vectors and the document those vectors represent; in this case, the documents are stored in S3
  • the retrieval can either return just the nearest neighbor vectors or vectors and documents

Footnotes

  1. Discussion and Analysis of the MLPerf Storage Benchmark - SNIA SDC25 2 3

  2. N10_Workload_Analysis.latest.pdf

  3. https://github.com/mlcommons/storage/blob/6524b8f7ddbdce102af738ace4c05d638d8ca204/mlpstorage/rules/utils.py#L100 2

  4. https://github.com/mlcommons/storage/blob/6524b8f7ddbdce102af738ace4c05d638d8ca204/configs/dlio/workload/unet3d_h100.yaml#L18

  5. https://github.com/mlcommons/storage/blob/6524b8f7ddbdce102af738ace4c05d638d8ca204/mlpstorage/rules/utils.py#L15

  6. https://github.com/mlcommons/storage/blob/6524b8f7ddbdce102af738ace4c05d638d8ca204/configs/dlio/workload/unet3d_h100.yaml#L24

  7. https://github.com/mlcommons/storage/blob/6524b8f7ddbdce102af738ace4c05d638d8ca204/configs/dlio/workload/unet3d_h100.yaml#L25

  8. https://github.com/argonne-lcf/dlio_benchmark/blob/57148a19ff004b214748b4290767c84392577aa2/dlio_benchmark/reader/npz_reader.py#L38

  9. https://github.com/mlcommons/storage/blob/6524b8f7ddbdce102af738ace4c05d638d8ca204/Rules.md?plain=1#L352

  10. VAST Quick NFS Read Ahead Tuning

  11. https://github.com/argonne-lcf/dlio_benchmark/blob/main/dlio_benchmark/reader/npy_reader_odirect.py#L65

  12. https://github.com/mlcommons/storage/blob/6524b8f7ddbdce102af738ace4c05d638d8ca204/configs/dlio/workload/resnet50_h100.yaml#L14; The generator creates every record at the same int(dimension) × int(dimension) size since dimension_stdev=0. However record_length_bytes: 114660.07 is a non-integer, and int(math.sqrt(114660.07)) = int(338.6...) = 338. So each generated record is actually 338 × 338 = 114,244 bytes, not 114,660.

  13. https://github.com/argonne-lcf/dlio_benchmark/blob/ea53bcfe26da8df15af6324ba6618e4593998104/dlio_benchmark/reader/tf_reader.py#L91

  14. https://github.com/mlcommons/storage/blob/6524b8f7ddbdce102af738ace4c05d638d8ca204/configs/dlio/workload/resnet50_h100.yaml#L25

  15. https://github.com/argonne-lcf/dlio_benchmark/blob/ea53bcfe26da8df15af6324ba6618e4593998104/dlio_benchmark/reader/tf_reader.py#L97

  16. https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/data/ops/readers.py#L36