MLPerf Storage

Here are my notes on what MLPerf Storage does. See storage benchmarking is dumb as well.

Training benchmark

MLPerf Storage defines three training benchmarks whose I/O patterns are being emulated by the DLIO benchmark tool.¹

UNet3D - meant to represent training a volumetric medical image segmentation model
ResNet-50 - meant to represent training a classification model using an ImageNet-like dataset
CosmoFlow - meant to represent a cosmology parameter prediction model

A cynical observation

Amusingly, all three are convolutional neural networks—not transformers, which are what drive most of the storage in AI these days.

Furthermore, CosmoFlow isn’t a real app and never has been; despite being developed at NERSC, it comprises 0% of the NERSC workload² and it was developed to demonstrate the potential of deep learning for science in 2018,³ not to represent a real workload.

The I/O patterns generated by these benchmarks are strangely arbitrary.

UNet3D

Abstract

UNet3D amounts to a file-per-process benchmark where each process reads its own ~140 MB file in a single op. If you run it without direct I/O, the benchmark ultimately tests how fast np.load on a single big file is. If you run it with direct I/O, though, it tests the performance of readv(2) followed by an in-memory unzip.

Dataset generation

The dataset read by UNet3D consists of $N$ files, each containing exactly one sample. The files have an average size of 146,600,628 bytes⁴ with a standard deviation of 68,341,808 bytes. These are fixed values.

$N$ is determined at benchmark launch by taking the max of two constraints:⁵

N = max(
    (5 * total_client_memory_bytes) / 146_600_628,       # 5× memory rule
    500 ** num_accelerators ** 7 / 1                        # 500-step rule
)

The constant 500 is a hard-coded literal.³ The constant 7 is batch_size a fixed value as well.⁶ The 500-step rule dominates at any reasonable accelerator count, giving $N = 3500 \times num_accelerators$ .

Workload setup

DLIO can run across multiple nodes using MPI using the --num-accelerators option to define the number of MPI ranks. Within each rank (accelerator), the benchmark then creates a PyTorch DataLoader with a configurable number of subprocesses⁷ to enable parallel workers. The input file list is distributed across parallel workers.

For example, if you run with --num-accelerators 16 and use the default read_threads: 4, you wind up reading 64 files concurrently.

I/O pattern

The actual read is called deep within DLIO and amounts to a single numpy.load of an entire file⁸ when direct I/O is not being used. Whether you use direct I/O is configurable.⁹

Without direct I/O, this means a single read(2) syscall that loads the entire contents of the ~140 MB file in a single read. How that single read is chopped into actual I/Os that storage sees is governed by the client’s page cache:

Local SSDs’ readahead sizes are governed by /dev/block/<dev>/queue/read_ahead_kb. This is typically 128 KiB.
NFS readahead is governed by /sys/class/bdi/*/read_ahead_kb. Sometimes this is 128K, but VAST recommends tuning it up to 4 MiB.¹⁰ The result are I/Os governed by the rsize mount parameter (usually 1 MiB) being sequentially issued in 4 MiB intervals.
Lustre readahead is governed by llite.*.max_read_ahead_mb which dozens of megabytes. The actual I/Os are governed by max_pages_per_rpc (usually 1-4 MiB) which are streamed into the readahead buffer.

Micron confirmed that UNet3D generates 128K reads,¹ but they failed to note that this is a result of their client’s choice of read_ahead_kb and can be changed without DLIO ever knowing.

With direct I/O, DLIO still issues a single readv(2) syscall that loads the entire contents of the ~140 MB file in a single read.¹¹ The Linux kernel then breaks this into smaller reads which are governed by a bunch of client tunables:

Local SSDs are governed by /sys/block/<dev>/queue/max_sectors_kb - typically between 512 KiB and 2 MiB unless the file system driver imposes its own chunk size above that.
NFS will chop the I/O into whatever the rsize mount parameter is - typically 1 MiB on high-performance mounts.
Lustre will chop the I/O into whatever the max_pages_per_rpc client tunable is. The default is 256 pages (1 MiB) but is often dialed up to 4 MiB or more.

Warning

While DLIO and UNet3D are meant to reproduce what PyTorch does, it never actually calls PyTorch; it is a reimplementation of something PyTorch did at a point in time. Thus, it doesn’t necessarily reflect what PyTorch does today.

ResNet-50

Abstract

ResNet-50 boils down to a file-per-process sequential read test with a configurable transfer size. The default is 256 KiB, but it should be increased to suit the backing store.

Dataset generation

ResNet-50 uses TFRecord, not NPZ files, and is therefore tied to TensorFlow in the closed case. The dataset consists of $N$ files, where $N$ is defined by

ceil(500 * num_accelerators * 400 / 1251) # 400 = batch_size

Each file contains 1,251 samples which are exactly 114,244 bytes in size,¹² translating to 137 MiB files.

This dataset is meant to represent the actual ImageNet dataset on average, but the real ImageNet dataset has variation in both record size and file size. In addition, the number of files becomes not a power of 2 at high concurrencies, resulting in additional imbalance. As a result, the MLPerf Storage ResNet-50 test is a best-case scenario in terms of per-file access (really training against ImageNet would result in load imbalance) but a bad case in terms of sharding individual records across files.

Workload setup

DLIO can run across multiple nodes using MPI using the --num-accelerators option to define the number of MPI ranks. Within each rank (accelerator), the benchmark constructs a tf.data pipeline¹³ with configurable number of threads to enable parallel workers. Within each MPI rank, num_parallel_reads=read_threads causes the TF pipeline to interleave records from multiple files simultaneously using those threads.

For example, if you run with --num-accelerators 16 and the default read_threads: 8,¹⁴ if $N = 1024$ , those 1024 files are divided evenly across the 16 MPI ranks (64 files per rank), and each rank reads 8 of its files concurrently, giving 128 concurrently read files across the job. One MPI rank gets one tf.data pipeline with 8 threads and 64 files.

I/O pattern

ResNet-50’s I/O pattern is governed by TensorFlow’s own TFRecordDataset I/O size, and this is configurable using the reader.transfer_size tunable.¹⁵ Weirdly, this is not advertised in many places, so you must know where to set this in order to realize the benefits. The default is 256 KiB,¹⁶ but it should be changed to match whatever is ideal for the underlying storage system.

Assuming transfer_size = 256K, each thread for each tf.data (i.e., one MPI rank) walks through its file, reading 256K at a time into an internal buffer. From that buffer, TensorFlow then decodes the whole records (there will probably be some leftovers unless each record is evenly divisible into 256K), then proceeds with reading another 256K. It does this until it reaches the end of file, then it begins work on its next file.

Threads have exclusive access over their own files. Direct I/O is not supported.

So, this benchmark boils down to a concurrent read bandwidth test which is governed by whatever you set reader.transfer_size. In practice,

The default is 256 KiB, which is a property of TensorFlow. This is why Micron observed most I/Os being this size in their analysis;¹ I think their observation of larger sizes is a result of readahead sometimes colliding with this record size. Not sure though.
Local SSDs are probably fine with the default of 256 KiB.
NFS should increase transfer_size to whatever the rss mount parameter is (probably 1 MiB) or an integer multiple of it (4 MiB) if you are using few processes per node or threads per process.
Lustre should increase transfer_size to whatever max_pages_per_rpc is (1 MiB - 4 MiB) or an integer multiple of it (4 MiB - 16 MiB) if you are using few processes per node or threads per process.

CosmoFlow

As I said earlier, CosmoFlow is not a real application or workload. I don’t think it’s helpful to describe it, because that would imply that its I/O patterns mean something. If you think CosmoFlow performance is telling you something important about your storage system that cannot be gleaned by UNet3D or ResNet-50, shame on you.

Vector benchmark

MLPerf Storage 2.0 includes a “Vector Database Benchmark”¹⁷ which effectively measures the performance of Milvus’s DiskANN index implementation. It is a storage benchmark, not a vector database benchmark, because it only tests one way of performing vector search using a prescriptive scale-out architecture. It offers no way for vector databases to differentiate themselves, since all benchmarks must use Milvus.

To the best of my understanding, the benchmark proceeds as follows:

Database creation (not benchmarked)
- The load_vdb.py tool connects to the Milvus server and creates a schema with Int64 keys and FloatVector values, then creates a DiskANN index.
- Synthetic vectors are generated in chunks and inserted in batches to Milvus
- Compaction can be performed as the index is being created.
- At this point, vectors and indices are stored in far-away object storage which is incidental to the benchmark. The reference implementation uses MinIO S3 for this, but it allows you to use whatever S3 store you want.
Ground-truth calculation (not benchmarked)
- A brute-force query is run against the database without the index to determine the true nearest neighbors (not approximate nearest neighbors)
- This is done to determine the degree to which the benchmark trades accuracy for performance later on.
Database loading (not benchmarked)
- The database’s graph is read from far-away object storage and sharded across Milvus query nodes’ chunk caches. At this point, the database is resident across Milvus nodes’ local SSDs.
- Metadata and product-quantized vectors are loaded into Milvus’ nodes DRAM
Benchmarking!
- The benchmark runs parallel processes that submit batched queries.
- The time for the collection.search() to return is what is measured - logging the result for verification is not timed.
Reporting
- Search results are compared against the ground-truth to determine the recall (accuracy)
- Per-search timings are aggregated and presented as mean, median, P95, P99, P99.99, and throughput (queries/sec)
- Disk performance (MB/s, IOPS, and bytes-read-per-query) are calculated by comparing a final snapshot of /proc/diskstats to the snapshot taken immediately before benchmarking started

As I wrote above, this isn’t a very meaningful vector search benchmark, and I believe it over-indexes on one specific implementation in a field that is moving very quickly. It answers a question for storage people, but doesn’t help AI infrastructure people unless they are running this very specific stack.

However, the reported metrics and vector dataset generation are meaningful. I hope the benchmark eventually turns into

a set of rules that allow you to generate the predefined dataset in whatever vector database you want
a tool to generate concurrent vector search requests against an arbitrary API and database
a reporting methodology that pulls forward the metrics that people should look at when evaluating vector benchmarks

I don’t think the benchmark is far off in principle, but I think practice will require the benchmark moving out of the MLCommons Storage Working Group and into a working group higher up the stack. So long as this benchmark is in the hands of storage people, it will be limited in its treatment of performance in the context of end-to-end AI workloads.

KV cache benchmark

MLPerf Storage 2.0 includes a new KV cache benchmark.¹⁸

Discussion and Analysis of the MLPerf Storage Benchmark - SNIA SDC25 ↩ ↩² ↩³
N10_Workload_Analysis.latest.pdf ↩
https://github.com/mlcommons/storage/blob/6524b8f7ddbdce102af738ace4c05d638d8ca204/mlpstorage/rules/utils.py#L100 ↩ ↩²
https://github.com/mlcommons/storage/blob/6524b8f7ddbdce102af738ace4c05d638d8ca204/configs/dlio/workload/unet3d_h100.yaml#L18 ↩
https://github.com/mlcommons/storage/blob/6524b8f7ddbdce102af738ace4c05d638d8ca204/mlpstorage/rules/utils.py#L15 ↩
https://github.com/mlcommons/storage/blob/6524b8f7ddbdce102af738ace4c05d638d8ca204/configs/dlio/workload/unet3d_h100.yaml#L24 ↩
https://github.com/mlcommons/storage/blob/6524b8f7ddbdce102af738ace4c05d638d8ca204/configs/dlio/workload/unet3d_h100.yaml#L25 ↩
https://github.com/argonne-lcf/dlio_benchmark/blob/57148a19ff004b214748b4290767c84392577aa2/dlio_benchmark/reader/npz_reader.py#L38 ↩
https://github.com/mlcommons/storage/blob/6524b8f7ddbdce102af738ace4c05d638d8ca204/Rules.md?plain=1#L352 ↩
VAST Quick NFS Read Ahead Tuning ↩
https://github.com/argonne-lcf/dlio_benchmark/blob/main/dlio_benchmark/reader/npy_reader_odirect.py#L65 ↩
https://github.com/mlcommons/storage/blob/6524b8f7ddbdce102af738ace4c05d638d8ca204/configs/dlio/workload/resnet50_h100.yaml#L14; The generator creates every record at the same int(dimension) × int(dimension) size since dimension_stdev=0. However record_length_bytes: 114660.07 is a non-integer, and int(math.sqrt(114660.07)) = int(338.6...) = 338. So each generated record is actually 338 × 338 = 114,244 bytes, not 114,660. ↩
https://github.com/argonne-lcf/dlio_benchmark/blob/ea53bcfe26da8df15af6324ba6618e4593998104/dlio_benchmark/reader/tf_reader.py#L91 ↩
https://github.com/mlcommons/storage/blob/6524b8f7ddbdce102af738ace4c05d638d8ca204/configs/dlio/workload/resnet50_h100.yaml#L25 ↩
https://github.com/argonne-lcf/dlio_benchmark/blob/ea53bcfe26da8df15af6324ba6618e4593998104/dlio_benchmark/reader/tf_reader.py#L97 ↩
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/data/ops/readers.py#L36 ↩
storage/vdb_benchmark at main · mlcommons/storage ↩
storage/kv_cache_benchmark at main · mlcommons/storage ↩

Glenn's Digital Garden

Explorer

MLPerf Storage

Training benchmark

UNet3D

Dataset generation

Workload setup

I/O pattern

ResNet-50

Dataset generation

Workload setup

I/O pattern

CosmoFlow

Vector benchmark

KV cache benchmark

Graph View

Table of Contents

Backlinks

Glenn's Digital Garden

Explorer

MLPerf Storage

Training benchmark

UNet3D

Dataset generation

Workload setup

I/O pattern

ResNet-50

Dataset generation

Workload setup

I/O pattern

CosmoFlow

Vector benchmark

KV cache benchmark

Footnotes

Graph View

Table of Contents

Backlinks