Storage benchmarking is dumb

Introduction: Why Benchmarks Matter

Storage benchmarks (IO500, MLPerf Storage, etc.) are widely used to guide infrastructure decisions.

They are supposed to represent workloads that matter to end users.

But in AI/HPC, they often miss the mark because I/O patterns are arbitrary and malleable. Algorithms operate on data in memory, not in storage, so storage’s only job is to get data into memory in whatever way is fastest.

The Core Problem: Who Shapes the Workload?

In enterprise storage, applications are fixed (e.g., databases, web services), so storage must adapt to the applications. Benchmarks are stable and meaningful.

In HPC/AI storage, applications (frameworks) adapt to storage. AI models adapt to whatever infrastructure they have (see DeepSeek-R1 as an example).

Example: If reading huge files all at once performs poorly, an AI practitioner would reshard their data to get higher performance.

Why Current Benchmarks Fall Short

Benchmarks are created by storage practitioners without meaningful input from AI/HPC end users or understanding the state of the art in distributed training/inferencing frameworks.

Encoding today’s I/O patterns into benchmarks locks in behaviors that may already be obsolete by the time the next model or framework is released.

Result is that storage benchmarks become more about storage people talking to storage people than predicting actual performance for workloads.

The Hidden Purpose of Benchmarks

Benchmarks persist not because they help workloads, but because they help infrastructure designers make decisions without becoming AI/HPC experts.

This is reasonable, because few system designers can understand PyTorch internals. But it creates a false sense of workload fidelity when people over-index on the importance of benchmark results.

Cynically, MLPerf Storage exists to smooth infrastructure conversations, not necessarily to accelerate AI.

Implications & Alternatives

For storage practitioners: Recognize what benchmarks can and can’t tell you. Learn more about how AI frameworks adapt, rather than assuming current benchmarks are sufficient, or partner with people from whom you can learn.

Possible alternatives: Benchmark the workflows, not the traces. Develop simplified workloads that co-evolve with AI libraries. Focus on metrics that emphasize system-level adaptability rather than application-specific fixed access patterns.

Conclusion

Benchmarks are not useless, but in HPC/AI they lag reality. Understand the context in which they have meaning.

The real question isn’t “does my storage run MLPerf fast?” but “what opportunities does my storage have to co-evolve with AI frameworks and open doors for higher end-to-end performance?”

Until benchmarks account for this, they risk reflecting the past rather than offering insight into future capabilities.

Glenn's Digital Garden

Explorer

Storage benchmarking is dumb

Graph View