md-workbench generates a semi-synchronous metadata-intensive workload that was designed to mimic what a parallel compilation may look like to a file system. It runs in three phases which are described below.
Let's take a look at what a simple md-workbench does:
mpirun -n 2 md-workbench -I 7 -P 11 -D 3
According to the manual,
-I
is the number of "objects" (files) to manipulate per "data set" (directory)-P
is the number of objects to precreate per data set-D
is the number of data sets to manipulate per process
Do not run md-workbench with the default parameters because they reflect a benchmark that will run for a very long time.
Phase 1 - Precreate phase
This phase can be isolated by specifying the -1
or --run-precreate
option.
Rank 0 does
mkdir out/0_0
mkdir out/0_1
mkdir out/0_2
and rank 1 does
mkdir out/1_0
mkdir out/1_1
mkdir out/1_2
Then rank 0 creates a bunch of files:
open(out/0_0/file-0, O_CREAT)
- write 3901 bytes to this file
- close this file
- ...
open(out/0_0/file-10, O_CREAT)
- write 3901 bytes to this file
- close this file
then repeat this for out/0_1
and out/0_2
.
Then there's a barrier.
As you can see, we created three directories per rank because of -D 3
and
eleven files per directory because of -P 11
. The -I 7
does not play a role
here.
The 3901-byte file size can be changed using -S
or --object-size
, and this
phase can be run multiple times using -R
or --iterations
.
Phase 2 - Benchmark phase
This phase can be isolated by specifying the -2
or --run-benchmark
option.
Rank 0 does:
stat(out/1_0/file-0)
open(out/1_0/file-0)
- read 3901 bytes from
$WORKDIR/out/1_0/file-0
- this is an absolute path, not a relative one close($WORKDIR/out/1_0/file-0)
-
unlink(./out/1_0/file-0)
-
open(./out/1_0/file-10, O_CREAT)
- write 3901 bytes to
$WORKDIR/out/1_0/file-10
close($WORKDIR/out/1_0/file-10)
This is repeated for file-0
and file-10
in different directories for a total
of 21 times, or -I
times -D
. This is to say, for each directory:
- a file is statted, opened, read, closed, and deleted
- a new file is created written, and closed
There is no net creation or destruction of files, but files are being created and destroyed repeatedly. The mapping of MPI ranks to directories/files here is shuffled relative to Phase 1. Stay tuned for more information on how this shuffling is determined.
Then there's a barrier.
Note that -P
plays no role here; it is solely for Phase 1. However the first
round of open-read-close-unlink in Phase 2 depends on precreated files which are
generated by Phase 1, so you should make sure that -P
is the same as -I
. If
you don't do this, Phase 2 will try to open files that weren't precreated and
categorize these operations as errors on the first round. I've had this cause
both harmless warnings and a full job failure, and I'm not sure what
circumstances lead to what. To be safe, just always specify both -P
and -I
for all phases.
Also, the default number of iterations is 3 (-R 3
) which means this test
will run three times before completing. It's a good idea to specify -R 1
if
you want the test to complete quickly.
Phase 3 - Cleanup phase
This phase can be isolated by specifying the -3
or --run-cleanup
option.
Rank 0:
- unlinks the seven files in
./out/0_0/
rmdir(./out/0_0/)
Repeat for 0_1/
and 0_2/
.
Rank 1 does the same for its directories and files from Phase 1.
Understanding Output
The default output of md-workbench is not labeled very well. It looks something like this (but note that I adding some line breaks for clarity):
benchmark process max:60.73s min:60.08s mean: 60.45s balance:98.9 stddev:0.2 rate:2420.7 iops/s objects:36750 rate:605.2 obj/s tp:4.5 MiB/s op-max:4.7359e-01s (0 errs) stonewall-iter:32 read(6.6781e-04s, 1.2720e-03s, 2.1476e-03s, 4.7331e-03s, 9.7501e-03s, 2.6300e-02s, 9.0354e-02s) stat(5.8293e-04s, 1.3051e-03s, 2.2000e-03s, 4.6260e-03s, 8.8098e-03s, 2.3064e-02s, 9.9011e-02s) create(3.1090e-03s, 2.6335e-02s, 4.2686e-02s, 7.1706e-02s, 1.0507e-01s, 1.8351e-01s, 4.7359e-01s) delete(1.4300e-03s, 9.6769e-03s, 1.4584e-02s, 2.1045e-02s, 3.1489e-02s, 7.1808e-02s, 3.3352e-01s)
Let's break this down. First are the basic statistics:
- max, min, and mean reflect the wall seconds used by the slowest and fastest MPI ranks
- balance is the fastest rank's time divided by the slowest rank's time (in percent)
- std is the standard deviation of all ranks' time
Then the benchmark rate summaries:
- rate (for iops/s) is the number of successful open/read/close/unlink/create+open/write/close
cycles successfully completed divided by walltime. To express this rate in
IOPS, it multiplies the number of successful cycles by four I/O
operations. md-workbench considers one cycle to be four I/O operations
(write, stat, read, delete), but I don't agree. You can ultimately divide
rate
by four to get the cycle rate, then multiply it by whatever number of I/O operations per cycle you care to use. - objects are the number of objects (files) successfully manipulated
- rate (for obj/s) is
objects
divided by time - this is the same as the cycle rate and should be exactly 0.25 times therate
for iops/s discussed above. - tp is the number of bytes successfully read and written over the whole benchmark phase divided by overall walltime. It is literally (objects created + objects read) multiplied by the mebibytes-per-object and divided by walltime.
- op-max is the time taken by the slowest single operation (stat, create, read, close, etc) by any MPI rank. Not a terribly useful metric, but it tells you if a single operation on a single MPI rank dominated the overall walltime.
The stonewall-iter is how many cycles successfully complete. This value will
never exceed whatever you specified for -I
.
Finally, statistics are shown for each of the I/O operations per cycle (read/stat/create/delete). They take the form
opname(min, q1, median, q3, q90, q99, max)
which is pretty self-explanatory:
- opname denotes the timing for stat/create/read/close
- min - fastest time to complete the I/O operation
- q1 - time taken to complete the op corresponding to first quartile
- median - the median operation time
- q3 - time taken to complete the op corresponding to third quartile
- q90 - time taken to complete the op corresponding to the 90th percentile - these are going to be pretty slow
- q99 - time taken to complete the op corresponding to the 99th percentile - these are the long-tail stragglers
- max - slowest time to complete the I/O operation
If you specify --print-detailed-stats
, you get a nice columnar summary of the
benchmark phases' performance:
phase d name create delete ob nam create read stat delete t_inc_b t_no_bar thp max_t benchmark 0 0 20075 20075 20075 20075 62.191s 62.191s 2.40 MiB/s 1.2642e+00
but strangely, it silences the other statistics for each operation.
Stonewalling
md-workbench supports stonewalling via the -w
option, but if you run your
phases as separate jobs using -1
and -2
explicitly, Phase 2 will only run
with stonewalling if wear-out (-W
) is also specified. I think this is because
md-workbench cannot store and recall the progress of each MPI rank after Phase
1, so Phase 2 does not know how many files each rank should expect to touch.
Omitting I/O
You can make md-workbench only test metadata operations by specifying -S 0
.
This sets the file size to 0 bytes, and md-workbench is smart enough to simply
never call read(2)
or write(2)
during the precreate and benchmark phases.
I argue that this is not a realistic test since from a user perspective since
there are few reasons to open a file without not performing some I/O on it, but
it is a good way to drive load on only the metadata subsystem on file systems
that separate data from metadata (like Lustre).