Introduction
Darshan is a very useful profiling tool that intercepts I/O calls within HPC applications to perform lightweight profiling. If you are interested in using it to profile your applications, you should see Building and Managing Darshan.
This page contains notes on how Darshan works internally with an eye towards extending its functionality.
Global Core Initialization
The darshan_core_initialize()
function is called at initialization (either via the GNU C constructor, or at MPI_Init()
). It does the following:
- check for environmental overrides (DARSHAN_MEM_ALIGNMENT_OVERRIDE, DARSHAN_JOBID_OVERRIDE, DARSHAN_MOD_MEM_OVERRIDE)
- set jobid
- malloc the core structure and structures for
- header (
struct darshan_header
) - job metadata (
struct darshan_job
) - executable name
- "name record buffer"
- module scratchpad
- populate header
- set job-level metadata (uid, start time, nprocs, jobid)
- record any MPI-IO hints specified for Darshan log output (
darshan_log_record_hints_and_ver()
) - populate mount table (
darshan_get_exe_and_mounts()
) - initialize static modules (
mod_static_init_fns[]
table)
Per-module Initialization
When a wrapped function is called, it checks to see if its parent module has been initialized via a *_PRE_RECORD
macro (e.g., POSIX_PRE_RECORD
).
If the core is enabled and the module's runtime metadata is not, call the module's *_runtime_initialize
function (e.g., posix_runtime_initialize()
.
The *_runtime_initialize()
function does
- estimate how much memory it will need
- call
darshan_core_register_module()
to trigger memory allocation which is done exclusively by the core - bail out if the core refused to give sufficient memory
- malloc and memset the runtime metadata structure
- enable DXT if
DXT_ENABLE_IO_TRACE
is defined
Most of the magic happens in the Darshan core's darshan_core_register_module()
which does the following:
- malloc and memset the module's metadata (
struct darshan_core_module
) - set the module's initial offset in the Darshan heap (
mod->rec_buf_start
) - map the module's shutdown function (
mod->mod_shutdown_func
) - update the core's module list to reflect the addition of the new module
- pass back the actually allocated size of the buffer via the
inout_mod_buf_size
input pointer - pass back the system memory alignment to the module via the
sys_mem_alignment
input pointer (??? which modules need this?) - pass back the MPI rank via the
rank
input pointer
It's interesting to note that Darshan knows what modules are available at compile-time, and each module and the core has a shared understanding of each module's unique module ID.
Global Core Finalization
The darshan_core_shutdown()
function is called at shutdown (either via GNU C destructor or MPI_Finalize()
). It does the following:
- ensure shutdown timing is synchronized using
MPI_Barrier()
- unplug the core (
final_core = darshan_core; darshan_core = NULL
) - set the job end time
- calculate the job's overall start and end times using global min/max
MPI_Reduce()
s and commit those to the final core's job metadata - allocate a buffer to help with log compression
- allocate a buffer to store the Darshan log path (done in heap because it needs to be shared via MPI?)
- calculate log file path on rank 0, then broadcast it to all nodes. Check that everyone has a valid log path name and collectively fail if not.
- identify modules that registered during the app runtime across all MPI processes via
MPI_Allreduce()
- determine common record names across all MPI processes (which files were opened by everyone) via
darshan_get_shared_records()
- collectively open log file and collectively check for errors
- rank 0 writes the job metadata and mount table and leaves room for an uncompressed log header. Check for errors and collectively fail if rank 0 failed in the aforementioned task
- write the name->record mapping by calling
darshan_log_write_name_record_hash()
- MPI ranks > 0 remove all their shared records names from the filename->record map (since they are already reduced and represented on rank0 at this point)
- rebuild record map on rank>0 such that unique records are at the front of the list of mappings and shared records are at the end
- then call
darshan_log_append_all()
which does a collective append - loop through all the modules; this is described below
- determine which MPI ranks ran out of memory via
MPI_Reduce()
- determine the total number of active modules (never-activated modules aren't written to the final log file)
- write the Darshan log header from rank 0 and collectively fail on error (
MPI_Bcast()
the return value of theMPI_File_write_at()
) - close the Darshan log file and rename it from
*.darshan_partial
to*.darshan
The "loop through all the modules" loop does the following:
- flags all globally never-active modules so they don't get written to the Darshan log
- set the shared record list for this module (
mod_shared_recs[]
) - call the module's shutdown function if the module was used locally. how does this work if a module is only used by some MPI processes?
MPI_COMM_WORLD
is passed to the module-specific shutdown reducer which would cause a deadlock if not all modules are doing posix, right?
All Darshan modules' shutdown functions must have the following prototype:
argument | intent | meaning |
---|---|---|
MPI_Comm mod_comm |
in | MPI communicator to run collectives with |
darshan_record_id *shared_recs |
in | pointer to start of shared records |
int shared_rec_count |
in | count of shared data records |
void **mod_buf |
in/out | pointer to start of the module's records |
int *mod_buf_sz |
out | number of records |