Discrete-event simulation is a useful technique for modeling the behavior of complex systems (like supercomputers and data centers) where events (and reactions to them) unfold over time. Rather than stepping through time linearly though (e.g., simulate hour-by-hour), discrete-event simulation skips directly to successive events. This makes it an efficient way to simulate rare phenomena like multiple simultaneous failures, since its complexity scales with the number of events, not the total time simulated.

Specifically in HPC, discrete event simulation can be used to examine how different types of events (component failures, maintenance event) affect the whole system when they occur across different components (compute nodes, rack controllers, PDUs, CDUs). You just describe the lifecycle of each component (it works, it breaks, it gets repaired, it gets returned to service) and how other components respond to each event (it breaks, it gets returned to service), then let it simulate. By defining the different quantities of components and how different events happen to them over time, you can see how different design decisions affect the overall reliability of the system.

Features

Discrete-event simulation has a few key components:

  1. Entities: The components that make up an overall system (servers, PDUs, or network switches) that can experience events and change state (online, faulted, out for repair)
  2. Events: Distinct incidents that change the state of entities at a specific point in time (e.g., a node failure, a repair completion, or a cooling system fault)
  3. Processes: Sequences of events for a particular entity that describe its lifecycle (e.g., a server transitions between “online,” “faulted,” and “out for repair” in that order over and over)
  4. Queues: Points within the simulation where events or entities may wait due to constrained resources (e.g., waiting for a technician or a spare part during a repair)

Time advancement is managed by the simulation framework being used by fast-forwarding the simulation clock to the next upcoming event rather than stepping through every unit of real time.

SimPy

SimPy is a solid Python library for building discrete-event simulations. At a high level, you define processes (e.g., lifecycle) and schedule events that occur at specified or random time intervals.

Because SimPy is relatively straightforward to learn, it is great for quickly prototyping reliability models for HPC systems, storage arrays, or any other complex system. For example, here is a basic simpy simulation that calculates the mean time to data loss for a RAID array:

import simpy
import random
import argparse
 
class RAIDArray:
    """Simulates a RAID array until data loss occurs.
    """
    def __init__(self, env, num_drives=10, num_parity=2, mtbf=1e6, repair_time=24):
        self.env = env
        self.num_drives = num_drives
        self.num_parity = num_parity        # number of parity drives (allowed simultaneous failures)
        self.mtbf = mtbf                    # mean time between drive failures (hours)
        self.repair_time = repair_time      # fixed repair time (hours)
        self.failed_drives = 0              # current number of failed drives
        self.failed = False                 # flag indicating that the array has suffered data loss
        self.data_loss_event = env.event()  # triggered when data loss occurs
 
    def lifecycle(self, drive_id):
        """Lifecycle for a single hard drive.
 
        Waits for a drive failure, then starts a repair (fixed delay). If there
        are more failed drives than parity drives, data loss is triggered.
        """
        while not self.failed:
            # Pick a time until this drive fails using an exponential distribution
            ttf = random.expovariate(1.0 / self.mtbf)
            yield self.env.timeout(ttf)
            if self.failed:
                break
 
            # Record the drive failure.
            self.failed_drives += 1
 
            # If too many drives have failed, mark the array as failed
            if self.failed_drives > self.num_parity:
                self.failed = True
                self.data_loss_event.succeed()
                print(f"Array failed after {self.env.now:15,.0f} hours")
                break
 
            # Simulate the replace and rebuild time
            yield self.env.timeout(self.repair_time)
            if self.failed:
                break
 
            # Drive is repaired
            self.failed_drives -= 1
 
def simulate_array(mtbf, repair_time, num_drives=10, num_parity=2):
    """Simulates a RAID array until data loss occurs.
 
    Returns the simulated time (in hours) at which data loss happened.
    """
    env = simpy.Environment()
    array = RAIDArray(env, num_drives=num_drives, num_parity=num_parity, mtbf=mtbf, repair_time=repair_time)
    for i in range(num_drives):
        env.process(array.lifecycle(i))
    env.run(until=array.data_loss_event)
    return env.now
 
def run_simulation(runs, simulation_func, **kwargs):
    """Runs the simulation for a given number of runs and returns the mean time
    to data loss.
    """
    times = []
    for i in range(runs):
        t = simulation_func(**kwargs)
        times.append(t)
    mean_time = sum(times) / len(times)
    return mean_time, times
 
if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='Simulates time to data loss for a RAID array.')
    parser.add_argument('--mtbf', type=float, default=1e6, help='Mean time between drive failures (hours)')
    parser.add_argument('--mttr', type=float, default=4*24, help='Repair time (hours)')
    parser.add_argument('--num_drives', type=int, default=5, help='Number of data+parity drives per array')
    parser.add_argument('--num_parity', type=int, default=1, help='Number of parity drives per array')
    parser.add_argument('--runs', type=int, default=100, help='Number of simulation runs to average over')
 
    args = parser.parse_args()
 
    print(f"Simulating a {args.num_drives - args.num_parity}+{args.num_parity} RAID array")
    mean_time_single, _ = run_simulation(
        args.runs, 
        simulate_array,
        mtbf=args.mtbf, repair_time=args.mttr,
        num_drives=args.num_drives, num_parity=args.num_parity
    )
    print(f"Mean time to data loss: {mean_time_single:,.0f} hours ({mean_time_single/8766:,.0f} years) over {args.runs} runs.")

In it, the RAID array is the only entity, and a drive lifecycle process is launched for each of the drives in that array. As failures and repairs happen, the array keeps track of how many drives are in a failed state. When too many failures happen within a short timespan, a data loss event is triggered. This simulation does not use queues, but it could represent data center staff as resources required for the repair part of the lifecycle to proceed.

Hot tips

As I’ve built models with simpy, I’ve learned a few important things:

Keep callbacks simple. You shouldn’t try to make callbacks reset the events that triggered them, because the event will always be in a triggered and processing (event.processed == False) state during the callback. Instead, every event that triggers a callback should also have some lifecycle loop waiting for it (yield event) that cleans up after the event has been fully processed. At best, callbacks should enqueue other processes. Callbacks should never directly trigger events that would trigger more callbacks.

Have a clear hierarchy of responsibility. For example, if nodes and racks are being modeled, racks should only respond to events emitted by nodes; they should not respond to events emitted by each other. This keeps processes easy to reason through and debug.