Availability is the fraction of a supercomputer’s nodes that are online and useable by jobs in a given interval of time.

Quantitative definition

Availability is pretty straightforward:

SLAs are often expressed as average monthly availability; for example, Azure offers a 99.9% (“three nines” or “3x9”) availability for a non-redundant VM on a monthly basis.1 If the average month has 730.5 hours, this means that VM will only be down for 0.1% of that, or (43.83 minutes) each month.

Job uptime, forward progress

Job uptime and forward progress are very similar to, but not exactly the same as, availability.

Job uptime is the fraction of time a job is up and running across nodes. If nodes are available, the job can be accumulating uptime, but the jobs may also be in the process of starting up and not yet running. If a job can restart across all nodes instantaneously as soon as all nodes are up, job uptime is the same as availability.

Forward progress is the fraction of time a job is up, running, and doing productive computation. By exclusion, any time the job is up and running but not doing productive computation does not count towards forward progress. This is an important distinction if a job spends a lot of time checkpointing to protect itself against failure; if a job runs for 10 minutes then spends 10 minutes checkpointing, its forward progress is only 50% even though its job uptime is 100%. Meta calls this metric “effective training time ratio” (ETTR)2 and Google calls this “runtime goodput.”3

Forward progress is a concept closely related to JMTTI and came out of the nuclear weapons simulation community.4

Relationship with reliability

If you know the MTBF and mean time to repair (MTTR), you can calculate the availability of a server:

You can relate this term directly to job reliability too. For example, Llama-3 had a JMTTI of 2.5 hours and a job uptime of 90%.5 Tweaking the above,

Since Meta used only 16,000 of the 24,000 GPUs on their H100 cluster, this probably reflects a mean time to restart from checkpoint of ~16.7 minutes.

Footnotes

  1. Azure Resiliency Infographic_Final (microsoft.com)

  2. [2410.21680v1] Revisiting Reliability in Large-Scale Machine Learning Research Clusters

  3. Goodput metric as measure of ML productivity | Google Cloud Blog

  4. Daly, A higher order estimate of the optimum checkpoint interval for restart dumps (2006)

  5. The Llama-3 Herd of Models (arxiv.org)