availability

Availability is the fraction of a supercomputer’s nodes that are online and useable by jobs in a given interval of time.

Quantitative definition

Availability is pretty straightforward:

$Availability = \frac{uptime}{uptime + downtime}$

SLAs are often expressed as average monthly availability; for example, Azure offers a 99.9% (“three nines” or “3x9”) availability for a non-redundant VM on a monthly basis.¹ If the average month has 730.5 hours, this means that VM will only be down for 0.1% of that, or $730.5 hours \times (1 - 0.999) = 0.7305 hours$ (43.83 minutes) each month.

Job uptime, forward progress

Job uptime and forward progress are very similar to, but not exactly the same as, availability.

Job uptime is the fraction of time a job is up and running across nodes. If nodes are available, the job can be accumulating uptime, but the jobs may also be in the process of starting up and not yet running. If a job can restart across all nodes instantaneously as soon as all nodes are up, job uptime is the same as availability.

Forward progress is the fraction of time a job is up, running, and doing productive computation. By exclusion, any time the job is up and running but not doing productive computation does not count towards forward progress. This is an important distinction if a job spends a lot of time checkpointing to protect itself against failure; if a job runs for 10 minutes then spends 10 minutes checkpointing, its forward progress is only 50% even though its job uptime is 100%. Meta calls this metric “effective training time ratio” (ETTR)² and Google calls this “runtime goodput.”³

Forward progress is a concept closely related to JMTTI and came out of the nuclear weapons simulation community.⁴

Relationship with reliability

If you know the MTBF and mean time to repair (MTTR), you can calculate the availability of a server:

$Availability = \frac{MTBF}{MTBF + MTTR}$

You can relate this term directly to job reliability too. For example, Llama-3 had a JMTTI of 2.5 hours and a job uptime of 90%.⁵ Tweaking the above,

$Availability = \frac{JMTTI}{JMTTI + MTTR}$

$0.90 = \frac{2.5}{2.5 + MTTR}$

$MTTR = \frac{5}{18} = 0.278 hr = 16.7 min$

Since Meta used only 16,000 of the 24,000 GPUs on their H100 cluster, this probably reflects a mean time to restart from checkpoint of ~16.7 minutes.

In practice

See In practice for examples of availability from the literature.

Glenn's Digital Garden

Explorer

availability

Quantitative definition

Job uptime, forward progress

Relationship with reliability

In practice

Graph View

Table of Contents

Backlinks

Glenn's Digital Garden

Explorer

availability

Quantitative definition

Job uptime, forward progress

Relationship with reliability

In practice

Footnotes

Graph View

Table of Contents

Backlinks