job reliability

There are many relevant terms and concepts. Here is a brain dump of them:

JMTTI is Job Mean Time To Interrupt.
Job uptime is what fraction of time in a fixed window (like 24 hours) a large-scale HPC job is actually running.
Forward progress is like job uptime, but it only refers to time when the job is actually productively computing new things. If the job is running but it’s writing a checkpoint or reading a checkpoint, it is not making forward progress. Similarly, if a job is re-computing the last 20 timesteps because 20 timesteps were lost since the last checkpoint and the job crashed, that recomputation is not forward progress. This metric is very hard to measure without having direct knowledge of the specific application.
Normalized MTTI and normalized MTBF follow what Gupta et al¹ called “Scale-normalized MTBF” which is $\frac{MTBF \times nodes in system}{some reference node count}$ .

In practice

Below are some anecdotes about failures when running large-scale workloads. Most of the literature refers to traditional scientific computing workloads, but training frontier models is emerging as a new source of failure rates in production.

System / Metric	JMTTI	Job uptime	Components	Parameters	Tokens	Year deployed	Source
AWS HLAT		98.81%	4096 Trainium	7B or 70B	600B	2024	²
AWS HLAT, manual restarts		77.83%	4096 Trainium	7B or 70B	600B	2024	²
Meta Llama-3.1	2.5 hours	90%	$\approx$ 16,000 H100	405B	15.6T	2024	³
Google Gemini		85–97%	TPUv4	?****		2023	⁴
Meta OPT-175B	7.5 hours	59%	992x or 1,024x A100	175B		2021	⁵
OLCF Titan			18,688x K20X GPUs		N/A	2014	¹
LANL Trinity	< 12 hours		> 9900x KNL + 18,400x Haswell CPUs		N/A	2017	⁶

A team at Sandia ran an application called SPARTA on LANL Trinity across 1.2 million MPI processes over the course of three days in 2018.⁷ Their memo contained the following anecdotes:

Six hardware failures (three SIGBUS and three node failures) across over 9,200 Haswell and 9,900 KNL nodes.
Ran srun inside a for loop to automatically restart when a job failed due to a SIGBUS interrupt.
They identified and banned two slow Haswell nodes

They also cite software bugs they discovered at scale, so it is unclear how much of those three days were actually spent computing. As such, it is hard to say what the JMTTI was except that the upper bound was $\frac{3 days \times 24 hours}{6 failures} = 12 hours$ .

Failure in large scale systems: long-term measurement, analysis, and implications ↩ ↩²
[2404.10630] HLAT: High-quality Large Language Model Pre-trained on AWS Trainium ↩ ↩²
“Despite these challenges, for Llama 3, we achieved higher than 90% effective training time while supporting automated cluster maintenance, such as firmware and Linux kernel upgrades, which resulted in at least one training interruption daily. The effective training time measures the time spent on useful training over the elapsed time.” The Llama-3 Herd of Models (arxiv.org) ↩
“Compared to both PaLM and PaLM-2, this provided a substantial speedup in recovery time, despite the significantly larger training resources being used. As a result, the overall goodput for the largest-scale training job increased from 85% to 97%.” Gemini: A Family of Highly Capable Multimodal Models ↩
See the OPT-175B page for a breakdown of that training. ↩
Hemmert et al. Trinity: Opportunities and Challenges of a Heterogeneous System. 2018. ↩
Full Trinity run with SPARTA ↩

Glenn's Digital Garden

Explorer

job reliability

In practice

Graph View

Backlinks

Glenn's Digital Garden

Explorer

job reliability

In practice

Footnotes

Graph View

Backlinks