There are many relevant terms and concepts. Here is a brain dump of them:

  • JMTTI is Job Mean Time To Interrupt.
  • Job uptime is what fraction of time in a fixed window (like 24 hours) a large-scale HPC job is actually running.
  • Forward progress is like job uptime, but it only refers to time when the job is actually productively computing new things. If the job is running but it’s writing a checkpoint or reading a checkpoint, it is not making forward progress. Similarly, if a job is re-computing the last 20 timesteps because 20 timesteps were lost since the last checkpoint and the job crashed, that recomputation is not forward progress. This metric is very hard to measure without having direct knowledge of the specific application.
  • Normalized MTTI and normalized MTBF follow what Gupta et al1 called “Scale-normalized MTBF” which is .

In practice

Below are some anecdotes about failures when running large-scale workloads. Most of the literature refers to traditional scientific computing workloads, but training frontier models is emerging as a new source of failure rates in production.

System / MetricJMTTIJob uptimeComponentsParametersTokensYear deployedSource
AWS HLAT98.81%4096 Trainium7B or 70B600B20242
AWS HLAT, manual restarts77.83%4096 Trainium7B or 70B600B20242
Meta Llama-3.12.5 hours90% 16,000 H100405B15.6T20243
Google Gemini85–97%TPUv4?****20234
Meta OPT-175B7.5 hours59%992x or 1,024x A100175B20215
OLCF Titan18,688x K20X GPUsN/A20141
LANL Trinity< 12 hours> 9900x KNL + 18,400x Haswell CPUsN/A20176

A team at Sandia ran an application called SPARTA on LANL Trinity across 1.2 million MPI processes over the course of three days in 2018.7 Their memo contained the following anecdotes:

  • Six hardware failures (three SIGBUS and three node failures) across over 9,200 Haswell and 9,900 KNL nodes.
  • Ran srun inside a for loop to automatically restart when a job failed due to a SIGBUS interrupt.
  • They identified and banned two slow Haswell nodes

They also cite software bugs they discovered at scale, so it is unclear how much of those three days were actually spent computing. As such, it is hard to say what the JMTTI was except that the upper bound was .

Footnotes

  1. Failure in large scale systems: long-term measurement, analysis, and implications 2

  2. [2404.10630] HLAT: High-quality Large Language Model Pre-trained on AWS Trainium 2

  3. “Despite these challenges, for Llama 3, we achieved higher than 90% effective training time while supporting automated cluster maintenance, such as firmware and Linux kernel upgrades, which resulted in at least one training interruption daily. The effective training time measures the time spent on useful training over the elapsed time.” The Llama-3 Herd of Models (arxiv.org)

  4. “Compared to both PaLM and PaLM-2, this provided a substantial speedup in recovery time, despite the significantly larger training resources being used. As a result, the overall goodput for the largest-scale training job increased from 85% to 97%.” Gemini: A Family of Highly Capable Multimodal Models

  5. See the OPT-175B page for a breakdown of that training.

  6. Hemmert et al. Trinity: Opportunities and Challenges of a Heterogeneous System. 2018.

  7. Full Trinity run with SPARTA