There are many relevant terms and concepts. Here is a brain dump of them:

  • FIT rate and AFR - Failure In Time rate and Annual Failure Rate
  • Weibull distribution - A statistical distribution that can fit the rate at which components or systems fail in time.

In practice

Meta published reliability information from their 16K GPU training run of Llama-3.1 405b. 466 interrupts occurred over the 54-day training run, resulting in a 90% job uptime.

Of those 466, 47 were planned activities (hardware/firmware updates or changes to training). 419 were unplanned and broken down as follows:1

ComponentCategoryInterruption Count% of Interruptions
Faulty GPUGPU14830.1%
GPU HBM3 MemoryGPU7217.2%
Software BugDependency5412.9%
Network Switch/CableNetwork358.4%
Host MaintenanceUnplanned Maintenance327.6%
GPU SRAM MemoryGPU194.5%
GPU System ProcessorGPU174.1%
NICHost71.7%
NCCL Watchdog TimeoutsUnknown71.7%
Silent Data CorruptionGPU61.4%
GPU Thermal Interface + SensorGPU61.4%
SSDHost30.7%
Power SupplyHost30.7%
Server ChassisHost20.5%
IO Expansion BoardHost20.5%
DependencyDependency20.5%
CPUHost20.5%
System MemoryHost20.5%

The specific issue of silent data corruption is also discussed in LLM training at scale.

Footnotes

  1. The Llama-3 Herd of Models (arxiv.org)