The reliability of the individual components inside a node and rack contribute to the overall reliability of a supercomputer. The mathematics of how component reliability affects system reliability is documented in MTBF.

In practice

Meta

Meta published reliability information from their 16K GPU training run of Llama-3.1 405b. 466 interrupts occurred over the 54-day training run, resulting in a 90% job uptime.

Of those 466, 47 were planned activities (hardware/firmware updates or changes to training). 419 were unplanned and broken down as follows:1

ComponentCategoryInterruption Count% of Interruptions
Faulty GPUGPU14830.1%
GPU HBM3 MemoryGPU7217.2%
Software BugDependency5412.9%
Network Switch/CableNetwork358.4%
Host MaintenanceUnplanned Maintenance327.6%
GPU SRAM MemoryGPU194.5%
GPU System ProcessorGPU174.1%
NICHost71.7%
NCCL Watchdog TimeoutsUnknown71.7%
Silent Data CorruptionGPU61.4%
GPU Thermal Interface + SensorGPU61.4%
SSDHost30.7%
Power SupplyHost30.7%
Server ChassisHost20.5%
IO Expansion BoardHost20.5%
DependencyDependency20.5%
CPUHost20.5%
System MemoryHost20.5%

The specific issue of silent data corruption is also discussed in LLM training at scale.

Crusoe

Crusoe shared the following failure rates for an H200 cluster at GTC25:2

Fault Cause% of cases
GPU recoverable faults19.5%
GPU host replacement41.5%
Other host issues (CPU, memory, etc)4.9%
InfiniBand recoverable faults4.9%
InfiniBand hardware issues12.2%
InfiniBand other issues17%

The lowest common denominator of the above data suggests they represent 205 observed failures during the four-month period. The talk also discussed a 1,600-GPU H200 cluster in Iceland which may be where this data originated.

NVIDIA

NVIDIA shared the failure frequency for a 6K-GPU training run that spanned a four-month training campaign:3

“UB timeout” is the most frequent root-cause, but NVIDIA doesn’t define it. It may refer to the “intranode user buffer”4 which implies a PCIe error. Similarly, “NaN in gradients” is a symptom, not a cause, but likely is a euphemism for silent data corruption.

Footnotes

  1. The Llama-3 Herd of Models (arxiv.org)

  2. Fault-Tolerant Managed Training: Crusoe Cloud’s Blueprint for AI Reliability (Presented by Crusoe.ai) | GTC 25 2025 | NVIDIA On-Demand

  3. Ensuring Reliable Model Training on NVIDIA DGX Cloud | NVIDIA Technical Blog

  4. New Scaling Algorithm and Initialization with NVIDIA Collective Communications Library 2.23 | NVIDIA Technical Blog