The reliability of the individual components inside a node and rack contribute to the overall reliability of a supercomputer. The mathematics of how component reliability affects system reliability is documented in MTBF.
In practice
Meta
Meta published reliability information from their 16K GPU training run of Llama-3.1 405b. 466 interrupts occurred over the 54-day training run, resulting in a 90% job uptime.
Of those 466, 47 were planned activities (hardware/firmware updates or changes to training). 419 were unplanned and broken down as follows:1
| Component | Category | Interruption Count | % of Interruptions |
|---|---|---|---|
| Faulty GPU | GPU | 148 | 30.1% |
| GPU HBM3 Memory | GPU | 72 | 17.2% |
| Software Bug | Dependency | 54 | 12.9% |
| Network Switch/Cable | Network | 35 | 8.4% |
| Host Maintenance | Unplanned Maintenance | 32 | 7.6% |
| GPU SRAM Memory | GPU | 19 | 4.5% |
| GPU System Processor | GPU | 17 | 4.1% |
| NIC | Host | 7 | 1.7% |
| NCCL Watchdog Timeouts | Unknown | 7 | 1.7% |
| Silent Data Corruption | GPU | 6 | 1.4% |
| GPU Thermal Interface + Sensor | GPU | 6 | 1.4% |
| SSD | Host | 3 | 0.7% |
| Power Supply | Host | 3 | 0.7% |
| Server Chassis | Host | 2 | 0.5% |
| IO Expansion Board | Host | 2 | 0.5% |
| Dependency | Dependency | 2 | 0.5% |
| CPU | Host | 2 | 0.5% |
| System Memory | Host | 2 | 0.5% |
The specific issue of silent data corruption is also discussed in LLM training at scale.
Crusoe
Crusoe shared the following failure rates for an H200 cluster at GTC25:2
| Fault Cause | % of cases |
|---|---|
| GPU recoverable faults | 19.5% |
| GPU host replacement | 41.5% |
| Other host issues (CPU, memory, etc) | 4.9% |
| InfiniBand recoverable faults | 4.9% |
| InfiniBand hardware issues | 12.2% |
| InfiniBand other issues | 17% |
The lowest common denominator of the above data suggests they represent 205 observed failures during the four-month period. The talk also discussed a 1,600-GPU H200 cluster in Iceland which may be where this data originated.
NVIDIA
NVIDIA shared the failure frequency for a 6K-GPU training run that spanned a four-month training campaign:3

“UB timeout” is the most frequent root-cause, but NVIDIA doesn’t define it. It may refer to the “intranode user buffer”4 which implies a PCIe error. Similarly, “NaN in gradients” is a symptom, not a cause, but likely is a euphemism for silent data corruption.
Footnotes
-
Fault-Tolerant Managed Training: Crusoe Cloud’s Blueprint for AI Reliability (Presented by Crusoe.ai) | GTC 25 2025 | NVIDIA On-Demand ↩
-
Ensuring Reliable Model Training on NVIDIA DGX Cloud | NVIDIA Technical Blog ↩
-
New Scaling Algorithm and Initialization with NVIDIA Collective Communications Library 2.23 | NVIDIA Technical Blog ↩