Silent data corruption occurs when a CPU or GPU gives you the wrong answer to an arithmetic operation. By definition, they are hard to catch because they only occur when protection mechanisms like ECC fail and allow bad data to percolate through the process.

SDCs are an acute problem for LLM training at scale.

Detection

There are a few ways I know of to detect silent data corruptions:

  1. Look for NaN or Inf in model weights or activations as training occurs.
  2. Look for discontinuities in loss as training occurs.
  3. Periodically recompute or compute parity for calculations.

When any of the above issues manifests, you rewind the calculation (e.g., restore from checkpoint) and recompute to see if the weird behavior happens again. If the loss spikes up again, or the NaNs or Infs do not reoccur, a silent data corruption was likely the cause.

Frequency

Google reported silent data corruption events “every week or two.” They used dedicated resources to monitor for silent data corruption, and used deterministic training to allow training to be replayed to recalculate weight updates when silent data corruption was detected. They do not say exactly how they detect silent data corruption and whether it only detected high-order bits or all corruption events.[^2]

Meta reported six silent data corruptions when training across 16K H100 GPUs for 54 days.[^llama3]