Unique challenges arise when performing LLM training at the scale required for frontier models, and there is no end in sight for training increasingly large models to get higher quality results.1

Multi-data center training

Gemini was “trained across multiple sites, and multiple clusters within those sites” according to Thomas Kurian.2 Groups of 4096 TPUv4 chips were formed into superpods which share a common optical switch. Superpods are then connected within and across data centers for synchronous training. Each superpod hosts one model replica, and data parallelism is used across superpods. Google also said that multiple model replicas were stored in a single superpod to accelerate restart “despite the significantly larger training resources being used.”3

Silent data corruption

Google reported silent data corruption events “every week or two.” They used dedicated resources to monitor for silent data corruption, and used deterministic training to allow training to be replayed to recalculate weight updates when silent data corruption was detected. They do not say exactly how they detect silent data corruption and whether it only detected high-order bits or all corruption events.3

Meta reported six silent data corruptions when training across 16K H100 GPUs for 54 days.4

Anecdotes

  • ByteDance’s MegaScale paper5 contains descriptions of the entire infrastructure required to train at 10K+ GPU scale.
  • Meta AI’s OPT-175B logbook6 provides specific details about errors and mitigations that happened while training an LLM across 1K GPUs for two months.

Footnotes

  1. See OpenAI Keynote on Building Scalable AI Infrastructure and the scaling plots cited from the GPT-4 technical report.

  2. Training Google’s Gemini: TPUs, multiple data centers, and risks of cosmic rays - DCD (datacenterdynamics.com)

  3. Gemini: A Family of Highly Capable Multimodal Models 2

  4. The Llama-3 Herd of Models (arxiv.org)

  5. [2402.15627] MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs (arxiv.org)

  6. metaseq/projects/OPT/chronicles at main · facebookresearch/metaseq (github.com)