Multicluster training is a technique applied while LLM training at scale to use hierarchies of cluster fabrics to train across massive amounts of GPUs. In HPC terms, this would be like running a single tightly coupled job MPI-style job across Frontier, Aurora, and Perlmutter. Such setups require communication patterns that work around the high asymmetry between intra-cluster communication (e.g., within a system like Frontier) and inter-cluster communication (e.g., between Frontier and Aurora).

NVIDIA recently opened a pull request for NCCL that enables “cross data center communications and network topology awareness,” strongly indicating that this technique is reaching mainstream usage.1

Google

Gemini was “trained across multiple sites, and multiple clusters within those sites” according to Thomas Kurian2 and the Gemini paper.3 Groups of 4096 TPUv4 chips were formed into superpods which share a common optical switch. Superpods are then connected within and across data centers for synchronous training. Each superpod hosts one model replica, and data parallelism is used across superpods. Google also said that multiple model replicas were stored in a single superpod to accelerate restart “despite the significantly larger training resources being used.”3

Microsoft

Mark Russinovich said:4

Quote

“I think it’s inevitable, especially when you get to the kind of scale that these things are getting to,” he said. “In some cases, that might be the only feasible way to train them is to go across data centers, or even across regions,” he said.

Dylan Patel wrote a broad report on this behind a paywall, but the summary claims that Google and Microsoft/OpenAI are both pursuing this.5

OpenAI

OpenAI confirmed that they used multicluster training for GPT-4.5.6

Footnotes

  1. Add cross data center communications and network topology awareness to NCCL by thomasgillis · Pull Request #1659 · NVIDIA/nccl · GitHub

  2. Training Google’s Gemini: TPUs, multiple data centers, and risks of cosmic rays - DCD (datacenterdynamics.com)

  3. Gemini: A Family of Highly Capable Multimodal Models 2

  4. Microsoft Azure CTO: US data centers will soon hit limits of energy grid | Semafor

  5. Multi-Datacenter Training: OpenAI’s Ambitious Plan To Beat Google’s Infrastructure

  6. See Pre-Training GPT-4.5.