multicluster training

Multicluster training is a technique applied while LLM training at scale to use hierarchies of cluster fabrics to train across massive amounts of GPUs. In HPC terms, this would be like running a single tightly coupled job MPI-style job across Frontier, Aurora, and Perlmutter. Such setups require communication patterns that work around the high asymmetry between intra-cluster communication (e.g., within a system like Frontier) and inter-cluster communication (e.g., between Frontier and Aurora).

NVIDIA recently opened a pull request for NCCL that enables “cross data center communications and network topology awareness,” strongly indicating that this technique is reaching mainstream usage.¹

Google

Gemini was “trained across multiple sites, and multiple clusters within those sites” according to Thomas Kurian² and the Gemini paper.³ Groups of 4096 TPUv4 chips were formed into superpods which share a common optical switch. Superpods are then connected within and across data centers for synchronous training. Each superpod hosts one model replica, and data parallelism is used across superpods. Google also said that multiple model replicas were stored in a single superpod to accelerate restart “despite the significantly larger training resources being used.”³

Microsoft

Mark Russinovich said:⁴

Quote

“I think it’s inevitable, especially when you get to the kind of scale that these things are getting to,” he said. “In some cases, that might be the only feasible way to train them is to go across data centers, or even across regions,” he said.

Scott Guthrie said:⁵

Quote

Now, what makes this even more impressive is the fact that we can combine multiple Azure AI data centers around the world using our AI Wide Area Network, or WAN. And this AI WAN supports up to 400 terabits of network bandwidth, making it the fastest and most scalable AI WAN in the world. And this enables large, large-scale distributed training across multiple Azure regions, allowing you to use one giant AI supercomputer for your jobs.

Dylan Patel wrote a broad report on this behind a paywall.⁶

OpenAI

OpenAI confirmed that they used multicluster training for GPT-4.5.⁷

Glenn's Digital Garden

Explorer

multicluster training

Google

Microsoft

OpenAI

Graph View

Table of Contents

Backlinks

Glenn's Digital Garden

Explorer

multicluster training

Google

Microsoft

OpenAI

Footnotes

Graph View

Table of Contents

Backlinks