Although LLM training makes heavy use of global collectives (see LLM training communication), communication between processors is highly localized and concentrated:1

  • 99% of processor pairs never communicate with each other when using 3D (data, pipeline, and tensor) parallelism.
  • Tensor parallelism accounts for 75% or more of the traffic, and that communication is highly localized to less than 0.04% of processor pairs at scale.
  • When processor pairs do communicate, they transfer large amounts of data and are bandwidth-bound, not latency-bound.

This all means that networks for LLM training at scale do not need to be densely interconnected, and there are many topologies that can effectively support distributed training at a lower cost than a fully nonblocking fat tree topology.

In practice

ByteDance uses a three-level, rail-optimized fat tree for their 10,000 GPU cluster.2

Alibaba uses a two-level, dual-plane fat tree for their 15,000(?) GPU cluster.3

Eagle uses a rail-optimized fat tree of some form. Its exact topology is not public.

In principle

HammingMesh is a topology designed specifically to match the needs of distributed model training.4

Rail-only is a concept where the planes in a multi-plane fat tree are completely independent.1

Footnotes

  1. [2307.12169] Rail-only: A Low-Cost High-Performance Network for Training LLMs with Trillion Parameters (arxiv.org) 2

  2. [2402.15627] MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs (arxiv.org)

  3. Alibaba HPN: A Data Center Network for Large Language Model Training (acm.org)

  4. HammingMesh (acm.org)