Networking for LLM training

Although LLM training makes heavy use of global collectives (see LLM training communication), communication between processors is highly localized and concentrated:¹

99% of processor pairs never communicate with each other when using 3D (data, pipeline, and tensor) parallelism.
Tensor parallelism accounts for 75% or more of the traffic, and that communication is highly localized to less than 0.04% of processor pairs at scale.
When processor pairs do communicate, they transfer large amounts of data and are bandwidth-bound, not latency-bound.

This all means that networks for LLM training at scale do not need to be densely interconnected, and there are many topologies that can effectively support distributed training at a lower cost than a fully nonblocking fat tree topology.

In practice

ByteDance uses a three-level, rail-optimized fat tree for their 10,000 GPU cluster.²

Alibaba uses a two-level, dual-plane fat tree for their 15,000(?) GPU cluster.³

Eagle uses a rail-optimized fat tree of some form. Its exact topology is not public.

In principle

HammingMesh is a topology designed specifically to match the needs of distributed model training.⁴

Rail-only is a concept where the planes in a multi-plane fat tree are completely independent.¹

Glenn's Digital Garden

Table of Contents

Explorer

Recent Notes

Social media platforms

Scaling laws

NVIDIA B200

Slingshot

Cray EX154n

Networking for LLM training

In practice

In principle

Graph View

Backlinks

Glenn's Digital Garden

Table of Contents

Explorer

Recent Notes

Social media platforms

Scaling laws

NVIDIA B200

Slingshot

Cray EX154n

Networking for LLM training

In practice

In principle

Footnotes

Graph View

Backlinks