Meta's H100 clusters

Meta has two 24,576-GPU clusters which are their flagship systems.¹ The Llama-3 paper calls them “Meta’s production clusters” but they have no other catchier name.

One has a 400G RoCEv2 backend using Arista 7800 switches
One has a 400G NDR InfiniBand backend using NVIDIA Quantum-2 switches

System architecture

Overall, each system has²

24,576 NVIDIA H100 GPUs
3,072 Grand Teton nodes
1,536 racks
8 pods

Each pod is a nonblocking fabric domain consisting of²

3,072 NVIDIA H100 GPUs
384 Grand Teton nodes
192 racks
7:1 blocking

Based on my estimate, each of these clusters should be able to get 957,700 PFLOPS with HPL.

Node architecture

Each node is built on Meta’s Grand Teton platform which has:

1x HGX baseboard
- 8x H100 GPUs
- Full NVLink interconnectivity
2x CPUs, each with
- 1x 400G frontend NIC
4x PCIe Gen5 switches, each with
- 1x connection into the HGX base board
- 2x 400G backend NICs
- 2x NVMe drives

Two nodes fit in a single rack, meaning each 24K cluster is 1,536 racks large.

Fabric architecture

Their backend network is a multilevel fat tree. They use inconsistent terminology to describe the three layers:

T0 leaf switches are “RTSW” or “ToR” switches. They are Minipack2 devices.
T1 aggregation switches are “CTSW” or “Cluster Switches.”
T2 spine switches are “ATSW” or “Aggregation switches.”

T0-T1 links are undersubscribed by 2:1 to reduce congestion at this level,³ and the Llama-3 paper calls these nonblocking T1 domains “pods.” This is quite unique as they spent twice the money to provision twice as much injection bandwidth than they could actually use just to work around the congestion that arises from RoCEv2’s poor handling of it at the T1 layer.

T1-T2 links are oversubscribed by 7:1.² They claim their communication library was aware of this extreme taper and works around it during training.

Glenn's Digital Garden

Table of Contents

Explorer

Recent Notes

working at Microsoft

BXI

Reasoning models

Azure ND GB200 v6

Azure SmartNICs

Meta's H100 clusters

System architecture

Node architecture

Fabric architecture

Graph View

Backlinks

Glenn's Digital Garden

Table of Contents

Explorer

Recent Notes

working at Microsoft

BXI

Reasoning models

Azure ND GB200 v6

Azure SmartNICs

Meta's H100 clusters

System architecture

Node architecture

Fabric architecture

Footnotes

Graph View

Backlinks