Meta has two 24,576-GPU clusters which are their flagship systems.1 The Llama-3 paper calls them “Meta’s production clusters” but they have no other catchier name.

  • One has a 400G RoCEv2 backend using Arista 7800 switches
  • One has a 400G NDR InfiniBand backend using NVIDIA Quantum-2 switches

System architecture

Overall, each system has2

  • 24,576 NVIDIA H100 GPUs
  • 3,072 Grand Teton nodes
  • 1,536 racks
  • 8 pods

Each pod is a nonblocking fabric domain consisting of2

  • 3,072 NVIDIA H100 GPUs
  • 384 Grand Teton nodes
  • 192 racks
  • 7:1 blocking

Based on my estimate, each of these clusters should be able to get 957,700 PFLOPS with HPL.

Node architecture

Each node is built on Meta’s Grand Teton platform which has:

  • 1x HGX baseboard
    • 8x H100 GPUs
    • Full NVLink interconnectivity
  • 2x CPUs, each with
    • 1x 400G frontend NIC
  • 4x PCIe Gen5 switches, each with
    • 1x connection into the HGX base board
    • 2x 400G backend NICs
    • 2x NVMe drives

Two nodes fit in a single rack, meaning each 24K cluster is 1,536 racks large.

Fabric architecture

Their backend network is a multilevel fat tree. They use inconsistent terminology to describe the three layers:

  • T0 leaf switches are “RTSW” or “ToR” switches. They are Minipack2 devices.
  • T1 aggregation switches are “CTSW” or “Cluster Switches.”
  • T2 spine switches are “ATSW” or “Aggregation switches.”

T0-T1 links are undersubscribed by 2:1 to reduce congestion at this level,3 and the Llama-3 paper calls these nonblocking T1 domains “pods.” This is quite unique as they spent twice the money to provision twice as much injection bandwidth than they could actually use just to work around the congestion that arises from RoCEv2’s poor handling of it at the T1 layer.

T1-T2 links are oversubscribed by 7:1.2 They claim their communication library was aware of this extreme taper and works around it during training.

Footnotes

  1. Building Meta’s GenAI Infrastructure

  2. The Llama-3 Herd of Models (arxiv.org) 2 3

  3. RDMA over Ethernet for Distributed AI Training at Meta Scale (2024)