Pod is a colloquial term to refer to the size of a high-bandwidth communication domain for a collection of GPUs. For example,
- a Azure ND H100 v5 with 8x GPUs has a pod size of eight
- a rack of GB200 with 72 GPUs all interconnected with NVLink Switch has a pod size of 72
The larger the pod size, the larger the model you can train. For example, when employing a combination of data, pipeline, and tensor parallelism, you typically do not want to distribute a single tensor across memory coherence domains because of the high bandwidth requirements.
I think of a “pod” as always being a memory coherence domain (NVIDIA adheres to this definition), but some supercomputer builders (like Meta; see Revisiting Reliability in Large-Scale Machine Learning Research Clusters) use “pod” to describe any nonblocking island within a larger cluster fabric.
Physical constraints
Because pods are a high-bandwidth communication domain, you need really short, high-quality connections between nodes within a pod. This means the GPUs in a pod need to be really close to each other:
- Pod size of 8 all lives on a single HGX baseboard for NVIDIA A100 and H100 GPUs, and all connections are copper traces on the circuit board.
- Pod size of 72 lives within a high-power rack for NVIDIA GB200, and connections use short copper flyover cables.
This implies that pod size is power-limited and cooling-limited. Consider:
- You need GPUs to all be within close physical proximity so that you can maintain signal integrity over copper.
- You need liquid cooling above a certain point, as there isn’t enough dead space for air to flow between GPUs, given point #1 above
- Your pod switches (e.g., NVLink Switch) need to be physically close to the GPUs for the same reason as #1
Since a pod is power/cooling-limited, you also have to trade off between GPUs and pod switches when designing an ideal pod. Every switch you add requires power/cooling that a GPU can no longer use. Thus, using a two-level switched fabric as the pod interconnect is not attractive, as the second layer of switches consumes a disproportionate amount of power compared to a pod with a single-level fabric. While a two-level pod fabric would let you connect more GPUs together into a pod, there would be less power to give to each GPU, undercutting the overall value of the pod. This means that the best way to increase pod size is to increase the pod switch radix and decrease the power-per-port.