Google TPUv4

See Google’s TPUv4 documentation.¹

Processor

Each TPUv4 processor has:¹

2 TensorCores (not to be confused with NVIDIA’s terminology)
- 8 MXUs (2 per TensorCores)
- 2 vector units (1 per TensorCore)
- 2 scalar units (1 per TensorCore)
- No sparsity
32 GB HBM2 (4? stacks²)
- 1.2 TB/s (max)
x16 PCIe Gen3
192 W maximum

ICI

ICI is a “proprietary inter-chip interconnect” that enables RDMA over P2P PCIe with full host bypass.² It is the interconnect that allows TPUs to communicate with each other and provides 400G bandwidth in each direction.

It uses a reliable data layer with in-order delivery and link-level credit-based flow control.²

Tray

A single TPU tray has 4x processors arranged in a 2x2x1 ICI mesh.²

Cubes

A cube has 16 trays (64 TPUs) which fit in a single physical rack. This cube is a 4x4x4 mesh² of TPUs, and they all share an OCS which connects to other cubes. Each x/y/z “face” of the cube has 16 ICIs which connect to other cubes.

Because each cube is tied to an OCS, this 4x4x4 cube is the minimum granularity of reconfigurability within a TPU cluster.

Supercomputer

64 cubes are assembled into a pod of 4,096 processors with 6,144 optical ICI links and 48 OCS switches.

Optical circuit switches (OCS) are used to connect (“xconnect”) multiple cubes to form a job-specific torus within a SuperPod.² Such a grouping of cubes is called a “slice.” OCS is a new feature to TPUv4 that was not present on TPUv3, and Google uses Palomar OCS technology.

Reconfiguring the torus via OCS takes ten seconds.³

Cells

Multiple TPUv4 supercomputers may share a single Borg cell.¹

Job scheduling

Jobs are scheduled using Borg, and Borg sends commands to a SuperPod’s Pod Manager to reconfigure the OCS switches.² The ICI is reconfigured to allow a job to land on a tight torus topology that is carved out of the larger 4x4x4 physical torus using optical switching. This avoids job fragmentation that occurs on other low-radix networks. Reconfiguration of OCS takes ten seconds.³

Job specification must include a requested topology in a $(4 x, 4 y, 4 z)$ format and a cell.

Performance

The following are theoretical maximum performance in TFLOPS:¹

Data Type	VFMA	Matrix	Sparse
FP64
FP32
TF32
FP16
BF16		275
FP8
INT32
INT8		275

Legacy TPU

TPUv3

TPUv3 pods had 1024 TPUs in a static 32x32 ICI torus and could be combined into a 128x32 mesh with “limited ICI routing capability” for the largest scales.²

TPUv2

TPUv2 pods had 256 TPUs in a static 16x16 ICI torus.²

TPU v4 (cloud.google.com) ↩ ↩² ↩³ ↩⁴
Zu et al. “Resiliency at scale: managing Google’s TPUv4 machine learning supercomputer.” NSDI’24. nsdi24-zu.pdf ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶ ↩⁷ ↩⁸ ↩⁹
Gemini: A Family of Highly Capable Multimodal Models ↩ ↩²

Glenn's Digital Garden

Explorer

Google TPUv4

Processor

ICI

Tray

Cubes

Supercomputer

Cells

Job scheduling

Performance

Legacy TPU

TPUv3

TPUv2

Graph View

Table of Contents

Backlinks

Glenn's Digital Garden

Explorer

Google TPUv4

Processor

ICI

Tray

Cubes

Supercomputer

Cells

Job scheduling

Performance

Legacy TPU

TPUv3

TPUv2

Footnotes

Graph View

Table of Contents

Backlinks