Google TPUv4

See Google’s TPUv4 documentation.¹ In brief, each one has:

2 TensorCores (not to be confused with NVIDIA’s terminology)
- 8 MXUs (2 per TensorCores)
- 2 vector units (1 per TensorCore)
- 2 scalar units (1 per TensorCore)
- No sparsity
32 GB HBM2 (? stacks)
- 1.2 TB/s (max)
x16 PCIe Gen3
192 W maximum

These TPUv4 chips are assembled into SuperPods of 4,096 chips which share an optical switch that can dynamically reconfigure 64-processor cubes of processors into different torus topologies in ten seconds.² Each 64-processor cube has 2 TiB of HBM2, and each SuperPod has 128 TiB of HBM2.

Performance

The following are theoretical maximum performance in TFLOPS:¹

Data Type	VFMA	Matrix	Sparse
FP64
FP32
TF32
FP16
BF16		275
FP8
INT32
INT8		275

TPU v4 (cloud.google.com) ↩ ↩²
Gemini: A Family of Highly Capable Multimodal Models ↩

Glenn's Digital Garden

Explorer

Recent Notes

working at Microsoft

BXI

Reasoning models

Azure ND GB200 v6

Azure SmartNICs

Google TPUv4

Performance

Graph View

Backlinks

Glenn's Digital Garden

Explorer

Recent Notes

working at Microsoft

BXI

Reasoning models

Azure ND GB200 v6

Azure SmartNICs

Google TPUv4

Performance

Footnotes

Graph View

Backlinks