See Google’s TPUv4 documentation.1 In brief, each one has:
- 2 TensorCores (not to be confused with NVIDIA’s terminology)
- 8 MXUs (2 per TensorCores)
- 2 vector units (1 per TensorCore)
- 2 scalar units (1 per TensorCore)
- No sparsity
- 32 GB HBM2 (? stacks)
- 1.2 TB/s (max)
- x16 PCIe Gen3
- 192 W maximum
These TPUv4 chips are assembled into SuperPods of 4,096 chips which share an optical switch that can dynamically reconfigure 64-processor cubes of processors into different torus topologies in ten seconds.2 Each 64-processor cube has 2 TiB of HBM2, and each SuperPod has 128 TiB of HBM2.
Performance
The following are theoretical maximum performance in TFLOPS:1
Data Type | VFMA | Matrix | Sparse |
---|---|---|---|
FP64 | |||
FP32 | |||
TF32 | |||
FP16 | |||
BF16 | 275 | ||
FP8 | |||
INT32 | |||
INT8 | 275 |