H100 is NVIDIA’s first Hopper-based GPU.
Specifications
H100 SXM5
Each H100 SXM5 GPU has:1
- 8 GPCs
- 66 TPCs (an uneven number of TPCs per GPC?)
- 132 SMs (2 per TPC)
- 16,896 FP32 cores (128 per SM)
- 528 4th generation tensor cores (4 per SM)
- 1.83 GHz (low precision), 1.98 GHz (high precision)
- 2:4 structured sparsity
- 80 GB HBM3 (5 stacks)
- 10 512-bit memory controllers
- 3.35 TB/s (max)?
- 900 GB/s NVLink (D2D)
- 128 GB/s PCIe Gen5 (H2D)
- 700 W maximum
GH100
GH100 GPUs ship with a higher-spec part than the SXM5 DGX-style node. Each GH100 GPU has:1
- 8 GPCs
- 72 TPCs (9 TPCs/GPC)
- 144 SMs (2 per TPC)
- 18,432 FP32 cores (128 per SM)
- 576 4th generation tensor cores (4 per SM)
- 6 HBM3 or HBM2e stacks (unclear how much capacity per stack)1
- 12 512-bit memory controllers
H200
H200 GPUs are the same as H100 GPUs (SXM5 or GH variant?), just with 144 GB of HBM3e (more and faster HBM).
Performance
The following are theoretical maximum performance in TFLOPS:2
Data Type | VFMA | Matrix | Sparse |
---|---|---|---|
FP64 | 33.5 | 66.9 | |
FP32 | 66.9 | ||
TF32 | 494.7 | 989.4 | |
FP16 | 133.8 | 989.4 | 1978.9 |
BF16 | 133.8 | 989.4 | 1978.9 |
FP8 | 1978.9 | 3957.8 | |
INT32 | 33.5 | ||
INT8 | 1978.9 | 3957.8 |
You can also project the HPL performance of any H100-based supercomputer by looking at the TFLOPS/node or TFLOPS/GPU from the Top500 list and linearly extrapolate.
Interestingly, the per H100 and H200 varies between 55.0 and 59.0 TF/GPU which is somewhere between the VFMA and Matrix FP64 performance above.