Calculating peak FLOPS is a bit contrived these days due to the complex thermal conditions under which processors operate.
Vector FLOPS
The H100 GPU is rated at 66.9 TFLOPS FP32 using vector FMAs (not matrix/tensor cores):
Matrix FLOPS
Tensor cores are a little more complicated. For example, on H100 and BF16,
- Each tensor core performs 512 FMAs per clock (1024 FLOPS per clock). See the History page to get this.
- There are 4 tensor cores per SM and 132 SMs per H100 GPU
- So a whole H100 has 528 tensor cores and is capable of 540,672 FLOPS/clock
- H100 clocks at 1.83 GHz, which means 989,429.76 GFLOPS BF16
When structured sparsity is included, NVIDIA doubles the effective FLOPS. The GPU doesn’t actually do twice as much math; it is just effectively doing twice as much math because 2 out of every 4 values is zero (and therefore does not require a FLOP to compute).
Rpeak
HPL also has a measure of peak flops called whose calculation is left up to the submitter. For example,
- NVIDIA used to the clock that can be sustained at a GPU’s power limit while running HPL (e.g., 1305 MHz for A100). Then they calculated the peak flops from that, even though the peak clock of the GPU is higher (e.g., 1410 MHz for A100).
- HPE/Cray used to assume HPL ran at the GPU’s base clock (e.g., 1095 MHz for A100), even though the GPU could boost up higher. This underrepresented the , but it meant the measured FLOPS () appeared closer to the theoretical max, thereby claiming higher efficiency.