Calculating peak FLOPS is a bit contrived these days due to the complex thermal conditions under which processors operate.

Vector FLOPS

The H100 GPU is rated at 66.9 TFLOPS FP32 using vector FMAs (not matrix/tensor cores):

Matrix FLOPS

Tensor cores are a little more complicated. For example, on H100 and BF16,

When structured sparsity is included, NVIDIA doubles the effective FLOPS. The GPU doesn’t actually do twice as much math; it is just effectively doing twice as much math because 2 out of every 4 values is zero (and therefore does not require a FLOP to compute).

Rpeak

HPL also has a measure of peak flops called whose calculation is left up to the submitter. For example,

  • NVIDIA used to the clock that can be sustained at a GPU’s power limit while running HPL (e.g., 1305 MHz for A100). Then they calculated the peak flops from that, even though the peak clock of the GPU is higher (e.g., 1410 MHz for A100).
  • HPE/Cray used to assume HPL ran at the GPU’s base clock (e.g., 1095 MHz for A100), even though the GPU could boost up higher. This underrepresented the , but it meant the measured FLOPS () appeared closer to the theoretical max, thereby claiming higher efficiency.