peak FLOPS

Calculating peak FLOPS is a bit contrived these days due to the complex thermal conditions under which processors operate.

Vector FLOPS

The H100 GPU is rated at 66.9 TFLOPS FP32 using vector FMAs (not matrix/tensor cores):

$16, 896 FP32 CUDA cores \times 1.98 GHz (boost) \times 2 FLOPS per FMA = 66, 908 GFLOPS$

Matrix FLOPS

Tensor cores are a little more complicated. For example, on H100 and BF16,

Each tensor core performs 512 FMAs per clock (1024 FLOPS per clock). See the History page to get this.
There are 4 tensor cores per SM and 132 SMs per H100 GPU
So a whole H100 has 528 tensor cores and is capable of 540,672 FLOPS/clock
H100 clocks at 1.83 GHz, which means 989,429.76 GFLOPS BF16

When structured sparsity is included, NVIDIA doubles the effective FLOPS. The GPU doesn’t actually do twice as much math; it is just effectively doing twice as much math because 2 out of every 4 values is zero (and therefore does not require a FLOP to compute).

Rpeak

HPL also has a measure of peak flops called $R_{peak}$ whose calculation is left up to the submitter. For example,

NVIDIA used to the clock that can be sustained at a GPU’s power limit while running HPL (e.g., 1305 MHz for A100). Then they calculated the peak flops from that, even though the peak clock of the GPU is higher (e.g., 1410 MHz for A100).
HPE/Cray used to assume HPL ran at the GPU’s base clock (e.g., 1095 MHz for A100), even though the GPU could boost up higher. This underrepresented the $R_{peak}$ , but it meant the measured FLOPS ( $R_{max}$ ) appeared closer to the theoretical max, thereby claiming higher efficiency.

Glenn's Digital Garden

Explorer

peak FLOPS

Vector FLOPS

Matrix FLOPS

Rpeak

Graph View

Table of Contents

Backlinks