This is prefill:

Prefill is the part of LLM inferencing where the prompt is run through the model. The products of prefill are:

  1. K and V vectors for every layer of attention and every token in the prompt
  2. The final hidden state of the model prior to output tokens being generated

Because the whole input prompt is passed in, it is a dense computation and is therefore compute-limited. This contrasts with the next phase, decode.

Specifically, the input of prefill is , where is the number of input tokens and is the hidden dimension. This turns into a GEMM (matrix-matrix multiplication). As long as is large enough to fill a tensor core tile, the GEMM takes longer than loading the inputs from HBM, so it is compute-bound.