This is prefill:
Prefill is the part of LLM inferencing where the prompt is run through the model. The products of prefill are:
- K and V vectors for every layer of attention and every token in the prompt
- The final hidden state of the model prior to output tokens being generated
Because the whole input prompt is passed in, it is a dense computation and is therefore compute-limited. This contrasts with the next phase, decode.
Specifically, the input of prefill is , where is the number of input tokens and is the hidden dimension. This turns into a GEMM (matrix-matrix multiplication). As long as is large enough to fill a tensor core tile, the GEMM takes longer than loading the inputs from HBM, so it is compute-bound.