This is decode:
Decode is the part of LLM inferencing where outputs are generated (decoded) one after the other. It follows prefill, and each output token is generated based on all the input and output tokens that preceded it.
Because the next token being generated only depends on the tokens that preceded it (and not tokens that will be generated after it), each K and V vector never changes once it is generated. This allows these vectors to be cached (in a KV cache) rather than repeatedly recomputed during every forward pass through the transformer as output tokens are generated.
Most of the decode phase is repeatedly loading two types of data from HBM into the GPU’s SRAM:
- Model weights
- K and V vectors from the KV cache
This makes decode highly sensitive to memory bandwidth instead of compute.
Specifically, the input of decode is , where 1 is the result of only the last generated token and is the hidden dimension. This turns into a GEMV (matrix-vector multiplication), which tensor cores cannot process efficiently. Thus, the time it takes to perform the GEMV is shorter than the time required to load its inputs from HBM, and the decode is limited by memory bandwidth.