decode

Decode is the part of LLM inferencing where outputs are generated (decoded) one after the other. It follows prefill, and each output token is generated based on all the input and output tokens that preceded it.

Because the next token being generated only depends on the tokens that preceded it (and not tokens that will be generated after it), each K and V vector never changes once it is generated. This allows these vectors to be cached (in a KV cache) rather than repeatedly recomputed during every forward pass through the transformer as output tokens are generated.

Most of the decode phase is repeatedly reading old K and V vectors from the KV cache, which usually lives in HBM. This makes decode highly sensitive to memory bandwidth instead of compute.

Glenn's Digital Garden

Explorer

decode

Graph View

Backlinks