The key-value cache is used during LLM inferencing to accelerate attention part of transformers. They exploit the fact that, as previously generated tokens are used to generate new tokens (autoregressive decoding), old tokens’ key and value vectors do not change.
Every new token being generated during decode only depends on the tokens that precede it, not the ones that haven’t yet been generated. This means that previously generated tokens (and their key and value vectors) do not change once they are generated. It also means K and V vectors for older tokens can be computed once and repeatedly used as the output tokens are being generated.
This repeated re-use of K and V vectors gives rise to KV caches which store all the key/value vectors for previously generated tokens while generating next tokens.
I wrote a more detailed explanation of why key and value vectors are cacheable in Full attention.
Applications
KV caches are very useful for prefix caching, where every query to a chatbot is preceded by the same system prompts.
More generally, KV caches are useful for long-context inferencing, which is prevalent in
- AI models for science which operate on huge amounts of scientific data as input (telescope images, etc).
- Code generation, where entire codebases can be included with a prompt. More generally, this can be extended to RAG with lots of relevant context.
- Multi-turn chat session with long wait times between turns.
Finally, I believe there is utility in KV caching for disaggregated inferencing. A fast, global KV cache allows prefill to be desynchronized from decode.
Implementation
KV caches can be implemented at multiple levels of the memory hierarchy:
- GPU HBM: This is where KV vectors are stored.
- CPU DRAM: This is a larger pool where KV vectors that don’t belong to the current multi-turn inferencing session can go. The current session cannot be cached in DRAM because all KV vectors for a conversation must be processed in HBM to generate the next token for that conversation.
- Local SSD: This is an even slower, even bigger pool where KV vectors can be cached. It has the same utility as the CPU DRAM.
- Remote storage: Same as above.
NVIDIA Dynamo defines these tiers as G1, G2, G3, and G4, but it also attributes a data lifecycle to these tiers.1
Sizing
The size of a single key or value vector is the product of:1
- Number of layers
- Number of attention heads per layer
- Dimension of the attention head
- Precision of the key/value
Optimizations
Distributed KV caches can become “perforated,” where parts of a transformer’s cached vectors are missing due to either eviction or failure of one of the cache nodes. vLLM is adding support for recomputing KV vectors for perforated parts of a cache2 without throwing out the entire cache.