The key-value cache is used during LLM inferencing to accelerate attention part of transformers. They exploit the fact that, as previously generated tokens are used to generate new tokens (autoregressive decoding), old tokens’ key and value vectors do not change.

Every new token being generated during decode only depends on the tokens that precede it, not the ones that haven’t yet been generated. This means that previously generated tokens (and their key and value vectors) do not change once they are generated. It also means K and V vectors for older tokens can be computed once and repeatedly used as the output tokens are being generated.

This repeated re-use of K and V vectors gives rise to KV caches which store all the key/value vectors for previously generated tokens while generating next tokens.

I wrote a more detailed explanation of why key and value vectors are cacheable in Full attention.

Applications

KV caches are very useful for prefix caching, where every query to a chatbot is preceded by the same system prompts.

More generally, KV caches are useful for long-context inferencing, which is prevalent in

  • AI models for science which operate on huge amounts of scientific data as input
  • Code generation, where entire codebases can be included with a prompt

Finally, I believe there is utility in KV caching for disaggregated inferencing. A fast, global KV cache allows prefill to be desynchronized from decode.

Optimizations

Distributed KV caches can become “perforated,” where parts of a transformer’s cached vectors are missing due to either eviction or failure of one of the cache nodes. vLLM is adding support for recomputing KV vectors for perforated parts of a cache1 without throwing out the entire cache.

Footnotes

  1. https://github.com/vllm-project/vllm/issues/25950