KV cache

The key-value cache is used during LLM inferencing to accelerate attention part of transformers. They exploit the fact that, as previously generated tokens are used to generate new tokens (autoregressive decoding), old tokens’ key and value vectors do not change.

Every new token being generated during decode only depends on the tokens that precede it, not the ones that haven’t yet been generated. This means that previously generated tokens (and their key and value vectors) do not change once they are generated. It also means K and V vectors for older tokens can be computed once and repeatedly used as the output tokens are being generated.

This repeated re-use of K and V vectors gives rise to KV caches which store all the key/value vectors for previously generated tokens while generating next tokens.

I wrote a more detailed explanation of why key and value vectors are cacheable in Full attention.

Applications

KV caches are very useful for prefix caching, where every query to a chatbot is preceded by the same system prompts.

More generally, KV caches are useful for long-context inferencing, which is prevalent in

AI models for science which operate on huge amounts of scientific data as input
Code generation, where entire codebases can be included with a prompt

Finally, I believe there is utility in KV caching for disaggregated inferencing. A fast, global KV cache allows prefill to be desynchronized from decode.

Optimizations

Distributed KV caches can become “perforated,” where parts of a transformer’s cached vectors are missing due to either eviction or failure of one of the cache nodes. vLLM is adding support for recomputing KV vectors for perforated parts of a cache¹ without throwing out the entire cache.

https://github.com/vllm-project/vllm/issues/25950 ↩

Glenn's Digital Garden

Explorer

KV cache

Applications

Optimizations

Graph View

Table of Contents

Backlinks

Glenn's Digital Garden

Explorer

KV cache

Applications

Optimizations

Footnotes

Graph View

Table of Contents

Backlinks