Glenn's Digital Garden

Search

Recent Notes

BXI
Dec 17, 2024
Meta Llama-3.1
Dec 09, 2024
checkpointing
Dec 09, 2024
Storage for LLM training
Dec 09, 2024
Availability
Dec 09, 2024

❯

LLM inferencing

LLM inferencing

Dec 05, 2024

workload

LLM inferencing

Inferencing is different from training in that models are often quantized to reduced precisions to save on the memory and computational requirements to process requests.

From the DeepSpeed-FastGen paper:

Prompt processing:
- input is user-provided text (the prompt)
- output is a key-value cache for attention
- compute-bound and scales with the input length
Token generation:
- adds a token to the KV cache, then generates a new token
- memory bandwidth-bound and shows approximately O(1) scaling

Until I have time to summarize it, I recommend reading Efficient Memory Management for Large Language Model Serving with PagedAttention by Kwon et al to understand how GPU memory is consumed during inferencing. This paper explains the role of key-value caches to store parts of the attention mechanism.

Graph View

Backlinks

Structured sparsity

Created with Quartz v4.2.4

glennklockwood.com
@glennklockwood