From Splitwise: Efficient Generative LLM Inference Using Phase Splitting:

Quote

To realize such a setup, the cached context from the prompt computation needs to be communicated over from the prompt processing machine to the token generation machine at low latency. We implement these transfers in an optimized manner over the back-end Infiniband interconnects

In reference to disaggregated inferencing. NVIDIA is using NVLink at rack scale for this, but this paper shows that disaggregating inferencing over InfiniBand works just fine too. Interesting that they didn’t bother decoupling the two using a shared KV cache in the middle; maybe this would’ve been too hard, since at the time of writing, KV caches were all node-local.

Quote

After the first token is generated, the following tokens only use the last generated token and the KV-cache as inputs to the forward pass of the model. This makes the subsequent token generation more memory bandwidth and capacity intensive than the computationally heavy prompt phase.

This is the basis for disaggregated inferencing.

Quote

For batch tasks (e.g., summarization), TTFT or TBT latency metrics are less important than throughput. On the other hand, for latency-sensitive tasks (e.g., conversational APIs), TTFT and TBT are the more important metrics with tighter SLOs.

Quote

We use production traces taken from two Azure LLM inference services on November 11th 2023. Our traces represent the most common scenarios in LLM inference today: coding and conversation. We have released a subset of our traces at https://github.com/Azure/AzurePublicDataset.

This trace data is very basic. Unclear how much is multi-turn.

Quote

we do not reuse the KV-cache between requests to emulate a cloud service with security guarantees.

Wish I knew what this really meant. Do CSPs dump cache just in case?

Quote

Since the coding service typically only generates the next few words in the program as the user types, the median number of output token is 13 tokens.

So code-generation models have huge inputs and tiny outputs.

Quote

if a prompt of 100 tokens is running in its prompt phase, we count the active tokens as 100. However, once the request is in the token phase, we count it as one active token, since the tokens are generated one at a time (assuming a beam search size of one [51]). We find that most of the time (60–70%) for conversation is spent running only 20 tokens or fewer. Since the coding service has very few output tokens, it experiences even worse batching in the token phase and runs with a single token for more than 20% of the time.

Note, their nomenclature is different from what has become standard:

  • Prompt phase = prefill
  • Token phase = decode

They used continuous batching in this study, so this is really saying that code generation models don’t benefit from continuous batching.

Quote

we see that most of the E2E time is spent running the token phase. This holds true even for the coding trace, where prompt sizes are large and generated tokens few. In fact, we find that for BLOOM-176B, a prompt phase with 1500 input tokens takes the same time as token phase with only 6 output tokens.

Since decode is memory bandwidth-bound, this suggests that LLM inferencing is memory bandwidth-bound overall.

Quote

While the prompt phase utilizes the power budget of the GPU efficiently, the token phase does not.

Because the SMs are waiting on HBM for the majority of the inferencing as a result of decode being predominant and memory bandwidth-bound.

Quote

Token generation can be run on less compute-capable hardware for better Perf/W and Perf/$ efficiencies.

But the hardware needs high memory bandwidth, which is the limiting factor. CPX takes the opposite approach of making the prompt phase (prefill) cheaper. But given that this work found that decode is where the time is spent, the value proposition of CPX doesn’t seem very high.

Quote

The prompt machine also sends over the KV-cache to the token machine, which continues the token generation until the response is complete. We use continuous batching at the token machines to maximize their utilization.

Impressive use of continuous batching.