disaggregated inferencing

Disaggregated inferencing an optimization where different phases of LLM inferencing are temporally distributed across different processors.

Prefill/decode disaggregation

As of 2026, “disaggregated inferencing” most commonly refers to running prefill (which is compute-bound) on one set of GPUs, and decode (which is memory bandwidth bound) on a different set of GPUs. Doing this allows you to more efficiently distribute inferencing across different types of GPUs which are either optimized for FLOPS or TB/s.

Disaggregated inferencing was first described in the SplitWise paper by Microsoft in 2023.¹

Disaggregation ratios

There are a couple ways of realizing this optimization:

SplitWise demonstrated using H100 for prefill and A100 for decode to achieve higher throughput and lower cost than running both in series on H100.¹ This capability was added to vLLM.
DeepSeek-v3 demonstrated using 32 H800 GPUs for prefill and 320 H800 GPUs for decode.² This was a case where the same GPUs were used but loaded differently to optimize for tokens/second and latency.

Attention-FFN disaggregation

As of 2026, “attention-FFN disaggregation” (AFD) is the “next frontier” of optimizing inferencing through disaggregation.

Glenn's Digital Garden

Explorer

disaggregated inferencing

Prefill/decode disaggregation

Disaggregation ratios

Attention-FFN disaggregation

Graph View

Table of Contents

Backlinks

Glenn's Digital Garden

Explorer

disaggregated inferencing

Prefill/decode disaggregation

Disaggregation ratios

Attention-FFN disaggregation

Footnotes

Graph View

Table of Contents

Backlinks