Disaggregated inferencing an optimization where different phases of LLM inferencing are temporally distributed across different processors.

Prefill/decode disaggregation

As of 2026, “disaggregated inferencing” most commonly refers to running prefill (which is compute-bound) on one set of GPUs, and decode (which is memory bandwidth bound) on a different set of GPUs. Doing this allows you to more efficiently distribute inferencing across different types of GPUs which are either optimized for FLOPS or TB/s.

Disaggregated inferencing was first described in the SplitWise paper by Microsoft in 2023.1

Disaggregation ratios

There are a couple ways of realizing this optimization:

  • SplitWise demonstrated using H100 for prefill and A100 for decode to achieve higher throughput and lower cost than running both in series on H100.1 This capability was added to vLLM.
  • DeepSeek-v3 demonstrated using 32 H800 GPUs for prefill and 320 H800 GPUs for decode.2 This was a case where the same GPUs were used but loaded differently to optimize for tokens/second and latency.

Attention-FFN disaggregation

As of 2026, “attention-FFN disaggregation” (AFD) is the “next frontier” of optimizing inferencing through disaggregation.

Footnotes

  1. Splitwise: Efficient generative LLM inference using phase splitting 2

  2. [2412.19437v2] DeepSeek-V3 Technical Report