Disaggregated inferencing an optimization where different phases of LLM inferencing are temporally distributed across different processors.
Prefill/decode disaggregation
As of 2026, “disaggregated inferencing” most commonly refers to running prefill (which is compute-bound) on one set of GPUs, and decode (which is memory bandwidth bound) on a different set of GPUs. Doing this allows you to more efficiently distribute inferencing across different types of GPUs which are either optimized for FLOPS or TB/s.
Disaggregated inferencing was first described in the SplitWise paper by Microsoft in 2023.1
Disaggregation ratios
There are a couple ways of realizing this optimization:
- SplitWise demonstrated using H100 for prefill and A100 for decode to achieve higher throughput and lower cost than running both in series on H100.1 This capability was added to vLLM.
- DeepSeek-v3 demonstrated using 32 H800 GPUs for prefill and 320 H800 GPUs for decode.2 This was a case where the same GPUs were used but loaded differently to optimize for tokens/second and latency.
Attention-FFN disaggregation
As of 2026, “attention-FFN disaggregation” (AFD) is the “next frontier” of optimizing inferencing through disaggregation.