Attention-FFN disaggregation (AFD) is a type of disaggregated inferencing where the forward pass of the attention block is run on separate GPUs from the feed-forward networks (FFNs). It exploits the facts that1
- Attention requires a lot of memory capacity for storing KV cache
- Feed-forward networks (experts) are compute-bound because they don’t require storing/loading any KV cache-like state between decode steps.
AFD is only practical when used on mixture of experts models. It piggybacks on the all-to-all communication required by expert parallelism. For dense transformers, the communication overhead of passing activations between all attention heads and the entire FFN becomes too costly.
AFD was first proposed by StepFun, one of China’s AI Tigers, when it released its Step-3 model.
Disaggregation ratios
AFD is the basis for NVIDIA acquiring Groq’s IP. FFNs can fit into Groq’s tiny SRAM, while attention (and its large memory footprint) can still be accommodated on Rubin GPUs.
A team led by the Hong Kong University of Science and Technology proposed a formal way to calculate the optimal ratio of accelerators to dedicate to attention vs. FFN.2