MOE models introduce the ability to utilize expert parallelism, a method by which different experts are distributed, in whole, across different GPUs. This introduces all-to-all communication as tokens are routed to their GPUs.1

NVIDIA wrote a nice paper that explains the architectural implications of training MOE models:1

Memory capacity

Even though all experts are not active for every token, they still must all live in GPU memory. This makes the ratio of compute to memory capacity skew even more towards memory capacity.

Communication bandwidth

All-to-alls are used to send tokens to experts and collect the results. Since each GPU only holds a fraction of the expert pool, the full hidden-state vector has to be transferred to k different experts on other GPUs. As a result, the total cross-GPU traffic scales with the number of tokens, the hidden dimension, and how many experts there are. As the number of experts grows, the odds of having to send everything off-node grows.

Taking for example DeepSeek-R1 (really, DeepSeek-V3). In principle,

  • Hidden dimension
  • 256 routed experts, 8 active at a time. There is a 257th shared expert,2 but it can be replicated to each GPU.
  • 4,096 tokens per sequence

The amount of data being sent out by each GPU is governed by:

This calculates out to 32,640 tokens’ hidden state vectors being sent out per GPU. Given that each token’s hidden state has 7,168 elements, each of which is bf16, this means

  • Each token’s hidden state vector is 14,336 bytes
  • 32,640 tokens’ hidden states adds up to 468 MB of data being sent out by each GPU
  • …plus another 468 MB of data coming back to each GPU after the tokens have been processed by experts

For DeepSeek’s 58 MOE layers, this means

  • 54 GB of all-to-all traffic per forward pass
  • another 54 GB of traffic during backpropagation

In reality, DeepSeek constrained expert routing so that active experts were never spread out over more than four nodes (32 GPUs). So the expert parallelism traffic is actually:

The IB transfers are the limiting factor; GPUs within each node can further communicate using higher-bandwidth NVLink.

Compute inefficiency

Models with many small experts results in many small matrix multiplications which aren’t executed efficiently on GPUs. NVIDIA found that GEMMs account for 70% of execution time in dense transformers but only 50% in MOE transformers.1 The 20% difference is caused by “operations that scale with tensor count rather than FLOP count.”

In addition, optimized token routing adds 9% more time to layer execution,1 but it is unclear how much of this is dense computation versus other Amdahl inefficiencies. For example, MOE models require more GPU kernel launches per layer than dense transformers. Furthermore, token routing is not deterministic which can cause load imbalance across experts.

Footnotes

  1. [2603.07685v1] Scalable Training of Mixture-of-Experts Models with Megatron Core 2 3 4

  2. DeepSeek-Inference-Theoretical-Model_Deriving-the-performance-from-hardware-primitives_02092025.pdf