The intermediate size of a transformer describes the size of the feed-forward network (FFN) part of each transformer layer.

In mixture of experts models, there may be multiple intermediate sizes; for example, Nemotron 3 uses

  • 5376 for shared expert
  • 2688 for each routed expert

The intermediate size directly affects how many parameters a model has. For example, a mixture of experts model has weights associated with its experts (the 2 comes from the up and down projections).

See also hidden dimension.