The intermediate size of a transformer describes the size of the feed-forward network (FFN) part of each transformer layer.
In mixture of experts models, there may be multiple intermediate sizes; for example, Nemotron 3 uses
- 5376 for shared expert
- 2688 for each routed expert
The intermediate size directly affects how many parameters a model has. For example, a mixture of experts model has weights associated with its experts (the 2 comes from the up and down projections).
See also hidden dimension.