intermediate size

The intermediate size of a transformer describes the size of the feed-forward network (FFN) part of each transformer layer.

In mixture of experts models, there may be multiple intermediate sizes; for example, Nemotron 3 uses

5376 for shared expert
2688 for each routed expert

The intermediate size directly affects how many parameters a model has. For example, a mixture of experts model has $N_{layers} \times 2 \times d_{expert} \times intermediate size \times N_{experts}$ weights associated with its experts (the 2 comes from the up and down projections).

Glenn's Digital Garden

Explorer

intermediate size

Graph View

Backlinks