The hidden dimension or model dimension or residual stream dimension of a transformer is a hyperparameter that describes how many features every token has. It is commonly referred to as or just .

A few examples:

This hidden dimension directly contributes to the number of parameters in a model. A couple examples:

  • Dense transformer: Each layer has…
    • Attention: weights (4 comes from query, key, value, and output)
    • FFN (aka MLP): (2 comes from the up and down projection)
  • Sparse transformer (MOE): Each layer has…
    • Attention: weights (4 comes from query, key, value, and output)
    • FFN (aka MLP): (2 comes from the up and down projection)

If a model uses SwiGLU, each expert or FFN has an additional parameter (three total—up, down, and gate).

Footnotes

  1. Brown et al. Language models are few-shot learners

  2. Liu et al. DeepSeek-V3 Technical Report.