The hidden dimension or model dimension or residual stream dimension of a transformer is a hyperparameter that describes how many features every token has. The number of weights a transformer has is a function of this hyperparameter.

A few examples:

Footnotes

  1. Brown et al. Language models are few-shot learners

  2. Liu et al. DeepSeek-V3 Technical Report.