The hidden dimension or model dimension or residual stream dimension of a transformer is a hyperparameter that describes how many features every token has. It is commonly referred to as or just .
A few examples:
- GPT-3 175B has a hidden dimension of 12,288.1
- Llama 3.1 405B has a hidden dimension of 16,384.
- DeepSeek-V3 (which is the basis for DeepSeek-R1) has a hidden dimension of 7,168.2
This hidden dimension directly contributes to the number of parameters in a model. A couple examples:
- Dense transformer: Each layer has…
- Attention: weights (4 comes from query, key, value, and output)
- FFN (aka MLP): (2 comes from the up and down projection)
- Sparse transformer (MOE): Each layer has…
- Attention: weights (4 comes from query, key, value, and output)
- FFN (aka MLP): (2 comes from the up and down projection)
If a model uses SwiGLU, each expert or FFN has an additional parameter (three total—up, down, and gate).
Footnotes
-
Brown et al. Language models are few-shot learners ↩
-
Liu et al. DeepSeek-V3 Technical Report. ↩