hidden dimension

The hidden dimension or model dimension or residual stream dimension of a transformer is a hyperparameter that describes how many features every token has. It is commonly referred to as $d_{model}$ or just $d$ .

A few examples:

GPT-3 175B has a hidden dimension of 12,288.¹
Llama 3.1 405B has a hidden dimension of 16,384.
DeepSeek-V3 (which is the basis for DeepSeek-R1) has a hidden dimension of 7,168.²

This hidden dimension directly contributes to the number of parameters in a model. A couple examples:

Dense transformer: Each layer has…
- Attention: $N_{layers} \times 4 \times d_{model}^{2}$ weights (4 comes from query, key, value, and output)
- FFN (aka MLP): $N_{layers} \times 2 \times d_{model} \times intermediate size$ (2 comes from the up and down projection)
Sparse transformer (MOE): Each layer has…
- Attention: $N_{layers} \times 4 \times d_{model}^{2}$ weights (4 comes from query, key, value, and output)
- FFN (aka MLP): $N_{layers} \times 2 \times d_{expert} \times intermediate size \times N_{experts}$ (2 comes from the up and down projection)

If a model uses SwiGLU, each expert or FFN has an additional parameter (three total—up, down, and gate).

Brown et al. Language models are few-shot learners ↩
Liu et al. DeepSeek-V3 Technical Report. ↩

Glenn's Digital Garden

Explorer

hidden dimension

Graph View

Backlinks

Glenn's Digital Garden

Explorer

hidden dimension

Footnotes

Graph View

Backlinks