Transformers are a family of neural networks that process text by analyzing relationships between words all at once rather than sequentially. This is done using self-attention (usually just called attention) to determine which words in a sentence are most relevant to each other which allows them to understand context. Transformers are the basis for LLMs like OpenAI’s GPT family, and they allow chatbots to generate context-aware conversations by predicting the next word based on everything seen so far instead of processing words one-by-one.

Transformers are computationally expensive because self-attention compares every input token to every other input token. This results in the computational requirements scaling as . This gives them rich contextual understanding at high training costs. See LLM training for more information.

Fundamentally, I think of a transformer as a black box with

  • Input: a sequence of words1 and a vocabulary of known words
  • Output: a table of every word in the vocabulary and the probability that it’s the next word in the input sequence

So transformers really are very fancy autocompletes.

Chatbots use autoregressive decoding to iteratively guess the next word in a sequence, then guess the word after that, and so on.

Dense transformer

Transformers are composed of transformer layers. Each transformer layer has two big parts to it:

  1. Self-attention
  2. Feed-forward network

If you are inferencing a model with sequence length tokens and a hidden dimension , each layer then has an activation tensor (also called the residual stream) with the dimensions

This tensor has one row per token ( rows), and each token is represented by a vector of size .

These transformer layers are strung together to produce a dense transformer.

Data flow within a transformer

When inferencing (or doing a forward pass during training), the flow of information through a transformer looks like this.

Before data goes into the transformer layers,

  1. Tokenization: Input text is tokenized. These tokens are converted into token IDs (typically int32) and stored in a vector of shape . The values of these token IDs range from 0 to , where is the vocabulary size of the model.
  2. Embedding: Token IDs are converted into embeddings (vectors), resulting in a tensor of shape . This is the initial activation tensor. This utilizes the embedding lookup table or embedding matrix, a tensor with dimensions ; the dimension of the activation tensor comes from mapping the token ID (a scalar) to an embedding vector of length .
  3. Positional encoding occurs, where tokens are augmented to include positional information (e.g., the order of words in a sentence). The activation tensor remains .

This activation tensor of shape is then passed through the transformer. That process looks like this, and the inputs and outputs of each step are always slightly tweaked versions of the activation tensor .

  1. Self-attention is applied to the whole activation tensor. This mixes information across tokens.
  2. Residual add is applied. This just adds the activation tensor that fed into step 1 of this transformer layer back into the result of self-attention.
  3. LayerNorm is applied to each token within the activation tensor. This rescales all the components of the token’s vector so the mean is zero and the variance is one, then scales those values according to some learned scale and bias.
  4. The activation tensor goes through the feed-forward network. This is done in a way that each token’s vector is passed through the FFN independently just like how LayerNorm is per-token.
  5. Residual add happens again, but this time with the activation tensor that went into step 5.
  6. LayerNorm is applied to each token within the activation tensor again.

The output of this transformer is still a tensor of shape and it is passed on to the next transformer layer (starting at self-attention again).

If this is the last layer of the transformer, the activation tensor is transformed into a logits tensor whose values encode the probability of every token being the next token. The logits tensor has shape .

Example of data flow

Let’s say you run a single word, prestidigitation, through a 30B parameter LLM. The LLM has the following:

  • Sequence length : 8,192 tokens
  • Hidden size : 6144
  • Number of layers: 60
  • Feed-forward network hidden size: 28,672

Using the GPT-4o tokenizer, we can see…

  1. Tokenization: The word breaks down into a vector of length , of type int32, and values
  2. Embedding: Each ID indexes into the embedding weight matrix of shape , resulting in rows (four of which are non-null) of length .
  3. Positional encoding: The four non-null values in the embedding tensor change a little.

Our activation tensor at this point has dimensions , but only actually do anything. This activation tensor is then passed through the transformer:

  1. Self-attention is applied, but the 8188 null tokens are masked off and do not contribute anything to the activation tensor that pops out of attention.
  2. Add+norm are applied to each token’s activation (or hidden state). This happens even for the unused part of the sequence.
  3. Each token’s activations go through the feed-forward network. Again, even the unused tokens go through.
  4. Add+norm happens again, even for the unused tokens.

Self-attention is the only step where each token’s activations are affected by each other, so this is the only step where care must be taken to ensure that unused tokens in the sequence are masked off. This prevents those null tokens in the sequence from interacting with the four tokens we actually care about ().

After our input sequence has completed its forward pass through the transformer, the logit tensor (with shape ) pops out and describes the probabilities of every token in our vocabulary (of size ) being the next token.

Attention

Info

This information is explored in more depth in attention.

Attention is the only part of the transformer where different tokens in the input sequence can interact with each other. The activations tensor is transformed into three intermediate tensors:

  • Query tensor , which captures what information each token needs from the other tokens
  • Key tensor , which captures what information each token has to offer the other tokens
  • Value tensor , which captures the features that get shared by other tokens when the query and key are similar

All three tensors are simply the activations multiplied by learned weights (which are model parameters). If the activation tensor is , then

Where these weight tensors have dimensions . This means , , and have the same dimensions as the activations tensor ().

Roughly, attention works like this:

  1. It first calculates a score between each token and every other token by comparing their projected representations ( vs ). The more similar they are, the higher the score. After this, we have a tensor of shape describing how much every token attends to every other token. These scores are then normalized with softmax.
  2. For each token , the normalized score for every token is multiplied by that token’s value vector . These weighted value vectors are added up, producing a new context vector for token . Doing this for all tokens results in an output tensor of shape .
  3. This output tensor is finally multiplied by one last matrix of learned weights, (shape ), to produce the new activations that feed into the next part of the transformer layer.

Feed-forward network

After attention, each token has a contextualized hidden state represented by its activations. The feed-forward network (FFN) operates on each token’s activations independently; whereas attention lets tokens exchange information, the FFN transforms each token’s own representation in more complex ways.

This FFN is just a multilayer perception and does the following:

  1. The input is the activations tensor of shape
  2. Each token’s activations are projected into a larger space with using a tensor of learned weights and a bias ( and )
  3. A nonlinear activation function (like ReLU, GELU) are applied element-by-element
  4. The resulting vector is then projected back down to the original model dimension using another set of learned weights and bias ( and )
  5. The resulting activations tensor (with shape ) then moves on through the rest of the transformer.

Intuitive explanation

The idea of attention and feed forward networks seem very abstract, so I find it helpful to also develop an intuitive understanding of how information flows through a transformer.

Let’s assume the input sequence is I drank a cup of and tokens are words. This makes the input sequence [I, drank, a, cup, of].

Step 1. Embedding

Our input sequence is converted into an initial activation tensor, where each of our five tokens has an embedding vector that encodes its lexical meaning. For example,

  • I is a first-person pronoun that’s close to me and myself in the embedding space.
  • drank is a past-tense verb meaning something was consumed. It’s probably close to words like ate and sipped.
  • a is an indefinite article, but it doesn’t have much meaning.
  • cup is a container, probably close to mug and glass
  • of is a preposition, probably close to with and from

Step 2. Transformer layers

Attention is where each token looks at every other token and figures out how important each token is to the others. For example,

  • drank has a strong relationship with I
  • cup has a strong relationship with drank
  • of has a strong relationship with cup

The embeddings for each token are tweaked to encode this, and these vectors are now called the hidden state or activation of each token.

The feed forward network then reprocesses each token’s newly updated hidden state to shift it towards more relevant parts of the semantic subspace. For example,

  • of’s FFN would push its hidden state towards a subspace that captures container-contents relationships
  • cup’s FFN would push its hidden state towards a subspace of being drunk from.

After the FFN, the token’s hidden state now encodes its lexical meaning and its role in the context of the input sequence.

As each token’s activations (hidden state) pass through more transformer layers, they accumulate more contextual awareness.

Step 3. Output projection

By the time the activations all pop out of the end of the transformer, they can be projected into a logits tensor that reflects the likelihood of the next token based on the full context of all the tokens that preceded it. The vocabulary might contain words like coffee, tea, water, and chair, and the logits tensor might reflect:

  • coffee - high probability
  • tea - high probability
  • water - moderate probability
  • chair - near zero probability

From this, we roll the dice and pick a next token.

Chatbots

Chatbots operate using autoregressive decoding. They take an input sequence and

  1. Run it through the transformer
  2. Pick a next token from the logit tensor that pops out
  3. Append that token to the input sequence
  4. Send that new sequence back through the transformer (step 1)

Seminal papers

I stole this reading list from No Hype DeepSeek-R1 Reading List.

Footnotes

  1. Pedantically, it’s really a sequence of tokens.