Transformers are a family of neural networks that process text by analyzing relationships between words all at once rather than sequentially. This is done using self-attention (usually just called attention) to determine which words in a sentence are most relevant to each other which allows them to understand context. Transformers are the basis for LLMs like OpenAI’s GPT family, and they allow chatbots to generate context-aware conversations by predicting the next word based on everything seen so far instead of processing words one-by-one.
Transformers are computationally expensive because self-attention compares every input token to every other input token. This results in the computational requirements scaling as . This gives them rich contextual understanding at high training costs. See LLM training for more information.
Fundamentally, I think of a transformer as a black box with
- Input: a sequence of words1 and a vocabulary of known words
- Output: a table of every word in the vocabulary and the probability that it’s the next word in the input sequence
So transformers really are very fancy autocompletes.
Chatbots use autoregressive decoding to iteratively guess the next word in a sequence, then guess the word after that, and so on.
Dense transformer
Transformers are composed of transformer layers. Each transformer layer has two big parts to it:
- Self-attention
- Feed-forward network
If you are inferencing a model with sequence length tokens and a hidden dimension , each layer then has an activation tensor (also called the residual stream) with the dimensions
This tensor has one row per token ( rows), and each token is represented by a vector of size .
These transformer layers are strung together to produce a dense transformer.
Data flow within a transformer
When inferencing (or doing a forward pass during training), the flow of information through a transformer looks like this.
Before data goes into the transformer layers,
- Tokenization: Input text is tokenized. These tokens are converted into token IDs (typically int32) and stored in a vector of shape . The values of these token IDs range from 0 to , where is the vocabulary size of the model.
- Embedding: Token IDs are converted into embeddings (vectors), resulting in a tensor of shape . This is the initial activation tensor. This utilizes the embedding lookup table or embedding matrix, a tensor with dimensions ; the dimension of the activation tensor comes from mapping the token ID (a scalar) to an embedding vector of length .
- Positional encoding occurs, where tokens are augmented to include positional information (e.g., the order of words in a sentence). The activation tensor remains .
This activation tensor of shape is then passed through the transformer. That process looks like this, and the inputs and outputs of each step are always slightly tweaked versions of the activation tensor .
- Self-attention is applied to the whole activation tensor. This mixes information across tokens.
- Residual add is applied. This just adds the activation tensor that fed into step 1 of this transformer layer back into the result of self-attention.
- LayerNorm is applied to each token within the activation tensor. This rescales all the components of the token’s vector so the mean is zero and the variance is one, then scales those values according to some learned scale and bias.
- The activation tensor goes through the feed-forward network. This is done in a way that each token’s vector is passed through the FFN independently just like how LayerNorm is per-token.
- Residual add happens again, but this time with the activation tensor that went into step 5.
- LayerNorm is applied to each token within the activation tensor again.
The output of this transformer is still a tensor of shape and it is passed on to the next transformer layer (starting at self-attention again).
If this is the last layer of the transformer, the activation tensor is transformed into a logits tensor whose values encode the probability of every token being the next token. The logits tensor has shape .
Example of data flow
Let’s say you run a single word, prestidigitation
, through a 30B parameter LLM. The LLM has the following:
- Sequence length : 8,192 tokens
- Hidden size : 6144
- Number of layers: 60
- Feed-forward network hidden size: 28,672
Using the GPT-4o tokenizer, we can see…
- Tokenization: The word breaks down into a vector of length , of type int32, and values
- Embedding: Each ID indexes into the embedding weight matrix of shape , resulting in rows (four of which are non-null) of length .
- Positional encoding: The four non-null values in the embedding tensor change a little.
Our activation tensor at this point has dimensions , but only actually do anything. This activation tensor is then passed through the transformer:
- Self-attention is applied, but the 8188 null tokens are masked off and do not contribute anything to the activation tensor that pops out of attention.
- Add+norm are applied to each token’s activation (or hidden state). This happens even for the unused part of the sequence.
- Each token’s activations go through the feed-forward network. Again, even the unused tokens go through.
- Add+norm happens again, even for the unused tokens.
Self-attention is the only step where each token’s activations are affected by each other, so this is the only step where care must be taken to ensure that unused tokens in the sequence are masked off. This prevents those null tokens in the sequence from interacting with the four tokens we actually care about ().
After our input sequence has completed its forward pass through the transformer, the logit tensor (with shape ) pops out and describes the probabilities of every token in our vocabulary (of size ) being the next token.
Attention
Info
This information is explored in more depth in attention.
Attention is the only part of the transformer where different tokens in the input sequence can interact with each other. The activations tensor is transformed into three intermediate tensors:
- Query tensor , which captures what information each token needs from the other tokens
- Key tensor , which captures what information each token has to offer the other tokens
- Value tensor , which captures the features that get shared by other tokens when the query and key are similar
All three tensors are simply the activations multiplied by learned weights (which are model parameters). If the activation tensor is , then
Where these weight tensors have dimensions . This means , , and have the same dimensions as the activations tensor ().
Roughly, attention works like this:
- It first calculates a score between each token and every other token by comparing their projected representations ( vs ). The more similar they are, the higher the score. After this, we have a tensor of shape describing how much every token attends to every other token. These scores are then normalized with softmax.
- For each token , the normalized score for every token is multiplied by that token’s value vector . These weighted value vectors are added up, producing a new context vector for token . Doing this for all tokens results in an output tensor of shape .
- This output tensor is finally multiplied by one last matrix of learned weights, (shape ), to produce the new activations that feed into the next part of the transformer layer.
Feed-forward network
After attention, each token has a contextualized hidden state represented by its activations. The feed-forward network (FFN) operates on each token’s activations independently; whereas attention lets tokens exchange information, the FFN transforms each token’s own representation in more complex ways.
This FFN is just a multilayer perception and does the following:
- The input is the activations tensor of shape
- Each token’s activations are projected into a larger space with using a tensor of learned weights and a bias ( and )
- A nonlinear activation function (like ReLU, GELU) are applied element-by-element
- The resulting vector is then projected back down to the original model dimension using another set of learned weights and bias ( and )
- The resulting activations tensor (with shape ) then moves on through the rest of the transformer.
Intuitive explanation
The idea of attention and feed forward networks seem very abstract, so I find it helpful to also develop an intuitive understanding of how information flows through a transformer.
Let’s assume the input sequence is I drank a cup of
and tokens are words. This makes the input sequence [I, drank, a, cup, of]
.
Step 1. Embedding
Our input sequence is converted into an initial activation tensor, where each of our five tokens has an embedding vector that encodes its lexical meaning. For example,
I
is a first-person pronoun that’s close tome
andmyself
in the embedding space.drank
is a past-tense verb meaning something was consumed. It’s probably close to words likeate
andsipped
.a
is an indefinite article, but it doesn’t have much meaning.cup
is a container, probably close tomug
andglass
of
is a preposition, probably close towith
andfrom
Step 2. Transformer layers
Attention is where each token looks at every other token and figures out how important each token is to the others. For example,
drank
has a strong relationship withI
cup
has a strong relationship withdrank
of
has a strong relationship withcup
The embeddings for each token are tweaked to encode this, and these vectors are now called the hidden state or activation of each token.
The feed forward network then reprocesses each token’s newly updated hidden state to shift it towards more relevant parts of the semantic subspace. For example,
of
’s FFN would push its hidden state towards a subspace that captures container-contents relationshipscup
’s FFN would push its hidden state towards a subspace of being drunk from.
After the FFN, the token’s hidden state now encodes its lexical meaning and its role in the context of the input sequence.
As each token’s activations (hidden state) pass through more transformer layers, they accumulate more contextual awareness.
Step 3. Output projection
By the time the activations all pop out of the end of the transformer, they can be projected into a logits tensor that reflects the likelihood of the next token based on the full context of all the tokens that preceded it. The vocabulary might contain words like coffee
, tea
, water
, and chair
, and the logits tensor might reflect:
coffee
- high probabilitytea
- high probabilitywater
- moderate probabilitychair
- near zero probability
From this, we roll the dice and pick a next token.
Chatbots
Chatbots operate using autoregressive decoding. They take an input sequence and
- Run it through the transformer
- Pick a next token from the logit tensor that pops out
- Append that token to the input sequence
- Send that new sequence back through the transformer (step 1)
Seminal papers
I stole this reading list from No Hype DeepSeek-R1 Reading List.
- Attention Is All You Need by Google is the original transformer paper that proposed using attention instead of recurrence to allow models to scale.
- Language Models are Unsupervised Multitask Learners is the GPT-2 paper by OpenAI which showed that large-scale pretraining allows a model to be good at many different tasks without fine-tuning.
- Language Models are Few-Shot Learners is the GPT-3 paper by OpenAI which showed that bigger models enable few-shot learning, where you can provide examples in the prompt itself to get the model to do tasks that it was never trained/fine-tuned to do.
- Training language models to follow instructions with human feedback is the OpenAI paper that explains how a general transformer can be turned into a chatbot using reinforcement learning from human feedback (RLHF).
Footnotes
-
Pedantically, it’s really a sequence of tokens. ↩