Transformers are a family of neural networks that process text by analyzing relationships between words all at once rather than sequentially. This is done using self-attention to determine which words in a sentence are most relevant to each other which allows them to understand context. Transformers are the basis for LLMs like OpenAI’s GPT family, and they allow chatbots to generate context-aware conversations by predicting the next word based on everything seen so far instead of processing words one-by-one.
Transformers are computationally expensive because self-attention compares every input token to every other input token. This results in the computational requirements scaling as . This gives them rich contextual understanding at high training costs. See LLM training for more information.
Seminal papers
I stole this reading list from No Hype DeepSeek-R1 Reading List.
- Attention Is All You Need by Google is the original transformer paper that proposed using attention instead of recurrence to allow models to scale.
- Language Models are Unsupervised Multitask Learners is the GPT-2 paper by OpenAI which showed that large-scale pretraining allows a model to be good at many different tasks without fine-tuning.
- Language Models are Few-Shot Learners is the GPT-3 paper by OpenAI which showed that bigger models enable few-shot learning, where you can provide examples in the prompt itself to get the model to do tasks that it was never trained/fine-tuned to do.
- Training language models to follow instructions with human feedback is the OpenAI paper that explains how a general transformer can be turned into a chatbot using reinforcement learning from human feedback (RLHF).