Transformers are a family of neural networks that process text by analyzing relationships between words all at once rather than sequentially. This is done using self-attention to determine which words in a sentence are most relevant to each other which allows them to understand context. Transformers are the basis for LLMs like OpenAI’s GPT family, and they allow chatbots to generate context-aware conversations by predicting the next word based on everything seen so far instead of processing words one-by-one.

Transformers are computationally expensive because self-attention compares every input token to every other input token. This results in the computational requirements scaling as . This gives them rich contextual understanding at high training costs. See LLM training for more information.

Seminal papers

I stole this reading list from No Hype DeepSeek-R1 Reading List.