Llama-3 is a family of models developed by Meta and described in The Llama-3 Herd of Models (arxiv.org). Its largest form, Llama-3.1 405b, may be considered a frontier model.
Llama-3 is a dense transformer that is a bigger and better-trained version of the previous Llama-2 model. Compared to that model,
- They trained on more, higher-quality data (15.6 trillion tokens vs. 1.8 trillion for Llama 2)
- They trained more, using 16K nodes on Meta’s H100 RoCE cluster and cranking through ops total.
Of note though, Llama-3 uses Grouped-Query Attention (GQA) instead of multi-head attention to reduce the number of key and value tensors, reducing computation requirements and memory footprint.1 They deliberately did not use mixture of experts.
Oxen.ai has a great summary of the paper.2 In brief, the paper has:
- A good explanation of how they cleaned their training data
- Great anecdotes about component reliability and JMTTI
- A description of techniques they used to train long contexts, anneal the model, and other practical things.
Post-training Llama-3 involved supervised finetuning, rejection sampling, and direct preference optimization instead of reinforcement learning.3