Llama-3 is a family of models described in The Llama-3 Herd of Models (arxiv.org).
Llama-3 is a dense transformer that is a bigger and better-trained version of the previous Llama-2 model. Compared to that model,
- They trained on more, higher-quality data (15 trillion tokens)
- They trained more, using 16K nodes on Meta’s H100 RoCE cluster and cranking through ops total.
Oxen.ai has a great summary of the paper.1 In brief, the paper has:
- A good explanation of how they cleaned their training data
- Great anecdotes about component reliability and JMTTI
- A description of techniques they used to train long contexts, anneal the model, and other practical things.