Llama-3 is a family of models described in The Llama-3 Herd of Models (arxiv.org).

Llama-3 is a dense transformer that is a bigger and better-trained version of the previous Llama-2 model. Compared to that model,

  1. They trained on more, higher-quality data (15 trillion tokens)
  2. They trained more, using 16K nodes on Meta’s H100 RoCE cluster and cranking through ops total.

Oxen.ai has a great summary of the paper.1 In brief, the paper has:

  • A good explanation of how they cleaned their training data
  • Great anecdotes about component reliability and JMTTI
  • A description of techniques they used to train long contexts, anneal the model, and other practical things.

Footnotes

  1. arXiv Dive: How Meta Trained Llama 3.1 (oxen.ai)