Llama-3 is a family of models developed by Meta and described in The Llama-3 Herd of Models (arxiv.org). Its largest form, Llama-3.1 405b, may be considered a frontier model.

Llama-3 is a dense transformer that is a bigger and better-trained version of the previous Llama-2 model. Compared to that model,

  1. They trained on more, higher-quality data (15.6 trillion tokens vs. 1.8 trillion for Llama 2)
  2. They trained more, using 16K nodes on Meta’s H100 RoCE cluster and cranking through ops total.

Of note though, Llama-3 uses Grouped-Query Attention (GQA) instead of multi-head attention to reduce the number of key and value tensors, reducing computation requirements and memory footprint.1 They deliberately did not use mixture of experts.

Oxen.ai has a great summary of the paper.2 In brief, the paper has:

  • A good explanation of how they cleaned their training data
  • Great anecdotes about component reliability and JMTTI
  • A description of techniques they used to train long contexts, anneal the model, and other practical things.

Post-training Llama-3 involved supervised finetuning, rejection sampling, and direct preference optimization instead of reinforcement learning.3

Footnotes

  1. Movie Gen: A Cast of Media Foundation Models

  2. arXiv Dive: How Meta Trained Llama 3.1 (oxen.ai)

  3. The Llama-3 Herd of Models (arxiv.org)