Llama-3 is a family of models developed by Meta and described in The Llama-3 Herd of Models (arxiv.org). Its largest form, Llama-3.1 405b, may be considered a frontier model.

Llama-3 is a dense transformer that is a bigger and better-trained version of the previous Llama-2 model. Compared to that model,

  1. They trained on more, higher-quality data (15.6 trillion tokens vs. 1.8 trillion for Llama 2)
  2. They trained more, using 2,048 nodes1 on Meta’s H100 RoCE cluster and cranking through ops total in bfloat16.

Of note though, Llama-3 uses Grouped-Query Attention (GQA) instead of multi-head attention to reduce the number of key and value tensors, reducing computation requirements and memory footprint.2 They deliberately did not use mixture of experts. They also used a significantly larger (128K) vocabulary size which allowed them to train Llama-3 as a multilingual model.

Oxen.ai has a great summary of the paper.3 In brief, the paper has:

  • A good explanation of how they cleaned their training data.
  • Great anecdotes about component reliability and JMTTI
  • A description of techniques they used to train long contexts, anneal the model, and other practical things.

Post-training Llama-3 involved supervised fine-tuning, rejection sampling, and direct preference optimization instead of reinforcement learning.4

Hyperparameters

Pavan Balaji presented the following hyperparameters:1

GPUsTensor ParallelismContext ParallelismPipeline ParallelismTokens/batchTFLOPS/GPU
8,192811616M430
16,384811616M400
16,3848161616M380

This table is probably in the paper as well.

Footnotes

  1. Balaji, Herding Llamas: A Sneak Peek Into Meta’s Infrastructure for Generative AI. SC’24. He showed a slide with hyperparameters which included 16,384 GPUs. 2

  2. Movie Gen: A Cast of Media Foundation Models

  3. arXiv Dive: How Meta Trained Llama 3.1 (oxen.ai)

  4. The Llama-3 Herd of Models (arxiv.org)