Llama-3 is a family of models developed by Meta and described in The Llama-3 Herd of Models (arxiv.org). Its largest form, Llama-3.1 405b, may be considered a frontier model.
Llama-3 is a dense transformer that is a bigger and better-trained version of the previous Llama-2 model. Compared to that model,
- They trained on more, higher-quality data (15.6 trillion tokens vs. 1.8 trillion for Llama 2)
- They trained more, using 2,048 nodes1 on Meta’s H100 RoCE cluster and cranking through ops total in bfloat16.
Of note though, Llama-3 uses Grouped-Query Attention (GQA) instead of multi-head attention to reduce the number of key and value tensors, reducing computation requirements and memory footprint.2 They deliberately did not use mixture of experts. They also used a significantly larger (128K) vocabulary size which allowed them to train Llama-3 as a multilingual model.
Oxen.ai has a great summary of the paper.3 In brief, the paper has:
- A good explanation of how they cleaned their training data.
- Great anecdotes about component reliability and JMTTI
- A description of techniques they used to train long contexts, anneal the model, and other practical things.
Post-training Llama-3 involved supervised fine-tuning, rejection sampling, and direct preference optimization instead of reinforcement learning.4
Hyperparameters
Pavan Balaji presented the following hyperparameters:1
GPUs | Tensor Parallelism | Context Parallelism | Pipeline Parallelism | Tokens/batch | TFLOPS/GPU |
---|---|---|---|---|---|
8,192 | 8 | 1 | 16 | 16M | 430 |
16,384 | 8 | 1 | 16 | 16M | 400 |
16,384 | 8 | 16 | 16 | 16M | 380 |
This table is probably in the paper as well.
Footnotes
-
Balaji, Herding Llamas: A Sneak Peek Into Meta’s Infrastructure for Generative AI. SC’24. He showed a slide with hyperparameters which included 16,384 GPUs. ↩ ↩2