Meta Llama-3.1

Llama-3 is a family of models developed by Meta and described in The Llama-3 Herd of Models (arxiv.org). Its largest form, Llama-3.1 405b, may be considered a frontier model.

Llama-3 is a dense transformer that is a bigger and better-trained version of the previous Llama-2 model. Compared to that model,

They trained on more, higher-quality data (15.6 trillion tokens vs. 1.8 trillion for Llama 2)
They trained more, using 16K nodes on Meta’s H100 RoCE cluster and cranking through $3.8 \times 1 0^{25}$ ops total.

Of note though, Llama-3 uses Grouped-Query Attention (GQA) instead of multi-head attention to reduce the number of key and value tensors, reducing computation requirements and memory footprint.¹ They deliberately did not use mixture of experts.

Oxen.ai has a great summary of the paper.² In brief, the paper has:

A good explanation of how they cleaned their training data
Great anecdotes about component reliability and JMTTI
A description of techniques they used to train long contexts, anneal the model, and other practical things.

Post-training Llama-3 involved supervised finetuning, rejection sampling, and direct preference optimization instead of reinforcement learning.³

Glenn's Digital Garden

Explorer

Recent Notes

Satya Nadella

Notice of Request for Information (RFI) on Frontiers in AI for Science, Security, and Technology (FASST) Initiative

Obsidian

Capex

xAI Colossus

Meta Llama-3.1

Graph View

Backlinks

Glenn's Digital Garden

Explorer

Recent Notes

Satya Nadella

Notice of Request for Information (RFI) on Frontiers in AI for Science, Security, and Technology (FASST) Initiative

Obsidian

Capex

xAI Colossus

Meta Llama-3.1

Footnotes

Graph View

Backlinks