OPT-175B

OPT-175B is the largest of Meta’s Open Pretrained Transformer models, released in May 2022 and trained at the end of 2021.

It was trained on 992¹ or 1024² NVIDIA A100 80G GPUs running in Microsoft Azure³ over the course of 56 days.⁴ A 114-page logbook was released by Meta that documents the on-call engineers who kept the training happening during that time, offering a unique view into LLM training at scale.

Efficiency

The model achieved 147 TFLOPS/GPU⁵ and required 4.30E+23 FLOPS total.² Doing some math, this means the total training time was:

$4.30 \times 1 0^{23} ops \times \frac{1 GPU}{147 \times 1 0 ^{12} ops \cdot sec ^{- 1}} \times \frac{1}{1024 GPUs} = 2, 856, 611 s = 793.5 h = 33.1 days$

The training began on November 5, 2021 and ended on January 6, 2022 for a total of 63 calendar days.⁶ However, Susan Zhang said it was trained over 56 days,⁴ and I haven’t scoured the logbook to see where the missing week went. This means the overall job uptime was between 51.7% and 58.9%.

[2205.01068] OPT: Open Pre-trained Transformer Language Models (arxiv.org) ↩
metaseq/projects/OPT/chronicles/final_update.md at main · facebookresearch/metaseq (github.com) ↩ ↩²
The logbook refers to “CSP” and “cloud,” but the Baselines Logbook refer to the same tools with “cloud” replaced by “azure” (fixmyazure, --full-azure-upload-path). The logbook is also full of Azure-specific terms including blobs. ↩
about me | Susan Zhang (suchenzang.github.io) ↩ ↩²
Democratizing access to large-scale language models with OPT-175B (meta.com) ↩
metaseq/projects/OPT/chronicles at main · facebookresearch/metaseq (github.com) ↩

Glenn's Digital Garden

Explorer

Recent Notes

BXI

Meta Llama-3.1

checkpointing

Storage for LLM training

Availability

OPT-175B

Efficiency

Graph View

Backlinks

Glenn's Digital Garden

Explorer

Recent Notes

BXI

Meta Llama-3.1

checkpointing

Storage for LLM training

Availability

OPT-175B

Efficiency

Footnotes

Graph View

Backlinks