OPT-175B is the largest of Meta’s Open Pretrained Transformer models, released in May 2022 and trained at the end of 2021.

It was trained on 9921 or 10242 NVIDIA A100 80G GPUs running in Azure3 over the course of 56 days.4 A 114-page logbook was released by Meta that documents the on-call engineers who kept the training happening during that time, offering a unique view into LLM training at scale.

Efficiency

The model achieved 147 TFLOPS/GPU5 and required 4.30E+23 FLOPS total.2 Doing some math, this means the total training time was:

The training began on November 5, 2021 and ended on January 6, 2022 for a total of 63 calendar days.6 However, Susan Zhang said it was trained over 56 days,4 and I haven’t scoured the logbook to see where the missing week went. This means the overall job uptime was between 51.7% and 58.9%.

Footnotes

  1. [2205.01068] OPT: Open Pre-trained Transformer Language Models (arxiv.org)

  2. metaseq/projects/OPT/chronicles/final_update.md at main · facebookresearch/metaseq (github.com) 2

  3. The logbook refers to “CSP” and “cloud,” but the Baselines Logbook refer to the same tools with “cloud” replaced by “azure” (fixmyazure, --full-azure-upload-path). The logbook is also full of Azure-specific terms including blobs.

  4. about me | Susan Zhang (suchenzang.github.io) 2

  5. Democratizing access to large-scale language models with OPT-175B (meta.com)

  6. metaseq/projects/OPT/chronicles at main · facebookresearch/metaseq (github.com)