LLM training uses storage for two primary purposes:

  1. Training data, which is used to update model weights and converge the model
  2. Checkpointing, where the model weights themselves are saved from GPU memory

Requirements

Training data

Training transformers is very compute-intensive by definition, meaning the data intensity of training them is comparatively small. Tokenized data, which is the binary data that models directly consume while training, is bytes per token of English language text, or a few terabytes per trillion tokens.

Leading models are training using tens of trillions of tokens (see LLM training datasets) which amounts to dozens of terabytes of training data for text-based models.

Checkpointing

The capacity required for LLM checkpointing scale with the size of the model, not the size of the training cluster.1

Model checkpoint sizes can be approximated by assuming 16 bytes per parameter (see LLM training memory requirements).

For more information, see checkpointing.

Footnotes

  1. A Checkpoint on Checkpoints in Large Language Models (vastdata.com)