The datasets used in LLM training exist in several formats:

Raw training data are text, images, audio files, that have not been wedged into a shape that the model training framework can accept. For text-based data, this is often raw HTML exactly as it was scraped from the Internet.

Tokenized training data is significantly smaller than raw training data and is ready to consume by the model training process. For large language models, this means tokenized text, tokenized images, and tokenized audio.

As data is converted from raw to tokenized, it exists in various intermediate formats; for example, most open training datasets are distributed as collections of documents encoded, with metadata, in json or jsonl.

Data quality

When training LLMs, it is now widely understood that data quality will dramatically affect how quickly the model trains (see Fallibility and end of scaling). If you do a better job of processing your data before it is fed into the model, you need less of it (and need to compute less) to train a model to the same performance (or same level of quality) as the same model trained on more but lower-quality data.

As a result, frontier models are now bootstrapping their training datasets using a couple of different methods:

  • Using synthetically generated data (see synthetic data) using a model that can generate high-quality output. SLM are trained this way, and this is the basis for distillation.
  • Using smaller LLMs to curate the data that will ultimately be used to train a next-generation model. Meta described how they used Llama-2 to filter low-quality data out for training Llama-3, but they used humans in the loop to avoid propagating biases..1

Storage requirements

Raw data

The process of converting raw text-based training data to a high-quality collection of text documents has been document as a 67×.2

Tokenized data

The average size of a single token (in bytes) is variable:

  • According to OpenAI, a token in a typical English-language dataset is about four bytes.
  • The Pile paper has both tokens and words for different datasets and found an average 3.41 bytes per token.
  • WanJuan-CC reported an average of 4.37 to 4.45 bytes per token.2

Assuming:

  • OPT-3 was 4.44 bytes/token per the OPT-175 paper appendix C.2
  • The Pile dataset is 3.41 bytes/token
  • Everything else is OpenAI’s 4 bytes/token

We can estimate the size (in GB) of various LLM training datasets:

DatasetTraining tokensTraining Bytes (est)
Llama-3> 15 trillion60 TB (54.6 TiB)
LLaMa-2 70B2.0 trillion8 TB (7.3 TiB)
OpenELM1.5 trillion6 TB (5.5 TiB)
OPT-175180 billion800 GB (475 GiB)
GPT-3300 billion1.2 TB (1.1 TiB)
The Pile260 billion890 GB (830 GiB)
ROOTS/BLOOM341 billion1.6 TB (1.5 TiB)
C4.en.noBlocklist156 billion1,077 GB (1,003 GiB)1

1 C4’s size is the size of the download (in TFDS format), not the size of the training tokens. I don’t know how much TFDS overhead is included here, but the bytes per token for C4 comes out very high (6.9) which indicates TFDS is very inefficient.

Computational requirements

Text processing

Training large language models requires a significant amount of text data, and these data are often derived from massive amounts of html scraped from the Internet. The process of converting these web scrapes into tokenized datasets of high quality requires extensive data preprocessing which typically happens on CPUs that are good at processing large amounts of uneven, messy data in memory.

The process of converting web scrapes into clean, tokenized data is described in the following resources:

The GPT-3 paper describes a very specific approach to data processing that relies on a combination of a few Apache Spark built-in tools:

To identify overlaps between the training dataset and benchmark datasets, they also identified exact overlaps based on documents that had overlapping N-grams that ranged from 8-grams to 13-grams.

The MM1 paper cited both the GPT-3 paper and CCNet as representative of their text processing pipeline.

Multimodal dataset creation

The MM1 paper cites the OBELICS paper as a representation of how they constructed datasets that interleaved text and images for multimodal training. They specifically filter images based on aspect ratio, size, and URI contents. They deduplicate based on URL and MD5 across documents (images appearing more than ten times) and only retain the first copy of an image within each document that replicates an image.

Synthetic data

See synthetic data.

Public datasets

In addition to those listed above, from AMD-Llama-135M:

SlimPajama is a deduplicated version of RedPajama and sources from Commoncrawl, C4, GitHub, Books, ArXiv, Wikpedia and StackExchange. We drop the Books data from SlimPajama due to license issues;

Footnotes

  1. Balaji, Herding Llamas: A Sneak Peek Into Meta’s Infrastructure for Generative AI. SC’24.

  2. [2402.19282] WanJuan-CC: A Safe and High-Quality Open-sourced English Webtext Dataset (arxiv.org) 2