The datasets used in LLM training exist in several formats:
Raw training data are text, images, audio files, that have not been wedged into a shape that the model training framework can accept. For text-based data, this is often raw HTML exactly as it was scraped from the Internet.
Tokenized training data is significantly smaller than raw training data and is ready to consume by the model training process. For large language models, this means tokenized text, tokenized images, and tokenized audio.
As data is converted from raw to tokenized, it exists in various intermediate formats; for example, most open training datasets are distributed as collections of documents encoded, with metadata, in json or jsonl.
Data quality
When training LLMs, it is now widely understood that data quality will dramatically affect how quickly the model trains (see Fallibility and end of scaling). If you do a better job of processing your data before it is fed into the model, you need less of it (and need to compute less) to train a model to the same performance (or same level of quality) as the same model trained on more but lower-quality data.
As a result, frontier models are now bootstrapping their training datasets using a couple of different methods:
- Using synthetically generated data (see synthetic data) using a model that can generate high-quality output. SLM are trained this way, and this is the basis for distillation.
- Using smaller LLMs to curate the data that will ultimately be used to train a next-generation model. Meta described how they used Llama-2 to filter low-quality data out for training Llama-3, but they used humans in the loop to avoid propagating biases..1
Storage requirements
Raw data
The process of converting raw text-based training data to a high-quality collection of text documents has been document as a 67×.2
Tokenized data
The average size of a single token (in bytes) is variable:
- According to OpenAI, a token in a typical English-language dataset is about four bytes.
- The Pile paper has both tokens and words for different datasets and found an average 3.41 bytes per token.
- WanJuan-CC reported an average of 4.37 to 4.45 bytes per token.2
Assuming:
- OPT-3 was 4.44 bytes/token per the OPT-175 paper appendix C.2
- The Pile dataset is 3.41 bytes/token
- Everything else is OpenAI’s 4 bytes/token
We can estimate the size (in GB) of various LLM training datasets:
Dataset | Training tokens | Training Bytes (est) |
---|---|---|
Llama-3 | > 15 trillion | 60 TB (54.6 TiB) |
LLaMa-2 70B | 2.0 trillion | 8 TB (7.3 TiB) |
OpenELM | 1.5 trillion | 6 TB (5.5 TiB) |
OPT-175 | 180 billion | 800 GB (475 GiB) |
GPT-3 | 300 billion | 1.2 TB (1.1 TiB) |
The Pile | 260 billion | 890 GB (830 GiB) |
ROOTS/BLOOM | 341 billion | 1.6 TB (1.5 TiB) |
C4.en.noBlocklist | 156 billion | 1,077 GB (1,003 GiB)1 |
1 C4’s size is the size of the download (in TFDS format), not the size of the training tokens. I don’t know how much TFDS overhead is included here, but the bytes per token for C4 comes out very high (6.9) which indicates TFDS is very inefficient.
Computational requirements
Text processing
Training large language models requires a significant amount of text data, and these data are often derived from massive amounts of html scraped from the Internet. The process of converting these web scrapes into tokenized datasets of high quality requires extensive data preprocessing which typically happens on CPUs that are good at processing large amounts of uneven, messy data in memory.
The process of converting web scrapes into clean, tokenized data is described in the following resources:
- The Dolma dataset used the CCNet processing pipeline
- Deduplicating Training Data Makes Language Models Better
- RedPajama-Data: The RedPajama-Data repository contains code for preparing large datasets for training large language models
- The WanJuan-CC subset of the Common Crawl dataset
The GPT-3 paper describes a very specific approach to data processing that relies on a combination of a few Apache Spark built-in tools:
- Tokenizer
- HashingTF
- Spark’s MinHashLSH with ten hashes
To identify overlaps between the training dataset and benchmark datasets, they also identified exact overlaps based on documents that had overlapping N-grams that ranged from 8-grams to 13-grams.
The MM1 paper cited both the GPT-3 paper and CCNet as representative of their text processing pipeline.
Multimodal dataset creation
The MM1 paper cites the OBELICS paper as a representation of how they constructed datasets that interleaved text and images for multimodal training. They specifically filter images based on aspect ratio, size, and URI contents. They deduplicate based on URL and MD5 across documents (images appearing more than ten times) and only retain the first copy of an image within each document that replicates an image.
Synthetic data
See synthetic data.
Public datasets
In addition to those listed above, from AMD-Llama-135M:
SlimPajama is a deduplicated version of RedPajama and sources from Commoncrawl, C4, GitHub, Books, ArXiv, Wikpedia and StackExchange. We drop the Books data from SlimPajama due to license issues;