Multimodal models are LLMs that can input and/or output multiple modes of data: text, images, audio, and others.
Training
Training on text tokens is pretty straightforward because the vocabulary is set a priori, and tokens are stored in either 4-byte or 2-byte integers which index into the vocabulary. This means that 1 trillion tokens will always be 2 TB or 4 TB that must be loaded into GPUs.
Training on visual tokens (see VLM) is more complex because of two factors:
- There is no set vocabulary for visual tokens; tokens are continuous values. This means you cannot preprocess visual training data into simple 2-byte or 4-byte indices into a vocabulary; you need to load some continuously variable input data into GPUs during training.
- Vision encoders, which convert images into tokens, usually are training along with the language model. This means they need images as input so that they learn how to respond when they are given images as input during inferencing.
As a result, instead of a predictable bytes-per-token payload having to be read in during training, there is a less predictable bytes-per-image payload that must be read inline with the training loop. That image is then processed and turned into tokens (encoded).
Older models used vision encoders which generated the same number of tokens per image regardless of the image size, which made the bytes read per token very unpredictable. For example, Llama-3.1 resized all images to fit four tiles of 336x336 pixels in either 4x1, 2x2, or 1x4 configurations to support different aspect ratios.
Newer models like Kimi K2.5 use techniques that generate more tokens for higher-resolution images, resulting in a more predictable ratio of bytes-per-token of visual data.
Llama 3 anecdotes
For Llama-3.1 vision pretraining:
- “We pre-train our image adapter on our dataset of ∼6B image-text pairs”
- “we resize all images to fit within at most four tiles of 336 x336 pixels each, where we arrange the tiles to support different aspect ratios
- “the image encoder produces a 7680-dimensional representation for each of the resulting 16 x16 = 256 patches.”
- “on average, images have more tokens than the associated text: an image has 2,308 tokens, whereas the associated text contains an average of only 192 tokens.”
- “The cross-attention layers introduce substantial numbers of additional trainable parameters into the model: for Llama 3 405B, the cross-attention layers have ≈100B parameters.” These extra cross-attention layer parameters are added on top of the base 405B language model.
- “We uniformly sample 16 frames from the full video, and represent each frame using four chunks, each of size of 448 x448 pixels.” and “We use a global batch size of 4,096, a sequence length of 190 tokens,”
So a Llama 3 vision pretraining example is 2,500 tokens (2,308 image + 192 text) of compute, derived from one image at up to 4x336x336x3 RGB pixels.
Meta went on to release these Llama 3.1 models with vision transformers and adapters as Llama 3.2.
Optimization
During multimodal inferencing, as query inputs become more image-heavy, prefill time is overtaken by image preprocessing time.1 To achieve high resource utilization, preprocessing must be disaggregated by mode, since prefill and image processing require different hardware.