Multimodal models are LLMs that can input and/or output multiple modes of data: text, images, audio, and others.

Training

I think multimodal training is pretty straightforward. All input data from every mode is converted into tokens prior to training. Once training begins, tokens are tokens.

Inferencing

Multimodal inferencing is much more complex, because different input modes require different types of preprocessing which are inline with response generation.

As query inputs become more image-heavy, prefill time is overtaken by image preprocessing time.1 To achieve high resource utilization, preprocessing must be disaggregated by mode, since prefill and image processing require different hardware.

Footnotes

  1. [2502.00937] ModServe: Modality- and Stage-Aware Resource Disaggregation for Scalable Multimodal Model Serving