Vision Language Models (VLMs) are a family of model architectures that operate on both text and images. IBM has a great summary of VLMs. In brief, they use a language encoder (like a transformer) and a vision encoder (formerly CNNs; now vision transformers) to combine textual and visual information.

Model architecture

What’s not clear to me:

How are different modes embedded and trained? Do they share the same vocabulary or latent space?

Vision transformers usually do not have a vocabulary like text transformers:

  • Text tokens are scalar integers (token IDs) that index into a vocabulary. That vocabulary maps each token ID to a vector in the embedding space.
  • Visual tokens are vectors in an embedding space directly. They are not discrete vectors and are generated directly using a vision encoder.

How are the vision and language encoders combined into a single model? Or are they? Are there separate attention heads or experts exclusively for each modality? Do they share the same FFN?

Text and vision each have their own embedding spaces. Multimodal models combine text and vision tokens through linear projection (multiple each embedding by a learned matrix of weights) or a simple multilayer perceptron to convert each type of token into a representation in a shared latent space. For example, CLIP1 used “a linear projection to map from each encoder’s representation to the multi-modal embedding space.”

Applications

NVIDIA has an AI Blueprint for Video Search and Summarization that explains how VLMs are integrated with other types of models to derive insight from video streams.

Footnotes

  1. See the CLIP paper: Learning Transferable Visual Models From Natural Language Supervision