Vision Language Models (VLMs) are a family of model architectures that operate on both text and images. IBM has a great summary of VLMs. In brief, they use a language encoder (like a transformer) and a vision encoder (formerly CNNs; now vision transformers) to combine textual and visual information.

What’s not clear to me:

How are the vision and language encoders combined into a single model? Or are they? Are there separate attention heads or experts exclusively for each modality? Do they share the same FFN?

How are different modes embedded and trained? Do they share the same vocabulary or latent space?

NVIDIA has an AI Blueprint for Video Search and Summarization that explains how VLMs are integrated with other types of models to derive insight from video streams.