A vision transformer (ViT) is an encoder-only transformer that receives images as inputs and generates embeddings as outputs. It does to images what an embedding lookup table (dictionary) does for text models; it maps raw input (images) into the multidimensional space that transformers operate in.
The ViT (or some other visual encoder) is an essential part of multimodal models and VLMs.
Examples
Flow-matching decoder
See https://arxiv.org/abs/2210.02747, which is the approach to visual inputs employed by TML-Interaction-Small, Thinking Machines Lab’s real-time multimodal conversational model.1
Footnotes
-
https://info.deeplearning.ai/e3t/Ctc/LX+113/cJhC404/MXk8ZC34GVFW79T-4R5GVlGxVwDG2D5Pn3f-N3mxJ7j5nR3bW8wM7ks6lZ3lkMtsl5b6nchgW2VgCfs9cB1YTW7lR6D38-M5TGW108dPG15k7mNW83k9X975X5RwW3yTj0T1plxGBW8KmPTC25zTTZW7RgvHF3j48GTVmh6sH3lc_8qMW40j-NS-FcW3KVG045ylZDbW75-YTd1sFQpyW1Q8N1Q2XV4rPW1kTFnD3GTmfJW3rfz_x8Kf0bgW37sgj55p-rClW2Vm2zC2Y50MgW5324bb1zVLKbW26Q2Kx8-B8wfW4shWsb6MWRJYW7yNnfw4LMX85W4Gj_S24f026FW7zwhr847zw4lW2VdWKB9jlNx9W8t8jBm5x3yjsW1PdwqQ4Y3ZGvW2989D06VgbkFW6v3X_G8RhvxcW8KB1pw2DkZ97W87V0GK5mY-cpW8FkhFN1rhNh_N6cXNt81MJ8pN5KwQznw47nfW8Syt7F3CBr03W7xHCz93mwZF5W8108k84PpwScW49ZcVz85HNp1VD_pG_1WRC2hVbG6Vn2CLLfjN5FXvDG-BvdXW95sXSW9kS48BW6WjWcy7fl7SZN3_jdRG5378wW357L1t7l_nhxf90C40j04 ↩