vision transformer

A vision transformer (ViT) is an encoder-only transformer that receives images as inputs and generates embeddings as outputs. It does to images what an embedding lookup table (dictionary) does for text models; it maps raw input (images) into the multidimensional space that transformers operate in.

The ViT (or some other visual encoder) is an essential part of multimodal models and VLMs.

Examples

ViT Name	Parameters	Creator	Used in
14	303M	OpenAI	LLaVA variants
EVA-CLIP ViT-bigG/14		BAAI	Early Qwen

Flow-matching decoder

See https://arxiv.org/abs/2210.02747, which is the approach to visual inputs employed by TML-Interaction-Small, Thinking Machines Lab’s real-time multimodal conversational model.¹

https://info.deeplearning.ai/e3t/Ctc/LX+113/cJhC404/MXk8ZC34GVFW79T-4R5GVlGxVwDG2D5Pn3f-N3mxJ7j5nR3bW8wM7ks6lZ3lkMtsl5b6nchgW2VgCfs9cB1YTW7lR6D38-M5TGW108dPG15k7mNW83k9X975X5RwW3yTj0T1plxGBW8KmPTC25zTTZW7RgvHF3j48GTVmh6sH3lc_8qMW40j-NS-FcW3KVG045ylZDbW75-YTd1sFQpyW1Q8N1Q2XV4rPW1kTFnD3GTmfJW3rfz_x8Kf0bgW37sgj55p-rClW2Vm2zC2Y50MgW5324bb1zVLKbW26Q2Kx8-B8wfW4shWsb6MWRJYW7yNnfw4LMX85W4Gj_S24f026FW7zwhr847zw4lW2VdWKB9jlNx9W8t8jBm5x3yjsW1PdwqQ4Y3ZGvW2989D06VgbkFW6v3X_G8RhvxcW8KB1pw2DkZ97W87V0GK5mY-cpW8FkhFN1rhNh_N6cXNt81MJ8pN5KwQznw47nfW8Syt7F3CBr03W7xHCz93mwZF5W8108k84PpwScW49ZcVz85HNp1VD_pG_1WRC2hVbG6Vn2CLLfjN5FXvDG-BvdXW95sXSW9kS48BW6WjWcy7fl7SZN3_jdRG5378wW357L1t7l_nhxf90C40j04 ↩

Glenn's Digital Garden

Explorer

vision transformer

Examples

Flow-matching decoder

Graph View

Table of Contents

Backlinks

Glenn's Digital Garden

Explorer

vision transformer

Examples

Flow-matching decoder

Footnotes

Graph View

Table of Contents

Backlinks