👁️‍🗨️Awesome VLM Architectures

VLM

Vision-Language Models (VLMs) feature a multimodal architecture that processes image and text data simultaneously. They can perform Visual Question Answering (VQA), image captioning and Text-To-Image search kind of tasks. VLMs utilize techniques like multimodal fusing with cross-attention, masked-language modeling, and image-text matching to relate visual semantics to textual representations. This repository contains information on famous Vision Language Models (VLMs), including details about their architectures, training procedures, and the datasets used for training. Click to expand for further details for every architecture - 📙 Visit my other repo to try Vision Language Models on ComfyUIContents

Important References

ModelsLLaVA | LLaVA 1.5 | LLaVA 1.6 | PaliGemma | PaliGemma 2 | AIMv2 | Apollo | ARIA | EVE | EVEv2 | Janus-Pro | LLaVA-CoT | LLM2CLIP | Maya | MiniMax-01 | NVLM | OmniVLM | Pixtral 12B | Sa2VA | Tarsier2 | UI-TARS | VideoChat-Flash | VideoLLaMA 3 | Llama 3.2-Vision | SmolVLM | IDEFICS | IDEFICS2 | IDEFICS3-8B | InternLM-XComposer2 | InternLM-XComposer2-4KHD | InternLM-XComposer-2.5 | InternVL 2.5 | DeepSeek-VL | DeepSeek-VL2 | MANTIS | Qwen-VL | Qwen2-VL | Qwen2.5-VL | moondream1 | moondream2 | Moondream-next | SPHINX-X | BLIP | BLIP-2 | xGen-MM (BLIP-3) | InstructBLIP | KOSMOS-1 | KOSMOS-2 | ConvLLaVA | Parrot | OMG-LLaVA | EVLM | SlowFast-LLaVA | Nous-Hermes-2-Vision - Mistral 7B | TinyGPT-V | CoVLM | GLaMM | COSMO | FireLLaVA | u-LLaVA | MoE-LLaVA | BLIVA | MobileVLM | FROZEN | Flamingo | OpenFlamingo | PaLI | PaLI-3 | PaLM-E | MiniGPT-4 | MiniGPT-v2 | LLaVA-Plus | BakLLaVA | CogVLM | CogVLM2 | Ferret | Fuyu-8B | OtterHD | SPHINX | Eagle 2 | EAGLE | VITA | LLaVA-OneVision | MiniCPM-o-2.6 | MiniCPM-V | INF-LLaVA | Florence-2 | MULTIINSTRUCT | MouSi | LaVIN | CLIP | MetaCLIP | Alpha-CLIP | GLIP | ImageBind | SigLIP | ViT

👁️‍🗨️Awesome VLM Architectures

VLM

Vision-Language Models (VLMs) feature a multimodal architecture that processes image and text data simultaneously. They can perform Visual Question Answering (VQA), image captioning and Text-To-Image search kind of tasks. VLMs utilize techniques like multimodal fusing with cross-attention, masked-language modeling, and image-text matching to relate visual semantics to textual representations. This repository contains information on famous Vision Language Models (VLMs), including details about their architectures, training procedures, and the datasets used for training. Click to expand for further details for every architecture - 📙 Visit my other repo to try Vision Language Models on ComfyUI

Contents

Architectures
Important References

Models

LLaVA | LLaVA 1.5 | LLaVA 1.6 | PaliGemma | PaliGemma 2 | AIMv2 | Apollo | ARIA | EVE | EVEv2 | Janus-Pro | LLaVA-CoT | LLM2CLIP | Maya | MiniMax-01 | NVLM | OmniVLM | Pixtral 12B | Sa2VA | Tarsier2 | UI-TARS | VideoChat-Flash | VideoLLaMA 3 | Llama 3.2-Vision | SmolVLM | IDEFICS | IDEFICS2 | IDEFICS3-8B | InternLM-XComposer2 | InternLM-XComposer2-4KHD | InternLM-XComposer-2.5 | InternVL 2.5 | DeepSeek-VL | DeepSeek-VL2 | MANTIS | Qwen-VL | Qwen2-VL | Qwen2.5-VL | moondream1 | moondream2 | Moondream-next | SPHINX-X | BLIP | BLIP-2 | xGen-MM (BLIP-3) | InstructBLIP | KOSMOS-1 | KOSMOS-2 | ConvLLaVA | Parrot | OMG-LLaVA | EVLM | SlowFast-LLaVA | Nous-Hermes-2-Vision - Mistral 7B | TinyGPT-V | CoVLM | GLaMM | COSMO | FireLLaVA | u-LLaVA | MoE-LLaVA | BLIVA | MobileVLM | FROZEN | Flamingo | OpenFlamingo | PaLI | PaLI-3 | PaLM-E | MiniGPT-4 | MiniGPT-v2 | LLaVA-Plus | BakLLaVA | CogVLM | CogVLM2 | Ferret | Fuyu-8B | OtterHD | SPHINX | Eagle 2 | EAGLE | VITA | LLaVA-OneVision | MiniCPM-o-2.6 | MiniCPM-V | INF-LLaVA | Florence-2 | MULTIINSTRUCT | MouSi | LaVIN | CLIP | MetaCLIP | Alpha-CLIP | GLIP | ImageBind | SigLIP | ViT