👁️🗨️Awesome VLM Architectures 
Vision-Language Models (VLMs) feature a multimodal architecture that processes image and text data simultaneously. They can perform Visual Question Answering (VQA), image captioning and Text-To-Image search kind of tasks. VLMs utilize techniques like multimodal fusing with cross-attention, masked-language modeling, and image-text matching to relate visual semantics to textual representations. This repository contains information on famous Vision Language Models (VLMs), including details about their architectures, training procedures, and the datasets used for training. Click to expand for further details for every architecture - 📙 Visit my other repo to try Vision Language Models on ComfyUI
Contents
Models
LLaVA | LLaVA 1.5 | LLaVA 1.6 | PaliGemma | PaliGemma 2 | AIMv2 | Apollo | ARIA | EVE | EVEv2 | Janus-Pro | LLaVA-CoT | LLM2CLIP | Maya | MiniMax-01 | NVLM | OmniVLM | Pixtral 12B | Sa2VA | Tarsier2 | UI-TARS | VideoChat-Flash | VideoLLaMA 3 | Llama 3.2-Vision | SmolVLM | IDEFICS | IDEFICS2 | IDEFICS3-8B | InternLM-XComposer2 | InternLM-XComposer2-4KHD | InternLM-XComposer-2.5 | InternVL 2.5 | DeepSeek-VL | DeepSeek-VL2 | MANTIS | Qwen-VL | Qwen2-VL | Qwen2.5-VL | moondream1 | moondream2 | Moondream-next | SPHINX-X | BLIP | BLIP-2 | xGen-MM (BLIP-3) | InstructBLIP | KOSMOS-1 | KOSMOS-2 | ConvLLaVA | Parrot | OMG-LLaVA | EVLM | SlowFast-LLaVA | Nous-Hermes-2-Vision - Mistral 7B | TinyGPT-V | CoVLM | GLaMM | COSMO | FireLLaVA | u-LLaVA | MoE-LLaVA | BLIVA | MobileVLM | FROZEN | Flamingo | OpenFlamingo | PaLI | PaLI-3 | PaLM-E | MiniGPT-4 | MiniGPT-v2 | LLaVA-Plus | BakLLaVA | CogVLM | CogVLM2 | Ferret | Fuyu-8B | OtterHD | SPHINX | Eagle 2 | EAGLE | VITA | LLaVA-OneVision | MiniCPM-o-2.6 | MiniCPM-V | INF-LLaVA | Florence-2 | MULTIINSTRUCT | MouSi | LaVIN | CLIP | MetaCLIP | Alpha-CLIP | GLIP | ImageBind | SigLIP | ViT