commonly used vision encoder
Table of contents
Paper | Framework | Data | Code | Publication | Preprint | Affiliation |
---|---|---|---|---|---|---|
DINOv2: Learning Robust Visual Features without Supervision | DINO+iBOT | 142M Image Pair | DINOv2 | TMLR | 2304.07193 | Meta |
Sigmoid Loss for Language Image Pre-Training | Contrastive (sigmoid) | 900M Image-text Pair | SigLIP | ICCV 2023 | 2303.15343 | |
Learning Transferable Visual Models From Natural Language Supervision | Contrastive (softmax) | 400M Image-text Pair | CLIP | ICML 2021 | 2103.00020 | OpenAI |
- [2024-06] Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs arxiv | comparison of different image encoder on LLM