A curated list of prompt-based papers in computer vision and vision-language learning.
Keywords:
- Task tag, e.g.,
- Abbreviation tag, e.g.,
- Characteristic tag: Some characteristic makes this paper unique, e.g.,
- Bold font: We highlight some pilot work that may contribute to the prevalence of visual prompting.
This section contains papers designing prompt (containing adapter) modules for parameter-efficient adaptation of foundation models.
-
Exploring Visual Prompts for Adapting Large-Scale Models [pdf] [code]
-
DualPrompt: Complementary Prompting for Rehearsal-free Continual Learning [pdf] [code]
-
AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition [pdf] [code]
-
Scaling & Shifting Your Features: A New Baseline for Efficient Model Tuning [pdf] [code]
-
Polyhistor: Parameter-Efficient Multi-Task Adaptation for Dense Vision Tasks [pdf]
-
Singular Value Fine-tuning: Few-shot Segmentation Requires Few-parameters Fine-tuning [pdf]
-
Vision Transformer Adapter for Dense Predictions [pdf] [code]
-
Convolutional Bypasses Are Better Vision Transformer Adapters [pdf] [code]
-
Conv-Adapter: Exploring Parameter Efficient Transfer Learning for ConvNets [pdf]
-
Prompt Vision Transformer for Domain Generalization [pdf]
-
Prompt-Matched Semantic Segmentation [pdf]
-
Visual Prompt Tuning for Test-time Domain Adaptation [pdf]
-
Visual Prompting for Adversarial Robustness [pdf]
-
Prompt Generation Networks for Efficient Adaptation of Frozen Vision Transformers [pdf] [code]
-
Towards a Unified View on Visual Parameter-Efficient Transfer Learning [pdf] [code]
-
FacT: Factor-Tuning For Lightweight Adaptation on Vision Transformer [pdf] [code]
-
Learning Transferable Visual Models From Natural Language Supervision [pdf] [code]
-
Prompt Distribution Learning [pdf]
-
Conditional Prompt Learning for Vision-Language Models [pdf] [code]
-
DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting [pdf] [code]
-
Bridge-Prompt: Towards Ordinal Action Understanding in Instructional Videos [pdf] [code]
-
VL-Adapter: Parameter-Efficient Transfer Learning for Vision-and-Language Tasks [pdf] [code]
-
Prompting for Multi-Modal Tracking [pdf]
-
Expanding Language-Image Pretrained Models for General Video Recognition [pdf] [code]
-
Tip-Adapter: Training-free Adaption of CLIP for Few-shot Classification [pdf] [code]
-
Colorful Prompt Tuning for Pre-trained Vision-Language Models [pdf]
-
ActionCLIP: A New Paradigm for Video Action Recognition [pdf] [code]
-
CLIP-Adapter: Better Vision-Language Models with Feature Adapters [pdf] [code]
-
Amortized Prompt: Lightweight Fine-Tuning for CLIP in Domain Generalization [pdf]
-
Prompting Visual-Language Models for Efficient Video Understanding [pdf] [code]
-
Unsupervised Prompt Learning for Vision-Language Models [pdf] [code]
-
Parameter-Efficient Image-to-Video Transfer Learning [pdf]
-
DualCoOp: Fast Adaptation to Multi-Label Recognition with Limited Annotations [pdf]
-
Rethinking the Openness of CLIP [pdf]
-
OrdinalCLIP: Learning Rank Prompts for Language-Guided Ordinal Regression [pdf]
-
Prompt Tuning for Generative Multimodal Pretrained Models [pdf] [code]
-
Prompt Tuning with Soft Context Sharing for Vision-Language Models [pdf]
-
Test-Time Prompt Tuning for Zero-Shot Generalization in Vision-Language Models [pdf] [code]
-
CPL: Counterfactual Prompt Learning for Vision and Language Models [pdf] [code]
-
Understanding and Mitigating Overfitting in Prompt Tuning for Vision-Language Models [pdf] [code]
-
Unified Vision and Language Prompt Learning [pdf]
-
Multi-Prompt Alignment for Multi-source Unsupervised Domain Adaptation [pdf]
Language-interactable prompter develops few/zero-shot capabilities by prompting one/several independent foundational models (VLMs, LMs, VMs, etc.) with the language interface.
-
Multimodal Few-Shot Learning with Frozen Language Models [pdf]
-
An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA [pdf] [code]
-
A Good Prompt Is Worth Millions of Parameters? Low-resource Prompt-based Learning for Vision-Language Models [pdf]
-
VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning [pdf] [code]
-
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language [pdf] [code]
-
Flamingo: a Visual Language Model for Few-Shot Learning [pdf]
-
Language Models Can See: Plugging Visual Controls in Text Generation [pdf] [code]
-
Zero-Shot Video Question Answering via Frozen Bidirectional Language Models [pdf]
-
Visual Clues: Bridging Vision and Language Foundations for Image Paragraph Captioning [pdf]
This section contains awesome papers using the prompt module as tools, like papers using prompts for pretraining or specific applications.
-
Unifying Vision-and-Language Tasks via Text Generation [pdf] [code]
ICML 2021
-
StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery [pdf] [code]
ICCV 2021
-
Grounded Language-Image Pre-training [pdf] [code]
CVPR 2022
-
Align and Prompt: Video-and-Language Pre-training with Entity Prompts [pdf] [code]
CVPR 2022
-
GroupViT: Semantic Segmentation Emerges from Text Supervision [pdf] [code]
CVPR 2022
-
Unified Multimodal Pretraining and Prompt-based Tuning for Vision-Language Understanding and Generation [pdf]
arXiv 2021/12
-
Discovering Bugs in Vision Models using Off-the-shelf Image Generation and Captioning [pdf]
- PromptPapers: A comprehensive curated list for prompting papers (mainly in natural language processing)
- Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing [pdf]
arXiv 2021/07