Multi-modal-Deep-Learning

Recent Multi-modal Deep Learning Advances (list of papers and highlights).

Introduction

Prelude

There are many advances of using unified models (e.g. Transformer) to create representations for multiple modalities. Some of them even enable fusion of multiple modalities to make different modalities help each other. Here, multiple modalities not only include natural language, vision and speech, but also include formal language (e.g. code), (semi-)structured knowledge (e.g. table, KG etc.) and biological/chemical compounds (e.g. protein, molecular, etc.). This is a list of recent important papers in this field. Welcome to contribute.

Introduction
- Prelude
Resources
Natural Language
Vision
- Supervised Vision Tasks
- Unsupervised Vision Representation Learning
Speech
- Unsupervised Speech Representation Learning
- Unsupervised Automatic Speech Recognition (ASR)
Formal Language / Code
Structured Knowledge
Biology / Chemistry
- Protein
- Molecular
Modality Fusion
- Vision and Natural Language

Resources

Microsoft UniLM series

Natural Language

BERT, RoBERTa, BART, SpanBERT, UniLM, PEGASUS, ELECTRA, T5, GPT-k, FLAN, InstructGPT etc.
data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language, arxiv Feb 2022.

Vision

Supervised Vision Tasks

DETR: End-to-End Object Detection with Transformers, ECCV 2020.
ViT: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, ICLR 2021.
DeiT: Training data-efficient image transformers & distillation through attention, arxiv Dec 2020.
MoCo-V3: An Empirical Study of Training Self-Supervised Vision Transformers, ICCV 2021.
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows, arxiv Aug 2021.

Unsupervised Vision Representation Learning

DINO: Emerging Properties in Self-Supervised Vision Transformers, arxiv April 2021.
BEiT: BERT Pre-Training of Image Transformers, arxiv Jun 2021
SimMIM: A Simple Framework for Masked Image Modeling, arxiv Nov 2021.
MAE: Masked Autoencoders Are Scalable Vision Learners, arxiv Nov 2021.
data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language, arxiv Feb 2022.

Speech

Unsupervised Speech Representation Learning

wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations, arxiv Jun 2020.
HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units, arxiv Jun 2021.
WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing, arxiv Oct 2021.
data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language, arxiv Feb 2022.

Unsupervised Automatic Speech Recognition

wav2vec-U: Unsupervised Speech Recognition, arxiv May 2021.

Formal Language

CodeBERT: A Pre-Trained Model for Programming and Natural Languages, EMNLP 2020 (Findings).
Codex: Evaluating Large Language Models Trained on Code, arxiv Jul 2021.
GraphCodeBERT: Pre-training Code Representations with Data Flow, ICLR 2021.
Transformer Embeddings of Irregularly Spaced Events and Their Participants, ICLR 2022.
AlphaCode: Competition-Level Code Generation with AlphaCode.

Structured Knowledge

UNIFIEDSKG: Unifying and Multi-Tasking Structured Knowledge Grounding with Text-to-Text Language Models, arxiv Jan 2022.

Table

TABERT: Pretraining for Joint Understanding of Textual and Tabular Data, ACL 2020.
GraPPa: Grammar-Augmented Pre-Training for Table Semantic Parsing, ICLR 2021.
TAPAS: Weakly Supervised Table Parsing via Pre-training, ACL 2020.
STRUG: Structure-Grounded Pretraining for Text-to-SQL, NAACL 2021.
TAPEX: Table Pre-training via Learning a Neural SQL Executor, ICLR 2022.
TableFormer: Robust Transformer Modeling for Table-Text Encoding, ACL 2022.

Knowledge Graph

COMET: Commonsense Transformers for Automatic Knowledge Graph Construction, ACL 2019.
(COMET-)ATOMIC-2020: On Symbolic and Neural Commonsense Knowledge Graphs, arxiv Oct 2020.
Knowledge is Power: Symbolic Knowledge Distillation, Commonsense Morality, & Multimodal Script Knowledge, WSDM 2022.

Retrieval Paragraphs as Knowledge

REALM: Retrieval-Augmented Language Model Pre-Training, arxiv Feb 2020.
MERGE: Pre-training via Paraphrasing, NeuralPS 2020.
Dense Passage Retrieval for Open-Domain Question Answering, EMNLP 2020.
RAG: Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, NeuralPS 2020.
End-to-End Training of Neural Retrievers for Open-Domain Question Answering, ACL 2021.
Condenser: a Pre-training Architecture for Dense Retrieval, EMNLP 2021.
Spider: Learning to Retrieve Passages without Supervision, arxiv Dec 2021.

Biology and Chemistry

Protein

Transformer protein language models are unsupervised structure learners, ICLR 2021.

Molecular

Graphomer: Do Transformers Really Perform Bad for Graph Representation?, NeuralPS 2021.

Modality Fusion

Vision and Natural Language

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks, NeuralPS 2019.
LXMERT: Learning Cross-Modality Encoder Representations, EMNLP 2019.
VisualBERT: A Simple and Performant Baseline for Vision and Language, ACL 2020.
Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training, arxiv Dec 2019.
UNITER: UNiversal Image-TExt Representation Learning, arxiv July 2020.
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks, ECCV 2020.
VILLA: Large-Scale Adversarial Training for Vision-and-Language Representation Learning, NeuralPS 2020.
ViLBERT-MT: 12-in-1: Multi-Task Vision and Language Representation Learning, CVPR 2020.
Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers, arxiv April 2020.
U-VisualBERT: Unsupervised Vision-and-Language Pre-training Without Parallel Images and Captions, NAACL 2021.
M6: A Chinese Multimodal Pretrainer, arxiv March 2021.
DALL·E: Zero-Shot Text-to-Image Generation, arxiv Feb 2021.
CLIP: Learning Transferable Visual Models From Natural Language Supervision, arxiv Feb 2021.
UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning, ACL 2021.
ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision, ICML 2021.
SimVLM: Simple Visual Language Model Pretraining with Weak Supervision, arxiv Aug 2021.
ALBEF: Align before Fuse: Vision and Language Representation Learning with Momentum Distillation, arxiv July 2021.
VinVL: Revisiting Visual Representations in Vision-Language Models, CVPR 2021.
LAFITE: Towards Language-Free Training for Text-to-Image Generation, arxiv Nov 2021.
VLMO: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts, arxiv Nov 2021.
GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models arxiv Dec 2021.
FLAVA: A Foundational Language And Vision Alignment Model, arxiv Dec 2021.
Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks, arxiv Dec 2021.
CM3: A Causal Masked Multimodal Model of the Internet, arxiv Jan 2022.
OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework, arxiv Feb 2022.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Multi-modal-Deep-Learning

Introduction

Prelude

Resources

Natural Language

Vision

Supervised Vision Tasks

Unsupervised Vision Representation Learning

Speech

Unsupervised Speech Representation Learning

Unsupervised Automatic Speech Recognition

Formal Language

Structured Knowledge

Table

Knowledge Graph

Retrieval Paragraphs as Knowledge

Biology and Chemistry

Protein

Molecular

Modality Fusion

Vision and Natural Language

Files

README.md

Latest commit

History

README.md

File metadata and controls

Multi-modal-Deep-Learning

Introduction

Prelude

Resources

Natural Language

Vision

Supervised Vision Tasks

Unsupervised Vision Representation Learning

Speech

Unsupervised Speech Representation Learning

Unsupervised Automatic Speech Recognition

Formal Language

Structured Knowledge

Table

Knowledge Graph

Retrieval Paragraphs as Knowledge

Biology and Chemistry

Protein

Molecular

Modality Fusion

Vision and Natural Language