This repository contains a list of papers on the Protein Representation Learning (PRL), we categorize them based on their published years. We will try to make this list updated. If you found any error or any missed paper, please don't hesitate to open issues or pull requests.
However, Since 2023, the number of simple PRL methods has dwindled, and researchers have begun to focus on more divergent and difficult domain problems.
- [ICML 2024] CLIPZyme: Reaction-Conditioned Virtual Screening of Enzymes[paper]
- [ICML 2024] Generative Enzyme Design Guided by Functionally Important Sites and Small-Molecule Substrates[paper][code]
- [ACMMM 2024] MetaEnzyme: Meta Pan-Enzyme Learning for Task-Adaptive Redesign[paper]
- [ICML 2024] Knowledge-aware Reinforced Language Models for Protein Directed Evolution[paper][code]
- [ICML 2024] Evolution-Inspired Loss Functions for Protein Representation Learning[paper]
- [ICML 2024] Diffusion Language Models Are Versatile Protein Learners[paper]
- [ICML 2024] ESM All-Atom: Multi-Scale Protein Language Model for Unified Molecular Modeling[paper]
- [ICML 2024] Protein Conformation Generation via Force-Guided SE(3) Diffusion Models[paper]
- [ICML 2024] Feature Reuse and Scaling: Understanding Transfer Learning with Protein Language Models[paper]
- [ICML 2024] AlphaFold Meets Flow Matching for Generating Protein Ensembles[paper]
- [ICML 2024] Context-Guided Diffusion for Out-of-Distribution Molecular and Protein Design[paper]
- [ICML 2024] Robust Optimization in Protein Fitness Landscapes Using Reinforcement Learning in Latent Space[paper]
- [ICML 2024] SurfPro: Functional Protein Design Based on Continuous Surface[paper]
- [ICML 2024] CarbonNovo: Joint Design of Protein Structure and Sequence Using a Unified Energy-based Model[paper]
- [ICML 2024] Generative Flows on Discrete State-Spaces: Enabling Multimodal Flows with Applications to Protein Co-Design[paper]
- [ICML 2024] Proteus: Exploring Protein Structure Generation for Enhanced Designability and Efficiency[paper]
- [ICML 2024] Learning to Predict Mutational Effects of Protein-Protein Interactions by Microenvironment-aware Hierarchical Prompt Learning[paper]
- [ICML 2024] Interaction-based Retrieval-augmented Diffusion Models for Protein-specific 3D Molecule Generation[paper]
- [ICML 2024] Pre-Training Protein Bi-level Representation Through Span Mask Strategy On 3D Protein Chains[paper]
- [Arxiv 2024] HeMeNet: Heterogeneous Multichannel Equivariant Network for Protein Multi-task Learning[paper]
- [Arxiv 2024] Clustering for Protein Representation Learning[paper]
- [ICLR 2024] BioBridge: Bridging Biomedical Foundation Models via Knowledge Graph[paper]
- [ICLR 2024] Self-Guided Masked Autoencoders for Domain-Agnostic Self-Supervised Learning[paper]
- [ICLR 2024] Learning Scalar Fields for Molecular Docking with Fast Fourier Transforms[paper]
- [ICLR 2024] BEND: Benchmarking DNA Language Models on Biologically Meaningful Tasks[paper]
- [ICLR 2024] Searching for High-Value Molecules Using Reinforcement Learning and Transformers[paper]
- [ICLR 2024] Expected flow networks in stochastic environments and two-player zero-sum games[paper]
- [ICLR 2024] Conversational Drug Editing Using Retrieval and Domain Feedback[paper]
- [ICLR 2024] The Discovery of Binding Modes Requires Rethinking Docking Generalization[paper]
- [ICLR 2024] Hierarchical Data-efficient Representation Learning for Tertiary Structure-based RNA Design[paper]
- [ICLR 2024] Generative Adversarial Policy Network for Modelling Protein Complexes[paper]
- [ICLR 2024] Protein Multimer Structure Prediction via PPI-guided Prompt Learning[paper]
- [ICLR 2024] Self-supervised Pocket Pretraining via Protein Fragment-Surroundings Alignment[paper]
- [ICLR 2024] Protein-Ligand Interaction Prior for Binding-aware 3D Molecule Diffusion Models[paper]
- [ICLR 2024] Str2Str: A Score-based Framework for Zero-shot Protein Conformation Sampling[paper]
- [ICLR 2024] Neural Probabilistic Protein-Protein Docking via a Differentiable Energy Model[paper]
- [ICLR 2024] Robust Model-Based Optimization for Challenging Fitness Landscapes[paper]
- [ICLR 2024] Protein-ligand binding representation learning from fine-grained interactions[paper]
- [ICLR 2024] Learning to design protein-protein interactions with enhanced generalization[paper]
- [ICLR 2024] KW-Design: Pushing the Limit of Protein Design via Knowledge Refinement[paper]
- [ICLR 2024] Rigid Protein-Protein Docking via Equivariant Elliptic-Paraboloid Interface Prediction[paper]
- [ICLR 2024] Evaluating Representation Learning on the Protein Structure Universe[paper]
- [ICLR 2024] Improving protein optimization with smoothed fitness landscapes[paper]
- [ICLR 2024] Dynamics-Informed Protein Design with Structure Conditioning[paper]
- [ICLR 2024] Pre-training Sequence, Structure, and Surface Features for Comprehensive Protein Representation Learning[paper]
- [ICLR 2024 spotlight] SE(3)-Stochastic Flow Matching for Protein Backbone Generation[paper]
- [ICLR 2024 spotlight] De novo Protein Design Using Geometric Vector Field Networks[paper]
- [ICLR 2024 spotlight] MAPE-PPI: Towards Effective and Efficient Protein-Protein Interaction Prediction via Microenvironment-Aware Protein Embedding[paper]
- [ICLR 2024 spotlight] SaProt: Protein Language Modeling with Structure-aware Vocabulary[paper][code]
- [ICLR 2024 oral] Protein Discovery with Discrete Walk-Jump Sampling[paper]
- [AAAI 2024] Improving PTM Site Prediction by Coupling of Multi-Granularity Structure and Multi-Scale Sequence Representation[paper][code]
- [NIPS 2023] DSR: Dynamical Surface Representation as Implicit Neural Networks for Protein[paper][code]
- [NIPS 2023] Predicting a Protein's Stability under a Million Mutations[paper][code]
- [NIPS 2023] Protein Design with Guided Discrete Diffusion[paper]
- [NIPS 2023] ProteinNPT: Improving protein property prediction and design with non-parametric transformers[paper][code]
- [NIPS 2023] ProteinShake: Building datasets and benchmarks for deep learning on protein structures[paper][code]
- [NIPS 2023] Symmetry-Informed Geometric Representation for Molecules, Proteins, and Crystalline Materials[paper][code]
- [NIPS 2023] Full-Atom Protein Pocket Design via Iterative Refinement[paper][code]
- [NIPS 2023] Injecting Multimodal Information into Rigid Protein Docking via Bi-level Optimization[paper]
- [NIPS 2023] Unsupervised Protein-Ligand Binding Energy Prediction via Neural Euler's Rotation Equation[paper][code]
- [NIPS 2023] ProteinInvBench: Benchmarking Protein Inverse Folding on Diverse Tasks, Models, and Metrics[paper][code]
- [NIPS 2023] OpenProteinSet: Training data for structural biology at scale[paper]
- [NIPS 2023] CELLE-2: Translating Proteins to Pictures and Back with a Bidirectional Text-to-Image Transformer[paper][code]
- [NIPS 2023] FABind: Fast and Accurate Protein-Ligand Binding[paper][code]
- [NIPS 2023] ProteinGym: Large-Scale Benchmarks for Protein Design and Fitness Prediction[paper][code]
- [NIPS 2023] Graph Denoising Diffusion for Inverse Protein Folding[paper][code]
- [NIPS 2023] DiffPack: A Torsional Diffusion Model for Autoregressive Protein Side-Chain Packing[paper][code]
- [NIPS 2023] DrugCLIP: Contrasive Protein-Molecule Representation Learning for Virtual Screening[paper]
- [NIPS 2023] Predicting mutational effects on protein-protein binding via a side-chain diffusion probabilistic model[paper]
- [NIPS 2023] PoET: A generative model of protein families as sequences-of-sequences[paper][code]
- [NIPS 2023] Pre-Training Protein Encoder via Siamese Sequence-Structure Diffusion Trajectory Prediction[paper]
- [ICML 2023] Bidirectional Learning for Offline Model-based Biological Sequence Design[paper][code]
- [ICML 2023] Reprogramming Pretrained Language Models for Antibody Sequence Infilling[paper][code]
- [ICML 2023] Learning Subpocket Prototypes for Generalizable Structure-based Drug Design[paper][code]
- [ICML 2023] Extrapolative Controlled Sequence Generation via Iterative Refinement[paper][code]
- [ICML 2023] AbODE: Ab initio antibody design using conjoined ODEs[paper][code]
- [ICML 2023] End-to-End Full-Atom Antibody Design[paper][code]
- [ICML 2023] Exploring Chemical Space with Score-based Out-of-distribution Generation[paper][code]
- [ICML 2023] SE(3) diffusion model with application to protein backbone generation[paper][code]
- [ICML 2023] Importance Weighted Expectation-Maximization for Protein Sequence Design[paper][code]
- [ICML 2023] Generating Novel, Designable, and Diverse Protein Structures by Equivariantly Diffusing Oriented Residue Clouds[paper][code]
- [ICML 2023] Chemically Transferable Generative Backmapping of Coarse-Grained Proteins[paper][code]
- [ICML 2023] Structure-informed Language Models Are Protein Designers[paper]
- [ICML 2023] ProtST: Multi-Modality Learning of Protein Sequences and Biomedical Texts [paper][code]
- [bioRxiv 2023] Retrieved Sequence Augmentation for Protein Representation Learning [paper][code]
- [Arxiv 2023] Data-Efficient Protein 3D Geometric Pretraining via Refinement of Diffused Protein Structure Decoy [paper]
- [ICLR 2023] Protein Representation Learning by Geometric Structure Pretraining [paper]
- [ICLR 2023] Protein Representation Learning via Knowledge Enhanced Primary Structure Reasoning[paper]
- [ICLR 2023] Multi-level Protein Structure Pre-training via Prompt Learning[paper]
- [ICLR 2023] Learning Hierarchical Protein Representations via Complete 3D Graph Networks[paper]
- [ICLR 2023] Rotamer Density Estimator is an Unsupervised Learner of the Effect of Mutations on Protein-Protein Interaction[paper][code]
- [ICLR 2023] Matching receptor to odorant with protein language and graph neural networks[paper][code]
- [ICLR 2023] Protein Sequence and Structure Co-Design with Equivariant Translation[paper][code]
- [ICLR 2023] HotProtein: A Novel Framework for Protein Thermostability Prediction and Editing[paper][code]
- [ICLR 2023] PiFold: Toward effective and efficient protein inverse folding[paper][code]
- [ICLR 2023] Continuous-Discrete Convolution for Geometry-Sequence Modeling in Proteins[paper][code]
- [Nature 2023] De novo design of protein interactions with learned surface fingerprints[paper][code]
- [Nature Communications 2023] PeSTo: parameter-free geometric deep learning for accurate prediction of protein binding interfaces [paper][code]
- [IJCAI 2023] SemiGNN-PPI: Self-Ensembling Multi-Graph Neural Network for Efficient and Generalizable Protein-Protein Interaction Prediction [paper]
- [Science 2023] Top-down design of protein architectures with reinforcement learning[paper][code]
- [Nature Communications 2023] Hierarchical graph learning for protein–protein interaction [paper][code]
- [Nature Biotechnol 2023] Efficient evolution of human antibodies from general protein language models [paper][code]
- [Advanced Science 2023] A Multimodal Protein Representation Framework for Quantifying Transferability Across Biochemical Downstream Tasks [paper][code]
- [Nature Communications Biology] Learning the protein language of proteome-wide protein-protein binding sites via explainable ensemble deep learning[paper][code]
- [Nature Communications 2022]ProtGPT2 is a deep unsupervised language model for protein design[paper]
- [bioRxiv 2022] Codon language embeddings provide strong signals for protein engineering [paper]
- [Arxiv 2022] When Geometric Deep Learning Meets Pretrained Protein Language Models [paper]
- [Arxiv 2022] Contrastive Representation Learning for 3D Protein Structures [paper]
- [Bioinformatics 2022] Structure-aware Protein Self-supervised Learning [paper][video]
- [KDD 2022] GBPNet: Universal Geometric Representation Learning on Protein Structures [paper][code]
- [Arxiv 2022] Directed Weight Neural Networks for Protein Structure Representation Learning [paper]
- [PLOS Computational Biology 2022] Fast protein structure comparison through effective representation learning with contrastive graph neural networks [paper][code]
- [NeurIPS 2022] Exploring evolution-based &-free protein language models as protein function predictors [paper]
- [bioRxiv 2022] High-resolution de novo structure prediction from primary sequence [paper][code]
- [Bioinformatics 2022] ProteinBERT: A universal deep-learning model of protein sequence and function [paper][code]
- [Communications Biology 2022] Artificial Intelligence Guided Conformational Mining of Intrinsically Disordered Proteins [paper][code]
- [Cell Systems 2022] Evolutionary velocity with protein language models predicts evolutionary dynamics of diverse proteins [paper][code]
- [bioRxiv 2022] Convolutions are competitive with transformers for protein sequence pretraining [paper][code]
- [bioRxiv 2022] Masked inverse folding with sequence transfer for protein representation learning [paper][code]
- [Nature methods 2022] Self-supervised deep learning encodes high-resolution features of protein subcellular localization [paper][code]
- [bioRxiv 2022] Language models of protein sequences at the scale of evolution enable accurate structure prediction [paper]
- [bioRxiv 2022] Atomic protein structure refinement using all-atom graph representations and SE(3)-equivariant graph neural networks [paper]
- [Bioinformatics 2022] Cross-Modality and Self-Supervised Protein Embedding for Compound–Protein Affinity and Contact Prediction [paper][code]
- [bioRxiv 2022] COLLAPSE: A representation learning framework for identification and characterization of protein structural sites [paper]
- [bioRxiv 2022] An Analysis of Protein Language Model Embeddings for Fold Prediction [paper]
- [ICLR 2022] OntoProtein: Protein Pretraining With Gene Ontology Embedding [paper] [code]
- [Bioinformatics 2022] DistilProtBert: a distilled protein language model used to distinguish between real proteins and their randomly shuffled counterparts [paper]
- [Briefings in Bioinformatics 2022] SPRoBERTa: protein embedding learning with local fragment modeling [paper]
- [NeurIPS 2022] Antigen-Specific Antibody Design and Optimization with Diffusion-Based Generative Models for Protein Structures[paper][code]
- [NeurIPS 2022] PEER: A Comprehensive and Multi-Task Benchmark for Protein Sequence Understanding[paper][code]
- [AAAI 2022] Self-Supervised Pre-training for Protein Embeddings Using Tertiary Structures[paper][code]
- [ICLR 2022] Geometric Transformers for Protein Interface Contact Prediction[paper][code]
- [ICLR 2022] Independent SE(3)-Equivariant Models for End-to-End Rigid Protein Docking[paper][code]
- [ICML 2022] Proximal Exploration for Model-guided Protein Sequence Design[paper][code]
- [ICML 2022] Generating 3D Molecules for Target Protein Binding[paper][code]
- [ICML 2022] Tranception: Protein Fitness Prediction with Autoregressive Transformers and Inference-time Retrieval[paper][code]
- [Arxiv 2022] DiffBP: Generative Diffusion of 3D Molecules for Target Protein Binding[paper]
- [Arxiv 2021] Pre-training co-evolutionary protein representation via a pairwise masked language model [paper]
- [NeurIPS 2021] Language models enable zero-shot prediction of the effects of mutations on protein function [paper][code]
- [ICML 2021] MSA Transformer [paper][code]
- [TPAMI 2021] ProtTrans: Towards Cracking the Language of Lifes Code Through Self-Supervised Learning [paper][code]
- [Arxiv 2021] Modeling Protein Using Large-scale Pretrain Language Model [paper][code]
- [IEEE Access 2021] Pre-Training of Deep Bidirectional Protein Sequence Representations With Structural Information [paper][code]
- [Bioinformatics 2021] GraphQA: protein model quality assessment using graph convolutional networks [paper][code]
- [ICLR 2021] Intrinsic-extrinsic convolution and pooling for learning on 3d protein structures [paper][code]
- [ICLR 2021] Learning from Protein Structure with Geometric Vector Perceptrons [paper][code]
- [NeurIPS 2021] Multi-Scale Representation Learning on Proteins [paper][code]
- [PNAS 2021] Neural networks to learn protein sequence–function relationships from deep mutational scanning data [paper][code]
- [bioRxiv 2021] LM-GVP: A Generalizable Deep Learning Framework for Protein Property Prediction from Sequence and Structure [paper][code]
- [Cell Systems 2021] Learning the protein language: Evolution, structure, and function [paper][code]
- [Nature Communications 2021] Structure-based protein function prediction using graph convolutional networks [paper][code]
- [KDD 2021] Geometric Graph Representation Learning on Protein Structure Prediction [paper]
- [Arxiv 2021] Adversarial Contrastive Pre-training for Protein Sequences [paper]
- [Emerg Top Life Sci 2021] Graph representation learning for structural proteomics [paper]
- [Arxiv 2021] Graph Representation Learning in Biomedicine [paper]
- [Applied Sciences 2021] GraphMS:Drug Target Prediction Using Graph Representation Learning with Substructures [paper][code]
- [JMC 2021] InteractionGraphNet: A Novel and Efficient Deep Graph Representation Learning Framework for Accurate Protein−Ligand Interaction Predictions [paper][code]
- [bioRxiv 2021] Self-Supervised Representation Learning of Protein Tertiary Structures (PtsRep) and Its Implications for Protein Engineering [paper]
- [Algorithms 2021] Capturing Protein Domain Structure and Function Using Self-Supervision on Domain Architectures [paper][code]
- [bioRxiv 2021] Combining evolutionary and assay-labelled data for protein fitness prediction [paper]
- [Science 2021] Accurate prediction of protein structures and interactions using a three-track neural network [paper][code]
- [Nature 2021] Highly accurate protein structure prediction with AlphaFold [paper][code]
- [IEEE TCBB 2021] Sequence representations and their utility for predicting protein-protein interactions [paper]
- [Bioinformatics 2021] Unsupervised protein embeddings outperform hand-crafted sequence and structure features at predicting molecular function [paper][code]
- [CVPR 2021] Fast end-to-end learning on protein surfaces [paper]
- [Briefings in Functional Genomics 2021] Pretraining model for biological sequence data [paper]
- [bioRxiv 2021] Toward More General Embeddings for Protein Design: Harnessing Joint Representations of Sequence and Structure [paper]
- [NeurIPS 2021] Neural Distance Embeddings for Biological Sequences [paper][code]
- [Computational Biology and Chemistry 2021] Convolutional neural networks with image representation of amino acid sequences for protein function prediction [paper][code]
- [bioRxiv 2021] Distillation of MSA Embeddings to Folded Protein Structures with Graph Transformers [paper]
- [chemRxiv 2021] Identification of Enzymatic Active Sites with Unsupervised Language Modeling [paper]
- [bioRxiv 2021] Deciphering the language of antibodies using self-supervised learning [paper]
- [bioRxiv 2021] Hydrogen bonds meet self-attention: all you need for general-purpose protein structure embedding [paper]
- [bioRxiv 2021] Improving Generalizability of Protein Sequence Models with Data Augmentations [paper]
- [BCB 2020] Transforming the Language of Life: Transformer Neural Networks for Protein Prediction Tasks [paper][code]
- [Bioinformatics 2020] UDSMProt: universal deep sequence models for protein classification [paper][code]
- [bioRxiv 2020] Self-Supervised Contrastive Learning of Protein Representations By Mutual Information Maximization [paper]
- [bioRxiv 2020] End-to-end multitask learning, from protein language to protein features without alignments [paper]
- [bioRxiv 2020] Language modelling for biological sequences – curated datasets and baselines [paper][code]
- [NeurIPS 2020] Is Transfer Learning Necessary for Protein Landscape Prediction? [paper]
- [PNAS 2020] Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences [paper][code]
- [Arxiv 2020] Profile Prediction: An Alignment-Based Pre-Training Task for Protein Sequence Models [paper]
- [Arxiv 2020] ProGen: Language Modeling for Protein Generation [paper][code]
- [bioRxiv 2020] Evaluation of Methods for Protein Representation Learning: A Quantitative Analysis [paper]
- [NAR Genomics and Bioinformatics 2020] Mutation effect estimation on protein–protein interactions using deep contextualized representation learning [paper][code]
- [CSBJ 2020] Representation learning applications in biological sequence analysis [paper]
- [bioRxiv 2020] TripletProt: Deep Representation Learning of Proteins based on Siamese Networks [paper]
- [RCMB 2020] Evolutionary context-integrated deep sequence modeling for protein engineering [paper]
- [Arxiv 2020] What is a meaningful representation of protein sequences? [paper][code]
- [bioRxiv 2020] Transformer protein language models are unsupervised structure learners [paper][code]
- [bioRxiv 2020] Deep learning-based prediction of protein structure using learned representations of multiple sequence alignments [paper]
- [Cell 2019] A High Efficient Biological Language Model for Predicting Protein–Protein Interactions [paper][code]
- [Nature Method 2019] Unified rational protein engineering with sequence-only deep representation learning [paper][code]
- [NeurIPS 2019] Evaluating Protein Transfer Learning with TAPE [paper][code]
- [Nature communications 2019] Deciphering protein evolution and fitness landscapes with latent space models [paper][code]
- [bioRxiv 2019] DeepPrime2Sec: Deep Learning for Protein Secondary Structure Prediction from the Primary Sequences [paper][code]
- [ACS Nano 2019] A Self-Consistent Sonification Method to Translate Amino Acid Sequences into Musical Compositions and Application in Protein Design Using Artificial Intelligence [paper]
- [bioRxiv 2019] Augmenting protein network embeddings with sequence information [paper]
- [Nature Machine Intelligence 2019] Leveraging implicit knowledge in neural networks for functional dissection and engineering of proteins. [paper][code]
- [bioRxiv 2019] Modeling the Language of Life – Deep Learning Protein Sequences [paper][code]
- [ICLR 2019] Learning protein sequence embeddings using information from structure [paper][code]
- [BIBM 2019] GraphCPI: Graph Neural Representation Learning for Compound-Protein Interaction [paper]
- [Bioinformatics 2018] Learned protein embeddings for machine learning [paper][code]
- [Bioinformatics 2018] Deep convolutional networks for quality assessment of protein folds [paper][code]
- [bioRxiv 2018] Deep Semantic Protein Representation for Annotation, Discovery, and Engineering [paper][code]
- [bioRxiv 2017] Predicting Protein Binding Affinity With Word Embeddings and Recurrent Neural Networks [paper][code]
- [Arxiv 2017] Variational auto-encoding of protein sequences [paper][code]
- [Arxiv 2016] Distributed Representations for Biological Sequence Analysis [paper]
- [Bioinformatics 2015] ProFET: Feature engineering captures high-level protein functions [paper]
- awesome-graph-representation-learning
- awesome-graph-self-supervised-learning
- awesome-self-supervised-gnn
- awesome-self-supervised-learning-for-graphs
- awesome-AI-based-protein-design
If you find this project useful for your research, please use the following BibTeX entry.
@article{wu2022survey,
title={A Survey on Protein Representation Learning: Retrospect and Prospect},
author={Wu, Lirong and Huang, Yufei and Lin, Haitao and Li, Stan Z},
journal={arXiv preprint arXiv:2301.00813},
year={2022}
}
If you have any issue about this work, please feel free to contact me by email:
- Lirong Wu: [email protected]
- Yufei Huang: [email protected]
- Bozhen Hu: [email protected]