Merge pull request #4 from chaitjo/v0.2

Major refactor and updates for v0.2 release
chaitjo · Jan 12, 2024 · b6c93a6 · b6c93a6
2 parents 5952cfe + 2ae09c7
commit b6c93a6
Show file tree

Hide file tree

Showing 69 changed files with 45,650 additions and 1,888 deletions.
diff --git a/.env.example b/.env.example
@@ -0,0 +1,12 @@
+export PROJECT_PATH='/home/ckj24/rna-inverse-folding/'
+
+export DATA_PATH='/home/ckj24/rna-inverse-folding/data/'
+
+export WANDB_PROJECT='gRNAde'
+export WANDB_ENTITY='chaitjo'
+export WANDB_DIR='/home/ckj24/rna-inverse-folding/'
+
+export ETERNAFOLD='/home/ckj24/rna-inverse-folding/tools/EternaFold'
+
+export X3DNA='/home/ckj24/rna-inverse-folding/tools/x3dna-v2.4'
+export PATH="/home/ckj24/rna-inverse-folding/tools/x3dna-v2.4/bin:$PATH"
diff --git a/.gitignore b/.gitignore
@@ -1,8 +1,8 @@
 # Custom
+/tools
 /data
-/seq
-/env
 /wandb
+/slurm
 .DS_Store
 # *.ipynb
 

diff --git a/README.md b/README.md
@@ -1,103 +1,155 @@
 # 💣 gRNAde: Geometric RNA Design
 
-**gRNAde** is a geometric deep learning pipeline for 3D RNA inverse design conditioned on *multiple* backbone conformations. 
-gRNAde explicitly accounts for RNA conformational flexibility via a novel **multi-Graph Neural Network** architecture which independently encodes a set of conformers via message passing.
+**gRNAde** is a geometric deep learning pipeline for 3D RNA inverse design. 
 
-![](fig/grnade_pipeline.png)
+gRNAde generates an RNA sequence conditioned on one or more 3D RNA backbone conformations, i.e. both single- and multi-state **fixed-backbone sequence design**.
+RNA backbones are featurized as geometric graphs and processed via a multi-state GNN encoder which is equivariant to 3D roto-translation of coordinates as well as conformer order, followed by conformer order-invariant pooling and sequence design.
 
-Check out the accompanying paper ['Multi-State RNA Design with Geometric Multi-Graph Neural Networks'](https://arxiv.org/abs/2305.14749), which introduces gRNAde.
-> Chaitanya K. Joshi, Arian R. Jamasb, Ramon Viñas, Charles Harris, Simon Mathis, and Pietro Liò. Multi-State RNA Design with Geometric Multi-Graph Neural Networks. *arXiv preprint, 2023.*
+![](/tutorial/fig/grnade_pipeline.png)
+
+⚙️ Want to use gRNAde for your own RNA designs? Check out the tutorial notebook: [gRNAde 101](/tutorial/tutorial.ipynb)
+
+📄 For more details on the methodology, see the accompanying paper: ['Multi-State RNA Design with Geometric Multi-Graph Neural Networks'](https://arxiv.org/abs/2305.14749)
+> Chaitanya K. Joshi, Arian R. Jamasb, Ramon Viñas, Charles Harris, Simon Mathis, and Pietro Liò. Multi-State RNA Design with Geometric Multi-Graph Neural Networks. *ICML Computational Biology Workshop, 2023.*
 >
->[PDF](https://arxiv.org/pdf/2305.14749.pdf) | [Thread](https://twitter.com/chaitjo/status/1662118334412800001)
+>[PDF](https://arxiv.org/pdf/2305.14749.pdf) | [Tweet](https://twitter.com/chaitjo/status/1662118334412800001)
 
-❗️**Note:** gRNAde is under active development; the `main` branch contains the most recent version of the code and models, but the manuscript may not be updated with the latest results. Please check the ['Releases'](https://github.com/chaitjo/geometric-rna-design/releases) tab to reproduce our results.
 
 
-## Directory Structure and Usage
+## Installation
 
-```
-.
-├── README.md
-|
-├── data                    # Data files directory
-├── notebooks               # Jupyter notebooks directory
-├── configs                 # Configuration files directory
-| 
-├── main.py                 # Main script for launching experiments
-|
-└── src
-    ├── models.py           # Multi-GNN encoder layers and model
-    ├── train.py            # Helper functions for training and evaluation
-    ├── data.py             # RNA inverse design dataset
-    ├── data_utils.py       # Helper functions for data preparation
-    └── featurisation.py    # Input featurisation helpers
+In order to get started, set up a python environment by following the installation instructions below. 
+We have tested gRNAde on Linux with Python 3.10.12 and CUDA 11.8 on an NVIDIA A100 80GB GPU, as well as on MacOS.
+
+```sh
+# Clone gRNAde repository
+cd ~  # change this to your prefered download location
+git clone https://github.com/chaitjo/geometric-rna-design.git
+cd geometric-rna-design
+
+# Install mamba (a faster conda)
+wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh
+bash Miniforge3-Linux-x86_64.sh
+source ~/.bashrc
+# You may also use conda or virtualenv to create your environment
+
+# Create new environment
+mamba create -n rna python=3.10
+mamba activate rna
+
+# Install Pytorch (ensure appropriate CUDA version for your hardware)
+mamba install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
+
+# Install Pytorch Geometric (ensure matching torch + CUDA version)
+pip install torch_geometric
+pip install pyg_lib torch_scatter torch_sparse torch_cluster torch_spline_conv -f https://data.pyg.org/whl/torch-2.1.0+cu118.html
+
+# Install other dependencies
+mamba install mdanalysis MDAnalysisTests jupyterlab matplotlib seaborn pandas networkx biopython biotite torchmetrics lovely-tensors -c conda-forge
+pip install wandb pyyaml ipdb python-dotenv tqdm lmdb cpdb-protein
+
+# Install X3DNA for secondary structure determination
+cd ~/rna-inverse-folding/tools/
+tar -xvzf x3dna-v2.4-linux-64bit.tar.gz
+./x3dna-v2.4/bin/x3dna_setup
+# Follow the instructions to test your installation
+
+# Install EternaFold for secondary structure prediction
+cd ~/rna-inverse-folding/tools/
+git clone --depth=1 https://github.com/eternagame/EternaFold.git && cd EternaFold/src
+make
+# Notes: 
+# - Multithreaded version of EternaFold did not install for me
+# - To install on MacOS, start a shell in Rosetta using `arch -x86_64 zsh`
+
+# (Optional) Install CD-HIT for sequence identity clustering
+mamba install cd-hit -c bioconda
+
+# (Optional) Install US-align/qTMclust for structural similarity clustering
+cd ~/rna-inverse-folding/tools/
+git clone https://github.com/pylelab/USalign.git && cd USalign/ && git checkout 97325d3aad852f8a4407649f25e697bbaa17e186
+g++ -static -O3 -ffast-math -lm -o USalign USalign.cpp
+g++ -static -O3 -ffast-math -lm -o qTMclust qTMclust.cpp
 ```
 
+Once your python environment is set up, create your `.env` file with the appropriate environment variables; see the .env.example file included in the codebase for reference. 
+```sh
+cd ~/rna-inverse-folding/
+touch .env
+```
 
 
-## Installation
+## Directory Structure and Usage
 
-Our experiments used Python 3.8.16 and CUDA 11.3 on NVIDIA Quadro RTX 8000 GPUs.
+Detailed usage instructions are available in [the tutorial notebook](/tutorial/tutorial.ipynb).
 
-```sh
-# Create new conda environment
-conda create --prefix ./env python=3.8
-conda activate ./env
-
-# Install PyTorch (Check CUDA version for GPU!)
-# Option 1: CPU
-# conda install pytorch==1.12.0 -c pytorch
-#
-# Option 2: GPU, CUDA 11.3
-conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.3 -c pytorch
-
-# Install dependencies
-conda install matplotlib pandas networkx
-pip install biopython wandb pyyaml ipdb 
-conda install jupyterlab -c conda-forge
-conda install -c bioconda cd-hit
-
-# Install PyG (Check CPU/GPU/MacOS)
-# Option 1: CPU, MacOS
-# pip install torch-scatter torch-sparse torch-cluster torch-spline-conv -f https://data.pyg.org/whl/torch-1.12.0+cpu.html 
-# pip install torch-geometric
-#
-# Option 2: GPU, CUDA 11.3
-pip install torch-scatter torch-sparse torch-cluster torch-spline-conv -f https://data.pyg.org/whl/torch-1.12.1+cu113.html
-pip install torch-geometric
-#
-# Option 3: 
-# conda install pyg -c pyg  # CPU/GPU, but may not work on MacOS
+```
+.
+├── README.md
+├── LICENSE
+|
+├── gRNAde.py                       # gRNAde python module and command line utility
+├── main.py                         # Main script for training models
+|
+├── .env.example                    # Example environment file
+├── .env                            # Your environment file
+|
+├── checkpoints                     # Saved model checkpoints
+├── configs                         # Configuration files directory
+├── data                            # Dataset and data files directory
+├── notebooks                       # Directory for Jupyter notebooks
+├── scripts                         # Directory for standalone scripts
+├── tutorial                        # Tutorial with example usage
+|
+├── tools                           # Directory for external tools
+|   ├── EternaFold                  # RNA sequence to secondary structure prediction
+|   └── x3dna-v2.4                  # RNA secondary structure determination from 3D
+|
+└── src                             # Source code directory
+    ├── constants.py                # Constant values for data, paths, etc.
+    ├── layers.py                   # PyTorch modules for building Multi-state GNN models
+    ├── models.py                   # Multi-state GNN models for gRNAde
+    ├── trainer.py                  # Training and evaluation loops
+    |
+    └── data                        # Data-related code
+        ├── clustering_utils.py     # Methods for clustering by sequence and structural similarity
+        ├── data_utils.py           # Methods for loading PDB files and handling coordinates
+        ├── dataset.py              # Dataset and batch sampler class
+        ├── featurizer.py           # Featurizer class
+        └── sec_struct_utils.py     # Methods for secondary structure prediction and determination
 ```
 
 
 
 ## Downloading Data
 
-We created a machine learning-ready dataset for RNA inverse design using [RNASolo](https://rnasolo.cs.put.poznan.pl) structures at resolution ≤3A. 
-Download and extract the raw `.pdb` files via the following script into the `data/raw/` directory.
-Running `main.py` for the first time will process the raw data and save the processed samples as a `.pt` file.
+gRNAde is trained on all RNA structures from the PDB at ≤4A resolution (12K 3D structures from 4.2K unique RNAs) downloaded via  [RNASolo](https://rnasolo.cs.put.poznan.pl) on 31 October 2023.
+If you would like to train your own models from scratch, download and extract the raw `.pdb` files via the following script into the `data/raw/` directory.
 
 ```sh
-mkdir data/raw
-cd data/raw
-curl -O https://rnasolo.cs.put.poznan.pl/media/files/zipped/bunches/pdb/all_member_pdb_3_0__3_280.zip
-unzip all_member_pdb_3_0__3_280.zip
-rm all_member_pdb_3_0__3_280.zip
+# Download structures in pdb format
+mkdir ~/rna-inverse-folding/data/raw
+cd ~/rna-inverse-folding/data/raw
+curl -O https://rnasolo.cs.put.poznan.pl/media/files/zipped/bunches/pdb/all_member_pdb_4_0__3_300.zip
+unzip all_member_pdb_4_0__3_300.zip
+rm all_member_pdb_4_0__3_300.zip
+
+# Process raw data into ML-ready format (this may take several hours)
+cd ~/rna-inverse-folding/
+python scripts/process_data.py
 ```
 
 Manual download link: https://rnasolo.cs.put.poznan.pl/archive.
-Select the following for creating the download: 3D (PDB) + all molecules + all members + res. ≤3.0
-
+Select the following for creating the download: 3D (PDB) + all molecules + all members + res. ≤4.0
 
 
 ## Citation
 
 ```
-@article{joshi2023multi,
+@inproceedings{joshi2023multi,
   title={Multi-State RNA Design with Geometric Multi-Graph Neural Networks},
   author={Joshi, Chaitanya K. and Jamasb, Arian R. and Viñas, Ramon and Harris, Charles and Mathis, Simon and Liò, Pietro},
-  journal={arXiv preprint arXiv:2305.14749},
+  booktitle={ICML 2023 Workshop on Computation Biology},
   year={2023},
 }
 ```
diff --git a/checkpoints/gRNAde_ARv1_1state.h5 b/checkpoints/gRNAde_ARv1_1state.h5
diff --git a/checkpoints/gRNAde_ARv1_2state.h5 b/checkpoints/gRNAde_ARv1_2state.h5
diff --git a/checkpoints/gRNAde_ARv1_3state.h5 b/checkpoints/gRNAde_ARv1_3state.h5
diff --git a/checkpoints/gRNAde_ARv1_5state.h5 b/checkpoints/gRNAde_ARv1_5state.h5
diff --git a/configs/default.yaml b/configs/default.yaml
@@ -3,7 +3,7 @@ gpu:
   value: 0
   desc: GPU ID
 seed:
-  value: 42
+  value: 0
   desc: Random seed for reproducibility
 save:
   value: True
@@ -13,75 +13,75 @@ save:
 data_path:
   value: "./data/"
   desc: Data directory (preprocessed and raw)
-process_raw:
-  value: True
-  desc: Whether to process datasets from raw .pdb files
-save_processed:
-  value: True
-  desc: Whether to save processed datasets
 radius:
   value: 4.5
   desc: Radius for determining local neighborhoods in Angstrom (currently not used)
 top_k:
-  value: 10
+  value: 32
   desc: Number of k-nearest neighbors
 num_rbf:
-  value: 16
+  value: 32
   desc: Number of radial basis functions to featurise distances
 num_posenc:
-  value: 16
+  value: 32
   desc: Number of positional encodings to featurise edges
-num_conformers:
-  value: 3
-  desc: Number of conformations sampled per sequence
+max_num_conformers:
+  value: 1
+  desc: Maximum number of conformations sampled per sequence
+noise_scale:
+  value: 0.1
+  desc: Std of gaussian noise added to node coordinates during training
+max_nodes_batch:
+  value: 3000
+  desc: Maximum number of nodes in batch
+max_nodes_sample:
+  value: 5000
+  desc: Maximum number of nodes in batches with single samples (ie. maximum RNA length)
 
 # Splitting configurations
-eval_size:
-  value: 100
-  desc: Number of samples in val/test sets
 split:
-  value: 'seqid_rmsd'
-  desc: Type of data split (rmsd/struct)
+  value: 'das'
+  desc: Type of data split (das/structsim/seqid)
 
 # Model configurations
 model:
-  value: 'NAR'
+  value: 'ARv1'
   desc: Model architecture (AR/NAR)
 node_in_dim:
-  value: [64, 4]
+  value: [15, 4]  # (num_bb_atoms x 5, 2 + (num_bb_atoms - 1))
   desc: Input dimensions for node features (scalar channels, vector channels)
 node_h_dim:
   value: [128, 16]
   desc: Hidden dimensions for node features (scalar channels, vector channels)
 edge_in_dim:
-  value: [32, 1]
+  value: [131, 3]  # (num_bb_atoms x num_rbf + num_posenc + num_bb_atoms, num_bb_atoms)
   desc: Input dimensions for edge features (scalar channels, vector channels)
 edge_h_dim:
-  value: [32, 1]
+  value: [64, 4]
   desc: Hidden dimensions for edge features (scalar channels, vector channels)
 num_layers:
   value: 4
   desc: Number of layers for encoder/decoder
 drop_rate:
-  value: 0.1
+  value: 0.5
   desc: Dropout rate
 out_dim:
   value: 4
   desc: Output dimension (4 bases for RNA)
 
 # Training configurations
 epochs:
-  value: 100
+  value: 50
   desc: Number of training epochs
 lr:
-  value: 0.001
+  value: 0.0001
   desc: Learning rate
+label_smoothing:
+  value: 0.05
+  desc: Label smoothing for cross entropy loss
 batch_size:
   value: 8
   desc: Batch size for dataloaders (currently not used)
-max_nodes:
-  value: 5000
-  desc: Maximum number of nodes in batch
 num_workers:
   value: 8
   desc: Number of workers for dataloaders
@@ -92,13 +92,13 @@ val_every:
 # Evaluation configurations
 model_path:
   value: ''
-  desc: Path to model checkpoint (for testing)
-test_perplexity:
-  value: False
-  desc: Whether to test perplexity
-test_recovery:
+  desc: Path to model checkpoint for evaluation or reloading
+evaluate:
   value: False
-  desc: Whether to test recovery
+  desc: Whether to run evaluation (or training)
 n_samples:
-  value: 100
-  desc: Number of samples for testing recovery
+  value: 16
+  desc: Number of samples for evaluating recovery
+temperature:
+  value: 0.1
+  desc: Sampling temperature for evaluating recovery