Initial commit

chaitjo · May 24, 2023 · 7889ef8 · 7889ef8
1 parent a8206d0
commit 7889ef8
Show file tree

Hide file tree

Showing 13 changed files with 2,616 additions and 2 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,159 @@
+# Custom
+/data
+/env
+/wandb
+.DS_Store
+# *.ipynb
+
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+
+# C extensions
+*.so
+
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py,cover
+.hypothesis/
+.pytest_cache/
+cover/
+
+# Translations
+*.mo
+*.pot
+
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+db.sqlite3-journal
+
+# Flask stuff:
+instance/
+.webassets-cache
+
+# Scrapy stuff:
+.scrapy
+
+# Sphinx documentation
+docs/_build/
+
+# PyBuilder
+.pybuilder/
+target/
+
+# Jupyter Notebook
+.ipynb_checkpoints
+
+# IPython
+profile_default/
+ipython_config.py
+
+# pyenv
+#   For a library or package, you might want to ignore these files since the code is
+#   intended to run in multiple environments; otherwise, check them in:
+# .python-version
+
+# pipenv
+#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
+#   However, in case of collaboration, if having platform-specific dependencies or dependencies
+#   having no cross-platform support, pipenv may install dependencies that don't work, or not
+#   install all needed dependencies.
+#Pipfile.lock
+
+# poetry
+#   Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
+#   This is especially recommended for binary packages to ensure reproducibility, and is more
+#   commonly ignored for libraries.
+#   https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
+#poetry.lock
+
+# PEP 582; used by e.g. github.com/David-OConnor/pyflow
+__pypackages__/
+
+# Celery stuff
+celerybeat-schedule
+celerybeat.pid
+
+# SageMath parsed files
+*.sage.py
+
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+
+# Spyder project settings
+.spyderproject
+.spyproject
+
+# Rope project settings
+.ropeproject
+
+# mkdocs documentation
+/site
+
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+
+# Pyre type checker
+.pyre/
+
+# pytype static type analyzer
+.pytype/
+
+# Cython debug symbols
+cython_debug/
+
+# PyCharm
+#  JetBrains specific template is maintained in a separate JetBrains.gitignore that can
+#  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
+#  and can be added to the global gitignore or merged into this file.  For a more nuclear
+#  option (not recommended) you can uncomment the following to ignore the entire idea folder.
+.idea/
diff --git a/README.md b/README.md
@@ -1,2 +1,103 @@
-# geometric-rna-design
-gRNAde: Geometric RNA Design Pipeline
+# 💣 gRNAde: Geometric RNA Design
+
+**gRNAde** is a geometric deep learning pipeline for 3D RNA inverse design conditioned on *multiple* backbone conformations. 
+gRNAde explicitly accounts for RNA conformational flexibility via a novel **multi-Graph Neural Network** architecture which independently encodes a set of conformers via message passing.
+
+![](fig/grnade_pipeline.png)
+
+Check out the accompanying paper ['Multi-State RNA Design with Geometric Multi-Graph Neural Networks'](https://arxiv.org/abs/TODO), which introduces gRNAde.
+> Chaitanya K. Joshi, Arian R. Jamasb, Ramon Viñas, Charles Harris, Simon Mathis, and Pietro Liò. Multi-State RNA Design with Geometric Multi-Graph Neural Networks. *arXiv preprint, 2023.*
+>
+>[PDF](https://arxiv.org/pdf/TODO) |
+
+❗️**Note:** gRNAde is under active development.
+
+
+## Directory Structure and Usage
+
+```
+.
+├── README.md
+|
+├── data                    # Data files directory
+├── notebooks               # Jupyter notebooks directory
+├── configs                 # Configuration files directory
+| 
+├── main.py                 # Main script for launching experiments
+|
+└── src
+    ├── models.py           # Multi-GNN encoder layers and model
+    ├── train               # Helper functions for training and evaluation
+    ├── data.py             # RNA inverse design dataset
+    ├── data_utils.py       # Helper functions for data preparation
+    └── featurisation.py    # Input featurisation helpers
+```
+
+
+
+## Installation
+
+Our experiments used Python 3.8.16 and CUDA 11.3 on NVIDIA Quadro RTX 8000 GPUs.
+
+```sh
+# Create new conda environment
+conda create --prefix ./env python=3.8
+conda activate ./env
+
+# Install PyTorch (Check CUDA version for GPU!)
+# Option 1: CPU
+# conda install pytorch==1.12.0 -c pytorch
+#
+# Option 2: GPU, CUDA 11.3
+conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.3 -c pytorch
+
+# Install dependencies
+conda install matplotlib pandas networkx
+pip install biopython wandb pyyaml ipdb 
+conda install jupyterlab -c conda-forge
+conda install -c bioconda cd-hit
+
+# Install PyG (Check CPU/GPU/MacOS)
+# Option 1: CPU, MacOS
+# pip install torch-scatter torch-sparse torch-cluster torch-spline-conv -f https://data.pyg.org/whl/torch-1.12.0+cpu.html 
+# pip install torch-geometric
+#
+# Option 2: GPU, CUDA 11.3
+pip install torch-scatter torch-sparse torch-cluster torch-spline-conv -f https://data.pyg.org/whl/torch-1.12.1+cu113.html
+pip install torch-geometric
+#
+# Option 3: 
+# conda install pyg -c pyg  # CPU/GPU, but may not work on MacOS
+```
+
+
+
+## Downloading Data
+
+We created a machine learning-ready dataset for RNA inverse design using [RNASolo](https://rnasolo.cs.put.poznan.pl) structures at resolution ≤3A. 
+Download and extract the raw `.pdb` files via the following script into the `data/raw/` directory.
+Running `main.py` for the first time will process the raw data and save the processed samples as a `.pt` file.
+
+```sh
+mkdir data/raw
+cd data/raw
+curl -O https://rnasolo.cs.put.poznan.pl/media/files/zipped/bunches/pdb/all_member_pdb_3_0__3_280.zip
+unzip all_member_pdb_3_0__3_280.zip
+rm all_member_pdb_3_0__3_280.zip
+```
+
+Manual download link: https://rnasolo.cs.put.poznan.pl/archive.
+Select the following for creating the download: 3D (PDB) + all molecules + all members + res. ≤3.0
+
+
+
+## Citation
+
+```
+@article{joshi2023multi,
+  title={Multi-State RNA Design with Geometric Multi-Graph Neural Networks},
+  author={Joshi, Chaitanya K. and Jamasb, Arian R. and Viñas, Ramon and Harris, Charles and Mathis, Simon and Liò, Pietro},
+  journal={arXiv preprint},
+  year={2023},
+}
+```
diff --git a/configs/default.yaml b/configs/default.yaml
@@ -0,0 +1,104 @@
+# Misc configurations
+gpu:
+  value: 0
+  desc: GPU ID
+seed:
+  value: 42
+  desc: Random seed for reproducibility
+save:
+  value: True
+  desc: Whether to save current and best model checkpoint
+
+# Data configurations
+data_path:
+  value: "./data/"
+  desc: Data directory (preprocessed and raw)
+process_raw:
+  value: True
+  desc: Whether to process datasets from raw .pdb files
+save_processed:
+  value: True
+  desc: Whether to save processed datasets
+radius:
+  value: 4.5
+  desc: Radius for determining local neighborhoods in Angstrom (currently not used)
+top_k:
+  value: 10
+  desc: Number of k-nearest neighbors
+num_rbf:
+  value: 16
+  desc: Number of radial basis functions to featurise distances
+num_posenc:
+  value: 16
+  desc: Number of positional encodings to featurise edges
+num_conformers:
+  value: 3
+  desc: Number of conformations sampled per sequence
+
+# Splitting configurations
+eval_size:
+  value: 200
+  desc: Number of samples in val/test sets
+split:
+  value: 'rmsd'
+  desc: Type of data split (random/rmsd/struct/seq_identity)
+
+# Model configurations
+model:
+  value: 'MultiGVPGNN'
+  desc: Model architecture
+node_in_dim:
+  value: [1, 4]
+  desc: Input dimensions for node features (scalar channels, vector channels)
+node_h_dim:
+  value: [128, 16]
+  desc: Hidden dimensions for node features (scalar channels, vector channels)
+edge_in_dim:
+  value: [32, 1]
+  desc: Input dimensions for edge features (scalar channels, vector channels)
+edge_h_dim:
+  value: [32, 1]
+  desc: Hidden dimensions for edge features (scalar channels, vector channels)
+num_layers:
+  value: 3
+  desc: Number of layers for encoder/decoder
+drop_rate:
+  value: 0.1
+  desc: Dropout rate
+out_dim:
+  value: 4
+  desc: Output dimension (4 bases for RNA)
+
+# Training configurations
+epochs:
+  value: 100
+  desc: Number of training epochs
+lr:
+  value: 0.001
+  desc: Learning rate
+batch_size:
+  value: 8
+  desc: Batch size for dataloaders (currently not used)
+max_nodes:
+  value: 5000
+  desc: Maximum number of nodes in batch
+num_workers:
+  value: 8
+  desc: Number of workers for dataloaders
+val_every:
+  value: 5
+  desc: Interval of training epochs after which validation is performed
+
+# Evaluation configurations
+model_path:
+  value: ''
+  desc: Path to model checkpoint (for testing)
+test_perplexity:
+  value: False
+  desc: Whether to test perplexity
+test_recovery:
+  value: False
+  desc: Whether to test recovery
+n_samples:
+  value: 100
+  desc: Number of samples for testing recovery
diff --git a/fig/grnade_pipeline.png b/fig/grnade_pipeline.png