Skip to content

Commit

Permalink
Initial commit
Browse files Browse the repository at this point in the history
  • Loading branch information
chaitjo committed May 24, 2023
1 parent a8206d0 commit 7889ef8
Show file tree
Hide file tree
Showing 13 changed files with 2,616 additions and 2 deletions.
159 changes: 159 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,159 @@
# Custom
/data
/env
/wandb
.DS_Store
# *.ipynb

# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/
cover/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
.pybuilder/
target/

# Jupyter Notebook
.ipynb_checkpoints

# IPython
profile_default/
ipython_config.py

# pyenv
# For a library or package, you might want to ignore these files since the code is
# intended to run in multiple environments; otherwise, check them in:
# .python-version

# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
# However, in case of collaboration, if having platform-specific dependencies or dependencies
# having no cross-platform support, pipenv may install dependencies that don't work, or not
# install all needed dependencies.
#Pipfile.lock

# poetry
# Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
# This is especially recommended for binary packages to ensure reproducibility, and is more
# commonly ignored for libraries.
# https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
#poetry.lock

# PEP 582; used by e.g. github.com/David-OConnor/pyflow
__pypackages__/

# Celery stuff
celerybeat-schedule
celerybeat.pid

# SageMath parsed files
*.sage.py

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/
.dmypy.json
dmypy.json

# Pyre type checker
.pyre/

# pytype static type analyzer
.pytype/

# Cython debug symbols
cython_debug/

# PyCharm
# JetBrains specific template is maintained in a separate JetBrains.gitignore that can
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
# and can be added to the global gitignore or merged into this file. For a more nuclear
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
.idea/
105 changes: 103 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,103 @@
# geometric-rna-design
gRNAde: Geometric RNA Design Pipeline
# 💣 gRNAde: Geometric RNA Design

**gRNAde** is a geometric deep learning pipeline for 3D RNA inverse design conditioned on *multiple* backbone conformations.
gRNAde explicitly accounts for RNA conformational flexibility via a novel **multi-Graph Neural Network** architecture which independently encodes a set of conformers via message passing.

![](fig/grnade_pipeline.png)

Check out the accompanying paper ['Multi-State RNA Design with Geometric Multi-Graph Neural Networks'](https://arxiv.org/abs/TODO), which introduces gRNAde.
> Chaitanya K. Joshi, Arian R. Jamasb, Ramon Viñas, Charles Harris, Simon Mathis, and Pietro Liò. Multi-State RNA Design with Geometric Multi-Graph Neural Networks. *arXiv preprint, 2023.*
>
>[PDF](https://arxiv.org/pdf/TODO) |
❗️**Note:** gRNAde is under active development.


## Directory Structure and Usage

```
.
├── README.md
|
├── data # Data files directory
├── notebooks # Jupyter notebooks directory
├── configs # Configuration files directory
|
├── main.py # Main script for launching experiments
|
└── src
├── models.py # Multi-GNN encoder layers and model
├── train # Helper functions for training and evaluation
├── data.py # RNA inverse design dataset
├── data_utils.py # Helper functions for data preparation
└── featurisation.py # Input featurisation helpers
```



## Installation

Our experiments used Python 3.8.16 and CUDA 11.3 on NVIDIA Quadro RTX 8000 GPUs.

```sh
# Create new conda environment
conda create --prefix ./env python=3.8
conda activate ./env

# Install PyTorch (Check CUDA version for GPU!)
# Option 1: CPU
# conda install pytorch==1.12.0 -c pytorch
#
# Option 2: GPU, CUDA 11.3
conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.3 -c pytorch

# Install dependencies
conda install matplotlib pandas networkx
pip install biopython wandb pyyaml ipdb
conda install jupyterlab -c conda-forge
conda install -c bioconda cd-hit

# Install PyG (Check CPU/GPU/MacOS)
# Option 1: CPU, MacOS
# pip install torch-scatter torch-sparse torch-cluster torch-spline-conv -f https://data.pyg.org/whl/torch-1.12.0+cpu.html
# pip install torch-geometric
#
# Option 2: GPU, CUDA 11.3
pip install torch-scatter torch-sparse torch-cluster torch-spline-conv -f https://data.pyg.org/whl/torch-1.12.1+cu113.html
pip install torch-geometric
#
# Option 3:
# conda install pyg -c pyg # CPU/GPU, but may not work on MacOS
```



## Downloading Data

We created a machine learning-ready dataset for RNA inverse design using [RNASolo](https://rnasolo.cs.put.poznan.pl) structures at resolution ≤3A.
Download and extract the raw `.pdb` files via the following script into the `data/raw/` directory.
Running `main.py` for the first time will process the raw data and save the processed samples as a `.pt` file.

```sh
mkdir data/raw
cd data/raw
curl -O https://rnasolo.cs.put.poznan.pl/media/files/zipped/bunches/pdb/all_member_pdb_3_0__3_280.zip
unzip all_member_pdb_3_0__3_280.zip
rm all_member_pdb_3_0__3_280.zip
```

Manual download link: https://rnasolo.cs.put.poznan.pl/archive.
Select the following for creating the download: 3D (PDB) + all molecules + all members + res. ≤3.0



## Citation

```
@article{joshi2023multi,
title={Multi-State RNA Design with Geometric Multi-Graph Neural Networks},
author={Joshi, Chaitanya K. and Jamasb, Arian R. and Viñas, Ramon and Harris, Charles and Mathis, Simon and Liò, Pietro},
journal={arXiv preprint},
year={2023},
}
```
104 changes: 104 additions & 0 deletions configs/default.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
# Misc configurations
gpu:
value: 0
desc: GPU ID
seed:
value: 42
desc: Random seed for reproducibility
save:
value: True
desc: Whether to save current and best model checkpoint

# Data configurations
data_path:
value: "./data/"
desc: Data directory (preprocessed and raw)
process_raw:
value: True
desc: Whether to process datasets from raw .pdb files
save_processed:
value: True
desc: Whether to save processed datasets
radius:
value: 4.5
desc: Radius for determining local neighborhoods in Angstrom (currently not used)
top_k:
value: 10
desc: Number of k-nearest neighbors
num_rbf:
value: 16
desc: Number of radial basis functions to featurise distances
num_posenc:
value: 16
desc: Number of positional encodings to featurise edges
num_conformers:
value: 3
desc: Number of conformations sampled per sequence

# Splitting configurations
eval_size:
value: 200
desc: Number of samples in val/test sets
split:
value: 'rmsd'
desc: Type of data split (random/rmsd/struct/seq_identity)

# Model configurations
model:
value: 'MultiGVPGNN'
desc: Model architecture
node_in_dim:
value: [1, 4]
desc: Input dimensions for node features (scalar channels, vector channels)
node_h_dim:
value: [128, 16]
desc: Hidden dimensions for node features (scalar channels, vector channels)
edge_in_dim:
value: [32, 1]
desc: Input dimensions for edge features (scalar channels, vector channels)
edge_h_dim:
value: [32, 1]
desc: Hidden dimensions for edge features (scalar channels, vector channels)
num_layers:
value: 3
desc: Number of layers for encoder/decoder
drop_rate:
value: 0.1
desc: Dropout rate
out_dim:
value: 4
desc: Output dimension (4 bases for RNA)

# Training configurations
epochs:
value: 100
desc: Number of training epochs
lr:
value: 0.001
desc: Learning rate
batch_size:
value: 8
desc: Batch size for dataloaders (currently not used)
max_nodes:
value: 5000
desc: Maximum number of nodes in batch
num_workers:
value: 8
desc: Number of workers for dataloaders
val_every:
value: 5
desc: Interval of training epochs after which validation is performed

# Evaluation configurations
model_path:
value: ''
desc: Path to model checkpoint (for testing)
test_perplexity:
value: False
desc: Whether to test perplexity
test_recovery:
value: False
desc: Whether to test recovery
n_samples:
value: 100
desc: Number of samples for testing recovery
Binary file added fig/grnade_pipeline.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit 7889ef8

Please sign in to comment.