Introduction

In this study we aim to analyse the effect that padding the amino acid sequences has on the performance of deep learning models. Specifically, we have chosen a hierarchical EC number prediction task to carry out the analysis. We use three different architectures (only_denses, 1_conv, stack_conv) to check if they affect the padding effect.

The data used in this study consists on all the reviewed protein sequences of the taxonomy Archaea from Uniprot version 2019_05.

The project is almost entirely coded in Python 3.6.7. R version 3.4.4 has been used for building the explanatory linear models (notebooks 07 and 09).

The findings of this analysis are explained in the following article: Lopez-del Rio, A., Martin, M., Perera-Lluna, A. et al. Effect of sequence padding on the performance of deep learning models in archaeal protein functional prediction. Sci Rep 10, 14634 (2020). https://doi.org/10.1038/s41598-020-71450-8.

Structure

Functions are defined in the src/ folder (.py and .R files)
The workflow of the analysis is applied through Jupyter Notebooks in the notebooks/ folder. The files are preceded by a number that indicates the chronological order of their execution.
raw_data/ folder contains the CSV file with reviewed Uniprot entries for Archaea.
data/ is the folder where intermediate and final results will be stored when running the notebooks.

Workflow (notebooks)

1. Data preprocessing

00_creating_data.ipynb
01_pre_statistics.ipynb

2. Models training and results processing

02_task1_training.ipynb
03_task1_comparison.ipynb
04_task2_training.ipynb
05_task2_comparison.ipynb

3. Performance metrics and graphical representation

06_comparing_architectures.ipynb
07_linear_models_metrics.ipynb
08_activations_sequences_pca.ipynb
09_linear_models_pcs.ipynb

System requirements

The runs have been executed on the following hardware from the B2SLab (Universitat Politecnica de Catalunya):

tob: 8 threads, 32GB RAM, NVIDIA TITAN Xp GeForce GTX 1070
lapsus: 12 threads, 32GB RAM, 2 x NVIDIA GeForce GTX 1070

Considerations

The absPath variable at the beginning of each notebook and src file should be changed once the repository is cloned to the correct path in each case.
Notebooks 02-04 are implemented to analyse only one architecture each time. The block of variables/parameters in the second cell corresponding to the studied architecture should be uncommented (and the rest, commented) prior to running the notebook.
In src/Target, all the different types of padding tested in this study are implemented.
Execution without GPU of notebooks 02-05 and 08 may require considerable time and it is not recommended.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Introduction

Structure

Workflow (notebooks)

1. Data preprocessing

2. Models training and results processing

3. Performance metrics and graphical representation

System requirements

Considerations

Files

README.md

Latest commit

History

README.md

File metadata and controls

Introduction

Structure

Workflow (notebooks)

1. Data preprocessing

2. Models training and results processing

3. Performance metrics and graphical representation

System requirements

Considerations