Merge pull request #16 from hoelzer/blastp

Adjustments after review comments
hoelzer · Feb 9, 2024 · 092166e · 092166e
2 parents 5c4abba + eae5409
commit 092166e
Show file tree

Hide file tree

Showing 9 changed files with 178 additions and 34 deletions.
diff --git a/README.md b/README.md
@@ -1,26 +1,59 @@
 # Calculation of the Percentage Of Conserved Proteins
 
 ![](https://img.shields.io/badge/nextflow->=20.01.0-brightgreen)
-![](https://img.shields.io/badge/uses-ruby-red)
 ![](https://img.shields.io/badge/can_use-conda/mamba-yellow.svg)
 ![](https://img.shields.io/badge/can_use-docker-blue.svg)
 ![](https://img.shields.io/badge/can_use-singularity-orange.svg)
 ![](https://img.shields.io/badge/licence-GLP3-lightgrey.svg)
 
 [![Twitter Follow](https://img.shields.io/twitter/follow/martinhoelzer.svg?style=social)](https://twitter.com/martinhoelzer) 
 
-__Update 2023/12: One-vs-All comparisons are now possible in genome and protein input mode. Check `--help` message.__
+1. [ Objective ](#objective)
+2. [ How ](#how)
+3. [ Requirements ](#need)
+4. [ Execution examples ](#run)
+5. [ Example output ](#example)
+6. [ Extend a calculation with more genomes ](#extend)
+7. [ A note on alignment calculation (DIAMOND vs. BLASTP) ](#diamond)
+8. [ Parameter adjustments (danger zone) ](#parameter)
+9. [ All-vs-All and One-vs-All ](#allvsall)
+10. [ Cite ](#cite)
+11. [ Update backlog ](#backlog)
 
-__Update 2023/10: Now using [Diamond](https://www.nature.com/articles/s41592-021-01101-x) instead of Blast for protein alignments. Thx [@michoug](https://github.com/michoug) for the Pull Request.__
+<a name="objective"></a>
 
-__Update 2023/05: Re-implementation as a [Nextflow pipeline](nextflow.io). Please feel free to report any [issues](https://github.com/hoelzer/pocp/issues)!__
+## Objective
+
+Sequence technology advancements have led to an exponential increase in bacterial genomes, necessitating robust taxonomic classification methods. The **P**ercentage **O**f **C**onserved **P**roteins (POCP), proposed initially by [Qin, Xie _et al_. 2014](https://www.ncbi.nlm.nih.gov/pubmed/24706738), is a valuable metric for assessing prokaryote genus boundaries. A prokaryotic genus can be defined as a group of species with all pairwise POCP values higher than 50%. Here, I introduce a computational pipeline for automated POCP calculation, aiming to enhance reproducibility and ease of use in taxonomic studies.
+
+<a name="how"></a>
+
+## How
 
 As input use one amino acid sequence FASTA file per genome such as provided by
-[Prokka](https://github.com/tseemann/prokka) or genome FASTA files which will be then annotated via [Prokka](https://github.com/tseemann/prokka). 
-The pipeline will then calculate all-vs-all pairwise alignments between all protein sequences and use this
-information for POCP calculation following [Qin, Xie _et al_. 2014](https://www.ncbi.nlm.nih.gov/pubmed/24706738). For one-vs-all comparisons see below.
+[Prokka](https://github.com/tseemann/prokka) or [Bakta](https://github.com/oschwengers/bakta) or genome FASTA files which will be then annotated via [Prokka](https://github.com/tseemann/prokka) automatically. The pipeline will then calculate all-vs-all pairwise alignments between all protein sequences using the blastp mode of [DIAMOND](https://www.nature.com/articles/nmeth.3176) and use this information for POCP calculation following the original formula of [Qin, Xie _et al_. 2014](https://www.ncbi.nlm.nih.gov/pubmed/24706738). One-vs-all comparisons are also possible, see below.
+
+<a name="need"></a>
+
+## Requirements
+
+You only need `nextflow` to install and run the pipeline. `nextflow` will take care of all dependencies and install them if necessary. For installing the dependencies (such as Prokka and DIAMOND), you can choose between `conda`, `mamba`, `docker` or `singularity`. I recommend using `docker`. Then install and run the pipeline:
+
+### Workflow management
+
+- [Nextflow](https://www.nextflow.io/docs/latest/getstarted.html#installation)
+
+### Dependencies management
 
-You only need `nextflow` and `conda` or `mamba` or `docker` or `singularity` to run the pipeline. I recommend using `docker`. Then install and run the pipeline:
+- [Conda/Mamba](https://docs.conda.io/en/latest/miniconda.html)
+- [Docker](https://docs.docker.com/get-docker/)
+- [Singularity](https://apptainer.org/docs/)
+
+Use the `-profile` parameter to switch between different dependency managers, e.g. `-profile conda`.
+
+<a name="run"></a>
+
+## Execution examples
 
 ```bash
 # get the pipeline code
@@ -30,35 +63,87 @@ nextflow pull hoelzer/pocp
 nextflow info hoelzer/pocp 
 
 # get the help page and define a release version. ATTENTION: use latest version. 
-nextflow run hoelzer/pocp -r 2.2.0 --help
+nextflow run hoelzer/pocp -r 2.3.0 --help
 
 # example with genome files as input, all-vs-all comparison, performing a local execution and using Docker
-nextflow run hoelzer/pocp -r 2.2.0 --genomes $HOME'/.nextflow/assets/hoelzer/pocp/example/*.fasta' -profile local,docker
+nextflow run hoelzer/pocp -r 2.3.0 --genomes $HOME'/.nextflow/assets/hoelzer/pocp/example/*.fasta' -profile local,docker
 
 # example with protein FASTA files as input (e.g. from Prokka pre-calculated), all-vs-all comparison, performing a SLURM execution and using conda
-nextflow run hoelzer/pocp -r 2.2.0 --proteins $HOME'/.nextflow/assets/hoelzer/pocp/example/*.faa' -profile slurm,conda
+nextflow run hoelzer/pocp -r 2.3.0 --proteins $HOME'/.nextflow/assets/hoelzer/pocp/example/*.faa' -profile slurm,conda
 
 # example with genome files as input, additional genome file to activate one-vs-all comparison, performing a local execution and using Docker
-nextflow run hoelzer/pocp -r 2.2.0 --genomes $HOME'/.nextflow/assets/hoelzer/pocp/example/*.fasta' --genome $HOME/.nextflow/assets/hoelzer/pocp/example/Cav_10DC88.fasta -profile local,docker
+nextflow run hoelzer/pocp -r 2.3.0 --genomes $HOME'/.nextflow/assets/hoelzer/pocp/example/*.fasta' --genome $HOME/.nextflow/assets/hoelzer/pocp/example/Cav_10DC88.fasta -profile local,docker
 ```
 
+<a name="example"></a>
+
+## Example output
+
 The final output (`pocp-matrix.tsv`) should look like this (here, the resulting TSV was imported into Numbers on MacOS):
 
 ![Example output](example_output.png)
 
-If needed, the following parameters used for filtering the `diamond` results (blastp mode) can be
-adjusted:
+<a name="extend"></a>
+
+## Extend a calculation with more genomes
+
+`nextflow` allows you to resume a calculation without recalculating previous steps. This is possible, because `nextflow` stores intermediate and results files in a `work` directory. If you want to resume a calculation later, dont delete the `work` directory and use the same `nextflow run` command. For example:
+
+```bash
+# first iteration
+nextflow run hoelzer/pocp -r 2.3.0 --genomes 'input/*.fasta' -profile local,docker
+
+# add more genomes to the input folder:
+cp /some/other/genomes/*.fasta input/
+
+# resume the calculation via adding "-resume".
+nextflow run hoelzer/pocp -r 2.3.0 --genomes 'input/*.fasta' -profile local,docker -resume
+```
+
+<a name="diamond"></a>
+
+## A note on alignment calculation (DIAMOND vs. BLASTP)
+
+The pipeline identifies orthologous proteins between species using DIAMOND in blastp mode. Please note that the original POCP publication used BLASTP for calculating the alignments. However, DIAMOND is not only faster, which is an advantage when calculating POCP values for larger input data sets, but also achieves the sensitivity of BLASTP ([Buchfink (2021)](https://www.nature.com/articles/s41592-021-01101-x)), especially when using the `--ultra-sensitive` mode, which is activated by default in the pipeline. Another study comparing different alignment programs found that DIAMOND offered the best compromise between speed, sensitivity and quality when a sensitivity option other than the default setting was selected ([Hernández-Salmerón and Moreno-Hagelsieb (2020)](https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-020-07132-6)). Therefore I decided to use DIAMOND as a more modern solution for the alignment calculation in POCP-nf. Thx [@michoug](https://github.com/michoug) for the idea and the Pull Request.
+
+<a name="parameter"></a>
+
+## Parameter adjustments (danger zone)
+
+**Attention**: Although you can customize core parameters in the `nextflow` pipeline, I recommend sticking to the original parameters defined by [Qin, Xie _et al_. 2014](https://www.ncbi.nlm.nih.gov/pubmed/24706738) and otherwise clearly indicating any changed parameter options along with the version of POCP-nf used when sharing POCP results! The default parameters defined in the configuration of the pipeline are:
 
 ```bash
 --evalue 1e-5
 --seqidentity 0.4
 --alnlength 0.5
 ```
 
-Please note that per default an "all-vs-all" comparison is performed based on the provided FASTA files. However, you can also switch to an "one-vs-all" comparison by additionally providing a single genome FASTA via `--genome` next to the `--genomes` input **or** a single protein multi-FASTA via `--protein` next to the `--proteins` input. In both cases, only "one-vs-all" comparisons will be performed. It is also possible to combine `--genomes` with a target `--protein` FASTA for "one-vs-all" and vice versa. 
+<a name="allvsall"></a>
+
+## All-vs-All and One-vs-All
+
+Please also note that per default an "all-vs-all" comparison is performed based on the provided FASTA files. However, you can also switch to an "one-vs-all" comparison by additionally providing a single genome FASTA via `--genome` next to the `--genomes` input **or** a single protein multi-FASTA via `--protein` next to the `--proteins` input. In both cases, only "one-vs-all" comparisons will be performed. It is also possible to combine `--genomes` with a target `--protein` FASTA for "one-vs-all" and vice versa. 
+
+<a name="cite"></a>
 
-If you use the POCP Nextflow pipeline, please cite the original POCP study that introduced the metric and the POCP-nf pipeline:
+## Cite
 
-**[Qin, Qi-Long, _et al_. "A proposed genus boundary for the prokaryotes based on genomic insights." Journal of bacteriology 196.12 (2014): 2210-2215.](https://pubmed.ncbi.nlm.nih.gov/24706738/)**
+If you use the POCP Nextflow pipeline, please cite the original POCP study that introduced the metric, the publications of the great tools used in the pipeline, and the POCP-nf pipeline:
 
-**[Martin Hölzer. "POCP: An automatic Nextflow pipeline for calculating the percentage of conserved proteins in bacterial taxonomy". SOME JOURNAL. Hopefully in 2024.]()**
+[Qin, Qi-Long, _et al_. "A proposed genus boundary for the prokaryotes based on genomic insights." Journal of bacteriology 196.12 (2014): 2210-2215.](https://pubmed.ncbi.nlm.nih.gov/24706738/)
+
+[Martin Hölzer. "POCP: An automatic Nextflow pipeline for calculating the percentage of conserved proteins in bacterial taxonomy". SOME JOURNAL. Hopefully in 2024.]()
+
+[Buchfink, Benjamin, Chao Xie, and Daniel H. Huson. "Fast and sensitive protein alignment using DIAMOND." Nature methods 12.1 (2015): 59-60.](https://www.nature.com/articles/nmeth.3176)
+
+[Seemann, Torsten. "Prokka: rapid prokaryotic genome annotation." Bioinformatics 30.14 (2014): 2068-2069.](https://doi.org/10.1093/bioinformatics/btu153)
+
+<a name="backlog"></a>
+
+## Updates backlog
+
+__Update 2023/12: One-vs-All comparisons are now possible in genome and protein input mode. Check `--help` message.__
+
+__Update 2023/10: Now using [Diamond](https://www.nature.com/articles/s41592-021-01101-x) instead of Blastp for protein alignments. Thx [@michoug](https://github.com/michoug) for the Pull Request.__
+
+__Update 2023/05: Re-implementation as a [Nextflow pipeline](nextflow.io). Please feel free to report any [issues](https://github.com/hoelzer/pocp/issues)!__
diff --git a/configs/conda.config b/configs/conda.config
@@ -1,5 +1,6 @@
 process {
     withLabel:	prokka	{ conda = "$baseDir/envs/prokka.yaml" } 
+    withLabel:	blast	{ conda = "$baseDir/envs/blast.yaml" } 
     withLabel:	diamond	{ conda = "$baseDir/envs/diamond.yaml" } 
     withLabel:	pocp	{ conda = "$baseDir/envs/diamond.yaml" } 
     withLabel:	ruby	{ conda = "$baseDir/envs/ruby.yaml" } 

diff --git a/configs/container.config b/configs/container.config
@@ -1,6 +1,7 @@
 process {
     withLabel: prokka            { container = 'nanozoo/prokka:1.14.6--773a90d' }
-    withLabel: diamond           { container = 'nanozoo/diamond:2.1.8--b6f1f11' }
-    withLabel: pocp              { container = 'nanozoo/diamond:2.1.8--b6f1f11' } 
+    withLabel: blast             { container = 'nanozoo/blast:2.15.0--139f5e2' }
+    withLabel: diamond           { container = 'nanozoo/diamond:2.1.9--d692301' }
+    withLabel: pocp              { container = 'nanozoo/diamond:2.1.9--d692301' } 
     withLabel: ruby              { container = 'nanozoo/ruby:3.2.2--fdadcb3' } 
 }
diff --git a/configs/local.config b/configs/local.config
@@ -1,5 +1,6 @@
 process {
     withLabel: prokka   { cpus = params.cores; memory = params.memory } 
+    withLabel: blast    { cpus = params.cores; memory = params.memory } 
     withLabel: diamond  { cpus = params.cores; memory = params.memory } 
     withLabel: pocp     { cpus = 1; memory = '1 GB' } 
     withLabel: ruby     { cpus = 1; memory = '1 GB' } 

diff --git a/configs/nodes.config b/configs/nodes.config
@@ -1,5 +1,6 @@
 process {   
     withLabel: prokka   { cpus = 8 ; memory = '2 GB' }
+    withLabel: blast    { cpus = 8 ; memory = '2 GB' }
     withLabel: diamond  { cpus = 8 ; memory = '2 GB' }
     withLabel: pocp     { cpus = 1 ; memory = '1 GB' }
     withLabel: ruby     { cpus = 1 ; memory = '1 GB' }

diff --git a/envs/blast.yaml b/envs/blast.yaml
@@ -0,0 +1,6 @@
+name: blast
+channels:
+  - bioconda
+dependencies:
+  - blast=2.15.0
+
diff --git a/main.nf b/main.nf
@@ -56,6 +56,7 @@ if (params.protein) { protein_single_input_ch = Channel
 
 // load modules
 include {prokka as prokka; prokka as prokka_single} from './modules/prokka'
+include {blast} from './modules/blast'
 include {diamond} from './modules/diamond'
 include {pocp; pocp_matrix} from './modules/pocp'
 
@@ -91,10 +92,15 @@ workflow {
       comparisons_ch = all_vs_all_ch
     }
 
-    diamond_hits_ch = diamond(comparisons_ch).hits.groupTuple()
+    // use either BLASTP or DIAMOND (default)
+    if (params.blastp) {
+      hits_ch = blast(comparisons_ch).hits.groupTuple()
+    } else {
+      hits_ch = diamond(comparisons_ch).hits.groupTuple()
+    }
 
     pocp_matrix(
-      pocp(diamond_hits_ch).map {comparison, pocp_file, pocp_value -> [pocp_file]}.collect()
+      pocp(hits_ch).map {comparison, pocp_file, pocp_value -> [pocp_file]}.collect()
     )
 }
 
@@ -104,6 +110,7 @@ def helpMSG() {
     c_reset = "\033[0m";
     c_yellow = "\033[0;33m";
     c_blue = "\033[0;34m";
+    c_red = "\033[0;31m";
     c_dim = "\033[2m";
     log.info """
     ____________________________________________________________________________________________
@@ -113,9 +120,14 @@ def helpMSG() {
     A prokaryotic genus can be defined as a group of species with all pairwise POCP values higher than 50%.    
     
     ${c_yellow}Usage example:${c_reset}
-    nextflow run hoelzer/pocp -r 2.2.0 --genomes '*.fasta' 
+    nextflow run hoelzer/pocp -r 2.3.0 --genomes '*.fasta' 
     or
-    nextflow run hoelzer/pocp -r 2.2.0 --proteins '*.faa' 
+    nextflow run hoelzer/pocp -r 2.3.0 --proteins '*.faa' 
+
+    Use the following commands to check for latest pipeline versions:
+    
+    nextflow pull hoelzer/pocp
+    nextflow info hoelzer/pocp
 
     ${c_yellow}Input${c_reset}
     ${c_yellow}All-vs-all comparisons (default):${c_reset}
@@ -129,20 +141,27 @@ def helpMSG() {
     or
      --protein           proteins.faa         -> one protein multi-FASTA
 
-    ${c_yellow}Options:${c_reset}
-    --gcode             genetic code for Prokka annotation [default: $params.gcode]
-    --evalue            Evalue for diamond protein search [default: $params.evalue]
-    --seqidentity       Sequence identity for diamond alignments [default: $params.seqidentity]
-    --alnlength         Alignment length for diamond hits [default: $params.alnlength]
-    --cores             max cores per process for local use [default: $params.cores]
-    --max_cores         max cores (in total) for local use [default: $params.max_cores]
-    --memory            max memory for local use [default: $params.memory]
-    --output            name of the result folder [default: $params.output]
+    ${c_yellow}General Options:${c_reset}
+    --gcode             Genetic code for Prokka annotation [default: $params.gcode]
+    --cores             Max cores per process for local use [default: $params.cores]
+    --max_cores         Max cores (in total) for local use [default: $params.max_cores]
+    --memory            Max memory for local use [default: $params.memory]
+    --output            Name of the result folder [default: $params.output]
+
+    ${c_yellow}Special Options${c_reset} ${c_red}(Danger Zone!)${c_yellow}:${c_reset}
+    ATTENTION: changing these parameters will lead to different POCP values. 
+    If you have good reasons to do that, you must report the changed parameters together with the used pipeline version.
+    
+    --evalue            Evalue for DIAMOND protein search [default: $params.evalue]
+    --seqidentity       Sequence identity for DIAMOD alignments [default: $params.seqidentity]
+    --alnlength         Alignment length for DIAMOND hits [default: $params.alnlength]
+    --blastp            Use BLASTP instead of DIAMOND for protein alignment (slower, as in the original 2014 publication) [default: $params.blastp]
 
     ${c_dim}Nextflow options:
     -with-report rep.html    cpu / ram usage (may cause errors)
     -with-dag chart.html     generates a flowchart for the process tree
     -with-timeline time.html timeline (may cause errors)
+    -resume                  resume a previous calculation w/o recalculating everything (needs the same run command and work dir!)
 
     ${c_yellow}Caching:${c_reset}
     --condaCacheDir         Location for storing the conda environments [default: $params.condaCacheDir]

diff --git a/modules/blast.nf b/modules/blast.nf
@@ -0,0 +1,29 @@
+/*
+Run blastp and format output for downstream POCP calculations.
+*/
+process blast {
+    label 'blast'
+    publishDir "${params.output}/blast", mode: 'copy', pattern: "${name}-query-${name2}-db.blast*"
+
+    input:
+      tuple val(name), path(fasta), val(name2), path(fasta2) 
+
+    output:
+	    tuple env(genome_ids_sorted), path(fasta), file("${name}-query-${name2}-db.blast"), emit: blast 
+	    tuple env(genome_ids_sorted), path(fasta), file("${name}-query-${name2}-db.blast.hits"), emit: hits
+
+    script:
+    """
+    makeblastdb -in ${fasta2} -dbtype prot #-parse_seqids
+    blastp -task blastp -num_threads ${task.cpus} -query ${fasta} -db ${fasta2} -evalue ${params.evalue} -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend qlen sstart send evalue bitscore slen" | awk '{if(\$3>${params.seqidentity*100} && \$4>(\$9*${params.alnlength})){print \$0}}' > ${name}-query-${name2}-db.blast
+    awk '{print \$1}' ${name}-query-${name2}-db.blast | sort | uniq | wc -l | awk '{print \$1}' > ${name}-query-${name2}-db.blast.hits
+    echo "\t${name}:\tFound \$(cat ${name}-query-${name2}-db.blast.hits) matches with an E value of less than ${params.evalue}, a sequence identity of more than ${params.seqidentity*100}%, and an alignable region of the query protein sequence of more than ${params.alnlength*100}%."
+
+    genome_ids_sorted='${name} ${name2}'
+    genome_ids_sorted=\$(echo \$genome_ids_sorted | xargs -n1 | sort | xargs | sed 's/ /-vs-/g')
+    """
+}
+
+/* Comments:
+I removed the -parse_seqids parameter from the makeblastdb command because of an error with fasta IDs that are longer than 50 chars. strange.
+*/
diff --git a/nextflow.config b/nextflow.config
@@ -25,7 +25,8 @@ params {
     // Prokka
     gcode = 0
 
-    // Diamond protein alignment
+    // DIAMOND or BLASTP protein alignment
+    blastp = false
     evalue = '1e-5'
     seqidentity = 0.4
     alnlength = 0.5