Skip to content

Commit

Permalink
Merge pull request #16 from hoelzer/blastp
Browse files Browse the repository at this point in the history
Adjustments after review comments
  • Loading branch information
hoelzer authored Feb 9, 2024
2 parents 5c4abba + eae5409 commit 092166e
Show file tree
Hide file tree
Showing 9 changed files with 178 additions and 34 deletions.
121 changes: 103 additions & 18 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,26 +1,59 @@
# Calculation of the Percentage Of Conserved Proteins

![](https://img.shields.io/badge/nextflow->=20.01.0-brightgreen)
![](https://img.shields.io/badge/uses-ruby-red)
![](https://img.shields.io/badge/can_use-conda/mamba-yellow.svg)
![](https://img.shields.io/badge/can_use-docker-blue.svg)
![](https://img.shields.io/badge/can_use-singularity-orange.svg)
![](https://img.shields.io/badge/licence-GLP3-lightgrey.svg)

[![Twitter Follow](https://img.shields.io/twitter/follow/martinhoelzer.svg?style=social)](https://twitter.com/martinhoelzer)

__Update 2023/12: One-vs-All comparisons are now possible in genome and protein input mode. Check `--help` message.__
1. [ Objective ](#objective)
2. [ How ](#how)
3. [ Requirements ](#need)
4. [ Execution examples ](#run)
5. [ Example output ](#example)
6. [ Extend a calculation with more genomes ](#extend)
7. [ A note on alignment calculation (DIAMOND vs. BLASTP) ](#diamond)
8. [ Parameter adjustments (danger zone) ](#parameter)
9. [ All-vs-All and One-vs-All ](#allvsall)
10. [ Cite ](#cite)
11. [ Update backlog ](#backlog)

__Update 2023/10: Now using [Diamond](https://www.nature.com/articles/s41592-021-01101-x) instead of Blast for protein alignments. Thx [@michoug](https://github.com/michoug) for the Pull Request.__
<a name="objective"></a>

__Update 2023/05: Re-implementation as a [Nextflow pipeline](nextflow.io). Please feel free to report any [issues](https://github.com/hoelzer/pocp/issues)!__
## Objective

Sequence technology advancements have led to an exponential increase in bacterial genomes, necessitating robust taxonomic classification methods. The **P**ercentage **O**f **C**onserved **P**roteins (POCP), proposed initially by [Qin, Xie _et al_. 2014](https://www.ncbi.nlm.nih.gov/pubmed/24706738), is a valuable metric for assessing prokaryote genus boundaries. A prokaryotic genus can be defined as a group of species with all pairwise POCP values higher than 50%. Here, I introduce a computational pipeline for automated POCP calculation, aiming to enhance reproducibility and ease of use in taxonomic studies.

<a name="how"></a>

## How

As input use one amino acid sequence FASTA file per genome such as provided by
[Prokka](https://github.com/tseemann/prokka) or genome FASTA files which will be then annotated via [Prokka](https://github.com/tseemann/prokka).
The pipeline will then calculate all-vs-all pairwise alignments between all protein sequences and use this
information for POCP calculation following [Qin, Xie _et al_. 2014](https://www.ncbi.nlm.nih.gov/pubmed/24706738). For one-vs-all comparisons see below.
[Prokka](https://github.com/tseemann/prokka) or [Bakta](https://github.com/oschwengers/bakta) or genome FASTA files which will be then annotated via [Prokka](https://github.com/tseemann/prokka) automatically. The pipeline will then calculate all-vs-all pairwise alignments between all protein sequences using the blastp mode of [DIAMOND](https://www.nature.com/articles/nmeth.3176) and use this information for POCP calculation following the original formula of [Qin, Xie _et al_. 2014](https://www.ncbi.nlm.nih.gov/pubmed/24706738). One-vs-all comparisons are also possible, see below.

<a name="need"></a>

## Requirements

You only need `nextflow` to install and run the pipeline. `nextflow` will take care of all dependencies and install them if necessary. For installing the dependencies (such as Prokka and DIAMOND), you can choose between `conda`, `mamba`, `docker` or `singularity`. I recommend using `docker`. Then install and run the pipeline:

### Workflow management

- [Nextflow](https://www.nextflow.io/docs/latest/getstarted.html#installation)

### Dependencies management

You only need `nextflow` and `conda` or `mamba` or `docker` or `singularity` to run the pipeline. I recommend using `docker`. Then install and run the pipeline:
- [Conda/Mamba](https://docs.conda.io/en/latest/miniconda.html)
- [Docker](https://docs.docker.com/get-docker/)
- [Singularity](https://apptainer.org/docs/)

Use the `-profile` parameter to switch between different dependency managers, e.g. `-profile conda`.

<a name="run"></a>

## Execution examples

```bash
# get the pipeline code
Expand All @@ -30,35 +63,87 @@ nextflow pull hoelzer/pocp
nextflow info hoelzer/pocp

# get the help page and define a release version. ATTENTION: use latest version.
nextflow run hoelzer/pocp -r 2.2.0 --help
nextflow run hoelzer/pocp -r 2.3.0 --help

# example with genome files as input, all-vs-all comparison, performing a local execution and using Docker
nextflow run hoelzer/pocp -r 2.2.0 --genomes $HOME'/.nextflow/assets/hoelzer/pocp/example/*.fasta' -profile local,docker
nextflow run hoelzer/pocp -r 2.3.0 --genomes $HOME'/.nextflow/assets/hoelzer/pocp/example/*.fasta' -profile local,docker

# example with protein FASTA files as input (e.g. from Prokka pre-calculated), all-vs-all comparison, performing a SLURM execution and using conda
nextflow run hoelzer/pocp -r 2.2.0 --proteins $HOME'/.nextflow/assets/hoelzer/pocp/example/*.faa' -profile slurm,conda
nextflow run hoelzer/pocp -r 2.3.0 --proteins $HOME'/.nextflow/assets/hoelzer/pocp/example/*.faa' -profile slurm,conda

# example with genome files as input, additional genome file to activate one-vs-all comparison, performing a local execution and using Docker
nextflow run hoelzer/pocp -r 2.2.0 --genomes $HOME'/.nextflow/assets/hoelzer/pocp/example/*.fasta' --genome $HOME/.nextflow/assets/hoelzer/pocp/example/Cav_10DC88.fasta -profile local,docker
nextflow run hoelzer/pocp -r 2.3.0 --genomes $HOME'/.nextflow/assets/hoelzer/pocp/example/*.fasta' --genome $HOME/.nextflow/assets/hoelzer/pocp/example/Cav_10DC88.fasta -profile local,docker
```

<a name="example"></a>

## Example output

The final output (`pocp-matrix.tsv`) should look like this (here, the resulting TSV was imported into Numbers on MacOS):

![Example output](example_output.png)

If needed, the following parameters used for filtering the `diamond` results (blastp mode) can be
adjusted:
<a name="extend"></a>

## Extend a calculation with more genomes

`nextflow` allows you to resume a calculation without recalculating previous steps. This is possible, because `nextflow` stores intermediate and results files in a `work` directory. If you want to resume a calculation later, dont delete the `work` directory and use the same `nextflow run` command. For example:

```bash
# first iteration
nextflow run hoelzer/pocp -r 2.3.0 --genomes 'input/*.fasta' -profile local,docker

# add more genomes to the input folder:
cp /some/other/genomes/*.fasta input/

# resume the calculation via adding "-resume".
nextflow run hoelzer/pocp -r 2.3.0 --genomes 'input/*.fasta' -profile local,docker -resume
```

<a name="diamond"></a>

## A note on alignment calculation (DIAMOND vs. BLASTP)

The pipeline identifies orthologous proteins between species using DIAMOND in blastp mode. Please note that the original POCP publication used BLASTP for calculating the alignments. However, DIAMOND is not only faster, which is an advantage when calculating POCP values for larger input data sets, but also achieves the sensitivity of BLASTP ([Buchfink (2021)](https://www.nature.com/articles/s41592-021-01101-x)), especially when using the `--ultra-sensitive` mode, which is activated by default in the pipeline. Another study comparing different alignment programs found that DIAMOND offered the best compromise between speed, sensitivity and quality when a sensitivity option other than the default setting was selected ([Hernández-Salmerón and Moreno-Hagelsieb (2020)](https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-020-07132-6)). Therefore I decided to use DIAMOND as a more modern solution for the alignment calculation in POCP-nf. Thx [@michoug](https://github.com/michoug) for the idea and the Pull Request.

<a name="parameter"></a>

## Parameter adjustments (danger zone)

**Attention**: Although you can customize core parameters in the `nextflow` pipeline, I recommend sticking to the original parameters defined by [Qin, Xie _et al_. 2014](https://www.ncbi.nlm.nih.gov/pubmed/24706738) and otherwise clearly indicating any changed parameter options along with the version of POCP-nf used when sharing POCP results! The default parameters defined in the configuration of the pipeline are:

```bash
--evalue 1e-5
--seqidentity 0.4
--alnlength 0.5
```

Please note that per default an "all-vs-all" comparison is performed based on the provided FASTA files. However, you can also switch to an "one-vs-all" comparison by additionally providing a single genome FASTA via `--genome` next to the `--genomes` input **or** a single protein multi-FASTA via `--protein` next to the `--proteins` input. In both cases, only "one-vs-all" comparisons will be performed. It is also possible to combine `--genomes` with a target `--protein` FASTA for "one-vs-all" and vice versa.
<a name="allvsall"></a>

## All-vs-All and One-vs-All

Please also note that per default an "all-vs-all" comparison is performed based on the provided FASTA files. However, you can also switch to an "one-vs-all" comparison by additionally providing a single genome FASTA via `--genome` next to the `--genomes` input **or** a single protein multi-FASTA via `--protein` next to the `--proteins` input. In both cases, only "one-vs-all" comparisons will be performed. It is also possible to combine `--genomes` with a target `--protein` FASTA for "one-vs-all" and vice versa.

<a name="cite"></a>

If you use the POCP Nextflow pipeline, please cite the original POCP study that introduced the metric and the POCP-nf pipeline:
## Cite

**[Qin, Qi-Long, _et al_. "A proposed genus boundary for the prokaryotes based on genomic insights." Journal of bacteriology 196.12 (2014): 2210-2215.](https://pubmed.ncbi.nlm.nih.gov/24706738/)**
If you use the POCP Nextflow pipeline, please cite the original POCP study that introduced the metric, the publications of the great tools used in the pipeline, and the POCP-nf pipeline:

**[Martin Hölzer. "POCP: An automatic Nextflow pipeline for calculating the percentage of conserved proteins in bacterial taxonomy". SOME JOURNAL. Hopefully in 2024.]()**
[Qin, Qi-Long, _et al_. "A proposed genus boundary for the prokaryotes based on genomic insights." Journal of bacteriology 196.12 (2014): 2210-2215.](https://pubmed.ncbi.nlm.nih.gov/24706738/)

[Martin Hölzer. "POCP: An automatic Nextflow pipeline for calculating the percentage of conserved proteins in bacterial taxonomy". SOME JOURNAL. Hopefully in 2024.]()

[Buchfink, Benjamin, Chao Xie, and Daniel H. Huson. "Fast and sensitive protein alignment using DIAMOND." Nature methods 12.1 (2015): 59-60.](https://www.nature.com/articles/nmeth.3176)

[Seemann, Torsten. "Prokka: rapid prokaryotic genome annotation." Bioinformatics 30.14 (2014): 2068-2069.](https://doi.org/10.1093/bioinformatics/btu153)

<a name="backlog"></a>

## Updates backlog

__Update 2023/12: One-vs-All comparisons are now possible in genome and protein input mode. Check `--help` message.__

__Update 2023/10: Now using [Diamond](https://www.nature.com/articles/s41592-021-01101-x) instead of Blastp for protein alignments. Thx [@michoug](https://github.com/michoug) for the Pull Request.__

__Update 2023/05: Re-implementation as a [Nextflow pipeline](nextflow.io). Please feel free to report any [issues](https://github.com/hoelzer/pocp/issues)!__
1 change: 1 addition & 0 deletions configs/conda.config
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
process {
withLabel: prokka { conda = "$baseDir/envs/prokka.yaml" }
withLabel: blast { conda = "$baseDir/envs/blast.yaml" }
withLabel: diamond { conda = "$baseDir/envs/diamond.yaml" }
withLabel: pocp { conda = "$baseDir/envs/diamond.yaml" }
withLabel: ruby { conda = "$baseDir/envs/ruby.yaml" }
Expand Down
5 changes: 3 additions & 2 deletions configs/container.config
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
process {
withLabel: prokka { container = 'nanozoo/prokka:1.14.6--773a90d' }
withLabel: diamond { container = 'nanozoo/diamond:2.1.8--b6f1f11' }
withLabel: pocp { container = 'nanozoo/diamond:2.1.8--b6f1f11' }
withLabel: blast { container = 'nanozoo/blast:2.15.0--139f5e2' }
withLabel: diamond { container = 'nanozoo/diamond:2.1.9--d692301' }
withLabel: pocp { container = 'nanozoo/diamond:2.1.9--d692301' }
withLabel: ruby { container = 'nanozoo/ruby:3.2.2--fdadcb3' }
}
1 change: 1 addition & 0 deletions configs/local.config
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
process {
withLabel: prokka { cpus = params.cores; memory = params.memory }
withLabel: blast { cpus = params.cores; memory = params.memory }
withLabel: diamond { cpus = params.cores; memory = params.memory }
withLabel: pocp { cpus = 1; memory = '1 GB' }
withLabel: ruby { cpus = 1; memory = '1 GB' }
Expand Down
1 change: 1 addition & 0 deletions configs/nodes.config
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
process {
withLabel: prokka { cpus = 8 ; memory = '2 GB' }
withLabel: blast { cpus = 8 ; memory = '2 GB' }
withLabel: diamond { cpus = 8 ; memory = '2 GB' }
withLabel: pocp { cpus = 1 ; memory = '1 GB' }
withLabel: ruby { cpus = 1 ; memory = '1 GB' }
Expand Down
6 changes: 6 additions & 0 deletions envs/blast.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
name: blast
channels:
- bioconda
dependencies:
- blast=2.15.0

45 changes: 32 additions & 13 deletions main.nf
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,7 @@ if (params.protein) { protein_single_input_ch = Channel

// load modules
include {prokka as prokka; prokka as prokka_single} from './modules/prokka'
include {blast} from './modules/blast'
include {diamond} from './modules/diamond'
include {pocp; pocp_matrix} from './modules/pocp'

Expand Down Expand Up @@ -91,10 +92,15 @@ workflow {
comparisons_ch = all_vs_all_ch
}

diamond_hits_ch = diamond(comparisons_ch).hits.groupTuple()
// use either BLASTP or DIAMOND (default)
if (params.blastp) {
hits_ch = blast(comparisons_ch).hits.groupTuple()
} else {
hits_ch = diamond(comparisons_ch).hits.groupTuple()
}

pocp_matrix(
pocp(diamond_hits_ch).map {comparison, pocp_file, pocp_value -> [pocp_file]}.collect()
pocp(hits_ch).map {comparison, pocp_file, pocp_value -> [pocp_file]}.collect()
)
}

Expand All @@ -104,6 +110,7 @@ def helpMSG() {
c_reset = "\033[0m";
c_yellow = "\033[0;33m";
c_blue = "\033[0;34m";
c_red = "\033[0;31m";
c_dim = "\033[2m";
log.info """
____________________________________________________________________________________________
Expand All @@ -113,9 +120,14 @@ def helpMSG() {
A prokaryotic genus can be defined as a group of species with all pairwise POCP values higher than 50%.
${c_yellow}Usage example:${c_reset}
nextflow run hoelzer/pocp -r 2.2.0 --genomes '*.fasta'
nextflow run hoelzer/pocp -r 2.3.0 --genomes '*.fasta'
or
nextflow run hoelzer/pocp -r 2.2.0 --proteins '*.faa'
nextflow run hoelzer/pocp -r 2.3.0 --proteins '*.faa'
Use the following commands to check for latest pipeline versions:
nextflow pull hoelzer/pocp
nextflow info hoelzer/pocp
${c_yellow}Input${c_reset}
${c_yellow}All-vs-all comparisons (default):${c_reset}
Expand All @@ -129,20 +141,27 @@ def helpMSG() {
or
--protein proteins.faa -> one protein multi-FASTA
${c_yellow}Options:${c_reset}
--gcode genetic code for Prokka annotation [default: $params.gcode]
--evalue Evalue for diamond protein search [default: $params.evalue]
--seqidentity Sequence identity for diamond alignments [default: $params.seqidentity]
--alnlength Alignment length for diamond hits [default: $params.alnlength]
--cores max cores per process for local use [default: $params.cores]
--max_cores max cores (in total) for local use [default: $params.max_cores]
--memory max memory for local use [default: $params.memory]
--output name of the result folder [default: $params.output]
${c_yellow}General Options:${c_reset}
--gcode Genetic code for Prokka annotation [default: $params.gcode]
--cores Max cores per process for local use [default: $params.cores]
--max_cores Max cores (in total) for local use [default: $params.max_cores]
--memory Max memory for local use [default: $params.memory]
--output Name of the result folder [default: $params.output]
${c_yellow}Special Options${c_reset} ${c_red}(Danger Zone!)${c_yellow}:${c_reset}
ATTENTION: changing these parameters will lead to different POCP values.
If you have good reasons to do that, you must report the changed parameters together with the used pipeline version.
--evalue Evalue for DIAMOND protein search [default: $params.evalue]
--seqidentity Sequence identity for DIAMOD alignments [default: $params.seqidentity]
--alnlength Alignment length for DIAMOND hits [default: $params.alnlength]
--blastp Use BLASTP instead of DIAMOND for protein alignment (slower, as in the original 2014 publication) [default: $params.blastp]
${c_dim}Nextflow options:
-with-report rep.html cpu / ram usage (may cause errors)
-with-dag chart.html generates a flowchart for the process tree
-with-timeline time.html timeline (may cause errors)
-resume resume a previous calculation w/o recalculating everything (needs the same run command and work dir!)
${c_yellow}Caching:${c_reset}
--condaCacheDir Location for storing the conda environments [default: $params.condaCacheDir]
Expand Down
29 changes: 29 additions & 0 deletions modules/blast.nf
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
/*
Run blastp and format output for downstream POCP calculations.
*/
process blast {
label 'blast'
publishDir "${params.output}/blast", mode: 'copy', pattern: "${name}-query-${name2}-db.blast*"

input:
tuple val(name), path(fasta), val(name2), path(fasta2)

output:
tuple env(genome_ids_sorted), path(fasta), file("${name}-query-${name2}-db.blast"), emit: blast
tuple env(genome_ids_sorted), path(fasta), file("${name}-query-${name2}-db.blast.hits"), emit: hits

script:
"""
makeblastdb -in ${fasta2} -dbtype prot #-parse_seqids
blastp -task blastp -num_threads ${task.cpus} -query ${fasta} -db ${fasta2} -evalue ${params.evalue} -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend qlen sstart send evalue bitscore slen" | awk '{if(\$3>${params.seqidentity*100} && \$4>(\$9*${params.alnlength})){print \$0}}' > ${name}-query-${name2}-db.blast
awk '{print \$1}' ${name}-query-${name2}-db.blast | sort | uniq | wc -l | awk '{print \$1}' > ${name}-query-${name2}-db.blast.hits
echo "\t${name}:\tFound \$(cat ${name}-query-${name2}-db.blast.hits) matches with an E value of less than ${params.evalue}, a sequence identity of more than ${params.seqidentity*100}%, and an alignable region of the query protein sequence of more than ${params.alnlength*100}%."
genome_ids_sorted='${name} ${name2}'
genome_ids_sorted=\$(echo \$genome_ids_sorted | xargs -n1 | sort | xargs | sed 's/ /-vs-/g')
"""
}

/* Comments:
I removed the -parse_seqids parameter from the makeblastdb command because of an error with fasta IDs that are longer than 50 chars. strange.
*/
3 changes: 2 additions & 1 deletion nextflow.config
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,8 @@ params {
// Prokka
gcode = 0

// Diamond protein alignment
// DIAMOND or BLASTP protein alignment
blastp = false
evalue = '1e-5'
seqidentity = 0.4
alnlength = 0.5
Expand Down

0 comments on commit 092166e

Please sign in to comment.