Skip to content

phylogenetic_analysis

Young edited this page Feb 10, 2024 · 14 revisions

Phylogenetic analysis

This workflow is mainly intended to be a reproducible workflow to help epidemiologists with their outbreak investigations in a research setting. This subworkflow is skipped by default because it does not skip or sort samples. Thus, all samples put into the workflow much be "similar enough" for downstream analysis. The default for "similar enough" is that each sample shares at least 1,500 genes across all input samples for their collective core genome. This threshold can be adjusted with params.min_core_genes. Additionally, this workflow will not continue unless there are at least four genomes to compare. This is a requirement by IQTree2 and cannot be adjusted.

Mashtree and core genome evaluation (custom script) may be useful in understanding sample outliers that may need to be removed from the analysis.

To allow this subworkflow to run, set params.msa = true.

Adding in top hits from FastANI

The default method is to include the top FastANI hit for each sample in the comparison of isolates.

---
Phylogenetic analysis with FastANI (default)
---
flowchart LR
C["FastANI top hits"] --> prokka
A["contigs"] --> prokka
prokka --> panaroo
panaroo --> B["snp-dists"]
panaroo --> iqtree2
B --> heatcluster
iqtree2 --> phytreeviz
A --> mashtree
mashtree --> phytreeviz
Loading

Relevant params ('params.msa' has be set to 'true'):

# skips fastani in addition to the information subworkflow to speed up the workflow
params.skip_extras       = false
# allows top fastani hits to be included in phylogenetic analysis
params.exclude_top_hit   = false
# specify outgroup for iqtree2
params.iqtree2_outgroup  = ""
# must be set to true to allow subworkflow to run
params.msa               = true
# sets the minimum core genes required to continue with iqtree2
params.min_core_genes    = 1500

Not adding in the top hits from FastANI

Adding in the top hit from FastANI may introduce genomes into the analysis that inhibit the workflow from moving forward. This behavior can be disabled by setting params.exclude_top_hit to false. This is also the preferred option for when there's a desired outgroup.

---
Phylogenetic analysis without FastANI
---
flowchart LR
A["contigs"] --> prokka
prokka --> panaroo
panaroo --> B["snp-dists"]
panaroo --> iqtree2
B --> heatcluster
iqtree2 --> phytreeviz
A --> mashtree
mashtree --> phytreeviz
Loading

Relevant params ('params.msa' and 'params.exclude_top_hit' are changed from their default value):

# skips fastani in addition to the information subworkflow to speed up the workflow
params.skip_extras       = false
# does not allow top fastani hits to be included in phylogenetic analysis
params.exclude_top_hit   = true
# specify outgroup for iqtree2
params.iqtree2_outgroup  = ""
# must be set to true to allow subworkflow to run
params.msa               = true
# sets the minimum core genes required to continue with iqtree2
params.min_core_genes    = 1500

Using an outgroup

This is the rationale behind the 'params.iqtree2_outgroup'.

The default workflow of Grandeur is to add in all the fastani top hits to the subsequent steps of the phylogenetic analysis workflow. These top hits cannot always be predicted beforehand, which makes specifying an outgroup in iqtree2 difficult. If there is a desire to use an outgroup, there are two methods for this:

  1. Use the default workflow with params.exclude_top_hit = true and set the outgroup params to something that will fail (params.iqtree2_outgroup = "doesntexist"). This will cause the iqtree2 process to fail. The workflow can then be resumed with -resume and the correct params.iqtree2_outgroup ascertained from the sample names in the iqtree2 error.

  2. Do not include the fastani top hits (params.exclude_top_hit = false). Instead, use the contig files and other fasta files as input, and designate one of those inputs as the outgroup with params.iqtree2_outgroup = "outgroup". The outgroup name would have to be consistent with the filename (i.e. "representative_genome.fasta" as an input file would be "representative_genome", without the ".fasta").

Excluding other processes

Since this portion of the workflow is likely to have run after running the other portions of the workflow already, the other subworkflows can be skipped by setting params.skip_extras to true. This will simplify the workflow to save time and computational resources.

---
Phylogenetic analysis
---
flowchart LR
A["contigs"] --> prokka
prokka --> panaroo
panaroo --> B["snp-dists"]
panaroo --> iqtree2
B --> heatcluster
iqtree2 --> phytreeviz
A --> mashtree
mashtree --> phytreeviz
Loading

Relevant params ('params.extras' and 'params.msa' are changed from their default values):

# skips fastani in addition to the information subworkflow to speed up the workflow
params.skip_extras       = true
# specify outgroup for iqtree2
params.iqtree2_outgroup  = ""
# must be set to true to allow subworkflow to run
params.msa               = true
# sets the minimum core genes required to continue with iqtree2
params.min_core_genes    = 1500

Visualizing the tree

It has been difficult to find a tool that creates a high-quality tree in a command line interface that automatically resizes the text and tree appropriately. Phytreeviz is currently in use with Grandeur for a rough initial look at what the phylogenetic tree would look like. The tree is often not rooted and can be difficult to interpret. The phylogenetic tree found at 'grandeur/iqtree2/iqtree.treefile' or 'grandeur/iqtree2/iqtree.contree' can be visualized through multiple tools, such as ggtree, microreact, or itol.

Visualizing the SNP matrix

Heatcluster is a tool developed to visualize SNP matrices.

Clone this wiki locally