-
Notifications
You must be signed in to change notification settings - Fork 7
average_nucleotide_identity
One of the limiting factors of using FastANI is a requirement to use some sort of reference sequence. There are millions of available bacterial genomes, however, and the definitions of a species is constantly being adjusted. To maintain a balance between stability and being up-to-date, Grandeur has three options in supplying references for FastANI.
Also, please note that mash and fastani have limitations on species identification. Many closely-related species, such as different Salmonella enterica subspecies or E. coli and Shigella, may have top mash or fastani hits in a closely related subtype as opposed to its own subtype.
The downloading fasta files of reference genomes from NCBI has some issues.
- Some sort of internet connection is required
- It can take some time to download
Thus, there are references included in a container that Grandeur pulls from quay.io of many commonly sequenced organisms.
If you would a like a reference genome to be included in this image for this workflow, please submit an issue.
---
Using static references supplied with Grandeur
---
flowchart LR
id5["Extract included references"] --> id6["FastANI"]
Relevant params with their default values.
# does not download files from NCBI
params.current_datasets = false
Multiple processes in Grandeur have some sort of speciation aspect, including mash, blobtools (optional), and kraken2 (optional). The results from these processes are used to create a list of genomes to download from NCBI. The genomes in this list are then downloaded during the workflow and used as the reference in fastani.
---
downloading similar genomes from NCBI
---
flowchart LR
id1["mash results"] --> id2["list of species"]
id3["kraken2 results"] --> id2
id4["blobtools results"] --> id2
id2 --> id5["download genomes with datasets"]
id5 --> id6["FastANI"]
id7["genome_references"] --> id2
Relevant params with their default values and params.current_datasets set to 'true':
# allows downloading of genomes from NCBI
params.current_datasets = true
# sets the maximum number of genomes downloaded per species
params.datasets_max_genomes = 5
# limits the number of mash hits used to determine which species to download
params.mash_max_hits = 25
We don't want to limit the End User to using our limited supply of genomes or the way downloading genomes from NCBI datasets is set up. The End User can add their own fasta files with 'params.fastani_ref = <path_to_new_reference.fasta>'. These fasta files can also be compressed with '.gz'.
---
Using static references supplied with Grandeur
---
flowchart LR
id5["Extract included references"] --> id6["FastANI"]
id7["Extract supplied references"] --> id6["FastANI"]
If there are multiple fasta files to upload, list the full or relative path of each separated by a comma.
Another option is to provide a file with the paths to each fasta file and read that in with params.fastani_ref_list.
Example file (file_list.txt):
genomes/genus_species_cool.fa
genomes/genus_species_another.fasta
genomes/genus_species_outgroup.fasta.gz
The user-supplied reference genomes cannot match the filenames of any of the included genomes and must be unique from each other. Otherwise, there will be a duplicate filename input error.
Relevant params (fastani_ref AND current_datasets must be changed from their default values):
# specifying a fasta file to include as a fastani reference
params.fastani_ref = "genus_species_interesting.fasta,genus_species_cool.fna.gz"
# specifying a file that lists fasta files to include as fastani references
params.fastani_ref_line = "file_list.txt"
More documentation about adding custom fastani references can be found on the fastani wiki page.
-
- amrfinderplus
- bbduk
- blastn
- blobtools_*
- core_genome_evaluation
- circulocov
- datasets_*
- drprg
- elgato
- emmtyper
- fastani
- fastp
- fastqc
- heatcluster
- iqtree2
- kaptive
- kleborate
- kraken2
- mash_*
- mashtree
- mlst
- multiqc
- mykrobe
- panaroo
- pbptyper
- phytreeviz
- plasmidfinder
- prokka
- quast
- seqsero2
- serotypefinder
- shigatyper
- snp_dists
- spades