Bioinformatics workflow of E. coli whole-genome data, collected as part of the 52HN project
- Species identification
- Quality control
- Assembly
- Quality control of the assembly
- Serotype
- MLST
- Phylogroup
- Virulence genes typing
- Resistance genes typing
- Plasmid
- Annotation
- Pan-genome investigation
- Phylogenomic inference
- Results
Species identification was conducted using Mash
https://mash.readthedocs.io/en/latest/
Used database for screening: RefSeq88n.msh.gz
mash screen -p [threads] RefSeq88n.msh.gz [sample] > mash_output.tab
# Extracting the top 10 hits
sort -gr mash_output.tab -o mash_output_sort.tab
head -n 10 mash_output_sor.tab > mash_output_top10.tab
Three genomes were identified as non E.coli: 17.6a, 17.6b, 17.6c
✔️ Mash: v2.3
Quality control was done by FastQC Fastp was used for filtering reads with more than 30% of bases with Phred score below 15 and/or reads with more than 5 N-bases. Adapters trimming is disabled.
fastp -i [input.read1] -I [input.read2] -o [output.fastp1] -O [output.fastp2] -q 15 -u 30 -n 5 -A
✔️ FastQC: 0.11.5 ✔️ Fastp: 0.24.0
The assembly was done with SPAdes v4.0.0 using the k-mers: 21, 35, 55, 71, 91, 111
spades.py --isolate -1 [read1] -2 [read2] -o [output] --threads [threads] -k 21,35,55,71,91,111 --cov-cutoff auto
--isolate: This flag is highly recommended for high-coverage isolate and multi-cell Illumina data; improves the assembly quality and running time.
--cov-cutoff: Read coverage cutoff value. Must be a positive float value, or "auto", or "off". Default value is "off". When set to "auto" SPAdes automatically computes coverage threshold using conservative strategy.
More details on --cov-cutoff parameter (this should be handled with care)
✔️ SPAdes: 4.0.0
- Assemblies' qualities were checked with QUAST using default parameter
- Assemblies' completeness were assessed using BUSCO against database enterobacteriaceae_odb12
# Batch mode
busco -i [inputs.directory] -m geno -l enterobacteriaceae_odb12 -c [threads]
- Assembly filtration: contigs with coverage lower than 1 were discarded using SeqKit
seqkit fx2tab [assembly] | csvtk mutate -H -t -f 1 -p "cov_(.+)" | awk -F "\t" '$4>=1' | seqkit tab2fx > filtered_assembly.fasta
✔️ BUSCO: 5.8.2 ✔️ seqkit: v2.8.2
Serotypes were determined using ECTyper with default parameters.
✔️ ECTyper: 2.0.0
Sequence Typing was done using mlst, --scheme ecoli_achtman_4, and default parameters.
✔️ mlst
The phylogroup was determined using ClermonTyping and default parameters.
✔️ ClermonTyping: 24.02
The virulenece genes were screen against the curated database "Septicoli" using Abricate with thresholds --minid 80 and --mincov 90. The database was done by Guilhem Royer in Kieffer et al, 2019.
✔️ abricate: 1.0.1
The resistance genes were screened against the NCBI database using AMRfinderplus
amrfinder -n [assembly] --organism Escherichia --threads [threads] --plus --name [prefix] > output.tab
✔️ amrfinder: 4.0.3
Plasmid replicon types were screened against the PlasmidFinder_DB database using Abricate with thresholds --minid 80 and --mincov 90.
abricate --fofn [file of file names] --db plasmidfinder > output.tab
✔️ abricate: 1.0.1
The annotation was done using Bakta
#Bakta database needs to be downloaded manually per installation instruction and kept in bakta_db directory
bakta --db [bakta_db] --prefix [file_name] --output [out_dir] --keep-contig-headers --genus Escherichia -v [assembly] -t [threads]
✔️ bakta: 1.10.3
Pangenome calculation was done using Panaroo
panaroo -i [input_gff3_files] -o [out_dir] --clean-mode strict --remove-invalid-genes -a core
--clean-mode: The stringency mode at which to run panaroo. Must be one of 'strict','moderate' or 'sensitive'. Each of these modes can be fine tuned using the additional parameters in the 'Graph correction' section. strict: Requires fairly strong evidence (present in at least 5% of genomes) to keep likely contaminant genes. Will remove genes that are refound more often than they were called originally.
--remove-invalid-genes: removes annotations that do not conform to the expected Prokka format such as those including premature stop codons.
-a: Output alignments of core genes or all genes. Options are 'core' and 'pan'. Default: 'None' #Aligner: default 'mafft'
✔️ Panaroo: 1.5.1
The phylogenetic tree was constructed using IQTree
#Due to large number of samples and limited computing capacity, the model GTR+F+R7 was chosen in prior instead of running model search by ModelFinder (-m MFP).
#The limits for memory and cores usage were 54G (-mem) and 18 (-ntmax), respectively.
#Safe numerical mode (-safe) was turned on to avoid numerical underflow for large data sets with many sequences
iqtree -s [msa.aln] -pre [file_prefix] -m GTR+F+R7 -bb 1000 -alrt 1000 -nt AUTO -mem 54G -ntmax 18 -safe
-bb: number of bootstrap replicates -alrt: number of replicates to perform SH-like
✔️ IQTree: 1.6.12
All the typing results were processed using R 4.3.3.