52HN - Evolutionary dynamics of Escherichia coli in Vietnam

Bioinformatics workflow of E. coli whole-genome data, collected as part of the 52HN project

Steps

Species identification
Quality control
Assembly
Quality control of the assembly
Serotype
MLST
Phylogroup
Virulence genes typing
Resistance genes typing
Plasmid
Annotation
Pan-genome investigation
Phylogenomic inference
Results

Species identification

Species identification was conducted using Mash

https://mash.readthedocs.io/en/latest/

Used database for screening: RefSeq88n.msh.gz

mash screen  -p [threads] RefSeq88n.msh.gz [sample] > mash_output.tab
# Extracting the top 10 hits 
sort -gr mash_output.tab -o mash_output_sort.tab
head -n 10 mash_output_sor.tab > mash_output_top10.tab

Three genomes were identified as non E.coli: 17.6a, 17.6b, 17.6c

✔️ Mash: v2.3

Quality control

Quality control was done by FastQC Fastp was used for filtering reads with more than 30% of bases with Phred score below 15 and/or reads with more than 5 N-bases. Adapters trimming is disabled.

fastp -i [input.read1] -I [input.read2] -o [output.fastp1] -O [output.fastp2] -q 15 -u 30 -n 5 -A

✔️ FastQC: 0.11.5 ✔️ Fastp: 0.24.0

Assembly

The assembly was done with SPAdes v4.0.0 using the k-mers: 21, 35, 55, 71, 91, 111

Page AJ, De Silva N, Hunt M, Quail MA, Parkhill J, Harris SR, Otto TD, Keane JA. Robust high-throughput prokaryote de novo assembly and improvement pipeline for Illumina data. Microb Genom. 2016 Aug 25;2(8):e000083. doi: 10.1099/mgen.0.000083. PMID: 28348874; PMCID: PMC5320598.

spades.py --isolate -1 [read1] -2 [read2] -o [output] --threads [threads] -k 21,35,55,71,91,111 --cov-cutoff auto

--isolate: This flag is highly recommended for high-coverage isolate and multi-cell Illumina data; improves the assembly quality and running time.

--cov-cutoff: Read coverage cutoff value. Must be a positive float value, or "auto", or "off". Default value is "off". When set to "auto" SPAdes automatically computes coverage threshold using conservative strategy.

More details on --cov-cutoff parameter (this should be handled with care)

✔️ SPAdes: 4.0.0

Quality control of the assembly

Assemblies' qualities were checked with QUAST using default parameter
Assemblies' completeness were assessed using BUSCO against database enterobacteriaceae_odb12

# Batch mode
busco -i [inputs.directory] -m geno -l enterobacteriaceae_odb12 -c [threads]

Assembly filtration: contigs with coverage lower than 1 were discarded using SeqKit

seqkit fx2tab [assembly] | csvtk mutate -H -t -f 1 -p "cov_(.+)" | awk -F "\t" '$4>=1' | seqkit tab2fx > filtered_assembly.fasta

✔️ BUSCO: 5.8.2 ✔️ seqkit: v2.8.2

Serotype

Serotypes were determined using ECTyper with default parameters.

✔️ ECTyper: 2.0.0

MLST

Sequence Typing was done using mlst, --scheme ecoli_achtman_4, and default parameters.

✔️ mlst

Phylogroup

The phylogroup was determined using ClermonTyping and default parameters.

✔️ ClermonTyping: 24.02

Virulence genes typing

The virulenece genes were screen against the curated database "Septicoli" using Abricate with thresholds --minid 80 and --mincov 90. The database was done by Guilhem Royer in Kieffer et al, 2019.

Kieffer N, Royer G, Decousser JW, Bourrel AS, Palmieri M, Ortiz De La Rosa JM, Jacquier H, Denamur E, Nordmann P, Poirel L. mcr-9, an Inducible Gene Encoding an Acquired Phosphoethanolamine Transferase in Escherichia coli, and Its Origin. Antimicrob Agents Chemother. 2019 Aug 23;63(9):e00965-19. doi: 10.1128/AAC.00965-19. Erratum in: Antimicrob Agents Chemother. 2019 Oct 22;63(11): PMID: 31209009; PMCID: PMC6709461. https://doi.org/10.1128/AAC.00965-19

✔️ abricate: 1.0.1

Resistance genes typing

The resistance genes were screened against the NCBI database using AMRfinderplus

amrfinder -n [assembly] --organism Escherichia --threads [threads] --plus --name [prefix] > output.tab

✔️ amrfinder: 4.0.3

Plasmid

Plasmid replicon types were screened against the PlasmidFinder_DB database using Abricate with thresholds --minid 80 and --mincov 90.

abricate --fofn [file of file names] --db plasmidfinder > output.tab

✔️ abricate: 1.0.1

Annotation

The annotation was done using Bakta

#Bakta database needs to be downloaded manually per installation instruction and kept in bakta_db directory

bakta --db [bakta_db] --prefix [file_name] --output [out_dir] --keep-contig-headers --genus Escherichia -v [assembly] -t [threads]

✔️ bakta: 1.10.3

Pan-genome investigation

Pangenome calculation was done using Panaroo

panaroo -i [input_gff3_files] -o [out_dir] --clean-mode strict --remove-invalid-genes -a core

--clean-mode: The stringency mode at which to run panaroo. Must be one of 'strict','moderate' or 'sensitive'. Each of these modes can be fine tuned using the additional parameters in the 'Graph correction' section. strict: Requires fairly strong evidence (present in at least 5% of genomes) to keep likely contaminant genes. Will remove genes that are refound more often than they were called originally.

--remove-invalid-genes: removes annotations that do not conform to the expected Prokka format such as those including premature stop codons.

-a: Output alignments of core genes or all genes. Options are 'core' and 'pan'. Default: 'None' #Aligner: default 'mafft'

✔️ Panaroo: 1.5.1

Phylogenomic inference

The phylogenetic tree was constructed using IQTree

#Due to large number of samples and limited computing capacity, the model GTR+F+R7 was chosen in prior instead of running model search by ModelFinder (-m MFP).
#The limits for memory and cores usage were 54G (-mem) and 18 (-ntmax), respectively.
#Safe numerical mode (-safe) was turned on to avoid numerical underflow for large data sets with many sequences

iqtree -s [msa.aln] -pre [file_prefix] -m GTR+F+R7 -bb 1000 -alrt 1000 -nt AUTO -mem 54G -ntmax 18 -safe

-bb: number of bootstrap replicates -alrt: number of replicates to perform SH-like

✔️ IQTree: 1.6.12

Results

All the typing results were processed using R 4.3.3.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

52HN - Evolutionary dynamics of Escherichia coli in Vietnam

Steps

Species identification

Quality control

Assembly

Quality control of the assembly

Serotype

MLST

Phylogroup

Virulence genes typing

Resistance genes typing

Plasmid

Annotation

Pan-genome investigation

Phylogenomic inference

Results

About

Releases

Packages

Trang021200/52HN_Escherichia_coli

Folders and files

Latest commit

History

Repository files navigation

52HN - Evolutionary dynamics of Escherichia coli in Vietnam

Steps

Species identification

Quality control

Assembly

Quality control of the assembly

Serotype

MLST

Phylogroup

Virulence genes typing

Resistance genes typing

Plasmid

Annotation

Pan-genome investigation

Phylogenomic inference

Results

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages