Assembly Tools

Snakemake workflows used to assemble bacterial isolates.

Workflows were used to assemble five historical Bacillus anthracis isolates soon to be published in Microbiology Resource Annoucements.

The Bacillus anthracis assemblies have been deposited in DDBJ/ENA/GenBank under BioSample accession numbers SAMN12620928, SAMN12620929, SAMN12620930, SAMN12620931, and SAMN12620932. The raw Illumina paired-end sequencing reads have been deposited in the Sequence Read Archive under accession numbers SRR10019497, SRR10019498, SRR10019499, SRR10019500, and SRR10019501.

Installation

Read preprocessing workflow installation

Install Anaconda

wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh

Download asm_tools

git clone git://github.com/bioforensics/asm_tools

OR

Download a Release

Setup python environment and use conda to install required packages (mash, fastp, etc).

   cd asm_tools/preprocess
   conda create -f preprocess_env.yml
   conda activate bmap_preprocess

(Optional) Download databases for "mash screen" to check for contaminants.
Mash Sketch databases for RefSeq release 88:

RefSeq88n.msh.gz: Genomes (k=21, s=1000), 1.2Gb uncompressed
RefSeq88p.msh.gz: Proteomes (k=9, s=1000), 1.1Gb uncompressed

Edit preprocess/config.yml with path to mash database

mashdb: path/to/mashdb

Run the read preprocessing workflow

path/to/asm_tools/preprocess/bmap_preprocess -r1 test/seq/test_R1.fastq.gz -r2 test/seq/test_R2.fastq.gz -s sample_name

Singularity Container installation

singularity pull bmap_preprocess.sif library://dsommer/default/bmap/bmap_preprocess singularity exec bmap_preprocess.sif -r1 test/seq/test_R1.fastq.gz -r2 test/seq/test_R2.fastq.gz -s test1

Preprocessing Paired-End Reads

Snakemake DAG

Workflow outline

The preprocessing.smk Snakemake workflow prepares Illumina reads to be assembled.

Run fastp to trim adapter sequence, low quality bases, and very short reads. By default, bases below Q20 at ends of reads will be trimmed. Any reads below length 75 and/or containing Ns will be removed.
Run "mash screen" against RefSeq to check for contaminents.
Estimate genome size by building a k-mer profile on the reads.
Randomly downsample reads to 150× coverage of the estimated genome size using sample-reads program.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Assembly Tools

Installation

Read preprocessing workflow installation

Singularity Container installation

Preprocessing Paired-End Reads

Snakemake DAG

Workflow outline

Files

README.md

Latest commit

History

README.md

File metadata and controls

Assembly Tools

Installation

Read preprocessing workflow installation

Singularity Container installation

Preprocessing Paired-End Reads

Snakemake DAG

Workflow outline