Skip to content

A de novo assembly pipeline for sequencing data using Oxford Nanopore Technologies

Notifications You must be signed in to change notification settings

jesswiley-PBio/ont-de-novo-assembly

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Nextflow De Novo Assembly Pipeline for Oxford Nanopore Data

This pipeline automates the process of generating a de novo assembly of long-read sequencing data produced by Oxford Nanopore Technology.

Technical Considerations

Minimum Read Length

The assembler used for this pipeline, Flye, has a minimum read length limit of 1000bp. For most data generated using Nanopore, this should be fine. However, if the majority of your reads are less than this length, this pipeline will not be useful for your analysis. In my experience, the noisy/long read assembler Miniasm allows parameters to be set for reads shorter than 1000bp (Find a helpful tutorial for this tool here).

Medaka Model

Medaka requires information about the pore, sequencing device, and basecaller. This information is specified to the tool through a 'model', which is a string of text in the following format:

{pore}_{device}_{caller variant}_{caller version}

The pipeline requires a medaka model as input. To see a list of medaka models, use the command medaka tools list_models.

As well, it is important to note, the models will not contain individual versions of Guppy. Thus, you should choose the version most closest to and less than the version you used (Ex: using Guppy 6.00, enter R***_min_hac_g507)

Medaka Batch Size

A known issue with medaka is that it can use too much GPU memory and crash resulting in an incomplete assembly. If this happens, you can try entering the following command to allow for more GPU memory to be allotted:

export TF_FORCE_GPU_ALLOW_GROWTH=true

If this still does not solve the issue, you can reduce the batch size. The pipeline allows for the modification of this value through an optional argument --medakaBatchSize INT. The value is defaultly 100.

Installation

To install this pipeline, enter the following commands:

# Clone the repository
git clone https://github.com/rchapman2000/ont-de-novo-assembly.git

# Create a conda environment using the provided environment.yml file
conda env create -f environment.yml

# Activate the conda environment
conda activate ONT-DeNovoAssembly

Updating the Pipeline

If you already have the pipeline installed, you can update it using the following commands:

# Navigate to your installation directory
cd ont-de-novo-assembly

# Use git to pull the latest update
git pull

# Activate the conda environment and use the environment.yml file to download updates
conda activate ONT-DeNovoAssembly
conda env update --file environment.yml --prune

Usage

To run the pipeline, use the following command:

# You must either be in the same directory as the main.nf file or reference the file location.
nextflow run main.nf [OPTIONS] --input INPUT_DIR --output OUTPUT_DIR --model MEDAKA_MODEL

Optional Arguments

The pipeline also supports the following optional arguments:

Option Type Description
--trimONTAdapters None Enables ONT Adapter/Barcode trimming using Porechop [Default = off]
--minReadLen int If supplied, the pipeline will perform length filtering using Chopper excluding reads less than this size [Default = off]
--maxReadLen int If supplied, the pipeline will perform length filtering using Chopper excluding reads larger than this size [Default = off]
--medakaBatchSize int Medaka uses a lot of GPU memory, and if you're assembly is large enough it may cause Medaka to crash. Reducing the batch size will help solve this issue. [Default = 100]
--preGuppy5 None Flye handles data generated version of Guppy < 5.0 differently. Supply this parameter if your data was generated pre-Guppy 5.0
--meta None Flye option for meta-genome assembly mode. The main difference is that without this option, the "regular" mode assumes a relatively uniform coverage of the assembled genome and makes certain decisions based on that. The meta-genome mode is more general in this respect, and works well for assembly of complex microbial communities with highly non-uniform coverage and richer repeat content. It is sensitive to very short sequences and underrepresented organisms at low read coverage (as low as 3x) [Default = off].
--min-overlap int Flye minimum overlap between reads. The default minimum overlap length used by the assembler varies depending on the type of reads and the selected mode. For Nanopore reads, the typical default minimum overlap length is set to 3000 base pairs. This default setting is chosen to balance the potential for alignment errors common in longer reads and the need for sufficient overlap to confidently establish connections between reads in the assembly process. Intuitively, we want keep it as high as possible (e.g. 5-10kb) to reduce the complexity of a repeat graph. However, if the read length is not sufficient, this might lead to gaps in assembly. Flye automatically selects this parameter based on the read length distribution, and for the most of datasets the selected value works well. In some rare cases, this parameter needs to be adjusted manually, for example if the read length distribution is skewed. [Default = off, Flye chooses length base on type of reads and the selected mode].
--threads int The number of CPU threads that can be use to run pipeline tools in parallel

To view the list of options from the command line, use the following command:

nextflow run main.nf --help

About

A de novo assembly pipeline for sequencing data using Oxford Nanopore Technologies

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Nextflow 100.0%