inisite-nf

Introduction

inisite-nf is a bioinformatics analysis pipeline used for mapping initiation sites from nascent strand sequencing data (NS-seq).

The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner.

Pipeline summary

The basic principle of the mapping workflow is abstracted from Cayrou et al, Genome Research 2015. In brief, we use the MACS peak caller to map NS-seq peaks genome-wide. There are two modes the input can be processed, dependent on if a second NS-seq data set is given or not.

a. Single NS-seq data

Adapter and quality trimming with trim_galore (trim_galore)
Alignment of reads with bowtie (bowtie)
Calling narrow peaks with or without control using MACS2 (MACS2)
Spatial clustering of MACS peaks with ClusterScan (ClusterScan)
Filtering MACS peaks by overlap with identified peak clusters with BEDTools (BEDTools)

b. Dual NS-seq data

Adapter and quality trimming with trim_galore (trim_galore)
Alignment of reads with bowtie (bowtie)
Calling narrow peaks for both datasets with or without control using MACS2 (MACS2)
Identifying and retaining peaks common to both datasets with BEDTools (BEDTools)
Spatial clustering of common peaks with ClusterScan (ClusterScan)
Filtering common peaks by overlap with identified peak clusters with BEDTools (BEDTools)

Quick Start

i. Install nextflow

ii. Clone repository with

nextflow pull pavrilab/inisite-nf

iii. Start running your own analysis!

a. Single

nextflow run pavrilab/iniseq-nf --treatment ns_seq1.fastq [--control control1.fastq] --genome mm9

b. Dual

nextflow run pavrilab/iniseq-nf --treatment ns_seq1.fastq --treatment2 ns_seq2.fastq [--control control1.fastq --control2 control2.fastq] --genome mm9

Note that we only support human, mouse, Drosophila and C. elegans by default (see conf/igenomes.conf). If you want to use a different genome either specify --fasta/--bowtieIndex and --genomeSize or add the respective genome to the igenomes.conf file.

Main arguments

`-profile`

Use this parameter to choose a configuration profile. Profiles can give configuration presets for different compute environments. For example -profile cbe invokes the execution of processes using the slurm workload manager. If no profile is given the pipeline will be executed locally.

`--genome`

The genome argument has two properties. Firstly, it is used to document the name of the reference genome of the organism the Hi-C data originates from and secondly it can be used to retrieve prespecified reference data from a local igenomes database, where the pipeline automatically takes the files it requires for processing the Hi-C data (i.e. a bowtie2 index, a genome fasta and a chromSizes file).

There are 31 different species supported in the iGenomes references. To run the pipeline, you must specify which to use with the --genome flag.

Note that here we only support genomes that have precompiled effective genome sizes in MACS. All other genomes have to be added by the user

You can find the keys to specify the genomes in the iGenomes config file. Common genomes that are supported are:

Human
- --genome hg19
Mouse
- --genome mm9
Drosophila
- --genome dm6

Note that you can use the same configuration setup to save sets of reference files for your own use, even if they are not part of the iGenomes resource. See the Nextflow documentation for instructions on where to save such a file.

The syntax for this reference configuration is as follows:

params {
  genomes {
    'GRCh37' {
      fasta      = "/path/to/genome/fasta/file" // Used if no bowtie index
      bowtie     = "/path/to/bowtie/index/basename
      genomeSize = effective size of the genome in bp to use for MACS (see MACS documentation)
    }
    // Any number of additional genomes, key is used with --genome
  }
}

`--treatment`

Raw NS-seq reads in FASTQ format

`--treatment2`

Optional second set of raw NS-seq reads in FASTQ format

`--control`

raw reads of background signal (e.g. sheared genomic DNA). If not given MACS calls peak without input.

`--control2`

raw control reads for --treatment2. If --control was specified this also must be specified. If both control arguments are not given MACS calls peaks without input for both.

Generic arguments

`--genome`

Genome from which the NS-seq reads originate (same as -g option in MACS). Has to be specified.

`--fasta`

This parameter is used to specify the genome fasta file and is only required if no igenomes database is available or bowtie index is not available locally. The file is used for bowtie index computation (if not specified manually)

`--genomeSize`

effective size of the genome to use with MACS if not specified in igenomes

`--bowtieIndex`

index to use with bowtie. Is generated from fasta file if not specified in igenomes

`--extensionSize`

Integer value specifying the length to which each read is extended by MACS before peak calling

`--qValueCutoff`

Float value between 0 and 1 specifying the q-value cutoff used by MACS to identify significant pileups

`--filePrefix`

Prefix for the result files name

`--outputDir`

Folder to which results will be written (is created if not existing)

Results

The pipeline's results comprise 3 to 5 files depending on the number of treatment files given (i.e. single or dual mode). If the pipeline is run in single mode the --outputDir will contain 3 files:

aligned reads in BAM format
A *_MACS.bed file containing the original peak called by macs2, named according to the treatment BAM basename
A *_IS.bed file containing the cluster-filtered MACS peaks
A *_IZ.bed file containing the clusters called by ClusterScan and used to filter the MACS peaks

If the pipeline is run in dual mode the --outputDir will contain 5 files:

aligned reads in BAM format
Two *_MACS.bed file containing the original peaks called by macs2 for each treatment BAM, named according to the treatment BAM basename
A *.common.bed containing all MACS peaks found in both treatment BAMs
A *_IS.bed file containing the cluster-filtered common MACS peaks
A *_IZ.bed file containing the clusters called by ClusterScan and used to filter the common MACS peaks

Credits

The pipeline was developed by Daniel Malzl for use at the IMP, Vienna.

Many thanks to others who have helped out along the way too, including (but not limited to): @t-neumann, @pditommaso.

Citations

Pipeline tools

Nextflow

Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017 Apr 11;35(4):316-319. doi: 10.1038/nbt.3820. PubMed PMID: 28398311.
BEDTools

Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010 Mar 15;26(6):841-2. doi: 10.1093/bioinformatics/btq033. Epub 2010 Jan 28. PubMed PMID: 20110278. PubMed Central PMCID: PMC2832824.
Bowtie

B. Langmead, C. Trapnell, M. Pop, S. L. Salzberg, Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology 10, R25 (2009). doi: 10.1186/gb-2009-10-3-r25. PubMed PMID: 19261174 PMCID: PMC2690996
cutadapt

M. Martin, Cutadapt removes adapter sequences from high-throughput sequencing reads. 2011 17, 3 (2011). doi: 10.14806/ej.17.1.200
ClusterScan

M. Volpe, M. Miralto, S. Gustincich, R. Sanges, ClusterScan: simple and generalistic identification of genomic clusters. Bioinformatics. 2018 Nov 15;34(22):3921-3923. doi: 10.1093/bioinformatics/bty486. PubMed PMID: 29912285
MACS

Zhang Y, Liu T, Meyer CA, Eeckhoute J, Johnson DS, Bernstein BE, Nusbaum C, Myers RM, Brown M, Li W, Liu XS. Model-based analysis of ChIP-Seq (MACS). Genome Biol. 2008;9(9):R137. doi: 10.1186/gb-2008-9-9-r137. Epub 2008 Sep 17. PubMed PMID: 18798982. PubMed Central PMCID: PMC2592715

Python Packages

pandas

Wes McKinney. Data Structures for Statistical Computing in Python, Proceedings of the 9th Python in Science Conference, 51-56 (2010)

Name		Name	Last commit message	Last commit date
Latest commit History 147 Commits
bin		bin
conf		conf
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
main.nf		main.nf
nextflow.config		nextflow.config

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

inisite-nf

Introduction

Pipeline summary

Quick Start

Main arguments

`-profile`

`--genome`

`--treatment`

`--treatment2`

`--control`

`--control2`

Generic arguments

`--genome`

`--fasta`

`--genomeSize`

`--bowtieIndex`

`--extensionSize`

`--qValueCutoff`

`--filePrefix`

`--outputDir`

Results

Credits

Citations

Pipeline tools

Python Packages

About

Releases 1

Packages

Contributors 2

Languages

License

PavriLab/inisite-nf

Folders and files

Latest commit

History

Repository files navigation

inisite-nf

Introduction

Pipeline summary

Quick Start

Main arguments

-profile

--genome

--treatment

--treatment2

--control

--control2

Generic arguments

--genome

--fasta

--genomeSize

--bowtieIndex

--extensionSize

--qValueCutoff

--filePrefix

--outputDir

Results

Credits

Citations

Pipeline tools

Python Packages

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Languages

`-profile`

`--genome`

`--treatment`

`--treatment2`

`--control`

`--control2`

`--genome`

`--fasta`

`--genomeSize`

`--bowtieIndex`

`--extensionSize`

`--qValueCutoff`

`--filePrefix`

`--outputDir`

Packages