Skip to content

Latest commit

 

History

History
479 lines (364 loc) · 18.8 KB

README.md

File metadata and controls

479 lines (364 loc) · 18.8 KB

MIRFLOWZ

MIRFLOWZ is a Snakemake workflow for mapping miRNAs and isomiRs.

Table of Contents

  1. Installation
  2. Usage
  3. Workflow description
  4. Contributing
  5. License
  6. Contact

Installation

The workflow lives inside this repository and will be available for you to run after following the installation instructions laid out in this section.

Cloning the repository

Traverse to the desired path on your file system, then clone the repository and change into it with:

git clone https://github.com/zavolanlab/mirflowz.git
cd mirflowz

Dependencies

For improved reproducibility and reusability of the workflow, as well as an easy means to run it on a high performance computing (HPC) cluster managed, e.g., by Slurm, all steps of the workflow run inside isolated environments (Singularity containers or Conda environments). As a consequence, running this workflow has only a few individual dependencies. These are managed by the package manager Conda, which needs to be installed on your system before proceeding.

If you do not already have Conda installed globally on your system, we recommend that you install Miniconda. For faster creation of the environment (and Conda environments in general), you can also install Mamba on top of Conda. In that case, replace conda with mamba in the commands below (particularly in conda env create).

Setting up the virtual environment

Create and activate the environment with necessary dependencies with Conda:

conda env create -f environment.yml
conda activate mirflowz

If you plan to run MIRFLOWZ via Conda, we recommend using the following command for a faster environment creation, specially if you will run it on an HPC cluster.

conda config --set channel_priority strict

If you plan to run MIRFLOWZ via Singularity and do not already have it installed globally on your system, you must further update the Conda environment using the environment.root.yml with the command below. Mind that you must have the environment activated to update it.

conda env update -f environment.root.yml

Note that you will need to have root permissions on your system to be able to install Singularity. If you want to run MIRFLOWZ on an HPC cluster (recommended in almost all cases), ask your systems administrator about Singularity.

If you would like to contribute to MIRFLOWZ development, you may find it useful to further update your environment with the development dependencies:

conda env update -f environment.dev.yml

Testing your installation

Several tests are provided to check the integrity of the installation. Follow the instructions in this section to make sure the workflow is ready to use.

Run test workflow on local machine

Execute one of the following commands to run the test workflow on your local machine:

  • Test workflow on local machine with Singularity:
bash test/test_workflow_local_with_singularity.sh
  • Test workflow on local machine with Conda:
bash test/test_workflow_local_with_conda.sh

Run test workflow on a cluster via SLURM

Execute one of the following commands to run the test workflow on a slurm-managed high-performance computing (HPC) cluster:

  • Test workflow with Singularity:
bash test/test_workflow_slurm_with_singularity.sh
  • Test workflow with Conda:
bash test/test_workflow_slurm_with_conda.sh

Rule graph

Execute the following command to generate a rule graph image for the workflow. The output will be found in the images/ directory in the repository root.

bash test/test_rule_graph.sh

You can see the rule graph below in the workflow description section.

Clean up test results

After successfully running the tests above, you can run the following command to remove all artifacts generated by the test runs:

bash test/test_cleanup.sh

Usage

Now that your virtual environment is set up and the workflow is deployed and tested, you can go ahead and run the workflow on your samples.

Preparing inputs

It is suggested to have all the input files for a given run (or hard links pointing to them) inside a dedicated directory, for instance under the MIRFLOWZ root directory. This way, it is easier to keep the data together, set up Singularity access to them and reproduce analyses.

1. Prepare a sample table

Refer to test/test_files/sample_table.tsv to know what this file must look like, or use it as a template.

touch path/to/your/sample/table.tsv

Fill the sample table according to the following requirements:

  • sample. Arbitrary name for the miRNA sequencing library.
  • sample_file. Path to the miRNA sequencing library file. The path must be relative to the directory where the workflow will be run.
  • adapter. Sequence of the 3'-end adapter used during library preparation.
  • format. One of fa/fasta or fq/fastq, if the library file is in FASTA or FASTQ format, respectively.

2. Prepare genome resources

There are 4 files you must provide:

  1. A gzipped FASTA file containing reference sequences, typically the genome of the source/organism from which the library was extracted.

  2. A gzipped GTF file with matching gene annotations for the reference sequences above.

MIRFLOWZ expects both the reference sequence and gene annotation files to follow Ensembl style/formatting. If you obtained these files from a source other than Ensembl, you must ensure that they adhere to the expected format by converting them, if necessary.

  1. An uncompressed GFF3 file with microRNA annotations for the reference sequences above.

MIRFLOWZ expects the miRNA annotations to follow miRBase style/formatting. If you obtained this file from a source other than miRBase, you must ensure that it adheres to the expected format by converting it, if necessary.

  1. An uncompressed tab-separated file with a mapping between the reference names used in the miRNA annotation file (column 1; "UCSC style") and in the gene annotations and reference sequence files (column 2; "Ensembl style"). Values in column 1 are expected to be unique, no header is expected, and any additional columns will be ignored. This resource provides such files for various organisms, and in the expected format.

  2. OPTIONAL: A BED6 file with regions for which to produce ASCII-style pileups. If not provided, no pileups are generated. See here for the expected format.

General note: If you want to process the genome resources before use (e.g., filtering), you can do that, but make sure the formats of any modified resource files meet the formatting expectations outlined above!

3. Prepare a configuration file

We recommend creating a copy of the configuration file template:

cp  config/config_template.yaml  path/to/config.yaml

Open the new copy in your editor of choice and adjust the configuration parameters to your liking. The template explains what each of the parameters mean and how you can meaningfully adjust them.

Running the workflow

With all the required files in place, you can now run the workflow locally via Singularity with the following command:

snakemake \
    --snakefile="path/to/Snakefile" \
    --cores 4  \
    --configfile="path/to/config.yaml" \
    --use-singularity \
    --singularity-args "--bind ${PWD}/../" \
    --printshellcmds \
    --rerun-incomplete \
    --verbose

NOTE: Depending on your working directory, you do not need to use the parameters --snakefile and --configfile. For instance, if the Snakefile is in the same directory or the workflow/ directory is beneath the current working directory, there's no need for the --snakefile directory. Refer to the Snakemake documentation for more information.

After successful execution of the workflow, results and logs will be found in the results/ and logs/ directories, respectively.

Expected output files

Upon successful execution of MIRFLOWZ, the tool automatically removes all intermediate files generated during the process. The final outputs comprise:

  1. A SAM file containing alignments intersecting a pri-miR locus. These alignments intersect with extended start and/or end positions specified in the provided pri-miR annotations. Please note that they may not contribute to the final counting and may not appear in the final table.

  2. A SAM file containing alignments intersecting a mature miRNA locus. Similar to the previous file, these alignments intersect with extended start and/or end positions specified in the provided miRNA annotations. They may not contribute to the final counting and might be absent from the final table.

  3. A BAM file containing the set of alignments contributing to the final counting and its corresponding index file (.bam.bai).

  4. Table(s) containing the counting data from all libraries for (iso)miRs and/or pri-miRs. Each row corresponds to a miRNA species, and each column represents a sample library. Each read is counted towards all the annotated miRNA species it aligns to, with 1/n, where n is the number of genomic and/or transcriptomic loci that read aligns to.

  5. OPTIONAL. ASCII-style pileups of read alignments produced for individual libraries, combinations of libraries and/or all libraries of a given run. The exact number and nature of the outputs depends on the workflow inputs/parameters. See the pileups section for a detailed description.

To retain all intermediate files, include --no-hooks in the workflow call.

snakemake \
    --snakefile="path/to/Snakefile" \
    --cores 4  \
    --configfile="path/to/config.yaml" \
    --use-conda \
    --printshellcmds \
    --rerun-incomplete \
    --no-hooks \
    --verbose

After successful execution of the workflow, the intermediate files will be found in the results/intermediates directory.

Creating a Snakemake report

Snakemake provides the option to generate a detailed HTML report on runtime statistics, workflow topology and results. If you want to create a Snakemake report, you must run the following command:

snakemake \
    --snakefile="path/to/Snakefile" \
    --configfile="path/to/config.yaml" \
    --report="snakemake_report.html"

NOTE: The report creation must be done after running the workflow in order to have the runtime statistics and the results.

Workflow description

MIRFLOWZ consists of a main Snakefile and four functional modules. In the Snakefile, the configuration file is validated, and the various modules are imported. In addition, a handler for both, a successful and a failed run are set. If the workflow finishes without any errors, all the intermediate files are removed, otherwise, a log file is created. To keep the intermediate files upon completion, use the --no-hooks CLI argument when running the pipeline.

The modules (1) process the genome resources, (2) map and (3) quantify the reads, and (4) generate pileups, as described in detail below.

NOTE: MIRFLOWZ uses the notation provided by miRBase (i.e. "miRNA primary transcript" for precursors and "miRNA" for the canonical mature miRNA). This implies that precursors are named "pri-miRs" across the workflow instead of pre-miR. This decision is made upon the lack of guarantee that "miRNA primary transcripts" are full pre-miR (and pre-miR only) sequences.

Prepare module

The MIRFLOWZ workflow initially processes and indexes the genome resources provided by the user. The regions corresponding to mature miRNAs are extended by a fixed but user-adjustable number of nucleotides on both sides to accommodate isomiR species with shifted start and/or end positions. If necessary, pri-miR loci are extended to adjust to the new miRNA coordinates. In addition, to account for the different genomic locations a miRNA sequence can be annotated, the name of these sequences are modified to have the format SPECIES-mir-NAME-# for pri-miRs and SPECIES-miR-NAME-#-ARM or SPECIES-miR-NAME-# for mature miRNAs with both or just one arm respectively, where # is the replica number.

Map module

The user-provided short-read small RNA-seq libraries undergo quality filtering (skipped if libraries are provided in FASTA rather than FASTQ), followed by adapter removal. The resulting reads are independently mapped to both the genome and the transcriptome using two distinct aligners: Segemehl and our in-house tool Oligomap.

Segemehl implements a fast heuristic strategy that returns the alignment(s) with the smallest edit distance. Oligomap, on the other hand, implements a slower and more restricted approach that reports all the alignments with an edit distance of at most 1. The combination of the fast and flexible results and the strict selection ensures results with a higher fidelity than if only one of the tools was to be used.

Two merging steps are done in order to have all the alignments in a single file. In the first one, the transcriptome and the genome mappings from both aligners are fused and only those alignments with a smaller NH than the one provided are kept. For the second step, transcriptomic coordinates are turned into genomic ones and alignments are combined into a single file. Duplicate alignments resulting from the partially redundant mapping strategy are discarded and only the best alignments for each read are retained (i.e. the ones with the smallest edit distance). In addition, and due to the alignment's aggregation, a second filtering according to the new NH is performed. If a read has been aligned beyond a specified threshold, it is removed due to (1) performance reasons as the file size can rapidly increase, and (2) the fact that each read contributes to each count 1/N where N is the number of genomic loci it aligns to and a large N makes the contribution negligible.

A final filter is made to further increase the classification accuracy and reduce the amount of multimappers. Given that isomiRs are known to present more mismatches than InDels when compared to the canonical sequence they come from, when addressing the multiple genomic locations a read has been mapped to, the alignments with fewer InDels are kept. Note that some multimappers might still be present if the number of InDels and mismatches is the same across alignments.

Quantify module

The filtered alignments are subsequently intersected with the user-provided, pre-processed miRNA annotation files using BEDTools. Each alignment is classified according to the miRNA species it fully intersects with in order to do the counts.

Counts are tabulated separately for reads consistent with either miRNA precursors, mature miRNA and/or isomiRs, and all library counts are fused into a single table. Note that an alignment is only counted towards a given miRNA (or isomiR) species if one of its alignments fully falls within the (previously extended) locus annotated for that miRNA. Specifically, reads contribute with 1/N for each miRNA for which that is the case, where N is the total number of genomic loci the read aligns to. Under this criterion, the precursor counts contain reads that intersect with its mature arm(s), its hairpin sequence and/or the whole precursor itself.

isomiRs notation

A sequence is considered to be an isomiR if it has a shift on either end, an InDel or a mismatch on its sequence when compared to the canonical miRNA it maps and intersects with.

MIRFLOWZ employs an unambiguous notation to classify isomiRs using the format miRNA_name|5p-shift|3p-shift|CIGAR|MD, where 5p-shift and 3p-shift represent the differences between the annotated mature miRNA start and end positions and those of the read alignment, respectively.

ASCII-style pileups module

Finally, to visualize the distribution of read alignments around miRNA loci, ASCII-style alignment pileups are optionally generated for user-defined regions of interest.

The schema below is a visual representation of the individual workflow steps and how they are related:

rule-graph

NOTE: For an elaborated description of each rule along with some examples, please, refer to the workflow documentation.

Contributing

MIRFLOWZ is an open-source project which relies on community contributions. You are welcome to participate by submitting bug reports or feature requests, taking part in discussions, or proposing fixes and other code changes. Please refer to the contributing guidelines if you are interested in contributing.

License

This project is covered by the MIT License.

Contact

For questions or suggestions regarding the code, please use the issue tracker. Do not hesitate to contact us via email for any other inquiries.

© 2023 Zavolab, Biozentrum, University of Basel