diff --git a/README.md b/README.md index 2a8588d..68a49f6 100644 --- a/README.md +++ b/README.md @@ -15,6 +15,10 @@ _MIRFLOWZ_ is a [Snakemake][snakemake] workflow for mapping miRNAs and isomiRs. - [Expected output files](#expected-output-files) - [Creating a Snakemake report](#creating-a-snakemake-report) 3. [Workflow description](#workflow-description) + - [Prepare module](#prepare-module) + - [Map module](#map-module) + - [Quantify module](#quantify-module) + - [ASCII-style pileups module](#ascii-style-pileups-module) 4. [Contributing](#contributing) 5. [License](#license) 6. [Contact](#contact) @@ -40,8 +44,8 @@ For improved reproducibility and reusability of the workflow, as well as an easy means to run it on a high performance computing (HPC) cluster managed, e.g., by [Slurm][slurm], all steps of the workflow run inside isolated environments ([Singularity][singularity] containers or [Conda][conda] -environments). As a consequence, running this workflow has only a few individual -dependencies. These are managed by the package manager Conda, which +environments). As a consequence, running this workflow has only a few +individual dependencies. These are managed by the package manager Conda, which needs to be installed on your system before proceeding. If you do not already have Conda installed globally on your system, @@ -159,7 +163,7 @@ tested, you can go ahead and run the workflow on your samples. It is suggested to have all the input files for a given run (or hard links pointing to them) inside a dedicated directory, for instance under the _MIRFLOWZ_ root directory. This way, it is easier to keep the data together, -reproduce an analysis and set up Singularity access to them. +set up Singularity access to them and reproduce analyses. #### 1. Prepare a sample table @@ -210,8 +214,8 @@ There are 4 files you must provide: expected format. 5. **OPTIONAL**: A **BED6** file with regions for which to produce - [ASCII-style alignment pileups][ascii-pileups]. If not provided, no pileups - will be generated. See [here][bed-format] for the expected format. + [ASCII-style pileups][ascii-pileups]. If not provided, no pileups are + generated. See [here][bed-format] for the expected format. > General note: If you want to process the genome resources before use (e.g., > filtering), you can do that, but make sure the formats of any modified @@ -265,9 +269,7 @@ intermediate files generated during the process. The final outputs comprise: 1. A SAM file containing alignments intersecting a pri-miR locus. These alignments intersect with extended start and/or end positions specified in the provided pri-miR annotations. Please note that they may not contribute to the -final counting and may not appear in the final table. Alignments are discarded -if their start and/or end positions differ from the ends of the provided -pri-miR annotations by more bases than the extension used. +final counting and may not appear in the final table. 2. A SAM file containing alignments intersecting a mature miRNA locus. Similar to the previous file, these alignments intersect with extended start and/or end @@ -325,33 +327,126 @@ snakemake \ ## Workflow description -The _MIRFLOWZ_ workflow first processes and indexes the user-provided genome -resources. Afterwards, the user-provided short read small-RNA-seq libraries will -be aligned separately against the genome and transcriptome. For increased -fidelity, two separated aligners, [Segemehl][segemehl] and our in-house tool -[Oligomap][oligomap], are used. All the resulting alignments are merged such -that only the best alignments of each read are kept (smallest edit distance). -Alignments are intersected with the user-provided, pre-processed miRNA -annotation file using [BEDTools][bedtools]. Counts are tabulated separately for -reads consistent with either miRNA precursors, mature miRNA and/or isomiRs. -Finally, ASCII-style alignment pileups are optionally generated for -user-defined regions of interest. - -> **NOTE:** For a detailed description of each rule, please, refer to the -> [workflow documentation](pipeline_documentation.md) +_MIRFLOWZ_ consists of a main `Snakefile` and four functional modules. In the +`Snakefile`, the configuration file is validated, and the various modules are +imported. In addition, a handler for both, a successful and a failed run are +set. If the workflow finishes without any errors, all the intermediate files +are removed, otherwise, a log file is created. To keep the intermediate files +upon completion, use the `--no-hooks` CLI argument when running the pipeline. + +The modules [(1)](#prepare-module) process the genome resources, +[(2)](#map-module) map and [(3)](#quantify-module) quantify the reads, and +[(4)](#ascii-style-pileups-module) generate pileups, as described in detail +below. + +> **NOTE:** _MIRFLOWZ_ uses the notation provided by miRBase (_i.e._ +> "miRNA primary transcript" for precursors and "miRNA" for the canonical +> mature miRNA). This implies that precursors are named "pri-miRs" across the +> workflow instead of pre-miR. This decision is made upon the lack of +> guarantee that "miRNA primary transcripts" are full pre-miR (and pre-miR +> only) sequences. + +### Prepare module + +The _MIRFLOWZ_ workflow initially processes and indexes the genome resources +provided by the user. The regions corresponding to mature miRNAs are extended +by a fixed but user-adjustable number of nucleotides on both sides to +accommodate isomiR species with shifted start and/or end positions. If +necessary, pri-miR loci are extended to adjust to the new miRNA coordinates. +In addition, to account for the different genomic locations a miRNA sequence +can be annotated, the name of these sequences are modified to have the format +`SPECIES-mir-NAME-#` for pri-miRs and `SPECIES-miR-NAME-#-ARM` or +`SPECIES-miR-NAME-#` for mature miRNAs with both or just one arm respectively, +where `#` is the replica number. + +### Map module + +The user-provided short-read small RNA-seq libraries undergo quality filtering +(skipped if libraries are provided in FASTA rather than FASTQ), followed by +adapter removal. The resulting reads are independently mapped to both the +genome and the transcriptome using two distinct aligners: Segemehl and our +in-house tool Oligomap. + +Segemehl implements a fast heuristic strategy that returns the alignment(s) +with the smallest edit distance. Oligomap, on the other hand, implements a +slower and more restricted approach that reports all the alignments with an +edit distance of at most 1. The combination of the fast and flexible results +and the strict selection ensures results with a higher fidelity than if only +one of the tools was to be used. + +Two merging steps are done in order to have all the alignments in a single +file. In the first one, the transcriptome and the genome mappings from both +aligners are fused and only those alignments with a smaller NH than the one +provided are kept. For the second step, transcriptomic coordinates are turned +into genomic ones and alignments are combined into a single file. Duplicate +alignments resulting from the partially redundant mapping strategy are +discarded and only the best alignments for each read are retained (_i.e._ the +ones with the smallest edit distance). In addition, and due to the alignment's +aggregation, a second filtering according to the new NH is performed. +If a read has been aligned beyond a specified threshold, it is removed due to +(1) performance reasons as the file size can rapidly increase, and (2) the fact +that each read contributes to each count `1/N` where `N` is the number of +genomic loci it aligns to and a large `N` makes the contribution negligible. + +A final filter is made to further increase the classification accuracy and +reduce the amount of multimappers. Given that isomiRs are known to present more +mismatches than InDels when compared to the canonical sequence they come from, +when addressing the multiple genomic locations a read has been mapped to, the +alignments with fewer InDels are kept. Note that some multimappers might still +be present if the number of InDels and mismatches is the same across +alignments. + +### Quantify module + +The filtered alignments are subsequently intersected with the user-provided, +pre-processed miRNA annotation files using BEDTools. Each alignment is +classified according to the miRNA species it fully intersects with in order +to do the counts. + +Counts are tabulated separately for reads consistent with either miRNA +precursors, mature miRNA and/or isomiRs, and all library counts are fused +into a single table. Note that an alignment is only counted towards a given +miRNA (or isomiR) species if one of its alignments fully falls within the +(previously extended) locus annotated for that miRNA. Specifically, reads +contribute with `1/N` for each miRNA for which that is the case, where `N` is +the total number of genomic loci the read aligns to. Under this criterion, the +precursor counts contain reads that intersect with its mature arm(s), its +hairpin sequence and/or the whole precursor itself. + +#### isomiRs notation + +A sequence is considered to be an isomiR if it has a shift on either end, an +InDel or a mismatch on its sequence when compared to the canonical miRNA it +maps and intersects with. + +_MIRFLOWZ_ employs an unambiguous notation to classify isomiRs using the +format `miRNA_name|5p-shift|3p-shift|CIGAR|MD`, where `5p-shift` and +`3p-shift` represent the differences between the annotated mature miRNA +start and end positions and those of the read alignment, respectively. + +### ASCII-style pileups module + +Finally, to visualize the distribution of read alignments around miRNA +loci, ASCII-style alignment pileups are optionally generated for user-defined +regions of interest. + The schema below is a visual representation of the individual workflow steps and how they are related: > ![rule-graph][rule-graph] +> **NOTE:** For an elaborated description of each rule along with some +> examples, please, refer to the +> [workflow documentation](pipeline_documentation.md). + ## Contributing _MIRFLOWZ_ is an open-source project which relies on community contributions. You are welcome to participate by submitting bug reports or feature requests, taking part in discussions, or proposing fixes and other code changes. Please -refer to the [contributing guidelines](CONTRIBUTING.md) if you are interested in -contribute. +refer to the [contributing guidelines](CONTRIBUTING.md) if you are interested +in contributing. ## License @@ -359,13 +454,14 @@ This project is covered by the [MIT License](LICENSE). ## Contact -For questions or suggestions regarding the code, please use the [issue tracker][issue-tracker]. Do not hesitate on contacting us via [email][email] for any other inquiries. +For questions or suggestions regarding the code, please use the +[issue tracker][issue-tracker]. Do not hesitate to contact us via +[email][email] for any other inquiries. © 2023 [Zavolab, Biozentrum, University of Basel][zavolab] [ascii-pileups]: [bed-format]: -[bedtools]: [chrMap]: [conda]: [cluster execution]: @@ -375,9 +471,7 @@ For questions or suggestions regarding the code, please use the [issue tracker][ [mamba]: [miniconda-installation]: [mirbase]: -[oligomap]: [rule-graph]: images/rule_graph.svg -[segemehl]: [singularity]: [slurm]: [snakemake]: diff --git a/pipeline_documentation.md b/pipeline_documentation.md index 2e2ba72..7e1c6c2 100644 --- a/pipeline_documentation.md +++ b/pipeline_documentation.md @@ -87,7 +87,6 @@ on installation and usage please see [here](README.md). - [`create_per_condition_ascii_pileups`](#create_per_condition_ascii_pileups) - ## Third-party software used > Tag lines were taken from the developers' websites (code repository or manual) @@ -95,25 +94,26 @@ on installation and usage please see [here](README.md). | Name | License | Tag line | More info | | --- | --- | --- | --- | | **ASCII-style alignment pileups** | [Apache 2.0][license-apache2] | _"Generates ASCII-style pileups of read alignments in one or more BAM files for one or more genomic regions."_ | [code][code-ascii] | -| **BEDTools** | [GPLv2][license-gpl2] | _"[...] intersect, merge, count, complement, and shuffle genomic intervals from multiple files in widely-used genomic file formats such as BAM, BED, GFF/GTF, VCF"_ | [code][code-bedtools] / [manual][manual-bedtools] / [publication][pub-bedtools] | +| **BEDTools** | [GPLv2][license-gpl2] | _"[...] intersect, merge, count, complement, and shuffle genomic intervals from multiple files in widely-used genomic file formats such as BAM, BED, GFF/GTF, VCF"_ | [code][code-bedtools] / [manual][docs-bedtools] / [publication][pub-bedtools] | | **cufflinks** | [BSL-1.0][license-bsl1] | _"[...] assembles transcripts, estimates their abundances, and tests for differential expression and regulation in RNA-Seq samples"_ | [code][code-cufflinks] / [manual][docs-cufflinks] / [publication][pub-cufflinks] | | **cutadapt** | [MIT][license-mit] | _"[...] finds and removes adapter sequences, primers, poly-A tails and other types of unwanted sequence from your high-throughput sequencing reads"_ | [code][code-cutadapt] / [manual][docs-cutadapt] / [publication][pub-cutadapt] | | **FASTX-Toolkit** | [AGPL-3.0][license-agpl3] | _"[...] collection of command line tools for Short-Reads FASTA/FASTQ files preprocessing"_ | [code][code-fastx] / [manual][docs-fastx] | -| **GFFUtils** | [AFL-3][license-afl3] | _"[...] a small set of utility programs for working with GFF and GTF files"_ | [code][code-gffutils] / [manual][docs-gffutils] | +| **GFFUtils** | [AFL-3][license-afl3] | _"[...] a small set of utility programs for working with GFF and GTF files"_ | [code][code-gffutils] / [manual][docs-gffutils] | | **Oligomap** | [GPLv3][license-gpl3] | _"[...] fast identification of nearly-perfect matches of small RNAs in sequence databases. It allows to exhaustively identify all the perfect and 1-error (where an error is defined to be a mismatch, insertion or deletion) matches of large sets of small RNAs to target sequences"_ | [code][code-oligomap] / [publication][pub-oligomap] | | **SAMtools** | [MIT][license-mit] | _"[...] suite of programs for interacting with high-throughput sequencing data"_ | [code][code-samtools] / [manual][docs-samtools] / [publication][pub-samtools] | -| **segemehl** | [GPLv3][license-gpl3] | _"[...] map short sequencer reads to reference genomes"_ | [manual][docs-segemehl] / [publication][pub-segemehl] | +| **segemehl** | [GPLv3][license-gpl3] | _"[...] map short sequencer reads to reference genomes"_ | [manual][docs-segemehl] / [publication][pub-segemehl] | ## Description of workflow steps -> The workflow consists of four Snakemake files: A main `Snakefile` and an -individual Snakemake file for each step in the workflow (the file preparation, -the reads mapping, the miRNA quantification and the ASCII-style alignment -pileups generation). The main `Snakefile` contains the configuration file -validation along with the inclusion of the sub-workflows. Individual steps of -the workflow are described briefly, and links to the respective software -manuals are given. Parameters that can be modified by the user (via the samples -table and the configuration file) are also described. +> The workflow consists of five Snakemake files: A main `Snakefile` and an +> individual Snakemake file each for the genome resources preparation, the +> reads mapping, the miRNA quantification and the ASCII-style pileups +> generation. The main `Snakefile` contains the configuration file validation +> and imports the various functional modules described below. Individual steps +> of the workflow are described briefly along with some examples, and links to +> the respective software manuals are given. Parameters that can be modified by +> the user (via the samples table and the configuration file) are also +> described. ### Rule graph @@ -128,17 +128,17 @@ Visual representation of the workflow. Automatically prepared with ##### Requirements -- tab-separated values (`.tsv`) file +- Tab-separated values (`.tsv`) file - First row has to contain parameter names as in [`samples_table.tsv`](test/test_files/samples_table.tsv) - First column used as sample identifiers Parameter name | Description | Data type(s) --- | --- | --- -sample | Descriptive sample name. | `str` -sample_file | Path of the library file in either `.fa.gz` or `.fastq.gz` format. | `str` -adapter | Required for [Cutadapt](#third-party-software-used). Use a value such as `XXXXXXXXXX` if no adapter is present or if no trimming is desired | `str` -format | There are two allowed values, `fa` and `fastq` according to the library format | `str` +sample | Arbitrary name for the miRNA sequence library. | `str` +sample_file | Path to the `gzip`ped miRNA sequencing library file. The path must be relative to the directory where the workflow will be run. | `str` +adapter | Sequence of the 3'-end adapter used during library preparation. Required for [Cutadapt](#third-party-software-used). Use a value such as `XXXXXXXXXX` if no adapter is present or if no trimming is desired. | `str` +format | One of `fa`/`fasta` or `fq`/`fastq`, if the library file is in FASTA or FASTQ format, respectively. | `str` #### Configuration file @@ -147,6 +147,7 @@ Some parameters within the workflow can be modified. Refer to the [configuration template](#config/config_template.yaml) for a detailed explanation of each option. + ### Snakefile #### `finish` @@ -156,19 +157,17 @@ Target rule as required by [Snakemake][docs-snakemake]. > Local rule - **Input** - - pri-miR intersections file (`.bed`); from - [**intersect_extended_primir**](#intersect_extended_primir) - - miRNA intersections file (`.bed`); from - [**intersect_extended_mirna**](#intersect_extended_mirna) - - Alignments file, sorted and tagged (`.sam`); from - [**sort_intersecting_mirna_by_feat_tag**](#sort_intersecting_mirna_by_feat_tag) - - (iso)miR and/or pri-miR counts table (`.tab`); from + - (**Workflow output**) SAM file with the pri-miR intersecting alignments; + from [**filter_sam_by_intersecting_primir**](#filter_sam_by_intersecting_primir) + - (**Workflow output**) SAM file with the mature miRNA intersecting alignments; from + [**filter_sam_by_intersecting_mirna**](#filter_sam_by_intersecting_mirna) + - (**Workflow output**) (iso)miR and/or pri-miR counts table (`.tab`); from [**merge_tables**](#merge_tables) - - Alignments file (`.bam`); from - [**sort_uncollapsed_reads_bam_by_position**](#sort_uncollapsed_reads_bam_by_position) - - `BAM` index file (`.bam.bai`); from + - (**Workflow output**) BAM file with the contributing alignments, sorted; + from [**sort_uncollapsed_reads_bam_by_position**](#sort_uncollapsed_reads_bam_by_position) + - (**Workflow output**) BAM index file (`.bam.bai`); from [**index_uncollapsed_reads_bam**](#index_uncollapsed_reads_bam) - - Empty text file (`.txt`); from + - (**Workflow output**) Empty text file (`.txt`) [**create_per_library_ascii_pileups**](#create_per_library_ascii_pileups), [**create_per_run_ascii_pileups**](#create_per_run_ascii_pileups) and/or [**create_per_condition_ascii_pileups**](#create_per_condition_ascii_pileups) @@ -189,13 +188,13 @@ Target rule as required by [Snakemake][docs-snakemake]. [**generate_segemehl_index_transcriptome**](#generate_segemehl_index_transcriptome) - Exon annotations (`.bed`); from [**convert_exons_gtf_to_bed**](#convert_exons_gtf_to_bed) - - Genome header (`.sam`); from + - SAM header (`.sam`); from [**create_genome_header**](#create_genome_header) - Tab-separated table mapping chromosome name(s) and length(s) (`.tsv`); from [**extract_chr_len**](#extract_chr_len) - - Primary miRNA transcript (pri-miR) extended annotation (`.gff3`); + - Primary miRNA transcript (pri-miR) extended annotations (`.gff3`); from [**extend_mirs_annotations**](#extend_mirs_annotations) - - Mature miRNA (miRNA) extended annotation (`.gff3`); + - Mature miRNA (miRNA) extended annotations (`.gff3`); from [**extend_mirs_annotations**](#extend_mirs_annotations) @@ -204,9 +203,9 @@ Target rule as required by [Snakemake][docs-snakemake]. Trim genome sequence IDs with a [**custom script**][custom-script-trim-id]. - **Input** - - Genome sequence file (`.fasta`) + - (**Workflow input**) Genome sequence, `gzip`ed (`.fa.gz`/`.fasta.gz`) - **Output** - - Genome sequence file, trimmed IDs (`.fasta`); used in + - Genome sequence, trimmed IDs (`.fa`); used in [**extract_transcriptome_seqs**](#extract_transcriptome_seqs), [**create_genome_header**](#create_genome_header), [**create_index_genome_fasta**](#create_index_genome), @@ -218,25 +217,28 @@ Trim genome sequence IDs with a [**custom script**][custom-script-trim-id]. #### `extract_transcriptome_seqs` -Create transcriptome from genome and genome annotations with +Create transcriptome from genomic sequence and annotations with [**cufflinks**](#third-party-software-used). - **Input** - - Genome sequence file (`.fasta`) - - Genome annotation file (.`gtf`) + - (**Workflow input**) Genome annotations, `gzip`ed (`.gtf.gz`) + - Genome sequence, trimmed IDs (`.fa`); from + [**trim_genome_seq_ids**](#trim_genome_seq_ids) - **Output** - - Transcriptome sequence file (`.fasta`); used in + - Transcriptome sequence (`.fa`); used in [**trim_transcriptome_seq_ids**](#trim_transcriptome_seq_ids) #### `trim_transcriptome_seq_ids` -Trim transcriptome sequence IDs with a [**custom script**][custom-script-trim-id]. +Trim transcriptome sequence IDs with a +[**custom script**][custom-script-trim-id]. - **Input** - - Transcriptome sequence file (`.fasta`) + - Transcriptome sequence (`.fa`); from + [**extract_transcriptome_seqs**](#extract_transcriptome_seqs) - **Output** - - Transcriptome sequence, trimmed IDs (`.fasta`); used in + - Transcriptome sequence, trimmed IDs (`.fa`); used in [**generate_segemehl_index_transcriptome**](#generate_segemehl_index_transcriptome), [**mapping_transcriptome_segemehl**](#mapping_transcriptome_segemehl) and [**mapping_transcriptome_oligomap**](#mapping_transcriptome_oligomap) @@ -248,14 +250,14 @@ Generate transcriptome index for [**segemehl**](#third-party-software-used) short read aligner. > The transcriptome index only needs to be generated once for each combination -of genome and annotations and sample sets. +> of transcriptome sequence and annotations, and sample sets. - **Input** - - Transcriptome sequence file, trimmed IDs (`.fasta`); from + - Transcriptome sequence, trimmed IDs (`.fa`); from [**trim_transcriptome_seq_ids**](#trim_transcriptome_seq_ids) - **Output** - segemehl transcriptome index (`.idx`); used in - [**mapping_genome_segemehl**](#mapping_genome_segemehl) + [**mapping_transcriptome_segemehl**](#mapping_transcriptome_segemehl) #### `generate_segemehl_index_genome` @@ -264,10 +266,10 @@ Generate genome index for [**segemehl**](#third-party-software-used) short read aligner. > The genome index only needs to be generated once for each combination -of annotations and sample sets. +> of annotations and sample sets. - **Input** - - Genome sequence file with trim IDs (`.fasta`); from + - Genome sequence, trimmed IDs (`.fa`); from [**trim_genome_seq_ids**](#trim_genome_seq_ids) - **Output** - segemehl genome index (`.idx`); used in @@ -280,11 +282,12 @@ Retrieve exon annotations from genome annotations with a [**custom script**][custom-script-get-lines]. - **Input** - - Genomic annotations (`.gtf`) + - (**Workflow input**) Genome annotations, `gzip`ed (`.gtf.gz`) - **Output** - Exon annotations (`.gtf`); used in [**convert_exons_gtf_to_bed**](#convert_exons_gtf_to_bed) + #### `convert_exons_gtf_to_bed` Convert exon annotations `.gtf` to `.bed` with a @@ -299,33 +302,33 @@ Convert exon annotations `.gtf` to `.bed` with a #### `create_genome_header` -Create `SAM` header for the genome with +Create SAM header for the genome with [**SAMtools**](#third-party-software-used). -> Required by [SAMtools](#third-party-software-used) to work with the alignment -file. +> Required by [SAMtools](#third-party-software-used) to work with the +> alignment files. - **Input** - - Genome sequence file, trimmed IDs (`.fasta`); from + - Genome sequence, trimmed IDs (`.fa`); from [**trim_genome_seq_ids**](#trim_genome_seq_ids) - **Output** - - Genome header (`.sam`); used in [add_header_all_maps](#add_header_all_maps) + - SAM genome header (`.sam`); used in + [add_header_all_maps](#add_header_all_maps) #### `map_chr_names` -Map UCSC-like chromosome names with Ensembl-like ones in miRNA annotation +Map UCSC-like chromosome names with Ensembl-like ones in miRNA annotations with a [**custom script**][custom-script-map-chr]. > Required by [BEDTools](#third-party-software) to intersect alignments with -miRNA annotations. Several mapping tables are available [here][chr-maps]. +> miRNA annotations. Several mapping tables are available [here][chr-maps]. - **Input** - - miRNA annotations (`.gff3`) - - Tab-separated mappings table (`.tsv`) + - (**Workflow input**) miRNA annotations (`.gff3`) + - (**Workflow input**) Tab-separated chromosome name mappings table (`.tsv`) - **Output** - - miRNA annotations with mapped genes(`.gff3`); used in - [**extend_mirs_annotations**](#extend_mirs_annotations), + - miRNA annotations, mapped chromosome name(s) (`.gff3`); used in [**create_per_library_ascii_pileups**](#create_per_library_ascii_pileups), [**create_per_run_ascii_pileups**](#create_per_run_ascii_pileups) and/or [**create_per_condition_ascii_pileups**](#create_per_condition_ascii_pileups) @@ -333,14 +336,14 @@ miRNA annotations. Several mapping tables are available [here][chr-maps]. #### `create_index_genome_fasta` -Create a `FASTA` index for the genome with +Create a FASTA index for the genome with [**SAMtools**](#third-party-software-used). - **Input** - - Genome sequence file, trimmed IDs (`.fasta`); from + - Genome sequence, trimmed IDs (`.fa`); from [**trim_genome_seq_ids**](#trim_genome_seq_ids) - **Output** - - `FASTA` genome index (`.fa.fai`); used in + - FASTA genome index (`.fa.fai`); used in [**extract_chr_len**](#extract_chr_len) @@ -348,8 +351,12 @@ Create a `FASTA` index for the genome with Extract chromosome(s) length from the genome sequence. +> Required to ensure that the extended annotations in generated in the +> [**extend_mirs_annotations**](#extend_mirs_annotations) rule do not exceed +> the chromosome(s) boundaries. + - **Input** - - `FASTA` genome index (`.fa.fai`); from + - FASTA genome index (`.fa.fai`); from [**create_index_genome_fasta**](#create_index_genome_fasta) - **Output** - Tab-separated table mapping chromosome name(s) and length(s) (`.tsv`); used @@ -358,25 +365,148 @@ Extract chromosome(s) length from the genome sequence. #### `extend_mirs_annotations` -Extend miRNA annotations and split the file by feature with a -[**custom script**][custom-script-mir-ext]. - -> Mature miRNA regions are extended on both sides to account for isomiR species -with shifted start and/or end positions. If required, pri-miR loci are also -extended to accommodate the new miRNA coordinates. - -- **Input** - - miRNA annotations with mapped chromosomes(`.gff3`); from +Extend miRNA annotations, ensure feature names uniqueness and split the file +by feature with a [**custom script**][custom-script-mir-ext]. + +> Adjust miRNAs' 'Name' attribute to account for the different genomic +> locations the miRNA sequence is annotated on and ensure their uniqueness. +> The name format is `SPECIES-mir-NAME-#` for pri-miRs, and +> `SPECIES-miR-NAME-#-ARM` or `SPECIES-miR-NAME-#` for mature miRNA with both +> or just one arm respectively, where `#` is the replica integer. If a pri-miR +> has a replica but its number is set in the 'ID' attribute, the first instance +> does not has a suffix but the other one(s) do. If a precursor has no other +> occurrences, no further modifications are made. On the other hand, +> mature miRNA regions are extended on both sides to account for isomiR species +> with shifted start and/or end positions without exceeding chromosome(s) +> boundaries. If required, pri-miR loci are also extended to accommodate the +> new miRNA coordinates. In addition, pri-miR names are modified to record the +> final positions by appending `_-y` and `_+x` to them, where `y` is the 5' +> shift and `x` the 3' shift. + +- **Input** + - miRNA annotations, mapped chromosome name(s) (`.gff3`); from [**map_chr_names**](#map_chr_names) - **Parameters** - **config_template.yaml** - `extension`: Number of nucleotides by which mature miRNA annotated - regions are extended (default 6) + regions are extended at most (default: 6) - **Output** - - Primary miRNA transcript (pri-miR) extended annotation (`.gff3`); used in + - Primary miRNA transcript (pri-miR) extended annotations (`.gff3`); used in [**intersect_extended_primir**](#intersect_extended_primir) - - Mature miRNA (miRNA) extended annotation (`.gff3`); used in + - Mature miRNA (miRNA) extended annotations (`.gff3`); used in [**intersect_extended_mirna**](#intersect_extended_mirna) +- **Examples** + +```console +Example 1 | Extension | Mature miRNA extension + +IN: + pri-miR entry: + 19 . miRNA_primary_transcript 2517 2614 . + . ID=MI0003141;Alias=MI0003141;Name=hsa-mir-512-2 + mature miRNA entry: + 19 . miRNA 2536 2558 . + . ID=MIMAT0002822_1;Alias=MIMAT0002822;Name=hsa-miR-512-5p;Derives_from=MI0003141 + extension: + 6 +OUT: + pri-miR entry: + 19 . miRNA_primary_transcript 2517 2614 . + . ID=MI0003141;Alias=MI0003141;Name=hsa-mir-512-2_-0_+0 + mature miRNA entry: + 19 . miRNA 2530 2564 . + . ID=MIMAT0002822_1;Alias=MIMAT0002822;Name=hsa-miR-512-2-5p;Derives_from=MI0003141 + + +Example 2 | Extension | Mature miRNA and pri-miR extension + +IN: + pri-miR entry: + 19 . miRNA_primary_transcript 9 122 . + . ID=MI0003140;Alias=MI0003140;Name=hsa-mir-512-1 + mature miRNA entry: + 19 . miRNA 12 74 . + . ID=MIMAT0002822;Alias=MIMAT0002822;Name=hsa-miR-512-5p;Derives_from=MI0003140 + extension: + 6 +OUT: + pri-miR entry: + 19 . miRNA_primary_transcript 6 122 . + . ID=MI0003140;Alias=MI0003140;Name=hsa-mir-512-1_-3_+0 + mature miRNA entry: + 19 . miRNA 6 80 . + . ID=MIMAT0002822;Alias=MIMAT0002822;Name=hsa-miR-512-1-5p;Derives_from=MI0003140 + + +Example 3 | Extension | Matrue miRNA exceeding chromosome boundaries extension + +IN: + pri-miR entry: + 19 . miRNA_primary_transcript 2 122 . + . ID=MI0003140;Alias=MI0003140;Name=hsa-mir-512-1 + mature miRNA entry: + 19 . miRNA 3 74 . + . ID=MIMAT0002822;Alias=MIMAT0002822;Name=hsa-miR-512-5p;Derives_from=MI0003140 + extension: + 6 +OUT: + pri-miR entry: + 19 . miRNA_primary_transcript 1 122 . + . ID=MI0003140;Alias=MI0003140;Name=hsa-mir-512-1_-1_+0 + mature miRNA entry: + 19 . miRNA 1 80 . + . ID=MIMAT0002822;Alias=MIMAT0002822;Name=hsa-miR-512-1-5p;Derives_from=MI0003140 + + +Example 4 | Name uniqueness | Replica number in the ID + +IN: + pri-miR entries: + chr21 . miRNA_primary_transcript 8206563 8206618 . + . ID=MI0033425;Alias=MI0033425;Name=hsa-mir-10401 + chr21 . miRNA_primary_transcript 8250772 8250827 . + . ID=MI0033425_2;Alias=MI0033425;Name=hsa-mir-10401 + mature miRNA entries: + chr21 . miRNA 8206563 8206582 . + . ID=MIMAT0041633;Alias=MIMAT0041633;Name=hsa-miR-10401-5p;Derives_from=MI0033425 + chr21 . miRNA 8206598 8206618 . + . ID=MIMAT0041634;Alias=MIMAT0041634;Name=hsa-miR-10401-3p;Derives_from=MI0033425 + chr21 . miRNA 8250772 8250791 . + . ID=MIMAT0041633_1;Alias=MIMAT0041633;Name=hsa-miR-10401-5p;Derives_from=MI0033425 + chr21 . miRNA 8250807 8250827 . + . ID=MIMAT0041634_1;Alias=MIMAT0041634;Name=hsa-miR-10401-3p;Derives_from=MI0033425 +OUT: + pri-miR entries: + chr21 . miRNA_primary_transcript 8206563 8206618 . + . ID=MI0033425;Alias=MI0033425;Name=hsa-mir-10401 + chr21 . miRNA_primary_transcript 8250772 8250827 . + . ID=MI0033425_2;Alias=MI0033425;Name=hsa-mir-10401-2 + mature miRNA entries: + chr21 . miRNA 8206563 8206582 . + . ID=MIMAT0041633;Alias=MIMAT0041633;Name=hsa-miR-10401-5p;Derives_from=MI0033425 + chr21 . miRNA 8206598 8206618 . + . ID=MIMAT0041634;Alias=MIMAT0041634;Name=hsa-miR-10401-3p;Derives_from=MI0033425 + chr21 . miRNA 8250772 8250791 . + . ID=MIMAT0041633_1;Alias=MIMAT0041633;Name=hsa-miR-10401-2-5p;Derives_from=MI0033425 + chr21 . miRNA 8250807 8250827 . + . ID=MIMAT0041634_1;Alias=MIMAT0041634;Name=hsa-miR-10401-2-3p;Derives_from=MI0033425 + + +Example 5 | Name uniqueness | Replica number in the Name; single mature arm + +IN: + pri-miR entries: + chr21 . miRNA_primary_transcript 8205315 8205406 . + . ID=MI0022559;Alias=MI0022559;Name=hsa-mir-6724-1 + chr21 . miRNA_primary_transcript 8249505 8249596 . + . ID=MI0031516;Alias=MI0031516;Name=hsa-mir-6724-2 + mature miRNA entries: + chr21 . miRNA 8205325 8205347 . + . ID=MIMAT0025856;Alias=MIMAT0025856;Name=hsa-miR-6724-5p;Derives_from=MI0022559 + chr21 . miRNA 8249515 8249537 . + . ID=MIMAT0025856_1;Alias=MIMAT0025856;Name=hsa-miR-6724-5p;Derives_from=MI0031516 +OUT: + pri-miR entries: + chr21 . miRNA_primary_transcript 8205315 8205406 . + . ID=MI0022559;Alias=MI0022559;Name=hsa-mir-6724-1 + chr21 . miRNA_primary_transcript 8249505 8249596 . + . ID=MI0031516;Alias=MI0031516;Name=hsa-mir-6724-2 + mature miRNA entries: + chr21 . miRNA 8205325 8205347 . + . ID=MIMAT0025856;Alias=MIMAT0025856;Name=hsa-miR-6724-1-5p;Derives_from=MI0022559 + chr21 . miRNA 8249515 8249537 . + . ID=MIMAT0025856_1;Alias=MIMAT0025856;Name=hsa-miR-6724-2-5p;Derives_from=MI0031516 + + +Example 6 | Name uniqueness | Both mature miRNA arms but just one with the replica number + +IN: + pri-miR entries: + chr2 . miRNA_primary_transcript 135665397 135665478 . + . ID=MI0000447;Alias=MI0000447;Name=hsa-mir-128-1 + chr3 . miRNA_primary_transcript 35744476 35744559 . + . ID=MI0000727;Alias=MI0000727;Name=hsa-mir-128-2 + mature miRNA entries: + chr2 . miRNA 135665446 135665466 . + . ID=MIMAT0000424;Alias=MIMAT0000424;Name=hsa-miR-128-3p;Derives_from=MI0000447 + chr2 . miRNA 135665411 135665433 . + . ID=MIMAT0026477;Alias=MIMAT0026477;Name=hsa-miR-128-1-5p;Derives_from=MI0000447 + chr3 . miRNA 35744527 35744547 . + . ID=MIMAT0000424_1;Alias=MIMAT0000424;Name=hsa-miR-128-3p;Derives_from=MI0000727 + chr3 . miRNA 35744490 35744512 . + . ID=MIMAT0031095;Alias=MIMAT0031095;Name=hsa-miR-128-2-5p;Derives_from=MI0000727 +OUT: + pri-miR entries: + chr2 . miRNA_primary_transcript 135665397 135665478 . + . ID=MI0000447;Alias=MI0000447;Name=hsa-mir-128-1 + chr3 . miRNA_primary_transcript 35744476 35744559 . + . ID=MI0000727;Alias=MI0000727;Name=hsa-mir-128-2 + mature miRNA entries: + chr2 . miRNA 135665446 135665466 . + . ID=MIMAT0000424;Alias=MIMAT0000424;Name=hsa-miR-128-1-3p;Derives_from=MI0000447 + chr2 . miRNA 135665411 135665433 . + . ID=MIMAT0026477;Alias=MIMAT0026477;Name=hsa-miR-128-1-5p;Derives_from=MI0000447 + chr3 . miRNA 35744527 35744547 . + . ID=MIMAT0000424_1;Alias=MIMAT0000424;Name=hsa-miR-128-2-3p;Derives_from=MI0000727 + chr3 . miRNA 35744490 35744512 . + . ID=MIMAT0031095;Alias=MIMAT0031095;Name=hsa-miR-128-2-5p;Derives_from=MI0000727 +``` ### Map workflow @@ -387,89 +517,94 @@ Target rule as required by [Snakemake][docs-snakemake]. > Local rule - **Input** - - `BAM` index file (`.bam.bai`); from + - BAM index file (`.bam.bai`); from [**index_all_alns_bam**](#index_all_alns_bam) - #### `start` Copy and rename read files. > Local rule. -Depending on the read files format, the output files undergo a quality filter -(`.fastq`) or are directly formatted (`.fa`). +> Depending on the library file format, the output file undergoes a quality +> filter (`fa`/`.fastq`) or is directly formatted (`.fa`/`.fasta`). - **Input** - - Reads file (`.fa.gz`, `.fastq.gz`) + - (**Workflow input**) miRNA sequencing library, `gzip`ed + (`.fa.gz`/`.fasta.gz` or `.fq.gz`/`.fastq.gz`) - **Output** - - Reads file, copied, renamed (`.fa`, `.fastq`); used in - [**fastq_quality_filter**](#fastq_quality_filter) or + - miRNA sequencing library, copied, renamed (`.fa`, `.fastq`); used in + [**fastq_quality_filter**](#fastq_quality_filter) and/or [**format_fasta**](#format_fasta) #### `fastq_quality_filter` -Conduct quality control for reads library with +Conduct quality control for reads library with [**fastx_toolkit**](#third-party-software-used). - **Input** - - Reads file, copied, renamed (`.fastq`); from [**start**](#start) + - miRNA sequencing library, copied, renamed (`.fastq`); from + [**start**](#start) - **Parameters** - **config_template.yaml** - - `q_value`: Minimum Q (Phred) score to keep (default 10) + - `q_value`: Minimum Q (Phred) score to keep (default: 10) - `p_value`: Minimum % of bases that must have a Q (Phred) quality - (default 50) + (default: 50) - **Output** - - Reads file, filtered (`.fastq`); used in + - miRNA sequencing library, filtered (`.fastq`); used in [**fastq_to_fasta**](#fastq_to_fasta) #### `fastq_to_fasta` -Convert reads file from `.fastq` to `.fa` with +Convert reads file from FASTQ to FASTA with [**fastx_toolkit**](#third-party-software-used). +> Sequence identifiers are renamed to numbers. + - **Input** - - Reads file, filtered (`.fastq`); from + - miRNA sequencing library, filtered (`.fastq`); from [**fastq_quality_filter**](#fastq-quality-filter) - **Output** - - Reads file (`.fa`); used in [**format_fasta**](#format_fasta) + - miRNA sequencing library (`.fa`); used in [**format_fasta**](#format_fasta) + #### `format_fasta` -Format reads to appear on a single line with +Format read's sequences to appear on a single line with [**fastx_toolkit**](#third-party-software-used). - **Input** - - Reads file (`.fa`); from [**start**](#start) or + - miRNA sequencing library (`.fa`); from [**start**](#start) or [**fastq_to_fasta**](#fastq_to_fasta) - **Output** - - Reads file, formatted (`.fa`); used in + - miRNA sequencing library, formatted (`.fasta`); used in [**remove_adapters**](#remove_adapters) #### `remove_adapters` -Trim adapters and `N` bases at either end. Filter reads by minimum length and -number of inner `N` bases with [**cutadapt**](#third-party-software-used). +Trim 3' adapters and `N` bases at either end. Filter reads by minimum length +and number of inner `N` bases with [**cutadapt**](#third-party-software-used). - **Input** - - Reads file (`.fa`); from [**format_fasta**](#format_fasta) + - miRNA sequencing library, formatted (`.fasta`); from + [**format_fasta**](#format_fasta) - **Parameters** - **samples.tsv** - - Adapter to be removed; specify in sample table column `adapter` + - Adapter to be removed; specified in the sample's table column `adapter` - **config_template.yaml** - `error_rate`: Fraction of allowed errors in the matched adapters - (default 0.1) + (default: 0.1) - `overlap`: Minimum overlap length between adapter and read to trim the - bases (default 3) + bases (default: 3) - `minimum_length`: Minimum length for a processed read to be kept - (default 15) - - `max_n`: Maximum number of `N` bases for a processed read to be kept - (default 0) + (default: 15) + - `max_n`: Maximum number of inner `N` bases for a processed read to be + kept (default: 0) - **Output** - - Reads file (`.fa`; used in + - miRNA sequencing library, filtered, without adapters (`.fasta`); used in [**collapse_identical_reads**](#collapse_identical_reads) @@ -478,10 +613,15 @@ number of inner `N` bases with [**cutadapt**](#third-party-software-used). Collapse and rename identical reads [**fastx_toolkit**](#third-party-software-used). +> Sequences are renamed in the format `R-N`, where `R` is the assigned number +> to the unique entry, and `N` is the amount of identical sequences within the +> library collapsed in it. + - **Input** - - Reads file (`.fa`); from [**remove_adapters**](#remove_adapters) + - miRNA sequencing library, filtered, without adapters (`.fasta`); from + [**remove_adapters**](#remove_adapters) - **Output** - - Reads file, collapsed, rename; used in + - miRNA sequencing library, collapsed, renamed (`.fasta`); used in [**filter_fasta_for_oligomap**](#filter_fasta_for_oligomap), [**map_genome_segemehl**](#map_genome_segemehl) and [**map_transcriptome_segemehl**](#map_transcriptome_segemehl) @@ -493,11 +633,11 @@ Align short reads to reference genome with [**segemehl**](#third-party-software-used). - **Input** - - Reads file, collapsed (`.fa`); from + - miRNA sequencing library, collapsed, renamed (`.fasta`); from [**collapse_identical_reads**](#collapse_identical_reads) - - Genome sequence (`.fa`); from + - Genome sequence, trimmed IDs (`.fa`); from [**trim_genome_seq_ids**](#trim_genome_seq_ids) - - segemehl genome index (`idx`); from + - segemehl genome index (`.idx`); from [**generate_segemehl_index_genome**](#generate_segemehl_index_genome) - **Output** - Alignments file (`.sam`); used in @@ -510,11 +650,11 @@ Align short reads to reference transcriptome with [**segemehl**](#third-party-software-used). - **Input** - - Reads file, collapsed (`.fa`); from + - miRNA sequencing library, collapsed, renamed (`.fasta`); from [**collapse_identical_reads**](#collapse_identical_reads) - - Transcriptome sequence (`.fa`); from + - Transcriptome sequence, trimmed IDs (`.fa`); from [**trim_transcriptome_seq_ids**](#trim_transcriptome_seq_ids) - - segemehl transcriptome index (`idx`); from + - segemehl transcriptome index (`.idx`); from [**generate_segemehl_index_transcriptome**](#generate_segemehl_index_transcriptome) - **Output** - Alignments file (`.sam`); used in @@ -525,16 +665,21 @@ Align short reads to reference transcriptome with Filter reads by length with a [**custom script**][custom-script-validation]. +> Required for an optimal mapping speed. Oligomap is specifically written for +> short reads. Therefore, reads with more bases than the default maximum (30 +> nts) makes the mapping slower. + - **Input** - - Reads file, collapsed (`.fa`); from + - miRNA sequencing library, collapsed, renamed (`.fasta`); from [**collapse_identical_reads**](#collapse_identical_reads) - **Parameters** - **config_template.yaml** - - `max_length_reads`: Maximum length of processed reads to map with - [**oligomap**](#third-party-software-used) + - `max_length_reads`: Maximum length of processed reads to be mapped with + [**oligomap**](#third-party-software-used) (default: 30) - **Output** - - Reads file (`.fa`); used in [**map_genome_oligomap**](#map_genome_oligomap) - and [**map_transcriptome_oligomap**](#map_transcriptome_oligomap) + - miRNA sequencing library, collapsed, filtered (`.fasta`); used in + [**map_genome_oligomap**](#map_genome_oligomap) and + [**map_transcriptome_oligomap**](#map_transcriptome_oligomap) #### `map_genome_oligomap` @@ -542,16 +687,18 @@ Filter reads by length with a [**custom script**][custom-script-validation]. Align short reads to reference genome with [**oligomap**](#third-party-software-used). +> Refer to Oligomap's [**Output format section**][oligomap-out] for a specific +> explanation and examples on the output format. + - **Input** - - Reads file, collapsed (`.fa`); from + - miRNA sequencing library, collapsed, filtered (`.fasta`); from [**filter_fasta_for_oligomap**](#filter_fasta_for_oligomap) - - Genome sequence (`.fa`); from + - Genome sequence, trimmed IDs (`.fa`); from [**trim_genome_seq_ids**](#trim_genome_seq_ids) - **Output** - - Alignments file (`.fa`); used in - [**sort_genome_oligomap**](#sort_genome_oligomap) - - Alignment report (`.txt`); used in + - Alignments file (`.oligomap`); used in [**sort_genome_oligomap**](#sort_genome_oligomap) + - Alignments report (`.txt`) #### `sort_genome_oligomap` @@ -560,32 +707,33 @@ Sort [**oligomap**](#third-party-software-used) alignments by query name with a [**custom script**][custom-script-blocksort]. - **Input** - - Alignments file (`.fa`); from - [**map_genome_oligomap**](#map_genome_oligomap) - - Alignment report (`.txt`); from + - Alignments file (`.oligomap`); from [**map_genome_oligomap**](#map_genome_oligomap) - **Output** - - Alignments file, sorted (`.fa`); used in - [**convert_genome_to_sam_oligomap**](#convert_genome_to_sam_oligomap) - - Alignment report, sorted (`.txt`); used in + - Alignments file, sorted (`.oligomap`); used in [**convert_genome_to_sam_oligomap**](#convert_genome_to_sam_oligomap) #### `convert_genome_to_sam_oligomap` -Convert aligned reads `.fa` to `.sam` and filter alignments by number of hits -with a [**custom script**][custom-script-oligo-sam]. +Convert aligned reads `.oligomap` to `.sam` and filter alignments by number of +hits with a [**custom script**][custom-script-oligo-sam]. + +> If a read has been aligned beyond a specified threshold, it is removed due +> to (1) performance reasons as the file size can rapidly increase, and (2) +> the fact that each read contributes to each count `1/N` where `N` is the +> number of genomic loci it aligns to and a large `N` makes the contribution +> negligible. - **Input** - - Alignments file, sorted (`.fa`); from - [**sort_genome_oligomap**](#sort_genome_oligomap) - - Alignment report, sorted (`.txt`); from + - Alignments file, sorted (`.oligomap`); from [**sort_genome_oligomap**](#sort_genome_oligomap) - **Parameters** - **config_template.yaml** - - `nh`: Maximum number of hits an alignment can have to be kept + - `nh`: Maximum number of mappings per read to be kept (default: 100) - **Output** - - Alignments file (`.sam`); used in [**merge_genome_maps**](#merge_genome_maps) + - Alignments file, filtered (`.sam`); used in + [**merge_genome_maps**](#merge_genome_maps) #### `map_transcriptome_oligomap` @@ -593,16 +741,18 @@ with a [**custom script**][custom-script-oligo-sam]. Align short reads to reference transcriptome with [**oligomap**](#third-party-software-used). +> Refer to Oligomap's [Output format section][oligomap-out] for a specific +> explanation and examples on the output format. + - **Input** - - Reads file (`.fa`); from + - miRNA sequencing library, collapsed, filtered (`.fasta`); from [**filter_fasta_for_oligomap**](#filter_fasta_for_oligomap) - - Transcriptome sequence (`.fa`); from + - Transcriptome sequence, trimmed IDs (`.fa`); from [**trim_transcriptome_seq_ids**](#trim_transcriptome_seq_ids) - **Output** - - Alignments file (`.fa`); used in - [**sort_transcriptome_oligomap**](#sort_transcriptome_oligomap) - - Alignment report (`.txt`); used in + - Alignments file (`.oligomap`); used in [**sort_transcriptome_oligomap**](#sort_transcriptome_oligomap) + - Alignments report (`.txt`) #### `sort_transcriptome_oligomap` @@ -611,32 +761,32 @@ Sort [**oligomap**](#third-party-software-used) alignments by query name with a [**custom script**][custom-script-blocksort]. - **Input** - - Alignments file (`.fa`); from - [**map_transcriptome_oligomap**](#map_transcriptome_oligomap) - - Alignment report (`.txt`); from + - Alignments file (`.oligomap`); from [**map_transcriptome_oligomap**](#map_transcriptome_oligomap) - **Output** - - Aligned file, sorted (`.fa`); used in + - Alignments file, sorted (`.oligomap`); used in [**convert_transcriptome_to_sam_oligomap**](#convert_transcriptome_to_sam_oligomap) - - Alignment report, sorted (`.txt`); used in - [**convert_transcriptome_to_sam_oligomap**](#convert_transcriptme_to_sam_oligomap) #### `convert_transcriptome_to_sam_oligomap` -Convert aligned reads `.fa` to `.sam` and filter alignments by number of hits -with a [**custom script**][custom-script-oligo-sam]. +Convert aligned reads `.oligomap` to `.sam` and filter alignments by number of +hits with a [**custom script**][custom-script-oligo-sam]. + +> If a read has been aligned beyond a specified threshold, it is removed due +> to (1) performance reasons as the file size can rapidly increase, and (2) +> the fact that each read contributes to each count `1/N` where `N` is the +> number of genomic loci it aligns to and a large `N` makes the contribution +> negligible. - **Input** - - Alignments file, sorted (`.fa`); from - [**sort_transcriptome_oligomap**](#sort_transcriptome_oligomap) - - Alignment report, sorted (`.txt`); from + - Alignments file, sorted (`.oligomap`); from [**sort_transcriptome_oligomap**](#sort_transcriptome_oligomap) - **Parameters** - **config_template.yaml** - - `nh`: Maximum number of hits an alignment can have to be kept + - `nh`: Maximum number of mappings per read to be kept (default: 100) - **Output** - - Alignments file (`.sam`); used in + - Alignments file, filtered (`.sam`); used in [**merge_transcriptome_maps**](#merge_transcriptome_maps) @@ -648,7 +798,7 @@ Concatenate [**segemehl**](#third-party-software-used) and - **Input** - Alignments file (`.sam`); from [**map_genome_segemehl**](#map_genome_segemehl) - - Alignments file (`.sam`); from + - Alignments file, filtered (`.sam`); from [**convert_genome_to_sam_oligomap**](#convert_genome_to_sam_oligomap) - **Output** - Alignments file (`.sam`); used in @@ -663,7 +813,7 @@ Concatenate [**segemehl**](#third-party-software-used) and - **Input** - Alignments file (`.sam`); from [**map_transcriptome_segemehl**](#map_transcriptome_segemehl) - - Alignments file (`.sam`); from + - Alignments file, filtered (`.sam`); from [**convert_transcriptome_to_sam_oligomap**](#convert_transcriptome_to_sam_oligomap) - **Output** - Alignments file (`.sam`); used in @@ -675,27 +825,40 @@ Concatenate [**segemehl**](#third-party-software-used) and Filter merged genome alignments by the number of hits with a [**custom script**][custom-script-nh-filter]. +> If a read has been aligned beyond a specified threshold, it is removed due +> to (1) performance reasons as the file size can rapidly increase, and (2) +> the fact that each read contributes to each count `1/N` where `N` is the +> number of genomic loci it aligns to and a large `N` makes the contribution +> negligible. + - **Input** - Alignments file (`.sam`); from [**merge_genome_maps**](#merge_genome_maps) - **Parameters** - **config_template.yaml** - - `nh`: Maximum number of mappings per read to be kept (default 100) + - `nh`: Maximum number of mappings per read to be kept (default: 100) - **Output** - Alignments file, filtered (`.sam`); used in [**remove_header_genome_mappings**](#remove_header_genome_mappings) + #### `filter_transcriptome_by_nh` Filter merged transcriptome alignments by the number of hits with a [**custom script**][custom-script-nh-filter]. +> If a read has been aligned beyond a specified threshold, it is removed due +> to (1) performance reasons as the file size can rapidly increase, and (2) +> the fact that each read contributes to each count `1/N` where `N` is the +> number of genomic loci it aligns to and a large `N` makes the contribution +> negligible. + - **Input** - Alignments file (`.sam`); from [**merge_transcriptome_maps**](#merge_transcriptme_maps) - **Parameters** - **config_template.yaml** - - `nh`: Maximum number of mappings per read to be kept (default 100) + - `nh`: Maximum number of mappings per read to be kept (default: 100) - **Output** - Alignments file, filtered (`.sam`); used in [**remove_header_transcriptome_mappings**](#remove_header_transcriptome_mappings) @@ -703,63 +866,66 @@ Filter merged transcriptome alignments by the number of hits with a #### `remove_header_genome_mappings` -Remove the `SAM` header of the genome alignments file with +Remove the SAM header of the genome alignments file with [**SAMtools**](#third-party-software-used). - **Input** - Alignments file (`.sam`); from [**filter_genome_by_nh**](#filter_genome_by_nh) - **Output** - - Alignments file (`.sam`); used in [**merge_all_maps**](#merge_all_maps) + - Alignments file, without SAM header (`.sam`); used in + [**merge_all_maps**](#merge_all_maps) #### `remove_header_transcriptome_mappings` -Remove the `SAM` header of the transcriptome alignments file with +Remove the SAM header of the transcriptome alignments file with [**SAMtools**](#third-party-software-used). - **Input** - Alignments file (`.sam`); from [**filter_transcriptome_by_nh**](#filter_transcriptome_by_nh) - **Output** - - Alignments file (`.sam`); used in + - Alignments file, without SAM header (`.sam`); used in [**transcriptome_to_genome_maps**](#transcriptome_to_genome_maps) #### `transcriptome_to_genome_maps` -Convert the alignments transcriptome coordinates to genomic ones with a +Convert the alignments' transcriptome coordinates to genomic ones with a [**custom script**][custom-script-sam-trx]. - **Input** - - Alignments file (`.sam`); from + - Alignments file, without SAM header (`.sam`); from [**remove_header_transcriptome_mappings**](#remove_header_transcriptome_mappings) - Exon annotations (`.bed`); from [**convert_exons_gtf_to_bed**](#convert_exons_gtf_to_bed) - **Output** - - Alignments file (`.sam`); used in [**merge_all_maps**](#merge_all_maps) + - Alignments file, without SAM header (`.sam`); used in + [**merge_all_maps**](#merge_all_maps) #### `merge_all_maps` -Concatenate the four alignments files into a single file. +Concatenate the four alignment files into a single one. - **Input** - - Alignments file (`.sam`); from + - Alignments files, without SAM header (`.sam`); from [**remove_header_genome_mappings**](#remove_header_genome_mappings) and [**transcriptome_to_genome_maps**](#transcriptome_to_genome_maps) - **Output** - - Alignments file (`.sam`); used in + - Alignments file, without SAM header (`.sam`); used in [**add_header_all_maps**](#add_header_all_maps) #### `add_header_all_maps` -Add the `SAM` header to the aligned reads with +Add the SAM header to the aligned reads merged file with [**SAMtools**](#third-party-software-used). - **Input** - - Alignments file (`.sam`); from [**merge_all_maps**](#merge_all_maps) + - Alignments file, without SAM header (`.sam`); from + [**merge_all_maps**](#merge_all_maps) - **Output** - Alignments file (`.sam`); used in [**sort_maps_by_id**](#sort_maps_by_id) @@ -782,30 +948,94 @@ Remove duplicate and inferior alignments with a [**custom script**][custom-script-remove-dup]. > Alignments are considered to be duplicates if having identical entries for -the fields `QNAME`, `FLAG`, `RNAME`, `POS` and `CIGAR`. -Alignments are considered to be inferiors if having the same `QNAME` and -a bigger edit distance than the minimum one within the group. - +> the fields `QNAME`, `FLAG`, `RNAME`, `POS` and `CIGAR`. +> Alignments are considered to be inferiors if having the same `QNAME` and +> a bigger edit distance than the smaller one within the group. The tags `NH` +> (number of hits) and `HI` (query hit index) are updated accordingly. - **Input** - - Alignments file, sorted (`.sam`); from [**sort_maps_by_id**](#sort_maps_by_id) + - Alignments file, sorted (`.sam`); from + [**sort_maps_by_id**](#sort_maps_by_id) - **Output** - - Alignments file (`.sam`); used in + - Alignments file, filtered (`.sam`); used in [**filter_by_indels**](#filter_by_indels) +- **Examples** + +```console +Example 1 | Remove duplicates + +IN: + 1-2 0 19 44414 1 21M * 0 0 GAAGGCGCTTCCCTTTGGAGT * HI:i:0 NH:i:1 NM:i:0 MD:Z:21 RG:Z:A1 YZ:Z:0 + 1-2 0 19 44414 255 21M * 0 0 GAAGGCGCTTCCCTTTGGAGT * NM:i:0 MD:Z:21 NH:i:1 +OUT: + 1-2 0 19 44414 255 21M * 0 0 GAAGGCGCTTCCCTTTGGAGT * MD:Z:21 NH:i:1 NM:i:0 + + +Example 2 | Remove inferiors single alignment + +IN: + 1-704 16 19 207362 1 18M * 0 0 CCCGGGCCCGGCGCGCCG * HI:i:0 NH:i:2 NM:i:0 MD:Z:18 RG:Z:A1 YZ:Z:0 + 1-704 272 19 471264 1 16M1I1M * 0 0 CCCGGGCCCGGCGCGCCG * HI:i:1 NH:i:2 NM:i:2 MD:Z:11G5 RG:Z:A1 YZ:Z:0 +OUT: + 1-704 16 19 207362 1 18M * 0 0 CCCGGGCCCGGCGCGCCG * HI:i:0 NH:i:1 NM:i:0 MD:Z:18 RG:Z:A1 YZ:Z:0 + + +Example 3 | Remove inferiors multiple alignments + +IN: + + 1-1197 0 19 56327 1 15M * 0 0 TATGGCACTGGTAGA * HI:i:0 NH:i:4 NM:i:2 MD:Z:1C11T1 RG:Z:A1 YZ:Z:0 + 1-1197 256 19 68983 1 15M * 0 0 TATGGCACTGGTAGA * HI:i:1 NH:i:4 NM:i:3 MD:Z:1C10AT1 RG:Z:A1 YZ:Z:0 + 1-1197 256 19 76967 1 15M * 0 0 TATGGCACTGGTAGA * HI:i:2 NH:i:4 NM:i:2 MD:Z:1C11T1 RG:Z:A1 YZ:Z:0 + 1-1197 256 19 92363 1 15M * 0 0 TATGGCACTGGTAGA * HI:i:4 NH:i:4 NM:i:3 MD:Z:1C11TA RG:Z:A1 YZ:Z:0 + +OUT: + 1-1197 0 19 56327 1 15M * 0 0 TATGGCACTGGTAGA * HI:i:0 NH:i:2 NM:i:2 MD:Z:1C11T1 RG:Z:A1 YZ:Z:0 + 1-1197 256 19 76967 1 15M * 0 0 TATGGCACTGGTAGA * HI:i:1 NH:i:2 NM:i:2 MD:Z:1C11T1 RG:Z:A1 YZ:Z:0 +``` #### `filter_by_indels` -Remove multimappers favoring mismatches over indels with a +Filter multimappers favoring mismatches over InDels with a [**custom script**][custom-script-filter-mm]. +> Under the assumption that InDels are less frequent than mismatches only +> those alignments (of the same read with the same edit distance) with the +> lowest number of InDels are kept. This approach allows the presence of +> multimappers and/or InDels after the filtering if the alignments contain the +> same proportion of mismatches vs. InDels. + - **Input** - - Alignments file, sorted (`.sam`); from + - Alignments file, sorted, filtered (`.sam`); from [**remove_inferiors**](#remove_inferiors) - **Output** - - Alignments file (`.sam`); used in + - Alignments file, sorted, filtered (`.sam`); used in [**convert_all_alns_sam_to_bam**](#convert_all_alns_sam_to_bam) and [**filter_sam_by_intersecting_primir**](#filter_sam_by_intersecting_primr) +- **Examples** + +```console +Example 1 | Different proportion of mismatches vs. InDels + +IN: + 1-1 16 19 77595 255 14M1D8M * 0 0 GCAGGAGAATCACTGATGTCAG * MD:Z:14^T2A1C3 NH:i:2 NM:i:3 XA:Z:Q XI:i:1 + 1-1 0 19 330456 255 4M1D1M1I3M1D13M * 0 0 CTGACATCAGTGATTCTCCTGC * MD:Z:4^G4^A13 NH:i:2 NM:i:3 XA:Z:Q XI:i:0 +OUT: + 1-1 16 19 77595 255 14M1D8M * 0 0 GCAGGAGAATCACTGATGTCAG * MD:Z:14^T2A1C3 NM:i:3 XA:Z:Q XI:i:1 NH:i:1 HI:i:1 + + +Example 2 | Equal proportion of mismatches vs. InDels + +IN: + 1-1 0 19 142777 255 15M1I5M * 0 0 GCTAGGTGGGAGGCTTGAAGC * MD:Z:4C0T14 NH:i:3 NM:i:3 XA:Z:Q XI:i:0 + 1-1 16 19 270081 255 6M1I14M * 0 0 GCTTCAAGCCTCCCACCTAGC * MD:Z:14G0G4 NH:i:3 NM:i:3 XA:Z:Q XI:i:2 + 1-1 16 19 545543 255 6M1I14M * 0 0 GCTTCAAGCCTCCCACCTAGC * MD:Z:14A0G4 NH:i:3 NM:i:3 XA:Z:Q XI:i:1 +OUT: + 1-1 0 19 142777 255 15M1I5M * 0 0 GCTAGGTGGGAGGCTTGAAGC * MD:Z:4C0T14 NH:i:3 NM:i:3 XA:Z:Q XI:i:0 + 1-1 16 19 270081 255 6M1I14M * 0 0 GCTTCAAGCCTCCCACCTAGC * MD:Z:14G0G4 NH:i:3 NM:i:3 XA:Z:Q XI:i:2 + 1-1 16 19 545543 255 6M1I14M * 0 0 GCTTCAAGCCTCCCACCTAGC * MD:Z:14A0G4 NH:i:3 NM:i:3 XA:Z:Q XI:i:1 +``` #### `convert_all_alns_sam_to_bam` @@ -814,10 +1044,10 @@ Convert alignments `.sam` file to `.bam` with [**SAMtools**](#third-party-software-used). > Required by [BEDTools](#third-party-software-used) to intersect alignments -with pri-miR annotations. +> with pri-miR annotations. - **Input** - - Alignments file (`.sam`); from [**filter_by_indels**](#filter_by_indels) + - Alignments file, filtered (`.sam`); from [**filter_by_indels**](#filter_by_indels) - **Output** - Alignments file (`.bam`); used in [**sort_all_alns_bam_by_position**](#sort_all_alns_bam_by_position) @@ -828,7 +1058,7 @@ with pri-miR annotations. Sort alignments by position with [**SAMtools**](#third-party-software-used). > Required by [BEDTools](#third-party-software-used) to intersect alignments -with pri-miR annotations more efficiently. +> with pri-miR annotations more efficiently. - **Input** - Alignments file (`.bam`); from @@ -841,16 +1071,16 @@ with pri-miR annotations more efficiently. #### `index_all_alns_bam` -Create index `BAM` file with [**SAMtools**](#third-party-software-used). +Create index BAM file with [**SAMtools**](#third-party-software-used). > Indexing is required by genome viewers such as IGV to quickly display -alignments in a genomic region of interest. +> alignments in a genomic region of interest. - **Input** - - Alignments file (`.bam`); from + - Alignments file, sorted (`.bam`); from [**sort_all_alns_bam_by_position**](#sort_all_alns_bam_by_position) - **Output** - - `BAM` index file (`.bam.bai`); used in + - BAM index file (`.bam.bai`); used in [**intersect_extended_primir**](#intersect_extended_primir) @@ -865,24 +1095,25 @@ Target rule as required by [Snakemake][docs-snakemake]. - **Input** - Alignments file, filtered (`.sam`); from [**filter_sam_by_intersecting_primir**](#filter_sam_by_intersecting_primir) - - Alignments file, filtered (`.sam`); from - [**filter_sam_by_intersecting_mirna**](#filter_sam_by_intersecting_mirna) - - Alignments file, sorted, tagged (`.sam`); from - [**sort_intersecting_mirna_by_feat_tag**](#sort_intersecting_mirna_by_feat_tag) + and [**filter_sam_by_intersecting_mirna**](#filter_sam_by_intersecting_mirna) - (iso)miR and/or pri-miR counts table (`.tab`); from [**merge_tables**](#merge_tables) - - Alignments file (`.bam`); from + - Alignments file, uncollapsed, sorted (`.bam`); from [**sort_uncollapsed_reads_bam_by_position**](#sort_uncollapsed_reads_bam_by_position) - - `BAM` index file (`.bam.bai`); from + - BAM index file (`.bam.bai`); from [**index_uncollapsed_reads_bam**](#index_uncollapsed_reads_bam) + #### `intersect_extendend_primir` Intersect the aligned reads with the extended pri-miR annotations with [**BEDTools**](#third-party-software-used). +> Only those alignments fully intersecting a (possibly extended) pri-miR +> annotated region are kept. + - **Input** - - Alignments file (`.bam`); from + - Alignments file, sorted (`.bam`); from [**sort_all_alns_bam_by_position**](#sort_all_alns_bam_by_position) - pri-miR extended annotations (`.gff3`); from [**extend_mirs_annotations**](#extend_mirs_annotations) @@ -897,14 +1128,16 @@ Intersect the aligned reads with the extended pri-miR annotations with Remove alignments that do not intersect with any pri-miR with [**SAMtools**](#third-party-software-used). -> Required to only intersect alignments within a pri-miR locus. +> Required to only intersect alignments within a (possibly extended) pri-miR +> locus. - **Input** - - Alignments file (`.sam`); from [**filter_by_indels**](#filter_by_indels) + - Alignments file, filtered (`.sam`); from + [**filter_by_indels**](#filter_by_indels) - pri-miR intersections file (`.bed`); from [**intersect_extended_primir**](#intersect_extended_primir) - **Output** - - Alignments file, filtered (`.sam`); used in + - (**Workflow output**) Alignments file, filtered (`.sam`); used in [**convert_intersecting_primir_sam_to_bam**](#convert_intersecting_primir_sam_to_bam) and [**filter_sam_by_intersecting_mirna**](#filter_sam_by_intersecting_mirna) @@ -915,10 +1148,10 @@ Convert alignments `.sam` file to `.bam` with [**SAMtools**](#third-party-software-used). > Required by [BEDTools](#third-party-software-used) to intersect alignments -with miRNA annotations. +> with miRNA annotations. - **Input** - - Alignments file (`.sam`); from + - Alignments file, filtered (`.sam`); from [**filter_sam_by_intersecting_primir**](#filter_sam_by_intersecting_primir) - **Output** - Alignments file (`.bam`); used in @@ -930,7 +1163,7 @@ with miRNA annotations. Sort alignments by position with [**SAMtools**](#third-party-software-used). > Required by [BEDTools](#third-party-software-used) to intersect alignments -with miRNA annotations more efficiently. +> with miRNA annotations more efficiently. - **Input** - Alignments file (`.bam`); from @@ -943,16 +1176,16 @@ with miRNA annotations more efficiently. #### `index_intersecting_primir_bam` -Create index `BAM` file with [**SAMtools**](#third-party-software-used). +Create index BAM file with [**SAMtools**](#third-party-software-used). > Indexing is required by genome viewers such as IGV to quickly display -alignments in a genomic region of interest. +> alignments in a genomic region of interest. - **Input** - - Alignments file (`.bam`); from + - Alignments file, sorted (`.bam`); from [**sort_intersecting_primir_bam_by_position**](#sort_intersecting_primir_bam_by_position) - **Output** - - `BAM` index file (`.bam.bai`); used in + - BAM index file (`.bam.bai`); used in [**intersect_extended_mirna**](#intersect_extended_mirna) @@ -961,13 +1194,16 @@ alignments in a genomic region of interest. Intersect the aligned reads with the extended miRNA annotations with [**BEDTools**](#third-party-software-used). +> Only those alignments fully intersecting an extended mature miRNA annotated +> region are kept. + - **Input** - - Alignments file (`.bam`); from + - Alignments file, sorted (`.bam`); from [**sort_intersecting_primir_bam_by_position**](#sort_intersecting_primir_bam_by_position) - - miRNA extended annotations (`.gff3`); from + - Mature miRNA extended annotations (`.gff3`); from [**extend_mirs_annotations**](#extend_mirs_annotations) - **Output** - - miRNA intersections file (`.bed`); used in + - Mature miRNA intersections file (`.bed`); used in [**filter_sam_by_intersecting_mirna**](#filter_sam_by_intersecting_mirna) and [**add_intersecting_mirna_tag**](#add_intersecting_mirna_tag) @@ -980,12 +1216,12 @@ Remove alignments that do not intersect with any miRNA with > Required to efficiently classify the alignments. - **Input** - - Alignments file (`.sam`); from + - Alignments file, filtered (`.sam`); from [**filter_sam_by_intersecting_primir**](#filter_sam_by_intersecting_primir) - - miRNA intersections file (`.bed`); from + - Mature miRNA intersections file (`.bed`); from [**intersect_extended_mirna**](#intersect_extended_mirna) - **Output** - - Alignments file, filtered (`.sam`); used in + - (**Workflow output**) Alignments file, filtered (`.sam`); used in [**add_intersecting_mirna_tag**](#add_intersecting_mirna_tag) and [**uncollapse_reads**](#uncollapse_reads) @@ -995,29 +1231,72 @@ Remove alignments that do not intersect with any miRNA with Classify and add the intersecting (iso)miR to each alignment as a tag with a [**custom script**][custom-script-iso-tag]. -> To classify the reads, miRNA annotations are turned into the original ones. -The alignment start and end shifts relative to the annotations are computed -and append along with the `CIGAR` and `MD` strings to the intersecting miRNA. -If the read is classified as a canonical miRNA, the name will not include -the star and end shift, nor the `CIGAR` and `MD` strings. +> In this step, the mature miRNA annotated regions are used instead of the +> extended ones. Each alignment gets an extra tag (`YW:Z`) with either the +> (iso)miR(s) it is considered to really intersect with or an empty string +> otherwise. The format of the intersecting mature miRNA species is +> `miRNA_name|5p-shift|3p-shift|CIGAR|MD`, where `5p-shift` and `3p-shift` are +> the difference between the miRNA start and end coordinates and the +> alignment's ones respectively. - **Input** - - Alignments file (`.sam`); from + - Alignments file, filtered (`.sam`); from [**filter_sam_by_intersecting_mirna**](#filter_sam_by_intersecting_mirna) - - miRNA intersections file (`.bed`); from + - Mature miRNA intersections file (`.bed`); from [**intersect_extended_mirna**](#intersect_extended_mirna) - **Parameters** - **config_template.yaml** - `extension`: Number of nucleotides by which mature miRNA annotated - regions are extended (default 6) + regions are extended (default: 6) - **Output** - Alignments file, tagged (`.sam`); used in [**sort_intersecting_mirna_by_feat_tag**](#sort_intersecting_mirna_by_feat_tag) +- **Examples** + +```console +Example 1 | Intersecting a canoncial mature miRNA + +IN miRNA annotations: + chr19 . miRNA 44377 44398 . + . ID=MIMAT0002849;Alias=MIMAT0002849;Name=hsa-miR-524-5p;Derives_from=MI0003160 +IN SAM record: + 1-1_1 0 19 44377 255 22M * 0 0 CTACAAAGGGAAGCACTTTCTC * MD:Z:22 NH:i:1 NM:i:0 +NEW TAG: + YW:Z:hsa-miR-524-5p|0|0|22M|22 + + +Example 2 | Intersecting an isomiR (no shifts) + +IN miRNA annotations: + chr19 . miRNA 44377 44398 . + . ID=MIMAT0002849;Alias=MIMAT0002849;Name=hsa-miR-524-5p;Derives_from=MI0003160 +IN SAM record: + 1-1_1 0 19 44377 1 11M3I11M * 0 0 CTACAAAGGGAGGTAGCACTTTCTC * HI:i:0 MD:Z:22 NH:i:1 NM:i:3 +NEW TAG: + YW:Z:hsa-miR-524-5p|0|0|11M3I11M|22 + +Example 3 | Intersecting an isomiR (no InDels nor mismatches) + +IN miRNA annotations: + chr19 . miRNA 5338 5359 . + . ID=MIMAT0005795;Alias=MIMAT0005795;Name=hsa-miR-1323;Derives_from=MI0003786 +IN SAM record: + 1-1_1 0 19 5338 255 21M * 0 0 TCAAAACTGAGGGGCATTTTC * MD:Z:21 NH:i:1 NM:i:0 +NEW TAG: + YW:Z:hsa-miR-1323|0|-1|21M|21 + + +Example 4 | Not intersecting an (iso)miR + +IN miRNA annotations: + chr19 . miRNA 5338 5359 . + . ID=MIMAT0005795;Alias=MIMAT0005795;Name=hsa-miR-1323;Derives_from=MI0003786 +IN SAM record: + 1-1_1 0 19 5338 255 21M * 0 0 TCAAAACTGAGGGGCATTTTC * MD:Z:21 NH:i:1 NM:i:0 +NEW TAG: + YW:Z: +``` #### `sort_intersecting_mirna_by_feat_tag` -Sort the alignments by the tag containing the classified intersecting miRNA +Sort the alignments by the tag containing the classified intersecting (iso)miR with [**SAMtools**](#third-party-software-used). > Required for an efficient quantification. @@ -1032,24 +1311,102 @@ with [**SAMtools**](#third-party-software-used). #### `quantify_mirna` -Tabulate alignments according to its tag with a +Tabulate alignments according to its new tag (`YW:Z`) with a [**custom script**][custom-script-mir-quant]. -> Quantification is done with partial counts (_i.e._ each alignment contributes -by the number of collapsed reads divided by the number of hits). - -- **Input** - - Alignments file, sorted, tagged (`.sam`); from +> Each alignment contributes to the miRNA species in its `YW:Z` tag by `R/N`, +> where `R` is the number of collapsed reads in that alignment, and `N` is the +> number of genomic and/or transcriptomic loci it aligns to. The values of +> both, `R` and `N` are inferred from the sequence name which follows the +> format `ID-R_N`. The resulting table has a row for each mature miRNA species +> (isomiR, canonical miRNA or both) with the name format set in +> [**add_intersecting_mirna_tag**](#add_intersecting_mirna_tag) unless +> considered a canonical miRNA, which only keeps the annotated mature miRNA +> name. A miRNA species is considered to be canonical if there are no shifts +> between its start and end positions and the aligned read ones, and there +> are no mismatches nor InDels. + +- **Input** + - Alignments file, tagged, sorted (`.sam`); from [**sort_intersecting_mirna_by_feat_tag**](#sort_intersecting_mirna_by_feat_tag) - **Parameters** - **samples.tsv** - - Library name; specify in sample table column `sample` + - Library name; specified in the sample's table column `sample` - **config_template.yaml** - - `mir_list`: miRNA features to be quantified (default isomir, mirna + - `mir_list`: miRNA features to be quantified (default: isomir, mirna pri-miR) - **Output** - (iso)miR counts tab-delimited file; used in [**merge_tables**](#merge_tables) +- **Examples** + +```console +Example 1 | Canonical miRNA and isomiR + +IN SAM record: + 10-4_2 0 19 34627 255 21M * 0 0 AAAGTGCTTCCTTTTAGAGGG * MD:Z:21 NM:i:0 NH:i:2 HI:i:1 YW:Z:hsa-miR-520b-3p|0|0|21M|21 + 10-4_2 0 19 40866 255 21M * 0 0 AAAGTGCTTCCTTTTAGAGGG * MD:Z:21 NM:i:0 NH:i:2 HI:i:2 YW:Z:hsa-miR-520c-3p|0|-1|21M|21 + +Data: + Alignment: + Read ID: 10 + Number of collapsed reads: 4 + Number of mapped genomic loci: 2 + Contribution: 4/2 = 2 + + miRNA species: + Tag name: hsa-miR-520b-3p|0|0|21M|21 + Type: Canonical + Table name: hsa-miR-520b-3p + Total count: 2 + + Tag name: hsa-miR-520c-3p|0|-1|21M|21 + Type: isomiR + Table name: hsa-miR-520c-3p|0|-1|21M|21 + Total count: 2 + +OUT table: + ID lib_name + hsa-miR-520b-3p 2 + hsa-miR-520c-3p|0|-1|21M|21 2 + + +Example 2 | Different isomiRs + +IN SAM record: + 599-1_3 0 19 27804 255 20M * 0 0 AAAGTGCTTCCTTTTAGAGG * MD:Z:20 NM:i:0 NH:i:3 HI:i:1 YW:Z:hsa-miR-526b-3p|1|-1|20M|20 + 599-1_3 0 19 34627 255 20M * 0 0 AAAGTGCTTCCTTTTAGAGG * MD:Z:20 NM:i:0 NH:i:3 HI:i:2 YW:Z:hsa-miR-520b-3p|0|-1|20M|20 + 599-1_3 0 19 40866 255 20M * 0 0 AAAGTGCTTCCTTTTAGAGG * MD:Z:20 NM:i:0 NH:i:3 HI:i:3 YW:Z:hsa-miR-520c-3p|0|-2|20M|20 + +Data: + Alignment: + Read ID: 599 + Number of collapsed reads: 1 + Number of mapped genomic loci: 3 + Contribution: 1/3 = 0.33 + + miRNA species: + Tag name: hsa-miR-526b-3p|1|-1|20M|20 + Type: isomiR + Table name: hsa-miR-526b-3p|1|-1|20M|20 + Total count: 0.33 + + Tag name: hsa-miR-520b-3p|0|-1|20M|20 + Type: isomiR + Table name: hsa-miR-520b-3p|0|-1|20M|20 + Total count: 0.33 + + Tag name: hsa-miR-520c-3p|0|-2|20M|20 + Type: isomiR + Table name: hsa-miR-520c-3p|0|-2|20M|20 + Total count: 0.33 + +OUT table: + ID lib_name + hsa-miR-520b-3p|0|-1|20M|20 0.33 + hsa-miR-520c-3p|0|-2|20M|20 0.33 + hsa-miR-526b-3p|1|-1|20M|20 0.33 +``` #### `quantify_primir` @@ -1057,15 +1414,73 @@ by the number of collapsed reads divided by the number of hits). Tabulate alignments according to its intersecting pri-miR with a [**custom script**][custom-script-pri-quant] -> Quantification is done with partial counts (_i.e._ each alignment contributes -by the number of collapsed reads divided by the number of hits). +> Each alignment contributes to the pri-miR it intersects with by `R/N`, where +> `R` is the number of collapsed reads in that alignment, and `N` is the +> number of genomic and/or transcriptomic loci it aligns to. The values of +> both, `R` and `N` are inferred from the sequence name which follows the +> format `ID-R_N`. The resulting table has a row for each pri-miR with the +> name format set in [**mirna_extension**](#mirna_extension). - **Input** - pri-miR intersections file (`.bed`); from - [**intersecti_extended_primir**](#intersect_extended_primir) + [**intersect_extended_primir**](#intersect_extended_primir) - **Output** - pri-miR counts tab-delimited file; used in [**merge_tables**](#merge_tables) +- **Examples** + +```console +Example 1 | One single pri-miR with different alignments + +IN BED records: + 19 . miRNA_primary_transcript 27766 27788 . + . ID=MI0003150;Alias=MI0003150;Name=hsa-mir-526b_-0_+0 19 27765 27788 68-2_1 255 + + 19 . miRNA_primary_transcript 27766 27787 . + . ID=MI0003150;Alias=MI0003150;Name=hsa-mir-526b_-0_+0 19 27765 27787 316-1_7 1 + + 19 . miRNA_primary_transcript 27804 27823 . + . ID=MI0003150;Alias=MI0003150;Name=hsa-mir-526b_-0_+0 19 27803 27823 599-1_3 255 + + 19 . miRNA_primary_transcript 27805 27822 . + . ID=MI0003150;Alias=MI0003150;Name=hsa-mir-526b_-0_+0 19 27804 27822 226-1_4 1 + + +Alignments: + Read ID: 68 + Number of collapsed reads: 2 + Number of mapped genomic loci: 1 + Contribution: 2/1 = 2 + + Read ID: 316 + Number of collapsed reads: 1 + Number of mapped genomic loci: 7 + Contribution: 1/7 = 0.143 + + Read ID: 599 + Number of collapsed reads: 1 + Number of mapped genomic loci: 3 + Contribution: 1/3 = 0.33 + + Read ID: 226 + Number of collapsed reads: 1 + Number of mapped genomic loci: 4 + Contribution: 1/4 = 0.25 + +OUT table: + ID lib_name + hsa-mir-526b_-0_+0 2.723 + + +Example 2 | Different pri-miRs for a single read + +IN BED records: + 19 . miRNA_primary_transcript 40866 40886 . + . ID=MI0003158;Alias=MI0003158;Name=hsa-mir-520c_-0_+0 19 40865 40886 10-4_2 255 + + 19 . miRNA_primary_transcript 34627 34647 . + . ID=MI0003155;Alias=MI0003155;Name=hsa-mir-520b_-5_+6 19 34626 34647 10-4_2 255 + + +Alignment: + Read ID: 10 + Number of collapsed reads: 4 + Number of mapped genomic loci: 2 + Contribution: 4/2 = 2 + +OUT table: + ID lib_name + hsa-mir-520c_-0_+0 2 + hsa-mir-520b_-5_+6 2 +``` #### `merge_tables` @@ -1073,24 +1488,56 @@ by the number of collapsed reads divided by the number of hits). Merge all the tables from the different libraries into a single one with a [**custom script**][custom-script-merge-tab]. +> The final table(s) containing the counting data from all libraries for the +> (iso)miRs and/or pri-miRs have a row per miRNA species and a column per +> sample library. If a miRNA species is not found in a certain library, its +> value is set to `NA`. + - **Input** - - counts tab-delimited file; from [**quantify_mirna**](#quantify_mirna) + - Counts tab-delimited file; from [**quantify_mirna**](#quantify_mirna) and/or [**quantify_primir**](#quantify_primir) - **Parameters** - **cluster_schema.json** - - `mir_list`: miRNA features to be quantified (default isomir, mirna + - `mir_list`: miRNA features to be quantified (default: isomir, mirna pri-mir) - **Output** - - (iso)miR and/or pri-miR counts table (`.tab`) + - (**Workflow output**) (iso)miR and/or pri-miR counts table (`.tab`) +- **Example** + +```console +IN library 1 + ID lib_1 + hsa-miR-524-5p 1 + hsa-miR-524-5p|0|0|22M|9G12 1 + hsa-miR-524-5p|0|0|22M|9G9C2 1 + +IN library 2 + ID lib_2 + hsa-miR-524-5p 1 + hsa-miR-1283 1 + hsa-miR-1283|-1|-2|21M|21 1 + +IN library 3 + ID lib_3 + +OUT table + ID lib_1 lib_2 lib_3 + hsa-miR-524-5p 1 1 NA + hsa-miR-524-5p|0|0|22M|9G12 1 NA NA + hsa-miR-524-5p|0|0|22M|9G9C2 1 NA NA + hsa-miR-1283 NA 1 NA + hsa-miR-1283|-1|-2|21M|21 NA 1 NA +``` #### `uncollapse_reads` Reverse the collapsing of reads with identical sequences as done with -[**FASTX-Toolkit**](#third-party-software-used). +[**FASTX-Toolkit**](#third-party-software-used) with a +[**custom script**][custom-script-uncollapse]. - **Input** - - Alignments file (`.sam`); from + - Alignments file, filtered (`.sam`); from [**filter_sam_by_intersecting_mirna**](#filter_sam_by_intersecting_mirna) - **Output** - Uncollapsed aligned reads (`.sam`); used in @@ -1103,10 +1550,10 @@ Convert alignments `.sam` file to `.bam` with [**SAMtools**](#third-party-software-used). - **Input** - - Alignments file (`.sam`); from + - Alignments file, uncollapsed (`.sam`); from [**filter_sam_by_intersecting_mirna**](#filter_sam_by_intersecting_mirna) - **Output** - - Alignments file (`.bam`); used in + - Alignments file, uncollapsed (`.bam`); used in [**sort_uncollapsed_reads_bam_by_position**](#sort_uncollapsed_reads_bam_by_position) @@ -1115,11 +1562,10 @@ Convert alignments `.sam` file to `.bam` with Sort alignments by position with [**SAMtools**](#third-party-software-used). - **Input** - - Alignments file (`.bam`); from + - Alignments file, uncollapsed (`.bam`); from [**convert_uncollapsed_reads_sam_to_bam**](#convert_uncollapsed_reads_sam_to_bam) - **Output** - - Alignments file, sorted (`.bam`); used in - [**index_uncollapsed_reads_bam**](#index_uncollapsed_reads_bam), + - (**Workflow output**) Alignments file, uncollapsed, sorted (`.bam`); used in [**create_per_library_ascii_pileups**](#create_per_library_ascii_pileups), [**create_per_run_ascii_pileups**](#create_per_run_ascii_pileups) and/or [**create_per_condition_ascii_pileups**](#create_per_condition_ascii_pileups) @@ -1127,16 +1573,16 @@ Sort alignments by position with [**SAMtools**](#third-party-software-used). #### `index_uncollapsed_reads_bam` -Create index `BAM` file with [**SAMtools**](#third-party-software-used). +Create index BAM file with [**SAMtools**](#third-party-software-used). > Indexing is required by genome viewers such as IGV to quickly display -alignments in a genomic region of interest. +> alignments in a genomic region of interest. - **Input** - - Alignments file (`.bam`); from + - (**Workflow output**) Alignments file, uncollapsed, sorted (`.bam`); from [**sort_uncollapsed_reads_bam_by_position**](#sort_uncollapsed_reads_bam_by_position) - **Output** - - `BAM` index file (`.bam.bai`); used in + - (**Workflow output**) BAM index file (`.bam.bai`); used in [**create_per_library_ascii_pileups**](#create_per_library_ascii_pileups), [**create_per_run_ascii_pileups**](#create_per_run_ascii_pileups) and/or [**create_per_condition_ascii_pileups**](#create_per_condition_ascii_pileups) @@ -1151,7 +1597,7 @@ Target rule as required by [Snakemake][docs-snakemake]. > Local rule - **Input** - - Empty text files (`.txt`); from + - (**Workflow output**) Empty text file (`.txt`) [**create_per_library_ascii_pileups**](#create_per_library_ascii_pileups) and [**create_per_run_ascii_pileups**](#create_per_run_ascii_pileups) @@ -1169,7 +1615,7 @@ Create an empty BED file if the user has not provided one. - `bed_file`: BED6 file with all the desired annotation regions to perform the ASCII-style alignment pileups on. (Default: None) - **Output** - - BED empty file (`.bed`); used in + - Empty BED file (`.bed`); used in [**create_per_library_ascii_pileups**](#create_per_library_ascii_pileups), [**create_per_run_ascii_pileups**](#create_per_run_ascii_pileups) and/or [**create_per_condition_ascii_pileups**](#create_per_condition_ascii_pileups) @@ -1177,17 +1623,16 @@ Create an empty BED file if the user has not provided one. #### `compress_reference_genome` -Compress the processed genome with trimmed IDs using `bgzip`. +Compress the processed genome with trimmed IDs using `bgzip` with +[**SAMtools**](#third-party-software-used). > Required to perform the ASCII-style alignment pileups. -> In order to be able to use the `bgzip` command when running the workflow -> using Singularity, [**SAMtools**](#third-party-software-used) is used. - **Input** - - Genome sequence file, trimmed IDs (`.fa`); from + - Genome sequence, trimmed IDs (`.fa`); from [**trim_genome_seq_ids**](#trim_genome_seq_ids) - **Output** - - Genome sequence file, trimmed IDs, `bgzip`ed (`.fa.bz`); used in + - Genome sequence, trimmed IDs, `bgzip`ed (`.fa.bz`); used in [**create_per_library_ascii_pileups**](#create_per_library_ascii_pileups), [**create_per_run_ascii_pileups**](#create_per_run_ascii_pileups) and/or [**create_per_condition_ascii_pileups**](#create_per_condition_ascii_pileups) @@ -1196,55 +1641,55 @@ Compress the processed genome with trimmed IDs using `bgzip`. #### `create_per_library_ascii_pileups` Create ASCII-style pileups for all the desired annotated regions across -libraries with [**ASCII-style alignment pilueups**](#third-party-software-used). +libraries with [**ASCII-style alignment pileups**](#third-party-software-used). > A directory containing the ASCII-style pileups is created for each > library. If no BED file is provided, the pileups' output directories will > only contain an empty file. - **Input** - - Genome sequence file, trimmed IDs, `bgzip`ed (`.fa.bz`); from + - Genome sequence, trimmed IDs, `bgzip`ed (`.fa.bz`); from [**compress_reference_genome**](#compress_reference_genome) - - miRNA annotations with mapped genes(`.gff3`); from + - miRNA annotations, mapped chromosome name(s) (`.gff3`); from [**map_chr_names**](#map_chr_names) - - Alignments file (`.bam`); from + - (**Workflow output**) Alignments file, uncollapsed, sorted (`.bam`); from [**sort_uncollapsed_reads_bam_by_position**](#sort_uncollapsed_reads_bam_by_position) - - `BAM` index file (`.bam.bai`); from + - (**Workflow output**) BAM index file (`.bam.bai`); used in [**index_uncollapsed_reads_bam**](#index_uncollapsed_reads_bam) - Annotated genomic regions (`.bed`); from workflow input files or [**create_empty_bed**](#create_empty_bed) - **Output** - - Empty text file (`.txt`) + - (**Workflow output**) Empty text file (`.txt`) #### `create_per_run_ascii_pileups` Create ASCII-style pileups for all the desired annotated regions for the whole -run with [**ASCII-style alignment pilueups**](#third-party-software-used). +run with [**ASCII-style alignment pileups**](#third-party-software-used). > If no BED file is provided, the pileups' output directory will only contain > an empty file. - **Input** - - Genome sequence file, trimmed IDs, `bgzip`ed (`.fa.bz`); from + - Genome sequence, trimmed IDs, `bgzip`ed (`.fa.bz`); from [**compress_reference_genome**](#compress_reference_genome) - - miRNA annotations with mapped genes(`.gff3`); from + - miRNA annotations, mapped chromosome name(s) (`.gff3`); from [**map_chr_names**](#map_chr_names) - - Alignments file (`.bam`); from + - (**Workflow output**) Alignments file, uncollapsed, sorted (`.bam`); from [**sort_uncollapsed_reads_bam_by_position**](#sort_uncollapsed_reads_bam_by_position) - - `BAM` index file (`.bam.bai`); from + - (**Workflow output**) BAM index file (`.bam.bai`); used in [**index_uncollapsed_reads_bam**](#index_uncollapsed_reads_bam) - Annotated genomic regions (`.bed`); from workflow input files or [**create_empty_bed**](#create_empty_bed) - **Output** - - Empty text file (`.txt`) + - (**Workflow output**) Empty text file (`.txt`) #### `create_per_condition_ascii_pileups` Create ASCII-style pileups for all the desired annotated regions across the different library subsets if provided with -[**ASCII-style alignment pilueups**](#third-party-software-used). +[**ASCII-style alignment pileups**](#third-party-software-used). > **OPTIONAL RULE.** The ASCII-style pileups for each annotated region are > made if, and only if, at least one library subset is specified in the @@ -1252,21 +1697,21 @@ different library subsets if provided with > executed, and no output will be generated. - **Input** - - Genome sequence file, trimmed IDs, `bgzip`ed (`.fa.bz`); from + - Genome sequence, trimmed IDs, `bgzip`ed (`.fa.bz`); from [**compress_reference_genome**](#compress_reference_genome) - - miRNA annotations with mapped genes(`.gff3`); from + - miRNA annotations, mapped chromosome name(s) (`.gff3`); from [**map_chr_names**](#map_chr_names) - - Alignments file (`.bam`); from + - (**Workflow output**) Alignments file, uncollapsed, sorted (`.bam`); from [**sort_uncollapsed_reads_bam_by_position**](#sort_uncollapsed_reads_bam_by_position) - - `BAM` index file (`.bam.bai`); from + - (**Workflow output**) BAM index file (`.bam.bai`); used in [**index_uncollapsed_reads_bam**](#index_uncollapsed_reads_bam) - Annotated genomic regions (`.bed`); from workflow input files or [**create_empty_bed**](#create_empty_bed) - **Parameters** - **config_template.yaml** - - `lib_dict`: Subset(s) of library name(s), as specified in the samples' - table column `sample` and the subset identifier stored in a dictionary. - (default: None) + - `lib_dict`: Dictionary of arbitrary condition names (keys) and library + names to aggregate alignment pileups for (values; MUST correspond to names + in samples table) (default: None) - **Output** - Empty text file (`.txt`) @@ -1312,6 +1757,7 @@ different library subsets if provided with [license-gpl2]: [license-gpl3]: [license-mit]: +[oligomap-out]: [pub-bedtools]: [pub-cufflinks]: [pub-cutadapt]: