Skip to content

Latest commit

 

History

History
1768 lines (1353 loc) · 68.5 KB

pipeline_documentation.md

File metadata and controls

1768 lines (1353 loc) · 68.5 KB

MIRFLOWZ: workflow documentation

This document describes the individual steps of the workflow. For instructions on installation and usage please see here.

Table of Contents

Third-party software used

Tag lines were taken from the developers' websites (code repository or manual)

Name License Tag line More info
ASCII-style alignment pileups Apache 2.0 "Generates ASCII-style pileups of read alignments in one or more BAM files for one or more genomic regions." code
BEDTools GPLv2 "[...] intersect, merge, count, complement, and shuffle genomic intervals from multiple files in widely-used genomic file formats such as BAM, BED, GFF/GTF, VCF" code / manual / publication
cufflinks BSL-1.0 "[...] assembles transcripts, estimates their abundances, and tests for differential expression and regulation in RNA-Seq samples" code / manual / publication
cutadapt MIT "[...] finds and removes adapter sequences, primers, poly-A tails and other types of unwanted sequence from your high-throughput sequencing reads" code / manual / publication
FASTX-Toolkit AGPL-3.0 "[...] collection of command line tools for Short-Reads FASTA/FASTQ files preprocessing" code / manual
GFFUtils AFL-3 "[...] a small set of utility programs for working with GFF and GTF files" code / manual
Oligomap GPLv3 "[...] fast identification of nearly-perfect matches of small RNAs in sequence databases. It allows to exhaustively identify all the perfect and 1-error (where an error is defined to be a mismatch, insertion or deletion) matches of large sets of small RNAs to target sequences" code / publication
SAMtools MIT "[...] suite of programs for interacting with high-throughput sequencing data" code / manual / publication
segemehl GPLv3 "[...] map short sequencer reads to reference genomes" manual / publication

Description of workflow steps

The workflow consists of five Snakemake files: A main Snakefile and an individual Snakemake file each for the genome resources preparation, the reads mapping, the miRNA quantification and the ASCII-style pileups generation. The main Snakefile contains the configuration file validation and imports the various functional modules described below. Individual steps of the workflow are described briefly along with some examples, and links to the respective software manuals are given. Parameters that can be modified by the user (via the samples table and the configuration file) are also described.

Rule graph

rule_graph

Visual representation of the workflow. Automatically prepared with Snakemake.

Preparatory

Read sample table

Requirements
  • Tab-separated values (.tsv) file
  • First row has to contain parameter names as in samples_table.tsv
  • First column used as sample identifiers
Parameter name Description Data type(s)
sample Arbitrary name for the miRNA sequence library. str
sample_file Path to the gzipped miRNA sequencing library file. The path must be relative to the directory where the workflow will be run. str
adapter Sequence of the 3'-end adapter used during library preparation. Required for Cutadapt. Use a value such as XXXXXXXXXX if no adapter is present or if no trimming is desired. str
format One of fa/fasta or fq/fastq, if the library file is in FASTA or FASTQ format, respectively. str

Configuration file

Some parameters within the workflow can be modified. Refer to the configuration template for a detailed explanation of each option.

Snakefile

finish

Target rule as required by Snakemake.

Local rule

Prepare workflow

finish_prepare

Target rule as required by Snakemake.

Local rule

trim_genome_seq_ids

Trim genome sequence IDs with a custom script.

extract_transcriptome_seqs

Create transcriptome from genomic sequence and annotations with cufflinks.

trim_transcriptome_seq_ids

Trim transcriptome sequence IDs with a custom script.

generate_segemehl_index_transcriptome

Generate transcriptome index for segemehl short read aligner.

The transcriptome index only needs to be generated once for each combination of transcriptome sequence and annotations, and sample sets.

generate_segemehl_index_genome

Generate genome index for segemehl short read aligner.

The genome index only needs to be generated once for each combination of annotations and sample sets.

get_exons_gtf

Retrieve exon annotations from genome annotations with a custom script.

  • Input
    • (Workflow input) Genome annotations, gziped (.gtf.gz)
  • Output

convert_exons_gtf_to_bed

Convert exon annotations .gtf to .bed with a custom script.

create_genome_header

Create SAM header for the genome with SAMtools.

Required by SAMtools to work with the alignment files.

map_chr_names

Map UCSC-like chromosome names with Ensembl-like ones in miRNA annotations with a custom script.

Required by BEDTools to intersect alignments with miRNA annotations. Several mapping tables are available here.

create_index_genome_fasta

Create a FASTA index for the genome with SAMtools.

extract_chr_len

Extract chromosome(s) length from the genome sequence.

Required to ensure that the extended annotations in generated in the extend_mirs_annotations rule do not exceed the chromosome(s) boundaries.

extend_mirs_annotations

Extend miRNA annotations, ensure feature names uniqueness and split the file by feature with a custom script.

Adjust miRNAs' 'Name' attribute to account for the different genomic locations the miRNA sequence is annotated on and ensure their uniqueness. The name format is SPECIES-mir-NAME-# for pri-miRs, and SPECIES-miR-NAME-#-ARM or SPECIES-miR-NAME-# for mature miRNA with both or just one arm respectively, where # is the replica integer. If a pri-miR has a replica but its number is set in the 'ID' attribute, the first instance does not has a suffix but the other one(s) do. If a precursor has no other occurrences, no further modifications are made. On the other hand, mature miRNA regions are extended on both sides to account for isomiR species with shifted start and/or end positions without exceeding chromosome(s) boundaries. If required, pri-miR loci are also extended to accommodate the new miRNA coordinates. In addition, pri-miR names are modified to record the final positions by appending _-y and _+x to them, where y is the 5' shift and x the 3' shift.

  • Input
    • miRNA annotations, mapped chromosome name(s) (.gff3); from map_chr_names
  • Parameters
    • config_template.yaml
      • extension: Number of nucleotides by which mature miRNA annotated regions are extended at most (default: 6)
  • Output
  • Examples
Example 1 | Extension | Mature miRNA extension

IN:
    pri-miR entry:
        19	.	miRNA_primary_transcript	2517	2614	.	+	.	ID=MI0003141;Alias=MI0003141;Name=hsa-mir-512-2
    mature miRNA entry:
        19	.	miRNA	2536	2558	.	+	.	ID=MIMAT0002822_1;Alias=MIMAT0002822;Name=hsa-miR-512-5p;Derives_from=MI0003141
    extension:
        6
OUT:
    pri-miR entry:
        19	.	miRNA_primary_transcript	2517	2614	.	+	.	ID=MI0003141;Alias=MI0003141;Name=hsa-mir-512-2_-0_+0
    mature miRNA entry:
        19	.	miRNA	2530	2564	.	+	.	ID=MIMAT0002822_1;Alias=MIMAT0002822;Name=hsa-miR-512-2-5p;Derives_from=MI0003141


Example 2 | Extension | Mature miRNA and pri-miR extension

IN:
    pri-miR entry:
        19	.	miRNA_primary_transcript	9	122	.	+	.	ID=MI0003140;Alias=MI0003140;Name=hsa-mir-512-1
    mature miRNA entry:
        19	.	miRNA	12	74	.	+	.	ID=MIMAT0002822;Alias=MIMAT0002822;Name=hsa-miR-512-5p;Derives_from=MI0003140
    extension:
        6
OUT:
    pri-miR entry:
        19	.	miRNA_primary_transcript	6	122	.	+	.	ID=MI0003140;Alias=MI0003140;Name=hsa-mir-512-1_-3_+0
    mature miRNA entry:
        19	.	miRNA	6	80	.	+	.	ID=MIMAT0002822;Alias=MIMAT0002822;Name=hsa-miR-512-1-5p;Derives_from=MI0003140


Example 3 | Extension | Matrue miRNA exceeding chromosome boundaries extension

IN:
    pri-miR entry:
        19	.	miRNA_primary_transcript	2	122	.	+	.	ID=MI0003140;Alias=MI0003140;Name=hsa-mir-512-1
    mature miRNA entry:
        19	.	miRNA	3	74	.	+	.	ID=MIMAT0002822;Alias=MIMAT0002822;Name=hsa-miR-512-5p;Derives_from=MI0003140
    extension:
        6
OUT:
    pri-miR entry:
        19	.	miRNA_primary_transcript	1	122	.	+	.	ID=MI0003140;Alias=MI0003140;Name=hsa-mir-512-1_-1_+0
    mature miRNA entry:
        19	.	miRNA	1	80	.	+	.	ID=MIMAT0002822;Alias=MIMAT0002822;Name=hsa-miR-512-1-5p;Derives_from=MI0003140


Example 4 | Name uniqueness | Replica number in the ID

IN:
    pri-miR entries:
        chr21	.	miRNA_primary_transcript	8206563	8206618	.	+	.	ID=MI0033425;Alias=MI0033425;Name=hsa-mir-10401
        chr21	.	miRNA_primary_transcript	8250772	8250827	.	+	.	ID=MI0033425_2;Alias=MI0033425;Name=hsa-mir-10401
    mature miRNA entries:
        chr21	.	miRNA	8206563	8206582	.	+	.	ID=MIMAT0041633;Alias=MIMAT0041633;Name=hsa-miR-10401-5p;Derives_from=MI0033425
        chr21	.	miRNA	8206598	8206618	.	+	.	ID=MIMAT0041634;Alias=MIMAT0041634;Name=hsa-miR-10401-3p;Derives_from=MI0033425
        chr21	.	miRNA	8250772	8250791	.	+	.	ID=MIMAT0041633_1;Alias=MIMAT0041633;Name=hsa-miR-10401-5p;Derives_from=MI0033425
        chr21	.	miRNA	8250807	8250827	.	+	.	ID=MIMAT0041634_1;Alias=MIMAT0041634;Name=hsa-miR-10401-3p;Derives_from=MI0033425
OUT:
    pri-miR entries:
        chr21	.	miRNA_primary_transcript	8206563	8206618	.	+	.	ID=MI0033425;Alias=MI0033425;Name=hsa-mir-10401
        chr21	.	miRNA_primary_transcript	8250772	8250827	.	+	.	ID=MI0033425_2;Alias=MI0033425;Name=hsa-mir-10401-2
    mature miRNA entries:
        chr21	.	miRNA	8206563	8206582	.	+	.	ID=MIMAT0041633;Alias=MIMAT0041633;Name=hsa-miR-10401-5p;Derives_from=MI0033425
        chr21	.	miRNA	8206598	8206618	.	+	.	ID=MIMAT0041634;Alias=MIMAT0041634;Name=hsa-miR-10401-3p;Derives_from=MI0033425
        chr21	.	miRNA	8250772	8250791	.	+	.	ID=MIMAT0041633_1;Alias=MIMAT0041633;Name=hsa-miR-10401-2-5p;Derives_from=MI0033425
        chr21	.	miRNA	8250807	8250827	.	+	.	ID=MIMAT0041634_1;Alias=MIMAT0041634;Name=hsa-miR-10401-2-3p;Derives_from=MI0033425


Example 5 | Name uniqueness | Replica number in the Name; single mature arm

IN:
    pri-miR entries:
        chr21	.	miRNA_primary_transcript	8205315	8205406	.	+	.	ID=MI0022559;Alias=MI0022559;Name=hsa-mir-6724-1
        chr21	.	miRNA_primary_transcript	8249505	8249596	.	+	.	ID=MI0031516;Alias=MI0031516;Name=hsa-mir-6724-2
    mature miRNA entries:
        chr21	.	miRNA	8205325	8205347	.	+	.	ID=MIMAT0025856;Alias=MIMAT0025856;Name=hsa-miR-6724-5p;Derives_from=MI0022559
        chr21	.	miRNA	8249515	8249537	.	+	.	ID=MIMAT0025856_1;Alias=MIMAT0025856;Name=hsa-miR-6724-5p;Derives_from=MI0031516
OUT:
    pri-miR entries:
        chr21	.	miRNA_primary_transcript	8205315	8205406	.	+	.	ID=MI0022559;Alias=MI0022559;Name=hsa-mir-6724-1
        chr21	.	miRNA_primary_transcript	8249505	8249596	.	+	.	ID=MI0031516;Alias=MI0031516;Name=hsa-mir-6724-2
    mature miRNA entries:
        chr21	.	miRNA	8205325	8205347	.	+	.	ID=MIMAT0025856;Alias=MIMAT0025856;Name=hsa-miR-6724-1-5p;Derives_from=MI0022559
        chr21	.	miRNA	8249515	8249537	.	+	.	ID=MIMAT0025856_1;Alias=MIMAT0025856;Name=hsa-miR-6724-2-5p;Derives_from=MI0031516


Example 6 | Name uniqueness | Both mature miRNA arms but just one with the replica number

IN:
    pri-miR entries:
        chr2	.	miRNA_primary_transcript	135665397	135665478	.	+	.	ID=MI0000447;Alias=MI0000447;Name=hsa-mir-128-1
        chr3	.	miRNA_primary_transcript	35744476	35744559	.	+	.	ID=MI0000727;Alias=MI0000727;Name=hsa-mir-128-2
    mature miRNA entries:
        chr2	.	miRNA	135665446	135665466	.	+	.	ID=MIMAT0000424;Alias=MIMAT0000424;Name=hsa-miR-128-3p;Derives_from=MI0000447
        chr2	.	miRNA	135665411	135665433	.	+	.	ID=MIMAT0026477;Alias=MIMAT0026477;Name=hsa-miR-128-1-5p;Derives_from=MI0000447
        chr3	.	miRNA	35744527	35744547	.	+	.	ID=MIMAT0000424_1;Alias=MIMAT0000424;Name=hsa-miR-128-3p;Derives_from=MI0000727
        chr3	.	miRNA	35744490	35744512	.	+	.	ID=MIMAT0031095;Alias=MIMAT0031095;Name=hsa-miR-128-2-5p;Derives_from=MI0000727
OUT:
    pri-miR entries:
        chr2	.	miRNA_primary_transcript	135665397	135665478	.	+	.	ID=MI0000447;Alias=MI0000447;Name=hsa-mir-128-1
        chr3	.	miRNA_primary_transcript	35744476	35744559	.	+	.	ID=MI0000727;Alias=MI0000727;Name=hsa-mir-128-2
    mature miRNA entries:
        chr2	.	miRNA	135665446	135665466	.	+	.	ID=MIMAT0000424;Alias=MIMAT0000424;Name=hsa-miR-128-1-3p;Derives_from=MI0000447
        chr2	.	miRNA	135665411	135665433	.	+	.	ID=MIMAT0026477;Alias=MIMAT0026477;Name=hsa-miR-128-1-5p;Derives_from=MI0000447
        chr3	.	miRNA	35744527	35744547	.	+	.	ID=MIMAT0000424_1;Alias=MIMAT0000424;Name=hsa-miR-128-2-3p;Derives_from=MI0000727
        chr3	.	miRNA	35744490	35744512	.	+	.	ID=MIMAT0031095;Alias=MIMAT0031095;Name=hsa-miR-128-2-5p;Derives_from=MI0000727

Map workflow

finish_map

Target rule as required by Snakemake.

Local rule

start

Copy and rename read files.

Local rule. Depending on the library file format, the output file undergoes a quality filter (fa/.fastq) or is directly formatted (.fa/.fasta).

  • Input
    • (Workflow input) miRNA sequencing library, gziped (.fa.gz/.fasta.gz or .fq.gz/.fastq.gz)
  • Output

fastq_quality_filter

Conduct quality control for reads library with fastx_toolkit.

  • Input
    • miRNA sequencing library, copied, renamed (.fastq); from start
  • Parameters
    • config_template.yaml
      • q_value: Minimum Q (Phred) score to keep (default: 10)
      • p_value: Minimum % of bases that must have a Q (Phred) quality (default: 50)
  • Output

fastq_to_fasta

Convert reads file from FASTQ to FASTA with fastx_toolkit.

Sequence identifiers are renamed to numbers.

format_fasta

Format read's sequences to appear on a single line with fastx_toolkit.

remove_adapters

Trim 3' adapters and N bases at either end. Filter reads by minimum length and number of inner N bases with cutadapt.

  • Input
    • miRNA sequencing library, formatted (.fasta); from format_fasta
  • Parameters
    • samples.tsv
      • Adapter to be removed; specified in the sample's table column adapter
    • config_template.yaml
      • error_rate: Fraction of allowed errors in the matched adapters (default: 0.1)
      • overlap: Minimum overlap length between adapter and read to trim the bases (default: 3)
      • minimum_length: Minimum length for a processed read to be kept (default: 15)
      • max_n: Maximum number of inner N bases for a processed read to be kept (default: 0)
  • Output

collapse_identical_reads

Collapse and rename identical reads fastx_toolkit.

Sequences are renamed in the format R-N, where R is the assigned number to the unique entry, and N is the amount of identical sequences within the library collapsed in it.

map_genome_segemehl

Align short reads to reference genome with segemehl.

map_transcriptome_segemehl

Align short reads to reference transcriptome with segemehl.

filter_fasta_for_oligomap

Filter reads by length with a custom script.

Required for an optimal mapping speed. Oligomap is specifically written for short reads. Therefore, reads with more bases than the default maximum (30 nts) makes the mapping slower.

map_genome_oligomap

Align short reads to reference genome with oligomap.

Refer to Oligomap's Output format section for a specific explanation and examples on the output format.

sort_genome_oligomap

Sort oligomap alignments by query name with a custom script.

convert_genome_to_sam_oligomap

Convert aligned reads .oligomap to .sam and filter alignments by number of hits with a custom script.

If a read has been aligned beyond a specified threshold, it is removed due to (1) performance reasons as the file size can rapidly increase, and (2) the fact that each read contributes to each count 1/N where N is the number of genomic loci it aligns to and a large N makes the contribution negligible.

  • Input
  • Parameters
    • config_template.yaml
      • nh: Maximum number of mappings per read to be kept (default: 100)
  • Output

map_transcriptome_oligomap

Align short reads to reference transcriptome with oligomap.

Refer to Oligomap's Output format section for a specific explanation and examples on the output format.

sort_transcriptome_oligomap

Sort oligomap alignments by query name with a custom script.

convert_transcriptome_to_sam_oligomap

Convert aligned reads .oligomap to .sam and filter alignments by number of hits with a custom script.

If a read has been aligned beyond a specified threshold, it is removed due to (1) performance reasons as the file size can rapidly increase, and (2) the fact that each read contributes to each count 1/N where N is the number of genomic loci it aligns to and a large N makes the contribution negligible.

merge_genome_maps

Concatenate segemehl and oligomap genome alignments.

merge_transcriptome_maps

Concatenate segemehl and oligomap transcriptome alignments.

filter_genome_by_nh

Filter merged genome alignments by the number of hits with a custom script.

If a read has been aligned beyond a specified threshold, it is removed due to (1) performance reasons as the file size can rapidly increase, and (2) the fact that each read contributes to each count 1/N where N is the number of genomic loci it aligns to and a large N makes the contribution negligible.

  • Input
  • Parameters
    • config_template.yaml
      • nh: Maximum number of mappings per read to be kept (default: 100)
  • Output

filter_transcriptome_by_nh

Filter merged transcriptome alignments by the number of hits with a custom script.

If a read has been aligned beyond a specified threshold, it is removed due to (1) performance reasons as the file size can rapidly increase, and (2) the fact that each read contributes to each count 1/N where N is the number of genomic loci it aligns to and a large N makes the contribution negligible.

remove_header_genome_mappings

Remove the SAM header of the genome alignments file with SAMtools.

remove_header_transcriptome_mappings

Remove the SAM header of the transcriptome alignments file with SAMtools.

transcriptome_to_genome_maps

Convert the alignments' transcriptome coordinates to genomic ones with a custom script.

merge_all_maps

Concatenate the four alignment files into a single one.

add_header_all_maps

Add the SAM header to the aligned reads merged file with SAMtools.

sort_maps_by_id

Sort alignments by reads ID with SAMtools.

remove_inferiors

Remove duplicate and inferior alignments with a custom script.

Alignments are considered to be duplicates if having identical entries for the fields QNAME, FLAG, RNAME, POS and CIGAR. Alignments are considered to be inferiors if having the same QNAME and a bigger edit distance than the smaller one within the group. The tags NH (number of hits) and HI (query hit index) are updated accordingly.

Example 1 | Remove duplicates

IN:
    1-2	0	19	44414	1	21M	*	0	0	GAAGGCGCTTCCCTTTGGAGT	*	HI:i:0	NH:i:1	NM:i:0	MD:Z:21	RG:Z:A1	YZ:Z:0
    1-2	0	19	44414	255	21M	*	0	0	GAAGGCGCTTCCCTTTGGAGT	*	NM:i:0	MD:Z:21	NH:i:1
OUT:
    1-2	0	19	44414	255	21M	*	0	0	GAAGGCGCTTCCCTTTGGAGT	*	MD:Z:21	NH:i:1	NM:i:0


Example 2 | Remove inferiors single alignment

IN:
    1-704	16	19	207362	1	18M	*	0	0	CCCGGGCCCGGCGCGCCG	*	HI:i:0	NH:i:2	NM:i:0	MD:Z:18	RG:Z:A1	YZ:Z:0
    1-704	272	19	471264	1	16M1I1M	*	0	0	CCCGGGCCCGGCGCGCCG	*	HI:i:1	NH:i:2	NM:i:2	MD:Z:11G5	RG:Z:A1	YZ:Z:0
OUT:
    1-704	16	19	207362	1	18M	*	0	0	CCCGGGCCCGGCGCGCCG	*	HI:i:0	NH:i:1	NM:i:0	MD:Z:18	RG:Z:A1	YZ:Z:0


Example 3 | Remove inferiors multiple alignments

IN:

    1-1197	0	19	56327	1	15M	*	0	0	TATGGCACTGGTAGA	*	HI:i:0	NH:i:4	NM:i:2	MD:Z:1C11T1	RG:Z:A1	YZ:Z:0
    1-1197	256	19	68983	1	15M	*	0	0	TATGGCACTGGTAGA	*	HI:i:1	NH:i:4	NM:i:3	MD:Z:1C10AT1	RG:Z:A1	YZ:Z:0
    1-1197	256	19	76967	1	15M	*	0	0	TATGGCACTGGTAGA	*	HI:i:2	NH:i:4	NM:i:2	MD:Z:1C11T1	RG:Z:A1	YZ:Z:0
    1-1197	256	19	92363	1	15M	*	0	0	TATGGCACTGGTAGA	*	HI:i:4	NH:i:4	NM:i:3	MD:Z:1C11TA	RG:Z:A1	YZ:Z:0

OUT:
    1-1197	0	19	56327	1	15M	*	0	0	TATGGCACTGGTAGA	*	HI:i:0	NH:i:2	NM:i:2	MD:Z:1C11T1	RG:Z:A1	YZ:Z:0
    1-1197	256	19	76967	1	15M	*	0	0	TATGGCACTGGTAGA	*	HI:i:1	NH:i:2	NM:i:2	MD:Z:1C11T1	RG:Z:A1	YZ:Z:0

filter_by_indels

Filter multimappers favoring mismatches over InDels with a custom script.

Under the assumption that InDels are less frequent than mismatches only those alignments (of the same read with the same edit distance) with the lowest number of InDels are kept. This approach allows the presence of multimappers and/or InDels after the filtering if the alignments contain the same proportion of mismatches vs. InDels.

Example 1 | Different proportion of mismatches vs. InDels

IN:
    1-1	16	19	77595	255	14M1D8M	*	0	0	GCAGGAGAATCACTGATGTCAG	*	MD:Z:14^T2A1C3	NH:i:2	NM:i:3	XA:Z:Q	XI:i:1
    1-1	0	19	330456	255	4M1D1M1I3M1D13M	*	0	0	CTGACATCAGTGATTCTCCTGC	*	MD:Z:4^G4^A13	NH:i:2	NM:i:3	XA:Z:Q	XI:i:0
OUT:
    1-1	16	19	77595	255	14M1D8M	*	0	0	GCAGGAGAATCACTGATGTCAG	*	MD:Z:14^T2A1C3	NM:i:3	XA:Z:Q	XI:i:1	NH:i:1	HI:i:1


Example 2 | Equal proportion of mismatches vs. InDels

IN:
    1-1	0	19	142777	255	15M1I5M	*	0	0	GCTAGGTGGGAGGCTTGAAGC	*	MD:Z:4C0T14	NH:i:3	NM:i:3	XA:Z:Q	XI:i:0
    1-1	16	19	270081	255	6M1I14M	*	0	0	GCTTCAAGCCTCCCACCTAGC	*	MD:Z:14G0G4	NH:i:3	NM:i:3	XA:Z:Q	XI:i:2
    1-1	16	19	545543	255	6M1I14M	*	0	0	GCTTCAAGCCTCCCACCTAGC	*	MD:Z:14A0G4	NH:i:3	NM:i:3	XA:Z:Q	XI:i:1
OUT:
    1-1	0	19	142777	255	15M1I5M	*	0	0	GCTAGGTGGGAGGCTTGAAGC	*	MD:Z:4C0T14	NH:i:3	NM:i:3	XA:Z:Q	XI:i:0
    1-1	16	19	270081	255	6M1I14M	*	0	0	GCTTCAAGCCTCCCACCTAGC	*	MD:Z:14G0G4	NH:i:3	NM:i:3	XA:Z:Q	XI:i:2
    1-1	16	19	545543	255	6M1I14M	*	0	0	GCTTCAAGCCTCCCACCTAGC	*	MD:Z:14A0G4	NH:i:3	NM:i:3	XA:Z:Q	XI:i:1

convert_all_alns_sam_to_bam

Convert alignments .sam file to .bam with SAMtools.

Required by BEDTools to intersect alignments with pri-miR annotations.

sort_all_alns_bam_by_position

Sort alignments by position with SAMtools.

Required by BEDTools to intersect alignments with pri-miR annotations more efficiently.

index_all_alns_bam

Create index BAM file with SAMtools.

Indexing is required by genome viewers such as IGV to quickly display alignments in a genomic region of interest.

Quantify workflow

finish_quantify

Target rule as required by Snakemake.

Local rule

intersect_extendend_primir

Intersect the aligned reads with the extended pri-miR annotations with BEDTools.

Only those alignments fully intersecting a (possibly extended) pri-miR annotated region are kept.

filter_sam_by_intersecting_primir

Remove alignments that do not intersect with any pri-miR with SAMtools.

Required to only intersect alignments within a (possibly extended) pri-miR locus.

convert_intersecting_primir_sam_to_bam

Convert alignments .sam file to .bam with SAMtools.

Required by BEDTools to intersect alignments with miRNA annotations.

sort_intersecting_primir_bam_by_position

Sort alignments by position with SAMtools.

Required by BEDTools to intersect alignments with miRNA annotations more efficiently.

index_intersecting_primir_bam

Create index BAM file with SAMtools.

Indexing is required by genome viewers such as IGV to quickly display alignments in a genomic region of interest.

intersect_extended_mirna

Intersect the aligned reads with the extended miRNA annotations with BEDTools.

Only those alignments fully intersecting an extended mature miRNA annotated region are kept.

filter_sam_by_intersecting_mirna

Remove alignments that do not intersect with any miRNA with SAMtools.

Required to efficiently classify the alignments.

add_intersecting_mirna_tag

Classify and add the intersecting (iso)miR to each alignment as a tag with a custom script.

In this step, the mature miRNA annotated regions are used instead of the extended ones. Each alignment gets an extra tag (YW:Z) with either the (iso)miR(s) it is considered to really intersect with or an empty string otherwise. The format of the intersecting mature miRNA species is miRNA_name|5p-shift|3p-shift|CIGAR|MD, where 5p-shift and 3p-shift are the difference between the miRNA start and end coordinates and the alignment's ones respectively.

Example 1 | Intersecting a canoncial mature miRNA

IN miRNA annotations:
    chr19	.	miRNA	44377	44398	.	+	.	ID=MIMAT0002849;Alias=MIMAT0002849;Name=hsa-miR-524-5p;Derives_from=MI0003160
IN SAM record:
    1-1_1	0	19	44377	255	22M	*	0	0	CTACAAAGGGAAGCACTTTCTC	*	MD:Z:22	NH:i:1	NM:i:0
NEW TAG:
	YW:Z:hsa-miR-524-5p|0|0|22M|22


Example 2 | Intersecting an isomiR (no shifts)

IN miRNA annotations:
    chr19	.	miRNA	44377	44398	.	+	.	ID=MIMAT0002849;Alias=MIMAT0002849;Name=hsa-miR-524-5p;Derives_from=MI0003160
IN SAM record:
    1-1_1	0	19	44377	1	11M3I11M	*	0	0	CTACAAAGGGAGGTAGCACTTTCTC	*	HI:i:0	MD:Z:22	NH:i:1	NM:i:3
NEW TAG:
    YW:Z:hsa-miR-524-5p|0|0|11M3I11M|22


Example 3 | Intersecting an isomiR (no InDels nor mismatches)

IN miRNA annotations:
    chr19	.	miRNA	5338	5359	.	+	.	ID=MIMAT0005795;Alias=MIMAT0005795;Name=hsa-miR-1323;Derives_from=MI0003786
IN SAM record:
    1-1_1	0	19	5338	255	21M	*	0	0	TCAAAACTGAGGGGCATTTTC	*	MD:Z:21	NH:i:1	NM:i:0
NEW TAG:
    YW:Z:hsa-miR-1323|0|-1|21M|21


Example 4 | Not intersecting an (iso)miR

IN miRNA annotations:
    chr19	.	miRNA	5338	5359	.	+	.	ID=MIMAT0005795;Alias=MIMAT0005795;Name=hsa-miR-1323;Derives_from=MI0003786
IN SAM record:
    1-1_1	0	19	5338	255	21M	*	0	0	TCAAAACTGAGGGGCATTTTC	*	MD:Z:21	NH:i:1	NM:i:0
NEW TAG:
    YW:Z:

sort_intersecting_mirna_by_feat_tag

Sort the alignments by the tag containing the classified intersecting (iso)miR with SAMtools.

Required for an efficient quantification.

quantify_mirna

Tabulate alignments according to its new tag (YW:Z) with a custom script.

Each alignment contributes to the miRNA species in its YW:Z tag by R/N, where R is the number of collapsed reads in that alignment, and N is the number of genomic and/or transcriptomic loci it aligns to. The values of both, R and N are inferred from the sequence name which follows the format ID-R_N. The resulting table has a row for each mature miRNA species (isomiR, canonical miRNA or both) with the name format set in add_intersecting_mirna_tag unless considered a canonical miRNA, which only keeps the annotated mature miRNA name. A miRNA species is considered to be canonical if there are no shifts between its start and end positions and the aligned read ones, and there are no mismatches nor InDels.

  • Input
  • Parameters
    • samples.tsv
      • Library name; specified in the sample's table column sample
    • config_template.yaml
      • mir_list: miRNA features to be quantified (default: isomir, mirna pri-miR)
  • Output
  • Examples
Example 1 | Canonical miRNA and isomiR

IN SAM record:
    10-4_2	0	19	34627	255	21M	*	0	0	AAAGTGCTTCCTTTTAGAGGG	*	MD:Z:21	NM:i:0	NH:i:2	HI:i:1	YW:Z:hsa-miR-520b-3p|0|0|21M|21
    10-4_2	0	19	40866	255	21M	*	0	0	AAAGTGCTTCCTTTTAGAGGG	*	MD:Z:21	NM:i:0	NH:i:2	HI:i:2	YW:Z:hsa-miR-520c-3p|0|-1|21M|21

Data:
    Alignment:
        Read ID: 10
        Number of collapsed reads: 4
        Number of mapped genomic loci: 2
        Contribution: 4/2 = 2

    miRNA species:
        Tag name: hsa-miR-520b-3p|0|0|21M|21
        Type: Canonical
        Table name: hsa-miR-520b-3p
        Total count: 2

        Tag name: hsa-miR-520c-3p|0|-1|21M|21
        Type: isomiR
        Table name: hsa-miR-520c-3p|0|-1|21M|21
        Total count: 2

OUT table:
    ID	                        lib_name
    hsa-miR-520b-3p         	2
    hsa-miR-520c-3p|0|-1|21M|21	2


Example 2 | Different isomiRs

IN SAM record:
    599-1_3	0	19	27804	255	20M	*	0	0	AAAGTGCTTCCTTTTAGAGG	*	MD:Z:20	NM:i:0	NH:i:3	HI:i:1	YW:Z:hsa-miR-526b-3p|1|-1|20M|20
    599-1_3	0	19	34627	255	20M	*	0	0	AAAGTGCTTCCTTTTAGAGG	*	MD:Z:20	NM:i:0	NH:i:3	HI:i:2	YW:Z:hsa-miR-520b-3p|0|-1|20M|20
    599-1_3	0	19	40866	255	20M	*	0	0	AAAGTGCTTCCTTTTAGAGG	*	MD:Z:20	NM:i:0	NH:i:3	HI:i:3	YW:Z:hsa-miR-520c-3p|0|-2|20M|20

Data:
    Alignment:
        Read ID: 599
        Number of collapsed reads: 1
        Number of mapped genomic loci: 3
        Contribution: 1/3 = 0.33

    miRNA species:
        Tag name: hsa-miR-526b-3p|1|-1|20M|20
        Type: isomiR
        Table name: hsa-miR-526b-3p|1|-1|20M|20
        Total count: 0.33

        Tag name: hsa-miR-520b-3p|0|-1|20M|20 
        Type: isomiR
        Table name: hsa-miR-520b-3p|0|-1|20M|20
        Total count: 0.33

        Tag name: hsa-miR-520c-3p|0|-2|20M|20
        Type: isomiR
        Table name: hsa-miR-520c-3p|0|-2|20M|20 
        Total count: 0.33
                      
OUT table:            
    ID	                        lib_name
    hsa-miR-520b-3p|0|-1|20M|20	0.33
    hsa-miR-520c-3p|0|-2|20M|20	0.33
    hsa-miR-526b-3p|1|-1|20M|20	0.33

quantify_primir

Tabulate alignments according to its intersecting pri-miR with a custom script

Each alignment contributes to the pri-miR it intersects with by R/N, where R is the number of collapsed reads in that alignment, and N is the number of genomic and/or transcriptomic loci it aligns to. The values of both, R and N are inferred from the sequence name which follows the format ID-R_N. The resulting table has a row for each pri-miR with the name format set in mirna_extension.

Example 1 | One single pri-miR with different alignments

IN BED records:
    19	.	miRNA_primary_transcript	27766	27788	.	+	.	ID=MI0003150;Alias=MI0003150;Name=hsa-mir-526b_-0_+0	19	27765	27788	68-2_1	255	+
    19	.	miRNA_primary_transcript	27766	27787	.	+	.	ID=MI0003150;Alias=MI0003150;Name=hsa-mir-526b_-0_+0	19	27765	27787	316-1_7	1	+
    19	.	miRNA_primary_transcript	27804	27823	.	+	.	ID=MI0003150;Alias=MI0003150;Name=hsa-mir-526b_-0_+0	19	27803	27823	599-1_3	255	+
    19	.	miRNA_primary_transcript	27805	27822	.	+	.	ID=MI0003150;Alias=MI0003150;Name=hsa-mir-526b_-0_+0	19	27804	27822	226-1_4	1	+

Alignments:
    Read ID: 68
    Number of collapsed reads: 2
    Number of mapped genomic loci: 1
    Contribution: 2/1 = 2

    Read ID: 316
    Number of collapsed reads: 1
    Number of mapped genomic loci: 7
    Contribution: 1/7 = 0.143

    Read ID: 599
    Number of collapsed reads: 1
    Number of mapped genomic loci: 3
    Contribution: 1/3 = 0.33

    Read ID: 226
    Number of collapsed reads: 1
    Number of mapped genomic loci: 4
    Contribution: 1/4 = 0.25

OUT table:            
    ID	                lib_name
    hsa-mir-526b_-0_+0	2.723


Example 2 | Different pri-miRs for a single read

IN BED records:
    19	.	miRNA_primary_transcript	40866	40886	.	+	.	ID=MI0003158;Alias=MI0003158;Name=hsa-mir-520c_-0_+0	19	40865	40886	10-4_2	255	+
    19	.	miRNA_primary_transcript	34627	34647	.	+	.	ID=MI0003155;Alias=MI0003155;Name=hsa-mir-520b_-5_+6	19	34626	34647	10-4_2	255	+

Alignment:
    Read ID: 10
    Number of collapsed reads: 4
    Number of mapped genomic loci: 2
    Contribution: 4/2 = 2

OUT table:            
    ID	                lib_name
    hsa-mir-520c_-0_+0	2
    hsa-mir-520b_-5_+6	2

merge_tables

Merge all the tables from the different libraries into a single one with a custom script.

The final table(s) containing the counting data from all libraries for the (iso)miRs and/or pri-miRs have a row per miRNA species and a column per sample library. If a miRNA species is not found in a certain library, its value is set to NA.

  • Input
  • Parameters
    • cluster_schema.json
      • mir_list: miRNA features to be quantified (default: isomir, mirna pri-mir)
  • Output
    • (Workflow output) (iso)miR and/or pri-miR counts table (.tab)
  • Example
IN library 1
    ID                              lib_1
    hsa-miR-524-5p          	    1	 
    hsa-miR-524-5p|0|0|22M|9G12	    1    
    hsa-miR-524-5p|0|0|22M|9G9C2	1    

IN library 2
    ID                              lib_2
    hsa-miR-524-5p          	    1
    hsa-miR-1283	                1
    hsa-miR-1283|-1|-2|21M|21	    1

IN library 3
    ID                              lib_3

OUT table
    ID                              lib_1  lib_2  lib_3
    hsa-miR-524-5p          	    1	    1       NA
    hsa-miR-524-5p|0|0|22M|9G12	    1   	NA      NA
    hsa-miR-524-5p|0|0|22M|9G9C2	1   	NA      NA
    hsa-miR-1283	                NA	    1       NA
    hsa-miR-1283|-1|-2|21M|21	    NA	    1       NA

uncollapse_reads

Reverse the collapsing of reads with identical sequences as done with FASTX-Toolkit with a custom script.

convert_uncollapsed_reads_sam_to_bam

Convert alignments .sam file to .bam with SAMtools.

sort_uncollapsed_reads_bam_by_position

Sort alignments by position with SAMtools.

index_uncollapsed_reads_bam

Create index BAM file with SAMtools.

Indexing is required by genome viewers such as IGV to quickly display alignments in a genomic region of interest.

Pileup workflow

finish_pileup

Target rule as required by Snakemake.

Local rule

create_empty_bed

Create an empty BED file if the user has not provided one.

OPTIONAL RULE. This rule will be executed if, and only if, the user has not provided a BED file in the configuration file with the regions the ASCII-style alignment pileups must be performed on.

compress_reference_genome

Compress the processed genome with trimmed IDs using bgzip with SAMtools.

Required to perform the ASCII-style alignment pileups.

create_per_library_ascii_pileups

Create ASCII-style pileups for all the desired annotated regions across libraries with ASCII-style alignment pileups.

A directory containing the ASCII-style pileups is created for each library. If no BED file is provided, the pileups' output directories will only contain an empty file.

create_per_run_ascii_pileups

Create ASCII-style pileups for all the desired annotated regions for the whole run with ASCII-style alignment pileups.

If no BED file is provided, the pileups' output directory will only contain an empty file.

create_per_condition_ascii_pileups

Create ASCII-style pileups for all the desired annotated regions across the different library subsets if provided with ASCII-style alignment pileups.

OPTIONAL RULE. The ASCII-style pileups for each annotated region are made if, and only if, at least one library subset is specified in the configuration file. Otherwise, this rule will not be executed, and no output will be generated.

  • Input
  • Parameters
    • config_template.yaml
      • lib_dict: Dictionary of arbitrary condition names (keys) and library names to aggregate alignment pileups for (values; MUST correspond to names in samples table) (default: None)
  • Output
    • Empty text file (.txt)