This document describes the individual steps of the workflow. For instructions on installation and usage please see here.
- Table of Contents
- Third-party software used
- Description of workflow steps
- Rule graph
- Preparatory
- Snakefile
- Prepare workflow
- Map workflow
finish_map
start
fastq_quality_filter
fastq_to_fasta
format_fasta
remove_adapters
collapse_indentical_reads
map_genome_segemehl
map_transcriptome_segemehl
filter_fasta_for_oligomap
map_genome_oligomap
sort_genome_oligomap
convert_genome_to_sam_oligomap
map_transcriptome_oligomap
sort_transcriptome_oligomap
convert_transcriptome_to_sam_oligomap
merge_genome_maps
merge_transcriptome_maps
filter_genome_by_nh
filter_transcriptome_by_nh
remove_header_genome_mappings
remove_header_transcriptome_mappings
transcriptome_to_genome_maps
merge_all_maps
add_header_all_maps
sort_maps_by_id
remove_inferiors
filter_by_indels
convert_all_alns_sam_to_bam
sort_all_alns_bam_by_position
index_all_alns_bam
- Quantify workflow
finish_quantify
intersect_extended_primir
filter_sam_by_intersecting_primir
convert_intersecting_primir_sam_to_bam
sort_intersecting_primir_bam_by_position
index_intersecting_primir_bam
intersect_extended_mirna
filter_sam_by_intersecting_mirna
add_intersecting_mirna_tag
sort_intersecting_mirna_by_feat_tag
quantify_mirna
quantify_primir
merge_tables
uncollapse_reads
convert_uncollapse_reads_sam_to_bam
sort_uncollapse_reads_bam_by_position
index_uncollapse_reads_bam
- Pileup workflow
Tag lines were taken from the developers' websites (code repository or manual)
Name | License | Tag line | More info |
---|---|---|---|
ASCII-style alignment pileups | Apache 2.0 | "Generates ASCII-style pileups of read alignments in one or more BAM files for one or more genomic regions." | code |
BEDTools | GPLv2 | "[...] intersect, merge, count, complement, and shuffle genomic intervals from multiple files in widely-used genomic file formats such as BAM, BED, GFF/GTF, VCF" | code / manual / publication |
cufflinks | BSL-1.0 | "[...] assembles transcripts, estimates their abundances, and tests for differential expression and regulation in RNA-Seq samples" | code / manual / publication |
cutadapt | MIT | "[...] finds and removes adapter sequences, primers, poly-A tails and other types of unwanted sequence from your high-throughput sequencing reads" | code / manual / publication |
FASTX-Toolkit | AGPL-3.0 | "[...] collection of command line tools for Short-Reads FASTA/FASTQ files preprocessing" | code / manual |
GFFUtils | AFL-3 | "[...] a small set of utility programs for working with GFF and GTF files" | code / manual |
Oligomap | GPLv3 | "[...] fast identification of nearly-perfect matches of small RNAs in sequence databases. It allows to exhaustively identify all the perfect and 1-error (where an error is defined to be a mismatch, insertion or deletion) matches of large sets of small RNAs to target sequences" | code / publication |
SAMtools | MIT | "[...] suite of programs for interacting with high-throughput sequencing data" | code / manual / publication |
segemehl | GPLv3 | "[...] map short sequencer reads to reference genomes" | manual / publication |
The workflow consists of five Snakemake files: A main
Snakefile
and an individual Snakemake file each for the genome resources preparation, the reads mapping, the miRNA quantification and the ASCII-style pileups generation. The mainSnakefile
contains the configuration file validation and imports the various functional modules described below. Individual steps of the workflow are described briefly along with some examples, and links to the respective software manuals are given. Parameters that can be modified by the user (via the samples table and the configuration file) are also described.
Visual representation of the workflow. Automatically prepared with Snakemake.
- Tab-separated values (
.tsv
) file - First row has to contain parameter names as in
samples_table.tsv
- First column used as sample identifiers
Parameter name | Description | Data type(s) |
---|---|---|
sample | Arbitrary name for the miRNA sequence library. | str |
sample_file | Path to the gzip ped miRNA sequencing library file. The path must be relative to the directory where the workflow will be run. |
str |
adapter | Sequence of the 3'-end adapter used during library preparation. Required for Cutadapt. Use a value such as XXXXXXXXXX if no adapter is present or if no trimming is desired. |
str |
format | One of fa /fasta or fq /fastq , if the library file is in FASTA or FASTQ format, respectively. |
str |
Some parameters within the workflow can be modified. Refer to the configuration template for a detailed explanation of each option.
Target rule as required by Snakemake.
Local rule
- Input
- (Workflow output) SAM file with the pri-miR intersecting alignments; from filter_sam_by_intersecting_primir
- (Workflow output) SAM file with the mature miRNA intersecting alignments; from filter_sam_by_intersecting_mirna
- (Workflow output) (iso)miR and/or pri-miR counts table (
.tab
); from merge_tables - (Workflow output) BAM file with the contributing alignments, sorted; from sort_uncollapsed_reads_bam_by_position
- (Workflow output) BAM index file (
.bam.bai
); from index_uncollapsed_reads_bam - (Workflow output) Empty text file (
.txt
) create_per_library_ascii_pileups, create_per_run_ascii_pileups and/or create_per_condition_ascii_pileups
Target rule as required by Snakemake.
Local rule
- Input
- segemehl genome index (
idx
); from generate_segemehl_index_genome - segemehl transcriptome index (
idx
); from generate_segemehl_index_transcriptome - Exon annotations (
.bed
); from convert_exons_gtf_to_bed - SAM header (
.sam
); from create_genome_header - Tab-separated table mapping chromosome name(s) and length(s) (
.tsv
); from extract_chr_len - Primary miRNA transcript (pri-miR) extended annotations (
.gff3
); from extend_mirs_annotations - Mature miRNA (miRNA) extended annotations (
.gff3
); from extend_mirs_annotations
- segemehl genome index (
Trim genome sequence IDs with a custom script.
- Input
- (Workflow input) Genome sequence,
gzip
ed (.fa.gz
/.fasta.gz
)
- (Workflow input) Genome sequence,
- Output
- Genome sequence, trimmed IDs (
.fa
); used in extract_transcriptome_seqs, create_genome_header, create_index_genome_fasta, generate_segemehl_index_genome, mapping_genome_segemehl, mapping_genome_oligomap and compress_reference_genome
- Genome sequence, trimmed IDs (
Create transcriptome from genomic sequence and annotations with cufflinks.
- Input
- (Workflow input) Genome annotations,
gzip
ed (.gtf.gz
) - Genome sequence, trimmed IDs (
.fa
); from trim_genome_seq_ids
- (Workflow input) Genome annotations,
- Output
- Transcriptome sequence (
.fa
); used in trim_transcriptome_seq_ids
- Transcriptome sequence (
Trim transcriptome sequence IDs with a custom script.
- Input
- Transcriptome sequence (
.fa
); from extract_transcriptome_seqs
- Transcriptome sequence (
- Output
- Transcriptome sequence, trimmed IDs (
.fa
); used in generate_segemehl_index_transcriptome, mapping_transcriptome_segemehl and mapping_transcriptome_oligomap
- Transcriptome sequence, trimmed IDs (
Generate transcriptome index for segemehl short read aligner.
The transcriptome index only needs to be generated once for each combination of transcriptome sequence and annotations, and sample sets.
- Input
- Transcriptome sequence, trimmed IDs (
.fa
); from trim_transcriptome_seq_ids
- Transcriptome sequence, trimmed IDs (
- Output
- segemehl transcriptome index (
.idx
); used in mapping_transcriptome_segemehl
- segemehl transcriptome index (
Generate genome index for segemehl short read aligner.
The genome index only needs to be generated once for each combination of annotations and sample sets.
- Input
- Genome sequence, trimmed IDs (
.fa
); from trim_genome_seq_ids
- Genome sequence, trimmed IDs (
- Output
- segemehl genome index (
.idx
); used in mapping_genome_segemehl
- segemehl genome index (
Retrieve exon annotations from genome annotations with a custom script.
- Input
- (Workflow input) Genome annotations,
gzip
ed (.gtf.gz
)
- (Workflow input) Genome annotations,
- Output
- Exon annotations (
.gtf
); used in convert_exons_gtf_to_bed
- Exon annotations (
Convert exon annotations .gtf
to .bed
with a
custom script.
- Input
- Exon annotations (
.gtf
); from get_exons_gtf
- Exon annotations (
- Output
- Exon annotations (
.bed
); used in transcriptome_to_genome_maps
- Exon annotations (
Create SAM header for the genome with SAMtools.
Required by SAMtools to work with the alignment files.
- Input
- Genome sequence, trimmed IDs (
.fa
); from trim_genome_seq_ids
- Genome sequence, trimmed IDs (
- Output
- SAM genome header (
.sam
); used in add_header_all_maps
- SAM genome header (
Map UCSC-like chromosome names with Ensembl-like ones in miRNA annotations with a custom script.
Required by BEDTools to intersect alignments with miRNA annotations. Several mapping tables are available here.
- Input
- (Workflow input) miRNA annotations (
.gff3
) - (Workflow input) Tab-separated chromosome name mappings table (
.tsv
)
- (Workflow input) miRNA annotations (
- Output
- miRNA annotations, mapped chromosome name(s) (
.gff3
); used in create_per_library_ascii_pileups, create_per_run_ascii_pileups and/or create_per_condition_ascii_pileups
- miRNA annotations, mapped chromosome name(s) (
Create a FASTA index for the genome with SAMtools.
- Input
- Genome sequence, trimmed IDs (
.fa
); from trim_genome_seq_ids
- Genome sequence, trimmed IDs (
- Output
- FASTA genome index (
.fa.fai
); used in extract_chr_len
- FASTA genome index (
Extract chromosome(s) length from the genome sequence.
Required to ensure that the extended annotations in generated in the extend_mirs_annotations rule do not exceed the chromosome(s) boundaries.
- Input
- FASTA genome index (
.fa.fai
); from create_index_genome_fasta
- FASTA genome index (
- Output
- Tab-separated table mapping chromosome name(s) and length(s) (
.tsv
); used in extend_mirs_annotations
- Tab-separated table mapping chromosome name(s) and length(s) (
Extend miRNA annotations, ensure feature names uniqueness and split the file by feature with a custom script.
Adjust miRNAs' 'Name' attribute to account for the different genomic locations the miRNA sequence is annotated on and ensure their uniqueness. The name format is
SPECIES-mir-NAME-#
for pri-miRs, andSPECIES-miR-NAME-#-ARM
orSPECIES-miR-NAME-#
for mature miRNA with both or just one arm respectively, where#
is the replica integer. If a pri-miR has a replica but its number is set in the 'ID' attribute, the first instance does not has a suffix but the other one(s) do. If a precursor has no other occurrences, no further modifications are made. On the other hand, mature miRNA regions are extended on both sides to account for isomiR species with shifted start and/or end positions without exceeding chromosome(s) boundaries. If required, pri-miR loci are also extended to accommodate the new miRNA coordinates. In addition, pri-miR names are modified to record the final positions by appending_-y
and_+x
to them, wherey
is the 5' shift andx
the 3' shift.
- Input
- miRNA annotations, mapped chromosome name(s) (
.gff3
); from map_chr_names
- miRNA annotations, mapped chromosome name(s) (
- Parameters
- config_template.yaml
extension
: Number of nucleotides by which mature miRNA annotated regions are extended at most (default: 6)
- config_template.yaml
- Output
- Primary miRNA transcript (pri-miR) extended annotations (
.gff3
); used in intersect_extended_primir - Mature miRNA (miRNA) extended annotations (
.gff3
); used in intersect_extended_mirna
- Primary miRNA transcript (pri-miR) extended annotations (
- Examples
Example 1 | Extension | Mature miRNA extension
IN:
pri-miR entry:
19 . miRNA_primary_transcript 2517 2614 . + . ID=MI0003141;Alias=MI0003141;Name=hsa-mir-512-2
mature miRNA entry:
19 . miRNA 2536 2558 . + . ID=MIMAT0002822_1;Alias=MIMAT0002822;Name=hsa-miR-512-5p;Derives_from=MI0003141
extension:
6
OUT:
pri-miR entry:
19 . miRNA_primary_transcript 2517 2614 . + . ID=MI0003141;Alias=MI0003141;Name=hsa-mir-512-2_-0_+0
mature miRNA entry:
19 . miRNA 2530 2564 . + . ID=MIMAT0002822_1;Alias=MIMAT0002822;Name=hsa-miR-512-2-5p;Derives_from=MI0003141
Example 2 | Extension | Mature miRNA and pri-miR extension
IN:
pri-miR entry:
19 . miRNA_primary_transcript 9 122 . + . ID=MI0003140;Alias=MI0003140;Name=hsa-mir-512-1
mature miRNA entry:
19 . miRNA 12 74 . + . ID=MIMAT0002822;Alias=MIMAT0002822;Name=hsa-miR-512-5p;Derives_from=MI0003140
extension:
6
OUT:
pri-miR entry:
19 . miRNA_primary_transcript 6 122 . + . ID=MI0003140;Alias=MI0003140;Name=hsa-mir-512-1_-3_+0
mature miRNA entry:
19 . miRNA 6 80 . + . ID=MIMAT0002822;Alias=MIMAT0002822;Name=hsa-miR-512-1-5p;Derives_from=MI0003140
Example 3 | Extension | Matrue miRNA exceeding chromosome boundaries extension
IN:
pri-miR entry:
19 . miRNA_primary_transcript 2 122 . + . ID=MI0003140;Alias=MI0003140;Name=hsa-mir-512-1
mature miRNA entry:
19 . miRNA 3 74 . + . ID=MIMAT0002822;Alias=MIMAT0002822;Name=hsa-miR-512-5p;Derives_from=MI0003140
extension:
6
OUT:
pri-miR entry:
19 . miRNA_primary_transcript 1 122 . + . ID=MI0003140;Alias=MI0003140;Name=hsa-mir-512-1_-1_+0
mature miRNA entry:
19 . miRNA 1 80 . + . ID=MIMAT0002822;Alias=MIMAT0002822;Name=hsa-miR-512-1-5p;Derives_from=MI0003140
Example 4 | Name uniqueness | Replica number in the ID
IN:
pri-miR entries:
chr21 . miRNA_primary_transcript 8206563 8206618 . + . ID=MI0033425;Alias=MI0033425;Name=hsa-mir-10401
chr21 . miRNA_primary_transcript 8250772 8250827 . + . ID=MI0033425_2;Alias=MI0033425;Name=hsa-mir-10401
mature miRNA entries:
chr21 . miRNA 8206563 8206582 . + . ID=MIMAT0041633;Alias=MIMAT0041633;Name=hsa-miR-10401-5p;Derives_from=MI0033425
chr21 . miRNA 8206598 8206618 . + . ID=MIMAT0041634;Alias=MIMAT0041634;Name=hsa-miR-10401-3p;Derives_from=MI0033425
chr21 . miRNA 8250772 8250791 . + . ID=MIMAT0041633_1;Alias=MIMAT0041633;Name=hsa-miR-10401-5p;Derives_from=MI0033425
chr21 . miRNA 8250807 8250827 . + . ID=MIMAT0041634_1;Alias=MIMAT0041634;Name=hsa-miR-10401-3p;Derives_from=MI0033425
OUT:
pri-miR entries:
chr21 . miRNA_primary_transcript 8206563 8206618 . + . ID=MI0033425;Alias=MI0033425;Name=hsa-mir-10401
chr21 . miRNA_primary_transcript 8250772 8250827 . + . ID=MI0033425_2;Alias=MI0033425;Name=hsa-mir-10401-2
mature miRNA entries:
chr21 . miRNA 8206563 8206582 . + . ID=MIMAT0041633;Alias=MIMAT0041633;Name=hsa-miR-10401-5p;Derives_from=MI0033425
chr21 . miRNA 8206598 8206618 . + . ID=MIMAT0041634;Alias=MIMAT0041634;Name=hsa-miR-10401-3p;Derives_from=MI0033425
chr21 . miRNA 8250772 8250791 . + . ID=MIMAT0041633_1;Alias=MIMAT0041633;Name=hsa-miR-10401-2-5p;Derives_from=MI0033425
chr21 . miRNA 8250807 8250827 . + . ID=MIMAT0041634_1;Alias=MIMAT0041634;Name=hsa-miR-10401-2-3p;Derives_from=MI0033425
Example 5 | Name uniqueness | Replica number in the Name; single mature arm
IN:
pri-miR entries:
chr21 . miRNA_primary_transcript 8205315 8205406 . + . ID=MI0022559;Alias=MI0022559;Name=hsa-mir-6724-1
chr21 . miRNA_primary_transcript 8249505 8249596 . + . ID=MI0031516;Alias=MI0031516;Name=hsa-mir-6724-2
mature miRNA entries:
chr21 . miRNA 8205325 8205347 . + . ID=MIMAT0025856;Alias=MIMAT0025856;Name=hsa-miR-6724-5p;Derives_from=MI0022559
chr21 . miRNA 8249515 8249537 . + . ID=MIMAT0025856_1;Alias=MIMAT0025856;Name=hsa-miR-6724-5p;Derives_from=MI0031516
OUT:
pri-miR entries:
chr21 . miRNA_primary_transcript 8205315 8205406 . + . ID=MI0022559;Alias=MI0022559;Name=hsa-mir-6724-1
chr21 . miRNA_primary_transcript 8249505 8249596 . + . ID=MI0031516;Alias=MI0031516;Name=hsa-mir-6724-2
mature miRNA entries:
chr21 . miRNA 8205325 8205347 . + . ID=MIMAT0025856;Alias=MIMAT0025856;Name=hsa-miR-6724-1-5p;Derives_from=MI0022559
chr21 . miRNA 8249515 8249537 . + . ID=MIMAT0025856_1;Alias=MIMAT0025856;Name=hsa-miR-6724-2-5p;Derives_from=MI0031516
Example 6 | Name uniqueness | Both mature miRNA arms but just one with the replica number
IN:
pri-miR entries:
chr2 . miRNA_primary_transcript 135665397 135665478 . + . ID=MI0000447;Alias=MI0000447;Name=hsa-mir-128-1
chr3 . miRNA_primary_transcript 35744476 35744559 . + . ID=MI0000727;Alias=MI0000727;Name=hsa-mir-128-2
mature miRNA entries:
chr2 . miRNA 135665446 135665466 . + . ID=MIMAT0000424;Alias=MIMAT0000424;Name=hsa-miR-128-3p;Derives_from=MI0000447
chr2 . miRNA 135665411 135665433 . + . ID=MIMAT0026477;Alias=MIMAT0026477;Name=hsa-miR-128-1-5p;Derives_from=MI0000447
chr3 . miRNA 35744527 35744547 . + . ID=MIMAT0000424_1;Alias=MIMAT0000424;Name=hsa-miR-128-3p;Derives_from=MI0000727
chr3 . miRNA 35744490 35744512 . + . ID=MIMAT0031095;Alias=MIMAT0031095;Name=hsa-miR-128-2-5p;Derives_from=MI0000727
OUT:
pri-miR entries:
chr2 . miRNA_primary_transcript 135665397 135665478 . + . ID=MI0000447;Alias=MI0000447;Name=hsa-mir-128-1
chr3 . miRNA_primary_transcript 35744476 35744559 . + . ID=MI0000727;Alias=MI0000727;Name=hsa-mir-128-2
mature miRNA entries:
chr2 . miRNA 135665446 135665466 . + . ID=MIMAT0000424;Alias=MIMAT0000424;Name=hsa-miR-128-1-3p;Derives_from=MI0000447
chr2 . miRNA 135665411 135665433 . + . ID=MIMAT0026477;Alias=MIMAT0026477;Name=hsa-miR-128-1-5p;Derives_from=MI0000447
chr3 . miRNA 35744527 35744547 . + . ID=MIMAT0000424_1;Alias=MIMAT0000424;Name=hsa-miR-128-2-3p;Derives_from=MI0000727
chr3 . miRNA 35744490 35744512 . + . ID=MIMAT0031095;Alias=MIMAT0031095;Name=hsa-miR-128-2-5p;Derives_from=MI0000727
Target rule as required by Snakemake.
Local rule
- Input
- BAM index file (
.bam.bai
); from index_all_alns_bam
- BAM index file (
Copy and rename read files.
Local rule. Depending on the library file format, the output file undergoes a quality filter (
fa
/.fastq
) or is directly formatted (.fa
/.fasta
).
- Input
- (Workflow input) miRNA sequencing library,
gzip
ed (.fa.gz
/.fasta.gz
or.fq.gz
/.fastq.gz
)
- (Workflow input) miRNA sequencing library,
- Output
- miRNA sequencing library, copied, renamed (
.fa
,.fastq
); used in fastq_quality_filter and/or format_fasta
- miRNA sequencing library, copied, renamed (
Conduct quality control for reads library with fastx_toolkit.
- Input
- miRNA sequencing library, copied, renamed (
.fastq
); from start
- miRNA sequencing library, copied, renamed (
- Parameters
- config_template.yaml
q_value
: Minimum Q (Phred) score to keep (default: 10)p_value
: Minimum % of bases that must have a Q (Phred) quality (default: 50)
- config_template.yaml
- Output
- miRNA sequencing library, filtered (
.fastq
); used in fastq_to_fasta
- miRNA sequencing library, filtered (
Convert reads file from FASTQ to FASTA with fastx_toolkit.
Sequence identifiers are renamed to numbers.
- Input
- miRNA sequencing library, filtered (
.fastq
); from fastq_quality_filter
- miRNA sequencing library, filtered (
- Output
- miRNA sequencing library (
.fa
); used in format_fasta
- miRNA sequencing library (
Format read's sequences to appear on a single line with fastx_toolkit.
- Input
- miRNA sequencing library (
.fa
); from start or fastq_to_fasta
- miRNA sequencing library (
- Output
- miRNA sequencing library, formatted (
.fasta
); used in remove_adapters
- miRNA sequencing library, formatted (
Trim 3' adapters and N
bases at either end. Filter reads by minimum length
and number of inner N
bases with cutadapt.
- Input
- miRNA sequencing library, formatted (
.fasta
); from format_fasta
- miRNA sequencing library, formatted (
- Parameters
- samples.tsv
- Adapter to be removed; specified in the sample's table column
adapter
- Adapter to be removed; specified in the sample's table column
- config_template.yaml
error_rate
: Fraction of allowed errors in the matched adapters (default: 0.1)overlap
: Minimum overlap length between adapter and read to trim the bases (default: 3)minimum_length
: Minimum length for a processed read to be kept (default: 15)max_n
: Maximum number of innerN
bases for a processed read to be kept (default: 0)
- samples.tsv
- Output
- miRNA sequencing library, filtered, without adapters (
.fasta
); used in collapse_identical_reads
- miRNA sequencing library, filtered, without adapters (
Collapse and rename identical reads fastx_toolkit.
Sequences are renamed in the format
R-N
, whereR
is the assigned number to the unique entry, andN
is the amount of identical sequences within the library collapsed in it.
- Input
- miRNA sequencing library, filtered, without adapters (
.fasta
); from remove_adapters
- miRNA sequencing library, filtered, without adapters (
- Output
- miRNA sequencing library, collapsed, renamed (
.fasta
); used in filter_fasta_for_oligomap, map_genome_segemehl and map_transcriptome_segemehl
- miRNA sequencing library, collapsed, renamed (
Align short reads to reference genome with segemehl.
- Input
- miRNA sequencing library, collapsed, renamed (
.fasta
); from collapse_identical_reads - Genome sequence, trimmed IDs (
.fa
); from trim_genome_seq_ids - segemehl genome index (
.idx
); from generate_segemehl_index_genome
- miRNA sequencing library, collapsed, renamed (
- Output
- Alignments file (
.sam
); used in merge_genome_maps
- Alignments file (
Align short reads to reference transcriptome with segemehl.
- Input
- miRNA sequencing library, collapsed, renamed (
.fasta
); from collapse_identical_reads - Transcriptome sequence, trimmed IDs (
.fa
); from trim_transcriptome_seq_ids - segemehl transcriptome index (
.idx
); from generate_segemehl_index_transcriptome
- miRNA sequencing library, collapsed, renamed (
- Output
- Alignments file (
.sam
); used in merge_transcriptome_maps
- Alignments file (
Filter reads by length with a custom script.
Required for an optimal mapping speed. Oligomap is specifically written for short reads. Therefore, reads with more bases than the default maximum (30 nts) makes the mapping slower.
- Input
- miRNA sequencing library, collapsed, renamed (
.fasta
); from collapse_identical_reads
- miRNA sequencing library, collapsed, renamed (
- Parameters
- config_template.yaml
max_length_reads
: Maximum length of processed reads to be mapped with oligomap (default: 30)
- config_template.yaml
- Output
- miRNA sequencing library, collapsed, filtered (
.fasta
); used in map_genome_oligomap and map_transcriptome_oligomap
- miRNA sequencing library, collapsed, filtered (
Align short reads to reference genome with oligomap.
Refer to Oligomap's Output format section for a specific explanation and examples on the output format.
- Input
- miRNA sequencing library, collapsed, filtered (
.fasta
); from filter_fasta_for_oligomap - Genome sequence, trimmed IDs (
.fa
); from trim_genome_seq_ids
- miRNA sequencing library, collapsed, filtered (
- Output
- Alignments file (
.oligomap
); used in sort_genome_oligomap - Alignments report (
.txt
)
- Alignments file (
Sort oligomap alignments by query name with a custom script.
- Input
- Alignments file (
.oligomap
); from map_genome_oligomap
- Alignments file (
- Output
- Alignments file, sorted (
.oligomap
); used in convert_genome_to_sam_oligomap
- Alignments file, sorted (
Convert aligned reads .oligomap
to .sam
and filter alignments by number of
hits with a custom script.
If a read has been aligned beyond a specified threshold, it is removed due to (1) performance reasons as the file size can rapidly increase, and (2) the fact that each read contributes to each count
1/N
whereN
is the number of genomic loci it aligns to and a largeN
makes the contribution negligible.
- Input
- Alignments file, sorted (
.oligomap
); from sort_genome_oligomap
- Alignments file, sorted (
- Parameters
- config_template.yaml
nh
: Maximum number of mappings per read to be kept (default: 100)
- config_template.yaml
- Output
- Alignments file, filtered (
.sam
); used in merge_genome_maps
- Alignments file, filtered (
Align short reads to reference transcriptome with oligomap.
Refer to Oligomap's Output format section for a specific explanation and examples on the output format.
- Input
- miRNA sequencing library, collapsed, filtered (
.fasta
); from filter_fasta_for_oligomap - Transcriptome sequence, trimmed IDs (
.fa
); from trim_transcriptome_seq_ids
- miRNA sequencing library, collapsed, filtered (
- Output
- Alignments file (
.oligomap
); used in sort_transcriptome_oligomap - Alignments report (
.txt
)
- Alignments file (
Sort oligomap alignments by query name with a custom script.
- Input
- Alignments file (
.oligomap
); from map_transcriptome_oligomap
- Alignments file (
- Output
- Alignments file, sorted (
.oligomap
); used in convert_transcriptome_to_sam_oligomap
- Alignments file, sorted (
Convert aligned reads .oligomap
to .sam
and filter alignments by number of
hits with a custom script.
If a read has been aligned beyond a specified threshold, it is removed due to (1) performance reasons as the file size can rapidly increase, and (2) the fact that each read contributes to each count
1/N
whereN
is the number of genomic loci it aligns to and a largeN
makes the contribution negligible.
- Input
- Alignments file, sorted (
.oligomap
); from sort_transcriptome_oligomap
- Alignments file, sorted (
- Parameters
- config_template.yaml
nh
: Maximum number of mappings per read to be kept (default: 100)
- config_template.yaml
- Output
- Alignments file, filtered (
.sam
); used in merge_transcriptome_maps
- Alignments file, filtered (
Concatenate segemehl and oligomap genome alignments.
- Input
- Alignments file (
.sam
); from map_genome_segemehl - Alignments file, filtered (
.sam
); from convert_genome_to_sam_oligomap
- Alignments file (
- Output
- Alignments file (
.sam
); used in filter_genome_by_nh
- Alignments file (
Concatenate segemehl and oligomap transcriptome alignments.
- Input
- Alignments file (
.sam
); from map_transcriptome_segemehl - Alignments file, filtered (
.sam
); from convert_transcriptome_to_sam_oligomap
- Alignments file (
- Output
- Alignments file (
.sam
); used in filter_transcriptome_by_nh
- Alignments file (
Filter merged genome alignments by the number of hits with a custom script.
If a read has been aligned beyond a specified threshold, it is removed due to (1) performance reasons as the file size can rapidly increase, and (2) the fact that each read contributes to each count
1/N
whereN
is the number of genomic loci it aligns to and a largeN
makes the contribution negligible.
- Input
- Alignments file (
.sam
); from merge_genome_maps
- Alignments file (
- Parameters
- config_template.yaml
nh
: Maximum number of mappings per read to be kept (default: 100)
- config_template.yaml
- Output
- Alignments file, filtered (
.sam
); used in remove_header_genome_mappings
- Alignments file, filtered (
Filter merged transcriptome alignments by the number of hits with a custom script.
If a read has been aligned beyond a specified threshold, it is removed due to (1) performance reasons as the file size can rapidly increase, and (2) the fact that each read contributes to each count
1/N
whereN
is the number of genomic loci it aligns to and a largeN
makes the contribution negligible.
- Input
- Alignments file (
.sam
); from merge_transcriptome_maps
- Alignments file (
- Parameters
- config_template.yaml
nh
: Maximum number of mappings per read to be kept (default: 100)
- config_template.yaml
- Output
- Alignments file, filtered (
.sam
); used in remove_header_transcriptome_mappings
- Alignments file, filtered (
Remove the SAM header of the genome alignments file with SAMtools.
- Input
- Alignments file (
.sam
); from filter_genome_by_nh
- Alignments file (
- Output
- Alignments file, without SAM header (
.sam
); used in merge_all_maps
- Alignments file, without SAM header (
Remove the SAM header of the transcriptome alignments file with SAMtools.
- Input
- Alignments file (
.sam
); from filter_transcriptome_by_nh
- Alignments file (
- Output
- Alignments file, without SAM header (
.sam
); used in transcriptome_to_genome_maps
- Alignments file, without SAM header (
Convert the alignments' transcriptome coordinates to genomic ones with a custom script.
- Input
- Alignments file, without SAM header (
.sam
); from remove_header_transcriptome_mappings - Exon annotations (
.bed
); from convert_exons_gtf_to_bed
- Alignments file, without SAM header (
- Output
- Alignments file, without SAM header (
.sam
); used in merge_all_maps
- Alignments file, without SAM header (
Concatenate the four alignment files into a single one.
- Input
- Alignments files, without SAM header (
.sam
); from remove_header_genome_mappings and transcriptome_to_genome_maps
- Alignments files, without SAM header (
- Output
- Alignments file, without SAM header (
.sam
); used in add_header_all_maps
- Alignments file, without SAM header (
Add the SAM header to the aligned reads merged file with SAMtools.
- Input
- Alignments file, without SAM header (
.sam
); from merge_all_maps
- Alignments file, without SAM header (
- Output
- Alignments file (
.sam
); used in sort_maps_by_id
- Alignments file (
Sort alignments by reads ID with SAMtools.
- Input
- Alignments file (
.sam
); from add_header_all_maps
- Alignments file (
- Output
- Alignments file, sorted (
.sam
); used in remove_inferiors
- Alignments file, sorted (
Remove duplicate and inferior alignments with a custom script.
Alignments are considered to be duplicates if having identical entries for the fields
QNAME
,FLAG
,RNAME
,POS
andCIGAR
. Alignments are considered to be inferiors if having the sameQNAME
and a bigger edit distance than the smaller one within the group. The tagsNH
(number of hits) andHI
(query hit index) are updated accordingly.
- Input
- Alignments file, sorted (
.sam
); from sort_maps_by_id
- Alignments file, sorted (
- Output
- Alignments file, filtered (
.sam
); used in filter_by_indels
- Alignments file, filtered (
- Examples
Example 1 | Remove duplicates
IN:
1-2 0 19 44414 1 21M * 0 0 GAAGGCGCTTCCCTTTGGAGT * HI:i:0 NH:i:1 NM:i:0 MD:Z:21 RG:Z:A1 YZ:Z:0
1-2 0 19 44414 255 21M * 0 0 GAAGGCGCTTCCCTTTGGAGT * NM:i:0 MD:Z:21 NH:i:1
OUT:
1-2 0 19 44414 255 21M * 0 0 GAAGGCGCTTCCCTTTGGAGT * MD:Z:21 NH:i:1 NM:i:0
Example 2 | Remove inferiors single alignment
IN:
1-704 16 19 207362 1 18M * 0 0 CCCGGGCCCGGCGCGCCG * HI:i:0 NH:i:2 NM:i:0 MD:Z:18 RG:Z:A1 YZ:Z:0
1-704 272 19 471264 1 16M1I1M * 0 0 CCCGGGCCCGGCGCGCCG * HI:i:1 NH:i:2 NM:i:2 MD:Z:11G5 RG:Z:A1 YZ:Z:0
OUT:
1-704 16 19 207362 1 18M * 0 0 CCCGGGCCCGGCGCGCCG * HI:i:0 NH:i:1 NM:i:0 MD:Z:18 RG:Z:A1 YZ:Z:0
Example 3 | Remove inferiors multiple alignments
IN:
1-1197 0 19 56327 1 15M * 0 0 TATGGCACTGGTAGA * HI:i:0 NH:i:4 NM:i:2 MD:Z:1C11T1 RG:Z:A1 YZ:Z:0
1-1197 256 19 68983 1 15M * 0 0 TATGGCACTGGTAGA * HI:i:1 NH:i:4 NM:i:3 MD:Z:1C10AT1 RG:Z:A1 YZ:Z:0
1-1197 256 19 76967 1 15M * 0 0 TATGGCACTGGTAGA * HI:i:2 NH:i:4 NM:i:2 MD:Z:1C11T1 RG:Z:A1 YZ:Z:0
1-1197 256 19 92363 1 15M * 0 0 TATGGCACTGGTAGA * HI:i:4 NH:i:4 NM:i:3 MD:Z:1C11TA RG:Z:A1 YZ:Z:0
OUT:
1-1197 0 19 56327 1 15M * 0 0 TATGGCACTGGTAGA * HI:i:0 NH:i:2 NM:i:2 MD:Z:1C11T1 RG:Z:A1 YZ:Z:0
1-1197 256 19 76967 1 15M * 0 0 TATGGCACTGGTAGA * HI:i:1 NH:i:2 NM:i:2 MD:Z:1C11T1 RG:Z:A1 YZ:Z:0
Filter multimappers favoring mismatches over InDels with a custom script.
Under the assumption that InDels are less frequent than mismatches only those alignments (of the same read with the same edit distance) with the lowest number of InDels are kept. This approach allows the presence of multimappers and/or InDels after the filtering if the alignments contain the same proportion of mismatches vs. InDels.
- Input
- Alignments file, sorted, filtered (
.sam
); from remove_inferiors
- Alignments file, sorted, filtered (
- Output
- Alignments file, sorted, filtered (
.sam
); used in convert_all_alns_sam_to_bam and filter_sam_by_intersecting_primir
- Alignments file, sorted, filtered (
- Examples
Example 1 | Different proportion of mismatches vs. InDels
IN:
1-1 16 19 77595 255 14M1D8M * 0 0 GCAGGAGAATCACTGATGTCAG * MD:Z:14^T2A1C3 NH:i:2 NM:i:3 XA:Z:Q XI:i:1
1-1 0 19 330456 255 4M1D1M1I3M1D13M * 0 0 CTGACATCAGTGATTCTCCTGC * MD:Z:4^G4^A13 NH:i:2 NM:i:3 XA:Z:Q XI:i:0
OUT:
1-1 16 19 77595 255 14M1D8M * 0 0 GCAGGAGAATCACTGATGTCAG * MD:Z:14^T2A1C3 NM:i:3 XA:Z:Q XI:i:1 NH:i:1 HI:i:1
Example 2 | Equal proportion of mismatches vs. InDels
IN:
1-1 0 19 142777 255 15M1I5M * 0 0 GCTAGGTGGGAGGCTTGAAGC * MD:Z:4C0T14 NH:i:3 NM:i:3 XA:Z:Q XI:i:0
1-1 16 19 270081 255 6M1I14M * 0 0 GCTTCAAGCCTCCCACCTAGC * MD:Z:14G0G4 NH:i:3 NM:i:3 XA:Z:Q XI:i:2
1-1 16 19 545543 255 6M1I14M * 0 0 GCTTCAAGCCTCCCACCTAGC * MD:Z:14A0G4 NH:i:3 NM:i:3 XA:Z:Q XI:i:1
OUT:
1-1 0 19 142777 255 15M1I5M * 0 0 GCTAGGTGGGAGGCTTGAAGC * MD:Z:4C0T14 NH:i:3 NM:i:3 XA:Z:Q XI:i:0
1-1 16 19 270081 255 6M1I14M * 0 0 GCTTCAAGCCTCCCACCTAGC * MD:Z:14G0G4 NH:i:3 NM:i:3 XA:Z:Q XI:i:2
1-1 16 19 545543 255 6M1I14M * 0 0 GCTTCAAGCCTCCCACCTAGC * MD:Z:14A0G4 NH:i:3 NM:i:3 XA:Z:Q XI:i:1
Convert alignments .sam
file to .bam
with
SAMtools.
Required by BEDTools to intersect alignments with pri-miR annotations.
- Input
- Alignments file, filtered (
.sam
); from filter_by_indels
- Alignments file, filtered (
- Output
- Alignments file (
.bam
); used in sort_all_alns_bam_by_position
- Alignments file (
Sort alignments by position with SAMtools.
Required by BEDTools to intersect alignments with pri-miR annotations more efficiently.
- Input
- Alignments file (
.bam
); from convert_all_alns_sam_to_bam
- Alignments file (
- Output
- Alignments file, sorted (
.bam
); used in index_all_alns_bam and intersect_extended_primir
- Alignments file, sorted (
Create index BAM file with SAMtools.
Indexing is required by genome viewers such as IGV to quickly display alignments in a genomic region of interest.
- Input
- Alignments file, sorted (
.bam
); from sort_all_alns_bam_by_position
- Alignments file, sorted (
- Output
- BAM index file (
.bam.bai
); used in intersect_extended_primir
- BAM index file (
Target rule as required by Snakemake.
Local rule
- Input
- Alignments file, filtered (
.sam
); from filter_sam_by_intersecting_primir and filter_sam_by_intersecting_mirna - (iso)miR and/or pri-miR counts table (
.tab
); from merge_tables - Alignments file, uncollapsed, sorted (
.bam
); from sort_uncollapsed_reads_bam_by_position - BAM index file (
.bam.bai
); from index_uncollapsed_reads_bam
- Alignments file, filtered (
Intersect the aligned reads with the extended pri-miR annotations with BEDTools.
Only those alignments fully intersecting a (possibly extended) pri-miR annotated region are kept.
- Input
- Alignments file, sorted (
.bam
); from sort_all_alns_bam_by_position - pri-miR extended annotations (
.gff3
); from extend_mirs_annotations
- Alignments file, sorted (
- Output
- pri-miR intersections file (
.bed
); used in filter_sam_by_intersecting_primir and quantify_primir
- pri-miR intersections file (
Remove alignments that do not intersect with any pri-miR with SAMtools.
Required to only intersect alignments within a (possibly extended) pri-miR locus.
- Input
- Alignments file, filtered (
.sam
); from filter_by_indels - pri-miR intersections file (
.bed
); from intersect_extended_primir
- Alignments file, filtered (
- Output
- (Workflow output) Alignments file, filtered (
.sam
); used in convert_intersecting_primir_sam_to_bam and filter_sam_by_intersecting_mirna
- (Workflow output) Alignments file, filtered (
Convert alignments .sam
file to .bam
with
SAMtools.
Required by BEDTools to intersect alignments with miRNA annotations.
- Input
- Alignments file, filtered (
.sam
); from filter_sam_by_intersecting_primir
- Alignments file, filtered (
- Output
- Alignments file (
.bam
); used in sort_intersecting_primir_bam_by_position
- Alignments file (
Sort alignments by position with SAMtools.
Required by BEDTools to intersect alignments with miRNA annotations more efficiently.
- Input
- Alignments file (
.bam
); from convert_intersecting_primir_sam_to_bam
- Alignments file (
- Output
- Alignments file, sorted (
.bam
); used in index_intersecting_primir_bam and intersect_extended_mirna
- Alignments file, sorted (
Create index BAM file with SAMtools.
Indexing is required by genome viewers such as IGV to quickly display alignments in a genomic region of interest.
- Input
- Alignments file, sorted (
.bam
); from sort_intersecting_primir_bam_by_position
- Alignments file, sorted (
- Output
- BAM index file (
.bam.bai
); used in intersect_extended_mirna
- BAM index file (
Intersect the aligned reads with the extended miRNA annotations with BEDTools.
Only those alignments fully intersecting an extended mature miRNA annotated region are kept.
- Input
- Alignments file, sorted (
.bam
); from sort_intersecting_primir_bam_by_position - Mature miRNA extended annotations (
.gff3
); from extend_mirs_annotations
- Alignments file, sorted (
- Output
- Mature miRNA intersections file (
.bed
); used in filter_sam_by_intersecting_mirna and add_intersecting_mirna_tag
- Mature miRNA intersections file (
Remove alignments that do not intersect with any miRNA with SAMtools.
Required to efficiently classify the alignments.
- Input
- Alignments file, filtered (
.sam
); from filter_sam_by_intersecting_primir - Mature miRNA intersections file (
.bed
); from intersect_extended_mirna
- Alignments file, filtered (
- Output
- (Workflow output) Alignments file, filtered (
.sam
); used in add_intersecting_mirna_tag and uncollapse_reads
- (Workflow output) Alignments file, filtered (
Classify and add the intersecting (iso)miR to each alignment as a tag with a custom script.
In this step, the mature miRNA annotated regions are used instead of the extended ones. Each alignment gets an extra tag (
YW:Z
) with either the (iso)miR(s) it is considered to really intersect with or an empty string otherwise. The format of the intersecting mature miRNA species ismiRNA_name|5p-shift|3p-shift|CIGAR|MD
, where5p-shift
and3p-shift
are the difference between the miRNA start and end coordinates and the alignment's ones respectively.
- Input
- Alignments file, filtered (
.sam
); from filter_sam_by_intersecting_mirna - Mature miRNA intersections file (
.bed
); from intersect_extended_mirna
- Alignments file, filtered (
- Parameters
- config_template.yaml
extension
: Number of nucleotides by which mature miRNA annotated regions are extended (default: 6)
- config_template.yaml
- Output
- Alignments file, tagged (
.sam
); used in sort_intersecting_mirna_by_feat_tag
- Alignments file, tagged (
- Examples
Example 1 | Intersecting a canoncial mature miRNA
IN miRNA annotations:
chr19 . miRNA 44377 44398 . + . ID=MIMAT0002849;Alias=MIMAT0002849;Name=hsa-miR-524-5p;Derives_from=MI0003160
IN SAM record:
1-1_1 0 19 44377 255 22M * 0 0 CTACAAAGGGAAGCACTTTCTC * MD:Z:22 NH:i:1 NM:i:0
NEW TAG:
YW:Z:hsa-miR-524-5p|0|0|22M|22
Example 2 | Intersecting an isomiR (no shifts)
IN miRNA annotations:
chr19 . miRNA 44377 44398 . + . ID=MIMAT0002849;Alias=MIMAT0002849;Name=hsa-miR-524-5p;Derives_from=MI0003160
IN SAM record:
1-1_1 0 19 44377 1 11M3I11M * 0 0 CTACAAAGGGAGGTAGCACTTTCTC * HI:i:0 MD:Z:22 NH:i:1 NM:i:3
NEW TAG:
YW:Z:hsa-miR-524-5p|0|0|11M3I11M|22
Example 3 | Intersecting an isomiR (no InDels nor mismatches)
IN miRNA annotations:
chr19 . miRNA 5338 5359 . + . ID=MIMAT0005795;Alias=MIMAT0005795;Name=hsa-miR-1323;Derives_from=MI0003786
IN SAM record:
1-1_1 0 19 5338 255 21M * 0 0 TCAAAACTGAGGGGCATTTTC * MD:Z:21 NH:i:1 NM:i:0
NEW TAG:
YW:Z:hsa-miR-1323|0|-1|21M|21
Example 4 | Not intersecting an (iso)miR
IN miRNA annotations:
chr19 . miRNA 5338 5359 . + . ID=MIMAT0005795;Alias=MIMAT0005795;Name=hsa-miR-1323;Derives_from=MI0003786
IN SAM record:
1-1_1 0 19 5338 255 21M * 0 0 TCAAAACTGAGGGGCATTTTC * MD:Z:21 NH:i:1 NM:i:0
NEW TAG:
YW:Z:
Sort the alignments by the tag containing the classified intersecting (iso)miR with SAMtools.
Required for an efficient quantification.
- Input
- Alignments file, tagged (
.sam
); from add_intersecting_mirna_tag
- Alignments file, tagged (
- Output
- Alignments file, tagged, sorted (
.sam
); used in quantify_mirna
- Alignments file, tagged, sorted (
Tabulate alignments according to its new tag (YW:Z
) with a
custom script.
Each alignment contributes to the miRNA species in its
YW:Z
tag byR/N
, whereR
is the number of collapsed reads in that alignment, andN
is the number of genomic and/or transcriptomic loci it aligns to. The values of both,R
andN
are inferred from the sequence name which follows the formatID-R_N
. The resulting table has a row for each mature miRNA species (isomiR, canonical miRNA or both) with the name format set in add_intersecting_mirna_tag unless considered a canonical miRNA, which only keeps the annotated mature miRNA name. A miRNA species is considered to be canonical if there are no shifts between its start and end positions and the aligned read ones, and there are no mismatches nor InDels.
- Input
- Alignments file, tagged, sorted (
.sam
); from sort_intersecting_mirna_by_feat_tag
- Alignments file, tagged, sorted (
- Parameters
- samples.tsv
- Library name; specified in the sample's table column
sample
- Library name; specified in the sample's table column
- config_template.yaml
mir_list
: miRNA features to be quantified (default: isomir, mirna pri-miR)
- samples.tsv
- Output
- (iso)miR counts tab-delimited file; used in merge_tables
- Examples
Example 1 | Canonical miRNA and isomiR
IN SAM record:
10-4_2 0 19 34627 255 21M * 0 0 AAAGTGCTTCCTTTTAGAGGG * MD:Z:21 NM:i:0 NH:i:2 HI:i:1 YW:Z:hsa-miR-520b-3p|0|0|21M|21
10-4_2 0 19 40866 255 21M * 0 0 AAAGTGCTTCCTTTTAGAGGG * MD:Z:21 NM:i:0 NH:i:2 HI:i:2 YW:Z:hsa-miR-520c-3p|0|-1|21M|21
Data:
Alignment:
Read ID: 10
Number of collapsed reads: 4
Number of mapped genomic loci: 2
Contribution: 4/2 = 2
miRNA species:
Tag name: hsa-miR-520b-3p|0|0|21M|21
Type: Canonical
Table name: hsa-miR-520b-3p
Total count: 2
Tag name: hsa-miR-520c-3p|0|-1|21M|21
Type: isomiR
Table name: hsa-miR-520c-3p|0|-1|21M|21
Total count: 2
OUT table:
ID lib_name
hsa-miR-520b-3p 2
hsa-miR-520c-3p|0|-1|21M|21 2
Example 2 | Different isomiRs
IN SAM record:
599-1_3 0 19 27804 255 20M * 0 0 AAAGTGCTTCCTTTTAGAGG * MD:Z:20 NM:i:0 NH:i:3 HI:i:1 YW:Z:hsa-miR-526b-3p|1|-1|20M|20
599-1_3 0 19 34627 255 20M * 0 0 AAAGTGCTTCCTTTTAGAGG * MD:Z:20 NM:i:0 NH:i:3 HI:i:2 YW:Z:hsa-miR-520b-3p|0|-1|20M|20
599-1_3 0 19 40866 255 20M * 0 0 AAAGTGCTTCCTTTTAGAGG * MD:Z:20 NM:i:0 NH:i:3 HI:i:3 YW:Z:hsa-miR-520c-3p|0|-2|20M|20
Data:
Alignment:
Read ID: 599
Number of collapsed reads: 1
Number of mapped genomic loci: 3
Contribution: 1/3 = 0.33
miRNA species:
Tag name: hsa-miR-526b-3p|1|-1|20M|20
Type: isomiR
Table name: hsa-miR-526b-3p|1|-1|20M|20
Total count: 0.33
Tag name: hsa-miR-520b-3p|0|-1|20M|20
Type: isomiR
Table name: hsa-miR-520b-3p|0|-1|20M|20
Total count: 0.33
Tag name: hsa-miR-520c-3p|0|-2|20M|20
Type: isomiR
Table name: hsa-miR-520c-3p|0|-2|20M|20
Total count: 0.33
OUT table:
ID lib_name
hsa-miR-520b-3p|0|-1|20M|20 0.33
hsa-miR-520c-3p|0|-2|20M|20 0.33
hsa-miR-526b-3p|1|-1|20M|20 0.33
Tabulate alignments according to its intersecting pri-miR with a custom script
Each alignment contributes to the pri-miR it intersects with by
R/N
, whereR
is the number of collapsed reads in that alignment, andN
is the number of genomic and/or transcriptomic loci it aligns to. The values of both,R
andN
are inferred from the sequence name which follows the formatID-R_N
. The resulting table has a row for each pri-miR with the name format set in mirna_extension.
- Input
- pri-miR intersections file (
.bed
); from intersect_extended_primir
- pri-miR intersections file (
- Output
- pri-miR counts tab-delimited file; used in merge_tables
- Examples
Example 1 | One single pri-miR with different alignments
IN BED records:
19 . miRNA_primary_transcript 27766 27788 . + . ID=MI0003150;Alias=MI0003150;Name=hsa-mir-526b_-0_+0 19 27765 27788 68-2_1 255 +
19 . miRNA_primary_transcript 27766 27787 . + . ID=MI0003150;Alias=MI0003150;Name=hsa-mir-526b_-0_+0 19 27765 27787 316-1_7 1 +
19 . miRNA_primary_transcript 27804 27823 . + . ID=MI0003150;Alias=MI0003150;Name=hsa-mir-526b_-0_+0 19 27803 27823 599-1_3 255 +
19 . miRNA_primary_transcript 27805 27822 . + . ID=MI0003150;Alias=MI0003150;Name=hsa-mir-526b_-0_+0 19 27804 27822 226-1_4 1 +
Alignments:
Read ID: 68
Number of collapsed reads: 2
Number of mapped genomic loci: 1
Contribution: 2/1 = 2
Read ID: 316
Number of collapsed reads: 1
Number of mapped genomic loci: 7
Contribution: 1/7 = 0.143
Read ID: 599
Number of collapsed reads: 1
Number of mapped genomic loci: 3
Contribution: 1/3 = 0.33
Read ID: 226
Number of collapsed reads: 1
Number of mapped genomic loci: 4
Contribution: 1/4 = 0.25
OUT table:
ID lib_name
hsa-mir-526b_-0_+0 2.723
Example 2 | Different pri-miRs for a single read
IN BED records:
19 . miRNA_primary_transcript 40866 40886 . + . ID=MI0003158;Alias=MI0003158;Name=hsa-mir-520c_-0_+0 19 40865 40886 10-4_2 255 +
19 . miRNA_primary_transcript 34627 34647 . + . ID=MI0003155;Alias=MI0003155;Name=hsa-mir-520b_-5_+6 19 34626 34647 10-4_2 255 +
Alignment:
Read ID: 10
Number of collapsed reads: 4
Number of mapped genomic loci: 2
Contribution: 4/2 = 2
OUT table:
ID lib_name
hsa-mir-520c_-0_+0 2
hsa-mir-520b_-5_+6 2
Merge all the tables from the different libraries into a single one with a custom script.
The final table(s) containing the counting data from all libraries for the (iso)miRs and/or pri-miRs have a row per miRNA species and a column per sample library. If a miRNA species is not found in a certain library, its value is set to
NA
.
- Input
- Counts tab-delimited file; from quantify_mirna and/or quantify_primir
- Parameters
- cluster_schema.json
mir_list
: miRNA features to be quantified (default: isomir, mirna pri-mir)
- cluster_schema.json
- Output
- (Workflow output) (iso)miR and/or pri-miR counts table (
.tab
)
- (Workflow output) (iso)miR and/or pri-miR counts table (
- Example
IN library 1
ID lib_1
hsa-miR-524-5p 1
hsa-miR-524-5p|0|0|22M|9G12 1
hsa-miR-524-5p|0|0|22M|9G9C2 1
IN library 2
ID lib_2
hsa-miR-524-5p 1
hsa-miR-1283 1
hsa-miR-1283|-1|-2|21M|21 1
IN library 3
ID lib_3
OUT table
ID lib_1 lib_2 lib_3
hsa-miR-524-5p 1 1 NA
hsa-miR-524-5p|0|0|22M|9G12 1 NA NA
hsa-miR-524-5p|0|0|22M|9G9C2 1 NA NA
hsa-miR-1283 NA 1 NA
hsa-miR-1283|-1|-2|21M|21 NA 1 NA
Reverse the collapsing of reads with identical sequences as done with FASTX-Toolkit with a custom script.
- Input
- Alignments file, filtered (
.sam
); from filter_sam_by_intersecting_mirna
- Alignments file, filtered (
- Output
- Uncollapsed aligned reads (
.sam
); used in convert_uncollapsed_reads_sam_to_bam
- Uncollapsed aligned reads (
Convert alignments .sam
file to .bam
with
SAMtools.
- Input
- Alignments file, uncollapsed (
.sam
); from filter_sam_by_intersecting_mirna
- Alignments file, uncollapsed (
- Output
- Alignments file, uncollapsed (
.bam
); used in sort_uncollapsed_reads_bam_by_position
- Alignments file, uncollapsed (
Sort alignments by position with SAMtools.
- Input
- Alignments file, uncollapsed (
.bam
); from convert_uncollapsed_reads_sam_to_bam
- Alignments file, uncollapsed (
- Output
- (Workflow output) Alignments file, uncollapsed, sorted (
.bam
); used in create_per_library_ascii_pileups, create_per_run_ascii_pileups and/or create_per_condition_ascii_pileups
- (Workflow output) Alignments file, uncollapsed, sorted (
Create index BAM file with SAMtools.
Indexing is required by genome viewers such as IGV to quickly display alignments in a genomic region of interest.
- Input
- (Workflow output) Alignments file, uncollapsed, sorted (
.bam
); from sort_uncollapsed_reads_bam_by_position
- (Workflow output) Alignments file, uncollapsed, sorted (
- Output
- (Workflow output) BAM index file (
.bam.bai
); used in create_per_library_ascii_pileups, create_per_run_ascii_pileups and/or create_per_condition_ascii_pileups
- (Workflow output) BAM index file (
Target rule as required by Snakemake.
Local rule
- Input
- (Workflow output) Empty text file (
.txt
) create_per_library_ascii_pileups and create_per_run_ascii_pileups
- (Workflow output) Empty text file (
Create an empty BED file if the user has not provided one.
OPTIONAL RULE. This rule will be executed if, and only if, the user has not provided a BED file in the configuration file with the regions the ASCII-style alignment pileups must be performed on.
- Condition
- config_template.yaml
bed_file
: BED6 file with all the desired annotation regions to perform the ASCII-style alignment pileups on. (Default: None)
- config_template.yaml
- Output
- Empty BED file (
.bed
); used in create_per_library_ascii_pileups, create_per_run_ascii_pileups and/or create_per_condition_ascii_pileups
- Empty BED file (
Compress the processed genome with trimmed IDs using bgzip
with
SAMtools.
Required to perform the ASCII-style alignment pileups.
- Input
- Genome sequence, trimmed IDs (
.fa
); from trim_genome_seq_ids
- Genome sequence, trimmed IDs (
- Output
- Genome sequence, trimmed IDs,
bgzip
ed (.fa.bz
); used in create_per_library_ascii_pileups, create_per_run_ascii_pileups and/or create_per_condition_ascii_pileups
- Genome sequence, trimmed IDs,
Create ASCII-style pileups for all the desired annotated regions across libraries with ASCII-style alignment pileups.
A directory containing the ASCII-style pileups is created for each library. If no BED file is provided, the pileups' output directories will only contain an empty file.
- Input
- Genome sequence, trimmed IDs,
bgzip
ed (.fa.bz
); from compress_reference_genome - miRNA annotations, mapped chromosome name(s) (
.gff3
); from map_chr_names - (Workflow output) Alignments file, uncollapsed, sorted (
.bam
); from sort_uncollapsed_reads_bam_by_position - (Workflow output) BAM index file (
.bam.bai
); used in index_uncollapsed_reads_bam - Annotated genomic regions (
.bed
); from workflow input files or create_empty_bed
- Genome sequence, trimmed IDs,
- Output
- (Workflow output) Empty text file (
.txt
)
- (Workflow output) Empty text file (
Create ASCII-style pileups for all the desired annotated regions for the whole run with ASCII-style alignment pileups.
If no BED file is provided, the pileups' output directory will only contain an empty file.
- Input
- Genome sequence, trimmed IDs,
bgzip
ed (.fa.bz
); from compress_reference_genome - miRNA annotations, mapped chromosome name(s) (
.gff3
); from map_chr_names - (Workflow output) Alignments file, uncollapsed, sorted (
.bam
); from sort_uncollapsed_reads_bam_by_position - (Workflow output) BAM index file (
.bam.bai
); used in index_uncollapsed_reads_bam - Annotated genomic regions (
.bed
); from workflow input files or create_empty_bed
- Genome sequence, trimmed IDs,
- Output
- (Workflow output) Empty text file (
.txt
)
- (Workflow output) Empty text file (
Create ASCII-style pileups for all the desired annotated regions across the different library subsets if provided with ASCII-style alignment pileups.
OPTIONAL RULE. The ASCII-style pileups for each annotated region are made if, and only if, at least one library subset is specified in the configuration file. Otherwise, this rule will not be executed, and no output will be generated.
- Input
- Genome sequence, trimmed IDs,
bgzip
ed (.fa.bz
); from compress_reference_genome - miRNA annotations, mapped chromosome name(s) (
.gff3
); from map_chr_names - (Workflow output) Alignments file, uncollapsed, sorted (
.bam
); from sort_uncollapsed_reads_bam_by_position - (Workflow output) BAM index file (
.bam.bai
); used in index_uncollapsed_reads_bam - Annotated genomic regions (
.bed
); from workflow input files or create_empty_bed
- Genome sequence, trimmed IDs,
- Parameters
- config_template.yaml
lib_dict
: Dictionary of arbitrary condition names (keys) and library names to aggregate alignment pileups for (values; MUST correspond to names in samples table) (default: None)
- config_template.yaml
- Output
- Empty text file (
.txt
)
- Empty text file (