The BCGSC miRNA Profiling Pipeline produces expression profiles of known miRNAs from BWA-aligned BAM files and generates summary reports and graphs describing the results.
Chu, Andy, Gordon Robertson, Denise Brooks, Andrew J. Mungall, Inanc Birol, Robin Coope, Yussanne Ma, et. al., Large-scale profiling of microRNAs for The Cancer Genome Atlas, Nucl. Acids Res. (2015) doi: 10.1093/nar/gkv808 First published online: August 13, 2015 http://nar.oxfordjournals.org/content/early/2015/08/13/nar.gkv808.full
The code used for trimming adapters and generating the required read-length summary table is available here: http://www.bcgsc.ca/platform/bioinfo/software/adapter-trimming-for-small-rna-sequencing
The following conventions are used in this document to differentiate between a word that has a technical meaning and a word that has a non-technical meaning.
- command - The word command has a special technical meaning, and should be typed exactly as it is spelled
- {xhost}~> command - command is to be run on a local computer
- {db-host}~> command - command is to be run on the database server
- {cluster-host}~> command - command is to be run on the head node of the compute cluster. The cluster-host must have mysql access to a miRbase and a UCSC database.
- ... - Truncated terminal output removed for clarity
- <> (angle brackets) enclose text that you need to replace with text that is appropriate to your installation
- - refers to a user-specified, sequencing library ID. e.g. A00107
- - refers to the top pipeline directory. Applications, scripts, inputs and ouputs can be stored in subdirectories of this base directory. e.g. projects/mirna/pipeline/
The analysis pipeline takes as input a .sam file containing BWA alignments of trimmed reads for a sample and generates a) annotations of the reads, b) expression profiles of features annotated in the sample, c) summarized miRNA expression as TCGA (The Cancer Genome Atlas) formatted expression quantification reports and as expression matrices, and d) graphs of overall feature representation, filtered results, and quality metrics. The miRNA analysis pipeline version v0.2.8 has been tested with BWA 0.5.7.
External applications required to run the pipeline can be placed in the apps folder in the top pipeline directory. e.g. /v0.2.8/apps/. Although applications do not have to be stored in this directory, Perl (perl-5.10-x86_64) and R (R-2.12.0) must be available on your system. In addition, Perl requires the MySQL DBI library. R is used to generate summary graphs, and may be disregarded if graphs are not desired.
Parameter | Description | Comment |
---|---|---|
-m | miRBase | The db_name from the db_connections.cfg file for the desired version of miRBase to use for the annotation |
-u | UCSC database | The db_name from the db_connections.cfg file for the desired version of the UCSC database to use for the annotation |
-o | Species code | The appropriate species miRBase species code to use for the annotation |
-p * | Project directory | The root directory containing the .sam and adapter.report files. This directory will be called the "project directory" or PROJECT |
Note: The annotation script will search through every file and subdirectory in the project directory, so do not put non-project related subdirectories in this directory.
- Create a project directory and relevant subdirectories where inputs and outputs are stored, e.g. <PROJECT>/<LIBID>/
- Obtain 3' adapter trimming results for each sample processed
- Generate a .bam file by aligning reads to a reference with BWA
- Edit the pipeline_params.cfg and db_connections.cfg files with the paths and database access information appropriate to your system
- Convert the .bam file to a .sam file using Samtools
- Run the annotate.pl script to annotate the .sam file
- Run the alignment_stats.pl script to generate alignment stat reports
- Run the tcga.pl script to generate TCGA formatted results
- Run the expression_matrix.pl script to generate miRNA expression matrices
- Run the graph_libs.pl script to graph results
Before any alignment can occur, some pre-processing of the sequences is necessary. MiRNAs tend to be shorter than the length of reads generated by sequencers, which means that the read will sequence past biological RNA into the 3' adapter of the sequencing protocol. This non-biological sequence at the end of the read will make it difficult for the aligner to align the read, so it is trimmed off first.
Documentation and code to trim adapters from miRNA reads and generate the required adapter.report is available at: http://www.bcgsc.ca/platform/bioinfo/software/adapter-trimming-for-small-rna-sequencing
The adapter trim summary report is a space delimited file which summarizes the adapter trimming done on the reads before alignment. It contains read lengths and number of reads, in this order.
adatper.report example
0 2232593
1 0
2 239
3 426
4 439
5 824
6 2167
7 4120
8 2768
9 2035
10 1835
11 1881
12 2704
13 3304
14 5431
15 10945
16 18786
17 39017
18 79237
19 144145
20 710922
21 1561322
22 519242
23 1022067
24 130369
25 25873
26 1611
27 22337
28 134916
29 1094298
30 739272
If you are working with bam files where the reads have already been trimmed and aligned, you can generate this summary report directly from the bam with the following command:
samtools view <abc.bam> | awk '{arr[length($10)]+=1} END {for (i in arr) {print i" "arr[i]}}' | sort -t " " -k1n
Information from the adapter report is included in the alignment_stats.csv report generated after miRNA profiling is complete.
Reads are aligned using BWA, typically with the following settings (the number of alternate alignments reported, -n, can be changed to any desired value):
#$infile is a miRNA fastq file
#$ref is a genomic reference fasta file
{cluster-host}~> bwa aln $ref $infile > $sai
{cluster-host}~> bwa samse -n 10 $ref $sai $infile > $sam
Collect the .bam and adapter.report files together under one project directory, typically with one subdirectory per pool of samples
example directory structure
{cluster-host}~> mkdir -p <PROJECT>/<LIBID1>
{cluster-host}~> mkdir <PROJECT>/<LIBID2>
{cluster-host}~> mkdir <PROJECT>/<LIBID3>
Symlink, copy, or move .bam files with their corresponding adapter.report files to the appropriate project subdirectory and rename them according to the convention <LIBID>_<INDEX>
file renaming example
{cluster-host}~> cp <MY_LIBID>.bam <PROJECT>/<LIBID1>/<LIBID1>_<INDEX1>.bam
{cluster-host}~> cp <MY_LIBID>_adapter.report <PROJECT>/<LIBID1>/<LIBID1>_<INDEX1>_adapter.report
{cluster-host}~> cp <MY_LIBID>.bam <PROJECT>/<LIBID1>/<LIBID1>_<INDEX2>.bam
{cluster-host}~> cp <MY_LIBID>_adapter.report <PROJECT>/<LIBID1>/<LIBID1>_<INDEX2>_adapter.report
...
{cluster-host}~> cp <MY_LIBID>.bam <PROJECT>/<LIBID2>/<LIBID2>_<INDEX1>.bam
{cluster-host}~> cp <MY_LIBID>_adapter.report <PROJECT>/<LIBID2>/<LIBID2>_<INDEX1>_adapter.report
{cluster-host}~> cp <MY_LIBID>.bam <PROJECT>/<LIBID1>/<LIBID2>_<INDEX2>.bam
{cluster-host}~> cp <MY_LIBID>_adapter.report <PROJECT>/<LIBID2>/<LIBID2>_<INDEX2>_adapter.report
There are two configuration files which must be modified:
- db_connections.cfg contains database settings to access MySQL databases containing the necessary UCSC and miRBase information. You must have a database connection to a miRBase instance and a UCSC database instance for annotations of miRNAs and other non-coding RNAs respectively. <db_name> field provides the database source for various script parameters <host> is the server name of the database. <user> and <password> are the user-specific login and password
- pipeline_params.cfg is provided for additional, optional settings such as the path to an Rscript binary for the optional graphing functions, e.g. Rscript=<BASEDIR>/apps/R-2.12.0/lib64/R/bin/Rscript
There is one additional script that must be modified with the path to perl 5. Modify profile.sh so that it points to the appropriate Perl to use for the scripts. Run source on this to generate the environment, e.g. {cluster-host}~> source /v0.2.8/config/profile.sh
The annotation process takes one or more .sam files and produces an annotated version of the .sam files where each read in each file is given an annotation based on its coordinates. A three base pair overlap with the coordinates of a known genomic feature is required for a read to be annotated. In the event that the coordinate range of an aligned read overlaps features in multiple databases, the read is annotated with each feature type overlapped, but the read is counted toward the expression of only the highest priority feature. The feature types considered for annotation by the pipeline, and their priority for read assignment, are given in Table 1 (Additional Information).
Before annotation can begin, .bam files must first be converted to .sam format. The bitmap flag in the .sam needs to be in hexidecimal format for bit calculations. The read will not be processed correctly if string format is used. Note that the annotation script expects to find a .sam file and its respective adapter trim summary report in the same project directory.
{xhost or cluster-host}~> samtools view -h <PROJECT><LIBID1>/<LIBID1>_<INDEX1>.bam > <PROJECT><LIBID1>/<LIBID1>_<INDEX1>.sam
During the annotation process, the script will create a version of the .sam named .sam.annot which adds the tags XC, XI, and XD to each line in the .sam file, while leaving everything else untouched. Once the annotation of a .sam file is complete, the original .sam file will be overwritten by the .sam.annot after checking both versions have the same number of reads.
Each aligned read in each .sam file in the project directory is annotated using miRBase and UCSC reference databases (any databases used must have the appropriate access information in the db_connections.cfg file).
Annotations are added to the .sam file as new columns (XC, XI, and DX), with multiple annotations for a single read separated by semicolons. Column XC contains the feature type the read overlapped; column XI contains the feature name; and column XD contains the MIMAT ID from miRBase if the read mapped to a known miRNA. If these columns already exist in the .sam file to be annotated, they are stripped off prior to annotation.
During annotation the alignment coordinates of each read are checked against reference databases for overlaps. To account for differences in chromosome naming conventions between aligners (e.g. chr1 vs 1), the naming convention used in the .sam is checked by taking a "sample" from the .sam file and checking the labeling format used.
All .sam files and corresponding adapter.report pairs in the given project directory.
- <PROJECT>/<LIBID>/<LIBID>_<INDEX>.sam
- <PROJECT>/<LIBID>/<LIBID>_<INDEX>_adapter.report
Appropriate miRbase and UCSC database access information (database name, host, userid, password) in db_connections.cfg
The annotation script requires the names of the desired miRbase and UCSC databases listed in db_connections.cfg, the miRbase species code to use (e.g. hsa for human), and the path to the top project directory (i.e. the directory where the LIBID subdirectories are located).
{cluster-host}~> perl <BASEDIR>/v0.2.8/code/annotation/annotate.pl -m <mirbase> -u <ucsc_database> -o <species_code> -p <PROJECT>
If the annotation fails, both a .sam and a .sam.annot file will be present in the <LIBID> subdirectory when the script finishes. If a failure occurs the script should be run again. The temporary .sam.annot files will be overwritten when the annotation script is rerun.
If annotation completes successfully, the original .sam file will be replaced by the annotated .sam and the .sam.annot file will have been removed.
Annotated read information is collated into summary reports by alignment_stats.pl.
This script returns the highest priority features from the XC, XI, and XD columns of an annotated .sam file (Additional Information: Table 1) to generate a set of alignment summary files for each annotated .sam file. A series of comma-separated data files is generated for each feature type - each summary file is named for the specific feature type. Each summary file lists the specific gene names with annotated reads followed by the expression level as a read count and sorted by expression level. An additional comma-separated file summarizing expression levels for each feature is generated for each .sam and adapter.report file pair in the project directory.
Note that the alignment_stats.pl script contains filtering specifications. Currently, only reads that map perfectly to the genome, are not soft clipped by BWA, and map to 3 or fewer locations on the genome are considered for annotation as miRNA. Reads that align perfectly, pass all filters, but fail chastity check, are not excluded, but are used for annotation as well.
Annotated <LIBID>_<INDEX>.sam files in the <LIBID> subdirectories of a <PROJECT> directory. Because the original .bam file has been converted to .sam and then overwritten by annotate.pl, original alignment files have been removed from the project directory after the annotation step.
{dbhost or xhost}~> perl <BASEDIR>/v0.2.8/code/library_stats/alignment_stats.pl -p <PROJECT>
The alignment_stats.pl script outputs summary and detailed reports which are placed in a report directory named after each library. e.g. <PROJECT>/<LIBID>/<LIBID>_<INDEX>_features/.
alignment_stats.csv is a comma-separated summary report listing summary annotation information (i.e. reads mapped to identified features) for each annotated .sam file. The file contains one line for each library in a LIBID subdirectory. Rerunning the script will regenerate the summary files and recover missing results. Feature-specific reports: For each of the features included in alignment_stats.csv, report files are created which list the specific gene expressed followed by the expression level (number of reads that overlapped the feature by at least 3 bp). Reports are sorted by expression level. miRNA specific reports:
- crossmapped.txt* contains all miRNAs that have been cross-mapped by any reads. A cross-mapped read indicates that a single read aligns to more than one miRNA. There are two categories of cross-mapped miRNAs. miRNAs containing at least one cross-mapped read are listed on a single line with only one miRNA ID. Lines containing more than one miRNA ID (separated by semicolons) represent the set of miRNA IDs that a read was cross-mapped between.
- miRNA.txt* contains the list of all uniquely mapped miRNAs with an extra annotation denoting the region of the miRNA the read mapped to. Note:* miRNA.txt and crossmapped.txt have a 3rd column which is the read count as a percentage of Total miRNA from alignment_stats.csv.
- mirna_species.txt is a summary of the miRNA.txt file with specific annotations about miRNA region left out. This means that each count is the number of reads aligning anywhere along the miRNA. Cross-mapped reads will be counted once for each miRNA it aligns to (i.e. one read will be counted as coverage for multiple miRNAs).
- isoforms.txt contains miRNAs in the miRNA.txt and crossmapped.txt files split by individual reads that make up the miRNA. Columns are: miRNA, chromosome number, start point, end point, strand(+/-), read sequence, read count, cross-mapped (1 = yes, 0 = no), annotation.
- A BED directory is created for each sample. This directory contains .txt files in BED format showing coverage of all reads which pass the alignment filters. The .txt files can be used to generate .wig coverage files. They are also sorted and filtered so that they can be used for peak discovery and novel miRNA prediction.
The tcga.pl script resolves crossmapped and multimapped reads then formats resulting expression files to the format defined for TCGA. Can be run on a desktop computer. Needs miRBase access.
expression files from <PROJECT>/<LIBID>/<LIBID>_<INDEX>_features/
- miRNA.txt
- crossmapped.txt
- isoforms.txt
{dbhost or xhost}~> perl <BASEDIR>/v0.2.8/code/custom\_output/tcga.pl -p <PROJECT>
These output data files are space delimited and located in a subdirectory of <PROJECT>/<LIBID>/<LIBID>_<INDEX>_features/tcga/
mirna.quantification.txt contains summed expression for all reads aligning to known miRNAs in the miRBase reference. If there are multiple alignments to different miRNAs or different regions of the same miRNA, the read is flagged as cross-mapped and every miRNA annotation is preserved. Comma-separated fields:
- miRNA name
- raw read count
- reads per million miRNA reads
- cross-mapped to other miRNA forms (Y or N) isoform.quantification.txt contains observed isoforms. Comma-separated fields:
- miRNA name
- alignment coordinates as <version>:<Chromosome>:<Start position>-<End position>:<Strand>
- raw read count
- reads per million miRNA reads
- cross-mapped to other miRNA forms (Y or N)
- region within miRNA
Scripts are provided to generate two types of expression matrices: pre-miRNA gene expression matrices and mature miRNA expression matrices. Expression matrices are tab-separated text files of read counts where counts for each miRNA gene are matrix rows with samples as columns.
library_stats/expression_matrix.pl
Generates an miRNA expression matrix with pre-miRNA genes as rows and gets read counts from the mirna_species.txt files (created by alignment_stats.pl in the <PROJECT>/<LIBID>/<LIBID>_<INDEX>_features/ subdirectories)
mirna_species.txt files
Run expression_matrix.pl
{dbhost or xhost}~> perl <BASEDIR>/v0.2.8/code/library\_stats/expression\_matrix.pl -m <mirbase> -o <mirbase\_species\_code> -p <PROJECT>
-m is the miRBase database to use, e.g. mirna\_20
-o is the species code used by miRBase for the desired organism. For human, use hsa
-p is the project directory that contains the library subdirectories
- expn_matrix.txt is an expression matrix containing raw read counts
- expn_matrix_norm.txt is an expression matrix of normalized read counts. Counts are divided by the total number of reads that aligned to miRNA in the library times one million.
- expn_matrix_norm_log.txt is an expression matrix of normalized, log2-transformed read counts.
The first of two methods for generating mature-miRNA expression matrices, the output is the same as that generated by expression_matrix.pl (Rows are miR counts, columns are samples, with three expression matrices generated: raw read count; normalized count; and normalized, log2-transformed count).
library_stats/expression_matrix_mimat.pl method uses queries miRBase to convert MIMAT IDs to miR names, and gets read counts from the crossmapped.txt files (generated by annotate.pl in the <PROJECT>/<LIBID>/<LIBID>_<INDEX>_features/
crossmapped.txt files
{dbhost}~> perl <BASEDIR>/v0.2.8/code/library\_stats/tcga/expression\_matrix\_mimat.pl -m <mirbase> -o <mirbase\_species\_code> -p <PROJECT>
-m is the miRBase database to use, e.g. mirna\_20
-o is the species code used by miRBase for the desired organism. For human, use hsa
-p is the directory that contains the library subdirectories
The second method for generating mature miRNA expression matrices uses a local miRNA adf text file to translate MIMAT IDs to miR names and gets read counts from <PROJECT>/<LIBID>/<LIBID>_<INDEX>_features/tcga/isoform.quantification.txt files (generated by code/custom_output/tcga/tcga.pl; these are the files that are submitted to the DCC by the BCGSC as Level 3 archive files)
Either rename the isoform.quantification.txt files you want included in the expression matrix as <sample_id>.isoform.quantification.txt files and copy them into a separate sub-directory, or download and unzip the desired Level 3 archives. (They will unzip to a single directory.)
{dbhost or xhost}~> perl <BASEDIR>/v0.2.8/code/library_stats/tcga/create_adf.pl -m <mirbase> -o <mirbase_species_code> -g <genome_version> -v <mirbase_adf_file>
-m is the miRBase database to use as listed in the db\_connections.cfg file, e.g. mirna\_20
-o is the species code used by miRBase for the desired organism. For human, use hsa.
-g specify the appropriate genome version for the version of miRBase that you are using. For mirna\_20 use hg\_19.
-v the name you want for the adf file
The adf file created will have the following format and be sorted by miRNA_ID: miRNA_ID miRBase_Ver Accession Genomic_Coords Precursor_Seq Mature_Coords Mature_Accession Alt_Mature_Coords Alt_Mature_Accession Star_Coords Star_Accession
{dbhost or xhost}~> perl <BASEDIR>/v0.2.8/code/library_stats/tcga/expression_matrix_mimat.pl -m <miRNA_adf_file> -p <Level_3_archive_directory>
-m miRNA\_ADF is the adf file generated by create\_adf.pl
-p the directory where the \*.isoform.quantification.txt files can be found. The script will loop through all isoform.quantification.txt files that it finds in this directory to build the expression matrix.
graph_libs.pl is a Perl wrapper that calls R to generate graphs of feature representations, filtered results and quality metrics based on files generated by alignment_stats.pl. The script generates graphs for each sample under the project directory and can be run on a desktop computer.
The alignment_stats.csv file and the report files located in each _features subdirectory in the project directory generated by alignment_stats.pl. (Note: graph_libs.pl searches for input files in the _features subdirectories by first trying to make a list of .sam files in the project directory. If the .sam files have been removed from the project directory, a set of appropriately-named, but empty, .sam files must be generated so that the wrapper script can find the needed input files.)
{xhost}~> perl <BASEDIR>/v0.2.8/code/library_stats/graph_libs.pl -p <PROJECT>
Make sure that if running this script through an ssh tunnel, X11 forwarding is enabled so that the R graphing libraries can be used. e.g. ~> ssh -X <MYSERVER>
Graphs will be created in both the library _features directories and in a new /graphs/ directory. The graphs are described in the online README document listed in section XI.
The miRNA annotation pipeline requires the following scripts and software
- adapter trimming script: http://www.bcgsc.ca/platform/bioinfo/software/adapter-trimming-for-small-rna-sequencing
- BWA 0.5.7
- samtools 0.1.7
- miRNA profiling 0.2.8
- Pre-alignment processing: A general description of how adapter trimming is performed is provided in the "1a. Preprocessing" section of the pipeline description at: /v0.2.8/DESCRIPTION.txt. Additional description and usage information is available through the publicly available archive http://www.bcgsc.ca/platform/bioinfo/software/adapter-trimming-for-small-rna-sequencing
- Annotation: Overview of miRNA analysis at the BCCA-GSC: <BASEDIR>/v0.2.8/DESCRIPTION.txt. Script usage detailed: <BASEDIR>/v0.2.8/code/annotation/HOWTO.txt
- Alignment stats: <BASEDIR>/v0.2.8/library_stats/README.txt
- Expression: TCGA formatted miRNA expression reports: <BASEDIR>/v0.2.8/custom_output/tcga/README.txt File formats for TCGA data files are also available online at: <https://wiki.nci.nih.gov/display/TCGA/miRNASeq
- miRNA expression maxtrices: <BASEDIR>/v0.2.8/library_stats/README.txt
- Optional graphs: <BASEDIR>/v0.2.8/library_stats/README.txt
Table 1. Annotation priorities used to resolve multiple matches, from highest to lowest priority. If a read has more than 1 alignment, and the annotations are different, the priorities from Table 1 are used as long as only 1 alignment is to a miRNA. If there are multiple alignments to different miRNAs (or even different regions of the same miRNA), the read is flagged as cross-mapped and every miRNA annotation is preserved. Using these rules, the reads are summed by annotation, and coverage reports are generated.*
Database order of priority | Feature order of priority |
---|---|
1 Annotation from miRBase | 1. mature strand |
| 2. star strand
| 3. precursor miRNA
| 4. stemloop, from 1 to 6 bases outside the mature strand, between the mature and star strands
| 5. "unannotated", any region other than the mature strand in miRNAs where there is no star strand annotated
2 From UCSC RepeatMasker (other small RNAs) | 1. snoRNA | 2. tRNA | 3. rRNA | 4. snRNA | 5. scRNA | 6. srpRNA | 7. other RepeatMasker RNAs 3 From UCSC knownGene | 1. coding exons where the annotated CDS region is 0 length | 2. 3' UTR | 3. 5' UTR | 4. coding exon 4 From UCSC RepeatMasker (other types of repeats) | 1. LINE | 2. SINE | 3. LTR | 4. Satellite | 5. RepeatMasker DNA | 6. RepeatMasker Low complexity | 7. RepeatMasker Simple Repeat | 8. RepeatMasker Other | 9. RepeatMasker Unknown
- What are precursors? What are mature strands?
A miRNA mature strand is the functional piece of miRNA that is transcribed from one arm of the precursor miRNA. Historically, one of the arms of the precursor express the "mature strand", and the other arm expresses a "star strand" which is thought to be nonfunctional and degraded. It has now been shown that both arms express functional miRNAs, and so both arms are now known as mature strands. On miRBase, the precursor miRNA has a MIxxxxxxx accession, whereas the mature strands have a MIMATxxxxxxx accession.
- What are the two .quantification.txt files in the Level 3 archives, and how to I extract data from them?
TCGA Level 3 archives for miRNA-seq data include 2 files: miRNA.quantification.txt and isoform.quantification.txt. miRNA.quantification.txt is a high level view of the miRNA, considering the expression of any region of the precursor miRNA, even those completely outside the mature strands. isoform.quantification file breaks the read alignments into detail, allowing one to extract data of any resolution from this file:
- The sum of read counts by the precursor miRNA name (in the miRNA_ID column) will result in counts equal to miRNA.quantification.txt.
- The sum of read counts by MIMAT accession (by MIMATxxxxxxxx in the miRNA_region column), excluding reads with other annotations, will result in the counts for each mature strand. (Note: In this case, the reads_per_million_miRNA_mapped (rpm) column is not typically summed to represent the rpm value for a MIMAT ID. Instead, the rpm for a MIMAT is generated from the summed counts for a MIMAT divided by the total summed MIMAT read counts for a sample multiplied by 1,000,000.)
- The sum of read counts by coordinates (in the isoforms_coords column) will result in counts for each miRNA isoform.
- What is crossmapping?
The trimmed sequence reads largely represent isomiRs, so are short (22±3 nt). Their 5’ and 3’ ends can differ from miRBase reference mature strand coordinates, particularly at the 3’ end. As well, some miRNAs occur as families of closely related sequences that have identical or nearly identical mature strand sequences (and MIMAT accession IDs). These factors result in reads crossmapping and multimapping, which we address as follows (see the publication for more details).
A short isomiR read may map exactly to mature strands whose sequences are similar but not identical, when the read sequence does not capture the bases that distinguish these miRNAs (e.g. hsa-mir-30a at 6q13 and hsa-mir-30e at 1p34.2, which differ at position 18). We report such a read as cross-mapped, and we increment the read count for each MIMAT that it mapped to.
A read can multimap to identical mature strands from a miRNA family whose members are in different locations in the genome (e.g. miR-181a-5p=MIMAT0000256 is present in hsa-mir-181a-1 at 1q32.1 and in hsa-mir-181a-2 at 9q33.3). When we annotate a read as miR-181a-5p, we increment the read count for this MIMAT, and we increment the read count of one of the genomic locations for the family’s stem-loops. See below.
- What are the different miRNAs with the same mature strand?
Many miRNAs have names in the format of hsa-mir-21, ie. organism-"mir"-microRNA. However, some miRNA names have an additional suffix, eg. hsa-mir-101-1 and hsa-mir-101-2. Note that miRNAs such as hsa-mir-29a and hsa-mir-29b are not an example of this; they are simply related, but distinct miRNAs.
These additional -1/-2 suffixes denote functionally identical miRNAs expressed from different loci. The miRBase entries for hsa-mir-101 provides the following information: hsa-mir-101-1; accession MI0000103; chr1: 65524117-65524191 [-] Mature sequence hsa-miR-101-5p; accession MIMAT0004513 Mature sequence hsa-miR-101-3p; accession MIMAT0000099 hsa-mir-101-2; accession MI0000739; chr9: 4850297-4850375 [+] Mature sequence hsa-miR-101-3p; accession MIMAT0000099
The mature sequence hsa-miR-101-3p has the same accession (and sequence) in both hsa-mir-101-1 and 101-2. This is the case for all miRNAs with only differing suffixes.
During miRNA profiling, if a read sequence has enough bases outside the canonical mature sequence to unambiguously assign it to a particular miRNA ID, then it does so. However, if the read maps to the mature strand, eg. MIMAT0000099, then there is no way to unambiguously assign it as hsa-mir-101-1 or 101-2, and one of the miRNA IDs is arbitrarily (not randomly) assigned. For this reason, when working at the precursor level, it is suggested that miRNA counts are summed to get one count for hsa-mir-101. When working at any higher resolution, there is no ambiguity when working with mature strand names or accessions. Counts can be summed by mature accession, and the precursor name can be ignored.
- Which read alignments are counted for miRNA expression profiles?
For a read annotation to be used the following rules are applied by examining the NM, X1, X0, XA, XC, and or XD fields in an annotated SAM file.
Brief field descriptions: The following information is reported for a primary read alignment in the first 11 columns of the SAM file: read id, bitflag, chromsome, start position on reference, mapping quality, cigar string, mate, mate position, insert size, read sequence, read sequence quality.
These fields are followed by "tag" fields with additional information about the read:
NM tag: number of "edits" for the main alignment to the reference
Tags with information for alternate alignments X0 number of what BWA considered to be the "best" alignments for the read X1 number of BWA secondary alignments XA contains semicolon separated fields that each contain comma-separated mapping info about each alternate read alignment, including an NM field with the number of mismatches for each specific read alignment.
Fields added by annotate.pl: XC general annotations for aligned reads, comma separated. XD specific annotations for aligned reads, commar separated. (Note: the first field of XC and XD is the annotation for the primary alignment, while subsequent fields are annotations for secondary alignments.)
Rules for counting a read as a miRNA:
- The minimum number of mismatches (NM tag) for all alignments for a read must be zero, e.g. the NM tag for the primary alignment or NM tag in a seconary alignment info field (semicolon-separated XA tag) if the primary alignment is not used.
- The number of "best" alignments (X0 tag) must be 3 or fewer
- The number of annotations (XC tags) considered is the number of alignment info fields (semicolon-separated XA tags), or the number of primary alignments (value of X0), which ever is smaller.
- Annotations for alignments used (3 or fewer) are read from the XD field, or are read from the XC field if there is no XD field. (For miRBase 16 annotations "mature" and "star" are trimmed off of the XD annotations.)
- The mapping info (XA tag) for each primary alignment considered is split into chromosome, position, cigar, and NM info
- If there are annotations in the XD field, these are used to determine which read has the highest annotation priority (annotation priorities are described in the main documentation). Otherwise, the annotations in the XC field are used. Basically, all annotations for seconary alignments are compared to the annotation for the primary alignment; if a secondary alignment has a higher priority annotation than the primary alignment, that read becomes the primary and the next alignment is then compared to the new primary.
The default settings can be changed by editing the following lines in the alignment_stats.pl
my $MAX_XA = 10; #number of alignments a read can have
#used to check for "overfiltering" e.g. if the number of alignments
#listed in X0 together with X1 are greater than 10, while the value
#of the X0 field (number of alignments) is less than or equal to 0,
#BWA is only reporting a subset of the read alignments.
my $MAX_NM = 0; #most mismatches allowed in the alignment, according to NM flag
my $MAX_X0 = 3; #most alignments a read can have before it's ignored
-
The number of alignment mappings allowed (XA)
- The number of mismatches (NM tag) must be zero
- The number of primary alignments (X0 tag) must be 3 or fewer
next if $nm > $MAX_NM;
next if $x0 > $MAX_X0;
- The number of annotations (XA tags) considered is the number of alignment info fields (XC tag contains one comma-separated info field for each alignment), or the number of primary alignments (value of X0), which ever is smaller.
#in case XA is overfiltered, get the smaller of X0 and length of XC array
$x0 = scalar @{$xc} if scalar @{$xc} < $x0;
- Annotations for the primary alignments used (3 or fewer) are read from the XD field, or are read from the XC field if there is no XD field. ("mature" and "star" are trimmed off of the XD annotations) Note: The annotations in the XC and XD fields are separated by semicolons. If a read has a good alignment, but no annotation, there will be an empty field before the semicolon in the XD field in the SAM file e.g. XD:Z:;;mature,MIMAT0018076;;;; indicates a read with a total of six alignments, but only one known annotation. This is easier to see in the XC field: XC:Z:Unknown;Unknown;miRNA;Intron;Unknown;Intron;
sub get_annot_key {
#returns the key used in ANNOTATION_PRIORITY
my $xc = shift;
my $xd = shift;
$xd =~ s/,.\*//; #remove specific information after the xd tag (mature and star strand miRNAs have the MIMAT ID appended to the end of xd, after a comma)
return $xd unless $xd eq '';
return $xc;
}
- The info for each primary alignment considered is split into chromosome, position, cigar, and NM info
#read only tags corresponding to the X0 alignments
for (my $i = 0; $i < $x0; $i++) {
my $xtag = get_annot_key($xc->[$i],$xd->[$i]);
($chr, $pos, $cigar, $nm) = split(',', $xa->[$i]);
...
- If the primary alignment has no mismatches (e.g. the NM field is less than or equal to 0) and the annotation priority for the corresponding annotation is higher than annotations for any of the alternate alignments for the read, the annotation is saved.
if ($ANNOTATION_PRIORITY{$xtag} < $ANNOTATION_PRIORITY{$prev_xtag} && $nm <= $MAX_NM) {
$prev_xtag = $xtag;
$index_used = $i;
}
###maintainer: [email protected]