miRNA_pipeline_description.txt

Overview of miRNA analysis
Last Edited by Andy Chu, August 09 2010.

	The analysis pipeline takes as input a .sam file containing BWA alignments for a sample, and generates a profile of the content of the sample. This includes metrics such as the expression of various RNAs with particular focus on the expression of miRNA species. This document aims to describe the algorithms in the miRNA analysis pipeline in detail, making note of specific thresholds used.

Preprocessing

	Before any alignment can occur, some preprocessing on the sequences is necessary. MiRNAs tend to be shorter than the length of reads generated by sequencers, which means that the read will sequence past biological RNA into the 3' adapter of the sequencing protocol. This non-biological sequence at the end of the read will make it difficult for the aligner to align the read, so it is trimmed off first.

	The adapter trimming works by looking for as much of the specified adapter sequence in the read as possible, allowing mismatches based on the length of adapter found. Only the first 15 bases of the adapter sequence are used for the string matching, regardless of the length of the adapter sequence supplied. If a read matches 15bp of non-biological adapter, it is safe to assume the sequence truly is adapter sequence.

	For each read, the algorithm first checks if it is an adapter dimer by checking if the adapter sequence is at the start of the read. Since it has been seen in indexed reads that the first few bases of the adapter can be lost, not only is the full 15bp of adapter sequence matched, the adapter sequence with the first 1, 2, and 3 bases trimmed off the beginning is also matched. If the read is not an adapter dimer, then the algorithm sees if there is an exact match of 15bp anywhere in the read, starting at the beginning of the read. Finally, fuzzy matching tries to match the adapter anywhere along the read, starting from the end. When matching the full 15bp of adapter, any match with up to 2 mismatches is accepted. If the full 15bp is not found, decreasing lengths of adapter, down to the first 8 bases, is checked with 1 mismatch allowed. If a match is still not found, 7 down to 1 base of adapter is checked with an exact match required. The algorithm will trim 1 base off the end of a read if it happens to match the first base of the adapter. This is because it is better to get a perfect alignment than an alignment with potentially 1 mismatch. In addition, if only 1 base of adapter was found, the read is likely not a miRNA and would not be a concern anyway.

	When trimming SOLiD reads, the base space anchor is not supplied as part of the adapter sequence, and the first character in the read sequence is ignored as the anchor base. The same matching algorithm is used as for base space trimming. If adapter sequence is found, there is one extra postprocessing step where the final colour space read is removed, since that represents a transition from the last base of RNA to the first base of adapter sequence, so it should not align to a reference. The read length reported will be the sum of all colour space transitions in the read, although the base space anchor will still remain, ie. the read length is 1 less than the string length of the read.

	After each read has been processed, a report is generated containing the number of reads at each read length. Any read that is under 15bp after trimming is discarded, the rest is sent to alignment. 15bp is chosen because it is the shortest mature miRNA in miRBase 16.
	
Annotation

	Given a .sam file, the first step of the pipeline is to annotate each of the aligned coordinates for each read against reference databases.	A 3 base pair overlap is required, and in the event that a coordinate range overlaps features in multiple databases, each feature type is given a priority. The feature types, in the order that they are applied, are in Table 1.

	For SOLiD reads aligned with Bioscope, a preprocessing step to reformat the .sam file into a BWA aligned format is first performed. Reads with multiple alignments are split into multiple lines by Bioscope. These lines are combined and the alignments are merged into an XA tag used by BWA, and read by the annotation and profiling scripts.

Table 1. Annotation priorities to resolve multiple matches.
	From miRBase.
		mature strand
		star strand
		precursor miRNA
		stemloop, from 1 to 6 bases outside the mature strand, between the mature and star strands
		"unannotated", any region other than the mature strand in miRNAs where there is no star strand annotated
	From UCSC RepeatMasker, other small RNAs
		snoRNA
		tRNA
		rRNA
		snRNA
		scRNA
		srpRNA
		other RepeatMasker RNAs
	From UCSC knownGene. For species where knownGene is not availabe, refGene is used instead.
		coding exons where the annotated CDS region is 0 length
		3' UTR
		5' UTR
		coding exon
		intron
	From UCSC RepeatMasker, other types of repeats
		LINE
		SINE
		LTR
		Satellite
		RepeatMasker DNA
		RepeatMasker Low complexity
		RepeatMasker Simple Repeat
		RepeatMasker Other
		RepeatMasker Unknown

Profiling

	After annotation is done, the results are aggregated to produce a report. For each read, a series of filters are used to check whether its alignment will be used. If the read has more than 3 alignments, it is discarded. Currently only perfect alignments with no mismatches are used. Based on expression profiles of test libraries, reads which fail chastity are kept, while reads which are softclipped by BWA are not. After filtering, each alignment annotation of the read is looked at to determine the annotation for the read itself. If a read has more than 1 alignment, and the annotations are different, the priorities from Table 1 are used as long as only 1 alignment is to a miRNA. If there are multiple alignments to different miRNAs, the read is flagged as cross-mapped and every miRNA annotation is preserved. MiRNAs are determined to be different by looking at their names. A miRBase name may have up to 4 separate sections separated by "-", eg. hsa-mir-26a-1. The final -1 in the example denotes functionally equivalent miRNAs expressed from different regions of the genome. Therefore only the first 3 sections (hsa-mir-26a in this example) is considered when comparing names. If 2 miRNAs are not considered crossmapped, then only 1 will be reported. The strand annotation is used to determine priority, but if even that is the same, then one annotation is arbitrarily chosen based on the order they were output by the aligner.
	
	Using these rules, the reads are summed by annotation and coverage reports are generated.	The reports generated are individually documented, and is packaged along with the code to keep up with updates. A current copy is included in this document in Appendix 1.

Custom Reporting

	The reports from profiling are specially formatted for different uses. All custom reporting is done by simply reformatting data files generated by the profiling step. There is no additional processing.
	
	Files submitted to TCGA use miRNA reports from mirna_species.txt, crossmapped.txt, and isoforms.txt to list all miRNA species found, along with their read coverage and isoform coverage.

Appendix 1. MiRNA Profiling Reports.

MiRNA Analysis Pipeline, Reports Guide
Last Edited Jul 28, 2010 by Andy Chu

This document explains the output from the library_stats scripts: alignment_stats.pl, graph_libs.pl, expression_matrix.pl.

1. alignment_stats.pl
Upon running this script, 1 set of data files specific to an input .bam file is created, detailing features found in the sample. A .csv file is also created in the project level directory, listing summary stats about each sample in the project.

1A. alignment_stats.csv file overview
The following fields are in the .csv summary file.

Library, Index: Identifier of the sample.
Total Reads: number of all reads used in alignment that can be associated with the sample. In the case of indexed libraries, this means the set of reads where the first N bases map back to the index sequence of length N. Note that since reads with less than 15bp of biological sequence were not sent for alignment, this number does not reflect those reads.
% Adapter dimers: number of adapter dimers as percentage of all reads associated with library, including reads with adapter before the 15th base which were not includedin the "Total Reads" column.
Adapter dimers: number of reads where the 3' adapter was found at the beginning of the read, implying 5' and 3' adapters formed a dimer with no biological RNA in the sequence. Discarded without alignment.
Adapter at 1-14bp: number of reads where the 3' adapter was found at the first 14bp of the read, meaning that there is less 15bp of biological sequence. Discarded without alignment.
Adapter at 15-25bp; Adapter at 26-35bp, Adapter after 35bp: number of reads with 3' adapter found after n bp of sequence.
Aligned Reads Post-Filter: number of reads that are considered for analysis. This is the number of reads that align perfectly to the genome, and map to 3 or fewer places. This number will be the basis of percentages for the feature counts.
Unaligned Reads: number of reads that could not be aligned to the genome at all. Total Reads - (Aligned Reads Post-Filter + Unaligned Reads) = number of reads that aligned but with any combination of mismatches, indels, mapped to too many regions, etc.
% Aligned Reads, %Unaligned Reads: Aligned and Unaligned Reads expressed as a percentage of Total Reads.
Filtered Reads without XA: ***Used for internal development*** Number of reads where extra mapped positions could not be used because they were not reported by bwa when X0 + X1 > N in `bwa samse -n N`.
Softclipped Reads: ***Used for internal development*** Number of reads that were perfectly aligned to the genome after bwa softclips part of the sequence. These alignments ARE NOT used in the core reports of reads mapped to features.
Chastity Failed Reads Post-Filter: ***Used for internal development*** Number of reads that aligned perfectly and passed all filters, but failed chastity check. These alignments ARE used in the core reports of reads mapped to features.
miRNA Species: number of miRNA species covered by at least 1 read.
miRNA Species Covered by >= 10 Reads without Crossmapping: number of miRNA species covered by at least 10 reads, does not consider cross-mapped miRNAs.
Total miRNA: number of reads aligning to miRNAs, this number is the sum of Crossmapped miRNA, mature miRNA, * miRNA, precursor miRNA, miRNA loop, and unannotated miRNA
Crossmapped miRNA: number of reads that align perfectly to more than 1 miRNA species, regardless of location within the miRNA. These reads are not counted in the mature, *, precursor, loop and unannotated miRNA numbers.
mature miRNA: number of reads mapped to the mature strand of miRNAs
* miRNA: number of reads mapped to the * strand of miRNAs
precursor miRNA: number of reads that either 1. do not map to the mature strand, but map to the region between the end of the mature strand and the end of the precursor miRNA, or 2. do not map to the * star strand, but map to the region between the end of the * strand and the end of the precursor miRNA
miRNA loop: number of reads that do not align to the mature strand, but align to the 5 bases after the mature strand, where the loop between the mature strand and * strand would be.
unannotated miRNA: number of reads that align to the region of the precursor miRNA that would appear to be on the opposite side of the mature strand, but miRBase does not have the * strand annotation. Therefore these reads cannot be annotated as * strand since there is no information regarding which part of the precursor miRNA would be the * strand.
snoRNA: number of reads that map to known snoRNAs. The list of snoRNAs is extracted from UCSC knownGenes and kgXref where geneSymbol is in the form "SNOR*" or the description contains the string " small nucleolar RNA".
tRNA; rRNA; snRNA; scRNA; srpRNA: number of reads that map to regions as listed by UCSC Genome Browser's RepeatMasker with the given repeat_class
Other RepeatMasker RNAs: number of reads that map to regions as listed by UCSC Genome Browser's RepeatMasker with the repeat_class "RNA"
RNA (No CDS): number of reads that map to genes in UCSC Genome Browser's knownGenes that have 0 length CDS regions (excluding snoRNAs matched by the filter described above)
3' UTR, 5' UTR: number of reads that map to 3' and 5' UTR Exons according to UCSC Genome Browser's knownGenes tables, after strand, coordinate, and CDS region has been calculated
Coding Exon, Intron: number of reads that map to the exon and intron regions of UCSC knownGenes
LINE, SINE, LTR, Satellite, RepeatMasker DNA, RepeatMasker Low complexity, RepeatMasker Simple repeat, RepeatMasker Other, RepeatMasker Unknown: number of reads mapping to other repeat_class categories of UCSC RepeatMasker
Unknown: number of reads that map to no features in the above reference datasets
% Total miRNA to Unknown: number of reads corresponding to above features, expressed as a percentage of Aligned Reads Post-Filter

1B. Sample specific data files
For each of the above features, files were created which lists the specific gene name expressed, and the expression level as a read count, sorted by expression.
Coding_exon.txt contains the list of all exons originating from UCSC knownGene. The "RNA (No CDS)" set is differentiated from other exons with the "No CDS" annotation.
crossmapped.txt contains all the miRNAs that have been crossmapped by any reads. This set includes the count of reads that did crossmap, and the set of reads that did not crossmap themselves, but map to reads that have been crossmapped. These 2 categories are distinguished by the line either containing 1 miRNA, or mulitple comma separated miRNA names.
miRNA.txt contains the list of all uniquely mapped miRNAs, with an extra annotation denoting the region of the miRNA the read mapped to.
miRNA expression files, miRNA.txt and crossmapped.txt have a 3rd column which is the read count as a percentage of "Total miRNA" from alignment_stats.csv.
mirna_species.txt is a summary of the miRNA.txt file, with specific annotations about miRNA region left out. This means that each count is the number of reads aligning anywhere along the miRNA. Note that this does not include the miRNAs in crossmapped.txt.
isoforms.txt includes miRNAs in the miRNA.txt and crossmapped.txt files, split by individual reads that make up the miRNA. Column definitions are: miRNA, chr, start, end, strand, read sequence, read count, crossmapped (1 = yes, 0 = no), annotation.
A bed directory is created for each sample. These directories contain text files in BED format showing coverage of all reads which pass the alignment filters. The text files can be used to generate wig coverage files. They are also sorted and filtered so that they can be used for peak discovery and novel miRNA prediction.
All other .csv files are used for generating graphs.

2. graph_libs.pl
This script is a perl wrapper that calls R to generate graphs based on files generated by alignment_stats.pl
The _tags.jpg is a graph which shows the the data corresponding to % feature columns described in 1A.
The _adapter.jpg is a graph which shows the "Adapter*" columns described in 1A in base pair resolution.
The _softcliip.jpg is used for internal development, to compare the expression profile of softclipped reads against the "normally used" reads in _tags.jpg.
The _chastity.jpg is used for internal development, to compare the expression profile of chasitity failed reads which managed to pass the filters against the "normally used" reads in _tags.jpg.

3. expression_matrix.pl
This script uses the mirna_species.txt file created by alignment_stats to create a tab-separated text file of read counts for each miRNA gene in each sample in the project. 3 versions of expression files are created, 1 containing raw counts, 1 normalizing the counts by dividing by the total number of miRNA aligned reads in the library times 1 million, and 1 which takes the log2 of the normalized counts.