Skip to content

Pipeline Background

jng2 edited this page Sep 17, 2021 · 6 revisions

Pipeline Background

This pipeline was developed to query small sequences against a large number of genomes. For example, one could look at the conservation of an enhancer sequence in many genomes. In particular, we developed this pipeline to utilize genome data from the newly generated genomes in the Vertebrate Genomes Project (VGP) as well as from the standard genomes in the Ensembl database.

The input file is are BLAST databases of the genome fasta file and the outputs are: BLAST results of the sequence for each reference genome, MUSCLE alignment of all sequences, PHYLIP reformatting, conversion to a GFA file, and finally a RAxML phylogenetic tree output. One particular feature users can modify is the threshold value from BLAST. This allows the user to only MUSCLE align if the files are at, or below the threshold requirement (i.e., high-quality matches). In addition to the threshold requirement, there is also a percent query length requirement. This requirement will take in a decimal value corresponding to a percent of sequenced covered in the alignment and filters out smaller miscellaneous sequences that may have a good e-value score. To assess homology between sequences, it is advised that sequences be over 30% identical across their entire lengths. By setting a minimum length requirement, this will ensure that larger sequences, with corresponding e-values will be the only sequences processed in the pipeline. This gives a cleaner and more concise result.

We suggest that you use an e-value threshold value of 0.00001, and a percent query length at 0.3 (30%). The query file cannot be larger than 1 Megabase pair (Mbp) in size and cannot be a repeat element as the pipeline is not optimized to work on those sequences.

When executing the pipeline, there are a total of 11 files will be generated if ran successfully. These files include a Parsed_Final.fa file which will include all sequences that have met the user’s threshold requirement.Files_Generated_Report.txt will generate a report on how many files contained hits, no hits, or had not met the threshold requirement. This file will also tell the user exactly how many hits, no hits, and total number of sequences read. After receiving the Parsed_Final.fa, the file will be converted into a Multi_Seq_Align.aln. This file takes all the parsed hit sequences and aligns them for computational use. The MSA2GFA.gfa file will be a file that converts the Multi_Seq_Align.aln into a GFA file that can be put into a Graphical Fragment Assembly viewer for analysis. Phy_Align.phy is like the MSA2GFA.gfa, except it is a multiple sequence file in PHYLIP format. This file format is required for running the RAXML analysis. When viewing the PHYLIP file or any RAxML file, please refer to the Keydoc.txt. This document will hold unique names to identify files and sequences in the named files. Changing this file will not change the names of files or identity names within files. RAxML will also generate a file consisting of the optimum tree created based on the Maximum Likelihood of 100 bootstraps, by default. This file will be called RAxML_bestTree.RAXML_output. Four other supporting files will be created for the user's viewing. RAxML will be running a GTRGAMMA model by default. For more information regarding RAXML please refer to the manual linked in the "More Information" section. To view a phylogenic tree created from RAXML, the user will need to use an external phylogenetic viewer.

The purpose of this pipeline is to provide a reproducible, and faster way to obtain an in-depth analysis of particular sequences of interest in genomes from the Vertebrate Genome Project and Ensembl by using a user inputted query sequence to run a BLASTn on both databases. Outputted sequences that have met a user set threshold value will be combined to create multiple files. These files include those that can be inputted in an external Graphical Fragment Assembly viewer and Phylogenetic tree viewer for further visual analysis of the data.