Skip to content

Souvadra/SeqToGraphAlignment

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SeqToGraphAlignment

This repository contains all the necessary codes, data and instructions to show the advantage of using graph as a refernce genome over string based genomes. This is part of my DS202-2020 final project and contains instruction to replicate the experiment I performed to prove the above-mentioned hypothesis in my final report. I have implemented this on a Debian GNU/Linux 10 (buster) on Windows 10 x86_64 Operating System.

Pre-requisites

Steps to replicate the experimental result

  1. Clone the SeqToGraphAlignment repository by typing
git clone https://github.com/Souvadra/SeqToGraphAlignment.git
  1. Download the vg-toolkit and relocate the executable for ease of use
wget https://github.com/vgteam/vg/releases/download/v1.33.0/vg
chmod +x vg
mv vg SeqToGraphAlignment/
  1. Download the minigraph repository and relocate teh executable for ease of use
git clone https://github.com/lh3/minigraph
cd minigraph && make
cd ..
cp minigraph/minigraph SeqToGraphAlignment/
  1. Make synthetic FASTA files, based on any of the input FASTA files (MT-human.fa was chosen in this case)
cd SeqToGraphAlignment/
mkdir synDNA
python3 code/synthetic_MT.py -infile data/MT-human.fa -outdir synDNA/ -num_files 10 -outfile MT-syn
  1. Build the reference grpah based on the input and the synthetically generated FASTA files using minigraph
./minigraph -xggs -l10k data/MT.gfa data/MT-chimp.fa data/MT-orangA.fa data/MT-human.fa synDNA/MT-syn0.fa synDNA/MT-syn1.fa synDNA/MT-syn2.fa synDNA/MT-syn3.fa synDNA/MT-syn4.fa synDNA/MT-syn5.fa synDNA/MT-syn6.fa synDNA/MT-syn7.fa synDNA/MT-syn8.fa synDNA/MT-syn9.fa > MT-graph.gfa
  1. Simulate reads from the constructed reference graph using vg-toolkit (we have chosen to construct ten reads each 500 nucleotide long for this example)
./vg view MT-graph.gfa > MT-graph.vg
./vg index -x x.xg -g x.gcsa -k 16 MT-graph.vg
./vg sim -n 10 -l 500 -x x.xg > read.sim.txt
  1. Create separate FASTA files from the simulated reads (so that we can directly use those for alignment using default functionalities of minigraph
mkdir sim_reads
python3 code/simulated_read_to_fasta.py -infile read.sim.txt -outdir sim_reads/ -outfile sim_read
  1. Perform the sequence-to-sequence and sequence-to-graph alignment, both using the default functions of minigraph
mkdir seq2seq seq2graph
for i in {0..9}
do
./minigraph data/MT-human.fa sim_reads/sim_read"$i".fa > seq2seq/out_seq"$i".paf
./minigraph MT-graph.gfa sim_reads/sim_read"$i".fa > seq2graph/out_graph"$i".gaf
done
  1. The .paf and .gaf files contain enough information to get to know about how the alignment algorithm performed in both the seq-to-graph and seq-to-seq alignment. In my report, I have shown the the number of residue matches which can be known from the 10-th column of both the .paf and .gaf files. For more information regarging these file formats, the reader can refer to GAF and PAF.

About

Final project for DS202.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages