SeqToGraphAlignment

This repository contains all the necessary codes, data and instructions to show the advantage of using graph as a refernce genome over string based genomes. This is part of my DS202-2020 final project and contains instruction to replicate the experiment I performed to prove the above-mentioned hypothesis in my final report. I have implemented this on a Debian GNU/Linux 10 (buster) on Windows 10 x86_64 Operating System.

Pre-requisites

Steps to replicate the experimental result

Clone the SeqToGraphAlignment repository by typing

git clone https://github.com/Souvadra/SeqToGraphAlignment.git

Download the vg-toolkit and relocate the executable for ease of use

wget https://github.com/vgteam/vg/releases/download/v1.33.0/vg
chmod +x vg
mv vg SeqToGraphAlignment/

Download the minigraph repository and relocate teh executable for ease of use

git clone https://github.com/lh3/minigraph
cd minigraph && make
cd ..
cp minigraph/minigraph SeqToGraphAlignment/

Make synthetic FASTA files, based on any of the input FASTA files (MT-human.fa was chosen in this case)

cd SeqToGraphAlignment/
mkdir synDNA
python3 code/synthetic_MT.py -infile data/MT-human.fa -outdir synDNA/ -num_files 10 -outfile MT-syn

Build the reference grpah based on the input and the synthetically generated FASTA files using minigraph

./minigraph -xggs -l10k data/MT.gfa data/MT-chimp.fa data/MT-orangA.fa data/MT-human.fa synDNA/MT-syn0.fa synDNA/MT-syn1.fa synDNA/MT-syn2.fa synDNA/MT-syn3.fa synDNA/MT-syn4.fa synDNA/MT-syn5.fa synDNA/MT-syn6.fa synDNA/MT-syn7.fa synDNA/MT-syn8.fa synDNA/MT-syn9.fa > MT-graph.gfa

Simulate reads from the constructed reference graph using vg-toolkit (we have chosen to construct ten reads each 500 nucleotide long for this example)

./vg view MT-graph.gfa > MT-graph.vg
./vg index -x x.xg -g x.gcsa -k 16 MT-graph.vg
./vg sim -n 10 -l 500 -x x.xg > read.sim.txt

Create separate FASTA files from the simulated reads (so that we can directly use those for alignment using default functionalities of minigraph

mkdir sim_reads
python3 code/simulated_read_to_fasta.py -infile read.sim.txt -outdir sim_reads/ -outfile sim_read

Perform the sequence-to-sequence and sequence-to-graph alignment, both using the default functions of minigraph

mkdir seq2seq seq2graph
for i in {0..9}
do
./minigraph data/MT-human.fa sim_reads/sim_read"$i".fa > seq2seq/out_seq"$i".paf
./minigraph MT-graph.gfa sim_reads/sim_read"$i".fa > seq2graph/out_graph"$i".gaf
done

The .paf and .gaf files contain enough information to get to know about how the alignment algorithm performed in both the seq-to-graph and seq-to-seq alignment. In my report, I have shown the the number of residue matches which can be known from the 10-th column of both the .paf and .gaf files. For more information regarging these file formats, the reader can refer to GAF and PAF.

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
code		code
data		data
LICENSE		LICENSE
README.md		README.md
Seq-to-graph alignment.pdf		Seq-to-graph alignment.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SeqToGraphAlignment

Pre-requisites

Steps to replicate the experimental result

About

Releases

Packages

Languages

License

Souvadra/SeqToGraphAlignment

Folders and files

Latest commit

History

Repository files navigation

SeqToGraphAlignment

Pre-requisites

Steps to replicate the experimental result

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages