Skip to content

Commit

Permalink
Final Pipeline
Browse files Browse the repository at this point in the history
  • Loading branch information
angadps committed Nov 14, 2011
0 parents commit 40edd99
Show file tree
Hide file tree
Showing 20 changed files with 1,588 additions and 0 deletions.
30 changes: 30 additions & 0 deletions README
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
The scripts in this folder form a computational pipeline for analyzing WGS and
Exome data. For a quick explanation on how to run the pipeline, scroll down to the section named RUNNING THE PIPELINE towards the end.

The steps that are part of the pipeline are:

1. Realignment of bam files
2. Calibration
3. Recalibration
4. Depth of coverage calculation
5. Indel Calling
6. SNP Calling
7. Variant evaluation
8. Variant Filtering
9. Transition/Transversion ratio computation
10. Obtain SNV count

The various steps part of the main pipeline are invoked in run_pipeline.sh. This script receives as input various values such as the reference fasta and the input bam file, apart from the others. All the steps are invoked using the qsub command. There is another script called job.sh that invokes run_pipeline.sh itself. job.sh may be used independently or may serve as the template for running the pipeline on various sets of data. TCGA_job.sh and YALE_job.sh are modified versions of the script that have some preset parameters.

There is a global configuration file called global_config.sh that is used to set various environmental parameters before running the pipeline. This file is loaded towards the beginning in each of the other .sh/.scr files.

The final output from the pipeline is VCF data. One may want to combine VCF data from various samples part of the same dataset. The script gatk_smart_merge.scr accomplishes this. It may be run on the VCF files produced from all samples from the same dataset.

The scripts are written to be modular. One may run individual scripts or the entore pipeline. Each script performs necessary parameter checking.

RUNNING THE PIPELINE

Modify any necessary parameters in the global_config.sh file. These may be any JAVA related settings, path to the various binaries, reference fasta or dbsnp files or most importantly the BPATH variable.

Now run job.sh by supplying it with all necessary parameters. If you're planning to run the pipeline over and over again, it is better to modify job.sh as a new file by specifying parameters such as the reference fasta, dbsnp, platform and the exon list.

35 changes: 35 additions & 0 deletions TCGA_job.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
#!/bin/sh
#$ -cwd

GLOBAL="global_config.sh"

# Load the global config file.
if [[ -e $GLOBAL ]]
then
. $GLOBAL
else
echo "Global config file not found. Exiting."
exit 1
fi

# Only the input bam file needs to be specified.
while getopts I:h o
do case "$o" in
I) INP="$OPTARG";;
h) echo $USAGE
exit 1;;
esac
done

if [[ $INP == "" ]]
then
echo "Please pass input bam file as parameter"
exit 1
fi

DATAPATH=`dirname "$INP"`
rm -f $DATAPATH/*.output

# The pipeline is invoked here.
$BPATH/run_pipeline.sh -I $INP -R $TCGA_REF -E $TCGA_List -D $TCGA_DBSNP -P $TCGA_Platform > $DATAPATH/pipeline.output

35 changes: 35 additions & 0 deletions YALE_job.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
#!/bin/sh
#$ -cwd

GLOBAL="global_config.sh"

# Load the global config file.
if [[ -e $GLOBAL ]]
then
. $GLOBAL
else
echo "Global config file not found. Exiting."
exit 1
fi

# Only the input bam file needs to be specified.
while getopts I:h o
do case "$o" in
I) INP="$OPTARG";;
h) echo $USAGE
exit 1;;
esac
done

if [[ $INP == "" ]]
then
echo "Please pass input bam file as parameter"
exit 1
fi

DATAPATH=`dirname "$INP"`
rm -f $DATAPATH/*.output

# The pipeline is invoked here.
$BPATH/run_pipeline.sh -I $INP -R $YALE_REF -E $YALE_List -D $YALE_DBSNP -P $YALE_Platform > $DATAPATH/pipeline.output

99 changes: 99 additions & 0 deletions count_SNV.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
#!/bin/sh
#$ -cwd

# Performs SNV count for individual chromosome. Invoked by the count_all_SNV.sh script.

# Keywords used in identifying lines in the VCF file that contain actual data.
CHECK1="HaplotypeScore"
CHECK2="HRun"

USAGE="Usage: SNV_Count.sh -V <Variants file> -E <ExonFile> -L <CHR>"

while getopts V:E:L:h o
do case "$o" in
V) VCF_FILE="$OPTARG";;
E) EXON_FILE="$OPTARG";;
L) CHR="$OPTARG";;
h) echo $USAGE
exit 1;;
esac
done

if [[ $VCF_FILE == "" || $EXON_FILE == "" || $CHR == "" ]]
then
echo $USAGE
exit 1
fi

# The hap.list.$CHR file is populated with the variant locations. The exon.list is populated with the exon regions
# for the particular chromosome. The variant file is compared against exon regions to find the SNVs in the
# exon regions alone.

DATAPATH=`dirname "$VCF_FILE"`
grep $CHECK1 $VCF_FILE | grep $CHECK2 | awk -v var1=$CHR '$1 == var1 {print $1 " " $2}' > $DATAPATH/hap.list.$CHR
cat $EXON_FILE | awk -v var1=$CHR '$1 == var1 {print $0}' > $DATAPATH/exon.list.$CHR

if [[ $? != 0 || ! -s $DATAPATH/hap.list.$CHR || ! -s $DATAPATH/exon.list.$CHR ]]
then
echo SNV Count failed
exit 1
fi

FD1=7
FD2=8
eof1=0
eof2=0

exec 7<$DATAPATH/hap.list.$CHR
exec 8<$DATAPATH/exon.list.$CHR
rm -f $DATAPATH/SNV.list.$CHR

SNV=-1
LOW=0
HIGH=0

# The following logic implements a simple linear complexity algorithm.
# We have with us a set of exon regions and variant locations.
#
# Identify the location of the first SNV and also the first exon region.
# If the SNV location is lesser than the lower limit of the present exon region, locate the next SNV.
# If the SNV location is greater than the higher limit of the present exon region, locate the next exon region.
# If the SNV location is within the limits of the present exon region, increment the count.

while [[ $eof1 -eq 0 && $eof2 -eq 0 ]]
do
while [[ $SNV -lt $LOW ]]
do
if read data1 <&$FD1
then
SNV=`echo "$data1" | cut -d ' ' -f 2`
else
eof1=1
break
fi
done

if [[ $SNV -le $HIGH ]]
then
echo SNV = $SNV TARGET REGION = $data2 >> $DATAPATH/SNV.list.$CHR
if read data1 <&$FD1
then
SNV=`echo "$data1" | cut -d ' ' -f 2`
else
eof1=1
fi
else
while [[ $SNV -gt $HIGH ]]
do
if read data2 <&$FD2
then
LOW=`echo "$data2" | cut -f 2`
HIGH=`echo "$data2" | cut -f 3`
else
eof2=1
break
fi
done
fi
done

48 changes: 48 additions & 0 deletions count_all_SNV.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
#!/bin/sh
#$ -cwd
# -l mem=5G,time=4::

# This step in the pipeline takes the SNV count for the entire genome.

GLOBAL="global_config.sh"

if [[ -e $GLOBAL ]]
then
. $GLOBAL
else
echo "Global config file not found. Exiting."
exit 1
fi

USAGE="Usage: SNV_Count.sh -V <Variants file> -E <ExonFile> -I <Input Bam>"

while getopts V:E:I:h o
do case "$o" in
V) VCF_FILE="$OPTARG";;
E) EXON_FILE="$OPTARG";;
I) INP="$OPTARG";;
h) echo $USAGE
exit 1;;
esac
done

if [[ $VCF_FILE == "" || $EXON_FILE == "" || $INP == "" ]]
then
echo $USAGE
exit 1
fi

DATAPATH=`dirname "$VCF_FILE"`
CONTIG_ORDER="`$SAMTOOLS idxstats $INP | grep -m 24 [0-9] | awk '{print $1}' | tr '\n' ' '`"

rm -f $DATAPATH/hap.list $DATAPATH/SNV.list

# Loop through all the chromosomes. This step has been split for each chromosome to reduce
# implementation logic complexity.
for i in $CONTIG_ORDER
do
./count_SNV.sh -V $VCF_FILE -E $EXON_FILE -L $i
cat $DATAPATH/SNV.list.$i >> $DATAPATH/SNV.list
rm -f $DATAPATH/SNV.list.$i $DATAPATH/hap.list.$i $DATAPATH/exon.list.$i
done

62 changes: 62 additions & 0 deletions gatk_calibrate.scr
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
#!/bin/sh
#$ -cwd
# -l mem=8G,time=8::

# This step in the pipeline performs calibration of the original input bam file.

GLOBAL="global_config.sh"

if [[ -e $GLOBAL ]]
then
. $GLOBAL
else
echo "Global config file not found. Exiting."
exit 1
fi

USAGE="Usage: $0 -I <Input bam file> -R <Reference fasta> -D <DBSNP file> [-L \"#:#-#\"]"

while getopts I:L:R:D:h o
do case "$o" in
I) INP="$OPTARG";;
L) CHR="$OPTARG";;
R) REF="$OPTARG";;
D) DBSNP="$OPTARG";;
h) echo $USAGE
exit 1;;
esac
done

if [[ $INP == "" || $REF == "" || $DBSNP == "" ]]
then
echo $USAGE
exit 1
fi

# Chromosome number may be specified in eithe of the formats for example:
# chrX or X
# Find out using the samtools idxstats command.

CHR_NAME=`$SAMTOOLS idxstats $INP | grep chr`

if [[ $CHR != "" ]]
then
if [[ `grep chr $CHR` == "" && -n $CHR_NAME ]]
then
CHR="chr${CHR}"
fi
fi

$GATK \
-L $CHR \
-R $REF \
--DBSNP $DBSNP \
-I $INP \
-T CountCovariates \
-cov ReadGroupCovariate \
-cov QualityScoreCovariate \
-cov CycleCovariate \
-cov DinucCovariate \
-recalFile $INP.recal_data.csv

# [-B:mask,VCF sitesToMask.vcf] \
52 changes: 52 additions & 0 deletions gatk_combine_variants.scr
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
#!/bin/sh
#$ -cwd
#$ -l mem=8G,time=16::

GLOBAL="global_config.sh"
TEMP="temp_file"

rm -f variant_combine.output $TEMP

if [[ -e $GLOBAL ]]
then
. $GLOBAL
else
echo "Global config file not found. Exiting."
exit 1
fi

USAGE="Usage: $0 -R <Reference Fasta> -V <VCF File list> [-D <DELIMITER>]"

while getopts V:R:D:h o
do case "$o" in
V) VCFLIST="$OPTARG";;
R) REF="$OPTARG";;
D) DELIMITER="$OPTARG";;
h) echo $USAGE
exit 1;;
esac
done

if [[ $VCFLIST == "" || $REF == "" ]]
then
echo $USAGE
exit 1
fi

PARAMETER=`cat $VCFLIST | awk '{n=split($0,a,"/"); split(a[n],b,"_"); print "-B:"b[1]"_"b[2]",VCF " $0}'`

$GATK \
-T CombineVariants \
-R $REF \
-o all_SNPs_output.vcf \
-B:foo,VCF /path/to/variants1.vcf \
-B:bar,VCF /path/to/variants2.vcf \
-variantMergeOptions UNION \
-genotypeMergeOptions UNIQUIFY

if [[ $? != 0 || `grep "$ERRORMESSAGE" variant_combine.output` != "" ]]
then
echo "Variants Combine FAILED"
exit 1
fi

Loading

0 comments on commit 40edd99

Please sign in to comment.