Final Pipeline

angadps · Nov 14, 2011 · 40edd99 · 40edd99
commit 40edd99
Show file tree

Hide file tree

Showing 20 changed files with 1,588 additions and 0 deletions.
diff --git a/README b/README
@@ -0,0 +1,30 @@
+The scripts in this folder form a computational pipeline for analyzing WGS and
+Exome data. For a quick explanation on how to run the pipeline, scroll down to the section named RUNNING THE PIPELINE towards the end.
+
+The steps that are part of the pipeline are:
+
+1. Realignment of bam files
+2. Calibration
+3. Recalibration
+4. Depth of coverage calculation
+5. Indel Calling
+6. SNP Calling
+7. Variant evaluation
+8. Variant Filtering
+9. Transition/Transversion ratio computation
+10. Obtain SNV count
+
+The various steps part of the main pipeline are invoked in run_pipeline.sh. This script receives as input various values such as the reference fasta and the input bam file, apart from the others. All the steps are invoked using the qsub command. There is another script called job.sh that invokes run_pipeline.sh itself. job.sh may be used independently or may serve as the template for running the pipeline on various sets of data. TCGA_job.sh and YALE_job.sh are modified versions of the script that have some preset parameters.
+
+There is a global configuration file called global_config.sh that is used to set various environmental parameters before running the pipeline. This file is loaded towards the beginning in each of the other .sh/.scr files.
+
+The final output from the pipeline is VCF data. One may want to combine VCF data from various samples part of the same dataset. The script gatk_smart_merge.scr accomplishes this. It may be run on the VCF files produced from all samples from the same dataset.
+
+The scripts are written to be modular. One may run individual scripts or the entore pipeline. Each script performs necessary parameter checking.
+
+RUNNING THE PIPELINE
+
+Modify any necessary parameters in the global_config.sh file. These may be any JAVA related settings, path to the various binaries, reference fasta or dbsnp files or most importantly the BPATH variable.
+
+Now run job.sh by supplying it with all necessary parameters. If you're planning to run the pipeline over and over again, it is better to modify job.sh as a new file by specifying parameters such as the reference fasta, dbsnp, platform and the exon list.
+
diff --git a/TCGA_job.sh b/TCGA_job.sh
@@ -0,0 +1,35 @@
+#!/bin/sh
+#$ -cwd
+
+GLOBAL="global_config.sh"
+
+# Load the global config file.
+if [[ -e $GLOBAL ]]
+then
+	. $GLOBAL
+else
+        echo "Global config file not found. Exiting."
+        exit 1
+fi
+
+# Only the input bam file needs to be specified.
+while getopts I:h o
+do      case "$o" in
+        I)      INP="$OPTARG";;
+        h)      echo $USAGE
+                exit 1;;
+        esac
+done
+
+if [[ $INP == "" ]]
+then
+	echo "Please pass input bam file as parameter"
+	exit 1
+fi
+
+DATAPATH=`dirname "$INP"`
+rm -f $DATAPATH/*.output
+
+# The pipeline is invoked here.
+$BPATH/run_pipeline.sh -I $INP -R $TCGA_REF -E $TCGA_List -D $TCGA_DBSNP -P $TCGA_Platform > $DATAPATH/pipeline.output
+
diff --git a/YALE_job.sh b/YALE_job.sh
@@ -0,0 +1,35 @@
+#!/bin/sh
+#$ -cwd
+
+GLOBAL="global_config.sh"
+
+# Load the global config file.
+if [[ -e $GLOBAL ]]
+then
+	. $GLOBAL
+else
+        echo "Global config file not found. Exiting."
+        exit 1
+fi
+
+# Only the input bam file needs to be specified.
+while getopts I:h o
+do      case "$o" in
+        I)      INP="$OPTARG";;
+        h)      echo $USAGE
+                exit 1;;
+        esac
+done
+
+if [[ $INP == "" ]]
+then
+	echo "Please pass input bam file as parameter"
+	exit 1
+fi
+
+DATAPATH=`dirname "$INP"`
+rm -f $DATAPATH/*.output
+
+# The pipeline is invoked here.
+$BPATH/run_pipeline.sh -I $INP -R $YALE_REF -E $YALE_List -D $YALE_DBSNP -P $YALE_Platform > $DATAPATH/pipeline.output
+
diff --git a/count_SNV.sh b/count_SNV.sh
@@ -0,0 +1,99 @@
+#!/bin/sh
+#$ -cwd
+
+# Performs SNV count for individual chromosome. Invoked by the count_all_SNV.sh script.
+
+# Keywords used in identifying lines in the VCF file that contain actual data.
+CHECK1="HaplotypeScore"
+CHECK2="HRun"
+
+USAGE="Usage: SNV_Count.sh -V <Variants file> -E <ExonFile> -L <CHR>"
+
+while getopts V:E:L:h o
+do      case "$o" in
+        V)      VCF_FILE="$OPTARG";;
+        E)      EXON_FILE="$OPTARG";;
+        L)      CHR="$OPTARG";;
+        h)      echo $USAGE
+                exit 1;;
+        esac
+done
+
+if [[ $VCF_FILE == "" || $EXON_FILE == "" || $CHR == "" ]]
+then
+        echo $USAGE
+        exit 1
+fi
+
+# The hap.list.$CHR file is populated with the variant locations. The exon.list is populated with the exon regions
+# for the particular chromosome. The variant file is compared against exon regions to find the SNVs in the
+# exon regions alone.
+
+DATAPATH=`dirname "$VCF_FILE"`
+grep $CHECK1 $VCF_FILE | grep $CHECK2 | awk -v var1=$CHR '$1 == var1 {print $1 " " $2}' > $DATAPATH/hap.list.$CHR
+cat $EXON_FILE | awk -v var1=$CHR '$1 == var1 {print $0}' > $DATAPATH/exon.list.$CHR
+
+if [[ $? != 0 || ! -s $DATAPATH/hap.list.$CHR || ! -s $DATAPATH/exon.list.$CHR ]]
+then
+	echo SNV Count failed
+	exit 1
+fi
+
+FD1=7
+FD2=8
+eof1=0
+eof2=0
+
+exec 7<$DATAPATH/hap.list.$CHR
+exec 8<$DATAPATH/exon.list.$CHR
+rm -f $DATAPATH/SNV.list.$CHR
+
+SNV=-1
+LOW=0
+HIGH=0
+
+# The following logic implements a simple linear complexity algorithm.
+# We have with us a set of exon regions and variant locations.
+#
+#   Identify the location of the first SNV and also the first exon region.
+#   If the SNV location is lesser than the lower limit of the present exon region, locate the next SNV.
+#   If the SNV location is greater than the higher limit of the present exon region, locate the next exon region.
+#   If the SNV location is within the limits of the present exon region, increment the count.
+
+while [[ $eof1 -eq 0 && $eof2 -eq 0 ]]
+do
+	while [[ $SNV -lt $LOW ]]
+	do
+		if read data1 <&$FD1
+		then
+			SNV=`echo "$data1" | cut -d ' ' -f 2`
+		else
+			eof1=1
+			break
+		fi
+	done
+
+	if [[ $SNV -le $HIGH ]]
+	then
+		echo SNV = $SNV TARGET REGION = $data2 >> $DATAPATH/SNV.list.$CHR
+		if read data1 <&$FD1
+		then
+			SNV=`echo "$data1" | cut -d ' ' -f 2`
+		else
+			eof1=1
+		fi
+	else
+		while [[ $SNV -gt $HIGH ]]
+		do
+			if read data2 <&$FD2
+			then
+                		LOW=`echo "$data2" | cut -f 2`
+       	        		HIGH=`echo "$data2" | cut -f 3`
+			else
+				eof2=1
+				break
+			fi
+		done
+	fi
+done
+
diff --git a/count_all_SNV.sh b/count_all_SNV.sh
@@ -0,0 +1,48 @@
+#!/bin/sh
+#$ -cwd
+# -l mem=5G,time=4::
+
+# This step in the pipeline takes the SNV count for the entire genome.
+
+GLOBAL="global_config.sh"
+
+if [[ -e $GLOBAL ]]
+then
+        . $GLOBAL
+else
+        echo "Global config file not found. Exiting."
+        exit 1
+fi
+
+USAGE="Usage: SNV_Count.sh -V <Variants file> -E <ExonFile> -I <Input Bam>"
+
+while getopts V:E:I:h o
+do      case "$o" in
+        V)      VCF_FILE="$OPTARG";;
+        E)      EXON_FILE="$OPTARG";;
+        I)      INP="$OPTARG";;
+        h)      echo $USAGE
+                exit 1;;
+        esac
+done
+
+if [[ $VCF_FILE == "" || $EXON_FILE == "" || $INP == "" ]]
+then
+        echo $USAGE
+        exit 1
+fi
+
+DATAPATH=`dirname "$VCF_FILE"`
+CONTIG_ORDER="`$SAMTOOLS idxstats $INP | grep -m 24 [0-9] | awk '{print $1}' | tr '\n' ' '`"
+
+rm -f $DATAPATH/hap.list $DATAPATH/SNV.list
+
+# Loop through all the chromosomes. This step has been split for each chromosome to reduce
+# implementation logic complexity.
+for i in $CONTIG_ORDER
+do
+	./count_SNV.sh -V $VCF_FILE -E $EXON_FILE -L $i
+	cat $DATAPATH/SNV.list.$i >> $DATAPATH/SNV.list
+	rm -f $DATAPATH/SNV.list.$i $DATAPATH/hap.list.$i $DATAPATH/exon.list.$i
+done
+
diff --git a/gatk_calibrate.scr b/gatk_calibrate.scr
@@ -0,0 +1,62 @@
+#!/bin/sh
+#$ -cwd
+# -l mem=8G,time=8::
+
+# This step in the pipeline performs calibration of the original input bam file.
+
+GLOBAL="global_config.sh"
+
+if [[ -e $GLOBAL ]]
+then
+        . $GLOBAL
+else
+        echo "Global config file not found. Exiting."
+        exit 1
+fi
+
+USAGE="Usage: $0 -I <Input bam file> -R <Reference fasta> -D <DBSNP file> [-L \"#:#-#\"]"
+
+while getopts I:L:R:D:h o
+do      case "$o" in
+        I)      INP="$OPTARG";;
+        L)      CHR="$OPTARG";;
+        R)      REF="$OPTARG";;
+        D)      DBSNP="$OPTARG";;
+        h)      echo $USAGE
+                exit 1;;
+        esac
+done
+
+if [[ $INP == "" || $REF == "" || $DBSNP == "" ]]
+then
+        echo $USAGE
+        exit 1
+fi
+
+# Chromosome number may be specified in eithe of the formats for example:
+# chrX or X
+# Find out using the samtools idxstats command.
+
+CHR_NAME=`$SAMTOOLS idxstats $INP | grep chr`
+
+if [[ $CHR != "" ]]
+then
+        if [[ `grep chr $CHR` == "" && -n $CHR_NAME ]]
+	then
+        	CHR="chr${CHR}"
+	fi
+fi
+
+$GATK \
+ -L $CHR \
+ -R $REF \
+ --DBSNP $DBSNP \
+ -I $INP \
+ -T CountCovariates \
+ -cov ReadGroupCovariate \
+ -cov QualityScoreCovariate \
+ -cov CycleCovariate \
+ -cov DinucCovariate \
+ -recalFile $INP.recal_data.csv
+
+# [-B:mask,VCF sitesToMask.vcf] \
diff --git a/gatk_combine_variants.scr b/gatk_combine_variants.scr
@@ -0,0 +1,52 @@
+#!/bin/sh
+#$ -cwd
+#$ -l mem=8G,time=16::
+
+GLOBAL="global_config.sh"
+TEMP="temp_file"
+
+rm -f variant_combine.output $TEMP
+
+if [[ -e $GLOBAL ]]
+then
+        . $GLOBAL
+else
+        echo "Global config file not found. Exiting."
+        exit 1
+fi
+
+USAGE="Usage: $0 -R <Reference Fasta> -V <VCF File list> [-D <DELIMITER>]"
+
+while getopts V:R:D:h o
+do      case "$o" in
+        V)      VCFLIST="$OPTARG";;
+        R)      REF="$OPTARG";;
+        D)      DELIMITER="$OPTARG";;
+        h)      echo $USAGE
+                exit 1;;
+        esac
+done
+
+if [[ $VCFLIST == "" || $REF == "" ]]
+then
+        echo $USAGE
+        exit 1
+fi
+
+PARAMETER=`cat $VCFLIST | awk '{n=split($0,a,"/"); split(a[n],b,"_"); print "-B:"b[1]"_"b[2]",VCF " $0}'`
+
+$GATK \
+  -T CombineVariants \
+  -R $REF \
+  -o all_SNPs_output.vcf \
+  -B:foo,VCF /path/to/variants1.vcf \
+  -B:bar,VCF /path/to/variants2.vcf \
+  -variantMergeOptions UNION \
+  -genotypeMergeOptions UNIQUIFY
+
+if [[ $? != 0 || `grep "$ERRORMESSAGE" variant_combine.output` != "" ]]
+then
+	echo "Variants Combine FAILED"
+	exit 1
+fi
+