genome-analysis

Genome analysis pipeline that follows GATK best practices from Broad Institute. This pipeline goes through: alignment, processing BAM files, variant calling, and copy number calling.

See the scripts folder for ready-to-run scripts.

Requirements:

BWA (Alignment) https://github.com/lh3/bwa
SAMtools (Alignment) http://www.htslib.org/
Octopus (SNV-calling) https://github.com/luntergroup/octopus/
Strelka2 (SNV-calling also) https://github.com/Illumina/strelka
HATCHet (Copy-number calling) https://github.com/raphael-group/hatchet

Alignment (FASTQ to BAM)

Mapping to reference genome

Human reference genome downloaded from: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/

Index the files (not sure if necessary, but was running into a [E::bwa_idx_load_from_disk] fail to locate the index files error when I didn't)

$ bwa index hg19/ucsc.hg19.fasta.gz

Read groups example:
-R "@RG\tID:HG5H.2\tSM:045F16\tLB:library1\tPL:illumina\tPU:HG5HWDSXX.2"
tID, tPU are obtained from fastq files.
tSM is name of the fastq file.

Use bwa-mem to map to reference

$ bwa mem -t 4 -R "@RG\tID:HG5H.2\tSM:043F13\tLB:library1\tPL:illumina\tPU:HG5HWDSXX.2" hg19/ucsc.hg19.fasta.gz /data/compbio/melkebir/ElKebir/043F13_CGGCGTGA-GCGCCTGT_L002_R1_001.fastq /data/compbio/melkebir/ElKebir/043F13_CGGCGTGA-GCGCCTGT_L002_R2_001.fastq > 043F13.bam

Sort and index

samtools sort bamfile.bam -o bamfile.sorted.bam

samtools index bamfile.bam

Mark duplicates using gatk command line tools

gatk --java-options -Xmx12G MarkDuplicates -I 045F16.sorted.bam -O 045F16.dedup.bam -M 045F16.dedup.metrics.txt gatk --java-options "-Djava.io.tmpdir=/scratch/tmp -Xmx12G" MarkDuplicates -I 043F13.sorted.bam -O 043F13.dedup.bam -M 043F13.dedup.metrics.txt

base recallibrator

Get the following .vcf files and their corresponding index files (.dict, .idx) from GATK resource bundle: dbsnp_138.hg19.vcf Mills_and_1000G_gold_standard.indels.hg19.sites.vcf 1000G_phase1.indels.hg19.sites.vcf

(dbsnp and 1000G were also avaialble in original UCSC ftp server, but I got them all from GATK bundle for consistency)

gunzip and IndexFeatureFiles for dbsnp_138.hg19.vcf (to solve block compression format error...see here: https://gatkforums.broadinstitute.org/gatk/discussion/11908/variantrecalibrator-error-using-gatk-bundle-vcf-files)

Deviation from Octopus paper: the Octopus paper uses .vcf.gz files but we used unzipped .vcf files because of block compression error, and GATK doesnt support zipped files (GATK documentation also does it this way)

Variant calling

Useful commands in general

See bcftools documentation: https://samtools.github.io/bcftools/bcftools.html

Take union of all SNVs from samples 1 through 4. --force-samples will combine sample columns that have the same name, and append an index in front of the new colun (according to their order in the command)

bcftools merge -Ov -m none sample1/somatic.snvs.vcf.gz sample2/somatic.snvs.vcf.gz sample3/somatic.snvs.vcf.gz sample4/somatic.snvs.vcf.gz -o combined.vcf --force-samples

Get intersection of VCF files. dir is the name of a new directory this command will create. Inside, read README.md for more information about what each file is. Usually, if you have two files, there will be four files created:

1. All SNVs unique to file1, taken from file1.
1. All SNVs unique to file2, taken from file2.
1. SNVs common to both file1 and file2, taken from file 1.
1. SNVs common to both file1 and file2, taken from file 2.

bcftools isec -p dir -f PASS sample3/somatic.snvs.vcf.gz sample4/somatic.snvs.vcf.gz

Using octopus

Supports multi-sample. Use script run_octopus.sh. --fast sacrifices some accuracy for speed.

Using strelka2

Does not support multi-sample (but has a workaround that improves recall: Illumina/strelka#59. Use script run_strelka2.sh.

Annotating VCF file with snpEff

This provides the following information:

1. What gene does the SNV occur in.
1. What is the type/function of the mutation (i.e., synonymous, missense, stop_gained?)
1. The actual nucleotide change and/or protein change

snpEff also lets you use your own reference genome if needed. This script uses a downloaded genome.

java -Xmx4g -jar snpEff.jar -v sus11.1 /scratch/data/output.vcf > Sus11.1_annotated.vcf

Copy Number Calling

(documentation incomplete) We use HATCHet for copy number calling.

Mutational signatures

(documentation incomplete) Python implementation: https://github.com/lrgr/signature-estimation-py/tree/master/example

R implementation (heuristic): https://github.com/raerose01/deconstructSigs

I used deconstructSigs to create and normalize a trinucleotide count matrix. Then, I used that as input for signature-estimation-py.

General useful commands

Be sure to grant access to your script before running it with bash. For instance:

chmod u+x
bash XYZ.sh

u means user x means executable

Finding matching genes from a genelist.

grep -w -F -f genelist.txt Sus11.1_genes.gff3 > overlapping_genes.gff3

Take one column from a file, use a delimiter to split up a column, sort, and count each unique value.

Manipulate files using awk (i.e., appending "chr" in front of every value in a column)

awk '{ print "sample\t" $1"\t"$2"\t"$3"\t"$4"\t" }' desig_strelka2_concat_input_header.txt | less

Print something in the first line of a file (i.e., a header)

awk 'BEGIN { print"chr\tpos\tref\talt\tgene\ttype" } {split($8,a,"|"); print $1"\t"$2"\t"$4"\t"$5"\t"a[4]"\t"a[2] }' output_concat_annotated.vcf | grep -v "\#"

Todo

[] Document how to find statistics for bam/vcf files
Clarify difference between bcftools and vcftools
Copy number calling pipeline
[] Finish documentation for mutational signatures
[] Finish documentation for CNA calling and SNVs

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
archive		archive
collect_wgs_metrics		collect_wgs_metrics
pig		pig
plots2		plots2
scripts		scripts
.gitignore		.gitignore
OCM.ipynb		OCM.ipynb
README.md		README.md
depth.0.png		depth.0.png
desig_input.awk		desig_input.awk
dnds_processing.ipynb		dnds_processing.ipynb
filter_vcf.awk		filter_vcf.awk
indels.0.png		indels.0.png
messy_Readme.md		messy_Readme.md
octopus-user-manual.pdf		octopus-user-manual.pdf
stats.txt		stats.txt
substitutions.0.png		substitutions.0.png
tstv_by_qual.0.png		tstv_by_qual.0.png
venn.py		venn.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

genome-analysis

Contents

Requirements:

Alignment (FASTQ to BAM)

Mapping to reference genome

Sort and index

Mark duplicates using gatk command line tools

base recallibrator

Variant calling

Useful commands in general

Using octopus

Using strelka2

Annotating VCF file with snpEff

Copy Number Calling

Mutational signatures

General useful commands

Todo

About

Releases

Packages

Languages

programmingprincess/genome-analysis

Folders and files

Latest commit

History

Repository files navigation

genome-analysis

Contents

Requirements:

Alignment (FASTQ to BAM)

Mapping to reference genome

Sort and index

Mark duplicates using gatk command line tools

base recallibrator

Variant calling

Useful commands in general

Using octopus

Using strelka2

Annotating VCF file with snpEff

Copy Number Calling

Mutational signatures

General useful commands

Todo

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages