- This is data prepartion process for demo workflow. If you want to follow whole workflow with real-world dataset, refer to this
- TL;DR : run following code
bash <(curl -s https://raw.githubusercontent.com/JiehoonKwak/MSE801_JHLEE/main/download_demo.sh)
- original fastq files were too large, so i tired to reverse-create fastq.gz files that were only aligned to Chr5 : subsample.sh, but it leads to decreased depth and quality of the data. So, I decided to random sample 1% of the original fastq files.
# random subsampling do not effectively call variants
ls *.fastq.gz | parallel -j 6 'seqtk sample -s777 {} 0.01 | gzip > sample/subsampled_{}'
- Download sequence of Chr5 (Reference genome)
curl -L https://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/chr5.fa.gz -o hg38.fa.gz
- Download Base Quality Score Recalibration (BQSR)
curl -L https://storage.googleapis.com/genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf -o Homo_sapiens_assembly38.dbsnp138.vcf
curl -L https://storage.googleapis.com/genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf.idx -o Homo_sapiens_assembly38.dbsnp138.vcf.idx
- Download Mutect2 resources
curl -L https://storage.googleapis.com/gcp-public-data--broad-references/hg38/v0/somatic-hg38/af-only-gnomad.hg38.vcf.gz -o af-only-gnomad.hg38.vcf.gz
curl -L https://storage.googleapis.com/gcp-public-data--broad-references/hg38/v0/somatic-hg38/af-only-gnomad.hg38.vcf.gz.tbi -o af-only-gnomad.hg38.vcf.gz.tbi
curl -L https://storage.googleapis.com/gatk-best-practices/somatic-hg38/1000g_pon.hg38.vcf.gz -o 1000g_pon.hg38.vcf.gz
curl -L https://storage.googleapis.com/gatk-best-practices/somatic-hg38/1000g_pon.hg38.vcf.gz.tbi -o 1000g_pon.hg38.vcf.gz.tbi
curl -L https://storage.googleapis.com/gcp-public-data--broad-references/hg38/v0/exome_calling_regions.v1.1.interval_list -o exome_calling_regions.v1.1.interval_list
gatk FuncotatorDataSourceDownloader --somatic --validate-integrity --extract-after-download --hg38