GRCh38/hg38 human reference genome
UiO ML-nodes holds a copy of a indexed reference dataset at - /storage/ngs/reference_data/GRCh38
GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz
from FTP site
A gzipped file that contains FASTA format sequences for the following:
- chromosomes from the GRCh38 Primary Assembly unit. Note: the two PAR regions on chrY have been hard-masked with Ns. The chromosome Y sequence provided therefore has the same coordinates as the GenBank sequence but it is not identical to the GenBank sequence. Similarly, duplicate copies of centromeric arrays and WGS on chromosomes 5, 14, 19, 21 & 22 have been hard-masked with Ns (locations of the unmasked copies are given below).
- mitochondrial genome from the GRCh38 non-nuclear assembly unit.
- unlocalized scaffolds from the GRCh38 Primary Assembly unit.
- unplaced scaffolds from the GRCh38 Primary Assembly unit.
- Epstein-Barr virus (EBV) sequence Note: The EBV sequence is not part of the genome assembly but is included in the analysis set as a sink for alignment of reads that are often present in sequencing samples.
NOTES: https://lh3.github.io/2017/11/13/which-human-reference-genome-to-use
samtools faidx
: https://www.htslib.org/doc/samtools-faidx.html
bwa index
: https://bio-bwa.sourceforge.net/bwa.shtml
GATK CreateSequenceDictionary
: https://gatk.broadinstitute.org/hc/en-us/articles/360037422891-CreateSequenceDictionary-Picard-