This table describes the contigs found in various GRCh38/hg38 reference genome fastas. It was anchored by the GRCh38.p12 standard release. All contigs, including patch, alt, unplaced/unlocalized, from that release are found in the first column "refseq". The rest of the columns are the identifiers other organizations use to identify that same contig.
The GRC actually publishes two separate releases of the genome:
The first, I'll call the "standard release", is updated with patches periodically. On their FTP, you can find GRCh38 or GRCh38.p12 for example. The .p12 indicates that this is patch version 12. These patch releases are also called minor releases. As new sequences are identified, gaps filled, etc., the GRC creates incremental updates that do_not affect the coordinates of the other contigs. They are added as additional seqeunces in the fasta files.
Additionaly, GRC has published a "pipeline release" specifically for data analysis pipelines. This version has different sequence names that match the UCSC-style naming convention. It also does not include any patches, and is not updated with minor releases.
Both of these are described below:
Every contig is named by its RefSeq accession number, the standard builds will also include patch contigs, notice "FIX PATCH" in some of the descriptions.
Source: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.27_GRCh38.p12/GCA_000001405.27_GRCh38.p12_genomic.fna.gz
Excerpt:
>CM000663.2 Homo sapiens chromosome 1, GRCh38 reference primary assembly
>KI270706.1 Homo sapiens chromosome 1 unlocalized genomic contig, GRCh38 reference primary assembly
>KI270707.1 Homo sapiens chromosome 1 unlocalized genomic contig, GRCh38 reference primary assembly
...
>KI270740.1 Homo sapiens chromosome Y unlocalized genomic contig, GRCh38 reference primary assembly
>KI270302.1 Homo sapiens unplaced genomic contig, GRCh38 reference primary assembly
>KI270304.1 Homo sapiens unplaced genomic contig, GRCh38 reference primary assembly
...
>KZ208924.1 Homo sapiens chromosome Y genomic contig HG1535_PATCH, GRC reference assembly FIX PATCH for GRCh38
>KN196487.1 Homo sapiens chromosome Y genomic contig HG2062_PATCH, GRC reference assembly FIX PATCH for GRCh38
>KI270762.1 Homo sapiens chromosome 1 genomic contig, GRCh38 reference assembly alternate locus group ALT_REF_LOCI_1
...
>KI270933.1 Homo sapiens chromosome 19 genomic contig, GRCh38 reference assembly alternate locus group ALT_REF_LOCI_34
>GL000209.2 Homo sapiens chromosome 19 genomic contig, GRCh38 reference assembly alternate locus group ALT_REF_LOCI_35
>J01415.2 Homo sapiens mitochondrion, complete genome
With the GRCh38 primary assembly release, the GRC added a folder to their FTP seqs_for_alignment_pipelines.ucsc_ids presumably because researchers would prefer to see chr1, chrX, etc. instead of the raw accession numbers. Some portions of the genome have been hard masked to prevent erroneous alignments. In addition, they added several extra "decoy" sequences. These sequences are not part of the human genome, but improve the alignment of short read sequencing data.
Source: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/eqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_full_plus_hs38d1_analysis_set.fna.gz
There are several fastas available in that same FTP directory. This is from one with the most content. The other files will omit the alt or decoy sequences for example.
Excerpt:
>chr1 AC:CM000663.2 gi:568336023 LN:248956422 rl:Chromosome M5:6aef897c3d6ff0c78aff06ac189178dd AS:GRCh38
>chr2 AC:CM000664.2 gi:568336022 LN:242193529 rl:Chromosome M5:f98db672eb0993dcfdabafe2a882905c AS:GRCh38
>chr3 AC:CM000665.2 gi:568336021 LN:198295559 rl:Chromosome M5:76635a41ea913a405ded820447d067b0 AS:GRCh38
...
>chrX AC:CM000685.2 gi:568336001 LN:156040895 rl:Chromosome M5:2b3a55ff7f58eb308420c8a9b11cac50 AS:GRCh38
>chrY AC:CM000686.2 gi:568336000 LN:57227415 rl:Chromosome M5:ce3e31103314a704255f3cd90369ecce AS:GRCh38 hm:10001-2781479,56887903-57217415
>chrM AC:J01415.2 gi:113200490 LN:16569 rl:Mitochondrion M5:c68f52674c9fb33aef52dcf399755519 AS:GRCh38 tp:circular
...
>chrUn_GL000216v2 AC:GL000216.2 gi:568335254 LN:176608 rl:unplaced M5:725009a7e3f5b78752b68afa922c090c AS:GRCh38
>chrUn_GL000218v1 AC:GL000218.1 gi:224183305 LN:161147 rl:unplaced M5:1d708b54644c26c7e01c2dad5426d38c AS:GRCh38
>chr1_KI270762v1_alt AC:KI270762.1 gi:568335926 LN:354444 rg:chr1:2448811-2791270 rl:alt-scaffold M5:b0397179e5a9
...
>chr19_GL949753v2_alt AC:GL949753.2 gi:568335996 LN:796479 rg:chr19:54025634-55084318 rl:alt-scaffold M5:19162055ca3e800f14797b6cd37c1d4c AS:GRCh38
>chr19_KI270938v1_alt AC:KI270938.1 gi:568335998 LN:1066800 rg:chr19:54025634-55084318 rl:alt-scaffold M5:9363b56f7b34fb35ab3400b1093f431a AS:GRCh38
>chrEBV AC:AJ507799.2 gi:86261677 LN:171823 rl:decoy M5:6743bd63b3ff2b5b8985d8933c53290a SP:Human_herpesvirus_4 tp:circular
...
>chrUn_JTFH01001996v1_decoy AC:JTFH01001996.1 gi:725020270 LN:2009 rl:decoy M5:a6503ea36c1b69162d3eda87c4f33922 AS:hs38d1
>chrUn_JTFH01001997v1_decoy AC:JTFH01001997.1 gi:725020269 LN:2003 rl:decoy M5:d3a1bed41244634725882235662d11a6 AS:hs38d1
>chrUn_JTFH01001998v1_decoy AC:JTFH01001998.1 gi:725020268 LN:2001 rl:decoy M5:35916d4135a2a9db7bc0749955d7161a AS:hs38d1
Prefixes every contig name with "chr" including unplaced, unlocalized, patches, alts, etc. The tricky part is that they also change the name of some of the contigs. Don't fall into the trap of thinking you can convert from UCSC to Ensembl IDs by just removing the "chr". Even if you choose to ignore the alt contigs, those GL.. KI.. contigs will still be wrong. Here are examples:
Ensembl UCSC
KI270706.1 chr1_KI270706v1_random
KI270707.1 chr1_KI270707v1_random
KI270708.1 chr1_KI270708v1_random
GL000009.2 chr14_GL000009v2_random
GL000225.1 chr14_GL000225v1_random
Source: ftp://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/p12/hg38.p12.fa.gz
There are several fastas available in that same FTP directory. This is from one with the most content. The other files will omit the alt or unplaced contigs for example. Each contig can also be downloaded individually from the "chromosomes" directory found here: ftp://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes
Excerpt:
>chr1
>chr10
>chr11
>chr11_KI270721v1_random
...
>chr19_KI270938v1_alt
>chrM
>chrUn_KI270302v1
...
>chrX
>chrY
>chrY_KI270740v1_random
...
>chr17_KZ559114v1_alt
>chr18_KZ559116v1_alt
>chr18_KZ559115v1_fix
Source: ftp://ftp.ensembl.org/pub/release-95/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.toplevel.fa.gz
Ensembl uses single integers to represent the autosomes. For unplaced,
unlocalized, and alts, they generally just use the RefSeq accession numbers.
But, the patch contigs are not consistent. The names found in data downloads
from Ensembl FTP are different than what you'll find while browsing their
website or other sources like BioMart. Most contigs match, but for patches,
the name in their downloads includes CHR_<patchname>
, whereas the website
lists these contigs with just the patch name. The difference might be very
small, but it will break some simple match operations that researchers often
use.
Excerpt:
1 dna:chromosome chromosome:GRCh38:1:1:248956422:1 REF
2 dna:chromosome chromosome:GRCh38:2:1:242193529:1 REF
3 dna:chromosome chromosome:GRCh38:3:1:198295559:1 REF
...
X dna:chromosome chromosome:GRCh38:X:1:156040895:1 REF
Y dna:chromosome chromosome:GRCh38:Y:2781480:56887902:1 REF
MT dna:chromosome chromosome:GRCh38:MT:1:16569:1 REF
CHR_HG76_PATCH dna:chromosome chromosome:GRCh38:CHR_HG76_PATCH:1:144938989:1 PATCH_FIX
CHR_HSCHR15_4_CTG8 dna:chromosome chromosome:GRCh38:CHR_HSCHR15_4_CTG8:1:102071387:1 HAP
CHR_HSCHR6_MHC_SSTO_CTG1 dna:chromosome chromosome:GRCh38:CHR_HSCHR6_MHC_SSTO_CTG1:1:170946136:1 HAP
...
KI270423.1 dna:scaffold scaffold:GRCh38:KI270423.1:1:981:1 REF
KI270392.1 dna:scaffold scaffold:GRCh38:KI270392.1:1:971:1 REF
KI270394.1 dna:scaffold scaffold:GRCh38:KI270394.1:1:970:1 REF
Source: ftp://ftp.ensembl.org/pub/release-95/gtf/homo_sapiens/Homo_sapiens.GRCh38.95.gtf.gz
Looking at the first column in this file, some lines redacted for more concise output. Note that the autosomes don't match UCSC or RefSeq names, the unplaced contigs match RefSeq names, and the patches don't match anyone, including their own database. They use their own ID system for exons, genes, transcripts etc. But, they also integrate HAVANA ids when applicable.
1
10
11
...
CHR_HG107_PATCH
CHR_HG126_PATCH
CHR_HG1311_PATCH
...
KI270744.1
KI270750.1
MT
...
Gencode is nice. They tried to add Ensembl contig names along with their own names. But, due to the problems with Ensembl discussed above, they don't exactly match the names that you'll find in files downloaded from Ensembl.
Source: ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_29/GRCh38.p12.genome.fa.gz
Excerpt:
>chr1 1
>chr2 2
>chr3 3
...
>chrX X
>chrY Y
>chrM MT
>GL000008.2 GL000008.2
...
>GL000226.1 GL000226.1
>KQ759759.1 HG107_PATCH
>KN538364.1 HG126_PATCH
...
>KV766199.1 HSCHRX_3_CTG7
>KI270302.1 KI270302.1
>KI270303.1 KI270303.1
...
>KI270755.1 KI270755.1
>KI270756.1 KI270756.1
>KI270757.1 KI270757.1
Source: ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_29/gencode.v29.chr_patch_hapl_scaff.annotation.gtf.gz
Looking at the first column in this file, some lines redacted for more concise
output. The gene_id
, transcript_id
, and exon_id
fields are Ensembl IDs.
They also include HAVANA ids when applicable.
chr1
chr10
chr11
...
chrM
chrX
chrY
...
GL000009.2
GL000194.1
GL000195.1
...
KZ559114.1
KZ559115.1
KZ559116.1
It looks like UCSC was publishing its own genome builds for hg1-hg8, but later adopted the NCBI builds this table here. In December 2001, with the release of NCBI Build 28, UCSC lists the NCBI build IDs associated with the their own IDs hg10-19. Finally, upon the release of build 38, they harmonized their build numbers with the GRC and called the latest release hg38.
Often times the sequence descriptions will hold useful information. There is actually a standard published by the NCBI but nobody really uses it as far as I can tell:
From wikipedia FASTA_format:,
NCBI identifiers
The NCBI defined a standard for the unique identifier used for the sequence
(SeqID) in the header line. This allows a sequence that was obtained from a
database to be labelled with a reference to its database record. The database
identifier format is understood by the NCBI tools like makeblastdb and
table2asn. The following list describes the NCBI FASTA defined format for
sequence identifiers.
The mitochondrial chromosome is usually labelled chrM or MT, but there are also different sequences found across some genomes. * Current builds should only be using rCRS or RSRS, but be aware that RSRS has not been widely adopted.
Name | Length (bp) | NCBI Identifiers |
---|---|---|
RSRS | 16569 | Not in Genbank |
rCRS | 16569 | J01415.2/NC_012920.1 |
CRS | 16569 | Removed after release of rCRS |
Yoruba Sequence | 16571 | AF347015/NC_001807.4 |
Uganda Sequence | 16559 | D38112 |
Swedish Sequence | 16570 | X93334 |
Japanese Sequence | 16554 | AB055387 |
-
Revised Sapiens Reference Sequence - RSRS
Behar et al proposed a new sequence that is an "ancestral" reference rather than a "leaf" reference. Some groups are still using rCRS because they say the drawbacks to having yet another reference sequence floating around outweigh the benefits of using the ancestral sequence.
Behar DM, van Oven M, Rosset S, et al. A "Copernican" reassessment of the human mitochondrial DNA tree from its root. Am J Hum Genet. 2012;90(4):675-84.
-
- Revised Cambridge Reference Sequence - rCRS
Most common sequence used currently. Found in the NCBI/GRC fasta downloads for GRCh38.
-
Cambridge Reference Sequence - CRS
The first draft of the full mitochondrial sequence published. This was later corrected and released as rCRS. The original sequence was removed from Genbank and should no longer be used.
-
Other Sequences
Some other sequences can be found that are less common. Generally rCRS should be preferred over these, and remapping might be required for older microarray data that used these sequences.