Skip to content

Commit

Permalink
Update to match new PyReference schema. Change names from pyhgvs to s…
Browse files Browse the repository at this point in the history
…eedot, add new 38 annotation releases
  • Loading branch information
davmlaw committed Jan 19, 2022
1 parent aad2d68 commit 107559e
Show file tree
Hide file tree
Showing 15 changed files with 434 additions and 192 deletions.
54 changes: 53 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,54 @@
# seedot
Transcript versions for HGVS libraries

Seedot is used to load transcripts for the 2 most popular Python [HGVS](http://varnomen.hgvs.org/) libraries.

It works by:

* Converting RefSeq and Ensembl GTFs into a JSON format
* Providing loaders from that JSON format (or a REST service)

We currently support 788k transcripts, 5.5x as many as [Universal Transcript Archive](https://github.com/biocommons/uta/)

## Examples

[Biocommons HGVS](https://github.com/biocommons/hgvs) example:

```
from seedot.hgvs.dataproviders import JSONDataProvider, RESTDataProvider
hp = JSONDataProvider({"GRCh37": "./seedot_220119.grch38.json.gz") # Uses local JSON file
# hp = RESTDataProvider() # Uses API server at seedot.cc
am = AssemblyMapper(hp,
assembly_name='GRCh37',
alt_aln_method='splign', replace_reference=True)
hp = hgvs.parser.Parser()
var_c = hp.parse_hgvs_variant('NM_001637.3:c.1582G>A')
am.c_to_g(var_c)
```

[PyHGVS](https://github.com/counsyl/hgvs) example:

```
# TODO
```

## Philosophical differences from Universal Transcript Archive

Seedot aims to be as simple as possible: convert existing Ensembl/RefSeq GTFs into JSON format

Universal transcript archive is an excellent and ambitious project that:

* Performs its own mapping of transcript sequences to reference genomes
* Stores the transcript version data (exons etc) in a SQL database

This has some advantages, namely that you can resolve a GRCh37 coordinate for a transcript which was never officially released for that build.

However the complexity causes a few downsides:

* Alignments may not exactly match those in official Ensembl/RefSeq releases
* Local install requires a PostgreSQL installation
* Internet hosted UTA is a PostgreSQL server, so requires client Postgres libraries, is inaccessible behind firewalls. They have been planning on building a [REST server since 2014](https://github.com/biocommons/uta/issues/164)
* High complexity manual process for releases means they [do not support Ensembl](https://github.com/biocommons/uta/issues/233) and take a while to make RefSeq releases.

6 changes: 3 additions & 3 deletions generate_transcript_data/ensembl_gene_annotation_grch37.sh
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ pyreference_args=()
for release in 82 85 87; do
filename=Homo_sapiens.GRCh37.${release}.gff3.gz
url=ftp://ftp.ensembl.org/pub/grch37/release-${release}/gff3/homo_sapiens/${filename}
pyreference_file=${filename}.json.gz
pyreference_file=$(basename $filename .gz).json.gz
if [[ ! -e ${filename} ]]; then
wget ${url}
fi
Expand All @@ -20,9 +20,9 @@ for release in 82 85 87; do
pyreference_args+=(--pyreference-json ${pyreference_file})
done

merged_file="pyhgvs_transcripts_ensembl_grch37.json.gz"
merged_file="seedot-$(date --iso).ensembl.grch37.json.gz"
if [[ ! -e ${merged_file} ]]; then
BASE_DIR=$(dirname ${BASH_SOURCE[0]})

python3 ${BASE_DIR}/pyreference_to_pyhgvs_json.py ${pyreference_args[@]} --output ${merged_file}
python3 ${BASE_DIR}/pyreference_to_seedot_json.py ${pyreference_args[@]} --output ${merged_file}
fi
8 changes: 4 additions & 4 deletions generate_transcript_data/ensembl_gene_annotation_grch38.sh
Original file line number Diff line number Diff line change
Expand Up @@ -15,10 +15,10 @@

#81 is first GFF3 for GRCh38
pyreference_args=()
for release in 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104; do
for release in 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105; do
filename=Homo_sapiens.GRCh38.${release}.gff3.gz
url=ftp://ftp.ensembl.org/pub/release-${release}/gff3/homo_sapiens/${filename}
pyreference_file=${filename}.json.gz
pyreference_file=$(basename $filename .gz).json.gz

if [[ ! -e ${filename} ]]; then
wget ${url}
Expand All @@ -29,9 +29,9 @@ for release in 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101
pyreference_args+=(--pyreference-json ${pyreference_file})
done

merged_file="pyhgvs_transcripts_ensembl_grch38.json.gz"
merged_file="seedot-$(date --iso).ensembl.grch38.json.gz"
if [[ ! -e ${merged_file} ]]; then
BASE_DIR=$(dirname ${BASH_SOURCE[0]})

python3 ${BASE_DIR}/pyreference_to_pyhgvs_json.py ${pyreference_args[@]} --output ${merged_file}
python3 ${BASE_DIR}/pyreference_to_seedot_json.py ${pyreference_args[@]} --output ${merged_file}
fi
Original file line number Diff line number Diff line change
@@ -1,20 +1,23 @@
"""
Converts PyReference JSON.gz files (created from RefSeq/Ensembl GTF or GFFs) into format easy to load w/PyHGVS
Converts PyReference JSON.gz files (created from RefSeq/Ensembl GTF or GFFs) into format easy to load for HGVS
@see https://bitbucket.org/sacgf/pyreference/
Dave Lawrence ([email protected]) on 22/10/2021
Dave Lawrence ([email protected]) on 17/01/2022
"""
import gzip
import json
from argparse import ArgumentParser
from typing import Dict

import seedot


def handle_args():
parser = ArgumentParser(description='Convert multiple PyReference json.gz files into one for PyHGVS')
parser = ArgumentParser(description='Convert multiple PyReference json.gz files into one for SeeDot')
parser.add_argument('--pyreference-json', required=True, nargs="+", action="extend",
help='PyReference JSON.gz - list OLDEST to NEWEST (newest is kept)')
parser.add_argument("--store-genes", action='store_true', help="Also store gene/version information")
parser.add_argument('--output', required=True, help='Output filename')
return parser.parse_args()

Expand All @@ -37,50 +40,6 @@ def convert_gene_pyreference_to_gene_version_data(gene_data: Dict) -> Dict:
return gene_version_data


def convert_transcript_pyreference_to_pyhgvs(transcript_data: Dict) -> Dict:
start = transcript_data["start"]
end = transcript_data["stop"]
strand = transcript_data["strand"]
# PyHGVS has cds_start/cds_end be equal to end/end for non-coding transcripts (so coding length ie end-start = 0)
cds_start = transcript_data.get("cds_start", end)
cds_end = transcript_data.get("cds_end", end)
# PyHGVS exons are in genomic order, PyReference are in stranded
features = transcript_data["features_by_type"]
exons = [[ed["start"], ed["stop"]] for ed in features["exon"]]
cdna_match = []
for cdm in features.get("cDNA_match", []):
if "cdna_strand" in cdm: # None in Human RefSeq so haven't handled
raise ValueError("Haven't handled stranded Target alignment")
cdna_match.append([cdm.get(k) for k in ["start", "stop", "cdna_start", "cdna_end", "gap"]])

pyhgvs_data = {
'chrom': transcript_data["chr"],
'start': start,
'end': end,
'strand': strand,
'cds_start': cds_start,
'cds_end': cds_end,
'exons': exons,
}

# Optional stuff
for optional_key in ["start_codon_transcript_pos", "stop_codon_transcript_pos"]:
if value := transcript_data.get(optional_key):
pyhgvs_data[optional_key] = value

if cdna_match:
pyhgvs_data["cdna_match"] = cdna_match
if transcript_data.get("partial"):
pyhgvs_data["partial"] = 1
other_chroms = transcript_data.get("other_chroms")
if other_chroms:
pyhgvs_data["other_chroms"] = other_chroms

# Few extra fields (pyHGVS doesn't currently use)
pyhgvs_data["biotype"] = ",".join(transcript_data["biotype"])
return pyhgvs_data


def main():
args = handle_args()

Expand Down Expand Up @@ -110,22 +69,26 @@ def main():
for transcript_accession in gene["transcripts"]:
transcript_gene_version[transcript_accession] = gene_accession

for transcript_accession, transcript in pyref_data["transcripts_by_id"].items():
hgvs_data = convert_transcript_pyreference_to_pyhgvs(transcript)
for transcript_accession, transcript_version in pyref_data["transcripts_by_id"].items():
gene_accession = transcript_gene_version[transcript_accession]
gene_version = gene_versions[gene_accession]
hgvs_data["id"] = transcript_accession
hgvs_data["gene_version"] = gene_accession
hgvs_data["gene_name"] = gene_version["gene_symbol"]
hgvs_data["url"] = url
transcript_versions[transcript_accession] = hgvs_data

print("Writing PyHGVS data")
transcript_version["id"] = transcript_accession
transcript_version["gene_version"] = gene_accession
transcript_version["gene_name"] = gene_version["gene_symbol"]
transcript_version["url"] = url
transcript_versions[transcript_accession] = transcript_version
if hgnc := gene_version.get("hgnc"):
transcript_version["hgnc"] = hgnc

print("Writing SeeDot data")
with gzip.open(args.output, 'w') as outfile:
data = {
"transcripts": transcript_versions,
"genes": gene_versions,
"seedot_version": seedot.__version__,
}
if args.store_genes:
data["genes"] = gene_versions

json_str = json.dumps(data)
outfile.write(json_str.encode('ascii'))

Expand Down
16 changes: 8 additions & 8 deletions generate_transcript_data/refseq_gene_annotation_grch37.sh
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ pyreference_args=()

filename=ref_GRCh37.p5_top_level.gff3.gz
url=http://ftp.ncbi.nlm.nih.gov/genomes/archive/old_refseq/Homo_sapiens/ARCHIVE/BUILD.37.3/GFF/${filename}
pyreference_file=${filename}.json.gz
pyreference_file=$(basename $filename .gz).json.gz
if [[ ! -e ${filename} ]]; then
wget ${url}
fi
Expand All @@ -18,7 +18,7 @@ pyreference_args+=(--pyreference-json ${pyreference_file})

filename=ref_GRCh37.p9_top_level.gff3.gz
url=http://ftp.ncbi.nlm.nih.gov/genomes/archive/old_refseq/Homo_sapiens/ARCHIVE/ANNOTATION_RELEASE.103/GFF/${filename}
pyreference_file=${filename}.json.gz
pyreference_file=$(basename $filename .gz).json.gz
if [[ ! -e ${filename} ]]; then
wget ${url}
fi
Expand All @@ -30,7 +30,7 @@ pyreference_args+=(--pyreference-json ${pyreference_file})

filename=ref_GRCh37.p10_top_level.gff3.gz
url=http://ftp.ncbi.nlm.nih.gov/genomes/archive/old_refseq/Homo_sapiens/ARCHIVE/ANNOTATION_RELEASE.104/GFF/${filename}
pyreference_file=${filename}.json.gz
pyreference_file=$(basename $filename .gz).json.gz
if [[ ! -e ${filename} ]]; then
wget ${url}
fi
Expand All @@ -42,7 +42,7 @@ pyreference_args+=(--pyreference-json ${pyreference_file})

filename=ref_GRCh37.p13_top_level.gff3.gz
url=http://ftp.ncbi.nlm.nih.gov/genomes/archive/old_refseq/Homo_sapiens/ARCHIVE/ANNOTATION_RELEASE.105/GFF/${filename}
pyreference_file=${filename}.json.gz
pyreference_file=$(basename $filename .gz).json.gz
if [[ ! -e ${filename} ]]; then
wget ${url}
fi
Expand All @@ -56,7 +56,7 @@ pyreference_args+=(--pyreference-json ${pyreference_file})
for release in 105.20190906 105.20201022; do
filename=GCF_000001405.25_GRCh37.p13_genomic.${release}.gff.gz
url=http://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/annotation_releases/${release}/GCF_000001405.25_GRCh37.p13/GCF_000001405.25_GRCh37.p13_genomic.gff.gz
pyreference_file=${filename}.json.gz
pyreference_file=$(basename $filename .gz).json.gz
if [[ ! -e ${filename} ]]; then
wget ${url} --output-document=${filename}
fi
Expand All @@ -66,10 +66,10 @@ for release in 105.20190906 105.20201022; do
pyreference_args+=(--pyreference-json ${pyreference_file})
done


merged_file="pyhgvs_transcripts_refseq_grch37.json.gz"
merged_file="seedot-$(date --iso).refseq.grch37.json.gz"
echo "Creating ${merged_file}"
if [[ ! -e ${merged_file} ]]; then
BASE_DIR=$(dirname ${BASH_SOURCE[0]})

python3 ${BASE_DIR}/pyreference_to_pyhgvs_json.py ${pyreference_args[@]} --output ${merged_file}
python3 ${BASE_DIR}/pyreference_to_seedot_json.py ${pyreference_args[@]} --output ${merged_file}
fi
19 changes: 9 additions & 10 deletions generate_transcript_data/refseq_gene_annotation_grch38.sh
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

filename=ref_GRCh38_top_level.gff3.gz
url=http://ftp.ncbi.nlm.nih.gov/genomes/archive/old_refseq/Homo_sapiens/ARCHIVE/ANNOTATION_RELEASE.106/GFF/${filename}
pyreference_file=${filename}.json.gz
pyreference_file=$(basename $filename .gz).json.gz

if [[ ! -e ${filename} ]]; then
wget ${url}
Expand All @@ -17,7 +17,7 @@ pyreference_args+=(--pyreference-json ${pyreference_file})

filename=ref_GRCh38.p2_top_level.gff3.gz
url=http://ftp.ncbi.nlm.nih.gov/genomes/archive/old_refseq/Homo_sapiens/ARCHIVE/ANNOTATION_RELEASE.107/GFF/${filename}
pyreference_file=${filename}.json.gz
pyreference_file=$(basename $filename .gz).json.gz

if [[ ! -e ${filename} ]]; then
wget ${url}
Expand All @@ -30,7 +30,7 @@ pyreference_args+=(--pyreference-json ${pyreference_file})

filename=ref_GRCh38.p7_top_level.gff3.gz
url=http://ftp.ncbi.nlm.nih.gov/genomes/archive/old_refseq/Homo_sapiens/ARCHIVE/ANNOTATION_RELEASE.108/GFF/${filename}
pyreference_file=${filename}.json.gz
pyreference_file=$(basename $filename .gz).json.gz

if [[ ! -e ${filename} ]]; then
wget ${url}
Expand All @@ -43,7 +43,7 @@ pyreference_args+=(--pyreference-json ${pyreference_file})

filename=ref_GRCh38.p12_top_level.gff3.gz
url=http://ftp.ncbi.nlm.nih.gov/genomes/archive/old_refseq/Homo_sapiens/ARCHIVE/ANNOTATION_RELEASE.109/GFF/${filename}
pyreference_file=${filename}.json.gz
pyreference_file=$(basename $filename .gz).json.gz

if [[ ! -e ${filename} ]]; then
wget ${url}
Expand All @@ -56,7 +56,7 @@ pyreference_args+=(--pyreference-json ${pyreference_file})

filename=GCF_000001405.38_GRCh38.p12_genomic.gff.gz
url=http://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/annotation_releases/109/GCF_000001405.38_GRCh38.p12/${filename}
pyreference_file=${filename}.json.gz
pyreference_file=$(basename $filename .gz).json.gz

if [[ ! -e ${filename} ]]; then
wget ${url}
Expand All @@ -68,10 +68,10 @@ pyreference_args+=(--pyreference-json ${pyreference_file})


# These all have the same name, so rename them based on release ID
for release in 109.20190607 109.20190905 109.20191205 109.20200228 109.20200522 109.20200815 109.20201120 109.20210226 109.20210514; do
for release in 109.20190607 109.20190905 109.20191205 109.20200228 109.20200522 109.20200815 109.20201120 109.20210226 109.20210514 109.20211119; do
filename=GCF_000001405.39_GRCh38.p13_genomic.${release}.gff.gz
url=http://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/annotation_releases/${release}/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.gff.gz
pyreference_file=${filename}.json.gz
pyreference_file=$(basename $filename .gz).json.gz
if [[ ! -e ${filename} ]]; then
wget ${url} --output-document=${filename}
fi
Expand All @@ -81,10 +81,9 @@ for release in 109.20190607 109.20190905 109.20191205 109.20200228 109.20200522
pyreference_args+=(--pyreference-json ${pyreference_file})
done


merged_file="pyhgvs_transcripts_refseq_grch38.json.gz"
merged_file="seedot-$(date --iso).refseq.grch38.json.gz"
if [[ ! -e ${merged_file} ]]; then
BASE_DIR=$(dirname ${BASH_SOURCE[0]})

python3 ${BASE_DIR}/pyreference_to_pyhgvs_json.py ${pyreference_args[@]} --output ${merged_file}
python3 ${BASE_DIR}/pyreference_to_seedot_json.py ${pyreference_args[@]} --output ${merged_file}
fi
6 changes: 6 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
[build-system]
requires = [
"setuptools>=42",
"wheel"
]
build-backend = "setuptools.build_meta"
7 changes: 7 additions & 0 deletions seedot/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
__version__ = "0.1.1"


def get_json_schema_version():
""" Return an int which increments upon breaking changes - ie anything other than patch """
major, minor, patch = __version__.split(".")
return 1000 * int(major) + int(minor)
1 change: 1 addition & 0 deletions seedot/hgvs/dataproviders/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
from .json_data_provider import *
Loading

0 comments on commit 107559e

Please sign in to comment.