-
Notifications
You must be signed in to change notification settings - Fork 5
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Update to match new PyReference schema. Change names from pyhgvs to s…
…eedot, add new 38 annotation releases
- Loading branch information
Showing
15 changed files
with
434 additions
and
192 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,2 +1,54 @@ | ||
# seedot | ||
Transcript versions for HGVS libraries | ||
|
||
Seedot is used to load transcripts for the 2 most popular Python [HGVS](http://varnomen.hgvs.org/) libraries. | ||
|
||
It works by: | ||
|
||
* Converting RefSeq and Ensembl GTFs into a JSON format | ||
* Providing loaders from that JSON format (or a REST service) | ||
|
||
We currently support 788k transcripts, 5.5x as many as [Universal Transcript Archive](https://github.com/biocommons/uta/) | ||
|
||
## Examples | ||
|
||
[Biocommons HGVS](https://github.com/biocommons/hgvs) example: | ||
|
||
``` | ||
from seedot.hgvs.dataproviders import JSONDataProvider, RESTDataProvider | ||
hp = JSONDataProvider({"GRCh37": "./seedot_220119.grch38.json.gz") # Uses local JSON file | ||
# hp = RESTDataProvider() # Uses API server at seedot.cc | ||
am = AssemblyMapper(hp, | ||
assembly_name='GRCh37', | ||
alt_aln_method='splign', replace_reference=True) | ||
hp = hgvs.parser.Parser() | ||
var_c = hp.parse_hgvs_variant('NM_001637.3:c.1582G>A') | ||
am.c_to_g(var_c) | ||
``` | ||
|
||
[PyHGVS](https://github.com/counsyl/hgvs) example: | ||
|
||
``` | ||
# TODO | ||
``` | ||
|
||
## Philosophical differences from Universal Transcript Archive | ||
|
||
Seedot aims to be as simple as possible: convert existing Ensembl/RefSeq GTFs into JSON format | ||
|
||
Universal transcript archive is an excellent and ambitious project that: | ||
|
||
* Performs its own mapping of transcript sequences to reference genomes | ||
* Stores the transcript version data (exons etc) in a SQL database | ||
|
||
This has some advantages, namely that you can resolve a GRCh37 coordinate for a transcript which was never officially released for that build. | ||
|
||
However the complexity causes a few downsides: | ||
|
||
* Alignments may not exactly match those in official Ensembl/RefSeq releases | ||
* Local install requires a PostgreSQL installation | ||
* Internet hosted UTA is a PostgreSQL server, so requires client Postgres libraries, is inaccessible behind firewalls. They have been planning on building a [REST server since 2014](https://github.com/biocommons/uta/issues/164) | ||
* High complexity manual process for releases means they [do not support Ensembl](https://github.com/biocommons/uta/issues/233) and take a while to make RefSeq releases. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,20 +1,23 @@ | ||
""" | ||
Converts PyReference JSON.gz files (created from RefSeq/Ensembl GTF or GFFs) into format easy to load w/PyHGVS | ||
Converts PyReference JSON.gz files (created from RefSeq/Ensembl GTF or GFFs) into format easy to load for HGVS | ||
@see https://bitbucket.org/sacgf/pyreference/ | ||
Dave Lawrence ([email protected]) on 22/10/2021 | ||
Dave Lawrence ([email protected]) on 17/01/2022 | ||
""" | ||
import gzip | ||
import json | ||
from argparse import ArgumentParser | ||
from typing import Dict | ||
|
||
import seedot | ||
|
||
|
||
def handle_args(): | ||
parser = ArgumentParser(description='Convert multiple PyReference json.gz files into one for PyHGVS') | ||
parser = ArgumentParser(description='Convert multiple PyReference json.gz files into one for SeeDot') | ||
parser.add_argument('--pyreference-json', required=True, nargs="+", action="extend", | ||
help='PyReference JSON.gz - list OLDEST to NEWEST (newest is kept)') | ||
parser.add_argument("--store-genes", action='store_true', help="Also store gene/version information") | ||
parser.add_argument('--output', required=True, help='Output filename') | ||
return parser.parse_args() | ||
|
||
|
@@ -37,50 +40,6 @@ def convert_gene_pyreference_to_gene_version_data(gene_data: Dict) -> Dict: | |
return gene_version_data | ||
|
||
|
||
def convert_transcript_pyreference_to_pyhgvs(transcript_data: Dict) -> Dict: | ||
start = transcript_data["start"] | ||
end = transcript_data["stop"] | ||
strand = transcript_data["strand"] | ||
# PyHGVS has cds_start/cds_end be equal to end/end for non-coding transcripts (so coding length ie end-start = 0) | ||
cds_start = transcript_data.get("cds_start", end) | ||
cds_end = transcript_data.get("cds_end", end) | ||
# PyHGVS exons are in genomic order, PyReference are in stranded | ||
features = transcript_data["features_by_type"] | ||
exons = [[ed["start"], ed["stop"]] for ed in features["exon"]] | ||
cdna_match = [] | ||
for cdm in features.get("cDNA_match", []): | ||
if "cdna_strand" in cdm: # None in Human RefSeq so haven't handled | ||
raise ValueError("Haven't handled stranded Target alignment") | ||
cdna_match.append([cdm.get(k) for k in ["start", "stop", "cdna_start", "cdna_end", "gap"]]) | ||
|
||
pyhgvs_data = { | ||
'chrom': transcript_data["chr"], | ||
'start': start, | ||
'end': end, | ||
'strand': strand, | ||
'cds_start': cds_start, | ||
'cds_end': cds_end, | ||
'exons': exons, | ||
} | ||
|
||
# Optional stuff | ||
for optional_key in ["start_codon_transcript_pos", "stop_codon_transcript_pos"]: | ||
if value := transcript_data.get(optional_key): | ||
pyhgvs_data[optional_key] = value | ||
|
||
if cdna_match: | ||
pyhgvs_data["cdna_match"] = cdna_match | ||
if transcript_data.get("partial"): | ||
pyhgvs_data["partial"] = 1 | ||
other_chroms = transcript_data.get("other_chroms") | ||
if other_chroms: | ||
pyhgvs_data["other_chroms"] = other_chroms | ||
|
||
# Few extra fields (pyHGVS doesn't currently use) | ||
pyhgvs_data["biotype"] = ",".join(transcript_data["biotype"]) | ||
return pyhgvs_data | ||
|
||
|
||
def main(): | ||
args = handle_args() | ||
|
||
|
@@ -110,22 +69,26 @@ def main(): | |
for transcript_accession in gene["transcripts"]: | ||
transcript_gene_version[transcript_accession] = gene_accession | ||
|
||
for transcript_accession, transcript in pyref_data["transcripts_by_id"].items(): | ||
hgvs_data = convert_transcript_pyreference_to_pyhgvs(transcript) | ||
for transcript_accession, transcript_version in pyref_data["transcripts_by_id"].items(): | ||
gene_accession = transcript_gene_version[transcript_accession] | ||
gene_version = gene_versions[gene_accession] | ||
hgvs_data["id"] = transcript_accession | ||
hgvs_data["gene_version"] = gene_accession | ||
hgvs_data["gene_name"] = gene_version["gene_symbol"] | ||
hgvs_data["url"] = url | ||
transcript_versions[transcript_accession] = hgvs_data | ||
|
||
print("Writing PyHGVS data") | ||
transcript_version["id"] = transcript_accession | ||
transcript_version["gene_version"] = gene_accession | ||
transcript_version["gene_name"] = gene_version["gene_symbol"] | ||
transcript_version["url"] = url | ||
transcript_versions[transcript_accession] = transcript_version | ||
if hgnc := gene_version.get("hgnc"): | ||
transcript_version["hgnc"] = hgnc | ||
|
||
print("Writing SeeDot data") | ||
with gzip.open(args.output, 'w') as outfile: | ||
data = { | ||
"transcripts": transcript_versions, | ||
"genes": gene_versions, | ||
"seedot_version": seedot.__version__, | ||
} | ||
if args.store_genes: | ||
data["genes"] = gene_versions | ||
|
||
json_str = json.dumps(data) | ||
outfile.write(json_str.encode('ascii')) | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
[build-system] | ||
requires = [ | ||
"setuptools>=42", | ||
"wheel" | ||
] | ||
build-backend = "setuptools.build_meta" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
__version__ = "0.1.1" | ||
|
||
|
||
def get_json_schema_version(): | ||
""" Return an int which increments upon breaking changes - ie anything other than patch """ | ||
major, minor, patch = __version__.split(".") | ||
return 1000 * int(major) + int(minor) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
from .json_data_provider import * |
Oops, something went wrong.