Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vv ensembl dev susmi #615

Open
wants to merge 175 commits into
base: ensembl_update_2024
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
175 commits
Select commit Hold shift + click to select a range
9e0313e
Merge pull request #389 from openvar/update_to_vvta
Jul 7, 2022
caec3fc
Merge pull request #390 from openvar/vv_ensembl_develop_pete
sbenny1230 Jul 7, 2022
7b77b02
Merge pull request #391 from openvar/vv_ensembl_develop_pete
sbenny1230 Jul 7, 2022
9e4525e
Remove ensembl not supported msg
sbenny1230 Jul 15, 2022
2602240
Add transcript set when searching options
sbenny1230 Jul 15, 2022
56ceeec
Add alt_aln_method to all t_to_g methods
sbenny1230 Jul 15, 2022
af3eb12
Add alt_aln_method when searching
sbenny1230 Jul 15, 2022
5ad1528
Tidy up Mixin code
sbenny1230 Jul 17, 2022
76daf7b
Merge pull request #394 from openvar/vv_ensembl_develop
sbenny1230 Jul 17, 2022
2df0d59
Optimise and fix search through options
sbenny1230 Jul 19, 2022
8d9a28a
Switch key order for postgres in config test
sbenny1230 Jul 19, 2022
b4c9eeb
Clean up ensembl url code
sbenny1230 Jul 19, 2022
2cdc808
Sort grch37 and grch38 ensembl urls
sbenny1230 Jul 19, 2022
d5b8cf0
Fix bug in ensembl test variant 2
sbenny1230 Jul 19, 2022
b4cb8b1
Add ensembl urls to ensembl input test
sbenny1230 Jul 19, 2022
8444080
Sort incorrect values in ensembl tests 4 and 5
sbenny1230 Jul 19, 2022
f4fb451
Add alt_aln_method when getting hgvs_stash_t
sbenny1230 Jul 19, 2022
692a5ac
Update wrong genome build error
sbenny1230 Jul 21, 2022
f5bf878
Add ensembl test for wrong genome build
sbenny1230 Jul 21, 2022
65ffe0c
Pass in alt_aln_method to g_to_t methods
sbenny1230 Jul 21, 2022
d4713ca
Add specific wrong build msg for each build
sbenny1230 Jul 24, 2022
4849921
Tidy up wrong build warning code
sbenny1230 Jul 24, 2022
1f2d93a
Include genome build in new version msg
sbenny1230 Jul 25, 2022
40c6b49
Merge pull request #400 from openvar/vv_ensembl_dev_s_working
sbenny1230 Jul 25, 2022
334ecb8
Delete MANUAL.md
sbenny1230 Jul 25, 2022
05cc65c
Rename MANUAL_UPDATED.md to MANUAL.md
sbenny1230 Jul 25, 2022
1fa9306
Tidy up config
sbenny1230 Jul 25, 2022
d310056
Merge pull request #401 from openvar/vv_ensembl
sbenny1230 Jul 26, 2022
f11fe6e
Hide warn for irrelevant transcripts
sbenny1230 Jul 26, 2022
af1ffcc
Change build to grch38 in ensembl tests
sbenny1230 Jul 27, 2022
5bcc2b5
Fix version code in ensembl test
sbenny1230 Aug 3, 2022
d446b15
Add unit tests for gap alignments
sbenny1230 Aug 27, 2022
b1df309
Add all transcripts to test case 8
sbenny1230 Aug 29, 2022
e3cfd41
Use GRCh37 for test case 9
sbenny1230 Aug 29, 2022
938e05d
Add second transcript to test case 6
sbenny1230 Aug 31, 2022
c76b5cc
Remove debug statements
sbenny1230 Sep 3, 2022
261a1a2
Correct psql file name
sbenny1230 Oct 18, 2022
d08f447
Merge conflicts
Peter-J-Freeman May 7, 2024
4b55691
starting to add tests for Ensembl data sets and some code fixes to ma…
Peter-J-Freeman May 8, 2024
236730d
updating tests
Peter-J-Freeman May 13, 2024
074265f
Updates including insertion length warning improvements
Peter-J-Freeman May 14, 2024
c9be994
Additiona of a new test for genomic ins interval error
Peter-J-Freeman May 14, 2024
a1f95c2
Additiona of a new test for genomic ins interval error
Peter-J-Freeman May 14, 2024
4634bda
Additiona of a new test for genomic ins interval error
Peter-J-Freeman May 14, 2024
50f2829
add code to warn of unsupported transcripts and suggest an alternativ…
Peter-J-Freeman May 15, 2024
79bdb45
Add more tests for ensembl
Peter-J-Freeman May 20, 2024
2127388
Add more tests for ensembl
Peter-J-Freeman May 20, 2024
bcb2828
Small bug fixed
Peter-J-Freeman May 21, 2024
0c73f92
Sort out ensembl gap warnings
Peter-J-Freeman May 24, 2024
b86e394
bug fixing
Peter-J-Freeman Jun 12, 2024
42927df
restore select_transcripts = select
Peter-J-Freeman Jun 12, 2024
5610de8
Bug fixes and close issue https://github.com/openvar/variantValidator…
Peter-J-Freeman Jul 9, 2024
5c8dc57
Bug fixes and close issue https://github.com/openvar/variantValidator…
Peter-J-Freeman Jul 9, 2024
be1add8
update docker conf
Peter-J-Freeman Jul 11, 2024
6d48f0f
Update the code to auto update Ensembl records if missing. Not perfec…
Peter-J-Freeman Jul 16, 2024
346ddcd
update exon sets in g2t for mito sets
Peter-J-Freeman Jul 17, 2024
7404f97
Update transcript version warnings to include genome build info for a…
Peter-J-Freeman Jul 18, 2024
4040c37
bug fix
Peter-J-Freeman Jul 18, 2024
a4bc09c
Handle integer submissions
Peter-J-Freeman Jul 19, 2024
eead53a
Add tests to VF set to ensure hybrid descriptions validate. Should no…
Peter-J-Freeman Jul 25, 2024
36d03e7
Add tests to VF set to ensure hybrid descriptions validate. Should no…
Peter-J-Freeman Jul 25, 2024
b2fee6b
Bug fix and tests added for VF warnings
Peter-J-Freeman Jul 29, 2024
25e04d5
Updtes to the expanded repeat code for simple repeats only so far
Peter-J-Freeman Aug 9, 2024
4623e05
bug in utils. Replace a lowercase
Peter-J-Freeman Aug 12, 2024
a97e8af
add intronic boundary handling to uncertain and fuzzy end code
Peter-J-Freeman Aug 23, 2024
974531f
Update dockerfiles
Peter-J-Freeman Sep 3, 2024
23bbd73
Store progress
Peter-J-Freeman Sep 5, 2024
e2de282
Code changes and tests added that close issue #645
Peter-J-Freeman Sep 10, 2024
634797f
Merge pull request #647 from openvar/issue_645
Peter-J-Freeman Sep 10, 2024
fa124ff
No changes, commit to run CodeCov
Peter-J-Freeman Sep 10, 2024
52e733b
Don't use normalising mapper in expanded_repeats
John-F-Wagstaff Sep 13, 2024
d319f9a
Completely replace the exon checking func
John-F-Wagstaff Sep 14, 2024
fe1e894
Fix a number of issues on multi-base repeats
John-F-Wagstaff Sep 14, 2024
6abacc6
Fix 0->1 based coordinate switch in error message
John-F-Wagstaff Sep 14, 2024
59021f5
Clean up expanded repeat test for fixed locations
John-F-Wagstaff Sep 15, 2024
1ad174c
Fix underlying 1>0 based issue in expanded repeats
John-F-Wagstaff Sep 15, 2024
62a67ba
Add n VS c test set for expanded repeats
John-F-Wagstaff Sep 15, 2024
e017b63
Add genomic tests expanded repeat tests
John-F-Wagstaff Sep 15, 2024
20a8604
Add RefSeqGenomic expanded repeat tests
John-F-Wagstaff Sep 15, 2024
6a7bca3
Add LRG type tests for expanded repeat syntax
John-F-Wagstaff Sep 15, 2024
7ec09c3
Remove now unneeded check_transcript_type function
John-F-Wagstaff Sep 15, 2024
4b32ae6
Clean up unintuitive intronic code and some names
John-F-Wagstaff Sep 19, 2024
68af21a
Remove intronic_or_utr variable, fix repeat check
John-F-Wagstaff Sep 19, 2024
2f0e466
Fix outstanding bugs
Peter-J-Freeman Sep 24, 2024
7240b75
Add handling for 3' UTR and improve c<->n mapping
John-F-Wagstaff Sep 26, 2024
b07d610
Add tests for 3' utr and over 5' end handling
John-F-Wagstaff Sep 26, 2024
5ca5926
Remove now unused function for variant splitting
John-F-Wagstaff Sep 26, 2024
9410c37
Clean up input function, reduce regex usage
John-F-Wagstaff Sep 26, 2024
9b0014d
Merge pull request #642 from openvar/expanded_repeat_syntax
Peter-J-Freeman Sep 27, 2024
4f361e5
Merge pull request #649 from openvar/issue_645
Peter-J-Freeman Sep 27, 2024
206402e
Merge branch 'vv_ensembl_dev_susmi' of https://github.com/openvar/var…
Peter-J-Freeman Sep 27, 2024
a0eb76a
update dockerfiles to latest vvta and sr
Peter-J-Freeman Nov 6, 2024
6340024
code changes that refer to issue https://github.com/openvar/variantVa…
Peter-J-Freeman Dec 2, 2024
4e2e82e
Fixes that overcome NR transcripts with LOC based gene symbols in gen…
Peter-J-Freeman Dec 2, 2024
dbfb7b1
Fixes the mapping of NC_000009.12:g.92474742delinsATCA back to NM_017…
Peter-J-Freeman Dec 3, 2024
033d948
Add tweaks to genes2transcripts to handle gene symbols that are updat…
Peter-J-Freeman Dec 4, 2024
45f5d07
Update the code to accept intronic variants in transcripts with alter…
Peter-J-Freeman Dec 5, 2024
2570119
code that deals with protein references with nucleotide variant types
Peter-J-Freeman Dec 5, 2024
cf2cfe1
Update position added for Ter=
Peter-J-Freeman Dec 5, 2024
36d4237
Update vdb version and test issue 87
Peter-J-Freeman Dec 11, 2024
b3692f8
Changes to the code base to correct some unhandled descriptions e.g. …
Peter-J-Freeman Jan 9, 2025
cc05e3c
Expand code to handle expanded repeat syntax in allele descriptions
Peter-J-Freeman Jan 14, 2025
9a7e4b2
add in code to deal with common HGVS early stage typos like double co…
Peter-J-Freeman Jan 16, 2025
95d6591
Additional code changes to handle failed variants reported by the LOV…
Peter-J-Freeman Jan 17, 2025
64b5132
Final commit before merge with parse bypass code
Peter-J-Freeman Jan 17, 2025
b3cd378
Reduce SeqRepo calls on hgvs to VCF mapping
John-F-Wagstaff Oct 30, 2024
d274486
Attempt to reduce SeqRepo calls when shifting
John-F-Wagstaff Oct 31, 2024
ed61716
Don't re-build existing hgvs objects in mappers
John-F-Wagstaff Nov 5, 2024
a80c386
Add a 'All' valid genomes flag to report_hgvs2vcf
John-F-Wagstaff Nov 7, 2024
446074c
Exploit new report_hgvs2vcf flag for output
John-F-Wagstaff Nov 7, 2024
5ef024c
Reduce unnecessary vcf re-generation from liftover
John-F-Wagstaff Nov 8, 2024
ca23057
Use existing genomic vcf data in liftover
John-F-Wagstaff Nov 8, 2024
c6d7a8f
Add validation skip option, use in map fallback
John-F-Wagstaff Nov 8, 2024
5da3492
Avoid parsing of hgvs from text hgvs_utils 1
John-F-Wagstaff Nov 11, 2024
3e9eaee
Avoid parsing of hgvs from text hgvs_utils 2
John-F-Wagstaff Nov 11, 2024
0f83cd6
Add hgvs object span handling to parts>delins func
John-F-Wagstaff Nov 13, 2024
70a2476
Add object based equivalent to hgvs_dup2indel
John-F-Wagstaff Nov 13, 2024
cfd3c22
Upgrade delins creation in gapped_mapping.py
John-F-Wagstaff Nov 13, 2024
b48b22b
Add a function to build new variants from existing
John-F-Wagstaff Nov 14, 2024
b0f9abe
Improve hgvs obj handling for complex coordinates
John-F-Wagstaff Nov 18, 2024
cc06e5d
Don't re-parse hgvs obj from strings MixinConv 1
John-F-Wagstaff Nov 14, 2024
e140dfe
Don't re-parse hgvs obj from strings MixinConv 2
John-F-Wagstaff Nov 14, 2024
00ac401
Don't re-parse hgvs obj from strings MixinConv 3
John-F-Wagstaff Nov 14, 2024
f602f39
Add extra test for chr_to_rsg
John-F-Wagstaff Mar 2, 2025
8309a01
Allow obj for validateHGVS to avoid re-parse
John-F-Wagstaff Nov 14, 2024
5e536bf
Allow obj for genomic mapper to avoid re-parse
John-F-Wagstaff Nov 14, 2024
440b2c9
Return obj for chr_to_rsg to avoid re-parse
John-F-Wagstaff Nov 14, 2024
4d47b3f
More re-parsing reductions for mappers.py
John-F-Wagstaff Nov 14, 2024
c8cf78b
Reduce trivial re-parsing in vvMixinCore
John-F-Wagstaff Nov 15, 2024
45be50c
Prevent re-parsing of relevant_transcripts output
John-F-Wagstaff Nov 15, 2024
bd7ac94
Reduce re-parsing in vvMixinConverters.py
John-F-Wagstaff Nov 18, 2024
f921881
Do vcf style multi-alt variants without re-parsing
John-F-Wagstaff Nov 18, 2024
554a763
Further reduce re-parsing in format_converters.py
John-F-Wagstaff Nov 18, 2024
9749a3a
Remove last non-input re-parse in liftover
John-F-Wagstaff Nov 18, 2024
400badb
Improve ref-striping from hgvs obj
John-F-Wagstaff Nov 19, 2024
6559f60
Variant.genomic_g str->hgvs obj, reduce re-parsing
John-F-Wagstaff Nov 19, 2024
490575b
Don't search for gene symbol on chr/transcript
John-F-Wagstaff Nov 19, 2024
7453e27
Remove unneded re-parse from expanded_repeats
John-F-Wagstaff Nov 20, 2024
5804d54
Reduce re-parsing in gapped_mapping.py
John-F-Wagstaff Nov 22, 2024
d759d11
Remove final hgvs re-parsing from hgvs_utils.py
John-F-Wagstaff Nov 22, 2024
b7977bf
Don't unset ref for = type variants
John-F-Wagstaff Nov 25, 2024
36797e1
Convert hgvs transcript variation to hgvs object
John-F-Wagstaff Nov 25, 2024
8383088
Test improved alt ref != selected genomic ref code
John-F-Wagstaff Feb 12, 2025
3b3e3cf
Remove last hgvs obj re-parses from vvMixinInit.py
John-F-Wagstaff Nov 25, 2024
4d41452
Move hgvs obj conversion up a step in vvMixinCore
John-F-Wagstaff Nov 26, 2024
d41d1f7
Move abort for con type variants before parsing
John-F-Wagstaff Dec 13, 2024
fc3e7f8
Return early and reduce indent on non c prot map
John-F-Wagstaff Nov 26, 2024
26de224
Exploit early return to reduce indent in prot map
John-F-Wagstaff Nov 26, 2024
36c8476
Further use of returns to reduce indent in protmap
John-F-Wagstaff Nov 26, 2024
401e0fb
Add VVPosEdit for output formatting tweaks
John-F-Wagstaff Dec 5, 2024
d38b52b
Add helper functions for hgvs obj protein handling
John-F-Wagstaff Dec 5, 2024
81551f9
Switch protein to use hgvs obj to avoid re-parsing
John-F-Wagstaff Dec 5, 2024
5e3512b
Add test for RNA indel -> prot del case
John-F-Wagstaff Mar 2, 2025
3c1ef7b
Move expanded repeat formatting before obj convert
John-F-Wagstaff Dec 12, 2024
d79c2ed
Move input formatting before object creation point
John-F-Wagstaff Dec 6, 2024
006c44c
Move methyl syntax suffix handling before obj conv
John-F-Wagstaff Dec 13, 2024
ecb52c2
Update/improve tests for variant format_quibble
John-F-Wagstaff Feb 11, 2025
1a1b8e3
Reduce hgvs obj re-parsing in complex_descriptions
John-F-Wagstaff Nov 18, 2024
a9e431c
Slight output formatting improvement HGVS AA
John-F-Wagstaff Dec 23, 2024
a5d885b
Improve formatting hgvs output on prot/NA
John-F-Wagstaff Jan 8, 2025
b22dfdc
Improve handling of bad mappings in gapped_mapping
John-F-Wagstaff Jan 8, 2025
731878f
Fix handling of ref type/source with obj not str
John-F-Wagstaff Jan 8, 2025
bec39cf
Make refseq mistake finder work for txt *and* obj
John-F-Wagstaff Jan 8, 2025
8572b28
Harden initial convert to obj presence + parser
John-F-Wagstaff Jan 9, 2025
08e52c4
Handel t->c shift without declaring as a re-map
John-F-Wagstaff Jan 10, 2025
7a1326a
Harden against n variant issues in mapping
John-F-Wagstaff Jan 13, 2025
b797255
Add limited allele parser improvements
John-F-Wagstaff Jan 10, 2025
a491a81
Fix use_checking to handle object quibble as well
John-F-Wagstaff Jan 10, 2025
db70e6b
Reduce unneeded exon fetch and re-start in mappers
John-F-Wagstaff Jan 13, 2025
85e9474
Switch to early (singular) hgvs str->object parse
John-F-Wagstaff Jan 10, 2025
1e5d669
Add&use handing for checked variants resubmitted
John-F-Wagstaff Jan 10, 2025
1188b2a
Add methylation to VVPosEdit + PosEdit->VVPosEdit
John-F-Wagstaff Jan 20, 2025
b026ea2
Switch from text to VVPosEdit for methylation out
John-F-Wagstaff Jan 20, 2025
ed9ed24
Merge pull request #663 from openvar/JFW_reparse_reduction
Peter-J-Freeman Mar 4, 2025
40e9f66
update tests where gene symbol HIF1A\-AS3 changed to HIF1A\-AS1
Peter-J-Freeman Mar 4, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,11 @@ VariantValidator/testing/outputs*

# backedup files after 2to3 conversion
*.bak
.eggs
Users
*.sql
*.pyc
temp.py

#ignore databases
seqrepo/*
Expand All @@ -20,3 +25,4 @@ tests/*/*.pyc
/VariantValidator.egg-info/*
validator_2021-07-21.sql
VVTA_2021_2_noseq.psql.gz

421 changes: 371 additions & 50 deletions VariantValidator/modules/complex_descriptions.py

Large diffs are not rendered by default.

803 changes: 497 additions & 306 deletions VariantValidator/modules/expanded_repeats.py

Large diffs are not rendered by default.

657 changes: 422 additions & 235 deletions VariantValidator/modules/format_converters.py

Large diffs are not rendered by default.

602 changes: 355 additions & 247 deletions VariantValidator/modules/gapped_mapping.py

Large diffs are not rendered by default.

54 changes: 41 additions & 13 deletions VariantValidator/modules/gene2transcripts.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
from . import utils as fn
from . import seq_data


# Pre compile variables
vvhgvs.global_config.formatting.max_ref_length = 1000000

Expand Down Expand Up @@ -39,18 +40,29 @@ def gene2transcripts(g2t, query, validator=False, bypass_web_searches=False, sel
# Remove whitespace
query = ''.join(query.split())

try:
query = int(query)
except ValueError:
pass
if isinstance(query, int):
query = f"HGNC:{str(query)}"

# Search by gene IDs
if "HGNC:" in query:
store_query = query
query = query.upper()
query = g2t.db.get_stable_gene_id_from_hgnc_id(query)[1]
if query == "No data":
try:
query = validator.db.get_transcripts_from_annotations(store_query)
for tx in query:
if tx[5] != "unassigned":
query = tx[5]
break
query = g2t.db.get_transcripts_from_annotations(store_query)
if "none" not in query[0]:
for tx in query:
if tx[5] != "unassigned":
query = tx[5]
break
else:
return {'error': 'Unable to recognise HGNC ID. Please provide a gene symbol',
"requested_symbol": store_query}
except TypeError:
pass

Expand Down Expand Up @@ -86,12 +98,12 @@ def gene2transcripts(g2t, query, validator=False, bypass_web_searches=False, sel
validator.alt_aln_method])

# Add refseqgene if available
if "NG_" in query.hgvs_refseqgene_variant:
if query.hgvs_refseqgene_variant and 'NG_' in query.hgvs_refseqgene_variant.ac:
tx_for_gene.append([query.gene_symbol,
tx_info[3],
0,
query.hgvs_coding.ac,
query.hgvs_refseqgene_variant.split(":")[0],
query.hgvs_refseqgene_variant.ac,
validator.alt_aln_method])

else:
Expand All @@ -111,7 +123,6 @@ def gene2transcripts(g2t, query, validator=False, bypass_web_searches=False, sel
g2t.hdp.get_tx_identity_info(refresh_hgnc)
tx_found = refresh_hgnc
found_res = True
break
except vvhgvs.exceptions.HGVSError as e:
logger.debug("Except passed, %s", e)
if not found_res:
Expand All @@ -132,8 +143,14 @@ def gene2transcripts(g2t, query, validator=False, bypass_web_searches=False, sel
except vvhgvs.exceptions.HGVSError as e:
return {'error': str(e),
"requested_symbol": query}

hgnc = tx_info[6]
hgnc = g2t.db.get_hgnc_symbol(hgnc)
hgnc2 = g2t.db.get_hgnc_symbol(hgnc)

if re.match("LOC", hgnc2) and not re.match("LOC", hgnc):
hgnc = hgnc
else:
hgnc = hgnc2

# First perform a search against the input gene symbol or the symbol inferred from UTA
symbol_identified = False
Expand All @@ -158,10 +175,21 @@ def gene2transcripts(g2t, query, validator=False, bypass_web_searches=False, sel
hgnc_id = vvta_record[0][0]
previous_sym = hgnc
symbol_identified = True
if len(vvta_record) > 1:
elif len(vvta_record) > 1:
return {'error': '%s is a previous symbol for %s genes. '
'Refer to https://www.genenames.org/' % (current_sym, str(len(vvta_record))),
'Refer to https://www.genenames.org/' % (hgnc, str(len(vvta_record))),
"requested_symbol": query}
else:
# Is it an updated symbol?
old_symbol = g2t.db.get_uta_symbol(hgnc)
if old_symbol is not None:
vvta_record = g2t.hdp.get_gene_info(old_symbol)
if vvta_record is not None:
current_sym = hgnc
gene_name = vvta_record[3]
hgnc_id = vvta_record[0]
previous_sym = old_symbol
symbol_identified = True

if symbol_identified is False:
return {'error': 'Unable to recognise gene symbol %s' % hgnc,
Expand Down Expand Up @@ -298,13 +326,13 @@ def gene2transcripts(g2t, query, validator=False, bypass_web_searches=False, sel
# reverse the exon_set to maintain gene and not genome orientation if gene is -1 orientated
if tx_orientation == -1:
exon_set.reverse()

if ('NG_' in line[4] or 'NC_000' in line[4]) and line[5] != 'blat':
if ('NG_' in line[4] or 'NC_0' in line[4]) and line[5] != 'blat':
gen_span = True
else:
gen_span = False

tx_description = g2t.db.get_transcript_description(tx)

if tx_description == 'none':
try:
g2t.db.update_transcript_info_record(tx, g2t)
Expand Down
Loading