HELP-357 Ensembl ID - Protein mapping issue with GlyGen data #1

jeremywalter · 2022-11-21T21:45:00Z

Data model: When we add a protein to a collection (collection_protein.tsv) at the same time the corresponding gene for the protein gets added to the collection as well (collection_gene.tsv).

Submission issue: There are several hundred protein/collection that the submission tool complained about invalid gene (Ensembl ID) information.

Problem: GlyGen releases new data every three months including data integrated from UniProt (which contains the UniProt accession to Ensembl ID mapping). If UniProt/Ensembl updates in between and the submission (submission prep tool) gets updated with the latest data there might be a change in the Ensembl IDs (some are removed, some are added). Which will result in the mentioned submission issue.

Short term solution: For the recent release we removed all gene information from the submission (collections are only associated with proteins but not genes).

Long term solution: I am looking for a strategy how we can ensue the Uniprot accession -> Ensembl mapping that we are using is in synch with the mapping that is used by submission and submission prep tool. I was thinking about using the files in https://osf.io/5sxvt/ for this purpose. But the “ensembl_genes.tsv” does not seem to contain the uniprot accession and the protein.tsv.gz does not contain ensembl IDs.

Is there a solution or alternative strategy?

jeet-vora · 2022-12-06T21:36:33Z

#3 HELP-406 and this are related.

jonathancrabtree · 2023-01-26T20:42:53Z

I would propose that instead of removing all gene information from the submission (the "Short term solution" mentioned above), next time just remove only the gene ids that are not in the current ensembl_genes.tsv file, which can be downloaded from osf.io Having a few hundred missing is vastly better than having them all missing and it may be that the discrepancy between Uniprot versions will be smaller or nonexistent next time around. To me it looks like the ensembl_genes.tsv file only includes human Ensembl genes, so we're talking about a relatively small and (hopefully) stable set of only about 24K genes. Even if we were to use the same protein to gene mapping we'd still have the problem that GlyGen might be using a slightly different UniProt version.

jeremywalter assigned jonathancrabtree and ReneRanzinger Nov 21, 2022

jonathancrabtree mentioned this issue Nov 22, 2022

HELP-217 Missing Ensembl IDs #7

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HELP-357 Ensembl ID - Protein mapping issue with GlyGen data #1

HELP-357 Ensembl ID - Protein mapping issue with GlyGen data #1

jeremywalter commented Nov 21, 2022

jeet-vora commented Dec 6, 2022

jonathancrabtree commented Jan 26, 2023

HELP-357 Ensembl ID - Protein mapping issue with GlyGen data #1

HELP-357 Ensembl ID - Protein mapping issue with GlyGen data #1

Comments

jeremywalter commented Nov 21, 2022

jeet-vora commented Dec 6, 2022

jonathancrabtree commented Jan 26, 2023