Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HELP-357 Ensembl ID - Protein mapping issue with GlyGen data #1

Open
jeremywalter opened this issue Nov 21, 2022 · 2 comments
Open
Assignees

Comments

@jeremywalter
Copy link
Contributor

Data model: When we add a protein to a collection (collection_protein.tsv) at the same time the corresponding gene for the protein gets added to the collection as well (collection_gene.tsv).

Submission issue: There are several hundred protein/collection that the submission tool complained about invalid gene (Ensembl ID) information.

Problem: GlyGen releases new data every three months including data integrated from UniProt (which contains the UniProt accession to Ensembl ID mapping). If UniProt/Ensembl updates in between and the submission (submission prep tool) gets updated with the latest data there might be a change in the Ensembl IDs (some are removed, some are added). Which will result in the mentioned submission issue.

Short term solution: For the recent release we removed all gene information from the submission (collections are only associated with proteins but not genes).

Long term solution: I am looking for a strategy how we can ensue the Uniprot accession -> Ensembl mapping that we are using is in synch with the mapping that is used by submission and submission prep tool. I was thinking about using the files in https://osf.io/5sxvt/ for this purpose. But the “ensembl_genes.tsv” does not seem to contain the uniprot accession and the protein.tsv.gz does not contain ensembl IDs.

Is there a solution or alternative strategy?

@jeet-vora
Copy link

#3 HELP-406 and this are related.

@jonathancrabtree
Copy link

I would propose that instead of removing all gene information from the submission (the "Short term solution" mentioned above), next time just remove only the gene ids that are not in the current ensembl_genes.tsv file, which can be downloaded from osf.io Having a few hundred missing is vastly better than having them all missing and it may be that the discrepancy between Uniprot versions will be smaller or nonexistent next time around. To me it looks like the ensembl_genes.tsv file only includes human Ensembl genes, so we're talking about a relatively small and (hopefully) stable set of only about 24K genes. Even if we were to use the same protein to gene mapping we'd still have the problem that GlyGen might be using a slightly different UniProt version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants