You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Data model: When we add a protein to a collection (collection_protein.tsv) at the same time the corresponding gene for the protein gets added to the collection as well (collection_gene.tsv).
Submission issue: There are several hundred protein/collection that the submission tool complained about invalid gene (Ensembl ID) information.
Problem: GlyGen releases new data every three months including data integrated from UniProt (which contains the UniProt accession to Ensembl ID mapping). If UniProt/Ensembl updates in between and the submission (submission prep tool) gets updated with the latest data there might be a change in the Ensembl IDs (some are removed, some are added). Which will result in the mentioned submission issue.
Short term solution: For the recent release we removed all gene information from the submission (collections are only associated with proteins but not genes).
Long term solution: I am looking for a strategy how we can ensue the Uniprot accession -> Ensembl mapping that we are using is in synch with the mapping that is used by submission and submission prep tool. I was thinking about using the files in https://osf.io/5sxvt/ for this purpose. But the “ensembl_genes.tsv” does not seem to contain the uniprot accession and the protein.tsv.gz does not contain ensembl IDs.
Is there a solution or alternative strategy?
The text was updated successfully, but these errors were encountered:
I would propose that instead of removing all gene information from the submission (the "Short term solution" mentioned above), next time just remove only the gene ids that are not in the current ensembl_genes.tsv file, which can be downloaded from osf.io Having a few hundred missing is vastly better than having them all missing and it may be that the discrepancy between Uniprot versions will be smaller or nonexistent next time around. To me it looks like the ensembl_genes.tsv file only includes human Ensembl genes, so we're talking about a relatively small and (hopefully) stable set of only about 24K genes. Even if we were to use the same protein to gene mapping we'd still have the problem that GlyGen might be using a slightly different UniProt version.
Data model: When we add a protein to a collection (collection_protein.tsv) at the same time the corresponding gene for the protein gets added to the collection as well (collection_gene.tsv).
Submission issue: There are several hundred protein/collection that the submission tool complained about invalid gene (Ensembl ID) information.
Problem: GlyGen releases new data every three months including data integrated from UniProt (which contains the UniProt accession to Ensembl ID mapping). If UniProt/Ensembl updates in between and the submission (submission prep tool) gets updated with the latest data there might be a change in the Ensembl IDs (some are removed, some are added). Which will result in the mentioned submission issue.
Short term solution: For the recent release we removed all gene information from the submission (collections are only associated with proteins but not genes).
Long term solution: I am looking for a strategy how we can ensue the Uniprot accession -> Ensembl mapping that we are using is in synch with the mapping that is used by submission and submission prep tool. I was thinking about using the files in https://osf.io/5sxvt/ for this purpose. But the “ensembl_genes.tsv” does not seem to contain the uniprot accession and the protein.tsv.gz does not contain ensembl IDs.
Is there a solution or alternative strategy?
The text was updated successfully, but these errors were encountered: