Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create system to validate node IDs, labels, and names #480

Open
andrewsu opened this issue Feb 11, 2022 · 0 comments
Open

Create system to validate node IDs, labels, and names #480

andrewsu opened this issue Feb 11, 2022 · 0 comments

Comments

@andrewsu
Copy link
Member

andrewsu commented Feb 11, 2022

We currently do not have any validation system in place to confirm that the node IDs used in each record use the correct names and labels. While our curators are very detail-oriented and careful, the lack of automated checks leaves open the risk that data errors are introduced. We should create a simple script that queries each ID against some authoritative resource(s) (e.g., mygene.info/mychem.info/mydisease.info, OLS, UMLS, NodeNormalizer). More details below...

The first record in the indication_paths.yaml file is here:

-   directed: true
    graph:
        disease: CML (ph+)
        disease_mesh: MESH:D015464
        drug: imatinib
        drug_mesh: MESH:D000068877
        drugbank: DB:DB00619
    links:
    -   key: decreases activity of
        source: MESH:D000068877
        target: UniProt:P00519
    -   key: causes
        source: UniProt:P00519
        target: MESH:D015464
    multigraph: true
    nodes:
    -   id: MESH:D000068877
        label: Drug
        name: imatinib
    -   id: UniProt:P00519
        label: Protein
        name: BCR/ABL
    -   id: MESH:D015464
        label: Disease
        name: CML (ph+)

There are three IDs under nodes for MESH:D000068877, UniProt:P00519, and MESH:D015464. If we look up the first ID in the MeSH API here: https://id.nlm.nih.gov/mesh/lookup/details?descriptor=D000068877, we see that the preferred name for MESH:D000068877 is actually Imatinib Mesylate, and the preferred name for MESH:D015464 is Leukemia, Myelogenous, Chronic, BCR-ABL Positive. The script described here would output a version of the input YAML with the names replaced by the "preferred name" from the MeSH API.

The most common identifiers used in DMDB are shown here (with counts):

$ cat indication_paths.yaml.4 | grep ' id:' | sed 's/.*id: //;s/:.*//' | sort | uniq -c | sort -k1nr
   8695 MESH
   6227 GO
   4359 UniProt
   1014 HP
    741 NCBITaxon
    601 InterPro
    429 CHEBI
    428 UBERON
    339 REACT
    218 DB
    198 CL
     48 Pfam
     18 PR
      7 "REACT
      3 TIGR
      1 "InterPro

So let's start with MeSH as the most common identifier used. After that, we'll expand to the other identifier types.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant