Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design data dictionary output format #199

Closed
4 tasks done
surchs opened this issue Oct 18, 2022 · 4 comments
Closed
4 tasks done

Design data dictionary output format #199

surchs opened this issue Oct 18, 2022 · 4 comments
Assignees
Labels
flag:discuss Flag issue that needs to be discussed before it can be implemented.
Milestone

Comments

@surchs
Copy link
Contributor

surchs commented Oct 18, 2022

We need a data dictionary format that we can share with others and that our own tools can handle. This data dictionary should be BIDS compatible and has to be able to transform a raw participants.tsv into a harmonized representation.

For this it needs to have:

  • links to controlled terms (IRIs)
  • list of "missing" values that are typos or otherwise irrelevant (usually omitted in BIDS dictionaries)
  • a way to explain that a column belongs to an assessment tool

Here is a first example of the data dictionary format based on a BIDS data dictionary:

# Data dictionary schema draft
categorical:
  Description: "human readable description"
  Levels:
    level1: "human readable description"
    level2: "human readable description"
  Annotation:
    TermURL: "URI of controlled term"
    Label: "human readable label of term"
    Levels:
      level1: # The terms from above in Levels
        TermURL: "URI to a controlled vocabulary for this term"
        Label: "human readable label for the above term"
      level2:
        termURL: "URI to a controlled vocabulary for this term"
        label: "human readable label for the above term"
    MissingValues:
      - " "
      - ""

continuous:
  Description: "human readable description"
  Units: "unit name, possibly just human readable, but could probably be an xsd string"
  Annotation:
    IsAbout:
      TermURL: "URI of controlled term"
      Label: "human readable label of term"
    Unit:
      Identifier: "xsd:integer"
    Transformation:
      Identifier: "unique name of a well defined transformation"
      Description: "human readable description of the transformation"
    MissingValues:
      - ""
      - "99"

# Subscales that belong to a tool
subscale1:
  Description: "human readable description"
  IsPartOf:
    TermURL: "URI of the name of the tool"
    Label: "human readable label of term"
  MissingValues:
    - ""
    - "not present"


subscale2:
  Description: "human readable description"
  IsPartOf:
    TermURL: "URI of the name of the tool"
    Label: "human readable label of term"

Steps to complete:

  • document a draft of this on our docs
  • implement the schema in pydantic
  • create a jsonschema to validate with
  • put this draft somewhere we can get feedback on it

Linked to #6

@surchs
Copy link
Contributor Author

surchs commented Oct 18, 2022

Split the work:

  • remove the output of an annotated / harmonized table
    • remove all instances of the annotated table copy (creation and storage) and make sure things don't break
    • replace the annotated table with the raw values table
    • disable the save functionality
  • don't delete the column object

@surchs surchs self-assigned this Oct 20, 2022
@surchs surchs moved this from Backlog to Next in Neurobagel Oct 20, 2022
@surchs surchs moved this from Next to Doing in Neurobagel Oct 25, 2022
@surchs surchs moved this from Doing to Review in Neurobagel Nov 28, 2022
@surchs surchs moved this from Review to Backlog in Neurobagel Nov 28, 2022
@github-actions
Copy link

github-actions bot commented Dec 7, 2022

We want to keep our issues up to date and active. This issue hasn't seen any activity in the last 30 days.
We have applied the stale-issue label to indicate that this issue should be reviewed again and then either prioritized or closed.

@github-actions github-actions bot added the _flag:stale [BOT ONLY] Flag issue that hasn't been updated in a while and needs to be triaged again label Dec 7, 2022
@surchs surchs added importance:high flag:discuss Flag issue that needs to be discussed before it can be implemented. and removed _flag:stale [BOT ONLY] Flag issue that hasn't been updated in a while and needs to be triaged again labels Dec 7, 2022
@surchs surchs added this to the v0.1.0 milestone Jan 10, 2023
@surchs surchs moved this from Backlog to Doing in Neurobagel Jan 10, 2023
@surchs
Copy link
Contributor Author

surchs commented Jan 17, 2023

Couple of open questions remain. For example, BIDS doesn't have to annotate participant_id because that's a mandatory field. But we allow more than one participant_id column to deal with mixed ID systems. So we need some kind of indication that a column has been flagged as an ID column.

Similarly session_id is a special column name in BIDS (although not allowed in participant.tsv), so there is no need to annotate that. Because we allow multiple session ID systems, we need to flag those.

@surchs surchs moved this from Doing to Review in Neurobagel Jan 18, 2023
@surchs surchs closed this as completed Jan 26, 2023
@github-project-automation github-project-automation bot moved this from Review to Done in Neurobagel Jan 26, 2023
@jarmoza
Copy link
Contributor

jarmoza commented Jan 26, 2023

Tagging this issue that I have discovered recently in working with data dictionaries and the annotation tool.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flag:discuss Flag issue that needs to be discussed before it can be implemented.
Projects
Archived in project
Development

No branches or pull requests

2 participants