Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sparql query to check that all data objects are referenced by another collection #1247

Closed
aclum opened this issue Oct 26, 2023 · 3 comments

Comments

@aclum
Copy link
Contributor

aclum commented Oct 26, 2023

In trying to clean up some workflow execution activity records i found documents in the data_object_set collection which aren't referenced anywhere else in mongo. All data_data_object documents should be referenced either by has_ouput in omics_processing_set documents or has_output in collections that inherit from workflow execution activities.
For example when cleaning up a record at Emiley's request I get 278 data_object_set documents when I check the data_object_set collection with {'description':{'$regex':'Gp0119870'}} vs 41 data_object_set when I iterate through the workflow execution activity (WEA) to get a list of inputs and outputs.
If a data_object_set document is referenced in another collection in a sparql query it will show up as the object. The subject is the WEA that references the data object with a predicate of 'nmdc:has_output'
The preview shows the expected results
Screenshot 2023-10-26 at 4 04 59 PM

I would be good to get a query set up that does this for all data objects for the napa compliance squad and then to be able to run it on a routine basis to check data integrity in general.

@turbomam @mbthornton-lbl

@turbomam
Copy link
Member

turbomam commented Oct 27, 2023

able to run it on a routine basis to check data integrity in general

Yes!

Let's base this on make-rdf and https://github.com/microbiomedata/nmdc-schema/blob/main/.github/workflows/migration.yaml

  • pulling all relevant data with pure-export in the API-only mode (--skip-collection-check)
  • running whatever migrations are necessary with migration-recursion
  • linkml-validate against either
    • nmdc_schema/nmdc_schema_accepting_legacy_ids.yaml
    • nmdc_schema/nmdc_materialized_patterns.yaml
  • linkml-convert to RDF/TTL
  • run through anyuri-strings-to-iris
  • load into a local Jena TDB database
  • query the TDB database to get a TSV report

@aclum
Copy link
Contributor Author

aclum commented Oct 14, 2024

closing this as complete. There is a squad ongoing for a longer term solution. see microbiomedata/nmdc-runtime#318

@aclum aclum closed this as completed Oct 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants