Add sparql query to check that all data objects are referenced by another collection #1247

aclum · 2023-10-26T23:08:43Z

In trying to clean up some workflow execution activity records i found documents in the data_object_set collection which aren't referenced anywhere else in mongo. All data_data_object documents should be referenced either by has_ouput in omics_processing_set documents or has_output in collections that inherit from workflow execution activities.
For example when cleaning up a record at Emiley's request I get 278 data_object_set documents when I check the data_object_set collection with {'description':{'$regex':'Gp0119870'}} vs 41 data_object_set when I iterate through the workflow execution activity (WEA) to get a list of inputs and outputs.
If a data_object_set document is referenced in another collection in a sparql query it will show up as the object. The subject is the WEA that references the data object with a predicate of 'nmdc:has_output'
The preview shows the expected results

I would be good to get a query set up that does this for all data objects for the napa compliance squad and then to be able to run it on a routine basis to check data integrity in general.

@turbomam @mbthornton-lbl

turbomam · 2023-10-27T15:47:42Z

see

Document the various ways SPARQL queries can be run against our data extracted from MongoDB #1245

turbomam · 2023-10-27T15:49:22Z

able to run it on a routine basis to check data integrity in general

Yes!

Let's base this on make-rdf and https://github.com/microbiomedata/nmdc-schema/blob/main/.github/workflows/migration.yaml

pulling all relevant data with pure-export in the API-only mode (--skip-collection-check)
running whatever migrations are necessary with migration-recursion
linkml-validate against either
- nmdc_schema/nmdc_schema_accepting_legacy_ids.yaml
- nmdc_schema/nmdc_materialized_patterns.yaml
linkml-convert to RDF/TTL
run through anyuri-strings-to-iris
load into a local Jena TDB database
query the TDB database to get a TSV report

aclum · 2024-10-14T22:56:18Z

closing this as complete. There is a squad ongoing for a longer term solution. see microbiomedata/nmdc-runtime#318

aclum closed this as completed Oct 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add sparql query to check that all data objects are referenced by another collection #1247

Add sparql query to check that all data objects are referenced by another collection #1247

aclum commented Oct 26, 2023 •

edited

Loading

turbomam commented Oct 27, 2023

turbomam commented Oct 27, 2023 •

edited

Loading

aclum commented Oct 14, 2024

Add sparql query to check that all data objects are referenced by another collection #1247

Add sparql query to check that all data objects are referenced by another collection #1247

Comments

aclum commented Oct 26, 2023 • edited Loading

turbomam commented Oct 27, 2023

turbomam commented Oct 27, 2023 • edited Loading

aclum commented Oct 14, 2024

aclum commented Oct 26, 2023 •

edited

Loading

turbomam commented Oct 27, 2023 •

edited

Loading