You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In trying to clean up some workflow execution activity records i found documents in the data_object_set collection which aren't referenced anywhere else in mongo. All data_data_object documents should be referenced either by has_ouput in omics_processing_set documents or has_output in collections that inherit from workflow execution activities.
For example when cleaning up a record at Emiley's request I get 278 data_object_set documents when I check the data_object_set collection with {'description':{'$regex':'Gp0119870'}} vs 41 data_object_set when I iterate through the workflow execution activity (WEA) to get a list of inputs and outputs.
If a data_object_set document is referenced in another collection in a sparql query it will show up as the object. The subject is the WEA that references the data object with a predicate of 'nmdc:has_output'
The preview shows the expected results
I would be good to get a query set up that does this for all data objects for the napa compliance squad and then to be able to run it on a routine basis to check data integrity in general.
In trying to clean up some workflow execution activity records i found documents in the data_object_set collection which aren't referenced anywhere else in mongo. All data_data_object documents should be referenced either by has_ouput in omics_processing_set documents or has_output in collections that inherit from workflow execution activities.
![Screenshot 2023-10-26 at 4 04 59 PM](https://private-user-images.githubusercontent.com/8196591/278495307-39395447-5e92-4d08-a2ae-d3c57ac93fb3.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkzNjAyNDMsIm5iZiI6MTczOTM1OTk0MywicGF0aCI6Ii84MTk2NTkxLzI3ODQ5NTMwNy0zOTM5NTQ0Ny01ZTkyLTRkMDgtYTJhZS1kM2M1N2FjOTNmYjMucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI1MDIxMiUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNTAyMTJUMTEzMjIzWiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9MDg5Y2E2NDJmOTRkMTc4MWY3OWFiMzBjN2UxMTFiYTdlM2FmY2E3NjUxMjdlMmVmNWVhMTZlMjExNDQwYTNlMSZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QifQ.unpBNupW06FV-rZznednNrpH39k0BWZyketYV6msELY)
For example when cleaning up a record at Emiley's request I get 278 data_object_set documents when I check the data_object_set collection with {'description':{'$regex':'Gp0119870'}} vs 41 data_object_set when I iterate through the workflow execution activity (WEA) to get a list of inputs and outputs.
If a data_object_set document is referenced in another collection in a sparql query it will show up as the object. The subject is the WEA that references the data object with a predicate of 'nmdc:has_output'
The preview shows the expected results
I would be good to get a query set up that does this for all data objects for the napa compliance squad and then to be able to run it on a routine basis to check data integrity in general.
@turbomam @mbthornton-lbl
The text was updated successfully, but these errors were encountered: