You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Our subject-system ingestion package has had a number of schema violations due to overly permissive ingestion behaviors that would be nice to catch earlier than later.
We're seeing reused identifiers within a single ingestion unit as well as across multiple ingestion units, both of which indicate mistakes. Unfortunately these mistakes only show up as nodes with more than one title or more than one wasStartedAt property. If the different uses had happened to use non-overlapping properties we'd have missed the mistake.
semtk
I'd like to extend SemTK's automatic ingestion nodegroup by class URI endpoint to support a pair of new, stricter modes:
Error if exists - we'll use this mode for all of our 1.csv ingestions. These ingestions should be expecting to be the unique creator of new nodes. The subtype column (if it exists) should be set to Error if missing in this mode.
Error if missing - we'll use this mode for all of our 2.csv ingestions. These files are used for defining relations. It should always be an error if data is missing.
We'll never need to use create if missing. In our data set, this behavior will never be desirable and will always end up hiding a mistake.
scrapingtoolkit
I'd like to augment scraping tool kit to make it explicit when we're intending to create a new node. I plan to do this using a new named parameter in the generated classes. It would be good if scraping toolkit could notice that the same identifier was used twice in the same ingestion unit.
The text was updated successfully, but these errors were encountered:
Our subject-system ingestion package has had a number of schema violations due to overly permissive ingestion behaviors that would be nice to catch earlier than later.
We're seeing reused identifiers within a single ingestion unit as well as across multiple ingestion units, both of which indicate mistakes. Unfortunately these mistakes only show up as nodes with more than one
title
or more than onewasStartedAt
property. If the different uses had happened to use non-overlapping properties we'd have missed the mistake.semtk
I'd like to extend SemTK's automatic ingestion nodegroup by class URI endpoint to support a pair of new, stricter modes:
We'll never need to use create if missing. In our data set, this behavior will never be desirable and will always end up hiding a mistake.
scrapingtoolkit
I'd like to augment scraping tool kit to make it explicit when we're intending to create a new node. I plan to do this using a new named parameter in the generated classes. It would be good if scraping toolkit could notice that the same identifier was used twice in the same ingestion unit.
The text was updated successfully, but these errors were encountered: