Support strict imports #973

glguy · 2023-06-07T23:53:46Z

Our subject-system ingestion package has had a number of schema violations due to overly permissive ingestion behaviors that would be nice to catch earlier than later.

We're seeing reused identifiers within a single ingestion unit as well as across multiple ingestion units, both of which indicate mistakes. Unfortunately these mistakes only show up as nodes with more than one title or more than one wasStartedAt property. If the different uses had happened to use non-overlapping properties we'd have missed the mistake.

semtk

I'd like to extend SemTK's automatic ingestion nodegroup by class URI endpoint to support a pair of new, stricter modes:

Error if exists - we'll use this mode for all of our 1.csv ingestions. These ingestions should be expecting to be the unique creator of new nodes. The subtype column (if it exists) should be set to Error if missing in this mode.
Error if missing - we'll use this mode for all of our 2.csv ingestions. These files are used for defining relations. It should always be an error if data is missing.

We'll never need to use create if missing. In our data set, this behavior will never be desirable and will always end up hiding a mistake.

scrapingtoolkit

I'd like to augment scraping tool kit to make it explicit when we're intending to create a new node. I plan to do this using a new named parameter in the generated classes. It would be good if scraping toolkit could notice that the same identifier was used twice in the same ingestion unit.

The text was updated successfully, but these errors were encountered:

glguy self-assigned this Jun 7, 2023

chrisage added the needs-details Unactionnable until we get more information label Jul 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support strict imports #973

Support strict imports #973

glguy commented Jun 7, 2023 •

edited

Loading

Support strict imports #973

Support strict imports #973

Comments

glguy commented Jun 7, 2023 • edited Loading

semtk

scrapingtoolkit

glguy commented Jun 7, 2023 •

edited

Loading