Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support strict imports #973

Open
glguy opened this issue Jun 7, 2023 · 0 comments
Open

Support strict imports #973

glguy opened this issue Jun 7, 2023 · 0 comments
Assignees
Labels
needs-details Unactionnable until we get more information

Comments

@glguy
Copy link
Contributor

glguy commented Jun 7, 2023

Our subject-system ingestion package has had a number of schema violations due to overly permissive ingestion behaviors that would be nice to catch earlier than later.

We're seeing reused identifiers within a single ingestion unit as well as across multiple ingestion units, both of which indicate mistakes. Unfortunately these mistakes only show up as nodes with more than one title or more than one wasStartedAt property. If the different uses had happened to use non-overlapping properties we'd have missed the mistake.

semtk

I'd like to extend SemTK's automatic ingestion nodegroup by class URI endpoint to support a pair of new, stricter modes:

  1. Error if exists - we'll use this mode for all of our 1.csv ingestions. These ingestions should be expecting to be the unique creator of new nodes. The subtype column (if it exists) should be set to Error if missing in this mode.
  2. Error if missing - we'll use this mode for all of our 2.csv ingestions. These files are used for defining relations. It should always be an error if data is missing.

We'll never need to use create if missing. In our data set, this behavior will never be desirable and will always end up hiding a mistake.

scrapingtoolkit

I'd like to augment scraping tool kit to make it explicit when we're intending to create a new node. I plan to do this using a new named parameter in the generated classes. It would be good if scraping toolkit could notice that the same identifier was used twice in the same ingestion unit.

@glguy glguy self-assigned this Jun 7, 2023
@chrisage chrisage added the needs-details Unactionnable until we get more information label Jul 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-details Unactionnable until we get more information
Projects
None yet
Development

No branches or pull requests

2 participants