-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[dg] Move check impl to lighter weight JSON validation #27214
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
This stack of pull requests is managed by Graphite. Learn more about stacking. |
b0de1ac
to
77812d5
Compare
725098b
to
1956009
Compare
7c52fad
to
4eea8a2
Compare
1956009
to
a4d6f07
Compare
python_modules/libraries/dagster-dg/dagster_dg/cli/component.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems like a reasonable implementation, have some comments though
also might be nice to see a sample output somewhere
component_contents_by_dir = {} | ||
local_components = set() | ||
for component_dir in ( | ||
dg_context.root_path / dg_context.root_package_name / "components" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can simplify this a bit with:
for component_path in component_path.rglob("component.yaml"):
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that'll handle walking the directory for you and finding all the component.yaml files. You can probably also use rglob to filter for things in the resolved paths, although not sure if that's that much better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I bias a little towards the iterdir approach just to mirror 1:1 how we actually pick up components elsewhere in the codebase - we end up needing to find both the component yaml and the component dir anyway so it doesn't save a ton of space
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems like a case where a common utility would be nice here -- nbd for now
@@ -259,10 +272,76 @@ def component_check_command( | |||
**global_options: object, | |||
) -> None: | |||
"""Check component files against their schemas, showing validation errors.""" | |||
resolved_paths = [Path(path).absolute() for path in paths] | |||
top_level_component_validator = Draft202012Validator(schema=COMPONENT_FILE_SCHEMA) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to use a specific Validator class like this instead of just jsonschema.validate()
? This class name seems kinda unfriendly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
validate
only raises an exception with the first error; we need a validator in order to access iter_errors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wild
python_modules/libraries/dagster-dg/dagster_dg/cli/component.py
Outdated
Show resolved
Hide resolved
class LocalComponentTypes: | ||
@classmethod | ||
def from_dg_context(cls, dg_context: "DgContext", component_dirs: Sequence[Path]): | ||
# TODO: cache | ||
|
||
raw_local_component_data = dg_context.external_components_command( | ||
["list", "local-component-types", *component_dirs] | ||
) | ||
local_component_data = json.loads(raw_local_component_data) | ||
|
||
components_by_path = {str(path): {} for path in component_dirs} | ||
for entry in local_component_data: | ||
components_by_path[entry["directory"]][entry["key"]] = RemoteComponentType( | ||
**entry["metadata"] | ||
) | ||
return cls(components_by_path) | ||
|
||
def __init__(self, components_by_path: dict[str, dict[str, RemoteComponentType]]): | ||
self._components_by_path = copy.copy(components_by_path) | ||
|
||
def get(self, path: Path, key: str) -> RemoteComponentType: | ||
return self._components_by_path[str(path)][key] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like the name of this class is pretty confusing -- it's called LocalComponentTypes but it's storing a bunch of RemoteComponentTypes
It does sort of feel like we should just have a param for RemoteComponentRegistry to load additional local component types. Like from_dg_context having an optional additional_component_dirs parameter.
That would simplify code that uses this stuff too -- you don't have to juggle two separate objects around and check which one you should be using
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I agree the name sucks. I do think it's somewhat useful that it's separate in that
- it's not comprehensive, e.g. a single instance represents a set of local component types from a few components, not necessarily the whole code location
- you may only want to load it in some scenarios, in others you only want to load the global
RemoteComponentRegistry
component type list
optional
additional_component_dirs
This would work, but would need a refactor of RemoteComponentRegistry
to no longer use component type as a unique key, since local components can have name overlap (since they're 'scoped' to their folder)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i see what you mean about uniqueness -- I guess just more fundamentally I think "bag of remote components that we want to do stuff with" should be a single object. this code feels pretty brittle and painful to work with as time goes on.
what other cases are you imagining where we'd want only the LocalComponents?
i could be convinced that we could stick with this sort of thing (ideally with a rename) until a new usecase comes along for us to refactor based off of, just feels like this pattern shouldn't stay around forever
might be a matter of subclassing or something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah let me feel out what it would look like to rewrite the existing remote components class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did a restructure, I'm happier with this approach
python_modules/libraries/dagster-dg/dagster_dg/cli/component.py
Outdated
Show resolved
Hide resolved
4eea8a2
to
185efb7
Compare
584c859
to
00df599
Compare
185efb7
to
1ed28d0
Compare
00df599
to
c9d5824
Compare
e269c1d
to
a873ee3
Compare
7cf47f1
to
15a9207
Compare
15a9207
to
ea0a069
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we'll want to do another pass on the local vs. global component stuff, but given that we're not sure of what the end state of that will be, I think this is good for now
34feb2b
to
3dc9f35
Compare
Summary
Moves the
dg check
implementation to usejsonschema
validation, eschewing the previous approach which would actually load each component using thedagster-components
machinery.Relies on #27285 to load schemas for local component types, in addition to our current cache for global component types & their schemas.
A few follow-ups to make this faster:
Test Plan
Shared test cases with the existing validation tests, moved over from
dagster-components
check tests.