Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dg] Move check impl to lighter weight JSON validation #27214

Merged
merged 10 commits into from
Jan 24, 2025

Conversation

benpankow
Copy link
Member

@benpankow benpankow commented Jan 18, 2025

Summary

Moves the dg check implementation to use jsonschema validation, eschewing the previous approach which would actually load each component using the dagster-components machinery.

Relies on #27285 to load schemas for local component types, in addition to our current cache for global component types & their schemas.

A few follow-ups to make this faster:

  • Cache local component schemas, should be easy enough to do so in a stacked PR.

Test Plan

Shared test cases with the existing validation tests, moved over from dagster-components check tests.

Copy link

vercel bot commented Jan 18, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
dagster-docs-legacy ✅ Ready (Inspect) Visit Preview 💬 Add feedback Jan 18, 2025 6:16pm

@benpankow
Copy link
Member Author

benpankow commented Jan 18, 2025

@benpankow benpankow changed the base branch from master to benpankow/list-local-component-types January 22, 2025 20:35
@benpankow benpankow force-pushed the benpankow/check-lw branch 2 times, most recently from 725098b to 1956009 Compare January 22, 2025 22:31
@benpankow benpankow marked this pull request as ready for review January 22, 2025 22:32
@benpankow benpankow requested a review from schrockn January 22, 2025 22:32
@benpankow benpankow force-pushed the benpankow/list-local-component-types branch from 7c52fad to 4eea8a2 Compare January 22, 2025 22:51
Copy link
Contributor

@OwenKephart OwenKephart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like a reasonable implementation, have some comments though

also might be nice to see a sample output somewhere

component_contents_by_dir = {}
local_components = set()
for component_dir in (
dg_context.root_path / dg_context.root_package_name / "components"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can simplify this a bit with:

for component_path in component_path.rglob("component.yaml"):
    ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that'll handle walking the directory for you and finding all the component.yaml files. You can probably also use rglob to filter for things in the resolved paths, although not sure if that's that much better

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I bias a little towards the iterdir approach just to mirror 1:1 how we actually pick up components elsewhere in the codebase - we end up needing to find both the component yaml and the component dir anyway so it doesn't save a ton of space

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like a case where a common utility would be nice here -- nbd for now

@@ -259,10 +272,76 @@ def component_check_command(
**global_options: object,
) -> None:
"""Check component files against their schemas, showing validation errors."""
resolved_paths = [Path(path).absolute() for path in paths]
top_level_component_validator = Draft202012Validator(schema=COMPONENT_FILE_SCHEMA)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to use a specific Validator class like this instead of just jsonschema.validate()? This class name seems kinda unfriendly

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

validate only raises an exception with the first error; we need a validator in order to access iter_errors

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wild

Comment on lines 28 to 49
class LocalComponentTypes:
@classmethod
def from_dg_context(cls, dg_context: "DgContext", component_dirs: Sequence[Path]):
# TODO: cache

raw_local_component_data = dg_context.external_components_command(
["list", "local-component-types", *component_dirs]
)
local_component_data = json.loads(raw_local_component_data)

components_by_path = {str(path): {} for path in component_dirs}
for entry in local_component_data:
components_by_path[entry["directory"]][entry["key"]] = RemoteComponentType(
**entry["metadata"]
)
return cls(components_by_path)

def __init__(self, components_by_path: dict[str, dict[str, RemoteComponentType]]):
self._components_by_path = copy.copy(components_by_path)

def get(self, path: Path, key: str) -> RemoteComponentType:
return self._components_by_path[str(path)][key]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like the name of this class is pretty confusing -- it's called LocalComponentTypes but it's storing a bunch of RemoteComponentTypes

It does sort of feel like we should just have a param for RemoteComponentRegistry to load additional local component types. Like from_dg_context having an optional additional_component_dirs parameter.

That would simplify code that uses this stuff too -- you don't have to juggle two separate objects around and check which one you should be using

Copy link
Member Author

@benpankow benpankow Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I agree the name sucks. I do think it's somewhat useful that it's separate in that

  1. it's not comprehensive, e.g. a single instance represents a set of local component types from a few components, not necessarily the whole code location
  2. you may only want to load it in some scenarios, in others you only want to load the global RemoteComponentRegistry component type list

optional additional_component_dirs

This would work, but would need a refactor of RemoteComponentRegistry to no longer use component type as a unique key, since local components can have name overlap (since they're 'scoped' to their folder)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i see what you mean about uniqueness -- I guess just more fundamentally I think "bag of remote components that we want to do stuff with" should be a single object. this code feels pretty brittle and painful to work with as time goes on.

what other cases are you imagining where we'd want only the LocalComponents?

i could be convinced that we could stick with this sort of thing (ideally with a rename) until a new usecase comes along for us to refactor based off of, just feels like this pattern shouldn't stay around forever

might be a matter of subclassing or something

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah let me feel out what it would look like to rewrite the existing remote components class

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did a restructure, I'm happier with this approach

@benpankow benpankow force-pushed the benpankow/list-local-component-types branch from 4eea8a2 to 185efb7 Compare January 23, 2025 00:20
@benpankow benpankow force-pushed the benpankow/list-local-component-types branch from 185efb7 to 1ed28d0 Compare January 23, 2025 15:15
@benpankow benpankow requested a review from OwenKephart January 23, 2025 15:39
Base automatically changed from benpankow/list-local-component-types to master January 23, 2025 19:09
Copy link
Contributor

@OwenKephart OwenKephart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'll want to do another pass on the local vs. global component stuff, but given that we're not sure of what the end state of that will be, I think this is good for now

@benpankow benpankow merged commit 1c2821a into master Jan 24, 2025
5 checks passed
@benpankow benpankow deleted the benpankow/check-lw branch January 24, 2025 17:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants