-
Notifications
You must be signed in to change notification settings - Fork 919
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DataCatalog2.0]: Update KedroDataCatalog
CLI logic and make it reusable
#3312
Comments
could we have something like |
This Viz issue is related: kedro-org/kedro-viz#1480 |
That would be perfect! We would need such a thing |
@MarcelBeining Can you explains a bit more why you need this? I am thinking about this again because I am trying to build a plugin for kedro and this would come in handy to compile a static version of configuration. |
@noklam We try to find kedro datasets for which we have not written a data test, hence we iterate over |
@MarcelBeining Did I understand this question correctly as:
Does |
@noklam "Find which datasets is not written in catalog.yml including dataset factory resolves, yet" , yes kedro catalog resolve shows what I need, but it is a CLI command and I need it within Python (of course one could use os.system etc, but a simple extension of catalog.list() should not be that hard) |
@MarcelBeining Are you integrating this with some extra functionalities? How do you consume this information if this is ok to share? |
Adding on from our discussion on slack,
But I'd also like that information easily consumable in a notebook (for example). So if my catalog stores models like: "{experiment}.model":
type: pickle.PickleDataset
filepath: data/06_models/{experiment}/model.pickle
versioned: true I would want to be able to (somehow) do something like: models = {}
for model_dataset in [d for d in catalog.list(*~*magic*~*) if ".model" in d]:
models[model_dataset] = catalog.load(model_dataset) Its a small thing. But I was kind of surprised to not see my |
Another one, bumped to |
What if for datasets in data_catalog:
... |
I think it's neat @noklam , but I don't know if it's discoverable. To me |
why_not_both.gif
|
I've also wanted to be able to iterate through the datasets for a while, but it raises some unanswered questions:
But we always face the same issue: we would need to "resolve" the dataset factory first relatively to a pipeline. it would eventually give: The real advantage of doing so is that we do not need to create a search method with all type of supported search (by extension, by regex... as suggested in the corresponding issue) because it's easily customizable, so it's less maintenance burden in the end. |
Catalog.list already support regex, isn't that identical to what you suggest as catalog.search? |
@noklam you can only search by name, namespaces aren't really supported and you can't search by attribute |
namespace is just a prefix string so it works pretty well. I do believe there are benefits to improve it, but I think we should at least add an example for existing feature since @Galileo-Galilei told me he is not aware of it and most likely very few do. |
catalog.list
or alternative for dataset factory?catalog.list
or alternative for dataset factory?
Inconsistency between CLI commands and interactive workflow is a problem related to all Thus, we suggest refactoring CLI logic and moving it to the We also think that we should not couple catalog and pipelines, so we do not consider extending |
So this came up yesterday - my colleague said 'I don't think the catalog is working' we did It took 10 minutes of playing with What may have helped?
What did we do? we wrote something horrible like this:
In summary, I think we don't have to over-engineer this, I think having expected patterns show up in the |
Which I realise this is exactly what I pitched 18 months ago 😂 |
@datajoely The idea of listing datasets and patterns together looks interesting. But doing so could lead to confusion, as people may not differentiate datasets and patterns. So, providing an interface to access both separately and communicate this better seems more reasonable to me. There is also a high chance that I see the pain point about the discrepancy between Long story short - we suggest:
These 5 should address the points you mentioned as well as others we discussed above. |
Okay we're actually designing for different things. Perhaps we could print out some of these recommendations when the ipython extension is loaded because even experienced kedro users are going to know how to retrieve patterns etc |
catalog.list
or alternative for dataset factory?KedroDataCatalog
CLI logic and make it reusable
@datajoely, yep, now it's a good time to point to possible improvements. Can you please elaborate on what you mean by |
I a JSON structure something like this would be more useful from a machine interpretability point of view: [
{"dataset_name": "...", "dataset_type": "...", "pipelines":["...","..."] },
] There is more that is possibly useful such as classpath for custom datasets, which YAML file it was found in etc. |
Description
Parent issue: #4472
Suggested plan: #3312 (comment)
Context
Background: https://linen-slack.kedro.org/t/16064885/when-i-say-catalog-list-in-a-kedro-jupter-lab-instance-it-do#ad3bb4aa-f6f9-44c6-bb84-b25163bfe85c
With dataset factory, the "defintion" of a dataset is not known until the pipeline is run. When user is using a Jupyter notebook, they expected to see the full list of dataset with
catalog.list
.Current workaround to see the datasets for
__default__
pipeline look like this:When using the CLI commands, e.g.
kedro catalog list
we do matching to figure out which factory mentions in the catalog match the datasets used in the pipeline, but when going through the interactive flow no such checking has been done yet.Possible Implementation
Could check dataset existence when the session is created. We need to verify if that has any unexpected side effects.
This ticket is still open scope and we don't have a specify implementation in mind. The person who pick up can evaluate different approaches, with considerations of side-effect, avoid coupling with other components.
Possible Alternatives
catalog.list( pipeline=<name>
) - not a good solution because catalog wouldn't have access to a pipelinekedro catalog list
is called.The text was updated successfully, but these errors were encountered: