Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DataCatalog2.0]: Update KedroDataCatalog CLI logic and make it reusable #3312

Open
noklam opened this issue Nov 15, 2023 · 25 comments
Open
Assignees
Labels
Component: IO Issue/PR addresses data loading/saving/versioning and validation, the DataCatalog and DataSets Issue: Feature Request New feature or improvement to existing feature

Comments

@noklam
Copy link
Contributor

noklam commented Nov 15, 2023

Description

Parent issue: #4472

Suggested plan: #3312 (comment)

Context

Background: https://linen-slack.kedro.org/t/16064885/when-i-say-catalog-list-in-a-kedro-jupter-lab-instance-it-do#ad3bb4aa-f6f9-44c6-bb84-b25163bfe85c

With dataset factory, the "defintion" of a dataset is not known until the pipeline is run. When user is using a Jupyter notebook, they expected to see the full list of dataset with catalog.list.

Current workaround to see the datasets for __default__ pipeline look like this:

for dataset in pipeline["__default__"].data_sets():
  catalog.exists(dataset)

When using the CLI commands, e.g. kedro catalog list we do matching to figure out which factory mentions in the catalog match the datasets used in the pipeline, but when going through the interactive flow no such checking has been done yet.

Possible Implementation

Could check dataset existence when the session is created. We need to verify if that has any unexpected side effects.

This ticket is still open scope and we don't have a specify implementation in mind. The person who pick up can evaluate different approaches, with considerations of side-effect, avoid coupling with other components.

Possible Alternatives

  • catalog.list( pipeline=<name>) - not a good solution because catalog wouldn't have access to a pipeline
  • Do something similar to what's happening when kedro catalog list is called.
@noklam noklam added the Issue: Feature Request New feature or improvement to existing feature label Nov 15, 2023
@datajoely
Copy link
Contributor

could we have something like catalog.resolve(pipeline:Optional[str]).list()?

@merelcht
Copy link
Member

This Viz issue is related: kedro-org/kedro-viz#1480

@MarcelBeining
Copy link

could we have something like catalog.resolve(pipeline:Optional[str]).list()?

That would be perfect! We would need such a thing

@noklam
Copy link
Contributor Author

noklam commented Feb 16, 2024

@MarcelBeining Can you explains a bit more why you need this? I am thinking about this again because I am trying to build a plugin for kedro and this would come in handy to compile a static version of configuration.

@ankatiyar ankatiyar added this to the Dataset Factory Improvements milestone Mar 4, 2024
@MarcelBeining
Copy link

@noklam We try to find kedro datasets for which we have not written a data test, hence we iterate over catalog.list(). However, if we use dataset factories, the datasets captured with a factory is not listed in catalog.list()

@noklam
Copy link
Contributor Author

noklam commented Mar 12, 2024

@MarcelBeining Did I understand this question correctly as:

Does kedro catalog resolve or kedro catalog list helps you? If not what are missing?

@MarcelBeining
Copy link

@noklam "Find which datasets is not written in catalog.yml including dataset factory resolves, yet" , yes

kedro catalog resolve shows what I need, but it is a CLI command and I need it within Python (of course one could use os.system etc, but a simple extension of catalog.list() should not be that hard)

@noklam
Copy link
Contributor Author

noklam commented Mar 12, 2024

@MarcelBeining Are you integrating this with some extra functionalities? How do you consume this information if this is ok to share?

@ianwhale
Copy link
Contributor

@noklam

Adding on from our discussion on slack,

kedro catalog resolve does what I'd want.

But I'd also like that information easily consumable in a notebook (for example).

So if my catalog stores models like:

"{experiment}.model":
  type: pickle.PickleDataset
  filepath: data/06_models/{experiment}/model.pickle
  versioned: true

I would want to be able to (somehow) do something like:

models = {}

for model_dataset in [d for d in catalog.list(*~*magic*~*) if ".model" in d]:
    models[model_dataset] = catalog.load(model_dataset)

Its a small thing. But I was kind of surprised to not see my {experiment}.model entries not listed at all in catalog.list().

@noklam
Copy link
Contributor Author

noklam commented May 24, 2024

Another one, bumped to high priority as discussed in slack.
https://linen-slack.kedro.org/t/18841749/hi-i-have-a-dataset-factory-specified-and-when-i-do-catalog-#eb609fb2-fce6-434d-a652-ffb62eb41e7b

@noklam
Copy link
Contributor Author

noklam commented Jun 3, 2024

What if DataCatalog is iterable?

for datasets in data_catalog:
   ...

@datajoely
Copy link
Contributor

I think it's neat @noklam , but I don't know if it's discoverable.

To me DataCatalog.list() feels more powerful in the IDE than list(DataCatalog)...

@astrojuanlu
Copy link
Member

why_not_both.gif

def list(self):
    ...

def __iter__(self):
    return self.list()

@Galileo-Galilei
Copy link
Member

Galileo-Galilei commented Jun 3, 2024

I've also wanted to be able to iterate through the datasets for a while, but it raises some unanswered questions:

But we always face the same issue: we would need to "resolve" the dataset factory first relatively to a pipeline. it would eventually give: [dataset.name for dataset in catalog.resolve(pipeline)], but is it really a better / more intuitive syntax ? I personnaly find it quite neat, but arguably beginners would prefer a "native" method.

The real advantage of doing so is that we do not need to create a search method with all type of supported search (by extension, by regex... as suggested in the corresponding issue) because it's easily customizable, so it's less maintenance burden in the end.

@noklam
Copy link
Contributor Author

noklam commented Jun 3, 2024

Catalog.list already support regex, isn't that identical to what you suggest as catalog.search?

@datajoely
Copy link
Contributor

@noklam you can only search by name, namespaces aren't really supported and you can't search by attribute

@noklam
Copy link
Contributor Author

noklam commented Jun 4, 2024

namespace is just a prefix string so it works pretty well. I do believe there are benefits to improve it, but I think we should at least add an example for existing feature since @Galileo-Galilei told me he is not aware of it and most likely very few do.

#3924
image

@noklam noklam changed the title How to improve catalog.list or alternative for dataset factory? Improve catalog.list or alternative for dataset factory? Oct 23, 2024
@ElenaKhaustova ElenaKhaustova self-assigned this Jan 20, 2025
@ElenaKhaustova
Copy link
Contributor

Inconsistency between CLI commands and interactive workflow is a problem related to all kedro catalog commands. Currently, all CLI logic is implemented within commands and cannot be reused across the codebase. This is confusing for users as one cannot perform the same actions that are available via commands in the interactive workflow. It also requires additional maintenance as any catalog-related changes should be validated for CLI logic as well.

Thus, we suggest refactoring CLI logic and moving it to the session level so that it can be reused for CLI and interactive workflows. We also suggest doing that together with the deprecation of the old catalog since CLI logic currently relies on some private fields and methods of the old catalog that will be removed.

We also think that we should not couple catalog and pipelines, so we do not consider extending catalog.list() method for that and suggest using session instead.

@datajoely
Copy link
Contributor

So this came up yesterday - my colleague said 'I don't think the catalog is working' we did catalog.list() and all we saw was parameters.

It took 10 minutes of playing with %kedro reload --env=... to realise that it was down to the fact that factories don't show up.

What may have helped?

catalog.list() could show up the patterns it is expecting too e.g. "{name}_data" so the user is aware of the

What did we do? we wrote something horrible like this:

{x:catalog.load(x) for x in (pipelines["__default__"].outputs() | pipelines["__default__"].inputs())}

In summary, I think we don't have to over-engineer this, I think having expected patterns show up in the catalog.list() would be a really helpful addition that doesn't add too much complexity.

@datajoely
Copy link
Contributor

Which I realise this is exactly what I pitched 18 months ago 😂
#3312 (comment)

@ElenaKhaustova
Copy link
Contributor

@datajoely
I think the conceptual problem here is that we were not clear enough in explaining to users that datasets are not equal to patterns. After we added CatalogConfigResolver patterns can be listed like that for both new and old catalogs: catalog.config_resolver.list_patterns().

The idea of listing datasets and patterns together looks interesting. But doing so could lead to confusion, as people may not differentiate datasets and patterns. So, providing an interface to access both separately and communicate this better seems more reasonable to me.

There is also a high chance that catalog.list() will be replaced with another filtering method or removed, as now we have a dict-like interface to list keys, values, and items: #3917.

I see the pain point about the discrepancy between catalog.list() and the CI command kedro catalog list, where the last considers the pipeline but the first does not. But as suggested above we think this can be solved via session, not catalog to avoid coupling catalog and pipelines.

Long story short - we suggest:

  1. Keep catalog.config_resolver.list_patterns() to list patterns
  2. Keep dict-like interface to list datasets
  3. Replace catalog.list() with catalog.filter()or remove it
  4. Move CI logic to the session, so it is possible to list datasets with regards to pipelines from the CI and in the interactive workflow, something like that: session.catalog_list()
  5. Communicate better difference between datasets and patterns, resolution based on pipelines

These 5 should address the points you mentioned as well as others we discussed above.

@datajoely
Copy link
Contributor

Okay we're actually designing for different things. Perhaps we could print out some of these recommendations when the ipython extension is loaded because even experienced kedro users are going to know how to retrieve patterns etc

@astrojuanlu astrojuanlu moved this to To Do in Kedro 🔶 Jan 27, 2025
@ElenaKhaustova ElenaKhaustova changed the title Improve catalog.list or alternative for dataset factory? [DataCatalog2.0]: Update KedroDataCatalog CLI logic and make it reusable Feb 10, 2025
@ElenaKhaustova ElenaKhaustova added the Component: IO Issue/PR addresses data loading/saving/versioning and validation, the DataCatalog and DataSets label Feb 10, 2025
@datajoely
Copy link
Contributor

Now this is prioritised for the 1.0.0 collection of breaking changes I want to push that the current CLI command is useless:

Image

I no longer have access to telemetry, but last time I looked at it was barely used.

What it does well:

  • It interpolates the dataset factories so you don't get the situation where catalog.list() doesn't work until the dataset has been discovered eagerly.
  • It breaks down the datasets by type, but this is now solved in the python API

What it doesn't do well:

  • It's not really machine readable, the output is technically YAML, but it's hard to do anything with it as the logging cannot be disabled by the user from the CLI on the own
  • I think a detailed JSON output would be amazing, being able to do kedro catalog list | jq would be a productivity driver for users.

@ElenaKhaustova
Copy link
Contributor

@datajoely, yep, now it's a good time to point to possible improvements. Can you please elaborate on what you mean by detailed? Do you expect any other information included?

@datajoely
Copy link
Contributor

I a JSON structure something like this would be more useful from a machine interpretability point of view:

[
   {"dataset_name": "...", "dataset_type": "...", "pipelines":["...","..."] },
]

There is more that is possibly useful such as classpath for custom datasets, which YAML file it was found in etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: IO Issue/PR addresses data loading/saving/versioning and validation, the DataCatalog and DataSets Issue: Feature Request New feature or improvement to existing feature
Projects
Status: In Progress
Development

No branches or pull requests

9 participants