Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kedro integration cannot load datasets created using dataset factories #988

Closed
1 task done
gtauzin opened this issue Feb 4, 2025 · 13 comments · Fixed by #1001
Closed
1 task done

Kedro integration cannot load datasets created using dataset factories #988

gtauzin opened this issue Feb 4, 2025 · 13 comments · Fixed by #1001
Assignees
Labels
Bug Report 🐛 Issue contains a bug report Needs triage 🔍 Issue needs triaging

Comments

@gtauzin
Copy link
Contributor

gtauzin commented Feb 4, 2025

Which package?

vizro

Package version

0.1.30

Description

When using the kedro integration for data management, it is currently not possible to load datasets that have been generated by dataset factories datasets. This is a known problem that is being addressed by the new KedroDataCatalog in the most recent versions of kedro (>=0.19.9). The issue is discussed here.

How to Reproduce

  1. Create a new kedro project: kedro new --starter=spaceflights-pandas
  2. Modify conf/base/catalog.yml to use the dataset factories syntax. Remove
companies:
  type: pandas.CSVDataset
  filepath: data/01_raw/companies.csv
  metadata:
    description: "A CSV file containing raw companies."

reviews:
  type: pandas.CSVDataset
  filepath: data/01_raw/reviews.csv
  metadata:
    description: "A CSV file containing raw reviews."

and replace it with:

"{dataset_name}#csv":
  type: pandas.CSVDataset
  filepath: data/01_raw/{dataset_name}.csv
  1. Add vizro to the requirements and open a jupyter notebook with kedro magic: %load_ext kedro.ipython. The kedro catalog is now defined.
  2. Write in a cell:
from vizro.integrations import kedro as kedro_integration

kedro_integration.datasets_from_catalog(catalog).keys()

Output

The steps above outputs a list of the kedro dataset names without "companies" and "reviews". However, as kedro pipelines' inputs/outputs refer to "companies" and "reviews" and they match the dataset factory defined above, they should be also listed.

Code of Conduct

@gtauzin gtauzin added Bug Report 🐛 Issue contains a bug report Needs triage 🔍 Issue needs triaging labels Feb 4, 2025
@gtauzin
Copy link
Contributor Author

gtauzin commented Feb 4, 2025

BTW, I am happy to open a PR to fix this :)

@antonymilne
Copy link
Contributor

Thanks for the very clear issue @gtauzin! I haven't been keeping up to date with all the kedro developments but was always a strong supporter of the idea of a dict interface to the data catalog so very pleased to see it's happened 😀 Thank you @astrojuanlu @ElenaKhaustova and everyone else! Just skimming through the kedro issues now this seems like great progress 🙌

If you're happy to raise a PR for this then that would be amazing. I'm very happy to support with it and offer any help that you need over github issues or a Zoom call or anything - just let me know. A few things to note upfront:

  1. Is there a nice way to get catalog.list() or something else to include dataset factory datasets for versions kedro<0.19.9? @astrojuanlu @ElenaKhaustova
    • If yes: let's do it that way.
    • If no: we'll bump our kedro requirement to kedro>=0.19.9 and use KedroDataCatalog instead of DataCatalog throughout _data_manager.py. I see it's an experimental feature but our usage is so simple that hopefully it should all be fine.
  2. I rewrote our contributing guidelines recently so please do let me know if it's all clear or you need any help!
  3. We have some simple tests that should be expanded to include the dataset factory case: https://github.com/mckinsey/vizro/blob/main/vizro-core/tests/unit/vizro/integrations/kedro/test_kedro_data_manager.py.

And a few tangentially related points...

  1. When we get around to doing Should vizro support polars (or other dataframes besides pandas)? #286 (hopefully Q1 this year) we'll need to work out how to expand the if "pandas" in dataset.__module__ logic.
  2. I'd love to hear any more feedback/ideas you have for the kedro integration. At the moment it's pretty lightweight but I'd like to try and make it deeper.
  3. Would you be interested in doing a user interview with us some time so we can find out more about how you use vizro and how we can improve it in future? 🙏

@gtauzin
Copy link
Contributor Author

gtauzin commented Feb 5, 2025

That's my pleasure @antonymilne. Let me provide you with an answer to your very first point now as I have already asked this question on the kedro slack and in the interest of the time of the people that have already answered me there.

I will answer to the rest of your points later on as I want to take the time to properly address them.

If you're happy to raise a PR for this then that would be amazing. I'm very happy to support with it and offer any help that you need over github issues or a Zoom call or anything - just let me know. A few things to note upfront:

  1. Is there a nice way to get catalog.list() or something else to include dataset factory datasets for versions kedro<0.19.9? @astrojuanlu @ElenaKhaustova

    • If yes: let's do it that way.
    • If no: we'll bump our kedro requirement to kedro>=0.19.9 and use KedroDataCatalog instead of DataCatalog throughout _data_manager.py. I see it's an experimental feature but our usage is so simple that hopefully it should all be fine.

First, thank you. I really appreciate how supportive the vizro team is (and the kedro team for that matter)!

From the answer of @ankatiyar on Slack:

The factory datasets are lazy so they don’t show up in catalog.list() (Discussion in https://github.com/kedro-org/kedro/issues/3312)

With the new catalog you can do -

catalog["<dataset_name>"] 

And it’ll resolve and get you the factory dataset

for dataset in pipelines['__default__'].datasets():
  catalog.exists(dataset) # or catalog.get_dataset(dataset)

# now it'll show up
catalog.list()

So, as I understand it, this is a simple workaround that should work for kedro<0.19.9:

from kedro.framework.project import pipelines
from vizro.integrations import kedro as kedro_integration

catalog = kedro_integration.catalog_from_project(project_path="/path/to/kedro/project", env="base")

for dataset in pipelines["__default__"].datasets():
    catalog.exists(dataset)

datasets = kedro_integration.datasets_from_catalog(catalog)

I tested this workaround, and it works well for me.

In practice that would simply mean that adding those two lines to kedro_integration.datasets_from_catalog would fix the issue.

@antonymilne
Copy link
Contributor

Thanks for the answer! This makes sense although I have some follow on questions (no problem if you can't answer them @gtauzin, I can take it up with the kedro team)... I looked through the source of KedroDataCatalog but couldn't quite figure it out myself.

If I do KedroDataCatalog.keys() then does kedro materialize all the lazy datasets? How does it do that? Does it need a specific pipeline like __default__ or looks over all pipelines somehow? Can you point me to the code that does this? 🙏 @ankatiyar

The reason this slightly matters is that the above code that works for kedro<0.19.9 requires a pipeline to be set. There are then three sensible options I think:

  1. just use __default__ like you do above, which feels a little bit wrong since it's not given that __default__ will include all the datasets in your project. Realistically that works fine in 99% of cases of though so is probably ok in reality.
  2. we expose a pipeline argument (that defaults to __default__) to allow users to change that. Would solve the 1% of edge cases but maybe adds a small amount of unnecessary complexity since it's such an edge case.
  3. do something like union(*pipeline.datasets for pipeline in pipelines.values()) (untested but you get the idea) to get datasets from across all the pipelines.

Hence I am wondering how KedroDataCatalog.keys(), which doesn't take a pipeline argument, handles this 🤔 Since it might be best if we emulate that behaviour.

@ankatiyar
Copy link

One thing to keep in mind is that catalog.exists(dataset) calls the exists function from the specific dataset which might cause some issues (See kedro-org/kedro-viz#1645)
Kedro Viz currently uses the catalog._get_dataset() internal method to fetch and resolve the datasets (with the old catalog)
https://github.com/kedro-org/kedro-viz/blob/60b6a811cddaf201ed6e83be550adfb78c03620e/package/kedro_viz/data_access/managers.py#L79-L96

For the new KedroDataCatalog it might suffice to do catalog.get('dataset_name')

@ElenaKhaustova
Copy link

I see there's confusion between datasets, lazy datasets and patterns. We'll try to communicate the difference better when switching to new catalog. Here are some things to keep in mind:

  1. Datasets are objects stored in the catalog and instantiated from the set class inherited from AbstractDataset
  2. _LazyDataset(added in the new catalog) objects store dataset configurations, so when the catalog is created actual dataset object is not instantiated instead we store its configuration in _LazyDataset. The actual dataset object is instantiated when you access dataset itself (get object from the dataset or before the pipeline run).

@antonymilne so KedroDataCatalog.keys() does not materialize dataset since you access name but not the object but if you do KedroDataCatalog["ds_name"], KedroDataCatalog.get(), KedroDataCatalog.values() or KedroDataCatalog.items() they will be materialized.

  1. Patterns are just templates for creating datasets. They are stored separately from datasets and resolved inside the session before the run when we already have access to the pipeline. As mentioned above patterns can only be resolved when having pipelines.
  2. For KedroDataCatalog patterns can be accessed via config_resolver like so: https://docs.kedro.org/en/stable/data/kedro_data_catalog.html#how-to-access-dataset-patterns
  3. To resolve patterns with regard to pipeline there's a CLI command kedro catalog list but now there's no separate API to call the same method programmatically. However, it is on our list and it will be resolved when migrating to the new catalog. Here is the solution proposed: [DataCatalog2.0]: Update KedroDataCatalog CLI logic and make it reusable kedro-org/kedro#3312 (comment)
  4. Existing catalog.list() will most probably be replaced by catalog.filter(): Implement catalog filter for KedroDataCatalog kedro-org/kedro#4449

@antonymilne
Copy link
Contributor

Thanks for the comments @ankatiyar and @ElenaKhaustova! Indeed I had assumed that lazy datasets were the same thing as dataset patterns, but part of the confusion came from me misreading things so it's not necessarily a problem with the docs!

Let me explain how I understand things now and hopefully you can say if I got anything wrong here:

  1. dataset patterns work exactly the same way with the old DataCatalog and the new KedroDataCatalog
  2. the common interface these follow is kedro.io.core.CatalogProtocol (though DataCatalog doesn't subclass it - is there a reason why?)
  3. given the choice of DATA_CATALOG_CLASS is set by the kedro user, on vizro we need to support both DataCatalog and the KedroDataCatalog. Essentially this means using only methods in CatalogProtocol and not using the new keys() etc. functionality.
  4. catalog.list() gives just datasets but no dataset patterns
  5. it only makes sense to resolve dataset patterns with respect to some dataset name, which for our purposes ultimately means pipeline.datasets
  6. kedro catalog list is quite thorough (I just tried it and read the source): it goes through every registered pipeline and for each gives datasets in pipeline (regardless of whether it's in the catalog), datasets in catalog but not in pipeline, and datasets mentioned in pipeline that come from factory
  7. at the moment there's no API to do this but in future maybe session.catalog_list() or similar will do it

Let's say the catalog contains datasets A, B and pattern P, and we have a pipeline that contains A, C, D, where C resolves to P but D does not (and hence will be memory dataset or whatever the default is). Then the options for finding "all" datasets would be:

  • catalog.list() ➡ A, B
  • pipeline.datasets() ➡ A, C, D
  • kedro catalog list would give all the above in the right categories

Am I correct so far?! 😅

Currently on vizro we use catalog.list() and so miss C and D above (hence the creation of this issue). If we swap to doing pipeline.datasets() then we miss B (defined in catalog but not used in any pipeline). So I need to think about whether we should do catalog.list() union pipeline.datasets() or whether just pipeline.datasets() is ok on vizro.

@ElenaKhaustova do you know what a future session.catalog_list() would be likely to do here? I guess return all A, B, C, D? If so then I'll probably just match that behaviour upfront on vizro.

The other question would be when you do session.catalog_list(), what pipeline do you use for it? Because we will need to again try and match this on vizro. As above there are maybe a few sensible options:

  1. Have mandatory pipelines argument - I'd prefer not to do this on vizro anyway but ok on kedro I guess
  2. Have optional pipelines argument with default as __default__ - ok by me but doesn't handle some edge cases
  3. Have optional pipelines argument with default as something like sum(pipelines.values())to get datasets from across all the pipelines - probably best option. In most cases just the same as option 2 but more general.

wdyt? 🙏

@ElenaKhaustova
Copy link

ElenaKhaustova commented Feb 5, 2025

I tried to clarify things with my previous reply but it looks like I reached the opposite effect 😅

  1. Yes
  2. We intentionally moved to protocol abstraction, which doesn't require inheritance, so it was possible to create and use custom catalogs that do not depend on Kedro. We had to include DataCatalog API to CatalogProtocol for compatibility but it will be updated based on KedroDataCatalog when replacing catalog (kedro 1.0.0).
  3. It makes sense for now
  4. Correct, It will be removed for KedroDataCatalog and replaced with filter()
  5. Yes
  6. Yes, but you can specify pipeline name as well, but by default all pipelines are taken. Probably kedro catalog resolve is closer to what you're looking for. But it's also available only as CLI command now.
  7. There is an API as now we do it inside CLI commands logic and when running pipelines. One can resolve via config_resolver or by get_dataset. But there's no API to do it for the pipeline within one function call. Yes, we plan to rework all CLI commands and make it possible to get the same result programmatically. However, this won't be part of catalog API as we don't want to couple pipeline and catalog.

If you have any ideas about what you would like as an output, we can consider them as well. Now is a perfect time for that 🙂

Let's say the catalog contains datasets A, B and pattern P, and we have a pipeline that contains A, C, D, where C resolves to P but D does not (and hence will be memory dataset or whatever the default is). Then the options for finding "all" datasets would be:

  • both catalog.list() and pipeline.datasets() will give you A, B before you resolve C (run pipeline or get dataset) and A, B, C after. But first list names only, while datasets gives objects.
  • kedro catalog list would give all- yes

@ElenaKhaustova
Copy link

@ElenaKhaustova do you know what a future session.catalog_list() would be likely to do here? I guess return all A, B, C, D? If so then I'll probably just match that behaviour upfront on vizro.

From how I see it now, we will do the same thing as in the CLI command: input an optional pipelines or use all pipelines if one is not provided. I guess that's option 3 you mentioned above.

@antonymilne
Copy link
Contributor

I had a chat with @ElenaKhaustova this afternoon (thanks again Elena 🙏) and another think about this, and here's the plan @gtauzin. It's a bit trickier than I first thought because the assumed flow on vizro is that you first define a kedro catalog (in one of three ways) and then use datasets_from_catalog. Now that it's possible to also get datasets from dataset factories too, we need to feed this function a new optional argument pipeline as well.

While the kedro catalog can come from three different places, the kedro pipeline can be found in two ways:

  1. bootstrap_project(project_path) and then from kedro.framework.project import pipelines. Let's wrap this into a function pipelines_from_project to match vizro's catalog_from_project
  2. or it's there for you already in Jupyter/iPython context because the above is already done for you

So to match our existing scheme of datasets_from_catalog I'd like to do this as two functions. This is not thoroughly tested but something like this hopefully works ok 🤞

def pipelines_from_project(project_path: Union[str, Path]) -> Pipeline:
    bootstrap_project(project_path)
    from kedro.framework.project import pipelines
    return pipelines


def datasets_from_catalog(catalog: CatalogProtocol, *, pipeline: Pipeline = None) -> dict[str, pd_DataFrameCallable]:
    # This doesn't include things added to the catalog at run time but that is ok for our purposes.
    config_resolver = catalog.config_resolver
    kedro_datasets = config_resolver.config.copy()

    if pipeline is not None:
        # Go through all dataset names that weren't in catalog and try to resolve them. Those that cannot be
        # resolved give an empty dictionary and are ignored.
        for dataset_name in set(pipeline.datasets()) - set(kedro_datasets):
            if dataset_config := config_resolver.resolve_pattern(dataset_name):
                kedro_datasets[dataset_name] = dataset_config

    vizro_data_sources = {}

    for dataset_name, dataset_config in kedro_datasets.items():
        # "type" key always exists because we filtered out patterns that resolve to empty dictionary above.
        if "pandas" in dataset_config["type"]:
            # TODO: in future update to use lambda: catalog.load(dataset_name) instead of _get_dataset
            #  but need to check if works with caching.
            dataset = catalog._get_dataset(dataset_name, suggest=False)
            vizro_data_sources[dataset_name] = dataset.load

    return vizro_data_sources

Intended usage is like this:

# This still works same as before:
datasets_from_catalog(catalog=catalog)

pipelines = pipelines_from_project("/path/to/kedro/project")
# or `pipelines` is already there in kedro ipython context

# This provides both catalog *and* pipeline datasets that use patterns:
datasets_from_catalog(catalog=catalog, pipeline=pipelines["__default__"])
# or use pipeline=pipelines["data_science"] for one specific pipeline
# or use pipeline=sum(pipelines.values()) for all registered pipelines

The tests which currently exist for this aren't great so let's start again with them. We should update test_catalog.yml like this:

"{P}1":
  type: pandas.CSVDataset
  filepath: {P}.csv

A:
  type: pandas.ExcelDataset
  filepath: A.xlsx

B:
  type: pandas.ParquetDataset
  filepath: B.parquet

Z:
  type: pandas.ParquetDataset
  filepath: Z.pkl

In test_kedro_data_manager.py define catalog and pipeline like this:

catalog = DataCatalog.from_config(yaml.safe_load(catalog_path.read_text(encoding="utf-8")))
pipeline = pipeline(
        [
            node(
                func=lambda *args: None,
                inputs=["A", "C1", "D", "Z", "parameters", "params:z"],
                outputs=None,
            ),
        ]
    )

And then minimal test cases would be:

# test 1
assert set(datasets_from_catalog(catalog)) == {"A", "B"}

# test 2
assert set(datasets_from_catalog(catalog, pipeline=pipeline)) == {"A", "B", "C1"}

Phew, that was more code than I expected to write! But this change turned out to be a bit more involved than I had first expected. @gtauzin please do feel free to proceed with the PR now I think we've figured out the correct way we should do this.

@gtauzin
Copy link
Contributor Author

gtauzin commented Feb 7, 2025

Thank you for the insightful exchange! Sounds like a nice plan, I'll get to the PR very soon.

@antonymilne I have a few points:

  • Given it still seem there could be some uncertainties related to the future API of KedroDataCatalog should I write tests based on KedroDataCatalog catalogs (as well as DataCatalog catalogs)? That way they'll serve as a reminder to update the code if anything change.
  • Could you please help me double check your tests?
"{P}1":
  type: pandas.CSVDataset
  filepath: {P}.csv

A:
  type: pandas.ExcelDataset
  filepath: A.xlsx

B:
  type: pandas.ParquetDataset
  filepath: B.parquet

Z:
  type: pandas.ParquetDataset
  filepath: Z.pkl

In test_kedro_data_manager.py define catalog and pipeline like this:

catalog = DataCatalog.from_config(yaml.safe_load(catalog_path.read_text(encoding="utf-8")))
pipeline = pipeline(
        [
            node(
                func=lambda *args: None,
                inputs=["A", "C1", "D", "Z", "parameters", "params:z"],
                outputs=None,
            ),
        ]
    )

I think inputs should be inputs=["A", "C1", "B", "Z", "parameters", "params:z"], i.e. "B" instead of "D" as "D" is not defined is the catalog.

And then minimal test cases would be:

# test 1
assert set(datasets_from_catalog(catalog)) == {"A", "B"}

# test 2
assert set(datasets_from_catalog(catalog, pipeline=pipeline)) == {"A", "B", "C1"}

Shouldn't test 1 return {"A", "B", "Z"} and test 2 {"A", "B", "Z", "C1"}?

  • As DataCatalog does not inherit from CatalogProtocal, shouldn't the catalog type hint be something like catalog: DataCatalog | CatalogProtocol? Otherwise I think static type checks will report if the function is called with an instance of DataCatalog.

@antonymilne I also wanted to answer the questions on your first message :)

  1. I rewrote our contributing guidelines recently so please do let me know if it's all clear or you need any help!
  2. We have some simple tests that should be expanded to include the dataset factory case: https://github.com/mckinsey/vizro/blob/main/vizro-core/tests/unit/vizro/integrations/kedro/test_kedro_data_manager.py.

Sounds good, thank you!

And a few tangentially related points...

  1. When we get around to doing Should vizro support polars (or other dataframes besides pandas)? #286 (hopefully Q1 this year) we'll need to work out how to expand the if "pandas" in dataset.__module__ logic.

Yes, I've seen plotly (and more recently bokeh) moving to narwhals and I use narwhals myself, so I am excited about this change!

  1. I'd love to hear any more feedback/ideas you have for the kedro integration. At the moment it's pretty lightweight but I'd like to try and make it deeper.

Actually, I've been thinking to open an issue on the kedro issue tracker to discuss this. Here are a few ideas:

  1. How should I think about integrating my vizro dashboards in my kedro projects?

Somehow, if you use both together, your dahsboards are fed by kedro pipelines and take kedro dataset in. They are part of the same data science project, so it makes sense to me to deal with them together within a monorepos.

For now, I put them in a dashboards subfolder at the same level as pipelines

.
├── conf
├── data
├── docs
├── notebooks
├── pyproject.toml
├── README.md
├── src
│   └── my_kedro_project
│       ├── dashboards
│       │   ├── __init__.py
│       │   └── iris
│       ├── hooks.py
│       ├── __init__.py
│       ├── pipeline_registry.py
│       ├── pipelines
│       │   ├── data_processing
│       │   ├── data_science
│       │   ├── __init__.py
│       └── settings.py
└── tests

I organize dependencies using the optional-dependencies section of my pyproject.toml, e.g. I'll have one set of dependencies for "dashboards.iris". Note this relates to kedro-org/kedro#4147

  1. If the vizro dashboards are "nodes" in your pipeline, then I'd love to be able to see them within my kedro viz visualization.

  2. I would love to be able to use CLI to create new vizro apps within my kedro project the same way I create new pipelines: kedro vizro create my_dashboard

  3. kedro has kedro-docker which helps containerize your pipelines, I'd love to have it work for vizro dashboards as well.

If there's more discussions on that topic, I'd love to follow them! Do you have any concrete plan to make the kedro/vizro integration? It seems to me a kedro plugin for vizro would be really nice.

  1. Would you be interested in doing a user interview with us some time so we can find out more about how you use vizro and how we can improve it in future? 🙏

Would love to connect with you and the vizro team if you think that would be useful!

@antonymilne
Copy link
Contributor

@gtauzin Just so we don't lose track of this when your PR gets merged, I've split the ongoing discussion off into a new issue: #1008. Will follow up on it all over there!

Would love to connect with you and the vizro team if you think that would be useful!

Amazing! We would love to talk to you some time. Probably the easiest way is to send me a message on the kedro slack channel - I'm not very active there these days but I should see a notification if you ping me there 🙂 Or feel free to drop me an email at [email protected].

@antonymilne
Copy link
Contributor

This was released with vizro==0.1.34 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Report 🐛 Issue contains a bug report Needs triage 🔍 Issue needs triaging
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants