Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feat] Enable datasets_from_catalog to return factory-based datasets #1001

Merged
merged 17 commits into from
Feb 12, 2025
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
47 changes: 47 additions & 0 deletions vizro-core/changelog.d/20250208_114146_4648633+gtauzin.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
<!--
A new scriv changelog fragment.

Uncomment the section that is right (remove the HTML comment wrapper).
-->

<!--
### Highlights ✨

- A bullet item for the Highlights ✨ category with a link to the relevant PR at the end of your entry, e.g. Enable feature XXX. ([#1](https://github.com/mckinsey/vizro/pull/1))

-->
<!--
### Removed

- A bullet item for the Removed category with a link to the relevant PR at the end of your entry, e.g. Enable feature XXX. ([#1](https://github.com/mckinsey/vizro/pull/1))

-->
<!--
### Added

- A bullet item for the Added category with a link to the relevant PR at the end of your entry, e.g. Enable feature XXX. ([#1](https://github.com/mckinsey/vizro/pull/1))

-->
<!--
### Changed

- A bullet item for the Changed category with a link to the relevant PR at the end of your entry, e.g. Enable feature XXX. ([#1](https://github.com/mckinsey/vizro/pull/1))

-->
<!--
### Deprecated

- A bullet item for the Deprecated category with a link to the relevant PR at the end of your entry, e.g. Enable feature XXX. ([#1](https://github.com/mckinsey/vizro/pull/1))

-->

### Fixed

- Fix a bug where datasets generated by dataset factories would not be returned by `kedro_integration.datasets_from_catalog`. ([#1001](https://github.com/mckinsey/vizro/pull/1001))

<!--
### Security

- A bullet item for the Security category with a link to the relevant PR at the end of your entry, e.g. Enable feature XXX. ([#1](https://github.com/mckinsey/vizro/pull/1))

-->
2 changes: 1 addition & 1 deletion vizro-core/docs/pages/explanation/authors.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@

<!-- vale off -->

[Ann Marie Ward](https://github.com/AnnMarieW), [Anna Xiong](https://github.com/Anna-Xiong), [Annie Wachsmuth](https://github.com/anniecwa), [ataraexia](https://github.com/ataraexia), [axa99](https://github.com/axa99), [Bhavana Sundar](https://github.com/bhavanaeh), [Bo Xu](https://github.com/boxuboxu), [Chiara Pullem](https://github.com/chiara-sophie), [Denis Lebedev](https://github.com/DenisLebedevMcK), [Elena Fridman](https://github.com/EllenWie), [Ferida Mohammed](https://github.com/feridaaa), [Hamza Oza](https://github.com/hamzaoza), [Hansaem Park](https://github.com/sammitako), [Hilary Ivy](https://github.com/hxe00570), [Jasmine Wu](https://github.com/jazwu), [Jenelle Yonkman](https://github.com/yonkmanjl), [Jingjing Guo](https://github.com/jjguo-mck), [Juan Luis Cano Rodríguez](https://github.com/astrojuanlu), [Kee Wen Ng](https://github.com/KeeWenNgQB), [Leon Nallamuthu](https://github.com/leonnallamuthu), [Lydia Pitts](https://github.com/LydiaPitts), [Manuel Konrad](https://github.com/manuelkonrad), [Ned Letcher](https://github.com/ned2), [Nikolaos Tsaousis](https://github.com/tsanikgr), [njmcgrat](https://github.com/njmcgrat), [Oleksandr Serdiuk](https://github.com/oserdiuk-lohika), [Prateek Bajaj](https://github.com/prateekdev552), [Qiuyi Chen](https://github.com/Qiuyi-Chen), [Rashida Kanchwala](https://github.com/rashidakanchwala), [Riley Dou](https://github.com/rilieo), [Rosheen C.](https://github.com/rc678), [Sylvie Zhang](https://github.com/sylviezhang37), and [Upekesha Ngugi](https://github.com/upekesha).
[Ann Marie Ward](https://github.com/AnnMarieW), [Anna Xiong](https://github.com/Anna-Xiong), [Annie Wachsmuth](https://github.com/anniecwa), [ataraexia](https://github.com/ataraexia), [axa99](https://github.com/axa99), [Bhavana Sundar](https://github.com/bhavanaeh), [Bo Xu](https://github.com/boxuboxu), [Chiara Pullem](https://github.com/chiara-sophie), [Denis Lebedev](https://github.com/DenisLebedevMcK), [Elena Fridman](https://github.com/EllenWie), [Ferida Mohammed](https://github.com/feridaaa), [Guillaume Tauzin](https://github.com/gtauzin), [Hamza Oza](https://github.com/hamzaoza), [Hansaem Park](https://github.com/sammitako), [Hilary Ivy](https://github.com/hxe00570), [Jasmine Wu](https://github.com/jazwu), [Jenelle Yonkman](https://github.com/yonkmanjl), [Jingjing Guo](https://github.com/jjguo-mck), [Juan Luis Cano Rodríguez](https://github.com/astrojuanlu), [Kee Wen Ng](https://github.com/KeeWenNgQB), [Leon Nallamuthu](https://github.com/leonnallamuthu), [Lydia Pitts](https://github.com/LydiaPitts), [Manuel Konrad](https://github.com/manuelkonrad), [Ned Letcher](https://github.com/ned2), [Nikolaos Tsaousis](https://github.com/tsanikgr), [njmcgrat](https://github.com/njmcgrat), [Oleksandr Serdiuk](https://github.com/oserdiuk-lohika), [Prateek Bajaj](https://github.com/prateekdev552), [Qiuyi Chen](https://github.com/Qiuyi-Chen), [Rashida Kanchwala](https://github.com/rashidakanchwala), [Riley Dou](https://github.com/rilieo), [Rosheen C.](https://github.com/rc678), [Sylvie Zhang](https://github.com/sylviezhang37), and [Upekesha Ngugi](https://github.com/upekesha).

with thanks to Sam Bourton and Kevin Staight for sponsorship, inspiration and guidance,

Expand Down
31 changes: 27 additions & 4 deletions vizro-core/docs/pages/user-guides/kedro-data-catalog.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ pip install vizro[kedro]

## Use datasets from the Kedro Data Catalog

`vizro.integrations.kedro` provides functions to help generate and process a [Kedro Data Catalog](https://docs.kedro.org/en/stable/data/index.html). Given a Kedro Data Catalog `catalog`, the general pattern to add datasets into the Vizro data manager is:
`vizro.integrations.kedro` provides functions to help generate and process a [Kedro Data Catalog](https://docs.kedro.org/en/stable/data/index.html). It supports both the original [DataCatalog](https://docs.kedro.org/en/stable/data/data_catalog.html) and the more recently introduced [KedroDataCatalog](https://docs.kedro.org/en/stable/data/index.html#kedrodatacatalog-experimental-feature). Given a Kedro Data Catalog `catalog`, the general pattern to add datasets into the Vizro data manager is:

```python
from vizro.integrations import kedro as kedro_integration
Expand All @@ -23,6 +23,19 @@ for dataset_name, dataset in kedro_integration.datasets_from_catalog(catalog).it
data_manager[dataset_name] = dataset
```

To add datasets that are defined using the [Kedro dataset factory](https://docs.kedro.org/en/stable/data/kedro_dataset_factories.html), `datasets_from_catalog` needs to access the pipelines that use them.

```python
from vizro.integrations import kedro as kedro_integration
from vizro.managers import data_manager


pipeline = pipelines.get("my_pipeline_name")

for dataset_name, dataset in kedro_integration.datasets_from_catalog(catalog, pipeline=pipeline).items():
data_manager[dataset_name] = dataset
```

This imports all datasets of type [`kedro_datasets.pandas`](https://docs.kedro.org/en/stable/kedro_datasets.html) from the Kedro `catalog` into the Vizro `data_manager`.

The `catalog` variable may have been created in a number of different ways:
Expand All @@ -31,6 +44,11 @@ The `catalog` variable may have been created in a number of different ways:
1. [Kedro Jupyter session](https://docs.kedro.org/en/stable/notebooks_and_ipython/kedro_and_notebooks.html). This automatically exposes `catalog`.
1. Data Catalog configuration file (`catalog.yaml`). This can create a `catalog` entirely independently of a Kedro project using [`kedro.io.DataCatalog.from_config`](https://docs.kedro.org/en/stable/kedro.io.DataCatalog.html#kedro.io.DataCatalog.from_config).

Conversely, the `pipelines` variable may have been created the following ways:

1. Kedro project path. Vizro exposes a helper function `vizro.integrations.kedro.pipelines_from_project` to generate a `pipelines` given the path to a Kedro project.
1. [Kedro Jupyter session](https://docs.kedro.org/en/stable/notebooks_and_ipython/kedro_and_notebooks.html). This automatically exposes `pipelines`.

The full code for these different cases is given below.

!!! example "Import a Kedro Data Catalog into the Vizro data manager"
Expand All @@ -39,10 +57,13 @@ The full code for these different cases is given below.
from vizro.integrations import kedro as kedro_integration
from vizro.managers import data_manager

project_path = "/path/to/kedro/project"
catalog = kedro_integration.catalog_from_project(project_path)
pipelines = kedro_integration.catalog_from_project(project_path)

catalog = kedro_integration.catalog_from_project("/path/to/kedro/project")
pipeline = pipelines.get("my_pipeline")

for dataset_name, dataset in kedro_integration.datasets_from_catalog(catalog).items():
for dataset_name, dataset in kedro_integration.datasets_from_catalog(catalog, pipeline=pipeline).items():
data_manager[dataset_name] = dataset
```

Expand All @@ -51,7 +72,9 @@ The full code for these different cases is given below.
from vizro.managers import data_manager


for dataset_name, dataset in kedro_integration.datasets_from_catalog(catalog).items():
pipeline = pipelines.get("my_pipeline")

for dataset_name, dataset in kedro_integration.datasets_from_catalog(catalog, pipeline=pipeline).items():
data_manager[dataset_name] = dataset
```

Expand Down
44 changes: 35 additions & 9 deletions vizro-core/src/vizro/integrations/kedro/_data_manager.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,25 +3,51 @@

from kedro.framework.session import KedroSession
from kedro.framework.startup import bootstrap_project
from kedro.io import DataCatalog
from kedro.io import CatalogProtocol, KedroDataCatalog
from kedro.pipeline import Pipeline

from vizro.managers._data_manager import pd_DataFrameCallable


def catalog_from_project(
project_path: Union[str, Path], env: Optional[str] = None, extra_params: Optional[dict[str, Any]] = None
) -> DataCatalog:
) -> CatalogProtocol | KedroDataCatalog:
bootstrap_project(project_path)
with KedroSession.create(
project_path=project_path, env=env, save_on_close=False, extra_params=extra_params
) as session:
return session.load_context().catalog


def datasets_from_catalog(catalog: DataCatalog) -> dict[str, pd_DataFrameCallable]:
datasets = {}
for name in catalog.list():
dataset = catalog._get_dataset(name, suggest=False)
if "pandas" in dataset.__module__:
datasets[name] = dataset.load
return datasets
def pipelines_from_project(project_path: Union[str, Path]) -> Pipeline:
bootstrap_project(project_path)
from kedro.framework.project import pipelines

return pipelines


def datasets_from_catalog(
catalog: CatalogProtocol | KedroDataCatalog, *, pipeline: Pipeline = None
) -> dict[str, pd_DataFrameCallable]:
# This doesn't include things added to the catalog at run time but that is ok for our purposes.
config_resolver = catalog.config_resolver
kedro_datasets = config_resolver.config.copy()

if pipeline is not None:
# Go through all dataset names that weren't in catalog and try to resolve them. Those that cannot be
# resolved give an empty dictionary and are ignored.
for dataset_name in set(pipeline.datasets()) - set(kedro_datasets):
if dataset_config := config_resolver.resolve_pattern(dataset_name):
kedro_datasets[dataset_name] = dataset_config

vizro_data_sources = {}

for dataset_name, dataset_config in kedro_datasets.items():
# "type" key always exists because we filtered out patterns that resolve to empty dictionary above.
if "pandas" in dataset_config["type"]:
# TODO: in future update to use lambda: catalog.load(dataset_name) instead of _get_dataset
# but need to check if works with caching.
dataset = catalog._get_dataset(dataset_name, suggest=False)
vizro_data_sources[dataset_name] = dataset.load

return vizro_data_sources
Original file line number Diff line number Diff line change
@@ -1,7 +1,19 @@
companies:
type: pandas.JSONDataset
filepath: companies.json
"{pandas_factory}1":
type: pandas.CSVDataset
filepath: ./{pandas_factory}.csv

reviews:
type: pickle.PickleDataset
filepath: reviews.pkl
pandas_excel:
type: pandas.ExcelDataset
filepath: pandas_excel.xlsx

pandas_parquet:
type: pandas.ParquetDataset
filepath: pandas_parquet.parquet

polars:
type: polars.CSVDataset
filepath: polars.csv

not_dataframe:
type: picke.PickleDataset
filepath: pickle.pkl
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@

kedro = pytest.importorskip("kedro")

import kedro.pipeline as kp # noqa: E402
from kedro.io import DataCatalog # noqa: E402

from vizro.integrations.kedro import datasets_from_catalog # noqa: E402
Expand All @@ -20,6 +21,25 @@ def catalog_path():

def test_datasets_from_catalog(catalog_path):
catalog = DataCatalog.from_config(yaml.safe_load(catalog_path.read_text(encoding="utf-8")))
assert "companies" in datasets_from_catalog(catalog)
assert isinstance(datasets_from_catalog(catalog), dict)
assert isinstance(datasets_from_catalog(catalog)["companies"], types.MethodType)

datasets = datasets_from_catalog(catalog)
assert isinstance(datasets, dict)
assert set(datasets) == {"pandas_excel", "pandas_parquet"}
for dataset in datasets.values():
assert isinstance(dataset, types.MethodType)


def test_datasets_from_catalog_with_pipeline(catalog_path):
catalog = DataCatalog.from_config(yaml.safe_load(catalog_path.read_text(encoding="utf-8")))
pipeline = kp.pipeline(
[
kp.node(
func=lambda *args: None,
inputs=["pandas_excel", "C1", "polars", "Z", "parameters", "params:z"],
outputs=["pandas_parquet", "not_dataframe"],
),
]
)

datasets = datasets_from_catalog(catalog, pipeline=pipeline)
assert set(datasets) == {"pandas_excel", "pandas_parquet", "C1"}
Loading