Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feat] Enable datasets_from_catalog to return factory-based datasets #1001

Merged
merged 17 commits into from
Feb 12, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 39 additions & 0 deletions vizro-core/changelog.d/20250208_114146_4648633+gtauzin.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
<!--
A new scriv changelog fragment.

Uncomment the section that is right (remove the HTML comment wrapper).
-->

<!--
### Highlights ✨

- A bullet item for the Highlights ✨ category with a link to the relevant PR at the end of your entry, e.g. Enable feature XXX. ([#1](https://github.com/mckinsey/vizro/pull/1))

-->
<!--
### Removed

- A bullet item for the Removed category with a link to the relevant PR at the end of your entry, e.g. Enable feature XXX. ([#1](https://github.com/mckinsey/vizro/pull/1))

-->
### Added

- Kedro integration function `datasets_from_catalog` can handle dataset factories for `kedro>=0.19.9`. ([#1001](https://github.com/mckinsey/vizro/pull/1001))

### Changed

- Bump optional dependency lower bound to `kedro>=0.19.0`. ([#1001](https://github.com/mckinsey/vizro/pull/1001))

<!--
### Deprecated

- A bullet item for the Deprecated category with a link to the relevant PR at the end of your entry, e.g. Enable feature XXX. ([#1](https://github.com/mckinsey/vizro/pull/1))

-->

<!--
### Security

- A bullet item for the Security category with a link to the relevant PR at the end of your entry, e.g. Enable feature XXX. ([#1](https://github.com/mckinsey/vizro/pull/1))

-->
2 changes: 1 addition & 1 deletion vizro-core/docs/pages/explanation/authors.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@

<!-- vale off -->

[Ann Marie Ward](https://github.com/AnnMarieW), [Anna Xiong](https://github.com/Anna-Xiong), [Annie Wachsmuth](https://github.com/anniecwa), [ataraexia](https://github.com/ataraexia), [axa99](https://github.com/axa99), [Bhavana Sundar](https://github.com/bhavanaeh), [Bo Xu](https://github.com/boxuboxu), [Chiara Pullem](https://github.com/chiara-sophie), [Denis Lebedev](https://github.com/DenisLebedevMcK), [Elena Fridman](https://github.com/EllenWie), [Ferida Mohammed](https://github.com/feridaaa), [Hamza Oza](https://github.com/hamzaoza), [Hansaem Park](https://github.com/sammitako), [Hilary Ivy](https://github.com/hxe00570), [Jasmine Wu](https://github.com/jazwu), [Jenelle Yonkman](https://github.com/yonkmanjl), [Jingjing Guo](https://github.com/jjguo-mck), [Juan Luis Cano Rodríguez](https://github.com/astrojuanlu), [Kee Wen Ng](https://github.com/KeeWenNgQB), [Leon Nallamuthu](https://github.com/leonnallamuthu), [Lydia Pitts](https://github.com/LydiaPitts), [Manuel Konrad](https://github.com/manuelkonrad), [Ned Letcher](https://github.com/ned2), [Nikolaos Tsaousis](https://github.com/tsanikgr), [njmcgrat](https://github.com/njmcgrat), [Oleksandr Serdiuk](https://github.com/oserdiuk-lohika), [Prateek Bajaj](https://github.com/prateekdev552), [Qiuyi Chen](https://github.com/Qiuyi-Chen), [Rashida Kanchwala](https://github.com/rashidakanchwala), [Riley Dou](https://github.com/rilieo), [Rosheen C.](https://github.com/rc678), [Sylvie Zhang](https://github.com/sylviezhang37), and [Upekesha Ngugi](https://github.com/upekesha).
[Ann Marie Ward](https://github.com/AnnMarieW), [Anna Xiong](https://github.com/Anna-Xiong), [Annie Wachsmuth](https://github.com/anniecwa), [ataraexia](https://github.com/ataraexia), [axa99](https://github.com/axa99), [Bhavana Sundar](https://github.com/bhavanaeh), [Bo Xu](https://github.com/boxuboxu), [Chiara Pullem](https://github.com/chiara-sophie), [Denis Lebedev](https://github.com/DenisLebedevMcK), [Elena Fridman](https://github.com/EllenWie), [Ferida Mohammed](https://github.com/feridaaa), [Guillaume Tauzin](https://github.com/gtauzin), [Hamza Oza](https://github.com/hamzaoza), [Hansaem Park](https://github.com/sammitako), [Hilary Ivy](https://github.com/hxe00570), [Jasmine Wu](https://github.com/jazwu), [Jenelle Yonkman](https://github.com/yonkmanjl), [Jingjing Guo](https://github.com/jjguo-mck), [Juan Luis Cano Rodríguez](https://github.com/astrojuanlu), [Kee Wen Ng](https://github.com/KeeWenNgQB), [Leon Nallamuthu](https://github.com/leonnallamuthu), [Lydia Pitts](https://github.com/LydiaPitts), [Manuel Konrad](https://github.com/manuelkonrad), [Ned Letcher](https://github.com/ned2), [Nikolaos Tsaousis](https://github.com/tsanikgr), [njmcgrat](https://github.com/njmcgrat), [Oleksandr Serdiuk](https://github.com/oserdiuk-lohika), [Prateek Bajaj](https://github.com/prateekdev552), [Qiuyi Chen](https://github.com/Qiuyi-Chen), [Rashida Kanchwala](https://github.com/rashidakanchwala), [Riley Dou](https://github.com/rilieo), [Rosheen C.](https://github.com/rc678), [Sylvie Zhang](https://github.com/sylviezhang37), and [Upekesha Ngugi](https://github.com/upekesha).

with thanks to Sam Bourton and Kevin Staight for sponsorship, inspiration and guidance,

Expand Down
2 changes: 1 addition & 1 deletion vizro-core/docs/pages/explanation/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -95,7 +95,7 @@ Any attempt at a high-level explanation must rely on an oversimplification that

All are great entry points to the world of data apps. If you prefer a top-down scripting style, then Streamlit is a powerful approach. If you prefer full control and customization over callbacks and layouts, then Dash is a powerful approach. If you prefer a configuration approach with in-built best practices, and the potential for customization and scalability through Dash, then Vizro is a powerful approach.

For a more detailed comparison, it may help to visit the introductory articles of [Dash](https://medium.com/plotly/introducing-dash-5ecf7191b503), [Streamlit](https://towardsdatascience.com/coding-ml-tools-like-you-code-ml-models-ddba3357eace) and [Vizro](https://quantumblack.medium.com/introducing-vizro-a-toolkit-for-creating-modular-data-visualization-applications-3a42f2bec4db), to see how each tool serves a distinct purpose, and could be the best tool of choice.
For a more detailed comparison, it may help to read introductory articles about [Dash](https://medium.com/plotly/introducing-dash-5ecf7191b503), [Streamlit](https://blog.streamlit.io/streamlit-101-python-data-app/) and [Vizro](https://quantumblack.medium.com/introducing-vizro-a-toolkit-for-creating-modular-data-visualization-applications-3a42f2bec4db), to see how each tool serves a distinct purpose.

## How does Vizro compare with Python packages and business intelligence (BI) tools?

Expand Down
62 changes: 54 additions & 8 deletions vizro-core/docs/pages/user-guides/kedro-data-catalog.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ pip install vizro[kedro]

## Use datasets from the Kedro Data Catalog

`vizro.integrations.kedro` provides functions to help generate and process a [Kedro Data Catalog](https://docs.kedro.org/en/stable/data/index.html). Given a Kedro Data Catalog `catalog`, the general pattern to add datasets into the Vizro data manager is:
`vizro.integrations.kedro` provides functions to help generate and process a [Kedro Data Catalog](https://docs.kedro.org/en/stable/data/index.html). It supports both the original [`DataCatalog`](https://docs.kedro.org/en/stable/data/data_catalog.html) and the more recently introduced [`KedroDataCatalog`](https://docs.kedro.org/en/stable/data/index.html#kedrodatacatalog-experimental-feature). Given a Kedro Data Catalog `catalog`, the general pattern to add datasets into the Vizro data manager is:

```python
from vizro.integrations import kedro as kedro_integration
Expand All @@ -39,20 +39,21 @@ The full code for these different cases is given below.
from vizro.integrations import kedro as kedro_integration
from vizro.managers import data_manager

project_path = "/path/to/kedro/project"
catalog = kedro_integration.catalog_from_project(project_path)

catalog = kedro_integration.catalog_from_project("/path/to/kedro/project")

for dataset_name, dataset in kedro_integration.datasets_from_catalog(catalog).items():
data_manager[dataset_name] = dataset
for dataset_name, dataset_loader in kedro_integration.datasets_from_catalog(catalog).items():
data_manager[dataset_name] = dataset_loader
```

=== "app.ipynb (Kedro Jupyter session)"
```python
from vizro.managers import data_manager


for dataset_name, dataset in kedro_integration.datasets_from_catalog(catalog).items():
data_manager[dataset_name] = dataset
for dataset_name, dataset_loader in kedro_integration.datasets_from_catalog(catalog).items():
data_manager[dataset_name] = dataset_loader
```

=== "app.py (Data Catalog configuration file)"
Expand All @@ -66,6 +67,51 @@ The full code for these different cases is given below.

catalog = DataCatalog.from_config(yaml.safe_load(Path("catalog.yaml").read_text(encoding="utf-8")))

for dataset_name, dataset in kedro_integration.datasets_from_catalog(catalog).items():
data_manager[dataset_name] = dataset
for dataset_name, dataset_loader in kedro_integration.datasets_from_catalog(catalog).items():
data_manager[dataset_name] = dataset_loader
```

### Use dataset factories

To add datasets that are defined using a [Kedro dataset factory](https://docs.kedro.org/en/stable/data/kedro_dataset_factories.html), `datasets_from_catalog` needs to resolve dataset patterns against explicit datasets. Given a Kedro `pipelines` dictionary, you should specify a `pipeline` argument as follows:

```python
kedro_integration.datasets_from_catalog(catalog, pipeline=pipelines["__default__"]) # (1)!
```

1. You can specify the name of your pipeline, for example `pipelines["my_pipeline"]`, or even combine multiple pipelines with `pipelines["a"] + pipelines["b"]`. The Kedro `__default__` pipeline is what runs by default with the `kedro run` command.

The `pipelines` variable may have been created the following ways:

1. Kedro project path. Vizro exposes a helper function `vizro.integrations.kedro.pipelines_from_project` to generate a `pipelines` given the path to a Kedro project.
1. [Kedro Jupyter session](https://docs.kedro.org/en/stable/notebooks_and_ipython/kedro_and_notebooks.html). This automatically exposes `pipelines`.

The full code for these different cases is given below.

!!! example "Import a Kedro Data Catalog with dataset factories into the Vizro data manager"
=== "app.py (Kedro project path)"
```python
from vizro.integrations import kedro as kedro_integration
from vizro.managers import data_manager


project_path = "/path/to/kedro/project"
catalog = kedro_integration.catalog_from_project(project_path)
pipelines = kedro_integration.pipelines_from_project(project_path)

for dataset_name, dataset_loader in kedro_integration.datasets_from_catalog(
catalog, pipeline=pipelines["__default__"]
).items():
data_manager[dataset_name] = dataset_loader
```

=== "app.ipynb (Kedro Jupyter session)"
```python
from vizro.managers import data_manager


for dataset_name, dataset_loader in kedro_integration.datasets_from_catalog(
catalog, pipeline=pipelines["__default__"]
).items():
data_manager[dataset_name] = dataset_loader
```
12 changes: 3 additions & 9 deletions vizro-core/hatch.toml
Original file line number Diff line number Diff line change
Expand Up @@ -3,13 +3,6 @@
[[envs.all.matrix]]
python = ["3.9", "3.10", "3.11", "3.12", "3.13"]

[envs.all.overrides]
# Kedro is currently not compatible with Python 3.13 and returns exceptions when trying to run the unit tests on
# Python 3.13. These exceptions turned out to be difficult to ignore: https://github.com/mckinsey/vizro/pull/216
matrix.python.features = [
{value = "kedro", if = ["3.9", "3.10", "3.11", "3.12"]}
]

[envs.changelog]
dependencies = ["scriv"]
detached = true
Expand Down Expand Up @@ -37,6 +30,7 @@ dependencies = [
"pyhamcrest",
"gunicorn"
]
features = ["kedro"]
installer = "uv"

[envs.default.env-vars]
Expand Down Expand Up @@ -133,9 +127,9 @@ extra-dependencies = [
"dash==2.18.0",
"plotly==5.24.0",
"pandas==2.0.0",
"numpy==1.23.0" # Need numpy<2 to work with pandas==2.0.0. See https://stackoverflow.com/questions/78634235/.
"numpy==1.23.0", # Need numpy<2 to work with pandas==2.0.0. See https://stackoverflow.com/questions/78634235/.
"kedro==0.19.0" # Includes kedro-datasets as a dependency.
]
features = ["kedro"]
python = "3.9"

[publish.index]
Expand Down
5 changes: 3 additions & 2 deletions vizro-core/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,8 @@ dependencies = [
"flask_caching>=2",
"wrapt>=1",
"black",
"autoflake"
"autoflake",
"packaging"
]
description = "Vizro is a package to facilitate visual analytics."
dynamic = ["version"]
Expand All @@ -36,7 +37,7 @@ requires-python = ">=3.9"

[project.optional-dependencies]
kedro = [
"kedro>=0.17.3",
"kedro>=0.19.0",
"kedro-datasets" # no longer a dependency of kedro for kedro>=0.19.2
]

Expand Down
3 changes: 2 additions & 1 deletion vizro-core/src/vizro/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@

import plotly.io as pio
from dash.development.base_component import ComponentRegistry
from packaging.version import parse

from ._constants import VIZRO_ASSETS_PATH
from ._vizro import Vizro, _make_resource_spec
Expand All @@ -23,7 +24,7 @@
# This would only be the case where you need to test something with serve_locally=False and have changed
# assets compared to main. In this case you need to push your assets changes to remote for the CDN to update,
# and it might also be necessary to clear the CDN cache: https://www.jsdelivr.com/tools/purge.
_git_branch = __version__ if "dev" not in __version__ else "main"
_git_branch = __version__ if not parse(__version__).is_devrelease else "main"
BASE_EXTERNAL_URL = f"https://cdn.jsdelivr.net/gh/mckinsey/vizro@{_git_branch}/vizro-core/src/vizro/"
# Enables the use of our own Bootstrap theme in a pure Dash app with `external_stylesheets=vizro.bootstrap`.
bootstrap = f"{BASE_EXTERNAL_URL}static/css/vizro-bootstrap.min.css"
Expand Down
3 changes: 2 additions & 1 deletion vizro-core/src/vizro/_vizro.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
import plotly.io as pio
from dash.development.base_component import ComponentRegistry
from flask_caching import SimpleCache
from packaging.version import parse

import vizro
from vizro._constants import VIZRO_ASSETS_PATH
Expand Down Expand Up @@ -209,7 +210,7 @@ def _make_resource_spec(path: Path) -> _ResourceSpec:
# This would only be the case where you need to test something with serve_locally=False and have changed
# assets compared to main. In this case you need to push your assets changes to remote for the CDN to update,
# and it might also be necessary to clear the CDN cache: https://www.jsdelivr.com/tools/purge.
_git_branch = vizro.__version__ if "dev" not in vizro.__version__ else "main"
_git_branch = vizro.__version__ if not parse(vizro.__version__).is_devrelease else "main"
BASE_EXTERNAL_URL = f"https://cdn.jsdelivr.net/gh/mckinsey/vizro@{_git_branch}/vizro-core/src/vizro/"

# Get path relative to the vizro package root, where this file resides.
Expand Down
4 changes: 2 additions & 2 deletions vizro-core/src/vizro/integrations/kedro/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
from ._data_manager import catalog_from_project, datasets_from_catalog
from ._data_manager import catalog_from_project, datasets_from_catalog, pipelines_from_project

__all__ = ["catalog_from_project", "datasets_from_catalog"]
__all__ = ["catalog_from_project", "datasets_from_catalog", "pipelines_from_project"]
54 changes: 50 additions & 4 deletions vizro-core/src/vizro/integrations/kedro/_data_manager.py
Original file line number Diff line number Diff line change
@@ -1,27 +1,73 @@
from __future__ import annotations

from importlib.metadata import version
from pathlib import Path
from typing import Any, Optional, Union
from typing import TYPE_CHECKING, Any, Optional, Union

from kedro.framework.session import KedroSession
from kedro.framework.startup import bootstrap_project
from kedro.io import DataCatalog
from kedro.pipeline import Pipeline
from packaging.version import parse

from vizro.managers._data_manager import pd_DataFrameCallable

if TYPE_CHECKING:
from kedro.io import CatalogProtocol


def catalog_from_project(
project_path: Union[str, Path], env: Optional[str] = None, extra_params: Optional[dict[str, Any]] = None
) -> DataCatalog:
) -> CatalogProtocol:
bootstrap_project(project_path)
with KedroSession.create(
project_path=project_path, env=env, save_on_close=False, extra_params=extra_params
) as session:
return session.load_context().catalog


def datasets_from_catalog(catalog: DataCatalog) -> dict[str, pd_DataFrameCallable]:
def pipelines_from_project(project_path: Union[str, Path]) -> Pipeline:
bootstrap_project(project_path)
from kedro.framework.project import pipelines

return pipelines


def _legacy_datasets_from_catalog(catalog: CatalogProtocol) -> dict[str, pd_DataFrameCallable]:
# The old version of datasets_from_catalog from before https://github.com/mckinsey/vizro/pull/1001.
# This does not support dataset factories.
# We keep this version to maintain backwards compatibility with 0.19.0 <= kedro < 0.19.9.
# Note the pipeline argument does not exist.
datasets = {}
for name in catalog.list():
dataset = catalog._get_dataset(name, suggest=False)
if "pandas" in dataset.__module__:
datasets[name] = dataset.load
return datasets


def datasets_from_catalog(catalog: CatalogProtocol, *, pipeline: Pipeline = None) -> dict[str, pd_DataFrameCallable]:
if parse(version("kedro")) < parse("0.19.9"):
return _legacy_datasets_from_catalog(catalog)

# This doesn't include things added to the catalog at run time but that is ok for our purposes.
config_resolver = catalog.config_resolver
kedro_datasets = config_resolver.config.copy()

if pipeline:
# Go through all dataset names that weren't in catalog and try to resolve them. Those that cannot be
# resolved give an empty dictionary and are ignored.
for dataset_name in set(pipeline.datasets()) - set(kedro_datasets):
if dataset_config := config_resolver.resolve_pattern(dataset_name):
kedro_datasets[dataset_name] = dataset_config

vizro_data_sources = {}

for dataset_name, dataset_config in kedro_datasets.items():
# "type" key always exists because we filtered out patterns that resolve to empty dictionary above.
if "pandas" in dataset_config["type"]:
# TODO: in future update to use lambda: catalog.load(dataset_name) instead of _get_dataset
# but need to check if works with caching.
dataset = catalog._get_dataset(dataset_name, suggest=False)
vizro_data_sources[dataset_name] = dataset.load

return vizro_data_sources
3 changes: 2 additions & 1 deletion vizro-core/tests/unit/test_vizro.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,12 @@

import dash
import pytest
from packaging.version import parse

import vizro
from vizro._constants import VIZRO_ASSETS_PATH

_git_branch = vizro.__version__ if "dev" not in vizro.__version__ else "main"
_git_branch = vizro.__version__ if not parse(vizro.__version__).is_devrelease else "main"


def test_vizro_bootstrap():
Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,15 @@
companies:
type: pandas.JSONDataset
filepath: companies.json
"{pandas_factory}#csv":
type: pandas.CSVDataset
filepath: "{pandas_factory}.csv"

reviews:
pandas_excel:
type: pandas.ExcelDataset
filepath: pandas_excel.xlsx

pandas_parquet:
type: pandas.ParquetDataset
filepath: pandas_parquet.parquet

not_dataframe:
type: pickle.PickleDataset
filepath: reviews.pkl
filepath: pickle.pkl
Loading