[DataCatalog]: Provide public methods to modify catalog #3930

ElenaKhaustova · 2024-06-05T16:59:11Z

Description

Plugin developers and advanced users face limitations due to the absence of public methods for modifying the catalog datasets, and injecting dynamic behaviour or configuration parameters on the fly during pipeline execution. Although these limitations are made intentionally by not providing corresponding public APIs users bypass them by using private APIs.

We propose to:

Rethink the concept of keeping DataCatalog immutable.
Explore the feasibility of providing public API for modifying the catalog datasets and configuration parameters, enabling users to adapt the pipeline's behaviour in response to changing runtime requirements or environmental conditions.

Relates to #2728

Context

Users need the ability to view and modify information within the Data Catalog dynamically during pipeline execution. This includes injecting dynamic data or swapping dataset implementations to accommodate varying runtime requirements.

https://github.com/Galileo-Galilei/kedro-mlflow/blob/64b8e94e1dafa02d979e7753dab9b9dfd4d7341c/kedro_mlflow/framework/hooks/mlflow_hook.py#L145

Plugin developers are interested in checking the dataset's type and injecting dynamic behaviour based on that type. They want to determine whether a dataset belongs to a certain class or type and then modify its parameters or behaviour accordingly, such as configuring it based on their environment or integration needs.

https://github.com/getindata/kedro-azureml/blob/d5c2011c7ed7fdc03235bf2bd6701f1901d1139c/kedro_azureml/hooks.py#L20

The text was updated successfully, but these errors were encountered:

astrojuanlu · 2024-06-06T07:30:02Z

Adding a few more examples:

There's general agreement that we don't necessarily want to make all mutations of the catalog easy (like crazy injection of datasets in the middle of the lifecycle) but maybe there's more ways we can open up the collection of datasets just before the catalog is first instantiated for the rest of the run.

For interactive use on the other hand, building the DataCatalog in an imperative way seems unnecessary and there are other possibilities we can offer #3612 (comment)

ElenaKhaustova · 2025-01-13T19:02:17Z

In the new catalog - KedroDataCatalog we implemented dict-like interface and removed _FrozenDatasets as well as access datasets like properties.

The new catalog is partially mutable as it supports a setter which allows adding new or replacing existing datasets.

We also decided with the team to not make catalog fully mutable. The datasets property remained private so as not to encourage behaviour when users configure the catalog via modifying the datasets dictionary. For the same reason KedroDataCatalog will not support all dictionary-specific methods, such as pop(), popitem(), or deletion by key (del).

It is also possible to modify the existing datasets in place as get() method returns a reference to datset object, but we do not recommend this and encourage users to be careful. These changes might affect the pipeline run and lead to unexpected results, as the framework itself doesn't track these kind of changes and does not synchronize them.

To see the full KedroDataCatalog API refer to #4175 and https://docs.kedro.org/en/stable/data/kedro_data_catalog.html.

ElenaKhaustova added the Issue: Feature Request New feature or improvement to existing feature label Jun 5, 2024

ElenaKhaustova added this to the Redesign the API for IO (catalog) milestone Jun 5, 2024

astrojuanlu mentioned this issue Jun 6, 2024

[DataCatalog]: add_feed_dict() performance bottleneck #3912

Closed

ElenaKhaustova mentioned this issue Jun 6, 2024

Research summary of insights for redesigning Kedro's data catalog API #3934

Open

github-actions bot mentioned this issue Jul 1, 2024

Monthly issue metrics report #3975

Closed

ElenaKhaustova closed this as completed Jan 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DataCatalog]: Provide public methods to modify catalog #3930

[DataCatalog]: Provide public methods to modify catalog #3930

ElenaKhaustova commented Jun 5, 2024

astrojuanlu commented Jun 6, 2024

ElenaKhaustova commented Jan 13, 2025

[DataCatalog]: Provide public methods to modify catalog #3930

[DataCatalog]: Provide public methods to modify catalog #3930

Comments

ElenaKhaustova commented Jun 5, 2024

Description

Context

astrojuanlu commented Jun 6, 2024

ElenaKhaustova commented Jan 13, 2025