Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kedro-airflow --group-in-memory issues #998

Open
CF-FHB-X opened this issue Feb 6, 2025 · 1 comment
Open

kedro-airflow --group-in-memory issues #998

CF-FHB-X opened this issue Feb 6, 2025 · 1 comment
Labels
Community Issue/PR opened by the open-source community

Comments

@CF-FHB-X
Copy link

CF-FHB-X commented Feb 6, 2025

Description

I'm having issues getting the --group-in-memory flag to actually group nodes.

Context

I'm running kedro v0.19.11 and kedro-airflow v0.9.2, and trying to deploy a simple 2-node test pipeline to our internal Airflow. Using the --group-in-memory flag doesn't seem to be doing anything.

Steps to Reproduce

  1. I have a simple test pipeline with 2 nodes. One fetches a file from a server, converts it to a DataFrame, and outputs as a MemoryDataset. The other node uses that DataFrame, does a simple group by with some stats, and dumps that out to a CSV.
  2. I run kedro airflow create --target-dir=dags/ --env-airflow --group-in-memory to convert the pipeline into an Airflow DAG.
    I should note that this is just a simple test to see if I can get kedro working with our Airflow deployment, so the nodes are just simple code snippets for testing purposes.

Expected Result

This could totally just be my misunderstanding this, but I expected those 2 nodes being munged into one task in the DAG (since the output from the first node and input to the second node is the same MemoryDataset).

Actual Result

With or without the --group-in-memory flag, the resulting DAG file always has 2 tasks.

Your Environment

Include as many relevant details about the environment in which you experienced the bug:

  • Kedro version used (pip show kedro or kedro -V): v0.19.11
  • Kedro plugin and kedro plugin version used (pip show kedro-airflow): kedro-airflow v0.9.2
  • Python version used (python -V): 3.11.9
  • Operating system and version: Windows 10 Enterprise
@merelcht merelcht added the Community Issue/PR opened by the open-source community label Feb 6, 2025
@github-project-automation github-project-automation bot moved this to Wizard inbox in Kedro Wizard 🪄 Feb 6, 2025
@CF-FHB-X
Copy link
Author

CF-FHB-X commented Feb 10, 2025

I think I might have found the issue? If you look at the kedro_airflow.grouping._is_memory_dataset() function, it always returns False if the dataset name is not in the catalog (there's no check if the dataset is a MemoryDataset).

I think it should be something along the lines of:

def _is_memory_dataset(catalog, dataset_name: str) -> bool:
    return isinstance(catalog.datasets[dataset_name], MemoryDataset)

with an from kedro.io import MemoryDataset at the top.

Running that seems to produce the desired outcome: all nodes with MemoryDatasets as inputs/outputs collapsed into one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Community Issue/PR opened by the open-source community
Projects
Status: Wizard inbox
Development

No branches or pull requests

2 participants