Airflow should support different types of dataset event consumer DAGs #34050

mrn-aglic · 2023-09-03T09:35:11Z

mrn-aglic
Sep 3, 2023

Description

Airflow should support defining different types of dataset consumer DAGs for dataset events. For example, if a DAG producer has created multiple dataset events, the DAG consumer should choose whether to digest these dataset events one by one or as a whole (for now).
The current implementation is non-deterministic. The job scheduler will pick up the dataset events which may be stored in the database by one or more DAG runs and will create a single DAG run. Sometimes the data intervals may correspond between the producer and consumer, sometimes they may not.

It would be nice if different types of consumers were also supported. For example, in our use case, we would need the consumer to replicate the data intervals of the producer DAG. I have taken a look at the source code and can propose a simple solution for this (on that later).

However, this also poses some issues/questions.

Let's break it down to pros and cons.
Pros:

in our case, we mostly use data-aware scheduling to break up large pipelines to multiple DAGs.This would allow us to be sure that data intervals are replicated across the entire pipeline

Cons:

what would it mean for situations where multiple DAGs produce the same dataset? These DAGs may have different data intervals, so replicating one doesn't make (exactly) sense. Different types of consuming modes could be proposed for these cases.
the dataset_dag_run_queue model in the database may not reflect what exactly will happen - the actual dag runs that will be queued. I'm not sure.

Here are some of my findings on what it would mean to add support for a consumer to replicate the data interval of the producer (again, this makes sense only if a single producer is updating the dataset):

change the DAG constructor to accept the mode for consuming dataset events (default to default or name it simple)
introduce an enum-like class (or whatever airflow uses for defining types) to define different consumer modes
add an __init__ method to the DatasetTimetable class to pass the mode for consumer
change the implementation for fetching data intervals to something like this:

def _data_interval_from_event(event: DatasetEvent) -> DataInterval:
    return DataInterval(event.source_dag_run.interval_start, event.source_dag_run.interval_end)

def _data_intervals_per_event(
    self,
    logical_date: DateTime,
    events: Collection[DatasetEvent],
) -> Collection[DataInterval]:

    if not events:
        return DataInterval(logical_date, logical_date)

    return [_data_interval_from_event(event) for event in events]

def _data_interval_for_event_set(
    self,
    logical_date: DateTime,
    events: Collection[DatasetEvent],
) -> Collection[DataInterval]:

    if not events:
        return DataInterval(logical_date, logical_date)

    start = min(
        events, key=operator.attrgetter("source_dag_run.data_interval_start")
    ).source_dag_run.data_interval_start
    end = max(
        events, key=operator.attrgetter("source_dag_run.data_interval_end")
    ).source_dag_run.data_interval_end
    return DataInterval(start, end)


def data_interval_for_events(
    self,
    logical_date: DateTime,
    events: Collection[DatasetEvent],
) -> Collection[DataInterval]:
    
    if self.event_mode == DatasetConsumerType.NON_DETERMINISTIC:
        return _data_interval_for_event_set(logical_date, events)
    elif self.event_mode == DatasetConsumerType.DATA_INTERVAL:
        return _data_intervals_per_event(logical_date, events)
    else: # this case should not be able to happen
        raise NotImplementedError()

and change the scheduler job logic to iterate over a list instead of a single value:

data_intervals = dag.timetable.data_interval_for_events(exec_date, dataset_events)

for data_interval in data_intervals:
    run_id = dag.timetable.generate_run_id(
        run_type=DagRunType.DATASET_TRIGGERED,
        logical_date=exec_date,
        data_interval=data_interval,
        session=session,
        events=dataset_events,
    )

    dag_run = dag.create_dagrun(
        run_id=run_id,
        run_type=DagRunType.DATASET_TRIGGERED,
        execution_date=exec_date,
        data_interval=data_interval,
        state=DagRunState.QUEUED,
        external_trigger=False,
        session=session,
        dag_hash=dag_hash,
        creating_job_id=self.job.id,
    )
    Stats.incr("dataset.triggered_dagruns")
    dag_run.consumed_dataset_events.extend(dataset_events)

session.query(DatasetDagRunQueue).filter(
    DatasetDagRunQueue.target_dag_id == dag_run.dag_id
).delete()

In the future, one could separate different types of dataset timetables.

Now, I have never submitted anything to Airflow, so I don't know the source code, and there may be issues with this approach that I am not aware off.

Thoughts?

Use case/motivation

While working with Data-aware scheduling in Airflow, there is an "issue" where multiple dataset events trigger a single DAG run of the consumer DAG. This presents an issue when using data_interal_start and data_interval_end variables in DAG execution as well as jinja templated queries.
The problem presents itself when either the producer DAG is running during catchup or backfill.

It seems like the culprit is that multiple dataset events are used to create a single DataInterval, where upon creation the min and max across all of the dataset events are taken (see DatasetTriggeredTimetable implementation).

It would be nice if different types of consumers were supported. For example, in our use case, we would need the consumer to replicate the data intervals of the producer DAG. I have taken a look at the source code and can propose a simple solution for this (on that later).

This would allow breaking up a complex pipeline into multiple DAGs and use backfill or catchup with more certainty in the end result.

Related issues

No response

Are you willing to submit a PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct

potiuk · 2023-09-03T14:44:09Z

potiuk
Sep 3, 2023
Collaborator

I think such a discussion should happen in the devlist not in GitHub issue. I believe very soon we are going to have some proposal (or few proposals) on the devlist that will be follow-up to the original AIP-48 https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-48+Data+Dependency+Management+and+Data+Driven+Scheduling

This is far more than a feature request and touches a lot more than a simple "feature" discussion as it changes the architecture of core airflow feature. I will convert it into a discussion and maybe some people will continue chiming in here, but at the very least you should join thet devlist @mrn-aglic and raise your proposal there - maybe pointing to this discussion, digesting the "gist" of it, or maybe continuing it in the devlist in full - up to you. But generally speaking - as in all ASF projects "If it did not happen at the devlist - it did not happen" - so any discussions and agreements here will have to be brought back to the devlist, likely Airflow Improvement Proposal written and it should be formally voted.

You can read more at the https://airflow.apache.org/community/ of ours and https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvement+Proposals for the process of discussing and proposing new AIP.

1 reply

mrn-aglic Sep 5, 2023
Author

I'll try to send in to the devlist (email as in the document). Just want to write this in a Google doc. How much in detail should it be or touch on the internal workings of Airflow (including DB)?
As the time I'll need also depends on the depth required for the document.

cmarteepants · 2024-02-15T16:05:33Z

cmarteepants
Feb 15, 2024
Collaborator

Doc from devlist: https://docs.google.com/document/d/1Apgrlylk410sStf9AS-Yc2RaIOGu8noBl8pHyBfmE08/edit#heading=h.wix0jhslclfw

Quick summary of the problem based on the doc:

Data Interval Values that we set for for downstream/consumer dags are non-deterministic. If there is more than one event in the queue, the data interval is calculated by looking at all the source dag runs, and setting the data interval to be the min data_interval_start date and max date_interval_end date. Problem is compounded once you consider backfills.

3 replies

potiuk Feb 15, 2024
Collaborator

@mrn-aglic @cmarteepants -> devlist thread is what is needed for that - I understand you aim to start one ?

potiuk Feb 15, 2024
Collaborator

(and by quick look, without going into details, it does look like draft proposal for at least a discussion on new AIP)

cmarteepants Feb 15, 2024
Collaborator

Yes 👍 It's premature to go to the dev-list at this time, but once we have something more concrete, we will

hristo-mavrodiev · 2025-01-19T08:30:34Z

hristo-mavrodiev
Jan 19, 2025

Hello,
Was this proposal accepted to be done?
Is there an AIP, where to progress could be tracked?

Thanks

1 reply

potiuk Jan 19, 2025
Collaborator

Hello, Was this proposal accepted to be done? Is there an AIP, where to progress could be tracked?

Thanks

I don't think there is a simple "yes/no" answer.

You can see at the current asset planned works at https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+3+Workstreams. If you follow the devlist (there decisions are made @hristo-mavrodiev ) - very soon there will be additional modification voted to AIP-83 https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-83+amendment+to+support+classic+Airflow+authoring+style - summarized in https://cwiki.apache.org/confluence/display/AIRFLOW/Option+2+clarification+doc+WIP - so if you would like to dive deeper and understand the consequence and what is planned for Airlfow 3 - those are the docs and linked discussions that you should read and follow.

Generally - following the devlist for such discussion is something you should do - discussions in Github Discussions are just this - discussions and people exchanging their opinions, but decisions in Apache projects are made on the devlist, and in case of Airflow they are captured in Airflow Improvement Proposals -see also community page https://airflow.apache.org/community/ where all the links and processes are explained.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Airflow should support different types of dataset event consumer DAGs #34050

{{title}}

Replies: 3 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Airflow should support different types of dataset event consumer DAGs #34050

mrn-aglic Sep 3, 2023

Description

Use case/motivation

Related issues

Are you willing to submit a PR?

Code of Conduct

Replies: 3 comments · 5 replies

potiuk Sep 3, 2023 Collaborator

mrn-aglic Sep 5, 2023 Author

cmarteepants Feb 15, 2024 Collaborator

potiuk Feb 15, 2024 Collaborator

potiuk Feb 15, 2024 Collaborator

cmarteepants Feb 15, 2024 Collaborator

hristo-mavrodiev Jan 19, 2025

potiuk Jan 19, 2025 Collaborator

mrn-aglic
Sep 3, 2023

Replies: 3 comments 5 replies

potiuk
Sep 3, 2023
Collaborator

mrn-aglic Sep 5, 2023
Author

cmarteepants
Feb 15, 2024
Collaborator

potiuk Feb 15, 2024
Collaborator

potiuk Feb 15, 2024
Collaborator

cmarteepants Feb 15, 2024
Collaborator

hristo-mavrodiev
Jan 19, 2025

potiuk Jan 19, 2025
Collaborator