feat: automatically inject OL transport info into spark jobs #45326

kacpermuda · 2025-01-01T13:38:12Z

Similar to #44477 , this PR introduces a new feature to OpenLineage integration. It will NOT impact users that are not using OpenLineage or have not explicitly enabled this feature (False by default).

TLDR;

When explicitly enabled by the user for supported operators, we will automatically inject transport information into the Spark job properties. For example, when submitting a Spark job using the DataprocSubmitJobOperator, we will configure Spark/OpenLineage integration to use the same transport configuration that Airflow integration uses.

Why ?

Currently, this process requires manual configuration by the user, as described here. E.g.:

DataprocSubmitJobOperator(
    task_id="my_task", 
    # ... 
    job={ 
        # ...
        "spark.openlineage.transport.type": "http",
        "spark.openlineage.transport.url": openlineage_url,
        "spark.openlineage.transport.compression": "gzip",
        "spark.openlineage.transport.auth.apiKey": api_key,
        "spark.openlineage.transport.auth.type": "apiKey",
    } 
)

Understanding how various Airflow operators configure Spark allows us to automatically inject transport information.

Controlling the Behavior

We provide users with a flexible control mechanism to manage this injection, combining per-operator enablement with a global fallback configuration. This design is inspired by the deferrable argument in Airflow.

ol_inject_transport_info: bool = conf.getboolean(
    "openlineage", "spark_inject_transport_info", fallback=False
)

Each supported operator will include an argument like ol_inject_transport_info, which defaults to the global configuration value of openlineage.spark_inject_transport_info. This approach allows users to:

Control behavior on a per-job basis by explicitly setting the argument.
Rely on a consistent default configuration for all jobs if the argument is not set.

This design ensures both flexibility and ease of use, enabling users to fine-tune their workflows while minimizing repetitive configuration. I am aware that adding an OpenLineage-related argument to the operator will affect all users, even those not using OpenLineage, but since it defaults to False and can be ignored, I hope this will not pose any issues.

How?

The implementation is divided into three parts for better organization and clarity:

Operator's Code (including the execute method):
Contains minimal logic to avoid overwhelming users who are not actively working with OpenLineage.
Google's Provider OpenLineage Utils File:
Handles the logic for accessing Spark properties specific to a given operator or job.
OpenLineage Provider's Utils:
Responsible for creating / extracting all necessary information in a format compatible with the OpenLineage Spark integration. We are also performing modifications to the Spark properties here.

For some operators parts 1 and 2 may be in the operator's code. In general, the specific operator / provider will know how to get the spark properties and the OL will know what to inject and do the injection itself.

^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

potiuk · 2025-01-02T12:17:16Z

Hey @kacpermuda -> I rebased your PR -> we found and issue with @jscheffl with the new caching scheme - fixed in #45347 that would run "main" version of the tests-> so I am rebasing all PRs affected :)

Signed-off-by: Kacper Muda <[email protected]>

boring-cyborg bot added area:providers kind:documentation provider:common-compat provider:google Google (including GCP) related issues provider:openlineage AIP-53 labels Jan 1, 2025

kacpermuda force-pushed the feat-ol-inject-transport-dataproc-job branch 4 times, most recently from d9a52b6 to 78b57e4 Compare January 2, 2025 10:06

kacpermuda marked this pull request as ready for review January 2, 2025 10:11

kacpermuda requested a review from mobuchowski as a code owner January 2, 2025 10:11

kacpermuda force-pushed the feat-ol-inject-transport-dataproc-job branch from 78b57e4 to 9f45650 Compare January 2, 2025 12:13

potiuk force-pushed the feat-ol-inject-transport-dataproc-job branch from 9f45650 to 6a85faa Compare January 2, 2025 12:15

kacpermuda force-pushed the feat-ol-inject-transport-dataproc-job branch 4 times, most recently from 0c33d55 to 179acb9 Compare January 3, 2025 16:11

feat: automatically inject OL transport info into spark jobs

1523cd9

Signed-off-by: Kacper Muda <[email protected]>

kacpermuda force-pushed the feat-ol-inject-transport-dataproc-job branch from 179acb9 to 1523cd9 Compare January 3, 2025 17:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: automatically inject OL transport info into spark jobs #45326

feat: automatically inject OL transport info into spark jobs #45326

kacpermuda commented Jan 1, 2025

potiuk commented Jan 2, 2025

feat: automatically inject OL transport info into spark jobs #45326

Are you sure you want to change the base?

feat: automatically inject OL transport info into spark jobs #45326

Conversation

kacpermuda commented Jan 1, 2025

TLDR;

Why ?

Controlling the Behavior

How?

potiuk commented Jan 2, 2025