feat: automatically inject OL transport info into spark jobs #45326
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Similar to #44477 , this PR introduces a new feature to OpenLineage integration. It will NOT impact users that are not using OpenLineage or have not explicitly enabled this feature (False by default).
TLDR;
When explicitly enabled by the user for supported operators, we will automatically inject transport information into the Spark job properties. For example, when submitting a Spark job using the DataprocSubmitJobOperator, we will configure Spark/OpenLineage integration to use the same transport configuration that Airflow integration uses.
Why ?
Currently, this process requires manual configuration by the user, as described here. E.g.:
Understanding how various Airflow operators configure Spark allows us to automatically inject transport information.
Controlling the Behavior
We provide users with a flexible control mechanism to manage this injection, combining per-operator enablement with a global fallback configuration. This design is inspired by the
deferrable
argument in Airflow.Each supported operator will include an argument like
ol_inject_transport_info
, which defaults to the global configuration value ofopenlineage.spark_inject_transport_info
. This approach allows users to:This design ensures both flexibility and ease of use, enabling users to fine-tune their workflows while minimizing repetitive configuration. I am aware that adding an OpenLineage-related argument to the operator will affect all users, even those not using OpenLineage, but since it defaults to False and can be ignored, I hope this will not pose any issues.
How?
The implementation is divided into three parts for better organization and clarity:
Operator's Code (including the
execute
method):Contains minimal logic to avoid overwhelming users who are not actively working with OpenLineage.
Google's Provider OpenLineage Utils File:
Handles the logic for accessing Spark properties specific to a given operator or job.
OpenLineage Provider's Utils:
Responsible for creating / extracting all necessary information in a format compatible with the OpenLineage Spark integration. We are also performing modifications to the Spark properties here.
For some operators parts 1 and 2 may be in the operator's code. In general, the specific operator / provider will know how to get the spark properties and the OL will know what to inject and do the injection itself.
^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named
{pr_number}.significant.rst
or{issue_number}.significant.rst
, in newsfragments.