-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alternate spark runtime implementation #406
Conversation
2dc2629
to
9413086
Compare
experimental/spark/src/data_processing_spark/runtime/spark/transform_launcher.py
Outdated
Show resolved
Hide resolved
3d8f301
to
81d41ac
Compare
data-processing-lib/python/src/data_processing/transform/abstract_transform.py
Outdated
Show resolved
Hide resolved
data-processing-lib/spark/src/data_processing_spark/runtime/spark/execution_configuration.py
Outdated
Show resolved
Hide resolved
data-processing-lib/spark/src/data_processing_spark/runtime/spark/transform_file_processor.py
Show resolved
Hide resolved
data-processing-lib/spark/src/data_processing_spark/runtime/spark/transform_file_processor.py
Outdated
Show resolved
Hide resolved
data-processing-lib/spark/src/data_processing_spark/runtime/spark/transform_file_processor.py
Outdated
Show resolved
Hide resolved
super().__init__(runtime_config, data_access_factory) | ||
self.execution_config = SparkTransformExecutionConfiguration(name=runtime_config.get_name()) | ||
|
||
def __get_parameters(self) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe its time to move these up to the super class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which ones? We are using different classes
data-processing-lib/spark/src/data_processing_spark/runtime/spark/transform_orchestrator.py
Show resolved
Hide resolved
logger.debug("Building job metadata") | ||
input_params = runtime_config.get_transform_metadata() | execution_config.get_input_params() | ||
metadata = { | ||
"pipeline": execution_config.pipeline_id, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the "parallelization" being captured as metadata somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. input parameters
052e929
to
625132c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please reinstate the two-stage build process for the Dockerfile. I also suggest that we keep an argument whether we want to use Hadoop or not during the build process, with the value of that argument set to false by default.
I have also created a requirements page for supporting fuzzy dedup. I suggest that we provide support for fuzzy dedup in this runtime before merging it into the dev code branch.
75325f7
to
21105a8
Compare
I think its time to merge |
576483b
to
7117079
Compare
Signed-off-by: Constantin M Adam <[email protected]>
Why are these changes needed?
Related issue number (if any).
#197