Alternate spark runtime implementation #406

blublinsky · 2024-07-12T11:03:32Z

Why are these changes needed?

Related issue number (if any).

#197

experimental/spark/src/data_processing_spark/runtime/spark/transform_launcher.py

experimental/spark/src/noop_local_spark.py

transforms/universal/filter/spark/src/filter_transform_spark.py

data-processing-lib/python/src/data_processing/transform/abstract_transform.py

data-processing-lib/spark/src/data_processing_spark/runtime/spark/execution_configuration.py

data-processing-lib/spark/src/data_processing_spark/runtime/spark/transform_file_processor.py

daw3rd · 2024-08-05T12:21:40Z

data-processing-lib/spark/src/data_processing_spark/runtime/spark/transform_launcher.py

+        super().__init__(runtime_config, data_access_factory)
+        self.execution_config = SparkTransformExecutionConfiguration(name=runtime_config.get_name())
+
+    def __get_parameters(self) -> bool:


maybe its time to move these up to the super class

Which ones? We are using different classes

data-processing-lib/spark/src/data_processing_spark/runtime/spark/transform_orchestrator.py

daw3rd · 2024-08-05T12:24:48Z

data-processing-lib/spark/src/data_processing_spark/runtime/spark/transform_orchestrator.py

+        logger.debug("Building job metadata")
+        input_params = runtime_config.get_transform_metadata() | execution_config.get_input_params()
+        metadata = {
+            "pipeline": execution_config.pipeline_id,


is the "parallelization" being captured as metadata somewhere?

Yes. input parameters

transforms/.make.transforms

cmadam

Please reinstate the two-stage build process for the Dockerfile. I also suggest that we keep an argument whether we want to use Hadoop or not during the build process, with the value of that argument set to false by default.

I have also created a requirements page for supporting fuzzy dedup. I suggest that we provide support for fuzzy dedup in this runtime before merging it into the dev code branch.

data-processing-lib/spark/Dockerfile

blublinsky · 2024-09-09T18:49:36Z

I think its time to merge
It might require some additional small changes to run fuzzy, but we can work it out later

Signed-off-by: Constantin M Adam <[email protected]>

its done

blublinsky requested review from daw3rd and cmadam July 12, 2024 12:22

daw3rd changed the title ~~initial implementation~~ Alternate spark runtime implementation Jul 12, 2024

daw3rd requested a review from touma-I July 30, 2024 19:32

blublinsky force-pushed the spark_experimental branch from 2dc2629 to 9413086 Compare July 31, 2024 12:45

daw3rd reviewed Jul 31, 2024

View reviewed changes

experimental/spark/src/data_processing_spark/runtime/spark/transform_launcher.py Outdated Show resolved Hide resolved

experimental/spark/src/noop_local_spark.py Outdated Show resolved Hide resolved

blublinsky force-pushed the spark_experimental branch from 3d8f301 to 81d41ac Compare August 1, 2024 13:04

daw3rd reviewed Aug 1, 2024

View reviewed changes

transforms/universal/filter/spark/src/filter_transform_spark.py Show resolved Hide resolved

blublinsky requested a review from daw3rd August 5, 2024 08:42

daw3rd previously requested changes Aug 5, 2024

View reviewed changes

blublinsky force-pushed the spark_experimental branch 3 times, most recently from 052e929 to 625132c Compare August 14, 2024 07:35

cmadam previously requested changes Aug 14, 2024

View reviewed changes

data-processing-lib/spark/Dockerfile Outdated Show resolved Hide resolved

blublinsky force-pushed the spark_experimental branch from 75325f7 to 21105a8 Compare September 9, 2024 09:02

blublinsky added 14 commits September 11, 2024 10:23

initial implementation

e284e88

Added dockerfile

bfd64c4

refactoring of code

f312146

removed unnecessary abstract transform class

7d6957f

removed unnecessary abstract transform class

7fc18a6

removed unnecessary abstract transform class

5f52e6a

fixed spark image help

3ed2db9

documentation update

3fe378f

Added filter

b5d75b0

preparing for docid

eeadae9

implementing docid

3d7c84c

implementing docid

44859cc

Add comments to implementation

bd2f588

Add comments to implementation

a14aac0

blublinsky added 12 commits September 11, 2024 10:23

Add support for explicit parallelization

3711b3e

Add support for explicit parallelization

6fa77f2

Add support for explicit parallelization

0cb0241

Add support for explicit parallelization

2e22d23

Addressed comments

fbaa472

Addressed comments

19fec86

run pre commit

6714ac7

small fixes

995ab0c

addressed comments

1c01f90

addressed comments - launcher refactoring

26ad9fa

added support for runtime

e698f60

small cleanup

7117079

blublinsky force-pushed the spark_experimental branch from 576483b to 7117079 Compare September 11, 2024 07:23

blublinsky and others added 5 commits September 11, 2024 19:13

re factored doc id

be7d944

Use multi-stage build

2a8222a

Signed-off-by: Constantin M Adam <[email protected]>

changed Spark version

433fcc3

changed Spark version

c319fcc

changed Spark version

4bfc07f

cmadam approved these changes Sep 12, 2024

View reviewed changes

changed Spark version

a876e7e

daw3rd merged commit 03cba30 into dev Sep 12, 2024
21 checks passed

daw3rd mentioned this pull request Sep 12, 2024

[Feature] Enable pure python transforms in new spark runtime. #586

Open

17 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alternate spark runtime implementation #406

Alternate spark runtime implementation #406

blublinsky commented Jul 12, 2024 •

edited by daw3rd

Loading

daw3rd Aug 5, 2024

blublinsky Aug 5, 2024 •

edited

Loading

daw3rd Aug 5, 2024

blublinsky Aug 5, 2024

cmadam left a comment

blublinsky commented Sep 9, 2024

Alternate spark runtime implementation #406

Alternate spark runtime implementation #406

Conversation

blublinsky commented Jul 12, 2024 • edited by daw3rd Loading

Why are these changes needed?

Related issue number (if any).

daw3rd Aug 5, 2024

Choose a reason for hiding this comment

blublinsky Aug 5, 2024 • edited Loading

Choose a reason for hiding this comment

daw3rd Aug 5, 2024

Choose a reason for hiding this comment

blublinsky Aug 5, 2024

Choose a reason for hiding this comment

cmadam left a comment

Choose a reason for hiding this comment

blublinsky commented Sep 9, 2024

blublinsky commented Jul 12, 2024 •

edited by daw3rd

Loading

blublinsky Aug 5, 2024 •

edited

Loading