Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternate spark runtime implementation #406

Merged
merged 32 commits into from
Sep 12, 2024
Merged

Alternate spark runtime implementation #406

merged 32 commits into from
Sep 12, 2024

Conversation

blublinsky
Copy link
Collaborator

@blublinsky blublinsky commented Jul 12, 2024

Why are these changes needed?

Related issue number (if any).

#197

@blublinsky blublinsky requested review from daw3rd and cmadam July 12, 2024 12:22
@daw3rd daw3rd changed the title initial implementation Alternate spark runtime implementation Jul 12, 2024
@daw3rd daw3rd requested a review from touma-I July 30, 2024 19:32
@blublinsky blublinsky force-pushed the spark_experimental branch from 2dc2629 to 9413086 Compare July 31, 2024 12:45
@blublinsky blublinsky requested a review from daw3rd August 5, 2024 08:42
daw3rd
daw3rd previously requested changes Aug 5, 2024
super().__init__(runtime_config, data_access_factory)
self.execution_config = SparkTransformExecutionConfiguration(name=runtime_config.get_name())

def __get_parameters(self) -> bool:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe its time to move these up to the super class

Copy link
Collaborator Author

@blublinsky blublinsky Aug 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which ones? We are using different classes

logger.debug("Building job metadata")
input_params = runtime_config.get_transform_metadata() | execution_config.get_input_params()
metadata = {
"pipeline": execution_config.pipeline_id,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the "parallelization" being captured as metadata somewhere?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. input parameters

transforms/.make.transforms Show resolved Hide resolved
@blublinsky blublinsky force-pushed the spark_experimental branch 3 times, most recently from 052e929 to 625132c Compare August 14, 2024 07:35
cmadam
cmadam previously requested changes Aug 14, 2024
Copy link
Collaborator

@cmadam cmadam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please reinstate the two-stage build process for the Dockerfile. I also suggest that we keep an argument whether we want to use Hadoop or not during the build process, with the value of that argument set to false by default.

I have also created a requirements page for supporting fuzzy dedup. I suggest that we provide support for fuzzy dedup in this runtime before merging it into the dev code branch.

data-processing-lib/spark/Dockerfile Outdated Show resolved Hide resolved
@blublinsky
Copy link
Collaborator Author

I think its time to merge
It might require some additional small changes to run fuzzy, but we can work it out later

@blublinsky blublinsky dismissed stale reviews from cmadam and daw3rd September 12, 2024 12:04

its done

@daw3rd daw3rd merged commit 03cba30 into dev Sep 12, 2024
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants