Add uses_bulkdata argument to paasta spark run instance_config #4005

SuperMatt · 2025-01-14T11:09:51Z

This makes the change to paasta spark run so that https://github.yelpcorp.com/sysgit/yelpsoa-configs/pull/52010 will work as expected.

This works by adding the uses_bulkdata key to the intsance config if the spark job has the key present and set to true.

I have added this arg to the tests so that they pass, and also created a test so that we can check all the different ways that uses_bulkdata can be set, either on paasta spark-run as an argument, or in the instance config.

See #3995 for more information about why we're doing this.

This makes the change to paasta spark run so that https://github.yelpcorp.com/sysgit/yelpsoa-configs/pull/52010 will work as expected. This works by adding the uses_bulkdata key to the intsance config if the spark job has the key present and set to true. I have added this arg to the tests so that they pass, and also created a test so that we can check all the different ways that uses_bulkdata can be set, either on paasta spark-run as an argument, or in the instance config. See #3995 for more information about why we're doing this.

nemacysts · 2025-01-24T16:21:07Z

tests/cli/test_cmds_spark_run.py

+    assert (
+        mock_get_instance_config.return_value.config_dict["uses_bulkdata"] == expected
+    )


we've mocked this on L1436 - so this assertion isn't doing much atm (I guess right now this test is really just checking that spark_run.paasta_spark_run(args) doesn't crash if we remove this always-passing assertion?)

imo, deleting this test is fine (the new logic is pretty trivial) or updating the test to assert that the correct mounts are in place in the final command would be where we should go from here :)

SuperMatt · 2025-01-30T08:50:18Z

Discussing with @nemacysts, we think that we don't need a test that the volume is present because we're testing that uses_bulkdata ends up on the instance config, but then we have another test already for the volume being added when uses_bulkdata is set to true, which can be seen here. Adding a spark test for this volume being included would be redundent.

SuperMatt force-pushed the u/mames/PERES-5194-uses-bulkdata-spark-run branch from 7c084d3 to 23fcbfa Compare January 20, 2025 15:37

SuperMatt requested review from nemacysts and chi-yelp January 20, 2025 15:38

timmow mentioned this pull request Jan 20, 2025

Add uses bulkdata argument to paasta spark run #3995

Closed

nemacysts reviewed Jan 24, 2025

View reviewed changes

chi-yelp approved these changes Feb 3, 2025

View reviewed changes

SuperMatt merged commit dde1ed3 into master Feb 3, 2025
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add uses_bulkdata argument to paasta spark run instance_config #4005

Add uses_bulkdata argument to paasta spark run instance_config #4005

SuperMatt commented Jan 14, 2025 •

edited

Loading

nemacysts Jan 24, 2025

SuperMatt commented Jan 30, 2025

Add uses_bulkdata argument to paasta spark run instance_config #4005

Add uses_bulkdata argument to paasta spark run instance_config #4005

Conversation

SuperMatt commented Jan 14, 2025 • edited Loading

nemacysts Jan 24, 2025

Choose a reason for hiding this comment

SuperMatt commented Jan 30, 2025

SuperMatt commented Jan 14, 2025 •

edited

Loading