-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Element-wise +, -, *, / for BoltArraySpark #94
base: master
Are you sure you want to change the base?
Conversation
Include elements of Spark Arrays that allow for element-wise addition, subtraction, multiplication, and division.
Fixed __truediv__ functionality. Submitted this as a patch to the main branch with the following note: Added basic element-wise addition (+), subtraction (-), multiplication (*), and division (/) to BoltSparkArrays. Method used is `_rdd().join().mapValues()`. Attempted to play around with `_rdd().cogroup().reduceByKey()` but `cogroup()` returns interables and I was not familiar enough with them to spend the time troubleshooting other methods for a time efficiency comparison. May be a better method to implement these operations, but I went with the best I knew how. I didn't include any of the reverse functions (__radd__ etc) as I wasn't sure how it would behave/interact with local array functions if there was an attempt to list the local array first (`local + Spark`). There are no issues as long as it is always two arrays being added together and the Spark array is listed first (`Spark + local`). I also didn't add support for adding/subtracting/multiplying/dividing by a constant integer across the whole array. I wasn't sure if this is functionality we would want to add, as it can be done using `map()`. My personal opinion is to require the user to use `map()` because I feel it forces the user to remember they are working on distributed arrays. However, this functionality could be easily implemented into the existing functions with some if statements in checking for integers and doing the `map()` under the hood. On a similar note to above, these functions could also be changed to not use the symbolic values (+, -, *, /) but be calls required in a map format (`.add()`, `.subtract()`, `multiply()`, `divide()`). Finally, if there is an intention to add many operations for distributed BoltSparkArrays similar to all the functions for python ndarrays, perhaps an independent class or .py file should be created to contain these operations that could then be called? I wasn't sure of the long-term intentions of what types of functions there is an interest to integrate into BoltArrays. Tried to include similar safety catches as `concatenate()`. For division, true division from __future__ is always performed. Basic troubleshooting was performed without any major issues, but more thorough troubleshooting may be useful.
@kkcthans This is great! One thought: all of these functions end up having the same boilerplate at the beginning. One nice way to organize this might be to have a generic |
Ya, I noticed that when I saw this was all ready implemented within Thunder/base.py. @freeman-lab said he was going to look at this pull request and talk with you about how to handle its implementation in addition to code all ready in Thunder (Decide what implementations should go where). I'd be happy to change it to that more generic element_wise function after you guys decide what will be best for the projects and implement it wherever. I'll change it and update appropriately after you guys decide where you want it. |
Just looked this over with @jwittenbach, it's looking really good! We should definitely put this in Look at what's in https://github.com/bolt-project/bolt/blob/master/test/spark/test_spark_basic.py to see how to structure them. You can run the tests on your own machine just by calling Let us know if you run into trouble! |
I'll look at implementing both of these and do that to update the code |
Does it matter too much where I put the tests? I was thinking of creating a new test, something like test_spark_binaryoperators.py. Or would it be better if I just append them to the end of the test_spark_basic.py, or even test_spark_functional.py? What is your preference in terms of organization for the long run? |
Ok @freeman-lab & @jwittenbach , I just added a test_spark_binaryoperators.py file to the test location. It looks like it worked. I added an elementwise function and tried to make sure it included appropriate checks to work with a BoltSparkArray, BoltLocalArray, or ndarray. One thing to note (probably in the documentation or something) unless we want to change this to make it more uniform, is that the output array will take the format of the first array. For example with: Let me know if there are any other things you guys feel I should tweak with it for this pull request. |
Added basic element-wise addition (+), subtraction (-), multiplication (*), and division (/) to BoltSparkArrays.
Method used is
_rdd().join().mapValues()
. Attempted to play around with_rdd().cogroup().reduceByKey()
butcogroup()
returns interables and I was not familiar enough with them to spend the time troubleshooting other methods for a time efficiency comparison. May be a better method to implement these operations, but I went with the best I knew how.I didn't include any of the reverse functions (radd etc) as I wasn't sure how it would behave/interact with local array functions if there was an attempt to list the local array first (
local + Spark
). There are no issues as long as it is always two arrays being added together and the Spark array is listed first (Spark + local
).I also didn't add support for adding/subtracting/multiplying/dividing by a constant integer across the whole array. I wasn't sure if this is functionality we would want to add, as it can be done using
map()
. My personal opinion is to require the user to usemap()
because I feel it forces the user to remember they are working on distributed arrays. However, this functionality could be easily implemented into the existing functions with some if statements in checking for integers and doing themap()
under the hood.On a similar note to above, these functions could also be changed to not use the symbolic values (+, -, *, /) but be calls required in a map format (
.add()
,.subtract()
,multiply()
,divide()
).Finally, if there is an intention to add many operations for distributed BoltSparkArrays similar to all the functions for python ndarrays, perhaps an independent class or .py file should be created to contain these operations that could then be called? I wasn't sure of the long-term intentions of what types of functions there is an interest to integrate into BoltArrays.
Tried to include similar safety catches as
concatenate()
.For division, true division from future is always performed.
Basic troubleshooting was performed without any major issues, but more thorough troubleshooting may be useful.