-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix #239 #314
Closed
Closed
Fix #239 #314
Changes from 1 commit
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the comment is outdated by now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately it is not. However this is the same behavior as vanilla: if you try to average an empty collection they return null. Probably that agrees with the SQL standard and that's why they do it (that's my guess). I am inclined to fix the test and and keep the change. I think the bug it introduces is much worst that the problem it tried to solve.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the discussion about (I believe) the same issue in Spark: https://issues.apache.org/jira/browse/SPARK-20346.
Philosophically speaking, for a strongly typed API, returning an un-expected null, is not ideal. Someone collecting the result of an aggregation operation that returns a (say) Tuple2[String, String], would never check for null, but if the dataset is empty (say no data were loaded yet), then this may result in NPE in your code.
Here are some approaches we can take:
The super-safe approach. All results of agg should return an (say)
Option[Tuple2[String, String]]
. The type makes it clear that the result can be null. It's super functional and, at the same time, super clanky to work with. Accessing nested columns is still sub-optimal in Frameless (API wise). Accessing nested optional columns is even worst.Try not to deviate that much from Vanilla. If this is an accepted behavior in Spark that is compatible with SQL engines, then maybe that's what we should do. This also reminds me of Unexpected null equality #269 where we deviated from vanilla to give a more Scala/functional feel to the API, that lead to an un-expected join behavior. If we had just stick to Vanilla compatibility there, we would have one less bug in the list?
Be "smarter" about the last filter. We used to just
filter("_1 is not null")
, which we now know is wrong: If we do afirst
on an Optional column that happens to be my first column ("_1"), if the first entry happened to be None the whole row is dropped, which is exactly what Optional aggregation columns shortcut the computation to an empty dataset #239 reports (this is undeniably a bug). Now, what we can probably do is check if the result for all columns is null, so make the filter -->filter("_1 is not null || _2 is not null ....")
to as many columns as the resulting schema has. The only issue here is that Instead of getting (say)Array(Tuple3[String, String, String](None, None, None))
as a result you now get an empty array. Semantically, they are different, but practically speaking this should be cool for 99% of the use cases. If you really want to know if the dataset is empty, maybe just check for it explicitly (just saying)?So the current "fix" takes approach 2 above. The more I think about it, the more I like approach 3. It pretty much aligns with what we had, but fixes this corner case bug. @OlivierBlanvillain?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approach 3 also sounds good to me. At the same time we already have several operations (head & reduce off the top of my head) that return options to safely deal with emptiness...