Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #239 #314

Closed
wants to merge 3 commits into from
Closed

Fix #239 #314

wants to merge 3 commits into from

Conversation

imarios
Copy link
Contributor

@imarios imarios commented Jul 4, 2018

No description provided.

@@ -111,7 +111,6 @@ class TypedDataset[T] protected[frameless](val dataset: Dataset[T])(implicit val
val selected = dataset.toDF()
.agg(cols.head.alias("_1"), cols.tail: _*)
.as[Out](TypedExpressionEncoder[Out])
.filter("_1 is not null") // otherwise spark produces List(null) for empty datasets
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the comment is outdated by now?

Copy link
Contributor Author

@imarios imarios Jul 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately it is not. However this is the same behavior as vanilla: if you try to average an empty collection they return null. Probably that agrees with the SQL standard and that's why they do it (that's my guess). I am inclined to fix the test and and keep the change. I think the bug it introduces is much worst that the problem it tried to solve.

Copy link
Contributor Author

@imarios imarios Jul 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the discussion about (I believe) the same issue in Spark: https://issues.apache.org/jira/browse/SPARK-20346.

Philosophically speaking, for a strongly typed API, returning an un-expected null, is not ideal. Someone collecting the result of an aggregation operation that returns a (say) Tuple2[String, String], would never check for null, but if the dataset is empty (say no data were loaded yet), then this may result in NPE in your code.

Here are some approaches we can take:

  1. The super-safe approach. All results of agg should return an (say) Option[Tuple2[String, String]]. The type makes it clear that the result can be null. It's super functional and, at the same time, super clanky to work with. Accessing nested columns is still sub-optimal in Frameless (API wise). Accessing nested optional columns is even worst.

  2. Try not to deviate that much from Vanilla. If this is an accepted behavior in Spark that is compatible with SQL engines, then maybe that's what we should do. This also reminds me of Unexpected null equality #269 where we deviated from vanilla to give a more Scala/functional feel to the API, that lead to an un-expected join behavior. If we had just stick to Vanilla compatibility there, we would have one less bug in the list?

  3. Be "smarter" about the last filter. We used to just filter("_1 is not null"), which we now know is wrong: If we do a first on an Optional column that happens to be my first column ("_1"), if the first entry happened to be None the whole row is dropped, which is exactly what Optional aggregation columns shortcut the computation to an empty dataset #239 reports (this is undeniably a bug). Now, what we can probably do is check if the result for all columns is null, so make the filter --> filter("_1 is not null || _2 is not null ....") to as many columns as the resulting schema has. The only issue here is that Instead of getting (say) Array(Tuple3[String, String, String](None, None, None)) as a result you now get an empty array. Semantically, they are different, but practically speaking this should be cool for 99% of the use cases. If you really want to know if the dataset is empty, maybe just check for it explicitly (just saying)?

So the current "fix" takes approach 2 above. The more I think about it, the more I like approach 3. It pretty much aligns with what we had, but fixes this corner case bug. @OlivierBlanvillain?

Copy link
Contributor

@OlivierBlanvillain OlivierBlanvillain Jul 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approach 3 also sounds good to me. At the same time we already have several operations (head & reduce off the top of my head) that return options to safely deal with emptiness...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants