Fix #239 #314

imarios · 2018-07-04T21:26:46Z

No description provided.

OlivierBlanvillain · 2018-07-04T22:01:19Z

dataset/src/main/scala/frameless/TypedDataset.scala

@@ -111,7 +111,6 @@ class TypedDataset[T] protected[frameless](val dataset: Dataset[T])(implicit val
        val selected = dataset.toDF()
          .agg(cols.head.alias("_1"), cols.tail: _*)
          .as[Out](TypedExpressionEncoder[Out])
-          .filter("_1 is not null") // otherwise spark produces List(null) for empty datasets


I guess the comment is outdated by now?

Unfortunately it is not. However this is the same behavior as vanilla: if you try to average an empty collection they return null. Probably that agrees with the SQL standard and that's why they do it (that's my guess). I am inclined to fix the test and and keep the change. I think the bug it introduces is much worst that the problem it tried to solve.

Here is the discussion about (I believe) the same issue in Spark: https://issues.apache.org/jira/browse/SPARK-20346.

Philosophically speaking, for a strongly typed API, returning an un-expected null, is not ideal. Someone collecting the result of an aggregation operation that returns a (say) Tuple2[String, String], would never check for null, but if the dataset is empty (say no data were loaded yet), then this may result in NPE in your code.

Here are some approaches we can take:

The super-safe approach. All results of agg should return an (say) Option[Tuple2[String, String]]. The type makes it clear that the result can be null. It's super functional and, at the same time, super clanky to work with. Accessing nested columns is still sub-optimal in Frameless (API wise). Accessing nested optional columns is even worst.

Try not to deviate that much from Vanilla. If this is an accepted behavior in Spark that is compatible with SQL engines, then maybe that's what we should do. This also reminds me of Unexpected null equality #269 where we deviated from vanilla to give a more Scala/functional feel to the API, that lead to an un-expected join behavior. If we had just stick to Vanilla compatibility there, we would have one less bug in the list?

Be "smarter" about the last filter. We used to just filter("_1 is not null"), which we now know is wrong: If we do a first on an Optional column that happens to be my first column ("_1"), if the first entry happened to be None the whole row is dropped, which is exactly what Optional aggregation columns shortcut the computation to an empty dataset #239 reports (this is undeniably a bug). Now, what we can probably do is check if the result for all columns is null, so make the filter --> filter("_1 is not null || _2 is not null ....") to as many columns as the resulting schema has. The only issue here is that Instead of getting (say) Array(Tuple3[String, String, String](None, None, None)) as a result you now get an empty array. Semantically, they are different, but practically speaking this should be cool for 99% of the use cases. If you really want to know if the dataset is empty, maybe just check for it explicitly (just saying)?

So the current "fix" takes approach 2 above. The more I think about it, the more I like approach 3. It pretty much aligns with what we had, but fixes this corner case bug. @OlivierBlanvillain?

Approach 3 also sounds good to me. At the same time we already have several operations (head & reduce off the top of my head) that return options to safely deal with emptiness...

Fix typelevel#239

3e6dee4

imarios requested a review from OlivierBlanvillain July 4, 2018 21:26

OlivierBlanvillain reviewed Jul 4, 2018

View reviewed changes

imarios added 2 commits July 4, 2018 19:51

Fix tests

b4b47cb

Fix yolo commit

150928c

imarios closed this Jul 15, 2018

imarios mentioned this pull request Jul 15, 2018

Fix implication from SPARK-20346. #316

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix #239 #314

Fix #239 #314

imarios commented Jul 4, 2018

OlivierBlanvillain Jul 4, 2018

imarios Jul 5, 2018 •

edited

Loading

imarios Jul 5, 2018 •

edited

Loading

OlivierBlanvillain Jul 18, 2018 •

edited

Loading

Fix #239 #314

Fix #239 #314

Conversation

imarios commented Jul 4, 2018

OlivierBlanvillain Jul 4, 2018

Choose a reason for hiding this comment

imarios Jul 5, 2018 • edited Loading

Choose a reason for hiding this comment

imarios Jul 5, 2018 • edited Loading

Choose a reason for hiding this comment

OlivierBlanvillain Jul 18, 2018 • edited Loading

Choose a reason for hiding this comment

imarios Jul 5, 2018 •

edited

Loading

imarios Jul 5, 2018 •

edited

Loading

OlivierBlanvillain Jul 18, 2018 •

edited

Loading