-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make scalar and array handling for array_has consistent #13683
Conversation
@@ -215,7 +215,11 @@ fn array_has_dispatch_for_array<O: OffsetSizeTrait>( | |||
let needle_row = Scalar::new(needle.slice(i, 1)); | |||
let eq_array = compare_with_eq(&arr, &needle_row, is_nested)?; | |||
let is_contained = eq_array.true_count() > 0; | |||
boolean_builder.append_value(is_contained) | |||
if is_contained || eq_array.null_count() == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am surprised this isn't
if is_contained || eq_array.null_count() == 0 { | |
if is_contained && eq_array.null_count() == 0 { |
I thought if the eq_array is null that means there was at least one comparison that is not known 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The semantics are basically:
- if the element is contained, return true
- otherwise if there is a null value in the array, return null
- otherwise return false
So nulls don't matter if there's match, it only matters if there's no match
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this PR @Kimahriman
I have a question
This is the behavior defined by Postgres, and used by Spark as well, not sure what other database systems use:
I think that is the right set of target systems.
@@ -5260,6 +5270,13 @@ select array_has([], null), | |||
---- | |||
NULL NULL NULL | |||
|
|||
# If lhs is has any Nulls, we return Null instead of false | |||
query BB | |||
select array_has([1, null, 2], 3), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is a correct behavior.
Checking DuckDB
D select array_has([1, null, 2], 3);
┌───────────────────────────────────────────┐
│ array_has(main.list_value(1, NULL, 2), 3) │
│ boolean │
├───────────────────────────────────────────┤
│ false │
└───────────────────────────────────────────┘
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm DuckDB is different then Spark and Postgres then
spark-sql (default)> SELECT array_contains(array(1, NULL, 3), 4);
NULL
postgres=# SELECT 4 =ANY('{1, NULL, 3}');
?column?
----------
(1 row)
Was hoping to use this to implement ArrayContains
for Comet, and this is the one behavior difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turns out the existing "correct" behavior for scalars doesn't match DuckDB either, which never returns null unless the whole array is null
DuckDB
D select array_has([null, null], 4);
┌───────────────────────────────────────────┐
│ array_has(main.list_value(NULL, NULL), 4) │
│ boolean │
├───────────────────────────────────────────┤
│ false │
└───────────────────────────────────────────┘
DF
DataFusion CLI v43.0.0
> select array_has([null, null], 4);
+-------------------------------------------+
| array_has(make_array(NULL,NULL),Int64(4)) |
+-------------------------------------------+
| |
+-------------------------------------------+
1 row(s) fetched.
Elapsed 0.001 seconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to match DuckDB behavior and updated description
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure the test gives a correct answer
We follow DuckDB for array function mostly, the best I can think of is implementing spark function in https://github.com/datafusion-contrib/datafusion-functions-extra |
I agree with @jayzhan211 that we should follow the DuckDB pattern for the array family function, as we consider the DuckDB results. |
Interesting thing is that DuckDB says they based it on the PrestoDB behavior: duckdb/duckdb#3065 But a quick look at the PrestoDB implementation suggests they use the behavior I'm suggesting, if one null is found the result is null: https://github.com/prestodb/presto/blob/b68af58583b8d58992b6ab0804d8d3618f7f402b/presto-main/src/main/java/com/facebook/presto/operator/scalar/ArrayContains.java#L52 So I wonder if DuckDBs null handling was actually intentionally or just an oversight by whoever initially implemented it |
Interesting, in this case I think changing the result is not a bad idea. |
just checked the Trino, it returns null in such case. DuckDB returns false. |
I recommend following @jayzhan211 and @Weijun-H 's suggestion that we follow DuckDB semantics for array/list functions I think it is past time to document what we think we are doing with respect to dialects, and I will attempt to do so shortly |
Can we add config setting in Something like If |
That sounds reasonable but those discrepancies tends to grow so maintain those params could be unbearable. My feeling we should follow some standard whatever it is, but one and the idea of extensibility is awesome so users can implements their own behavior |
Yeah that does seem like a messy slippery slope. And I feel like part of the critique in #13525 was that too many things required the |
Maybe we you could just implement a version of array_has that has the null semantics that you need in your system 🤔 |
Yeah that seems like the only viable option. I'll just update this to match duckdb |
What is the status of this PR? It seems from the most recent comments we don't plan to proceed with this PR? I'll mark it draft to signify it isn't waiting on review, but please feel free to correct me if I am wrong |
This is ready. It still fixes a minor discrepancy between scalars and arrays handling of null values, so that both now match ducksb's behavior |
@@ -214,8 +214,7 @@ fn array_has_dispatch_for_array<O: OffsetSizeTrait>( | |||
let is_nested = arr.data_type().is_nested(); | |||
let needle_row = Scalar::new(needle.slice(i, 1)); | |||
let eq_array = compare_with_eq(&arr, &needle_row, is_nested)?; | |||
let is_contained = eq_array.true_count() > 0; | |||
boolean_builder.append_value(is_contained) | |||
boolean_builder.append_value(eq_array.true_count() > 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not a behavior change I don't think - just a refactor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this was just matching the style of the other
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like an improvement to me -- that now DataFusion is at least consistent
Also since most of the array functions follow DuckDB I think picking to follow that is also a good idea.
Thank you for your patience @Kimahriman
🚀 thanks again @Kimahriman |
Which issue does this PR close?
Closes #13682
Rationale for this change
Makes null handling for
array_has
consistent across scalars and arrays, following the DuckDB behavior of returning false instead of null if a list contains null values.What changes are included in this PR?
Updates both scalar and array handling for
array_has
to return false regardless of if the left hand side has any null values in it. This matches DuckDB, but differs from systems like Postgres, Spark, and Trino.Are these changes tested?
Tests are added and updated to accommodate the changes.
Are there any user-facing changes?
Fix handling of array_has for scalars to match the array behavior.