Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make scalar and array handling for array_has consistent #13683

Merged
merged 3 commits into from
Jan 22, 2025

Conversation

Kimahriman
Copy link
Contributor

@Kimahriman Kimahriman commented Dec 7, 2024

Which issue does this PR close?

Closes #13682

Rationale for this change

Makes null handling for array_has consistent across scalars and arrays, following the DuckDB behavior of returning false instead of null if a list contains null values.

What changes are included in this PR?

Updates both scalar and array handling for array_has to return false regardless of if the left hand side has any null values in it. This matches DuckDB, but differs from systems like Postgres, Spark, and Trino.

Are these changes tested?

Tests are added and updated to accommodate the changes.

Are there any user-facing changes?

Fix handling of array_has for scalars to match the array behavior.

@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Dec 7, 2024
@@ -215,7 +215,11 @@ fn array_has_dispatch_for_array<O: OffsetSizeTrait>(
let needle_row = Scalar::new(needle.slice(i, 1));
let eq_array = compare_with_eq(&arr, &needle_row, is_nested)?;
let is_contained = eq_array.true_count() > 0;
boolean_builder.append_value(is_contained)
if is_contained || eq_array.null_count() == 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am surprised this isn't

Suggested change
if is_contained || eq_array.null_count() == 0 {
if is_contained && eq_array.null_count() == 0 {

I thought if the eq_array is null that means there was at least one comparison that is not known 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The semantics are basically:

  • if the element is contained, return true
  • otherwise if there is a null value in the array, return null
  • otherwise return false

So nulls don't matter if there's match, it only matters if there's no match

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this PR @Kimahriman

I have a question

This is the behavior defined by Postgres, and used by Spark as well, not sure what other database systems use:

I think that is the right set of target systems.

@@ -5260,6 +5270,13 @@ select array_has([], null),
----
NULL NULL NULL

# If lhs is has any Nulls, we return Null instead of false
query BB
select array_has([1, null, 2], 3),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is a correct behavior.

Checking DuckDB

D select array_has([1, null, 2], 3);
┌───────────────────────────────────────────┐
│ array_has(main.list_value(1, NULL, 2), 3) │
│                  boolean                  │
├───────────────────────────────────────────┤
│ false                                     │
└───────────────────────────────────────────┘

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm DuckDB is different then Spark and Postgres then

spark-sql (default)> SELECT array_contains(array(1, NULL, 3), 4);
NULL
postgres=# SELECT 4 =ANY('{1, NULL, 3}');
 ?column? 
----------
 
(1 row)

Was hoping to use this to implement ArrayContains for Comet, and this is the one behavior difference.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out the existing "correct" behavior for scalars doesn't match DuckDB either, which never returns null unless the whole array is null

DuckDB

D select array_has([null, null], 4);
┌───────────────────────────────────────────┐
│ array_has(main.list_value(NULL, NULL), 4) │
│                  boolean                  │
├───────────────────────────────────────────┤
│ false                                     │
└───────────────────────────────────────────┘

DF

DataFusion CLI v43.0.0
> select array_has([null, null], 4);
+-------------------------------------------+
| array_has(make_array(NULL,NULL),Int64(4)) |
+-------------------------------------------+
|                                           |
+-------------------------------------------+
1 row(s) fetched. 
Elapsed 0.001 seconds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to match DuckDB behavior and updated description

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure the test gives a correct answer

@jayzhan211
Copy link
Contributor

jayzhan211 commented Dec 8, 2024

We follow DuckDB for array function mostly, the best I can think of is implementing spark function in https://github.com/datafusion-contrib/datafusion-functions-extra

@Weijun-H
Copy link
Member

Weijun-H commented Dec 8, 2024

We follow DuckDB for array function mostly, the best I can think of is implementing spark function in datafusion-contrib/datafusion-functions-extra

I agree with @jayzhan211 that we should follow the DuckDB pattern for the array family function, as we consider the DuckDB results.

@Kimahriman
Copy link
Contributor Author

Interesting thing is that DuckDB says they based it on the PrestoDB behavior: duckdb/duckdb#3065

But a quick look at the PrestoDB implementation suggests they use the behavior I'm suggesting, if one null is found the result is null: https://github.com/prestodb/presto/blob/b68af58583b8d58992b6ab0804d8d3618f7f402b/presto-main/src/main/java/com/facebook/presto/operator/scalar/ArrayContains.java#L52

So I wonder if DuckDBs null handling was actually intentionally or just an oversight by whoever initially implemented it

@jayzhan211
Copy link
Contributor

Interesting thing is that DuckDB says they based it on the PrestoDB behavior: duckdb/duckdb#3065

But a quick look at the PrestoDB implementation suggests they use the behavior I'm suggesting, if one null is found the result is null: https://github.com/prestodb/presto/blob/b68af58583b8d58992b6ab0804d8d3618f7f402b/presto-main/src/main/java/com/facebook/presto/operator/scalar/ArrayContains.java#L52

So I wonder if DuckDBs null handling was actually intentionally or just an oversight by whoever initially implemented it

Interesting, in this case I think changing the result is not a bad idea.

@comphead
Copy link
Contributor

comphead commented Dec 8, 2024

just checked the Trino, it returns null in such case. DuckDB returns false.
Not sure what should we do, @alamb WDYT to break the tie?

@alamb
Copy link
Contributor

alamb commented Dec 9, 2024

just checked the Trino, it returns null in such case. DuckDB returns false. Not sure what should we do, @alamb WDYT to break the tie?

I recommend following @jayzhan211 and @Weijun-H 's suggestion that we follow DuckDB semantics for array/list functions

I think it is past time to document what we think we are doing with respect to dialects, and I will attempt to do so shortly

@jayzhan211
Copy link
Contributor

jayzhan211 commented Dec 14, 2024

Can we add config setting in SessionConfig to tune the result of function after #13519 is done?

Something like returns_null_if_contains_null, ignore_nulls, convert_null_to_value and so on. (These should be dialect-agnostic)

If returns_null_if_contains_null is true, we can have the Spark result of array_has, otherwise we have Postgres/DuckDB result.

@comphead
Copy link
Contributor

Can we add config setting in SessionConfig to tune the result of function after #13519 is done?

Something like returns_null_if_contains_null, ignore_nulls, convert_null_to_value and so on. (These should be dialect-agnostic)

If returns_null_if_contains_null is true, we can have the Spark result of array_has, otherwise we have Postgres/DuckDB result.

That sounds reasonable but those discrepancies tends to grow so maintain those params could be unbearable. My feeling we should follow some standard whatever it is, but one and the idea of extensibility is awesome so users can implements their own behavior

@Kimahriman
Copy link
Contributor Author

Yeah that does seem like a messy slippery slope. And I feel like part of the critique in #13525 was that too many things required the SessionContext to customize.

@alamb
Copy link
Contributor

alamb commented Dec 16, 2024

Yeah that does seem like a messy slippery slope. And I feel like part of the critique in #13525 was that too many things required the SessionContext to customize.

Maybe we you could just implement a version of array_has that has the null semantics that you need in your system 🤔

@Kimahriman
Copy link
Contributor Author

Maybe we you could just implement a version of array_has that has the null semantics that you need in your system 🤔

Yeah that seems like the only viable option. I'll just update this to match duckdb

@alamb
Copy link
Contributor

alamb commented Jan 16, 2025

What is the status of this PR? It seems from the most recent comments we don't plan to proceed with this PR?

I'll mark it draft to signify it isn't waiting on review, but please feel free to correct me if I am wrong

@alamb alamb marked this pull request as draft January 16, 2025 22:23
@Kimahriman
Copy link
Contributor Author

This is ready. It still fixes a minor discrepancy between scalars and arrays handling of null values, so that both now match ducksb's behavior

@alamb alamb marked this pull request as ready for review January 18, 2025 23:03
@@ -214,8 +214,7 @@ fn array_has_dispatch_for_array<O: OffsetSizeTrait>(
let is_nested = arr.data_type().is_nested();
let needle_row = Scalar::new(needle.slice(i, 1));
let eq_array = compare_with_eq(&arr, &needle_row, is_nested)?;
let is_contained = eq_array.true_count() > 0;
boolean_builder.append_value(is_contained)
boolean_builder.append_value(eq_array.true_count() > 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not a behavior change I don't think - just a refactor

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this was just matching the style of the other

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like an improvement to me -- that now DataFusion is at least consistent

Also since most of the array functions follow DuckDB I think picking to follow that is also a good idea.

Thank you for your patience @Kimahriman

@alamb alamb dismissed comphead’s stale review January 22, 2025 23:08

Changes were addressed

@alamb alamb merged commit 3efcd6a into apache:main Jan 22, 2025
26 checks passed
@alamb
Copy link
Contributor

alamb commented Jan 22, 2025

🚀 thanks again @Kimahriman

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

array_has has inconsistent null handling for scalars and arrays
5 participants