bug: improve schema checking for `insert into` cases #14572

zhuqi-lucas · 2025-02-10T06:51:11Z

Which issue does this PR close?

Describe the bug
In, #14394, it was reported that while attempting to implement a DataSink different schemas for the record batches were being given than per the RecordBatchStream.

A fix for the given example, an INSERT INTO ... VALUES query, was merged (#14472). However, this issue likely arises when the schema of the source of an INSERT statement contain fields that differ from the table schema in terms of nullability. That is, the problem is not just limited to INSERT INTO ... VALUES statements.

Closes #14550

What changes are included in this PR?

Add a separate nullable checking besides the original checking which only include the name and datatype.
Improve the error message to including more info about the error.

We will improve the checking for the 3 cases, also improve the error message.

There are three cases we need to check

The len of the schema of the plan and the schema of the table should be the same
The nullable flag of the schema of the plan and the schema of the table should be the same
The datatype of the schema of the plan and the schema of the table should be the same

Are these changes tested?

Yes

Are there any user-facing changes?

No

jayzhan211

Can you explain the reason of the change in test.slt, thanks

jayzhan211 · 2025-02-11T12:38:04Z

datafusion/common/src/dfschema.rs

+    // 1. The len of the schema of the plan and the schema of the table should be the same
+    // 2. The nullable flag of the schema of the plan and the schema of the table should be the same
+    // 3. The datatype of the schema of the plan and the schema of the table should be the same
+    fn logically_equivalent_names_and_types(&self, other: &Self) -> Result<(), String> {


Why not Result<bool>

Originally i use Result < bool >, but i want to get three different error messages for different case, so i change to Result<(), String>.

You can also define different messages with internal_err!("msg1"), internal_err!("msg2")

jayzhan211 · 2025-02-11T12:41:00Z

datafusion/common/src/dfschema.rs

-                f1.name() == f2.name()
-                    && DFSchema::datatype_is_logically_equal(
+            .try_for_each(|(f1, f2)| {
+                if f1.is_nullable() != f2.is_nullable() {


If the field is nullable, we can insert non-null column. Similar to #14519

This seems a regression to me 🤔. Even though the schema of a source is nullable, all of its data can be non-nullable, and in such cases, it can still be inserted into a non-nullable sink. When inserting, we currently validate against the actual data rather than the schema. See check_not_null_constraints

If 'DataSink receiving different schemas' is an issue, we can rewrite the schema of batches emitted by DataSinkExec.

Thank you @jayzhan211 and @jonahgao for review, this is a good point, i change it to the only error case for nullable check:
// only check the case when the table field is not nullable and the insert data field is nullable

jayzhan211 · 2025-02-11T12:41:31Z

datafusion/sqllogictest/test_files/insert.slt

@@ -78,7 +104,7 @@ physical_plan
 query I
 INSERT INTO table_without_values SELECT
 SUM(c4) OVER(PARTITION BY c1 ORDER BY c9 ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING),
-COUNT(*) OVER(PARTITION BY c1 ORDER BY c9 ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING)
+NULLIF(COUNT(*) OVER(PARTITION BY c1 ORDER BY c9 ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING), 0)


Why do we need NULLIF? Does its use indicate a potential issue?

This regression now not happened after the above code changes.

zhuqi-lucas · 2025-02-11T16:00:30Z

datafusion/sqllogictest/test_files/insert_to_external.slt

@@ -81,11 +77,9 @@ STORED AS arrow
 LOCATION 'test_files/scratch/insert_to_external/arrow_dict_partitioned/'
 PARTITIONED BY (b);

-query I
+query error DataFusion error: Error during planning: Inserting query must have the same schema nullability as the table\. Expected table field 'b' nullability: false, got field: 'b', nullability: true


This is strange, it means the PARTITIONED BY (b) will make the field 'b' nullability: false?

This is the only different case when PARTITIONED BY happen.

cc @jayzhan211 @jonahgao

zhuqi-lucas · 2025-02-11T16:11:13Z

datafusion/sqllogictest/test_files/aggregate_skip_partial.slt

@@ -228,7 +228,7 @@ CREATE TABLE aggregate_test_100_null (
  c11 FLOAT
 );

-statement ok
+statement error DataFusion error: Error during planning: Inserting query must have the same schema nullability as the table\. Expected table field 'c5' nullability: false, got field: 'c5', nullability: true


This is the only regression in the slt i think. cc @jayzhan211 @jonahgao

# Setup test data table statement ok CREATE EXTERNAL TABLE aggregate_test_100 ( c1 VARCHAR NOT NULL, c2 TINYINT NOT NULL, c3 SMALLINT NOT NULL, c4 SMALLINT, c5 INT, c6 BIGINT NOT NULL, c7 SMALLINT NOT NULL, c8 INT NOT NULL, c9 INT UNSIGNED NOT NULL, c10 BIGINT UNSIGNED NOT NULL, c11 FLOAT NOT NULL, c12 DOUBLE NOT NULL, c13 VARCHAR NOT NULL ) STORED AS CSV LOCATION '../../testing/data/csv/aggregate_test_100.csv' OPTIONS ('format.has_header' 'true'); statement ok CREATE TABLE aggregate_test_100_null ( c2 TINYINT NOT NULL, c5 INT NOT NULL, c3 SMALLINT, c11 FLOAT ); statement error DataFusion error: Error during planning: Inserting query must have the same schema nullability as the table\. Expected table field 'c5' nullability: false, got field: 'c5', nullability: true INSERT INTO aggregate_test_100_null SELECT c2, c5, CASE WHEN c1 = 'e' THEN NULL ELSE c3 END as c3, CASE WHEN c1 = 'a' THEN NULL ELSE c11 END as c11 FROM aggregate_test_100;

I think the original behaviour is wrong, because the insert table is not nullable.

zhuqi-lucas · 2025-02-11T16:17:13Z

datafusion/sqllogictest/test_files/aggregate_skip_partial.slt

+statement ok
+CREATE TABLE aggregate_test_100_null (
+  c2  TINYINT NOT NULL,
+  c5  INT,


Noted, i also add the successful case which the table field c5 is nullable.

zhuqi-lucas · 2025-02-12T03:50:20Z

Can you explain the reason of the change in test.slt, thanks

Thank you for review @jayzhan211, i already updated the slt now, and added note for the only 2 different results for the sql.

bug: improve schema checking for instert into cases

aaf50fd

github-actions bot added core Core DataFusion crate common Related to common crate labels Feb 10, 2025

Fix testing

8c1af9c

github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Feb 10, 2025

zhuqi-lucas marked this pull request as draft February 10, 2025 10:16

zhuqi-lucas added 2 commits February 10, 2025 23:10

fix test

51f53ec

minor fix

72eaf24

zhuqi-lucas marked this pull request as ready for review February 10, 2025 15:32

jayzhan211 reviewed Feb 11, 2025

View reviewed changes

github-actions bot added documentation Improvements or additions to documentation sql SQL Planner development-process Related to development process of DataFusion physical-expr Physical Expressions optimizer Optimizer rules proto Related to proto crate functions labels Feb 11, 2025

zhuqi-lucas added 2 commits February 11, 2025 23:51

Address comments

d58a451

Merge remote-tracking branch 'upstream/main' into 14550_issue

af496fb

zhuqi-lucas force-pushed the 14550_issue branch from 0e7dac6 to af496fb Compare February 11, 2025 15:52

github-actions bot removed documentation Improvements or additions to documentation sql SQL Planner development-process Related to development process of DataFusion physical-expr Physical Expressions optimizer Optimizer rules proto Related to proto crate functions labels Feb 11, 2025

zhuqi-lucas commented Feb 11, 2025

View reviewed changes

Add more testing cases

0071247

zhuqi-lucas commented Feb 11, 2025

View reviewed changes

zhuqi-lucas requested review from jayzhan211 and jonahgao February 11, 2025 16:19

alamb changed the title ~~bug: improve schema checking for instert into cases~~ bug: improve schema checking for `insert into cases Feb 11, 2025

alamb changed the title ~~bug: improve schema checking for `insert into cases~~ bug: improve schema checking for insert into cases Feb 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: improve schema checking for `insert into` cases #14572

bug: improve schema checking for `insert into` cases #14572

zhuqi-lucas commented Feb 10, 2025

jayzhan211 left a comment

jayzhan211 Feb 11, 2025

zhuqi-lucas Feb 11, 2025 •

edited

Loading

jayzhan211 Feb 12, 2025

jayzhan211 Feb 11, 2025

jonahgao Feb 11, 2025

jonahgao Feb 11, 2025

zhuqi-lucas Feb 11, 2025 •

edited

Loading

jayzhan211 Feb 11, 2025

zhuqi-lucas Feb 11, 2025

zhuqi-lucas Feb 11, 2025 •

edited

Loading

zhuqi-lucas Feb 11, 2025 •

edited

Loading

zhuqi-lucas Feb 11, 2025 •

edited

Loading

zhuqi-lucas commented Feb 12, 2025

bug: improve schema checking for insert into cases #14572

Are you sure you want to change the base?

bug: improve schema checking for insert into cases #14572

Conversation

zhuqi-lucas commented Feb 10, 2025

Which issue does this PR close?

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

jayzhan211 left a comment

Choose a reason for hiding this comment

jayzhan211 Feb 11, 2025

Choose a reason for hiding this comment

zhuqi-lucas Feb 11, 2025 • edited Loading

Choose a reason for hiding this comment

jayzhan211 Feb 12, 2025

Choose a reason for hiding this comment

jayzhan211 Feb 11, 2025

Choose a reason for hiding this comment

jonahgao Feb 11, 2025

Choose a reason for hiding this comment

jonahgao Feb 11, 2025

Choose a reason for hiding this comment

zhuqi-lucas Feb 11, 2025 • edited Loading

Choose a reason for hiding this comment

jayzhan211 Feb 11, 2025

Choose a reason for hiding this comment

zhuqi-lucas Feb 11, 2025

Choose a reason for hiding this comment

zhuqi-lucas Feb 11, 2025 • edited Loading

Choose a reason for hiding this comment

zhuqi-lucas Feb 11, 2025 • edited Loading

Choose a reason for hiding this comment

zhuqi-lucas Feb 11, 2025 • edited Loading

Choose a reason for hiding this comment

zhuqi-lucas commented Feb 12, 2025

bug: improve schema checking for `insert into` cases #14572

bug: improve schema checking for `insert into` cases #14572

zhuqi-lucas Feb 11, 2025 •

edited

Loading

zhuqi-lucas Feb 11, 2025 •

edited

Loading

zhuqi-lucas Feb 11, 2025 •

edited

Loading

zhuqi-lucas Feb 11, 2025 •

edited

Loading

zhuqi-lucas Feb 11, 2025 •

edited

Loading