Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: improve schema checking for insert into cases #14572

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

zhuqi-lucas
Copy link
Contributor

Which issue does this PR close?

Describe the bug
In, #14394, it was reported that while attempting to implement a DataSink different schemas for the record batches were being given than per the RecordBatchStream.

A fix for the given example, an INSERT INTO ... VALUES query, was merged (#14472). However, this issue likely arises when the schema of the source of an INSERT statement contain fields that differ from the table schema in terms of nullability. That is, the problem is not just limited to INSERT INTO ... VALUES statements.

What changes are included in this PR?

Add a separate nullable checking besides the original checking which only include the name and datatype.
Improve the error message to including more info about the error.

We will improve the checking for the 3 cases, also improve the error message.

There are three cases we need to check

  1. The len of the schema of the plan and the schema of the table should be the same
  2. The nullable flag of the schema of the plan and the schema of the table should be the same
  3. The datatype of the schema of the plan and the schema of the table should be the same

Are these changes tested?

Yes

Are there any user-facing changes?

No

@github-actions github-actions bot added core Core DataFusion crate common Related to common crate labels Feb 10, 2025
@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Feb 10, 2025
@zhuqi-lucas zhuqi-lucas marked this pull request as draft February 10, 2025 10:16
@zhuqi-lucas zhuqi-lucas marked this pull request as ready for review February 10, 2025 15:32
Copy link
Contributor

@jayzhan211 jayzhan211 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain the reason of the change in test.slt, thanks

// 1. The len of the schema of the plan and the schema of the table should be the same
// 2. The nullable flag of the schema of the plan and the schema of the table should be the same
// 3. The datatype of the schema of the plan and the schema of the table should be the same
fn logically_equivalent_names_and_types(&self, other: &Self) -> Result<(), String> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not Result<bool>

Copy link
Contributor Author

@zhuqi-lucas zhuqi-lucas Feb 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally i use Result < bool >, but i want to get three different error messages for different case, so i change to Result<(), String>.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also define different messages with internal_err!("msg1"), internal_err!("msg2")

f1.name() == f2.name()
&& DFSchema::datatype_is_logically_equal(
.try_for_each(|(f1, f2)| {
if f1.is_nullable() != f2.is_nullable() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the field is nullable, we can insert non-null column. Similar to #14519

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a regression to me 🤔. Even though the schema of a source is nullable, all of its data can be non-nullable, and in such cases, it can still be inserted into a non-nullable sink. When inserting, we currently validate against the actual data rather than the schema. See check_not_null_constraints

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If 'DataSink receiving different schemas' is an issue, we can rewrite the schema of batches emitted by DataSinkExec.

Copy link
Contributor Author

@zhuqi-lucas zhuqi-lucas Feb 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @jayzhan211 and @jonahgao for review, this is a good point, i change it to the only error case for nullable check:
// only check the case when the table field is not nullable and the insert data field is nullable

@@ -78,7 +104,7 @@ physical_plan
query I
INSERT INTO table_without_values SELECT
SUM(c4) OVER(PARTITION BY c1 ORDER BY c9 ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING),
COUNT(*) OVER(PARTITION BY c1 ORDER BY c9 ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING)
NULLIF(COUNT(*) OVER(PARTITION BY c1 ORDER BY c9 ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING), 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need NULLIF? Does its use indicate a potential issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This regression now not happened after the above code changes.

@github-actions github-actions bot added documentation Improvements or additions to documentation sql SQL Planner development-process Related to development process of DataFusion physical-expr Physical Expressions optimizer Optimizer rules proto Related to proto crate functions labels Feb 11, 2025
@github-actions github-actions bot removed documentation Improvements or additions to documentation sql SQL Planner development-process Related to development process of DataFusion physical-expr Physical Expressions optimizer Optimizer rules proto Related to proto crate functions labels Feb 11, 2025
@@ -81,11 +77,9 @@ STORED AS arrow
LOCATION 'test_files/scratch/insert_to_external/arrow_dict_partitioned/'
PARTITIONED BY (b);

query I
query error DataFusion error: Error during planning: Inserting query must have the same schema nullability as the table\. Expected table field 'b' nullability: false, got field: 'b', nullability: true
Copy link
Contributor Author

@zhuqi-lucas zhuqi-lucas Feb 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is strange, it means the PARTITIONED BY (b) will make the field 'b' nullability: false?

This is the only different case when PARTITIONED BY happen.

cc @jayzhan211 @jonahgao

@@ -228,7 +228,7 @@ CREATE TABLE aggregate_test_100_null (
c11 FLOAT
);

statement ok
statement error DataFusion error: Error during planning: Inserting query must have the same schema nullability as the table\. Expected table field 'c5' nullability: false, got field: 'c5', nullability: true
Copy link
Contributor Author

@zhuqi-lucas zhuqi-lucas Feb 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only regression in the slt i think. cc @jayzhan211 @jonahgao

# Setup test data table
statement ok
CREATE EXTERNAL TABLE aggregate_test_100 (
  c1  VARCHAR NOT NULL,
  c2  TINYINT NOT NULL,
  c3  SMALLINT NOT NULL,
  c4  SMALLINT,
  c5  INT,
  c6  BIGINT NOT NULL,
  c7  SMALLINT NOT NULL,
  c8  INT NOT NULL,
  c9  INT UNSIGNED NOT NULL,
  c10 BIGINT UNSIGNED NOT NULL,
  c11 FLOAT NOT NULL,
  c12 DOUBLE NOT NULL,
  c13 VARCHAR NOT NULL
)
STORED AS CSV
LOCATION '../../testing/data/csv/aggregate_test_100.csv'
OPTIONS ('format.has_header' 'true');

statement ok
CREATE TABLE aggregate_test_100_null (
  c2  TINYINT NOT NULL,
  c5  INT NOT NULL,
  c3  SMALLINT,
  c11 FLOAT
);

statement error DataFusion error: Error during planning: Inserting query must have the same schema nullability as the table\. Expected table field 'c5' nullability: false, got field: 'c5', nullability: true
INSERT INTO aggregate_test_100_null
SELECT
  c2,
  c5,
  CASE WHEN c1 = 'e' THEN NULL ELSE c3 END as c3,
  CASE WHEN c1 = 'a' THEN NULL ELSE c11 END as c11
FROM aggregate_test_100;

I think the original behaviour is wrong, because the insert table is not nullable.

statement ok
CREATE TABLE aggregate_test_100_null (
c2 TINYINT NOT NULL,
c5 INT,
Copy link
Contributor Author

@zhuqi-lucas zhuqi-lucas Feb 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted, i also add the successful case which the table field c5 is nullable.

@alamb alamb changed the title bug: improve schema checking for instert into cases bug: improve schema checking for `insert into cases Feb 11, 2025
@alamb alamb changed the title bug: improve schema checking for `insert into cases bug: improve schema checking for insert into cases Feb 11, 2025
@zhuqi-lucas
Copy link
Contributor Author

Can you explain the reason of the change in test.slt, thanks

Thank you for review @jayzhan211, i already updated the slt now, and added note for the only 2 different results for the sql.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
common Related to common crate core Core DataFusion crate sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Construct source plan schema with correct nullability during INSERT planning.
3 participants