-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug: improve schema checking for insert into
cases
#14572
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain the reason of the change in test.slt, thanks
// 1. The len of the schema of the plan and the schema of the table should be the same | ||
// 2. The nullable flag of the schema of the plan and the schema of the table should be the same | ||
// 3. The datatype of the schema of the plan and the schema of the table should be the same | ||
fn logically_equivalent_names_and_types(&self, other: &Self) -> Result<(), String> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not Result<bool>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Originally i use Result < bool >, but i want to get three different error messages for different case, so i change to Result<(), String>.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can also define different messages with internal_err!("msg1"), internal_err!("msg2")
datafusion/common/src/dfschema.rs
Outdated
f1.name() == f2.name() | ||
&& DFSchema::datatype_is_logically_equal( | ||
.try_for_each(|(f1, f2)| { | ||
if f1.is_nullable() != f2.is_nullable() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the field is nullable, we can insert non-null column. Similar to #14519
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems a regression to me 🤔. Even though the schema of a source is nullable, all of its data can be non-nullable, and in such cases, it can still be inserted into a non-nullable sink. When inserting, we currently validate against the actual data rather than the schema. See check_not_null_constraints
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If 'DataSink receiving different schemas' is an issue, we can rewrite the schema of batches emitted by DataSinkExec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @jayzhan211 and @jonahgao for review, this is a good point, i change it to the only error case for nullable check:
// only check the case when the table field is not nullable and the insert data field is nullable
@@ -78,7 +104,7 @@ physical_plan | |||
query I | |||
INSERT INTO table_without_values SELECT | |||
SUM(c4) OVER(PARTITION BY c1 ORDER BY c9 ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING), | |||
COUNT(*) OVER(PARTITION BY c1 ORDER BY c9 ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING) | |||
NULLIF(COUNT(*) OVER(PARTITION BY c1 ORDER BY c9 ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING), 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need NULLIF? Does its use indicate a potential issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This regression now not happened after the above code changes.
0e7dac6
to
af496fb
Compare
@@ -81,11 +77,9 @@ STORED AS arrow | |||
LOCATION 'test_files/scratch/insert_to_external/arrow_dict_partitioned/' | |||
PARTITIONED BY (b); | |||
|
|||
query I | |||
query error DataFusion error: Error during planning: Inserting query must have the same schema nullability as the table\. Expected table field 'b' nullability: false, got field: 'b', nullability: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is strange, it means the PARTITIONED BY (b) will make the field 'b' nullability: false?
This is the only different case when PARTITIONED BY happen.
@@ -228,7 +228,7 @@ CREATE TABLE aggregate_test_100_null ( | |||
c11 FLOAT | |||
); | |||
|
|||
statement ok | |||
statement error DataFusion error: Error during planning: Inserting query must have the same schema nullability as the table\. Expected table field 'c5' nullability: false, got field: 'c5', nullability: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the only regression in the slt i think. cc @jayzhan211 @jonahgao
# Setup test data table
statement ok
CREATE EXTERNAL TABLE aggregate_test_100 (
c1 VARCHAR NOT NULL,
c2 TINYINT NOT NULL,
c3 SMALLINT NOT NULL,
c4 SMALLINT,
c5 INT,
c6 BIGINT NOT NULL,
c7 SMALLINT NOT NULL,
c8 INT NOT NULL,
c9 INT UNSIGNED NOT NULL,
c10 BIGINT UNSIGNED NOT NULL,
c11 FLOAT NOT NULL,
c12 DOUBLE NOT NULL,
c13 VARCHAR NOT NULL
)
STORED AS CSV
LOCATION '../../testing/data/csv/aggregate_test_100.csv'
OPTIONS ('format.has_header' 'true');
statement ok
CREATE TABLE aggregate_test_100_null (
c2 TINYINT NOT NULL,
c5 INT NOT NULL,
c3 SMALLINT,
c11 FLOAT
);
statement error DataFusion error: Error during planning: Inserting query must have the same schema nullability as the table\. Expected table field 'c5' nullability: false, got field: 'c5', nullability: true
INSERT INTO aggregate_test_100_null
SELECT
c2,
c5,
CASE WHEN c1 = 'e' THEN NULL ELSE c3 END as c3,
CASE WHEN c1 = 'a' THEN NULL ELSE c11 END as c11
FROM aggregate_test_100;
I think the original behaviour is wrong, because the insert table is not nullable.
statement ok | ||
CREATE TABLE aggregate_test_100_null ( | ||
c2 TINYINT NOT NULL, | ||
c5 INT, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noted, i also add the successful case which the table field c5 is nullable.
insert into
cases
Thank you for review @jayzhan211, i already updated the slt now, and added note for the only 2 different results for the sql. |
Which issue does this PR close?
Describe the bug
In, #14394, it was reported that while attempting to implement a DataSink different schemas for the record batches were being given than per the RecordBatchStream.
A fix for the given example, an INSERT INTO ... VALUES query, was merged (#14472). However, this issue likely arises when the schema of the source of an INSERT statement contain fields that differ from the table schema in terms of nullability. That is, the problem is not just limited to INSERT INTO ... VALUES statements.
What changes are included in this PR?
Add a separate nullable checking besides the original checking which only include the name and datatype.
Improve the error message to including more info about the error.
We will improve the checking for the 3 cases, also improve the error message.
There are three cases we need to check
Are these changes tested?
Yes
Are there any user-facing changes?
No