Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression: DataFrame::schema returns incorrect schema for NATURAL JOIN #14058

Closed
Tracked by #14008
DDtKey opened this issue Jan 9, 2025 · 4 comments · Fixed by #14102
Closed
Tracked by #14008

Regression: DataFrame::schema returns incorrect schema for NATURAL JOIN #14058

DDtKey opened this issue Jan 9, 2025 · 4 comments · Fixed by #14102
Assignees
Labels
bug Something isn't working help wanted Extra attention is needed regression Something that used to work no longer does

Comments

@DDtKey
Copy link
Contributor

DDtKey commented Jan 9, 2025

Describe the bug

Affected Version: 42.x, 43.x, 44.x (regression since 41.x)

The DataFrame::schema (=> LogicalPlan::schema) method returns a schema that includes all columns from the joined sources (using NATURAL JOIN), including columns not present in the final output. This behavior is incorrect and inconsistent with the documented behavior:

Returns the DFSchema describing the output of this DataFrame.

To Reproduce

Simple MRE here:

// Works for datafusion: 41.x and earlier
// Failed for datafusion: 42.x and later (including 44.x)

use datafusion::arrow::util::pretty;
use datafusion::prelude::*;

#[tokio::main]
async fn main() -> datafusion::error::Result<()> {
    let ctx = SessionContext::new();

    // Create table1
    ctx.sql(
        r#"
        CREATE TABLE table1 AS
        SELECT * FROM (
            VALUES
            (1, 'a'),
            (2, 'b'),
            (3, 'c')
        ) AS t(id, value1)
        "#,
    )
    .await?;

    // Create table2
    ctx.sql(
        r#"
        CREATE TABLE table2 AS
        SELECT * FROM (
            VALUES
            (1, 'x'),
            (3, 'y'),
            (4, 'z')
        ) AS t(id, value2)
        "#,
    )
    .await?;

    // Execute NATURAL JOIN query
    let df = ctx.sql("SELECT * FROM table1 NATURAL JOIN table2").await?;

    // Incorrect schema includes all columns from both tables
    let schema = df.schema().as_arrow().clone();
    println!("Schema: {:?}", schema);

    // Output does not include all columns
    let result = df.collect().await?;
    pretty::print_batches(&result)?;

    let result_schema = result.first().unwrap().schema();
    assert_eq!(&schema, &*result_schema, "Schema mismatch");

    Ok(())
}

Deps:

datafusion = "44.0.0"
tokio = { version = "1", features = ["full"] }

Expected behavior

The schema returned by DataFrame::schema should match the structure of the output produced by collect/collect_partitioned and etc. Specifically:

  • Excluded columns from the result of a NATURAL JOIN should not appear in the schema.

Or, if it was intended - the documentation should be aligned and be clear how to access the schema.
However, I find previous behavior correct and useful (e.g - get schema before methods like write_parquet/csv/json)

Additional context

This is a regression, as the method previously worked correctly in version 41.x.x and earlier.

Also, it probably points to the missing test coverage for particular code-paths. In a sense it's not enough to compare SQL execution results

@DDtKey DDtKey added the bug Something isn't working label Jan 9, 2025
@DDtKey
Copy link
Contributor Author

DDtKey commented Jan 9, 2025

Bisect shows that the behavior changed in 3438b355308afa23dba399f4aec5760969d054c5 commit
It's #11681

Used to work properly before merging this PR

@alamb
Copy link
Contributor

alamb commented Jan 10, 2025

I think @jonahgao is working on something else related to Natural join as well

@alamb alamb added regression Something that used to work no longer does help wanted Extra attention is needed labels Jan 10, 2025
@jonahgao jonahgao self-assigned this Jan 10, 2025
@jonahgao
Copy link
Member

I'll try to fix this.

@alamb
Copy link
Contributor

alamb commented Jan 13, 2025

I added this to the list of items that i think we should fix before releasing version 45.0.0:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed regression Something that used to work no longer does
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants