Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet readers incorrectly interpret legacy nested lists #6756

Closed
etseidl opened this issue Nov 19, 2024 · 1 comment · Fixed by #6757
Closed

Parquet readers incorrectly interpret legacy nested lists #6756

etseidl opened this issue Nov 19, 2024 · 1 comment · Fixed by #6757
Labels
bug parquet Changes to the parquet crate

Comments

@etseidl
Copy link
Contributor

etseidl commented Nov 19, 2024

Describe the bug
A file with the schema

message my_record {
  REQUIRED group a (LIST) {
    REPEATED group array (LIST) {
      REPEATED INT32 array;
    }
  }
}

is currently read by arrow-rs as a list<struct<list<int32>>, i.e. a list of a one-tuple encapsulating a list of integers. Consensus is forming around the notion that this should instead be a nested list of integer lists (see apache/parquet-format#466 and apache/arrow#43995).

To Reproduce
Run parquet-rewrite on the file old_list_structure.parquet in parquet-testing/data and print the schema from the resulting file.

% parquet-rewrite -i old_list_structure.parquet -o old.pq
% parquet-schema old.pq
Metadata for file: old.pq

version: 1
num of rows: 1
created by: parquet-rs version 53.2.0
metadata:
  parquet.avro.schema: {"type":"record","name":"my_record","fields":[{"name":"a","type":{"type":"array","items":{"type":"array","items":"int"}}}]}
  writer.model.name: avro
  ARROW:schema: /////wABAAAQAAAAAAAKAAwACgAJAAQACgAAABAAAAAAAQQACAAIAAAABAAIAAAABAAAAAEAAAAEAAAAnP///xgAAAAMAAAAAAAADLQAAAABAAAACAAAAMD///+8////GAAAAAwAAAAAAAANiAAAAAEAAAAIAAAA4P///9z///8cAAAADAAAAAAAAAxcAAAAAQAAABwAAAAEAAQABAAAABAAFAAQAAAADwAEAAAACAAQAAAAGAAAACAAAAAAAAACHAAAAAgADAAEAAsACAAAACAAAAAAAAABAAAAAAUAAABhcnJheQAAAAUAAABhcnJheQAAAAUAAABhcnJheQAAAAEAAABhAAAA
message arrow_schema {
  REQUIRED group a (LIST) {
    REPEATED group list {
      REQUIRED group array {
        REQUIRED group array (LIST) {
          REPEATED group list {
            REQUIRED INT32 array;
          }
        }
      }
    }
  }
}

Expected behavior
The test file should be read as nested lists and produce the following schema:

message arrow_schema {
  REQUIRED group a (LIST) {
    REPEATED group list {
      REQUIRED group array (LIST) {
        REPEATED group list {
          REQUIRED INT32 array;
        }
      }
    }
  }
}

Additional context
The root cause is the naming of the repeated group as "array". This causes the code that handles legacy lists to use a rule which states:

If the repeated field is a group with one field and is named either array or uses the LIST-annotated group's name with _tuple appended then the repeated type is the element type and elements are required.

This rule should not apply due to a) the child of the repeated group "array" also having repeated repetition, and b) the LIST annotation on the repeated group.

@etseidl etseidl added the bug label Nov 19, 2024
@alamb alamb added the parquet Changes to the parquet crate label Dec 17, 2024
@alamb
Copy link
Contributor

alamb commented Dec 17, 2024

label_issue.py automatically added labels {'parquet'} from #6757

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants