Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-36036: [C++][Python][Parquet] Implement Float16 logical type #36073

Merged
merged 37 commits into from
Nov 15, 2023

Conversation

benibus
Copy link
Collaborator

@benibus benibus commented Jun 14, 2023

Rationale for this change

There is currently an active proposal to support half-float types in Parquet. For more details/discussion, see the links in this PR's accompanying issue.

What changes are included in this PR?

This PR implements basic support for a Float16LogicalType in accordance with the proposed spec. More specifically, this includes:

  • Changes to parquet.thrift and regenerated parqet_types files
  • Basic LogicalType class definition, method impls, and enums
  • Support for specialized comparisons and column statistics

In the interest of scope, this PR does not currently deal with arrow integration and byte split encoding - although we will want both of these features resolved before the proposal is approved.

Are these changes tested?

Yes (tests are included)

Are there any user-facing changes?

Yes

@benibus benibus force-pushed the GH-36036-parquet-float16 branch from 47790ee to 2809fbf Compare June 14, 2023 22:07
@benibus benibus marked this pull request as ready for review June 14, 2023 23:39
@benibus benibus requested a review from wjones127 as a code owner June 14, 2023 23:39
@benibus
Copy link
Collaborator Author

benibus commented Jun 14, 2023

cc @anjakefala @zeroshade @pitrou

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jun 15, 2023
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @benibus ! Please note that the Thrift definitions will have to be synchronized from https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift once the Parquet format spec addition is accepted.

cpp/src/parquet/float_internal.h Outdated Show resolved Hide resolved
cpp/src/parquet/float_internal.h Outdated Show resolved Hide resolved
cpp/src/parquet/float_internal.h Outdated Show resolved Hide resolved
cpp/src/parquet/float_internal.h Outdated Show resolved Hide resolved
cpp/src/parquet/statistics.cc Outdated Show resolved Hide resolved
cpp/src/parquet/statistics_test.cc Outdated Show resolved Hide resolved
cpp/src/parquet/statistics_test.cc Show resolved Hide resolved
cpp/src/parquet/statistics_test.cc Outdated Show resolved Hide resolved
cpp/src/parquet/statistics_test.cc Show resolved Hide resolved
cpp/src/parquet/statistics_test.cc Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Jun 15, 2023
@benibus
Copy link
Collaborator Author

benibus commented Jun 15, 2023

Thanks! Before moving forward on some of the suggestions, I just want to ensure that I interpreted the endianness requirement from the proposal correctly. As per the spec:

The FLOAT16 annotation represents half-precision floating-point numbers in the 2-byte IEEE little-endian format.

Is this in line with what I've done here (i.e. the FLBA itself representing a little-endian binary float) or does it imply something else?

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jun 15, 2023
@pitrou
Copy link
Member

pitrou commented Jun 15, 2023

Is this in line with what I've done here (i.e. the FLBA itself representing a little-endian binary float)

Yes!

cpp/src/parquet/float_internal.h Outdated Show resolved Hide resolved
cpp/src/parquet/float_internal.h Outdated Show resolved Hide resolved
@@ -435,6 +435,8 @@ std::shared_ptr<const LogicalType> LogicalType::FromThrift(
return BSONLogicalType::Make();
} else if (type.__isset.UUID) {
return UUIDLogicalType::Make();
} else if (type.__isset.FLOAT16) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may need to change thrift_internal.h for serialization of FLOAT16.

cpp/src/parquet/statistics.cc Outdated Show resolved Hide resolved
@benibus benibus force-pushed the GH-36036-parquet-float16 branch 3 times, most recently from 5c109d3 to 2f9ca2a Compare June 18, 2023 03:42
@benibus benibus requested a review from pitrou June 18, 2023 04:14
cpp/src/arrow/util/float16.h Show resolved Hide resolved
cpp/src/arrow/util/float16.h Outdated Show resolved Hide resolved
cpp/src/arrow/util/float16.h Outdated Show resolved Hide resolved
cpp/src/arrow/util/float16.h Outdated Show resolved Hide resolved
cpp/src/arrow/util/float16.h Show resolved Hide resolved
cpp/src/arrow/util/float16_test.cc Outdated Show resolved Hide resolved
cpp/src/parquet/statistics.cc Outdated Show resolved Hide resolved
cpp/src/parquet/statistics.cc Outdated Show resolved Hide resolved
cpp/src/parquet/statistics.cc Outdated Show resolved Hide resolved
cpp/src/parquet/statistics.cc Outdated Show resolved Hide resolved
@pitrou
Copy link
Member

pitrou commented Jun 19, 2023

By the way:

In the interest of scope, this PR does not currently deal with arrow integration and byte split encoding - although we will want both of these features resolved before the proposal is approved.

Do you plan to tackle Arrow integration in a subsequent PR? This PR cannot be integrated before the proposal is approved, AFAICT.

@benibus
Copy link
Collaborator Author

benibus commented Jun 19, 2023

Do you plan to tackle Arrow integration in a subsequent PR? This PR cannot be integrated before the proposal is approved, AFAICT.

Yes, my plan was to handle Arrow integration in a subsequent PR (in tandem with the Java implementation), and then move forward with the proposal.

However, I think you're right that we'll need to roll those features into this PR. Just did a closer re-reading of @julienledem's comments on apache/parquet-format#184, and I believe we're going to want the full C++/Java PRs ready (but not merged) before approving the proposal - so we'll probably need to sit on this one in the meantime.

That being said, I can maintain this PR until then. Plus we may need to make adjustments if new requirements come in from the Java side.

@pitrou
Copy link
Member

pitrou commented Jun 21, 2023

Well, if Arrow integration does not come in this PR, then at least some testing of reading/writing using non-Arrow Parquet APIs should still be added.

@wgtmac
Copy link
Member

wgtmac commented Jun 21, 2023

Just curious: Is it enough to get approval if there is a full C++ implementation only? Or java parity should be ready at the meanwhile?

@pitrou
Copy link
Member

pitrou commented Jun 21, 2023

We probably want a Java implementation as well according to apache/parquet-format#184 (comment)

@wgtmac
Copy link
Member

wgtmac commented Jun 21, 2023

We probably want a Java implementation as well according to apache/parquet-format#184 (comment)

OK, if that is a must. I can spare some time to do this. Would be good after this PR gets completed.

Reverted several prior changes that were accidentally pushed

Enabled construction from native floats

Removed `uint16_t` conversion operator since it doesn't behave
consistently with standard floats. As a result, rolled back some of the
prior changes to `random_real` used in the Parquet test utils
@benibus benibus force-pushed the GH-36036-parquet-float16 branch from a672c73 to 157e0d7 Compare November 13, 2023 21:01
zeroshade pushed a commit that referenced this pull request Nov 13, 2023
### Rationale for this change

There is an active proposal for a Float16 logical type in Parquet (apache/parquet-format#184) with C++/Python implementations in progress (#36073), so we should add one for Go as well.

### What changes are included in this PR?

- [x] Adds `LogicalType` definitions and methods for `Float16`
- [x] Adds support for `Float16` column statistics and comparators
- [x] Adds support for interchange between Parquet and Arrow's half-precision float

### Are these changes tested?

Yes

### Are there any user-facing changes?

Yes

* Closes: #37582

Authored-by: benibus <[email protected]>
Signed-off-by: Matt Topol <[email protected]>
@benibus
Copy link
Collaborator Author

benibus commented Nov 13, 2023

Now that the proposal is merged, would anyone be willing to take a final look and potentially merge this? I've gone ahead and rebased for good measure.

I should also mention that the Go implementation was merged today, but it'll eventually need this PR's parquet.thrift to re-generate its files in the future.

loicalleyne pushed a commit to loicalleyne/arrow that referenced this pull request Nov 13, 2023
…37599)

### Rationale for this change

There is an active proposal for a Float16 logical type in Parquet (apache/parquet-format#184) with C++/Python implementations in progress (apache#36073), so we should add one for Go as well.

### What changes are included in this PR?

- [x] Adds `LogicalType` definitions and methods for `Float16`
- [x] Adds support for `Float16` column statistics and comparators
- [x] Adds support for interchange between Parquet and Arrow's half-precision float

### Are these changes tested?

Yes

### Are there any user-facing changes?

Yes

* Closes: apache#37582

Authored-by: benibus <[email protected]>
Signed-off-by: Matt Topol <[email protected]>
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except for the nits above.

Would it be useful to add tests checking that one write or read a Parquet FLOAT16 with a byte length != 2?

Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

( It's a bit weird for me that fp16 can only use PLAIN, RLE_DICT and ... DELTA_BYTE_ARRAY, aha...)

@benibus
Copy link
Collaborator Author

benibus commented Nov 14, 2023

Would it be useful to add tests checking that one write or read a Parquet FLOAT16 with a byte length != 2?

Probably... I'll try to add something, assuming it wouldn't be redundant.

( It's a bit weird for me that fp16 can only use PLAIN, RLE_DICT and ... DELTA_BYTE_ARRAY, aha...)

Yeah, I think we're going to specifically address the encodings as a follow-up since there's been some recent discussion in that area.

@benibus
Copy link
Collaborator Author

benibus commented Nov 14, 2023

Update: I don't think it'd be very useful to add tests for different type lengths at the reader/writer level since it isn't even possible to construct a PrimitiveNode with an inapplicable length without throwing an exception. I suppose that's why there aren't similar test for other (e.g. decimal, uuid) types AFAICT.

However, it turns out that I forgot to add the relevant tests to schema_test.cc, so I went ahead and included those - along with the other requested changes.

@pitrou
Copy link
Member

pitrou commented Nov 15, 2023

LGTM, I'm waiting for a last CI check and will merge afterwards.

@mapleFU
Copy link
Member

mapleFU commented Nov 15, 2023

Also we're going to release this in parquet-2.10

apache/parquet-format#219

@pitrou pitrou merged commit b55d13c into apache:main Nov 15, 2023
32 of 34 checks passed
@pitrou pitrou removed the awaiting merge Awaiting merge label Nov 15, 2023
Copy link

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit b55d13c.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 13 possible false positives for unstable benchmarks that are known to sometimes produce them.

dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…37599)

### Rationale for this change

There is an active proposal for a Float16 logical type in Parquet (apache/parquet-format#184) with C++/Python implementations in progress (apache#36073), so we should add one for Go as well.

### What changes are included in this PR?

- [x] Adds `LogicalType` definitions and methods for `Float16`
- [x] Adds support for `Float16` column statistics and comparators
- [x] Adds support for interchange between Parquet and Arrow's half-precision float

### Are these changes tested?

Yes

### Are there any user-facing changes?

Yes

* Closes: apache#37582

Authored-by: benibus <[email protected]>
Signed-off-by: Matt Topol <[email protected]>
dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…apache#36073)

### Rationale for this change

There is currently an active proposal to support half-float types in Parquet. For more details/discussion, see the links in this PR's accompanying issue.

### What changes are included in this PR?

This PR implements basic support for a `Float16LogicalType` in accordance with the proposed spec. More specifically, this includes:

- Changes to `parquet.thrift` and regenerated `parqet_types` files
- Basic `LogicalType` class definition, method impls, and enums
- Support for specialized comparisons and column statistics

In the interest of scope, this PR does not currently deal with arrow integration and byte split encoding - although we will want both of these features resolved before the proposal is approved.

### Are these changes tested?

Yes (tests are included)

### Are there any user-facing changes?

Yes

* Closes: apache#36036

Lead-authored-by: benibus <[email protected]>
Co-authored-by: Ben Harkins <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
kou pushed a commit to apache/arrow-go that referenced this pull request Aug 30, 2024
### Rationale for this change

There is an active proposal for a Float16 logical type in Parquet (apache/parquet-format#184) with C++/Python implementations in progress (apache/arrow#36073), so we should add one for Go as well.

### What changes are included in this PR?

- [x] Adds `LogicalType` definitions and methods for `Float16`
- [x] Adds support for `Float16` column statistics and comparators
- [x] Adds support for interchange between Parquet and Arrow's half-precision float

### Are these changes tested?

Yes

### Are there any user-facing changes?

Yes

* Closes: #37582

Authored-by: benibus <[email protected]>
Signed-off-by: Matt Topol <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++][Parquet] Implement Float16 logical type
9 participants