-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-17464: [C++] Store/Read float16 values in Parquet as FixedSizeBinary #13947
Conversation
…inary Half-float values are currently not supported in Parquet: https://issues.apache.org/jira/browse/PARQUET-1647
(Force push was needed to update my fork with updates from |
If you run the Result of stacktrace:
I managed to narrow this stacktrace down to this call:
in This will then call: I am going to continue trying to understand this, but I am outside my comfort zone, and really welcome help! In particular, my test might need adjustment, and perhaps my Update from gdb: Here is the location where the
|
My quick guess is that we can't cast a HalfFloatArray to a FixedSizeBinaryArray, and this is causing the crash. For example another case that takes this path in the column writer is Decimal128Array, but that array extends FixedSizeBinaryArray, and thus can be cast to that type. |
Yes, definitely, you cannot interpret one type of array like this as another. |
@@ -2050,6 +2050,7 @@ Status TypedColumnWriterImpl<FLBAType>::WriteArrowDense( | |||
WRITE_SERIALIZE_CASE(FIXED_SIZE_BINARY, FixedSizeBinaryType, FLBAType) | |||
WRITE_SERIALIZE_CASE(DECIMAL128, Decimal128Type, FLBAType) | |||
WRITE_SERIALIZE_CASE(DECIMAL256, Decimal256Type, FLBAType) | |||
WRITE_SERIALIZE_CASE(HALF_FLOAT, FixedSizeBinaryType, FLBAType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would resolve the segfault but doesn't really feel elegant.
WRITE_SERIALIZE_CASE(HALF_FLOAT, FixedSizeBinaryType, FLBAType) | |
case ::arrow::Type::HALF_FLOAT: { | |
auto array_data = array.data(); | |
const auto& arr = ::arrow::FixedSizeBinaryArray( | |
::arrow::fixed_size_binary(2), array.length(), array_data->buffers[1], | |
array_data->buffers[0], array.null_count(), array.offset()); | |
return WriteArrowSerialize<FLBAType, ::arrow::FixedSizeBinaryType>( | |
arr, num_levels, def_levels, rep_levels, ctx, this, maybe_parent_nulls); | |
} |
I'm not sure it's a good idea to do this before the HALF logical type is standardized in Parquet. |
Edit When I return to this, I want to re-open a PR off of my branch
arrow-17464
. It was a mistake/accident to do this off ofmaster
.Half-float values are currently not supported in Parquet:
https://issues.apache.org/jira/browse/PARQUET-1647
This is a draft PR! The tests as they are currently not passing. I am a pretty new committer, and opened this PR to more easily ask for help understanding what various parts of the code are doing.
The Parquet mailing list a proposal to have float16 standardised in Parquet itself: https://lists.apache.org/thread/03vmcj7ygwvsbno764vd1hr954p62zr5
The PR to
parquet-format
: apache/parquet-format#184This PR was inspired by: #12449