Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-17464: [C++] Store/Read float16 values in Parquet as FixedSizeBinary #13947

Closed
wants to merge 2 commits into from

Conversation

anjakefala
Copy link
Collaborator

@anjakefala anjakefala commented Aug 22, 2022

Edit When I return to this, I want to re-open a PR off of my branch arrow-17464. It was a mistake/accident to do this off of master.

Half-float values are currently not supported in Parquet:
https://issues.apache.org/jira/browse/PARQUET-1647

This is a draft PR! The tests as they are currently not passing. I am a pretty new committer, and opened this PR to more easily ask for help understanding what various parts of the code are doing.

The Parquet mailing list a proposal to have float16 standardised in Parquet itself: https://lists.apache.org/thread/03vmcj7ygwvsbno764vd1hr954p62zr5

The PR to parquet-format: apache/parquet-format#184

This PR was inspired by: #12449

@github-actions
Copy link

@anjakefala
Copy link
Collaborator Author

(Force push was needed to update my fork with updates from apache/arrow repository. No new changes.)

@anjakefala
Copy link
Collaborator Author

anjakefala commented Sep 6, 2022

If you run the TestHalfFloatParquetIO test, currently you get an std::bad_cast stacktrace:

Result of stacktrace:

/home/anja/git/arrow/cpp/src/parquet/arrow/arrow_reader_writer_test.cc:664: Failure
Expected: for (::arrow::Status _st = ::arrow::internal::GenericToStatus((WriteTable(*table, ::arrow::default_memory_pool(), this->sink_, values->length(), default_writer_properties(), arrow_properties))); !_st.ok();) return ::testing::internal::AssertHelper(::testing::TestPartResult::kFatalFailure, "/home/anja/git/arrow/cpp/src/parquet/arrow/arrow_reader_writer_test.cc", 664, "Failed") = ::testing::Message() << "'" "WriteTable(*table, ::arrow::default_memory_pool(), this->sink_, values->length(), default_writer_properties(), arrow_properties)" "' failed with " << _st.ToString() doesn't throw an exception.
  Actual: it throws std::bad_cast with description "std::bad_cast".
/home/anja/git/arrow/cpp/src/parquet/arrow/arrow_reader_writer_test.cc:664: Failure
Expected: for (::arrow::Status _st = ::arrow::internal::GenericToStatus((WriteTable(*table, ::arrow::default_memory_pool(), this->sink_, values->length(), default_writer_properties(), arrow_properties))); !_st.ok();) return ::testing::internal::AssertHelper(::testing::TestPartResult::kFatalFailure, "/home/anja/git/arrow/cpp/src/parquet/arrow/arrow_reader_writer_test.cc", 664, "Failed") = ::testing::Message() << "'" "WriteTable(*table, ::arrow::default_memory_pool(), this->sink_, values->length(), default_writer_properties(), arrow_properties)" "' failed with " << _st.ToString() doesn't throw an exception.
  Actual: it throws std::bad_cast with description "std::bad_cast".

I managed to narrow this stacktrace down to this call:

    WRITE_SERIALIZE_CASE(HALF_FLOAT, FixedSizeBinaryType, FLBAType)

in src/cpp/parquet/column_writer.cc

This will then call:
- WriteArrowSerialize<FLBAType, FixedSizeBinaryType>()
With the std::bad_cast being thrown by -> RETURN_NOT_OK(functor.Serialize(checked_cast<const ArrayType&>(array), ctx, buffer)); (at ~line 1569)
- it uses ArrayType -> typename ::arrow::TypeTraits::ArrayType;
- it uses SerializeFunctor<FLBAType, FixedSizeBinaryType>

I am going to continue trying to understand this, but I am outside my comfort zone, and really welcome help! In particular, my test might need adjustment, and perhaps my ParquetType and ArrowType should be changed.

Update from gdb: Here is the location where the bad_cast is happening:

2  0x00007ffff77ed098 in arrow::internal::checked_cast<arrow::FixedSizeBinaryArray const&, arrow::Array const&> (value=...) at /home/anja/git/arrow/cpp/src/arrow/util/checked_cast.h:38
#3  0x00007ffff795f33d in parquet::WriteArrowSerialize<parquet::PhysicalType<(parquet::Type::type)7>, arrow::FixedSizeBinaryType> (array=..., num_levels=4, def_levels=0x7ffff300a700, rep_levels=0x0, 
    ctx=0x555555c861a8, writer=0x555555c870d0, maybe_parent_nulls=false) at /home/anja/git/arrow/cpp/src/parquet/column_writer.cc:1570

@jorisvandenbossche
Copy link
Member

My quick guess is that we can't cast a HalfFloatArray to a FixedSizeBinaryArray, and this is causing the crash. For example another case that takes this path in the column writer is Decimal128Array, but that array extends FixedSizeBinaryArray, and thus can be cast to that type.
While it's possible to implement such a cast operator, I am not sure that's something we typically do in our codebase for arrays. I suppose somewhere in the write path, we should explicitly convert the float array to a binary array (it might be this conversion still needs to be implemented)

@pitrou
Copy link
Member

pitrou commented Sep 7, 2022

Yes, definitely, you cannot interpret one type of array like this as another.

@@ -2050,6 +2050,7 @@ Status TypedColumnWriterImpl<FLBAType>::WriteArrowDense(
WRITE_SERIALIZE_CASE(FIXED_SIZE_BINARY, FixedSizeBinaryType, FLBAType)
WRITE_SERIALIZE_CASE(DECIMAL128, Decimal128Type, FLBAType)
WRITE_SERIALIZE_CASE(DECIMAL256, Decimal256Type, FLBAType)
WRITE_SERIALIZE_CASE(HALF_FLOAT, FixedSizeBinaryType, FLBAType)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would resolve the segfault but doesn't really feel elegant.

Suggested change
WRITE_SERIALIZE_CASE(HALF_FLOAT, FixedSizeBinaryType, FLBAType)
case ::arrow::Type::HALF_FLOAT: {
auto array_data = array.data();
const auto& arr = ::arrow::FixedSizeBinaryArray(
::arrow::fixed_size_binary(2), array.length(), array_data->buffers[1],
array_data->buffers[0], array.null_count(), array.offset());
return WriteArrowSerialize<FLBAType, ::arrow::FixedSizeBinaryType>(
arr, num_levels, def_levels, rep_levels, ctx, this, maybe_parent_nulls);
}

@pitrou
Copy link
Member

pitrou commented Sep 7, 2022

I'm not sure it's a good idea to do this before the HALF logical type is standardized in Parquet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants