Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-3122: Correct V2 page header compression fields for zero-size data pages #3148

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

ConeyLiu
Copy link
Contributor

@ConeyLiu ConeyLiu commented Feb 6, 2025

Rationale for this change

Fixes #3122

What changes are included in this PR?

Set is_compressed to false and compressed_page_size to 0 for the V2 page with no data size.

Are these changes tested?

Yes, new UTs.

Are there any user-facing changes?

No.

@ConeyLiu
Copy link
Contributor Author

ConeyLiu commented Feb 6, 2025

@mapleFU @pitrou @wgtmac pls take a look. Thanks a lot.

Comment on lines 2126 to 2128
if (compressedSize == 0) {
dataPageHeaderV2.setIs_compressed(false);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should take an explicit bool isCompressed parameter instead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Data page v2 was designed to adaptively fall back to uncompressed data when compression is not promising (though we don't implement it yet). Using an explicit parameter makes sense.

if (compressedData.size() > 0) {
compressedSize =
toIntWithCheck(compressedData.size() + repetitionLevels.size() + definitionLevels.size(), "page");
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this class is duplicating code from ColumnChunkPageWriteStore.java above? Do you know why that is?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ParquetFileWriter supports writing out pages directly. So there are some duplicate codes. I plan to reduce the duplication for better maintenance.

@@ -2123,6 +2123,9 @@ private PageHeader newDataPageV2Header(
int dlByteLength) {
DataPageHeaderV2 dataPageHeaderV2 = new DataPageHeaderV2(
valueCount, nullCount, rowCount, getEncoding(dataEncoding), dlByteLength, rlByteLength);
if (compressedSize == 0) {
dataPageHeaderV2.setIs_compressed(false);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised that dataPageHeaderV2.setIs_compressed() has never been called before.

Copy link
Member

@wgtmac wgtmac Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, I think the description in the spec needs to be improved since it does not consider the case when comressed_page_size == (definition_levels_byte_length + repetition_levels_byte_length): https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L665-L668

Which means the section of the page between definition_levels_byte_length + repetition_levels_byte_length + 1 and compressed_page_size (included) is compressed with the compression_codec.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I don't think compressed_page_size can be 0, except if you have 0 levels and 0 data (is that possible?).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad. Fixed my comment.

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this! I've left some comments.

Comment on lines 2126 to 2128
if (compressedSize == 0) {
dataPageHeaderV2.setIs_compressed(false);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Data page v2 was designed to adaptively fall back to uncompressed data when compression is not promising (though we don't implement it yet). Using an explicit parameter makes sense.

@@ -55,7 +55,11 @@ public byte[] decrypt(byte[] lengthAndCiphertext, byte[] AAD) {
public byte[] decrypt(byte[] ciphertext, int cipherTextOffset, int cipherTextLength, byte[] AAD) {

int plainTextLength = cipherTextLength - NONCE_LENGTH;
if (plainTextLength < 1) {
if (plainTextLength == 0) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

decryptor doesn't support decrypting zero-byte data, fixes here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ggershinsky Does this look ok to you?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, technically this should work ok - unlike GCM, the CTR mode doesn't guarantee/check the integrity, so we don't need to protect against page replacement attacks (we can't anyway).

In terms of implementation, it might be a bit cleaner if we use cipher decryption of zero-sized arrays; fewer lines of code. But I'll verify first if this is supported in CTR, will get back on that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, CTR can handle this as well. So the fix can simply be a replacement of if (plainTextLength < 1) with if (plainTextLength < 0)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

org.apache.parquet.column.Encoding dataEncoding,
int rlByteLength,
int dlByteLength,
boolean compressed,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pitrou @wgtmac changed to an explicit parameter.

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM. I've left a few comments.

@@ -737,6 +737,7 @@ private void processChunk(
encryptColumn,
dataEncryptor,
dataPageAAD);
boolean compressed = compressor != null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we get it from headerV2 to keep it unchanged?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed headerV2.is_compressed since UNCOMPRESSED is still a compression codec.

*/
@Deprecated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related to this PR: we have plenty of deprecated public methods with a bunch of parameters like this. It takes pain for the downstream to migrate after removing them in the 2.0.0 and still likely to be deprecated if we need a new parameter next time. Should we change these into a Builder or a class with many parameters so we won't break anything in the future? @gszadovszky @Fokko

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separating those methods from ParquetFileWriter class and reducing the duplication between ColumnChunkPageWriteStore could help the maintenance.

@@ -51,7 +51,11 @@ public byte[] decrypt(byte[] lengthAndCiphertext, byte[] AAD) {
public byte[] decrypt(byte[] ciphertext, int cipherTextOffset, int cipherTextLength, byte[] AAD) {

int plainTextLength = cipherTextLength - GCM_TAG_LENGTH - NONCE_LENGTH;
if (plainTextLength < 1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AES GCM decryption is able to handle an empty plaintext. It will also use the key to check the IV/tag integrity.
So the fix can be just replacement of if (plainTextLength < 1) { with if (plainTextLength < 0) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestions. Updated.

@@ -81,7 +85,12 @@ public ByteBuffer decrypt(ByteBuffer ciphertext, byte[] AAD) {
int cipherTextOffset = SIZE_LENGTH;
int cipherTextLength = ciphertext.limit() - ciphertext.position() - SIZE_LENGTH;
int plainTextLength = cipherTextLength - GCM_TAG_LENGTH - NONCE_LENGTH;
if (plainTextLength < 1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

state = state.write();
int rlByteLength = toIntWithCheck(repetitionLevels.size(), "page repetition levels");
int dlByteLength = toIntWithCheck(definitionLevels.size(), "page definition levels");

int compressedSize =
toIntWithCheck(compressedData.size() + repetitionLevels.size() + definitionLevels.size(), "page");
int compressedSize = toIntWithCheck(bytes.size() + repetitionLevels.size() + definitionLevels.size(), "page");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't compressedSize also supposed to include the size of pageHeaderAAD? @ggershinsky

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The AAD suffix parameters are calculated in-memory, they are not stored in the file.

The compressedSize does include the encryption IV/tag (28 bytes for GCM), but this was already accounted for in the old compressedData.size(). I presume the new bytes.size() is the same.

Copy link
Contributor Author

@ConeyLiu ConeyLiu Feb 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, here just renamed the compressedData to bytes since the data may not be compressed.

@@ -91,7 +95,11 @@ public ByteBuffer decrypt(ByteBuffer ciphertext, byte[] AAD) {
int cipherTextLength = ciphertext.limit() - ciphertext.position() - SIZE_LENGTH;

int plainTextLength = cipherTextLength - NONCE_LENGTH;
if (plainTextLength < 1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parquet-java sometimes produces 0-size compressed data in data page v2
4 participants