-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-3122: Correct V2 page header compression fields for zero-size data pages #3148
base: master
Are you sure you want to change the base?
Conversation
parquet-column/src/main/java/org/apache/parquet/column/page/DataPageV2.java
Outdated
Show resolved
Hide resolved
if (compressedSize == 0) { | ||
dataPageHeaderV2.setIs_compressed(false); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should take an explicit bool isCompressed
parameter instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Data page v2 was designed to adaptively fall back to uncompressed data when compression is not promising (though we don't implement it yet). Using an explicit parameter makes sense.
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ColumnChunkPageWriteStore.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java
Outdated
Show resolved
Hide resolved
if (compressedData.size() > 0) { | ||
compressedSize = | ||
toIntWithCheck(compressedData.size() + repetitionLevels.size() + definitionLevels.size(), "page"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this class is duplicating code from ColumnChunkPageWriteStore.java
above? Do you know why that is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ParquetFileWriter
supports writing out pages directly. So there are some duplicate codes. I plan to reduce the duplication for better maintenance.
@@ -2123,6 +2123,9 @@ private PageHeader newDataPageV2Header( | |||
int dlByteLength) { | |||
DataPageHeaderV2 dataPageHeaderV2 = new DataPageHeaderV2( | |||
valueCount, nullCount, rowCount, getEncoding(dataEncoding), dlByteLength, rlByteLength); | |||
if (compressedSize == 0) { | |||
dataPageHeaderV2.setIs_compressed(false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm surprised that dataPageHeaderV2.setIs_compressed()
has never been called before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, I think the description in the spec needs to be improved since it does not consider the case when comressed_page_size
== (definition_levels_byte_length + repetition_levels_byte_length)
: https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L665-L668
Which means the section of the page between definition_levels_byte_length + repetition_levels_byte_length + 1 and compressed_page_size (included) is compressed with the compression_codec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I don't think compressed_page_size
can be 0, except if you have 0 levels and 0 data (is that possible?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad. Fixed my comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing this! I've left some comments.
if (compressedSize == 0) { | ||
dataPageHeaderV2.setIs_compressed(false); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Data page v2 was designed to adaptively fall back to uncompressed data when compression is not promising (though we don't implement it yet). Using an explicit parameter makes sense.
parquet-column/src/main/java/org/apache/parquet/column/page/Page.java
Outdated
Show resolved
Hide resolved
parquet-column/src/main/java/org/apache/parquet/column/page/DataPageV2.java
Outdated
Show resolved
Hide resolved
@@ -55,7 +55,11 @@ public byte[] decrypt(byte[] lengthAndCiphertext, byte[] AAD) { | |||
public byte[] decrypt(byte[] ciphertext, int cipherTextOffset, int cipherTextLength, byte[] AAD) { | |||
|
|||
int plainTextLength = cipherTextLength - NONCE_LENGTH; | |||
if (plainTextLength < 1) { | |||
if (plainTextLength == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
decryptor doesn't support decrypting zero-byte data, fixes here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ggershinsky Does this look ok to you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, technically this should work ok - unlike GCM, the CTR mode doesn't guarantee/check the integrity, so we don't need to protect against page replacement attacks (we can't anyway).
In terms of implementation, it might be a bit cleaner if we use cipher decryption of zero-sized arrays; fewer lines of code. But I'll verify first if this is supported in CTR, will get back on that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, CTR can handle this as well. So the fix can simply be a replacement of if (plainTextLength < 1)
with if (plainTextLength < 0)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
org.apache.parquet.column.Encoding dataEncoding, | ||
int rlByteLength, | ||
int dlByteLength, | ||
boolean compressed, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally LGTM. I've left a few comments.
@@ -737,6 +737,7 @@ private void processChunk( | |||
encryptColumn, | |||
dataEncryptor, | |||
dataPageAAD); | |||
boolean compressed = compressor != null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we get it from headerV2
to keep it unchanged?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed headerV2.is_compressed
since UNCOMPRESSED
is still a compression codec.
*/ | ||
@Deprecated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not related to this PR: we have plenty of deprecated public methods with a bunch of parameters like this. It takes pain for the downstream to migrate after removing them in the 2.0.0 and still likely to be deprecated if we need a new parameter next time. Should we change these into a Builder
or a class with many parameters so we won't break anything in the future? @gszadovszky @Fokko
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Separating those methods from ParquetFileWriter
class and reducing the duplication between ColumnChunkPageWriteStore
could help the maintenance.
@@ -51,7 +51,11 @@ public byte[] decrypt(byte[] lengthAndCiphertext, byte[] AAD) { | |||
public byte[] decrypt(byte[] ciphertext, int cipherTextOffset, int cipherTextLength, byte[] AAD) { | |||
|
|||
int plainTextLength = cipherTextLength - GCM_TAG_LENGTH - NONCE_LENGTH; | |||
if (plainTextLength < 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AES GCM decryption is able to handle an empty plaintext. It will also use the key to check the IV/tag integrity.
So the fix can be just replacement of if (plainTextLength < 1)
{ with if (plainTextLength < 0) {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestions. Updated.
@@ -81,7 +85,12 @@ public ByteBuffer decrypt(ByteBuffer ciphertext, byte[] AAD) { | |||
int cipherTextOffset = SIZE_LENGTH; | |||
int cipherTextLength = ciphertext.limit() - ciphertext.position() - SIZE_LENGTH; | |||
int plainTextLength = cipherTextLength - GCM_TAG_LENGTH - NONCE_LENGTH; | |||
if (plainTextLength < 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
state = state.write(); | ||
int rlByteLength = toIntWithCheck(repetitionLevels.size(), "page repetition levels"); | ||
int dlByteLength = toIntWithCheck(definitionLevels.size(), "page definition levels"); | ||
|
||
int compressedSize = | ||
toIntWithCheck(compressedData.size() + repetitionLevels.size() + definitionLevels.size(), "page"); | ||
int compressedSize = toIntWithCheck(bytes.size() + repetitionLevels.size() + definitionLevels.size(), "page"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't compressedSize
also supposed to include the size of pageHeaderAAD
? @ggershinsky
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The AAD suffix parameters are calculated in-memory, they are not stored in the file.
The compressedSize
does include the encryption IV/tag (28 bytes for GCM), but this was already accounted for in the old compressedData.size()
. I presume the new bytes.size()
is the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, here just renamed the compressedData
to bytes
since the data may not be compressed.
@@ -91,7 +95,11 @@ public ByteBuffer decrypt(ByteBuffer ciphertext, byte[] AAD) { | |||
int cipherTextLength = ciphertext.limit() - ciphertext.position() - SIZE_LENGTH; | |||
|
|||
int plainTextLength = cipherTextLength - NONCE_LENGTH; | |||
if (plainTextLength < 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
Rationale for this change
Fixes #3122
What changes are included in this PR?
Set
is_compressed
to false andcompressed_page_size
to 0 for the V2 page with no data size.Are these changes tested?
Yes, new UTs.
Are there any user-facing changes?
No.