GH-3122: Correct V2 page header compression fields for zero-size data pages #3148

ConeyLiu · 2025-02-06T14:36:35Z

Rationale for this change

Fixes #3122

What changes are included in this PR?

Set is_compressed to false and compressed_page_size to 0 for the V2 page with no data size.

Are these changes tested?

Yes, new UTs.

Are there any user-facing changes?

No.

ConeyLiu · 2025-02-06T14:39:36Z

@mapleFU @pitrou @wgtmac pls take a look. Thanks a lot.

parquet-column/src/main/java/org/apache/parquet/column/page/DataPageV2.java

pitrou · 2025-02-06T15:20:36Z

parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java

+    if (compressedSize == 0) {
+      dataPageHeaderV2.setIs_compressed(false);
+    }


I think you should take an explicit bool isCompressed parameter instead.

Agreed. Data page v2 was designed to adaptively fall back to uncompressed data when compression is not promising (though we don't implement it yet). Using an explicit parameter makes sense.

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ColumnChunkPageWriteStore.java

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java

pitrou · 2025-02-06T15:25:18Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java

+    if (compressedData.size() > 0) {
+      compressedSize =
+          toIntWithCheck(compressedData.size() + repetitionLevels.size() + definitionLevels.size(), "page");
+    }


So this class is duplicating code from ColumnChunkPageWriteStore.java above? Do you know why that is?

ParquetFileWriter supports writing out pages directly. So there are some duplicate codes. I plan to reduce the duplication for better maintenance.

wgtmac · 2025-02-07T05:39:14Z

parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java

@@ -2123,6 +2123,9 @@ private PageHeader newDataPageV2Header(
      int dlByteLength) {
    DataPageHeaderV2 dataPageHeaderV2 = new DataPageHeaderV2(
        valueCount, nullCount, rowCount, getEncoding(dataEncoding), dlByteLength, rlByteLength);
+    if (compressedSize == 0) {
+      dataPageHeaderV2.setIs_compressed(false);


I'm surprised that dataPageHeaderV2.setIs_compressed() has never been called before.

BTW, I think the description in the spec needs to be improved since it does not consider the case when comressed_page_size == (definition_levels_byte_length + repetition_levels_byte_length): https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L665-L668

Which means the section of the page between definition_levels_byte_length + repetition_levels_byte_length + 1 and compressed_page_size (included) is compressed with the compression_codec.

Well, I don't think compressed_page_size can be 0, except if you have 0 levels and 0 data (is that possible?).

My bad. Fixed my comment.

wgtmac

Thanks for fixing this! I've left some comments.

wgtmac · 2025-02-07T05:51:40Z

parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java

+    if (compressedSize == 0) {
+      dataPageHeaderV2.setIs_compressed(false);
+    }


Agreed. Data page v2 was designed to adaptively fall back to uncompressed data when compression is not promising (though we don't implement it yet). Using an explicit parameter makes sense.

parquet-column/src/main/java/org/apache/parquet/column/page/Page.java

parquet-column/src/main/java/org/apache/parquet/column/page/DataPageV2.java

ConeyLiu · 2025-02-07T11:12:48Z

parquet-hadoop/src/main/java/org/apache/parquet/crypto/AesCtrDecryptor.java

@@ -55,7 +55,11 @@ public byte[] decrypt(byte[] lengthAndCiphertext, byte[] AAD) {
  public byte[] decrypt(byte[] ciphertext, int cipherTextOffset, int cipherTextLength, byte[] AAD) {

    int plainTextLength = cipherTextLength - NONCE_LENGTH;
-    if (plainTextLength < 1) {
+    if (plainTextLength == 0) {


decryptor doesn't support decrypting zero-byte data, fixes here.

@ggershinsky Does this look ok to you?

yep, technically this should work ok - unlike GCM, the CTR mode doesn't guarantee/check the integrity, so we don't need to protect against page replacement attacks (we can't anyway).

In terms of implementation, it might be a bit cleaner if we use cipher decryption of zero-sized arrays; fewer lines of code. But I'll verify first if this is supported in CTR, will get back on that.

Yep, CTR can handle this as well. So the fix can simply be a replacement of if (plainTextLength < 1) with if (plainTextLength < 0)

ConeyLiu · 2025-02-07T11:14:00Z

parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java

+      org.apache.parquet.column.Encoding dataEncoding,
+      int rlByteLength,
+      int dlByteLength,
+      boolean compressed,


@pitrou @wgtmac changed to an explicit parameter.

wgtmac

Generally LGTM. I've left a few comments.

wgtmac · 2025-02-07T15:05:08Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java

@@ -737,6 +737,7 @@ private void processChunk(
              encryptColumn,
              dataEncryptor,
              dataPageAAD);
+          boolean compressed = compressor != null;


Shouldn't we get it from headerV2 to keep it unchanged?

Changed headerV2.is_compressed since UNCOMPRESSED is still a compression codec.

wgtmac · 2025-02-07T15:13:23Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java

   */
+  @Deprecated


Not related to this PR: we have plenty of deprecated public methods with a bunch of parameters like this. It takes pain for the downstream to migrate after removing them in the 2.0.0 and still likely to be deprecated if we need a new parameter next time. Should we change these into a Builder or a class with many parameters so we won't break anything in the future? @gszadovszky @Fokko

Separating those methods from ParquetFileWriter class and reducing the duplication between ColumnChunkPageWriteStore could help the maintenance.

ggershinsky · 2025-02-09T07:05:56Z

parquet-hadoop/src/main/java/org/apache/parquet/crypto/AesGcmDecryptor.java

@@ -51,7 +51,11 @@ public byte[] decrypt(byte[] lengthAndCiphertext, byte[] AAD) {
  public byte[] decrypt(byte[] ciphertext, int cipherTextOffset, int cipherTextLength, byte[] AAD) {

    int plainTextLength = cipherTextLength - GCM_TAG_LENGTH - NONCE_LENGTH;
-    if (plainTextLength < 1) {


AES GCM decryption is able to handle an empty plaintext. It will also use the key to check the IV/tag integrity.
So the fix can be just replacement of if (plainTextLength < 1) { with if (plainTextLength < 0) {

Thanks for the suggestions. Updated.

ggershinsky · 2025-02-09T07:06:09Z

parquet-hadoop/src/main/java/org/apache/parquet/crypto/AesGcmDecryptor.java

@@ -81,7 +85,12 @@ public ByteBuffer decrypt(ByteBuffer ciphertext, byte[] AAD) {
    int cipherTextOffset = SIZE_LENGTH;
    int cipherTextLength = ciphertext.limit() - ciphertext.position() - SIZE_LENGTH;
    int plainTextLength = cipherTextLength - GCM_TAG_LENGTH - NONCE_LENGTH;
-    if (plainTextLength < 1) {


pitrou · 2025-02-10T08:59:37Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java

    state = state.write();
    int rlByteLength = toIntWithCheck(repetitionLevels.size(), "page repetition levels");
    int dlByteLength = toIntWithCheck(definitionLevels.size(), "page definition levels");

-    int compressedSize =
-        toIntWithCheck(compressedData.size() + repetitionLevels.size() + definitionLevels.size(), "page");
+    int compressedSize = toIntWithCheck(bytes.size() + repetitionLevels.size() + definitionLevels.size(), "page");


Isn't compressedSize also supposed to include the size of pageHeaderAAD? @ggershinsky

The AAD suffix parameters are calculated in-memory, they are not stored in the file.

The compressedSize does include the encryption IV/tag (28 bytes for GCM), but this was already accounted for in the old compressedData.size(). I presume the new bytes.size() is the same.

Yes, here just renamed the compressedData to bytes since the data may not be compressed.

ggershinsky · 2025-02-11T08:40:05Z

parquet-hadoop/src/main/java/org/apache/parquet/crypto/AesCtrDecryptor.java

@@ -91,7 +95,11 @@ public ByteBuffer decrypt(ByteBuffer ciphertext, byte[] AAD) {
    int cipherTextLength = ciphertext.limit() - ciphertext.position() - SIZE_LENGTH;

    int plainTextLength = cipherTextLength - NONCE_LENGTH;
-    if (plainTextLength < 1) {


pitrou reviewed Feb 6, 2025

View reviewed changes

wgtmac reviewed Feb 7, 2025

View reviewed changes

wgtmac requested changes Feb 7, 2025

View reviewed changes

fixes apache#3122

0d520b1

ConeyLiu force-pushed the i-3122 branch from 08b7936 to 0d520b1 Compare February 7, 2025 11:08

ConeyLiu commented Feb 7, 2025

View reviewed changes

wgtmac reviewed Feb 7, 2025

View reviewed changes

ggershinsky reviewed Feb 9, 2025

View reviewed changes

ConeyLiu added 2 commits February 10, 2025 14:52

address comments & fixes

1c16f92

address comments

d6d00fa

pitrou reviewed Feb 10, 2025

View reviewed changes

ggershinsky reviewed Feb 11, 2025

View reviewed changes

address comments

8b342f1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-3122: Correct V2 page header compression fields for zero-size data pages #3148

GH-3122: Correct V2 page header compression fields for zero-size data pages #3148

ConeyLiu commented Feb 6, 2025

ConeyLiu commented Feb 6, 2025

pitrou Feb 6, 2025

wgtmac Feb 7, 2025

pitrou Feb 6, 2025

ConeyLiu Feb 7, 2025

wgtmac Feb 7, 2025

wgtmac Feb 7, 2025 •

edited

Loading

pitrou Feb 7, 2025

wgtmac Feb 7, 2025

wgtmac left a comment

wgtmac Feb 7, 2025

ConeyLiu Feb 7, 2025

pitrou Feb 10, 2025

ggershinsky Feb 10, 2025

ggershinsky Feb 11, 2025

ConeyLiu Feb 11, 2025

ConeyLiu Feb 7, 2025

wgtmac left a comment

wgtmac Feb 7, 2025

ConeyLiu Feb 10, 2025

wgtmac Feb 7, 2025

ConeyLiu Feb 10, 2025

ggershinsky Feb 9, 2025

ConeyLiu Feb 10, 2025

ggershinsky Feb 9, 2025

pitrou Feb 10, 2025

ggershinsky Feb 10, 2025

ConeyLiu Feb 11, 2025 •

edited

Loading

ggershinsky Feb 11, 2025

GH-3122: Correct V2 page header compression fields for zero-size data pages #3148

Are you sure you want to change the base?

GH-3122: Correct V2 page header compression fields for zero-size data pages #3148

Conversation

ConeyLiu commented Feb 6, 2025

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

ConeyLiu commented Feb 6, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wgtmac Feb 7, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wgtmac left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wgtmac left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ConeyLiu Feb 11, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wgtmac Feb 7, 2025 •

edited

Loading

ConeyLiu Feb 11, 2025 •

edited

Loading