Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Hudi merged view files for partition path updates without compaction #24283

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

codope
Copy link
Contributor

@codope codope commented Dec 19, 2024

Description

Support Hudi merged view files for partition path updates without compaction. This is needed with Merge-on-Read Hudi tables when partition has been updated for a record and the table has not been compacted yet.

  • Add a client and session config to specify tables (comma-separated schemaName.tableName) where this support is needed.
  • Based on config, if the current tables needs to be listed using merged view, then do so in HudiDirectoryLister.
  • Add a test for Merge-on-Read table with a record being updated from partition p1 to p2 and no compaction.
  • Add some test fixtures to support above test scenario.

Motivation and Context

Support Hudi merged view files for partition path updates without compaction. This is needed with Merge-on-Read Hudi tables when partition has been updated for a record and the table has not been compacted yet.

Impact

Support Hudi merged view files for partition path updates without compaction. This is needed with Merge-on-Read Hudi tables when partition has been updated for a record and the table has not been compacted yet. As a result, even the read-optimized view of Merge-on-Read tables under partition path updates without compaction should not return any duplicates.

Test Plan

  • Add a test for Merge-on-Read table with a record being updated from partition p1 to p2 and no compaction.
  • Add some test fixtures to support above test scenario.

Contributor checklist

  • Please make sure your submission complies with our contributing guide, in particular code style and commit standards.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

== NO RELEASE NOTE ==

@codope codope requested a review from a team as a code owner December 19, 2024 04:25
@codope codope requested a review from presto-oss December 19, 2024 04:25
@tdcmeehan tdcmeehan self-assigned this Dec 19, 2024
Copy link

@nsivabalan nsivabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add tests please

@codope codope force-pushed the hudi-hive-merged-view-part-update branch from 7198333 to 483e2cb Compare January 20, 2025 07:34
@codope
Copy link
Contributor Author

codope commented Jan 20, 2025

can we add tests please

There is already a test in TestHudiDirectoryLister to cover this scenario specifically.

@codope codope force-pushed the hudi-hive-merged-view-part-update branch 2 times, most recently from 32c670f to 896e81c Compare January 23, 2025 08:03
@codope
Copy link
Contributor Author

codope commented Jan 23, 2025

@ZacBlanco Feedback is addressed. Please take a look again.

@codope codope force-pushed the hudi-hive-merged-view-part-update branch 2 times, most recently from 7cd16cc to d0bf390 Compare January 27, 2025 06:28
@ZacBlanco ZacBlanco self-requested a review January 29, 2025 19:21
Copy link
Contributor

@ZacBlanco ZacBlanco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates. The configurations look better to me now.

I have one more concern, and that's the added files for these tests take up about 2.2MB. Not a whole lot, but for the health and size of the repository I would either like to generate the Hudi tables during the test, or minimize the size of added files to be < 1MiB.

zacblanco@zac-ibm presto % du -h -s presto-hive/src/test/resources/hudi_mor_part_update 
2.2M    presto-hive/src/test/resources/hudi_mor_part_update

@codope
Copy link
Contributor Author

codope commented Jan 31, 2025

@ZacBlanco I have addressed your feedback. On the added resource files for the tests, currently we don't have a hudi writer integration. One solution is to zip the hudi testing data and unzip in tests. Overall size on disk would be less, with a little bit of overhead during tests (probably similar or lesser than writing hudi tables). If you think that's ok, I can do that in a separate PR. Please let me know your thoughts.

@codope codope force-pushed the hudi-hive-merged-view-part-update branch from d0bf390 to 1b9e76d Compare January 31, 2025 17:59
@ZacBlanco
Copy link
Contributor

ZacBlanco commented Jan 31, 2025

Thank you for the updates. It looks good to me now. On the table size aspect:

currently we don't have a hudi writer integration. One solution is to zip the hudi testing data and unzip in tests. Overall size on disk would be less, with a little bit of overhead during tests (probably similar or lesser than writing hudi tables).

Have you tested the size of the zipped table? Also, we might even avoid zipping if we perform the minimum number of operations and make the size of the table as small as possible.

Also, did you try to minimize the number of operations and records on this table before including it in the PR? I'm not sure how you generated this table, but ideally we would generate a table with the smallest amount of data and metadata. e.g. This table should have one int-sized column and only a small number (<10?) records across the smallest number of inserts to (1-2?) alongside the partition path update.

If you think that's ok, I can do that in a separate PR. Please let me know your thoughts.

These files will get pulled on every clone of the repository. Per-day Presto is seeing 1-2K clones. With the current table it adds that's up to 4GB additional data per day being pulled.

It needs to be done in the same commit that the table is introduced, or else the extra data will still exist in the git history and will affect non-shallow clones.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants