Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Java Parquet reads via multiple host buffers #17673

Merged
merged 3 commits into from
Jan 7, 2025

Conversation

jlowe
Copy link
Member

@jlowe jlowe commented Jan 2, 2025

Description

Adds a custom cuio datasource that can provide file data via multiple host memory buffers. This allows data that arrives from multiple threads in multiple buffers to be read directly rather than requiring the buffers to be concatenated into a single host memory buffer before reading.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@jlowe jlowe added Java Affects Java cuDF API. improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jan 2, 2025
@jlowe jlowe self-assigned this Jan 2, 2025
@jlowe jlowe requested a review from a team as a code owner January 2, 2025 22:54
@github-actions github-actions bot added the CMake CMake build issue label Jan 2, 2025
Comment on lines +159 to +161
std::promise<size_t> p;
p.set_value(device_read(offset, size, dst, stream));
return p.get_future();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vuule is it safe to return from the future with device activity still potentially pending on the specified stream? It wasn't clear to me from the documentation whether the caller of this method expects the device operation to be synchronously completed when this future completes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, this is not really specified. I think letting the device side of the operation be performed after the future would not break anything, but it wasn't intended originally. How does this impact you here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to avoid unnecessary stream synchronization if cuio will guarantee that it will do any required stream synchronization after the future completes. If that's not guaranteed then I'll need to add explicit stream synchronization per read operation for safety.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the semantics used here are consistent with cudf::io::device_buffer_source::device_read_async and device_read. Both just wait for the host future, but do not synchronize on the CUDA stream. I think the reader's fork_streams logic in the reader is doing the right thing, waiting on an event that has all actions from the stream we are using so far before they launch anything in the new streams.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only thing we have to be careful about is keeping the host buffers around during the parquet read. I don't think we have a worry about that given the API.

Copy link
Contributor

@abellina abellina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@abellina
Copy link
Contributor

abellina commented Jan 6, 2025

some linting issues it looks like

@abellina
Copy link
Contributor

abellina commented Jan 7, 2025

/merge

@rapids-bot rapids-bot bot merged commit f308122 into rapidsai:branch-25.02 Jan 7, 2025
90 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake CMake build issue improvement Improvement / enhancement to an existing function Java Affects Java cuDF API. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants