-
Notifications
You must be signed in to change notification settings - Fork 917
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Java Parquet reads via multiple host buffers #17673
Conversation
Signed-off-by: Jason Lowe <[email protected]>
std::promise<size_t> p; | ||
p.set_value(device_read(offset, size, dst, stream)); | ||
return p.get_future(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vuule is it safe to return from the future with device activity still potentially pending on the specified stream? It wasn't clear to me from the documentation whether the caller of this method expects the device operation to be synchronously completed when this future completes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, this is not really specified. I think letting the device side of the operation be performed after the future would not break anything, but it wasn't intended originally. How does this impact you here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to avoid unnecessary stream synchronization if cuio will guarantee that it will do any required stream synchronization after the future completes. If that's not guaranteed then I'll need to add explicit stream synchronization per read operation for safety.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the semantics used here are consistent with cudf::io::device_buffer_source::device_read_async
and device_read
. Both just wait for the host future, but do not synchronize on the CUDA stream. I think the reader's fork_streams logic in the reader is doing the right thing, waiting on an event that has all actions from the stream we are using so far before they launch anything in the new streams.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only thing we have to be careful about is keeping the host buffers around during the parquet read. I don't think we have a worry about that given the API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
some linting issues it looks like |
/merge |
Description
Adds a custom cuio datasource that can provide file data via multiple host memory buffers. This allows data that arrives from multiple threads in multiple buffers to be read directly rather than requiring the buffers to be concatenated into a single host memory buffer before reading.
Checklist