-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: Improve data loading performance in S3 store #244
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I explored this a little further today (but also in a new branch, given the merge conflicts) and I was just about to open a PR for feat/optimize-data-access
. You beat me to it! 😄
We should merge in the caching layer for the S3Store
as soon as possible, but the custom iterators need some further thinking. I've left some comments throughout the PR, but this is raising some higher level thoughts for me:
- Given Add custom codecs for RDKit Molecules and Biotite AtomArrays #243, I think we could deprecate the
Adapter
andDatasetFactory
as well as constrain the structure of the Zarr archive to not have any subgroups. That would likely be a breaking change, but I think it's worth it. - At several places, we mix data loading with data transformations, which makes it hard to load data from a cached chunk.
On a bit of a tangent, I found the following performance penalties for each chunk access:
- Fetching the data from the cloud bucket.
- Copying the data to a NumPy Array.
- Decompressing the data.
As we examined, the decompression added minimal overhead, but the data copies added notable overhead. Wonder if this has been sped up in Zarr V3. See also: zarr-developers/zarr-python#1395
16ba67a
to
ffca9da
Compare
I don't know if we specifically need to constrain the structure, but we do mix loading and transformations in a very ad-hoc way. That makes things more complicated. I do think leveraging object codecs to make the Zarr layer handle transformations lets us build a simpler access layer.
We can work on the fetching part, but the other two are inside Zarr-Python and out of our control. I stumbled upon this repo that might be an interesting way to speed up the S3 access layer. There might be a way to leverage Rust-based codec pipeline in Zarr-Python. Not as good as pure Rust, but the benchmarks show some interesting performance improvements. Probably the best we can do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @jstlaurent !
Changelogs
S3Store
class, to cache fetched objectsChecklist:
Was this PR discussed in an issue? It is recommended to first discuss a new feature into a GitHub issue before opening a PR.feature
,fix
,chore
,documentation
ortest
(or ask a maintainer to do it for you).This builds on the work Cas did in
feat/optimize-dataloader
. I increased the number of cached objects in the S3 store, since most of them are small anyway. I also added a Generator-based chunking iterator, that I think is simpler than the class-based approach. It should reduce the amount of decompression that takes place when iterating through a dataset row-by-row.