Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: Improve data loading performance in S3 store #244

Merged
merged 1 commit into from
Jan 14, 2025

Conversation

jstlaurent
Copy link
Contributor

@jstlaurent jstlaurent commented Jan 13, 2025

Changelogs

  • Add a LRU cache to the S3Store class, to cache fetched objects
  • Add chunk-based iterators, to improve performance when traversing dataset columns

Checklist:

  • Was this PR discussed in an issue? It is recommended to first discuss a new feature into a GitHub issue before opening a PR.
  • Add tests to cover the fixed bug(s) or the newly introduced feature(s) (if appropriate).
  • Update the API documentation if a new function is added, or an existing one is deleted.
  • Write concise and explanatory changelogs above.
  • If possible, assign one of the following labels to the PR: feature, fix, chore, documentation or test (or ask a maintainer to do it for you).

This builds on the work Cas did in feat/optimize-dataloader. I increased the number of cached objects in the S3 store, since most of them are small anyway. I also added a Generator-based chunking iterator, that I think is simpler than the class-based approach. It should reduce the amount of decompression that takes place when iterating through a dataset row-by-row.

Copy link
Collaborator

@cwognum cwognum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I explored this a little further today (but also in a new branch, given the merge conflicts) and I was just about to open a PR for feat/optimize-data-access. You beat me to it! 😄

We should merge in the caching layer for the S3Store as soon as possible, but the custom iterators need some further thinking. I've left some comments throughout the PR, but this is raising some higher level thoughts for me:

  1. Given Add custom codecs for RDKit Molecules and Biotite AtomArrays #243, I think we could deprecate the Adapter and DatasetFactoryas well as constrain the structure of the Zarr archive to not have any subgroups. That would likely be a breaking change, but I think it's worth it.
  2. At several places, we mix data loading with data transformations, which makes it hard to load data from a cached chunk.

On a bit of a tangent, I found the following performance penalties for each chunk access:

  • Fetching the data from the cloud bucket.
  • Copying the data to a NumPy Array.
  • Decompressing the data.

As we examined, the decompression added minimal overhead, but the data copies added notable overhead. Wonder if this has been sped up in Zarr V3. See also: zarr-developers/zarr-python#1395

@jstlaurent jstlaurent closed this Jan 14, 2025
@jstlaurent jstlaurent force-pushed the chore/optimize-dataloading branch from 16ba67a to ffca9da Compare January 14, 2025 01:58
@jstlaurent jstlaurent reopened this Jan 14, 2025
@jstlaurent
Copy link
Contributor Author

We should merge in the caching layer for the S3Store as soon as possible, but the custom iterators need some further thinking. I've left some comments throughout the PR, but this is raising some higher level thoughts for me:

1. Given [Add custom codecs for RDKit Molecules and Biotite AtomArrays #243](https://github.com/polaris-hub/polaris/pull/243), I think we could deprecate the `Adapter` and `DatasetFactory`as well as constrain the structure of the Zarr archive to not have any subgroups. That would likely be a breaking change, but I think it's worth it.

2. At several places, we mix data loading with data transformations, which makes it hard to load data from a cached chunk. 

I don't know if we specifically need to constrain the structure, but we do mix loading and transformations in a very ad-hoc way. That makes things more complicated. I do think leveraging object codecs to make the Zarr layer handle transformations lets us build a simpler access layer.

On a bit of a tangent, I found the following performance penalties for each chunk access:

* Fetching the data from the cloud bucket.

* Copying the data to a NumPy Array.

* Decompressing the data.

As we examined, the decompression added minimal overhead, but the data copies added notable overhead. Wonder if this has been sped up in Zarr V3. See also: zarr-developers/zarr-python#1395

We can work on the fetching part, but the other two are inside Zarr-Python and out of our control.

I stumbled upon this repo that might be an interesting way to speed up the S3 access layer.

There might be a way to leverage Rust-based codec pipeline in Zarr-Python. Not as good as pure Rust, but the benchmarks show some interesting performance improvements. Probably the best we can do

@jstlaurent jstlaurent requested a review from cwognum January 14, 2025 03:09
@jstlaurent jstlaurent changed the title chore: Improve data loading performance when iterating on a V2 dataset through the S3 store chore: Improve data loading performance in S3 store Jan 14, 2025
Copy link
Collaborator

@cwognum cwognum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @jstlaurent !

@jstlaurent jstlaurent merged commit fe9da3a into main Jan 14, 2025
20 checks passed
@jstlaurent jstlaurent deleted the chore/optimize-dataloading branch January 14, 2025 20:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants