chore: Improve data loading performance in S3 store #244

jstlaurent · 2025-01-13T19:39:11Z

Changelogs

Add a LRU cache to the S3Store class, to cache fetched objects
Add chunk-based iterators, to improve performance when traversing dataset columns

Checklist:

~~Was this PR discussed in an issue? It is recommended to first discuss a new feature into a GitHub issue before opening a PR.~~
Add tests to cover the fixed bug(s) or the newly introduced feature(s) (if appropriate).
Update the API documentation if a new function is added, or an existing one is deleted.
Write concise and explanatory changelogs above.
If possible, assign one of the following labels to the PR: feature, fix, chore, documentation or test (or ask a maintainer to do it for you).

This builds on the work Cas did in feat/optimize-dataloader. I increased the number of cached objects in the S3 store, since most of them are small anyway. I also added a Generator-based chunking iterator, that I think is simpler than the class-based approach. It should reduce the amount of decompression that takes place when iterating through a dataset row-by-row.

cwognum

I explored this a little further today (but also in a new branch, given the merge conflicts) and I was just about to open a PR for feat/optimize-data-access. You beat me to it! 😄

We should merge in the caching layer for the S3Store as soon as possible, but the custom iterators need some further thinking. I've left some comments throughout the PR, but this is raising some higher level thoughts for me:

Given Add custom codecs for RDKit Molecules and Biotite AtomArrays #243, I think we could deprecate the Adapter and DatasetFactoryas well as constrain the structure of the Zarr archive to not have any subgroups. That would likely be a breaking change, but I think it's worth it.
At several places, we mix data loading with data transformations, which makes it hard to load data from a cached chunk.

On a bit of a tangent, I found the following performance penalties for each chunk access:

Fetching the data from the cloud bucket.
Copying the data to a NumPy Array.
Decompressing the data.

As we examined, the decompression added minimal overhead, but the data copies added notable overhead. Wonder if this has been sped up in Zarr V3. See also: zarr-developers/zarr-python#1395

polaris/hub/storage.py

polaris/dataset/_iterator.py

jstlaurent · 2025-01-14T03:08:56Z

We should merge in the caching layer for the S3Store as soon as possible, but the custom iterators need some further thinking. I've left some comments throughout the PR, but this is raising some higher level thoughts for me:
1. Given [Add custom codecs for RDKit Molecules and Biotite AtomArrays #243](https://github.com/polaris-hub/polaris/pull/243), I think we could deprecate the `Adapter` and `DatasetFactory`as well as constrain the structure of the Zarr archive to not have any subgroups. That would likely be a breaking change, but I think it's worth it.

2. At several places, we mix data loading with data transformations, which makes it hard to load data from a cached chunk. 

I don't know if we specifically need to constrain the structure, but we do mix loading and transformations in a very ad-hoc way. That makes things more complicated. I do think leveraging object codecs to make the Zarr layer handle transformations lets us build a simpler access layer.

On a bit of a tangent, I found the following performance penalties for each chunk access:
* Fetching the data from the cloud bucket.

* Copying the data to a NumPy Array.

* Decompressing the data.
As we examined, the decompression added minimal overhead, but the data copies added notable overhead. Wonder if this has been sped up in Zarr V3. See also: zarr-developers/zarr-python#1395

We can work on the fetching part, but the other two are inside Zarr-Python and out of our control.

I stumbled upon this repo that might be an interesting way to speed up the S3 access layer.

There might be a way to leverage Rust-based codec pipeline in Zarr-Python. Not as good as pure Rust, but the benchmarks show some interesting performance improvements. Probably the best we can do

cwognum

Thanks, @jstlaurent !

jstlaurent requested a review from cwognum as a code owner January 13, 2025 19:39

jstlaurent added the chore label Jan 13, 2025

jstlaurent self-assigned this Jan 13, 2025

jstlaurent requested review from achchala, Andrewq11 and roselynh100 January 13, 2025 19:39

cwognum requested changes Jan 13, 2025

View reviewed changes

jstlaurent closed this Jan 14, 2025

jstlaurent force-pushed the chore/optimize-dataloading branch from 16ba67a to ffca9da Compare January 14, 2025 01:58

chore: Caching for S3 store

82d36b5

jstlaurent reopened this Jan 14, 2025

jstlaurent requested a review from cwognum January 14, 2025 03:09

jstlaurent changed the title ~~chore: Improve data loading performance when iterating on a V2 dataset through the S3 store~~ chore: Improve data loading performance in S3 store Jan 14, 2025

cwognum approved these changes Jan 14, 2025

View reviewed changes

jstlaurent merged commit fe9da3a into main Jan 14, 2025
20 checks passed

jstlaurent deleted the chore/optimize-dataloading branch January 14, 2025 20:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: Improve data loading performance in S3 store #244

chore: Improve data loading performance in S3 store #244

jstlaurent commented Jan 13, 2025 •

edited

Loading

cwognum left a comment

jstlaurent commented Jan 14, 2025

cwognum left a comment

chore: Improve data loading performance in S3 store #244

chore: Improve data loading performance in S3 store #244

Conversation

jstlaurent commented Jan 13, 2025 • edited Loading

Changelogs

cwognum left a comment

Choose a reason for hiding this comment

jstlaurent commented Jan 14, 2025

cwognum left a comment

Choose a reason for hiding this comment

jstlaurent commented Jan 13, 2025 •

edited

Loading