Support a new group structure for columns #241

cwognum · 2025-01-09T19:53:43Z

Changelogs

Add support for a new group of arrays structure.

Checklist:

~~Was this PR discussed in an issue? It is recommended to first discuss a new feature into a GitHub issue before opening a PR.~~
Add tests to cover the fixed bug(s) or the newly introduced feature(s) (if appropriate).
Update the API documentation if a new function is added, or an existing one is deleted.
Write concise and explanatory changelogs above.
If possible, assign one of the following labels to the PR: feature, fix, chore, documentation or test (or ask a maintainer to do it for you).

A popular file format in drug discovery to represent protein structures is the PDB (Protein Data Bank) file. The 3D structure of the protein can be thought of as a list of properties for each of the protein's atoms (e.g. the atom's x, y and z coordinate). For each of these properties, you can save all values in a single array. This is the internal representation of the popular Biotite library, see e.g. AtomArray. In Zarr, a group of arrays thus makes for a natural representation for this data.

There are 14 properties and we need at least 3 files per property (i.e. the chunk, .zattrs, .zmetadata). With some extra metadata files on the group level, that makes for 44 files per protein. All of these files are small. For cloud-native Zarr archives, this makes the Zarr based solution a lot less performant than simply saving a single PDB file.

The primary alternative that comes to mind is to restructure the Zarr archive such that the arrays are larger.

The main idea is to concatenate the per-group arrays in one larger array. Since the per-group arrays are variable in size (i.e. the number of atoms per protein differs), we would use Ragged Arrays. This PR adds support for a new structure that allows us to built a datapoint from indexing the nth element in each array.

We now officially support three Zarr structures to represent the data in a column:

An array: Each element in the array corresponds to a data point in the dataset.
A group of groups: Each subgroup corresponds to a data point in the dataset.
- The special __index__ array specifies the ordering of the subgroups.
A group of arrays (new): The nth datapoint is constructed from indexing the nth element in each array.

jstlaurent

I trust you and we need this, so I'll approve it.

I feel uncomfortable with the implicit complexity we're baking into the Zarr structure. There's increasingly tight coupling between the validation performed here and code that lives elsewhere in the client. That coupling is non-obvious (and undocumented). That all need to be kept in sync, or else users are going to start getting weird (for them) errors that they are ill equipped to fix.

I fear we're also going to be ill equipped to evolve this to other embedded formats without losing control.

cwognum · 2025-01-10T12:05:44Z

@jstlaurent I share your discomfort.

A Zarr archive can get arbitrarily complex. We need some way to constrain this structure for use in a dataset, but I'm not convinced I've found the right place to do so. Perhaps something like a Column class that groups related logic for data access? We could have a ArrayColumn and a GroupWithSubgroupsColumn, for example.

For now, there are two options:

We don't need this feature per se and can choose to postpone merging it.
The issue seems to be with the structure of the code, more so than the structure of the Zarr archive. If so, that means we could merge it already and refactor the code later.

Andrewq11 · 2025-01-10T15:20:08Z

I see where you're both coming from regarding the implicit complexity, and I agree that without clearer documentation and a more concrete design decision around this, it will be troublesome at some point. I like Cas's idea related to a Column class which can be a place to clearly define/centralize/document the logic around all of this.

Given the performance gains we're seeing from this and the upcoming competitions on Monday, I think we should merge it. We can then soon after formalize and centralize this logic in some class.

jstlaurent · 2025-01-10T16:02:06Z

Yeah... It's not clear to me (or perhaps to all of us, since we haven't dug into it) on where we draw the line between turning an embedded format into a Zarr-native structure (groups and arrays), or embedding it as is.

For example, images in phenomics datasets get encoded as binary by using the codecs functionality in Zarr. Here, we try to turn PDBs into Zarr groups and arrays. But there's a world in where, once we're parsed the PDB to a series of dicts (which is what the PDBConverter class does), we encode them as is, through numcodecs.MsgPack or something similar, instead of making them into Zarr-native formats.

Edit: Cas and I discussed this live, and he's going to try something with codecs to see if it might work. We have this solution if it does not.

cwognum · 2025-01-10T16:42:43Z

The solution @jstlaurent proposed of using a custom object_codec works and it feels a lot more idiomatic. I'll hence close this PR!

Allow for groups with arrays for more efficient data PDB storage

9d01793

cwognum added the feature Annotates any PR that adds new features; Used in the release process label Jan 9, 2025

cwognum requested review from jstlaurent, mercuryseries, roselynh100, achchala and Andrewq11 January 9, 2025 19:53

cwognum self-assigned this Jan 9, 2025

jstlaurent approved these changes Jan 10, 2025

View reviewed changes

Andrewq11 approved these changes Jan 10, 2025

View reviewed changes

cwognum closed this Jan 10, 2025

cwognum mentioned this pull request Jan 10, 2025

Add custom codecs for RDKit Molecules and Biotite AtomArrays #243

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support a new group structure for columns #241

Support a new group structure for columns #241

cwognum commented Jan 9, 2025

jstlaurent left a comment

cwognum commented Jan 10, 2025

Andrewq11 commented Jan 10, 2025

jstlaurent commented Jan 10, 2025

cwognum commented Jan 10, 2025

Support a new group structure for columns #241

Support a new group structure for columns #241

Conversation

cwognum commented Jan 9, 2025

Changelogs

jstlaurent left a comment

Choose a reason for hiding this comment

cwognum commented Jan 10, 2025

Andrewq11 commented Jan 10, 2025

jstlaurent commented Jan 10, 2025

cwognum commented Jan 10, 2025