Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support a new group structure for columns #241

Closed
wants to merge 1 commit into from

Conversation

cwognum
Copy link
Collaborator

@cwognum cwognum commented Jan 9, 2025

Changelogs

  • Add support for a new group of arrays structure.

Checklist:

  • Was this PR discussed in an issue? It is recommended to first discuss a new feature into a GitHub issue before opening a PR.
  • Add tests to cover the fixed bug(s) or the newly introduced feature(s) (if appropriate).
  • Update the API documentation if a new function is added, or an existing one is deleted.
  • Write concise and explanatory changelogs above.
  • If possible, assign one of the following labels to the PR: feature, fix, chore, documentation or test (or ask a maintainer to do it for you).

A popular file format in drug discovery to represent protein structures is the PDB (Protein Data Bank) file. The 3D structure of the protein can be thought of as a list of properties for each of the protein's atoms (e.g. the atom's x, y and z coordinate). For each of these properties, you can save all values in a single array. This is the internal representation of the popular Biotite library, see e.g. AtomArray. In Zarr, a group of arrays thus makes for a natural representation for this data.

There are 14 properties and we need at least 3 files per property (i.e. the chunk, .zattrs, .zmetadata). With some extra metadata files on the group level, that makes for 44 files per protein. All of these files are small. For cloud-native Zarr archives, this makes the Zarr based solution a lot less performant than simply saving a single PDB file.

The primary alternative that comes to mind is to restructure the Zarr archive such that the arrays are larger.

The main idea is to concatenate the per-group arrays in one larger array. Since the per-group arrays are variable in size (i.e. the number of atoms per protein differs), we would use Ragged Arrays. This PR adds support for a new structure that allows us to built a datapoint from indexing the nth element in each array.

We now officially support three Zarr structures to represent the data in a column:

  • An array: Each element in the array corresponds to a data point in the dataset.
  • A group of groups: Each subgroup corresponds to a data point in the dataset.
    • The special __index__ array specifies the ordering of the subgroups.
  • A group of arrays (new): The nth datapoint is constructed from indexing the nth element in each array.

@cwognum cwognum added the feature Annotates any PR that adds new features; Used in the release process label Jan 9, 2025
@cwognum cwognum self-assigned this Jan 9, 2025
Copy link
Contributor

@jstlaurent jstlaurent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I trust you and we need this, so I'll approve it.

I feel uncomfortable with the implicit complexity we're baking into the Zarr structure. There's increasingly tight coupling between the validation performed here and code that lives elsewhere in the client. That coupling is non-obvious (and undocumented). That all need to be kept in sync, or else users are going to start getting weird (for them) errors that they are ill equipped to fix.

I fear we're also going to be ill equipped to evolve this to other embedded formats without losing control.

@cwognum
Copy link
Collaborator Author

cwognum commented Jan 10, 2025

@jstlaurent I share your discomfort.

A Zarr archive can get arbitrarily complex. We need some way to constrain this structure for use in a dataset, but I'm not convinced I've found the right place to do so. Perhaps something like a Column class that groups related logic for data access? We could have a ArrayColumn and a GroupWithSubgroupsColumn, for example.

For now, there are two options:

  • We don't need this feature per se and can choose to postpone merging it.
  • The issue seems to be with the structure of the code, more so than the structure of the Zarr archive. If so, that means we could merge it already and refactor the code later.

@Andrewq11
Copy link
Contributor

I see where you're both coming from regarding the implicit complexity, and I agree that without clearer documentation and a more concrete design decision around this, it will be troublesome at some point. I like Cas's idea related to a Column class which can be a place to clearly define/centralize/document the logic around all of this.

Given the performance gains we're seeing from this and the upcoming competitions on Monday, I think we should merge it. We can then soon after formalize and centralize this logic in some class.

@jstlaurent
Copy link
Contributor

Yeah... It's not clear to me (or perhaps to all of us, since we haven't dug into it) on where we draw the line between turning an embedded format into a Zarr-native structure (groups and arrays), or embedding it as is.

For example, images in phenomics datasets get encoded as binary by using the codecs functionality in Zarr. Here, we try to turn PDBs into Zarr groups and arrays. But there's a world in where, once we're parsed the PDB to a series of dicts (which is what the PDBConverter class does), we encode them as is, through numcodecs.MsgPack or something similar, instead of making them into Zarr-native formats.

Edit: Cas and I discussed this live, and he's going to try something with codecs to see if it might work. We have this solution if it does not.

@cwognum
Copy link
Collaborator Author

cwognum commented Jan 10, 2025

The solution @jstlaurent proposed of using a custom object_codec works and it feels a lot more idiomatic. I'll hence close this PR!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Annotates any PR that adds new features; Used in the release process
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants