-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support a new group structure for columns #241
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I trust you and we need this, so I'll approve it.
I feel uncomfortable with the implicit complexity we're baking into the Zarr structure. There's increasingly tight coupling between the validation performed here and code that lives elsewhere in the client. That coupling is non-obvious (and undocumented). That all need to be kept in sync, or else users are going to start getting weird (for them) errors that they are ill equipped to fix.
I fear we're also going to be ill equipped to evolve this to other embedded formats without losing control.
@jstlaurent I share your discomfort. A Zarr archive can get arbitrarily complex. We need some way to constrain this structure for use in a dataset, but I'm not convinced I've found the right place to do so. Perhaps something like a For now, there are two options:
|
I see where you're both coming from regarding the implicit complexity, and I agree that without clearer documentation and a more concrete design decision around this, it will be troublesome at some point. I like Cas's idea related to a Given the performance gains we're seeing from this and the upcoming competitions on Monday, I think we should merge it. We can then soon after formalize and centralize this logic in some class. |
Yeah... It's not clear to me (or perhaps to all of us, since we haven't dug into it) on where we draw the line between turning an embedded format into a Zarr-native structure (groups and arrays), or embedding it as is. For example, images in phenomics datasets get encoded as binary by using the codecs functionality in Zarr. Here, we try to turn PDBs into Zarr groups and arrays. But there's a world in where, once we're parsed the PDB to a series of dicts (which is what the Edit: Cas and I discussed this live, and he's going to try something with codecs to see if it might work. We have this solution if it does not. |
The solution @jstlaurent proposed of using a custom |
Changelogs
Checklist:
Was this PR discussed in an issue? It is recommended to first discuss a new feature into a GitHub issue before opening a PR.feature
,fix
,chore
,documentation
ortest
(or ask a maintainer to do it for you).A popular file format in drug discovery to represent protein structures is the PDB (Protein Data Bank) file. The 3D structure of the protein can be thought of as a list of properties for each of the protein's atoms (e.g. the atom's x, y and z coordinate). For each of these properties, you can save all values in a single array. This is the internal representation of the popular Biotite library, see e.g. AtomArray. In Zarr, a group of arrays thus makes for a natural representation for this data.
There are 14 properties and we need at least 3 files per property (i.e. the chunk, .zattrs, .zmetadata). With some extra metadata files on the group level, that makes for 44 files per protein. All of these files are small. For cloud-native Zarr archives, this makes the Zarr based solution a lot less performant than simply saving a single PDB file.
The primary alternative that comes to mind is to restructure the Zarr archive such that the arrays are larger.
The main idea is to concatenate the per-group arrays in one larger array. Since the per-group arrays are variable in size (i.e. the number of atoms per protein differs), we would use Ragged Arrays. This PR adds support for a new structure that allows us to built a datapoint from indexing the nth element in each array.
We now officially support three Zarr structures to represent the data in a column:
__index__
array specifies the ordering of the subgroups.