How to save Awkward Arrays? (Parquet in specific) #329
Replies: 7 comments
-
The currently implemented ways to read and write Awkward 1 arrays are:
Since Awkward 1 can read and write Arrow, reading and writing Parquet is low-hanging fruit (discussed in #303). I (or someone else) would only need to port the Awkward 0 ↔ Arrow ↔ Parquet chain. All the pieces are in place to do that, especially PartitionedArrays and VirtualArrays (because we usually want to view a Parquet file lazily), so the differential "work to do" over "benefit" (dW/dB) is small and it ought to be a high-priority item. Less low-hanging is the idea of making Awkward Array a Zarr v3 protocol extension (discussed in zarr-developers/zarr-specs#62). That's a thread I'd like to follow, but it will take some time. I'm not 100% sure it's a good idea to reimplement Awkward 0's custom format, the If all of the logical data (everything that meaningfully affects the interpretation of the array) can be saved in Arrow and therefore Parquet files, then we have no business inventing a new file format and maybe the |
Beta Was this translation helpful? Give feedback.
-
Hi again and thanks for your answer! I'm converting the awkward1 arrays to awkward0 for now and saving them as hdf5-files, since this seems to be the easiest for me and others who need to be able to also read the code. I'm looking forward to being able to save and load directly to parquet files from awkward1 at some point. One last question; is it on purpose that awkward1 arrays are immutable (compared to awkward0 arrays)? |
Beta Was this translation helpful? Give feedback.
-
It's on purpose. It allows for references and values to be treated equally, which means that outputs of operations can share most of their internal data ("structural sharing"). Awkward 0 arrays were immutable apart from their properties—starts, stops, content, mask, index, etc.—as a user convenience. Now the layout nodes, which have these properties, are hidden inside Note that |
Beta Was this translation helpful? Give feedback.
-
Hm, so it wouldn't be possible to implement e.g. a counter as a jagged awkward array? To be more precise, I'll shortly explain my use case. I need to be able to iterate over particles in each list (where all lists are combined as a jagged awkward array) [[particle1, particle2, particle3], [particle4, particle5], ...] . I need to be able to set some of the particles to be invalid which I used to do by just setting the value to "-1" (an unphysical value for this variable) and then only iterate over the "good" particles. I guess a masked array could also work, however, I do not have any experience with the combination of awkward1, numba, and masked arrays. Do you have any tips for this sort of use case? Thanks a lot! |
Beta Was this translation helpful? Give feedback.
-
That's right: if you mean a counter that both increases integers in nested lists and appends to those nested lists, the ListArray and ListOffsetArray in the Awkward library are not good data structures: appending to the first list would move (i.e. reallocate and copy) all the subsequent lists. There's not a fundamental problem with increasing a fixed-length set of integers (no appending), but that could be a flat array that later gets wrapped as jagged. Data structures with good mutation properties, such as the counter you mentioned, hash-maps, B-trees indexes for databases, etc., are a large, open set that extend beyond the data structures covered by Awkward Array. Awkward's data structures are covering a more focused problem: viewing arbitrarily shaped input data. Since the goal is to view data that have already been made, immutability is a natural choice. It is possible, however, for a momentary state of a dynamic data structure to be viewed as an Awkward Array. The "snapshots" of ak.ArrayBuilder are an example of that: the ArrayBuilder is a dynamic data structure in that it grows by allocating blocks, filling them until they reach their limit, then replacing them with larger blocks (i.e. the Other dynamic data structures could be viewed that way, and the >>> class Counter:
... def __init__(self, offsets, content):
... self.offsets, self.content = offsets, content
... def __getitem__(self, i):
... return self.content[self.offsets[i]:self.offsets[i + 1]]
... def snapshot(self):
... offsets = ak.layout.Index64(self.offsets)
... content = ak.layout.NumpyArray(self.content)
... listarray = ak.layout.ListOffsetArray64(offsets, content)
... return ak.Array(listarray)
...
>>> counter = Counter(np.array([0, 3, 3, 5, 10]), np.zeros(10, int))
>>> counter.snapshot()
<Array [[0, 0, 0], [], ... 0], [0, 0, 0, 0, 0]] type='4 * var * int64'>
>>> counter[0][2] += 1
>>> counter.snapshot()
<Array [[0, 0, 1], [], ... 0], [0, 0, 0, 0, 0]] type='4 * var * int64'>
>>> counter[3][1] += 1
>>> counter[3][1] += 1
>>> counter[3][1] += 1
>>> counter.snapshot()
<Array [[0, 0, 1], [], ... 0], [0, 3, 0, 0, 0]] type='4 * var * int64'> (Actually, in this implementation, the originally snapshotted array would see all updates because it views the data that are changing in place. ArrayBuilder escapes that by only making changes that are beyond the view of previously snapshotted arrays.) If we want to append to inner lists, we'd have to have blocks that can be replaced, so the Awkward snapshot would need a layer of indirection, which can be implemented by ak.IndexedArray. So, things can be done, but the conceptual constraint is that Awkward Arrays are views of data, not general-purpose data structures. |
Beta Was this translation helpful? Give feedback.
-
Actually, this "counter" that you speak of sounds a lot like a histogram with a sparse axis. Perhaps this is something that's happening in boost-histogram or hist? (@HDembinski, @henryiii) |
Beta Was this translation helpful? Give feedback.
-
(The reason this closed yesterday is because I implemented Parquet reading and writing. That might help with your original problem.) |
Beta Was this translation helpful? Give feedback.
-
Hi!
I have a short question about what's the preferred method of saving awkward 1.0 arrays? Is it by first converting it to awkward0 arrays (
ak.to_awkward0
) and then saving it withawkward0.save(filename, array0, mode="w")
or how?Cheers!
Beta Was this translation helpful? Give feedback.
All reactions