How to save Awkward Arrays? (Parquet in specific) #329

ChristianMichelsen · 2020-07-12T13:58:55Z

ChristianMichelsen
Jul 12, 2020

Hi!

I have a short question about what's the preferred method of saving awkward 1.0 arrays? Is it by first converting it to awkward0 arrays (ak.to_awkward0) and then saving it with awkward0.save(filename, array0, mode="w") or how?

Cheers!

jpivarski · 2020-07-12T16:03:27Z

jpivarski
Jul 12, 2020
Maintainer

The currently implemented ways to read and write Awkward 1 arrays are:

convert to Awkward 0 and use Awkward 0's persistence ("save/load", "hdf5", "to/fromarrow", "to/fromparquet")
use Awkward 1's ak.to_arrow and ak.from_arrow
read from ROOT files with Uproot 4.

Since Awkward 1 can read and write Arrow, reading and writing Parquet is low-hanging fruit (discussed in #303). I (or someone else) would only need to port the Awkward 0 ↔ Arrow ↔ Parquet chain. All the pieces are in place to do that, especially PartitionedArrays and VirtualArrays (because we usually want to view a Parquet file lazily), so the differential "work to do" over "benefit" (dW/dB) is small and it ought to be a high-priority item.

Less low-hanging is the idea of making Awkward Array a Zarr v3 protocol extension (discussed in zarr-developers/zarr-specs#62). That's a thread I'd like to follow, but it will take some time.

I'm not 100% sure it's a good idea to reimplement Awkward 0's custom format, the .awkd files (which were really just ZIP files with the flat-array components as binary blobs within the file) and the layer on HDF5 (which was the same thing, but it used HDF5 as a place to store flat arrays). The motivation for .awkd files was so that there would be at least one format that could save and retrieve Awkward arrays with 100% fidelity, but Arrow's custom application metadata and/or extension types could make it possible to express the extra information so that an Awkward array can losslessly round-trip to that format. Also, 1:1 parity between the object in memory and on disk is not always desirable: scikit-hep/awkward-0.x#246 is an example where a user was expecting an array to be smaller on disk after filtering out events, but it wasn't because slicing an Awkward array doesn't recursively compact all of its contents, and a 1:1 view of that uncompacted data on disk isn't any smaller. What that user really wanted was only the accessible elements to be saved, so he did want the file-writing to be lossy, in a sense.

If all of the logical data (everything that meaningfully affects the interpretation of the array) can be saved in Arrow and therefore Parquet files, then we have no business inventing a new file format and maybe the .awkd files should be retired. As for the HDF5 version, a user receiving those files without knowing that they're supposed to view them through the Awkward HDF5 layer might get very confused about what they're supposed to mean. Unlike ZIP, HDF5 is a high-level format that people expect to be able to understand without extra software.

0 replies

ChristianMichelsen · 2020-07-15T14:22:20Z

ChristianMichelsen
Jul 15, 2020
Author

Hi again and thanks for your answer!

I'm converting the awkward1 arrays to awkward0 for now and saving them as hdf5-files, since this seems to be the easiest for me and others who need to be able to also read the code.

I'm looking forward to being able to save and load directly to parquet files from awkward1 at some point.

One last question; is it on purpose that awkward1 arrays are immutable (compared to awkward0 arrays)?

0 replies

jpivarski · 2020-07-15T14:50:09Z

jpivarski
Jul 15, 2020
Maintainer

One last question; is it on purpose that awkward1 arrays are immutable (compared to awkward0 arrays)?

It's on purpose. It allows for references and values to be treated equally, which means that outputs of operations can share most of their internal data ("structural sharing").

Awkward 0 arrays were immutable apart from their properties—starts, stops, content, mask, index, etc.—as a user convenience. Now the layout nodes, which have these properties, are hidden inside ak.Array as an "experts only" feature (you have to say array.layout to access them), so the user convenience of setting "content" after a node has been created is not as important. (Experts can create nodes with the constructor.) This also makes it impossible to build cyclic references, which is good because cyclic references are not supported by C++ std::shared_ptr and operations are not, in general, recursion-ready and raise segfaults when the stack overflows. Preventing this case as a side-effect of syntax is good for preventing errors downstream. (Cyclic references will have to be built a different way: #178.)

Note that ak.Array is mutable in one particular way: you can assign fields to records. This replaces the immutable layout within the ak.Array, but the ak.Array itself is changed in-place to have the new field. (See #273.)

0 replies

ChristianMichelsen · 2020-07-16T09:03:16Z

ChristianMichelsen
Jul 16, 2020
Author

Hm, so it wouldn't be possible to implement e.g. a counter as a jagged awkward array?

To be more precise, I'll shortly explain my use case. I need to be able to iterate over particles in each list (where all lists are combined as a jagged awkward array) [[particle1, particle2, particle3], [particle4, particle5], ...] . I need to be able to set some of the particles to be invalid which I used to do by just setting the value to "-1" (an unphysical value for this variable) and then only iterate over the "good" particles. I guess a masked array could also work, however, I do not have any experience with the combination of awkward1, numba, and masked arrays. Do you have any tips for this sort of use case?

Thanks a lot!

0 replies

jpivarski · 2020-07-16T13:31:06Z

jpivarski
Jul 16, 2020
Maintainer

That's right: if you mean a counter that both increases integers in nested lists and appends to those nested lists, the ListArray and ListOffsetArray in the Awkward library are not good data structures: appending to the first list would move (i.e. reallocate and copy) all the subsequent lists. There's not a fundamental problem with increasing a fixed-length set of integers (no appending), but that could be a flat array that later gets wrapped as jagged.

Data structures with good mutation properties, such as the counter you mentioned, hash-maps, B-trees indexes for databases, etc., are a large, open set that extend beyond the data structures covered by Awkward Array. Awkward's data structures are covering a more focused problem: viewing arbitrarily shaped input data. Since the goal is to view data that have already been made, immutability is a natural choice.

It is possible, however, for a momentary state of a dynamic data structure to be viewed as an Awkward Array. The "snapshots" of ak.ArrayBuilder are an example of that: the ArrayBuilder is a dynamic data structure in that it grows by allocating blocks, filling them until they reach their limit, then replacing them with larger blocks (i.e. the std::vector algorithm); then when you call snapshot(), those blocks are viewed within an immutable ak.Array (while the ArrayBuilder might continue growing).

Other dynamic data structures could be viewed that way, and the snapshot() might be a lightweight view or a heavyweight copy, depending on the data structure. A counter with nested lists whose lengths do not change would be easy to wrap:

>>> class Counter:
...     def __init__(self, offsets, content):
...         self.offsets, self.content = offsets, content
...     def __getitem__(self, i):
...         return self.content[self.offsets[i]:self.offsets[i + 1]]
...     def snapshot(self):
...         offsets = ak.layout.Index64(self.offsets)
...         content = ak.layout.NumpyArray(self.content)
...         listarray = ak.layout.ListOffsetArray64(offsets, content)
...         return ak.Array(listarray)
... 
>>> counter = Counter(np.array([0, 3, 3, 5, 10]), np.zeros(10, int))
>>> counter.snapshot()
<Array [[0, 0, 0], [], ... 0], [0, 0, 0, 0, 0]] type='4 * var * int64'>
>>> counter[0][2] += 1
>>> counter.snapshot()
<Array [[0, 0, 1], [], ... 0], [0, 0, 0, 0, 0]] type='4 * var * int64'>
>>> counter[3][1] += 1
>>> counter[3][1] += 1
>>> counter[3][1] += 1
>>> counter.snapshot()
<Array [[0, 0, 1], [], ... 0], [0, 3, 0, 0, 0]] type='4 * var * int64'>

(Actually, in this implementation, the originally snapshotted array would see all updates because it views the data that are changing in place. ArrayBuilder escapes that by only making changes that are beyond the view of previously snapshotted arrays.)

If we want to append to inner lists, we'd have to have blocks that can be replaced, so the Awkward snapshot would need a layer of indirection, which can be implemented by ak.IndexedArray.

So, things can be done, but the conceptual constraint is that Awkward Arrays are views of data, not general-purpose data structures.

0 replies

jpivarski · 2020-07-16T13:34:33Z

jpivarski
Jul 16, 2020
Maintainer

Actually, this "counter" that you speak of sounds a lot like a histogram with a sparse axis. Perhaps this is something that's happening in boost-histogram or hist? (@HDembinski, @henryiii)

0 replies

jpivarski · 2020-07-17T13:08:51Z

jpivarski
Jul 17, 2020
Maintainer

(The reason this closed yesterday is because I implemented Parquet reading and writing. That might help with your original problem.)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to save Awkward Arrays? (Parquet in specific) #329

{{title}}

Replies: 7 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

How to save Awkward Arrays? (Parquet in specific) #329

ChristianMichelsen Jul 12, 2020

Replies: 7 comments

jpivarski Jul 12, 2020 Maintainer

ChristianMichelsen Jul 15, 2020 Author

jpivarski Jul 15, 2020 Maintainer

ChristianMichelsen Jul 16, 2020 Author

jpivarski Jul 16, 2020 Maintainer

jpivarski Jul 16, 2020 Maintainer

jpivarski Jul 17, 2020 Maintainer

ChristianMichelsen
Jul 12, 2020

jpivarski
Jul 12, 2020
Maintainer

ChristianMichelsen
Jul 15, 2020
Author

jpivarski
Jul 15, 2020
Maintainer

ChristianMichelsen
Jul 16, 2020
Author

jpivarski
Jul 16, 2020
Maintainer

jpivarski
Jul 16, 2020
Maintainer

jpivarski
Jul 17, 2020
Maintainer