Export Zarr to S3 #475

elehmer · 2024-12-16T18:15:35Z

Description of PR

This PR adds option to export Zarr format to S3 to the export function. DataSet or DataArray is automatically sent to S3 with no local save option. This can then be opened in python using xarray from the S3 scratch bucket and then saved locally using the xr.to_netcdf() function.

Summary of changes and related issue

Added new _export_to_zarr function to data_export module. Run by export method.

Relevant motivation and context

This added functionality will allow users to export larger than 4GB variables to S3 and then save them to their local machines. Previously we exported to S3 using uncompressed netCDF3 which has a 4GB limit on any variable size.

Type of change

Bug fix (non-breaking change which fixes an issue)
[ x ] New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

Definition of Done Checklist

Practical

Conceptual

Doesn't replicate existing functionality
Aligns with general coding standard of existing functions
Matches desired functionality from users/scientists

vicford · 2024-12-16T20:13:50Z

Testing in the data_acquisition_for_hydro notebook, export on a test dataarray: ck.export(mean_airtemp_data, filename_export, `Zarr`)
is giving me the following error:
ContainsGroupError: path '' contains a group

elehmer · 2024-12-16T21:46:01Z

Testing in the data_acquisition_for_hydro notebook, export on a test dataarray: ck.export(mean_airtemp_data, filename_export, `Zarr`) is giving me the following error: ContainsGroupError: path '' contains a group

Hmm, yeah this notebook has some issues now when trying to use Zarr. I was able to use the bulk_run by just replacing for Zarr. But doing the merged exports are failing.

Which cell did you run into this issue?

vicford · 2024-12-16T22:25:12Z

Testing in the data_acquisition_for_hydro notebook, export on a test dataarray: ck.export(mean_airtemp_data, filename_export, `Zarr`) is giving me the following error: ContainsGroupError: path '' contains a group

Hmm, yeah this notebook has some issues now when trying to use Zarr. I was able to use the bulk_run by just replacing for Zarr. But doing the merged exports are failing.

Which cell did you run into this issue?

I tested in the Step 1b option 2 cell where I replaced 'netCDF' with 'Zarr'. I tried both on the xr.merge with multiple variables and a single var

elehmer · 2024-12-17T03:57:13Z

Testing in the data_acquisition_for_hydro notebook, export on a test dataarray: ck.export(mean_airtemp_data, filename_export, `Zarr`) is giving me the following error: ContainsGroupError: path '' contains a group

Hmm, yeah this notebook has some issues now when trying to use Zarr. I was able to use the bulk_run by just replacing for Zarr. But doing the merged exports are failing.
Which cell did you run into this issue?

I tested in the Step 1b option 2 cell where I replaced 'netCDF' with 'Zarr'. I tried both on the xr.merge with multiple variables and a single var

Ok I am seeing this error now. I think it was introduced when main was merged? I will try and figure this out

elehmer · 2024-12-18T18:01:41Z

Testing in the data_acquisition_for_hydro notebook, export on a test dataarray: ck.export(mean_airtemp_data, filename_export, `Zarr`) is giving me the following error: ContainsGroupError: path '' contains a group

Hmm, yeah this notebook has some issues now when trying to use Zarr. I was able to use the bulk_run by just replacing for Zarr. But doing the merged exports are failing.
Which cell did you run into this issue?

I tested in the Step 1b option 2 cell where I replaced 'netCDF' with 'Zarr'. I tried both on the xr.merge with multiple variables and a single var

Ok I am seeing this error now. I think it was introduced when main was merged? I will try and figure this out

It looks like this is caused when the filename already exists on S3? I need to include better error control over that. Can you confirm this?

elehmer · 2024-12-18T22:28:47Z

Testing in the data_acquisition_for_hydro notebook, export on a test dataarray: ck.export(mean_airtemp_data, filename_export, `Zarr`) is giving me the following error: ContainsGroupError: path '' contains a group

Hmm, yeah this notebook has some issues now when trying to use Zarr. I was able to use the bulk_run by just replacing for Zarr. But doing the merged exports are failing.
Which cell did you run into this issue?

I tested in the Step 1b option 2 cell where I replaced 'netCDF' with 'Zarr'. I tried both on the xr.merge with multiple variables and a single var

Ok I am seeing this error now. I think it was introduced when main was merged? I will try and figure this out

It looks like this is caused when the filename already exists on S3? I need to include better error control over that. Can you confirm this?

Ok, I have included testing for the existence of key in bucket and deletion before writing again. This will hopefully fix the issue. Let me know how it works for you.

vicford

Nice @elehmer! Testing in the data_acquisition_for hydro notebook:

Using the bulk_run function with zarr option worked perfectly, and took less than 2 minutes (whoop!), and I appreciate the directions on how to export locally -- our users will really appreciate that
Using "regular" ck.export with zarr as an option, I get this error:

ValueError: Zarr requires uniform chunk sizes except for final chunk. Variable named 'Air Temperature at 2m' has incompatible dask chunks: ((1,), (3267, 1, 3388, 3, 3386, 4, 3385, 5, 3034, 11, 318), (50, 50, 11, 39, 50, 22, 28, 50, 33, 7), (40, 40, 9, 31, 40, 18, 22, 40, 27, 3)). Consider rechunking using `chunk()`.

Which is okay if one option does work, I can remove the second option from the notebook, and share with the hydro folks using this notebook

elehmer · 2024-12-19T16:49:16Z

Nice @elehmer! Testing in the data_acquisition_for hydro notebook:

Using the bulk_run function with zarr option worked perfectly, and took less than 2 minutes (whoop!), and I appreciate the directions on how to export locally -- our users will really appreciate that

Using "regular" ck.export with zarr as an option, I get this error:
ValueError: Zarr requires uniform chunk sizes except for final chunk. Variable named 'Air Temperature at 2m' has incompatible dask chunks: ((1,), (3267, 1, 3388, 3, 3386, 4, 3385, 5, 3034, 11, 318), (50, 50, 11, 39, 50, 22, 28, 50, 33, 7), (40, 40, 9, 31, 40, 18, 22, 40, 27, 3)). Consider rechunking using `chunk()`.
Which is okay if one option does work, I can remove the second option from the notebook, and share with the hydro folks using this notebook

I also run into this error if in bulk_run you comment out the loading into memory. @bkg Would you know why this is happening?

elehmer added 11 commits December 10, 2024 14:37

Add Zarr export function

9cbfd45

Remove engine param

c561031

Try different way to write zarr to S3

8ea6b12

Use .zarr extension

7da84f6

Use proper S3 bucket location

45166a9

Update Zarr export text return

f00dca6

Use display path for text return

ba02b25

Add example python code to text return

13652c2

add new lines

4606224

add quotes

e994cfe

use right string

400c800

elehmer requested review from nicolejkeeney and vicford December 16, 2024 18:16

Merge branch 'main' into export-zarr

39245c8

black format

c89d22f

elehmer added 8 commits December 18, 2024 13:09

Add testing and deleting existing Zarr in S3

85a9178

Fix syntax error

3a31a1b

import botocore

6b0dae2

fix key refs

bc2f3a5

move path refs

5d76e13

move function to proper place

67c5093

changes to boto3 commands for dir

0d5ff52

change how boto3 load is called

19f4d5b

vicford reviewed Dec 19, 2024

View reviewed changes

elehmer added 2 commits December 19, 2024 09:41

Experiment with no encoding

913587d

revert last commit

0957325

nicolejkeeney added the backburner label Jan 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Export Zarr to S3 #475

Export Zarr to S3 #475

elehmer commented Dec 16, 2024

vicford commented Dec 16, 2024

elehmer commented Dec 16, 2024

vicford commented Dec 16, 2024

elehmer commented Dec 17, 2024

elehmer commented Dec 18, 2024

elehmer commented Dec 18, 2024

vicford left a comment

elehmer commented Dec 19, 2024

Export Zarr to S3 #475

Are you sure you want to change the base?

Export Zarr to S3 #475

Conversation

elehmer commented Dec 16, 2024

Description of PR

Summary of changes and related issue

Relevant motivation and context

Type of change

Definition of Done Checklist

Practical

Conceptual

vicford commented Dec 16, 2024

elehmer commented Dec 16, 2024

vicford commented Dec 16, 2024

elehmer commented Dec 17, 2024

elehmer commented Dec 18, 2024

elehmer commented Dec 18, 2024

vicford left a comment

Choose a reason for hiding this comment

elehmer commented Dec 19, 2024