-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Export Zarr to S3 #475
base: main
Are you sure you want to change the base?
Export Zarr to S3 #475
Conversation
Testing in the |
Hmm, yeah this notebook has some issues now when trying to use Zarr. I was able to use the Which cell did you run into this issue? |
I tested in the Step 1b option 2 cell where I replaced 'netCDF' with 'Zarr'. I tried both on the xr.merge with multiple variables and a single var |
Ok I am seeing this error now. I think it was introduced when main was merged? I will try and figure this out |
It looks like this is caused when the filename already exists on S3? I need to include better error control over that. Can you confirm this? |
Ok, I have included testing for the existence of key in bucket and deletion before writing again. This will hopefully fix the issue. Let me know how it works for you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice @elehmer! Testing in the data_acquisition_for hydro notebook:
- Using the
bulk_run
function with zarr option worked perfectly, and took less than 2 minutes (whoop!), and I appreciate the directions on how to export locally -- our users will really appreciate that - Using "regular"
ck.export
with zarr as an option, I get this error:
ValueError: Zarr requires uniform chunk sizes except for final chunk. Variable named 'Air Temperature at 2m' has incompatible dask chunks: ((1,), (3267, 1, 3388, 3, 3386, 4, 3385, 5, 3034, 11, 318), (50, 50, 11, 39, 50, 22, 28, 50, 33, 7), (40, 40, 9, 31, 40, 18, 22, 40, 27, 3)). Consider rechunking using `chunk()`.
Which is okay if one option does work, I can remove the second option from the notebook, and share with the hydro folks using this notebook
I also run into this error if in |
Description of PR
This PR adds option to export Zarr format to S3 to the
export
function.DataSet
orDataArray
is automatically sent to S3 with no local save option. This can then be opened in python using xarray from the S3 scratch bucket and then saved locally using thexr.to_netcdf()
function.Summary of changes and related issue
Added new
_export_to_zarr
function todata_export
module. Run byexport
method.Relevant motivation and context
This added functionality will allow users to export larger than 4GB variables to S3 and then save them to their local machines. Previously we exported to S3 using uncompressed netCDF3 which has a 4GB limit on any variable size.
Type of change
Definition of Done Checklist
Practical
_
before the nameConceptual