Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Export Zarr to S3 #475

Open
wants to merge 23 commits into
base: main
Choose a base branch
from
Open

Export Zarr to S3 #475

wants to merge 23 commits into from

Conversation

elehmer
Copy link
Contributor

@elehmer elehmer commented Dec 16, 2024

Description of PR

This PR adds option to export Zarr format to S3 to the export function. DataSet or DataArray is automatically sent to S3 with no local save option. This can then be opened in python using xarray from the S3 scratch bucket and then saved locally using the xr.to_netcdf() function.

Summary of changes and related issue

Added new _export_to_zarr function to data_export module. Run by export method.

Relevant motivation and context

This added functionality will allow users to export larger than 4GB variables to S3 and then save them to their local machines. Previously we exported to S3 using uncompressed netCDF3 which has a 4GB limit on any variable size.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • [ x ] New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

Definition of Done Checklist

Practical

  • 80% unit test coverage
  • Documentation
    • All functions/adjusted functions documented in the readthedocs.
    • Documentation is pushed
  • Complex code commented
  • Naming conventions followed
    • Helper functions hidden with _ before the name
  • Context of function is clearly provided
    • Intent of function is provided
    • How to test, so that it is not siloed on scientists and anyone can review
    • Appropriate manual testing was completed
  • Any notebooks known to utilize the affected functions are still working
  • Linting completed and resolved

Conceptual

  • Doesn't replicate existing functionality
  • Aligns with general coding standard of existing functions
  • Matches desired functionality from users/scientists

@vicford
Copy link
Contributor

vicford commented Dec 16, 2024

Testing in the data_acquisition_for_hydro notebook, export on a test dataarray: ck.export(mean_airtemp_data, filename_export, `Zarr`)
is giving me the following error:
ContainsGroupError: path '' contains a group

@elehmer
Copy link
Contributor Author

elehmer commented Dec 16, 2024

Testing in the data_acquisition_for_hydro notebook, export on a test dataarray: ck.export(mean_airtemp_data, filename_export, `Zarr`) is giving me the following error: ContainsGroupError: path '' contains a group

Hmm, yeah this notebook has some issues now when trying to use Zarr. I was able to use the bulk_run by just replacing for Zarr. But doing the merged exports are failing.

Which cell did you run into this issue?

@vicford
Copy link
Contributor

vicford commented Dec 16, 2024

Testing in the data_acquisition_for_hydro notebook, export on a test dataarray: ck.export(mean_airtemp_data, filename_export, `Zarr`) is giving me the following error: ContainsGroupError: path '' contains a group

Hmm, yeah this notebook has some issues now when trying to use Zarr. I was able to use the bulk_run by just replacing for Zarr. But doing the merged exports are failing.

Which cell did you run into this issue?

I tested in the Step 1b option 2 cell where I replaced 'netCDF' with 'Zarr'. I tried both on the xr.merge with multiple variables and a single var

@elehmer
Copy link
Contributor Author

elehmer commented Dec 17, 2024

Testing in the data_acquisition_for_hydro notebook, export on a test dataarray: ck.export(mean_airtemp_data, filename_export, `Zarr`) is giving me the following error: ContainsGroupError: path '' contains a group

Hmm, yeah this notebook has some issues now when trying to use Zarr. I was able to use the bulk_run by just replacing for Zarr. But doing the merged exports are failing.
Which cell did you run into this issue?

I tested in the Step 1b option 2 cell where I replaced 'netCDF' with 'Zarr'. I tried both on the xr.merge with multiple variables and a single var

Ok I am seeing this error now. I think it was introduced when main was merged? I will try and figure this out

@elehmer
Copy link
Contributor Author

elehmer commented Dec 18, 2024

Testing in the data_acquisition_for_hydro notebook, export on a test dataarray: ck.export(mean_airtemp_data, filename_export, `Zarr`) is giving me the following error: ContainsGroupError: path '' contains a group

Hmm, yeah this notebook has some issues now when trying to use Zarr. I was able to use the bulk_run by just replacing for Zarr. But doing the merged exports are failing.
Which cell did you run into this issue?

I tested in the Step 1b option 2 cell where I replaced 'netCDF' with 'Zarr'. I tried both on the xr.merge with multiple variables and a single var

Ok I am seeing this error now. I think it was introduced when main was merged? I will try and figure this out

It looks like this is caused when the filename already exists on S3? I need to include better error control over that. Can you confirm this?

@elehmer
Copy link
Contributor Author

elehmer commented Dec 18, 2024

Testing in the data_acquisition_for_hydro notebook, export on a test dataarray: ck.export(mean_airtemp_data, filename_export, `Zarr`) is giving me the following error: ContainsGroupError: path '' contains a group

Hmm, yeah this notebook has some issues now when trying to use Zarr. I was able to use the bulk_run by just replacing for Zarr. But doing the merged exports are failing.
Which cell did you run into this issue?

I tested in the Step 1b option 2 cell where I replaced 'netCDF' with 'Zarr'. I tried both on the xr.merge with multiple variables and a single var

Ok I am seeing this error now. I think it was introduced when main was merged? I will try and figure this out

It looks like this is caused when the filename already exists on S3? I need to include better error control over that. Can you confirm this?

Ok, I have included testing for the existence of key in bucket and deletion before writing again. This will hopefully fix the issue. Let me know how it works for you.

Copy link
Contributor

@vicford vicford left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice @elehmer! Testing in the data_acquisition_for hydro notebook:

  • Using the bulk_run function with zarr option worked perfectly, and took less than 2 minutes (whoop!), and I appreciate the directions on how to export locally -- our users will really appreciate that
  • Using "regular" ck.export with zarr as an option, I get this error:
ValueError: Zarr requires uniform chunk sizes except for final chunk. Variable named 'Air Temperature at 2m' has incompatible dask chunks: ((1,), (3267, 1, 3388, 3, 3386, 4, 3385, 5, 3034, 11, 318), (50, 50, 11, 39, 50, 22, 28, 50, 33, 7), (40, 40, 9, 31, 40, 18, 22, 40, 27, 3)). Consider rechunking using `chunk()`.

Which is okay if one option does work, I can remove the second option from the notebook, and share with the hydro folks using this notebook

@elehmer
Copy link
Contributor Author

elehmer commented Dec 19, 2024

Nice @elehmer! Testing in the data_acquisition_for hydro notebook:

  • Using the bulk_run function with zarr option worked perfectly, and took less than 2 minutes (whoop!), and I appreciate the directions on how to export locally -- our users will really appreciate that
  • Using "regular" ck.export with zarr as an option, I get this error:
ValueError: Zarr requires uniform chunk sizes except for final chunk. Variable named 'Air Temperature at 2m' has incompatible dask chunks: ((1,), (3267, 1, 3388, 3, 3386, 4, 3385, 5, 3034, 11, 318), (50, 50, 11, 39, 50, 22, 28, 50, 33, 7), (40, 40, 9, 31, 40, 18, 22, 40, 27, 3)). Consider rechunking using `chunk()`.

Which is okay if one option does work, I can remove the second option from the notebook, and share with the hydro folks using this notebook

I also run into this error if in bulk_run you comment out the loading into memory. @bkg Would you know why this is happening?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants