Skip to content

Commit

Permalink
Doc improvements
Browse files Browse the repository at this point in the history
  • Loading branch information
asinghvi17 committed Oct 16, 2024
1 parent d7557d3 commit ac820ec
Show file tree
Hide file tree
Showing 4 changed files with 24 additions and 4 deletions.
1 change: 1 addition & 0 deletions docs/make.jl
Original file line number Diff line number Diff line change
Expand Up @@ -110,4 +110,5 @@ makedocs(;
deploydocs(;
repo="github.com/JuliaIO/Kerchunk.jl",
devbranch="main",
push_preview=true,
)
8 changes: 8 additions & 0 deletions docs/src/api.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,14 @@ ReferenceStore

## Correction interface

Kerchunk files often need corrections to the metadata.

For example, the CF-convention `add_offset` and `scale_factor` metadata fields are stored as separate variables in the source data, but should ideally be stored as a single Zarr `FixedScaleOffset` filter so you can get performance as close to native as possible. Some CF datasets also encode an `_Unsigned` metadata field, which should simply be used to edit the `dtype` of the Zarr array.

Kerchunk also sometimes places the compressor as the last filter, which is technically compliant with Zarr v3 but is not compliant with Zarr v2. This is corrected by moving the compressor to the `compressor` field of the metadata, but this has to be done before the Zarr is loaded.

This is the point of the correction interface. As more idiosyncrasies are discovered, they can be added to it.

```@docs
do_correction!
add_scale_offset_filter_and_set_mask!
Expand Down
2 changes: 1 addition & 1 deletion docs/src/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ This effort was funded by the NASA MEaSUREs program in contribution to the Inter
## Alternatives and related packages

- You can always use Python's `xarray` directly via PythonCall.jl
- [FSSpec.jl](https://github.com/asinghvi17/FSSpec.jl) is an alternative storage backends for Zarr.jl that wraps the same [`fsspec`](https://github.com/fsspec/filesystem_spec) that `xarray` uses under the hood.
- [FSSpec.jl](https://github.com/asinghvi17/FSSpec.jl) is an alternative storage backend for Zarr.jl that wraps the same [`fsspec`](https://github.com/fsspec/filesystem_spec) that `xarray` uses under the hood.

This package is of course built on top of [Zarr.jl](https://github.com/JuliaIO/Zarr.jl), which is a pure-Julia Zarr array library.
[YAXArrays.jl](https://github.com/JuliaDataCubes/YAXArrays.jl) is a Julia package that can wrap Zarr arrays in a DimensionalData-compatible interface.
17 changes: 14 additions & 3 deletions docs/src/what_the_heck.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,21 @@
# What is Kerchunk?

Kerchunk is a powerful tool designed to optimize access to large scientific datasets, particularly those stored in cloud-based object stores. It addresses the challenges of working with numerous small files or large, chunked files by creating a unified, efficient interface for data access.

At its core, Kerchunk works by creating a "fake" file system that maps to a Zarr store. The file system describes a mapping from Zarr chunks to byte ranges of the source files.

This approach allows Kerchunk to effectively wrap one or many data files into a single Zarr array, providing a consolidated view of the data. By doing so, it enables faster data access, reduces the number of API calls needed to retrieve information (by essentially front-loading the process), and greatly simplifies the process of working with multi-file datasets.


## Available data sources

The unit of Kerchunking is the _catalog_. Each catalog is either a single JSON file or a directory of Parquet files. The catalog is essentially a dictionary of file paths mapped to byte ranges.

Catalogs are "sidecar" files, and may not always be present with the original data. Generally, at least for now, if there's no obvious Kerchunk file you would have to generate one yourself.

## Tips and tricks

#### Where's my CRS?
### Where's my CRS?

That's an interesting question. Over the short term, Julia doesn't have support for CF-style (climate-and-forecast conventions) CRS metadata. Additionally, CRS from e.g NetCDF files are stored as empty variables, which Kerchunk removes.

Expand All @@ -28,7 +39,7 @@ new_crs = Rasters.EPSG(epsg_code)
new_crs = Rasters.ProjString(proj4_string)
```

#### S3 redirect errors
### S3 redirect errors

Many S3 buckets are restricted to only allow access from certain regions. If you get an error like this:
```
Expand All @@ -46,6 +57,6 @@ import AWS
AWS.global_aws_config(AWS.AWSConfig(; region="us-west-2"))
```

#### Version mismatches
### Version mismatches

Python and Julia load different versions of libraries, which can cause incompatibilities. For example, both NCDatasets.jl and Python's netcdf4 library depend on libhdf5, but the versions they try to load are incompatible.

0 comments on commit ac820ec

Please sign in to comment.