-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #14 from cedadev/dev
Version 1.1.1 Bug Fixes and Reconfigurations in preparation for V1.2
- Loading branch information
Showing
29 changed files
with
1,656 additions
and
1,374 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,108 +1,12 @@ | ||
# kerchunk-builder | ||
A repository for building a kerchunk infrastructure using existing tools, and a set of showcase notebooks to use on example data in this repository. | ||
|
||
Now a repository under cedadev group! | ||
|
||
Example Notebooks: | ||
https://mybinder.org/v2/gh/cedadev/kerchunk-builder.git/main?filepath=showcase/notebooks | ||
The Kerchunk Pipeline (Soon to be renamed) is a Data Aggregation pipeline for creating Kerchunk files to represent various datasets in different original formats. | ||
Currently the Pipeline supports writing JSON/Parquet Kerchunk files for input NetCDF/HDF files. Further developments will allow GeoTiff, GRIB and possibly MetOffice (.pp) files to be represented, as well as using the Pangeo [Rechunker](https://rechunker.readthedocs.io/en/latest/) tool to create Zarr stores for Kerchunk-incompatible datasets. | ||
|
||
# Pipeline Phases | ||
[Example Notebooks at this link](https://mybinder.org/v2/gh/cedadev/kerchunk-builder.git/main?filepath=showcase/notebooks) | ||
|
||
All pipeline phases are now run using master scripts `single_run.py` or `group_run.py` | ||
[Documentation hosted at this link](https://cedadev.github.io/kerchunk-builder/) | ||
|
||
## 0 Activating Environment Settings | ||
|
||
`source build_venv/bin/activate` | ||
|
||
Python virtual environment setup | ||
|
||
`. templates/<config>.sh` | ||
|
||
Sets all environment variables, if a shell script is already present with the correct name. Environment variables to set are: | ||
- WORKDIR: (Required) - Central workspace for saving data | ||
- GROUPDIR: (Required for parallel) - Workspace for a specific group | ||
- SRCDIR: (Required for parallel) - Kerchunk pipeline repo path ending in `/kerchunk-builder` | ||
- KVENV: (Required for parallel) - Path to virtual environment. | ||
All of the above can be passed as flags to each script, or set as environment variables before processing. | ||
|
||
## 1. Running the Pipeline - Examples | ||
|
||
### 1.1 Single running of an isolated dataset | ||
`python single_run.py scan a11x34 -vfbd` | ||
|
||
The above runs the scan process for project code `a11x34` with verbose level 1, forced running (overwrites existing files), bypass errors with `-b` and dry-running with `-d`. Note that running with `-f` and `-d` means that sections will not be skipped if files already exist, but no new files will be generated. | ||
|
||
### 1.2 Single running for a dataset within a group | ||
`python single_run.py scan 0 -vfbd -G CMIP6_exampleset_1 -r scan_2` | ||
|
||
The above has the same features as before, except now we are using project id `0` in place of a project code, with a group ID (`-G`) supplied as well as a repeat ID (`-r`) from which to identify the correct project code from a group. This is an example of what each parallel job will execute, so using this format is solely for test purposes. | ||
|
||
### 1.3 Group running of multiple datasets | ||
`python group_run.py scan CMIP6_exampleset_1 -vfbd -r scan_2` | ||
|
||
The above is the full parallelised job execution command which would activate all jobs with the `single_run.py` script as detailed in section 1.2. This command creates a sbatch file and separate command to start the parallel jobs, which will include all datasets within the `scan_2` subgroup of the `CMIP6_exampleset_1` group. Subgroups are created with the `identify_reruns.py` script. | ||
|
||
### 1.4 Full Worked Example | ||
Using the example documents in this repository, we can run an example group containing just two datasets. Any number of datasets would also follow this method, two is not a unique number other than being the smallest so to minimise duplication. | ||
|
||
#### 1.4.1 Init | ||
The first step is to initialise the group from the example csv given. Here I am giving the group the identifier `UKCP_test1` as the second argument after the phase `init` which we are peforming. With `-i` we supply an input csv (This can also be a text file for some cases where the project code can be generated). Finally `-v` means we get to see general information as the program is running. | ||
`python group_run.py init UKCP_test1 -i examples/UKCP_test1.csv -v` | ||
|
||
#### 1.4.2 Scan | ||
Scanning will give an indication of how long each file will take to produce and some other characteristics which will come into play in later phases. | ||
`python group_run.py scan UKCP_test1 -v` | ||
|
||
If running in `dryrun` mode, this will generate an sbatch submission command like: | ||
`sbatch --array=0-2 /gws/nopw/j04/cmip6_prep_vol1/kerchunk-pipeline/groups/UKCP_test1/sbatch/scan.sbatch` | ||
|
||
Which can be copied into the terminal and executed. Otherwise the jobs will be automatically submitted. | ||
|
||
#### 1.4.3 Compute | ||
`python group_run.py compute UKCP_test1 -vv` | ||
|
||
This command differs only with the level of verboseness, with this many 'v's we will see the debug information as well as general information. Again this will produce an sbatch command to be copied to the terminal if in dryrun mode. | ||
|
||
#### 1.4.4 Validate | ||
`python group_run.py validate UKCP_test1 -vv` | ||
|
||
This final step will submit all datasets for validation, which includes copying the final output file to the `/complete` directory within the workdir set as an environment variable. | ||
|
||
|
||
## 2. Pipeline Phases in detail | ||
|
||
### 2.1 Init | ||
Initialise and configure for running the pipeline for any number of datasets in parallel. | ||
If using the pipeline with a group of datasets, an input file is required (`-i` option) which must be one of: | ||
- A text file containing the wildcard paths describing all files within each dataset (CMIP6) | ||
- A properly formatted CSV with fields for each entry corresponding to the headers: | ||
- Project code: Unique identifier for this dataset, commonly taken from naming conventions in the path | ||
- Pattern/Filename | ||
- Updates: Boolean 1/0 if updates file is present | ||
- Removals: Boolean 1/0 if removals file is present | ||
|
||
### 2.2 Scan | ||
Run kerchunk-scan tool (or similar) to build a test kerchunk file and determine parameters: | ||
- chunks per netcdf file (Nc) | ||
- average chunk size (Tc) | ||
- total expected kerchunk size (Tk) | ||
|
||
### 2.3 Compute | ||
Create parquet store for a specified dataset, using method depending on total expected kerchunk size (Tk) | ||
|
||
#### 2.3.1 Create Kerchunk JSON dataset | ||
|
||
#### 2.3.2 Large Chunkset Tk value - Parallel (Batch) processing | ||
- Batch process to create parts - batch_process/process_wrapper.py | ||
- Combine parts using copier script - combine_refs.py | ||
- Correct metadata (shape, parameters) - correct_meta.py | ||
- Run time correction script if necessary - correct_time.py | ||
|
||
#### 2.3.3 Small Chunkset Tk value - Serial processing | ||
Run create parquet script - create_parq.py | ||
Not currently supported | ||
|
||
### 3. Validate | ||
Run a series of tests on parquet store usage: | ||
- Ensure small plot success with no errors | ||
- Ensure large plot (dask gateway) success with no errors or killed job. | ||
![Kerchunk Pipeline](docs/source/_images/pipeline.png) |
Oops, something went wrong.