Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve partition #504

Merged
merged 43 commits into from
Jan 15, 2025
Merged
Show file tree
Hide file tree
Changes from 37 commits
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
cfd4218
add calendar
juliettelavoie Jan 19, 2024
2dede98
improve cal
juliettelavoie Jan 25, 2024
48ed051
remove chunks
juliettelavoie Jan 26, 2024
16ed840
Merge remote-tracking branch 'origin/main' into improve-partition
juliettelavoie Jan 29, 2024
34739f6
drop_vars
juliettelavoie Jan 30, 2024
c2c648e
add subcat possibility to avoid merge
juliettelavoie Feb 21, 2024
4519048
real
juliettelavoie Feb 21, 2024
123d42d
Merge remote-tracking branch 'origin/main' into improve-partition
juliettelavoie Feb 27, 2024
13fa197
add datacatalog option
juliettelavoie Feb 27, 2024
d4ffe92
add real to part_dim
juliettelavoie Mar 6, 2024
d0a097d
remove moving_rearly_window
juliettelavoie May 9, 2024
3378757
fix A-DEC
juliettelavoie May 13, 2024
5ed7105
Merge branch 'main' into improve-partition
juliettelavoie May 14, 2024
95f50f9
subdivise
juliettelavoie May 23, 2024
650803e
common_attrs
juliettelavoie May 23, 2024
b8d66d0
add ref and method to cat
juliettelavoie May 23, 2024
2a41984
index
juliettelavoie May 23, 2024
f667026
real in from cat
juliettelavoie May 24, 2024
8414c10
to level
juliettelavoie Jun 7, 2024
e27ebb7
to level type
juliettelavoie Jun 7, 2024
d9d0535
adjustment instead
juliettelavoie Jun 10, 2024
79dd1b0
subcat
juliettelavoie Jun 10, 2024
721dd0f
merge
juliettelavoie Jan 6, 2025
0f53e79
cleanup
juliettelavoie Jan 6, 2025
7894515
add tests
juliettelavoie Jan 7, 2025
cc699fc
Merge branch 'main' into improve-partition
juliettelavoie Jan 7, 2025
61eb239
pr num
juliettelavoie Jan 7, 2025
e283333
Merge remote-tracking branch 'origin/improve-partition' into improve-…
juliettelavoie Jan 7, 2025
9ecc8c9
pin xarray
juliettelavoie Jan 7, 2025
1f5baa5
pin xarray
juliettelavoie Jan 7, 2025
10cd31b
pin xarray
juliettelavoie Jan 7, 2025
7fe8abd
fix doc
juliettelavoie Jan 7, 2025
c4aaf08
changelog
juliettelavoie Jan 7, 2025
98aa4ba
update xclim v
juliettelavoie Jan 7, 2025
0edc0aa
update xclim v
juliettelavoie Jan 7, 2025
fa884c0
fix docs
juliettelavoie Jan 8, 2025
32350ff
fix test
juliettelavoie Jan 8, 2025
4ebc4c3
remove test
juliettelavoie Jan 8, 2025
afd6437
try again
juliettelavoie Jan 8, 2025
e3c8bf5
Apply suggestions from code review
juliettelavoie Jan 15, 2025
ea1a99a
remove print
juliettelavoie Jan 15, 2025
0ca11fd
pin zarr
juliettelavoie Jan 15, 2025
d9ab991
fix name
juliettelavoie Jan 15, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions CHANGELOG.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,11 @@ Changelog

v0.11.0 (unreleased)
--------------------
Contributors to this version: Gabriel Rondeau-Genesse (:user:`RondeauG`).
Contributors to this version: Gabriel Rondeau-Genesse (:user:`RondeauG`), Juliette Lavoie (:user:`juliettelavoie`).

New features and enhancements
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
* N/A
* Improve ``xs.ensemles.build_partition_data``. (:pull:`504`).
juliettelavoie marked this conversation as resolved.
Show resolved Hide resolved

Breaking changes
^^^^^^^^^^^^^^^^
Expand Down
16 changes: 12 additions & 4 deletions docs/notebooks/4_ensembles.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -169,19 +169,28 @@
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"metadata": {
"is_executing": true
},
"outputs": [],
"source": [
"# Get catalog\n",
"from pathlib import Path\n",
"\n",
"import xclim as xc\n",
"\n",
"output_folder = Path().absolute() / \"_data\"\n",
"cat = xs.DataCatalog(str(output_folder / \"tutorial-catalog.json\"))\n",
"\n",
"# create a dictionnary of datasets wanted for the partition\n",
"input_dict = cat.search(variable=\"tas\", member=\"r1i1p1f1\").to_dataset_dict(\n",
" xarray_open_kwargs={\"engine\": \"h5netcdf\"}\n",
")"
")\n",
"datasets = {}\n",
"for k, v in input_dict.items():\n",
" ds = xc.atmos.tg_mean(v.tas).to_dataset()\n",
" ds.attrs = v.attrs\n",
" datasets[k] = ds"
]
},
{
Expand All @@ -204,9 +213,8 @@
"import xclim as xc\n",
"\n",
"ds = xs.ensembles.build_partition_data(\n",
" input_dict,\n",
" datasets,\n",
" subset_kw=dict(name=\"mtl\", method=\"gridpoint\", lat=[45.5], lon=[-73.6]),\n",
" indicators_kw={\"indicators\": [xc.atmos.tg_mean]},\n",
")\n",
"ds"
]
Expand Down
5 changes: 3 additions & 2 deletions environment-dev.yml
Original file line number Diff line number Diff line change
Expand Up @@ -30,8 +30,8 @@ dependencies:
- shapely >=2.0
- sparse
- toolz
- xarray >=2023.11.0, !=2024.6.0
- xclim >=0.53.2, <0.54
- xarray >=2023.11.0, !=2024.6.0, <2024.10.0 #FIXME: 2024.10.0 breaks rechunker with zarr, https://github.com/pangeo-data/rechunker/issues/154
- xclim >=0.54, <0.55
juliettelavoie marked this conversation as resolved.
Show resolved Hide resolved
- xesmf >=0.7, <0.8.8 # FIXME: 0.8.8 currently creates segfaults on ReadTheDocs.
- zarr >=2.13
# Opt
Expand All @@ -56,6 +56,7 @@ dependencies:
- pandoc
- pooch
- pre-commit >=3.5.0
- pygments <2.19 #FIXME: temporary fix, https://github.com/felix-hilden/sphinx-codeautolink/issues/153
- pytest >=8.3.2
- pytest-cov >=5.0.0
- pytest-xdist >=3.2.0
Expand Down
4 changes: 2 additions & 2 deletions environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -30,8 +30,8 @@ dependencies:
- shapely >=2.0
- sparse
- toolz
- xarray >=2023.11.0, !=2024.6.0
- xclim >=0.53.2, <0.54
- xarray >=2023.11.0, !=2024.6.0, <2024.10.0 #FIXME: 2024.10.0 breaks rechunker with zarr
- xclim >=0.54, <0.55
- xesmf >=0.7, <0.8.8 # FIXME: 0.8.8 currently creates segfaults on ReadTheDocs.
- zarr >=2.13
# To install from source
Expand Down
7 changes: 4 additions & 3 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -65,8 +65,8 @@ dependencies = [
"shapely >=2.0",
"sparse",
"toolz",
"xarray >=2023.11.0, !=2024.6.0",
"xclim >=0.53.2, <0.54",
"xarray >=2023.11.0, !=2024.6.0, <2024.10.0", # FIXME: 2024.10.0 breaks rechunker with zarr
"xclim >=0.54, <0.55",
"zarr >=2.13"
]

Expand Down Expand Up @@ -109,7 +109,8 @@ docs = [
"sphinx-intl",
"sphinx-mdinclude",
"sphinx-rtd-theme >=1.0",
"sphinxcontrib-napoleon"
"sphinxcontrib-napoleon",
"pygments <2.19" # FIXME: temporary fix, https://github.com/felix-hilden/sphinx-codeautolink/issues/153
]
extra = [
"xesmf>=0.7, <0.8.8" # FIXME: 0.8.8 currently creates segfaults on ReadTheDocs.
Expand Down
Binary file modified src/xscen/data/fr/LC_MESSAGES/xscen.mo
Binary file not shown.
209 changes: 173 additions & 36 deletions src/xscen/ensembles.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,8 @@
import xarray as xr
from xclim import ensembles

from .catalog import DataCatalog
from .catutils import generate_id
from .config import parse_config
from .indicators import compute_indicators
from .regrid import regrid_dataset
Expand Down Expand Up @@ -629,13 +631,150 @@ def generate_weights( # noqa: C901
return weights


def _partition_from_list(datasets, partition_dim, subset_kw, regrid_kw):
list_ds = []
# only keep attrs common to all datasets
common_attrs = False
for ds in datasets:
if subset_kw:
ds = subset(ds, **subset_kw)
ds = ds.drop_vars(
["lat", "lon", "rlat", "rlon", "rotated_pole"], errors="ignore"
)
juliettelavoie marked this conversation as resolved.
Show resolved Hide resolved

if regrid_kw:
ds = regrid_dataset(ds, **regrid_kw)

for dim in partition_dim:
if f"cat:{dim}" in ds.attrs:
ds = ds.expand_dims(**{dim: [ds.attrs[f"cat:{dim}"]]})

if "bias_adjust_project" in ds.dims:
ds = ds.assign_coords(
adjustment=(
"bias_adjust_project",
[ds.attrs.get("cat:adjustment", np.nan)],
)
)
ds = ds.assign_coords(
reference=(
"bias_adjust_project",
[ds.attrs.get("cat:reference", np.nan)],
)
)

if "realization" in partition_dim:
new_source = f"{ds.attrs['cat:institution']}_{ds.attrs['cat:source']}_{ds.attrs['cat:member']}"
ds = ds.expand_dims(realization=[new_source])

a = ds.attrs
a.pop("intake_esm_vars", None) # remove list for intersection to work
common_attrs = dict(common_attrs.items() & a.items()) if common_attrs else a
list_ds.append(ds)
ens = xr.merge(list_ds)
ens.attrs = common_attrs
return ens


def _partition_from_catalog(
datasets, partition_dim, subset_kw, regrid_kw, to_dataset_kw
):

if ("adjustment" in partition_dim or "reference" in partition_dim) and (
"bias_adjust_project" in partition_dim
):
raise ValueError(
"The partition_dim can have either adjustment and reference or bias_adjust_project, not both."
)

if ("realization" in partition_dim) and ("source" in partition_dim):
raise ValueError(
"The partition_dim can have either realization or source, not both."
)

# special case to handle source (create one dimension with institution_source_member)
ensemble_on_list = None
if "realization" in partition_dim:
partition_dim.remove("realization")
ensemble_on_list = ["institution", "source", "member"]

subcat = datasets

# get attrs that are common to all datasets
common_attrs = {}
for col, series in subcat.df.items():
if (series[0] == series).all():
common_attrs[f"cat:{col}"] = series[0]

col_id = [
(
"adjustment" if "adjustment" in partition_dim else None
), # instead of bias_adjust_project, need to use adjustment, not method bc .sel
(
"reference" if "reference" in partition_dim else None
), # instead of bias_adjust_project
"bias_adjust_project" if "bias_adjust_project" in partition_dim else None,
"mip_era",
"activity",
"driving_model",
"institution" if "realization" in partition_dim else None,
"source",
"experiment",
"member" if "realization" in partition_dim else None,
"domain",
]

subcat.df["id"] = generate_id(subcat.df, col_id)

# create a dataset for each bias_adjust_project, modify grid and concat them
# choose dim that exists in partition_dim and first in the order of preference
order_of_preference = ["reference", "bias_adjust_project", "source"]
dim_with_different_grid = list(set(partition_dim) & set(order_of_preference))[0]

list_ds = []
for d in subcat.df[dim_with_different_grid].unique():
ds = subcat.search(**{dim_with_different_grid: d}).to_dataset(
concat_on=partition_dim,
create_ensemble_on=ensemble_on_list,
**to_dataset_kw,
)

if subset_kw:
ds = subset(ds, **subset_kw)
ds = ds.drop_vars(
["lat", "lon", "rlat", "rlon", "rotated_pole"], errors="ignore"
)
juliettelavoie marked this conversation as resolved.
Show resolved Hide resolved
if regrid_kw:
ds = regrid_dataset(ds, **regrid_kw)

# add coords adjustment and reference
if "bias_adjust_project" in ds.dims:
ds = ds.assign_coords(
adjustment=(
"bias_adjust_project",
[ds.attrs.get("cat:adjustment", np.nan)],
)
) # need to use adjustment, not method bc .sel
ds = ds.assign_coords(
reference=(
"bias_adjust_project",
[ds.attrs.get("cat:reference", np.nan)],
)
)
list_ds.append(ds)
ens = xr.concat(list_ds, dim=dim_with_different_grid)
ens.attrs = common_attrs
return ens


def build_partition_data(
datasets: dict | list[xr.Dataset],
partition_dim: list[str] = ["source", "experiment", "bias_adjust_project"],
partition_dim: list[str] = ["realization", "experiment", "bias_adjust_project"],
subset_kw: dict | None = None,
regrid_kw: dict | None = None,
indicators_kw: dict | None = None,
rename_dict: dict | None = None,
to_dataset_kw: dict | None = None,
to_level: str = "partition-ensemble",
):
"""
Get the input for the xclim partition functions.
Expand All @@ -644,29 +783,36 @@ def build_partition_data(
`partition_dim` dimensions (and time) to pass to one of the xclim partition functions
(https://xclim.readthedocs.io/en/stable/api.html#uncertainty-partitioning).
If the inputs have different grids,
they have to be subsetted and regridded to a common grid/point.
Indicators can also be computed before combining the datasets.
they have to be subsetted and/or regridded to a common grid/point.

Parameters
----------
datasets : dict
List or dictionnary of Dataset objects that will be included in the ensemble.
datasets : list, dict, DataCatalog
List or dictionnary of Datasets or DataCatalog that will be included in the ensemble.
juliettelavoie marked this conversation as resolved.
Show resolved Hide resolved
The datasets should include the necessary ("cat:") attributes to understand their metadata.
Tip: With a project catalog, you can do: `datasets = pcat.search(**search_dict).to_dataset_dict()`.
partition_dim : list[str]
Tip: A dictionnary can be created with `datasets = pcat.search(**search_dict).to_dataset_dict()`.
juliettelavoie marked this conversation as resolved.
Show resolved Hide resolved

The use of a DataCatalog is recommended for large ensembles.
In that case, the ensembles will be loaded separately for each `bias_adjust_project`,
the subsetting or regridding can be applied before combining the datasets through concatenation.
If `bias_adjust_project` is not in `partition_dim`, `source` will be used instead.
partition_dim: list[str]
Components of the partition. They will become the dimension of the output.
The default is ['source', 'experiment', 'bias_adjust_project'].
For source, the dimension will actually be institution_source_member.
subset_kw : dict, optional
Arguments to pass to `xs.spatial.subset()`.
regrid_kw : dict, optional
Arguments to pass to `xs.regrid_dataset()`.
indicators_kw : dict, optional
Arguments to pass to `xs.indicators.compute_indicators()`.
All indicators have to be for the same frequency, in order to be put on a single time axis.
Note thet regriding is computationnaly expensive. For large datasets,
juliettelavoie marked this conversation as resolved.
Show resolved Hide resolved
it might be worth it to do do regridding first, outside of this function.
juliettelavoie marked this conversation as resolved.
Show resolved Hide resolved
rename_dict : dict, optional
Dictionary to rename the dimensions from xscen names to xclim names.
If None, the default is {'source': 'model', 'bias_adjust_project': 'downscaling', 'experiment': 'scenario'}.
The default is {'source': 'model', 'bias_adjust_project': 'downscaling', 'experiment': 'scenario'}.
to_dataset_kw : dict, optional
Arguments to pass to `xscen.DataCatalog.to_dataset()` if datasets is a DataCatalog.
to_level: str
The processing level of the output dataset. Default is 'partition-ensemble'.

Returns
-------
Expand All @@ -682,41 +828,32 @@ def build_partition_data(
# initialize dict
subset_kw = subset_kw or {}
regrid_kw = regrid_kw or {}
to_dataset_kw = to_dataset_kw or {}

list_ds = []
for ds in datasets:
if subset_kw:
ds = subset(ds, **subset_kw)

if regrid_kw:
ds = regrid_dataset(ds, **regrid_kw)

if indicators_kw:
dict_ind = compute_indicators(ds, **indicators_kw)
if len(dict_ind) > 1:
raise ValueError(
f"The indicators computation should return only indicators of the same frequency.Returned frequencies: {dict_ind.keys()}"
)
else:
ds = list(dict_ind.values())[0]
if isinstance(datasets, list):
juliettelavoie marked this conversation as resolved.
Show resolved Hide resolved
ens = _partition_from_list(datasets, partition_dim, subset_kw, regrid_kw)

for dim in partition_dim:
if f"cat:{dim}" in ds.attrs:
ds = ds.expand_dims(**{dim: [ds.attrs[f"cat:{dim}"]]})
elif isinstance(datasets, DataCatalog):
ens = _partition_from_catalog(
datasets, partition_dim, subset_kw, regrid_kw, to_dataset_kw
)

if "source" in partition_dim:
new_source = f"{ds.attrs['cat:institution']}_{ds.attrs['cat:source']}_{ds.attrs['cat:member']}"
ds = ds.assign_coords(source=[new_source])
list_ds.append(ds)
ens = xr.merge(list_ds)
else:
raise ValueError(
"datasets should be a list or a dictionary of xarray datasets or a xscen.DataCatalog"
juliettelavoie marked this conversation as resolved.
Show resolved Hide resolved
)

rename_dict = rename_dict or {}
rename_dict.setdefault("realization", "model")
rename_dict.setdefault("source", "model")
rename_dict.setdefault("experiment", "scenario")
rename_dict.setdefault("bias_adjust_project", "downscaling")
rename_dict = {k: v for k, v in rename_dict.items() if k in ens.dims}
ens = ens.rename(rename_dict)

ens.attrs["cat:processing_level"] = to_level
ens.attrs["cat:id"] = generate_id(ens)[0]

return ens


Expand Down
2 changes: 1 addition & 1 deletion src/xscen/io.py
Original file line number Diff line number Diff line change
Expand Up @@ -1053,7 +1053,7 @@ def rechunk(
raise ValueError(
"No chunks given. Need to give at `chunks_over_var` or `chunks_over_dim`."
)

print(ds, chunks, worker_mem, str(path_out), str(temp_store))
juliettelavoie marked this conversation as resolved.
Show resolved Hide resolved
plan = _rechunk(ds, chunks, worker_mem, str(path_out), temp_store=str(temp_store))

plan.execute()
Expand Down
7 changes: 3 additions & 4 deletions tests/test_biasadjust.py
Original file line number Diff line number Diff line change
Expand Up @@ -47,11 +47,10 @@ def test_basic_train(self, var, period):

def test_preprocess(self):

dref360 = self.dref.convert_calendar("360_day", align_on="year")

dhist360 = self.dhist.convert_calendar("360_day", align_on="year")
juliettelavoie marked this conversation as resolved.
Show resolved Hide resolved
out = xs.train(
dref360,
self.dhist,
self.dref,
dhist360,
var="tas",
period=["2001", "2002"],
adapt_freq={"thresh": "2 K"},
Expand Down
Loading
Loading