Skip to content
This repository has been archived by the owner on Dec 20, 2024. It is now read-only.

Commit

Permalink
Refactor Callbacks (#60)
Browse files Browse the repository at this point in the history
* Refactor Callbacks
- Split into seperate files
- Use list in config to add callbacks
- Split out plotting callbacks config

* Refactor rollout (#87)
- New rollout central function

---------

Co-authored-by: Mario Santa Cruz <[email protected]>
Co-authored-by: Sara Hahner <[email protected]>
  • Loading branch information
3 people authored and JesperDramsch committed Oct 29, 2024
1 parent 6fc2e3b commit 9ea0390
Show file tree
Hide file tree
Showing 24 changed files with 2,129 additions and 1,194 deletions.
8 changes: 7 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,10 @@ Keep it human-readable, your future self will thank you!

## [0.2.2 - Maintenance: pin python <3.13](https://github.com/ecmwf/anemoi-training/compare/0.2.1...0.2.2) - 2024-10-28

### Fixed
- Refactored callbacks. [#60](https://github.com/ecmwf/anemoi-training/pulls/60)
- Refactored rollout [#87](https://github.com/ecmwf/anemoi-training/pulls/87)
- Enable longer validation rollout than training
### Added
- Included more loss functions and allowed configuration [#70](https://github.com/ecmwf/anemoi-training/pull/70)

Expand All @@ -29,7 +33,9 @@ Keep it human-readable, your future self will thank you!
- Feature: New `Boolean1DMask` class. Enables rollout training for limited area models. [#79](https://github.com/ecmwf/anemoi-training/pulls/79)

### Fixed

- Mlflow-sync to handle creation of new experiments in the remote server [#83] (https://github.com/ecmwf/anemoi-training/pull/83)
- Fix for multi-gpu when using mlflow due to refactoring of _get_mlflow_run_params function [#99] (https://github.com/ecmwf/anemoi-training/pull/99)
- ci: fix pyshtools install error (#100) https://github.com/ecmwf/anemoi-training/pull/100
- Mlflow-sync to handle creation of new experiments in the remote server [#83](https://github.com/ecmwf/anemoi-training/pull/83)
- Fix for multi-gpu when using mlflow due to refactoring of _get_mlflow_run_params function [#99](https://github.com/ecmwf/anemoi-training/pull/99)
- ci: fix pyshtools install error [#100](https://github.com/ecmwf/anemoi-training/pull/100)
Expand Down
97 changes: 84 additions & 13 deletions docs/modules/diagnostics.rst
Original file line number Diff line number Diff line change
Expand Up @@ -21,23 +21,94 @@ functionality to use both Weights & Biases and Tensorboard.

The callbacks can also be used to evaluate forecasts over longer
rollouts beyond the forecast time that the model is trained on. The
number of rollout steps (or forecast iteration steps) is set using
``config.eval.rollout = *num_of_rollout_steps*``.

Note the user has the option to evaluate the callbacks asynchronously
(using the following config option
``config.diagnostics.plot.asynchronous``, which means that the model
training doesn't stop whilst the callbacks are being evaluated).
However, note that callbacks can still be slow, and therefore the
plotting callbacks can be switched off by setting
``config.diagnostics.plot.enabled`` to ``False`` or all the callbacks
can be completely switched off by setting
``config.diagnostics.eval.enabled`` to ``False``.
number of rollout steps for verification (or forecast iteration steps)
is set using ``config.dataloader.validation_rollout =
*num_of_rollout_steps*``.

Callbacks are configured in the config file under the
``config.diagnostics`` key.

For regular callbacks, they can be provided as a list of dictionaries
underneath the ``config.diagnostics.callbacks`` key. Each dictionary
must have a ``_target`` key which is used by hydra to instantiate the
callback, any other kwarg is passed to the callback's constructor.

.. code:: yaml
callbacks:
- _target_: anemoi.training.diagnostics.callbacks.evaluation.RolloutEval
rollout: ${dataloader.validation_rollout}
frequency: 20
Plotting callbacks are configured in a similar way, but they are
specified underneath the ``config.diagnostics.plot.callbacks`` key.

This is done to ensure seperation and ease of configuration between
experiments.

``config.diagnostics.plot`` is a broader config file specifying the
parameters to plot, as well as the plotting frequency, and
asynchronosity.

Setting ``config.diagnostics.plot.asynchronous``, means that the model
training doesn't stop whilst the callbacks are being evaluated)

.. code:: yaml
plot:
asynchronous: True # Whether to plot asynchronously
frequency: # Frequency of the plotting
batch: 750
epoch: 5
# Parameters to plot
parameters:
- z_500
- t_850
- u_850
# Sample index
sample_idx: 0
# Precipitation and related fields
precip_and_related_fields: [tp, cp]
callbacks:
- _target_: anemoi.training.diagnostics.callbacks.plot.PlotLoss
# group parameters by categories when visualizing contributions to the loss
# one-parameter groups are possible to highlight individual parameters
parameter_groups:
moisture: [tp, cp, tcw]
sfc_wind: [10u, 10v]
- _target_: anemoi.training.diagnostics.callbacks.plot.PlotSample
sample_idx: ${diagnostics.plot.sample_idx}
per_sample : 6
parameters: ${diagnostics.plot.parameters}
Below is the documentation for the default callbacks provided, but it is
also possible for users to add callbacks using the same structure:

.. automodule:: anemoi.training.diagnostics.callbacks
.. automodule:: anemoi.training.diagnostics.callbacks.checkpoint
:members:
:no-undoc-members:
:show-inheritance:

.. automodule:: anemoi.training.diagnostics.callbacks.evaluation
:members:
:no-undoc-members:
:show-inheritance:

.. automodule:: anemoi.training.diagnostics.callbacks.optimiser
:members:
:no-undoc-members:
:show-inheritance:

.. automodule:: anemoi.training.diagnostics.callbacks.plot
:members:
:no-undoc-members:
:show-inheritance:

.. automodule:: anemoi.training.diagnostics.callbacks.provenance
:members:
:no-undoc-members:
:show-inheritance:
4 changes: 2 additions & 2 deletions docs/user-guide/configuring.rst
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ settings at the top as follows:
defaults:
- data: zarr
- dataloader: native_grid
- diagnostics: eval_rollout
- diagnostics: evaluation
- hardware: example
- graph: multi_scale
- model: gnn
Expand Down Expand Up @@ -100,7 +100,7 @@ match the dataset you provide.
defaults:
- data: zarr
- dataloader: native_grid
- diagnostics: eval_rollout
- diagnostics: evaluation
- hardware: example
- graph: multi_scale
- model: transformer # Change from default group
Expand Down
2 changes: 1 addition & 1 deletion docs/user-guide/tracking.rst
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ the same experiment.
Within the MLflow experiments tab, it is possible to define different
namespaces. To create a new namespace, the user just needs to pass an
'experiment_name'
(``config.diagnostics.eval_rollout.log.mlflow.experiment_name``) to the
(``config.diagnostics.evaluation.log.mlflow.experiment_name``) to the
mlflow logger.

**Parent-Child Runs**
Expand Down
2 changes: 1 addition & 1 deletion src/anemoi/training/config/config.yaml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
defaults:
- data: zarr
- dataloader: native_grid
- diagnostics: eval_rollout
- diagnostics: evaluation
- hardware: example
- graph: multi_scale
- model: gnn
Expand Down
2 changes: 2 additions & 0 deletions src/anemoi/training/config/dataloader/native_grid.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,8 @@ training:
frequency: ${data.frequency}
drop: []

validation_rollout: 1 # number of rollouts to use for validation, must be equal or greater than rollout expected by callbacks

validation:
dataset: ${dataloader.dataset}
start: 2021
Expand Down
4 changes: 2 additions & 2 deletions src/anemoi/training/config/debug.yaml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
defaults:
- data: zarr
- dataloader: native_grid
- diagnostics: eval_rollout
- diagnostics: evaluation
- hardware: example
- graph: multi_scale
- model: gnn
Expand All @@ -18,7 +18,7 @@ defaults:

diagnostics:
plot:
enabled: False
callbacks: []
hardware:
files:
graph: ???
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# Add callbacks here
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# Add callbacks here
- _target_: anemoi.training.diagnostics.callbacks.evaluation.RolloutEval
rollout: ${dataloader.validation_rollout}
frequency: 20
Original file line number Diff line number Diff line change
@@ -1,53 +1,8 @@
---
eval:
enabled: False
# use this to evaluate the model over longer rollouts, every so many validation batches
rollout: 12
frequency: 20
plot:
enabled: True
asynchronous: True
frequency: 750
sample_idx: 0
per_sample: 6
parameters:
- z_500
- t_850
- u_850
- v_850
- 2t
- 10u
- 10v
- sp
- tp
- cp
#Defining the accumulation levels for precipitation related fields and the colormap
accumulation_levels_plot: [0, 0.05, 0.1, 0.25, 0.5, 1, 1.5, 2, 3, 4, 5, 6, 7, 100] # in mm
cmap_accumulation: ["#ffffff", "#04e9e7", "#019ff4", "#0300f4", "#02fd02", "#01c501", "#008e00", "#fdf802", "#e5bc00", "#fd9500", "#fd0000", "#d40000", "#bc0000", "#f800fd"]
precip_and_related_fields: [tp, cp]
# Histogram and Spectrum plots
parameters_histogram:
- z_500
- tp
- 2t
- 10u
- 10v
parameters_spectrum:
- z_500
- tp
- 2t
- 10u
- 10v
# group parameters by categories when visualizing contributions to the loss
# one-parameter groups are possible to highlight individual parameters
parameter_groups:
moisture: [tp, cp, tcw]
sfc_wind: [10u, 10v]
learned_features: False
longrollout:
enabled: False
rollout: [60]
frequency: 20 # every X epochs
defaults:
- plot: detailed
- callbacks: pretraining


debug:
# this will detect and trace back NaNs / Infs etc. but will slow down training
Expand All @@ -57,6 +12,7 @@ debug:
# remember to also activate the tensorboard logger (below)
profiler: False

enable_checkpointing: True
checkpoint:
every_n_minutes:
save_frequency: 30 # Approximate, as this is checked at the end of training steps
Expand Down
62 changes: 62 additions & 0 deletions src/anemoi/training/config/diagnostics/plot/detailed.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
asynchronous: True # Whether to plot asynchronously
frequency: # Frequency of the plotting
batch: 750
epoch: 5

# Parameters to plot
parameters:
- z_500
- t_850
- u_850
- v_850
- 2t
- 10u
- 10v
- sp
- tp
- cp

# Sample index
sample_idx: 0

# Precipitation and related fields
precip_and_related_fields: [tp, cp]

callbacks:
# Add plot callbacks here
- _target_: anemoi.training.diagnostics.callbacks.plot.GraphNodeTrainableFeaturesPlot
- _target_: anemoi.training.diagnostics.callbacks.plot.GraphEdgeTrainableFeaturesPlot
epoch_frequency: 5
- _target_: anemoi.training.diagnostics.callbacks.plot.PlotLoss
# group parameters by categories when visualizing contributions to the loss
# one-parameter groups are possible to highlight individual parameters
parameter_groups:
moisture: [tp, cp, tcw]
sfc_wind: [10u, 10v]
- _target_: anemoi.training.diagnostics.callbacks.plot.PlotSample
sample_idx: ${diagnostics.plot.sample_idx}
per_sample : 6
parameters: ${diagnostics.plot.parameters}
#Defining the accumulation levels for precipitation related fields and the colormap
accumulation_levels_plot: [0, 0.05, 0.1, 0.25, 0.5, 1, 1.5, 2, 3, 4, 5, 6, 7, 100] # in mm
cmap_accumulation: ["#ffffff", "#04e9e7", "#019ff4", "#0300f4", "#02fd02", "#01c501", "#008e00", "#fdf802", "#e5bc00", "#fd9500", "#fd0000", "#d40000", "#bc0000", "#f800fd"]
precip_and_related_fields: ${diagnostics.plot.precip_and_related_fields}

- _target_: anemoi.training.diagnostics.callbacks.plot.PlotSpectrum
# batch_frequency: 100 # Override for batch frequency
sample_idx: ${diagnostics.plot.sample_idx}
parameters:
- z_500
- tp
- 2t
- 10u
- 10v
- _target_: anemoi.training.diagnostics.callbacks.plot.PlotHistogram
sample_idx: ${diagnostics.plot.sample_idx}
precip_and_related_fields: ${diagnostics.plot.precip_and_related_fields}
parameters:
- z_500
- tp
- 2t
- 10u
- 10v
1 change: 1 addition & 0 deletions src/anemoi/training/config/diagnostics/plot/none.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
callbacks: []
68 changes: 68 additions & 0 deletions src/anemoi/training/config/diagnostics/plot/rollout_eval.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
asynchronous: True # Whether to plot asynchronously
frequency: # Frequency of the plotting
batch: 750
epoch: 5

# Parameters to plot
parameters:
- z_500
- t_850
- u_850
- v_850
- 2t
- 10u
- 10v
- sp
- tp
- cp

# Sample index
sample_idx: 0

# Precipitation and related fields
precip_and_related_fields: [tp, cp]

callbacks:
# Add plot callbacks here
- _target_: anemoi.training.diagnostics.callbacks.plot.GraphNodeTrainableFeaturesPlot
- _target_: anemoi.training.diagnostics.callbacks.plot.GraphEdgeTrainableFeaturesPlot
epoch_frequency: 5
- _target_: anemoi.training.diagnostics.callbacks.plot.PlotLoss
# group parameters by categories when visualizing contributions to the loss
# one-parameter groups are possible to highlight individual parameters
parameter_groups:
moisture: [tp, cp, tcw]
sfc_wind: [10u, 10v]
- _target_: anemoi.training.diagnostics.callbacks.plot.PlotSample
sample_idx: ${diagnostics.plot.sample_idx}
per_sample : 6
parameters: ${diagnostics.plot.parameters}
#Defining the accumulation levels for precipitation related fields and the colormap
accumulation_levels_plot: [0, 0.05, 0.1, 0.25, 0.5, 1, 1.5, 2, 3, 4, 5, 6, 7, 100] # in mm
cmap_accumulation: ["#ffffff", "#04e9e7", "#019ff4", "#0300f4", "#02fd02", "#01c501", "#008e00", "#fdf802", "#e5bc00", "#fd9500", "#fd0000", "#d40000", "#bc0000", "#f800fd"]
precip_and_related_fields: ${diagnostics.plot.precip_and_related_fields}

- _target_: anemoi.training.diagnostics.callbacks.plot.PlotSpectrum
# batch_frequency: 100 # Override for batch frequency
sample_idx: ${diagnostics.plot.sample_idx}
parameters:
- z_500
- tp
- 2t
- 10u
- 10v
- _target_: anemoi.training.diagnostics.callbacks.plot.PlotHistogram
sample_idx: ${diagnostics.plot.sample_idx}
precip_and_related_fields: ${diagnostics.plot.precip_and_related_fields}
parameters:
- z_500
- tp
- 2t
- 10u
- 10v
- _target_: anemoi.training.diagnostics.callbacks.plot.LongRolloutPlots
rollout:
- ${dataloader.validation_rollout}
epoch_frequency: 20
sample_idx: ${diagnostics.plot.sample_idx}
parameters: ${diagnostics.plot.parameters}
Loading

0 comments on commit 9ea0390

Please sign in to comment.