Skip to content
This repository has been archived by the owner on Dec 20, 2024. It is now read-only.

Commit

Permalink
Merge branch 'develop' into pre-training-check
Browse files Browse the repository at this point in the history
  • Loading branch information
JesperDramsch authored Dec 3, 2024
2 parents e384516 + 460b604 commit e0fb176
Show file tree
Hide file tree
Showing 27 changed files with 877 additions and 213 deletions.
29 changes: 21 additions & 8 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
Please add your functional changes to the appropriate section in the PR.
Keep it human-readable, your future self will thank you!

## [Unreleased](https://github.com/ecmwf/anemoi-training/compare/0.3.0...HEAD)
## [Unreleased](https://github.com/ecmwf/anemoi-training/compare/0.3.1...HEAD)
### Fixed

### Added
Expand All @@ -19,23 +19,32 @@ Keep it human-readable, your future self will thank you!
### Removed
- Removed the resolution config entry [#120](https://github.com/ecmwf/anemoi-training/pull/120)

## [0.3.1 - AIFS v0.3 Compatibility](https://github.com/ecmwf/anemoi-training/compare/0.3.0...0.3.1) - 2024-11-28

### Changed
- Perform full shuffle of training dataset [#153](https://github.com/ecmwf/anemoi-training/pull/153)

## [0.3.0 - Loss & Callback Refactors](https://github.com/ecmwf/anemoi-training/compare/0.2.2...0.3.0) - 2024-11-14
### Fixed
- Update `n_pixel` used by datashader to better adapt across resolutions #152
- Fixed bug in power spectra plotting for the n320 resolution.
- Allow histogram and spectrum plot for one variable [#165](https://github.com/ecmwf/anemoi-training/pull/165)

### Added
- Introduce variable to configure (Cosine Annealing) optimizer warm up [#155](https://github.com/ecmwf/anemoi-training/pull/155)
- Add reader groups to reduce CPU memory usage and increase dataloader throughput [#76](https://github.com/ecmwf/anemoi-training/pull/76)
- Bump `anemoi-graphs` version to 0.4.1 [#159](https://github.com/ecmwf/anemoi-training/pull/159)

### Changed
- Increase the default MlFlow HTTP max retries [#111](https://github.com/ecmwf/anemoi-training/pull/111)

## [0.3.0 - Loss & Callback Refactors](https://github.com/ecmwf/anemoi-training/compare/0.2.2...0.3.0) - 2024-11-14

### Fixed

- Rename loss_scaling to variable_loss_scaling [#138](https://github.com/ecmwf/anemoi-training/pull/138)
- Refactored callbacks. [#60](https://github.com/ecmwf/anemoi-training/pulls/60)
- Updated docs [#115](https://github.com/ecmwf/anemoi-training/pull/115)
- Fix enabling LearningRateMonitor [#119](https://github.com/ecmwf/anemoi-training/pull/119)

- Refactored rollout [#87](https://github.com/ecmwf/anemoi-training/pulls/87)
- Enable longer validation rollout than training

- Expand iterables in logging [#91](https://github.com/ecmwf/anemoi-training/pull/91)
- Save entire config in mlflow

Expand All @@ -44,22 +53,25 @@ Keep it human-readable, your future self will thank you!

- Included more loss functions and allowed configuration [#70](https://github.com/ecmwf/anemoi-training/pull/70)
- Include option to use datashader and optimised asyncronohous callbacks [#102](https://github.com/ecmwf/anemoi-training/pull/102)
- Fix that applies the metric_ranges in the post-processed variable space [#116](https://github.com/ecmwf/anemoi-training/pull/116)
- Fix that applies the metric_ranges in the post-processed variable space [#116](https://github.com/ecmwf/anemoi-training/pull/116)
- Allow updates to scalars [#137](https://github.com/ecmwf/anemoi-training/pulls/137)
- Add without subsetting in ScaleTensor

- Sub-hour datasets [#63](https://github.com/ecmwf/anemoi-training/pull/63)
- Add synchronisation workflow [#92](https://github.com/ecmwf/anemoi-training/pull/92)
- Feat: Anemoi Profiler compatible with mlflow and using Pytorch (Kineto) Profiler for memory report [38](https://github.com/ecmwf/anemoi-training/pull/38/)
- Feat: Save a gif for longer rollouts in validation [#65](https://github.com/ecmwf/anemoi-training/pull/65)
- New limited area config file added, limited_area.yaml. [#134](https://github.com/ecmwf/anemoi-training/pull/134/)
- New stretched grid config added, stretched_grid.yaml [#133](https://github.com/ecmwf/anemoi-training/pull/133)
- Functionality to change the weight attribute of nodes in the graph at the start of training without re-generating the graph. [#136] (https://github.com/ecmwf/anemoi-training/pull/136)
- Custom System monitor for Nvidia and AMD GPUs [#147](https://github.com/ecmwf/anemoi-training/pull/147)


### Changed

- Renamed frequency keys in callbacks configuration. [#118](https://github.com/ecmwf/anemoi-training/pull/118)
- Modified training configuration to support max_steps and tied lr iterations to max_steps by default [#67](https://github.com/ecmwf/anemoi-training/pull/67)
- Merged node & edge trainable feature callbacks into one. [#135](https://github.com/ecmwf/anemoi-training/pull/135)
- Increase the default MlFlow HTTP max retries [#111](https://github.com/ecmwf/anemoi-training/pull/111)

### Removed

Expand Down Expand Up @@ -121,6 +133,7 @@ Keep it human-readable, your future self will thank you!
- Feature: Support training for datasets with missing time steps [#48](https://github.com/ecmwf/anemoi-training/pulls/48)
- Feature: `AnemoiMlflowClient`, an mlflow client with authentication support [#86](https://github.com/ecmwf/anemoi-training/pull/86)
- Long Rollout Plots
- Mask NaN values in training loss function [#72](https://github.com/ecmwf/anemoi-training/pull/72) and [#271](https://github.com/ecmwf-lab/aifs-mono/issues/271)

### Fixed

Expand Down
4 changes: 4 additions & 0 deletions docs/user-guide/distributed.rst
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,10 @@ number of GPUs you wish to shard the model across. It is recommended to
only shard if the model does not fit in GPU memory, as data distribution
is a much more efficient way to parallelise the training.

When using model sharding, ``config.dataloader.read_group_size`` allows
for sharded data loading in subgroups. This should be set to the number
of GPUs per model for optimal performance.

*********
Example
*********
Expand Down
37 changes: 32 additions & 5 deletions docs/user-guide/training.rst
Original file line number Diff line number Diff line change
Expand Up @@ -183,15 +183,38 @@ levels nearer to the surface). By default anemoi-training uses a ReLU
Pressure Level scaler with a minimum weighting of 0.2 (i.e. no pressure
level has a weighting less than 0.2).

The loss is also scaled by assigning a weight to each node on the output
grid. These weights are calculated during graph-creation and stored as
an attribute in the graph object. The configuration option
``config.training.node_loss_weights`` is used to specify the node
attribute used as weights in the loss function. By default
anemoi-training uses area weighting, where each node is weighted
according to the size of the geographical area it represents.

It is also possible to rescale the weight of a subset of nodes after
they are loaded from the graph. For instance, for a stretched grid setup
we can rescale the weight of nodes in the limited area such that their
sum equals 0.25 of the sum of all node weights with the following config
setup

.. code:: yaml
node_loss_weights:
_target_: anemoi.training.losses.nodeweights.ReweightedGraphNodeAttribute
target_nodes: data
scaled_attribute: cutout
weight_frac_of_total: 0.25
***************
Learning rate
***************

Anemoi training uses the ``CosineLRScheduler`` from PyTorch as it's
learning rate scheduler. The user can configure the maximum learning
rate by setting ``config.training.lr.rate``. Note that this learning
rate is scaled by the number of GPUs where for the `data parallelism
<distributed>`_.
learning rate scheduler. Docs for this scheduler can be found here
https://github.com/huggingface/pytorch-image-models/blob/main/timm/scheduler/cosine_lr.py
The user can configure the maximum learning rate by setting
``config.training.lr.rate``. Note that this learning rate is scaled by
the number of GPUs where for the `data parallelism <distributed>`_.

.. code:: yaml
Expand All @@ -201,7 +224,11 @@ The user can also control the rate at which the learning rate decreases
by setting the total number of iterations through
``config.training.lr.iterations`` and the minimum learning rate reached
through ``config.training.lr.min``. Note that the minimum learning rate
is not scaled by the number of GPUs.
is not scaled by the number of GPUs. The user can also control the
warmup period by setting ``config.training.lr.warmup_t``. If the warmup
period is set to 0, the learning rate will start at the maximum learning
rate. If no warmup period is defined, a default warmup period of 1000
iterations is used.

*********
Rollout
Expand Down
4 changes: 2 additions & 2 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -40,8 +40,8 @@ classifiers = [
dynamic = [ "version" ]

dependencies = [
"anemoi-datasets>=0.4",
"anemoi-graphs>=0.4",
"anemoi-datasets>=0.5.2",
"anemoi-graphs>=0.4.1",
"anemoi-models>=0.3",
"anemoi-utils[provenance]>=0.4.4",
"datashader>=0.16.3",
Expand Down
11 changes: 11 additions & 0 deletions src/anemoi/training/config/dataloader/native_grid.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,17 @@
prefetch_factor: 2
pin_memory: True

# ============
# read_group_size:
# Form subgroups of model comm groups that read data together.
# Each reader in the group only reads 1/read_group_size of the data
# which is then all-gathered between the group.
# This can reduce CPU memory usage as well as increase dataloader throughput.
# The number of GPUs per model must be divisible by read_group_size.
# To disable, set to 1.
# ============
read_group_size: ${hardware.num_gpus_per_model}

num_workers:
training: 8
validation: 8
Expand Down
10 changes: 5 additions & 5 deletions src/anemoi/training/config/graph/encoder_decoder_only.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -22,15 +22,15 @@ edges:
# Encoder configuration
- source_name: ${graph.data}
target_name: ${graph.hidden}
edge_builder:
_target_: anemoi.graphs.edges.CutOffEdges # options: KNNEdges, CutOffEdges
edge_builders:
- _target_: anemoi.graphs.edges.CutOffEdges # options: KNNEdges, CutOffEdges
cutoff_factor: 0.6 # only for cutoff method
attributes: ${graph.attributes.edges}
- source_name: ${graph.hidden}
# Decoder configuration
- source_name: ${graph.hidden}
target_name: ${graph.data}
edge_builder:
_target_: anemoi.graphs.edges.KNNEdges # options: KNNEdges, CutOffEdges
edge_builders:
- _target_: anemoi.graphs.edges.KNNEdges # options: KNNEdges, CutOffEdges
num_nearest_neighbours: 3 # only for knn method
attributes: ${graph.attributes.edges}

Expand Down
14 changes: 7 additions & 7 deletions src/anemoi/training/config/graph/limited_area.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -23,23 +23,23 @@ edges:
# Encoder configuration
- source_name: ${graph.data}
target_name: ${graph.hidden}
edge_builder:
_target_: anemoi.graphs.edges.CutOffEdges # options: KNNEdges, CutOffEdges
edge_builders:
- _target_: anemoi.graphs.edges.CutOffEdges # options: KNNEdges, CutOffEdges
cutoff_factor: 0.6 # only for cutoff method
attributes: ${graph.attributes.edges}
# Processor configuration
- source_name: ${graph.hidden}
target_name: ${graph.hidden}
edge_builder:
_target_: anemoi.graphs.edges.MultiScaleEdges
edge_builders:
- _target_: anemoi.graphs.edges.MultiScaleEdges
x_hops: 1
attributes: ${graph.attributes.edges}
# Decoder configuration
- source_name: ${graph.hidden}
target_name: ${graph.data}
target_mask_attr_name: cutout
edge_builder:
_target_: anemoi.graphs.edges.KNNEdges # options: KNNEdges, CutOffEdges
edge_builders:
- _target_: anemoi.graphs.edges.KNNEdges # options: KNNEdges, CutOffEdges
target_mask_attr_name: cutout
num_nearest_neighbours: 3 # only for knn method
attributes: ${graph.attributes.edges}

Expand Down
16 changes: 8 additions & 8 deletions src/anemoi/training/config/graph/multi_scale.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -22,22 +22,22 @@ edges:
# Encoder configuration
- source_name: ${graph.data}
target_name: ${graph.hidden}
edge_builder:
_target_: anemoi.graphs.edges.CutOffEdges # options: KNNEdges, CutOffEdges
edge_builders:
- _target_: anemoi.graphs.edges.CutOffEdges # options: KNNEdges, CutOffEdges
cutoff_factor: 0.6 # only for cutoff method
attributes: ${graph.attributes.edges}
- source_name: ${graph.hidden}
# Processor configuration
- source_name: ${graph.hidden}
target_name: ${graph.hidden}
edge_builder:
_target_: anemoi.graphs.edges.MultiScaleEdges
edge_builders:
- _target_: anemoi.graphs.edges.MultiScaleEdges
x_hops: 1
attributes: ${graph.attributes.edges}
- source_name: ${graph.hidden}
# Decoder configuration
- source_name: ${graph.hidden}
target_name: ${graph.data}
edge_builder:
_target_: anemoi.graphs.edges.KNNEdges # options: KNNEdges, CutOffEdges
edge_builders:
- _target_: anemoi.graphs.edges.KNNEdges # options: KNNEdges, CutOffEdges
num_nearest_neighbours: 3 # only for knn method
attributes: ${graph.attributes.edges}

Expand Down
12 changes: 6 additions & 6 deletions src/anemoi/training/config/graph/stretched_grid.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -34,22 +34,22 @@ edges:
# Encoder
- source_name: ${graph.data}
target_name: ${graph.hidden}
edge_builder:
_target_: anemoi.graphs.edges.KNNEdges
edge_builders:
- _target_: anemoi.graphs.edges.KNNEdges
num_nearest_neighbours: 12
attributes: ${graph.attributes.edges}
# Processor
- source_name: ${graph.hidden}
target_name: ${graph.hidden}
edge_builder:
_target_: anemoi.graphs.edges.MultiScaleEdges
edge_builders:
- _target_: anemoi.graphs.edges.MultiScaleEdges
x_hops: 1
attributes: ${graph.attributes.edges}
# Decoder
- source_name: ${graph.hidden}
target_name: ${graph.data}
edge_builder:
_target_: anemoi.graphs.edges.KNNEdges
edge_builders:
- _target_: anemoi.graphs.edges.KNNEdges
num_nearest_neighbours: 3
attributes: ${graph.attributes.edges}

Expand Down
2 changes: 0 additions & 2 deletions src/anemoi/training/config/model/gnn.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -45,8 +45,6 @@ attributes:
- edge_dirs
nodes: []

node_loss_weight: area_weight

# Bounding configuration
bounding: #These are applied in order

Expand Down
2 changes: 0 additions & 2 deletions src/anemoi/training/config/model/graphtransformer.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -50,8 +50,6 @@ attributes:
- edge_dirs
nodes: []

node_loss_weight: area_weight

# Bounding configuration
bounding: #These are applied in order

Expand Down
2 changes: 0 additions & 2 deletions src/anemoi/training/config/model/transformer.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -49,8 +49,6 @@ attributes:
- edge_dirs
nodes: []

node_loss_weight: area_weight

# Bounding configuration
bounding: #These are applied in order

Expand Down
10 changes: 9 additions & 1 deletion src/anemoi/training/config/training/default.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,9 @@ training_loss:
# Scalars to include in loss calculation
# Available scalars include:
# - 'variable': See `variable_loss_scaling` for more information
scalars: ['variable']
# - 'loss_weights_mask': Giving imputed NaNs a zero weight in the loss function
scalars: ['variable', 'loss_weights_mask']

ignore_nans: False

loss_gradient_scaling: False
Expand Down Expand Up @@ -81,6 +83,7 @@ lr:
rate: 0.625e-4 #local_lr
iterations: ${training.max_steps} # NOTE: When max_epochs < max_steps, scheduler will run for max_steps
min: 3e-7 #Not scaled by #GPU
warmup_t: 1000

# Changes in per-gpu batch_size should come with a rescaling of the local_lr
# in order to keep a constant global_lr
Expand Down Expand Up @@ -115,3 +118,8 @@ pressure_level_scaler:
_target_: anemoi.training.data.scaling.ReluPressureLevelScaler
minimum: 0.2
slope: 0.001

node_loss_weights:
_target_: anemoi.training.losses.nodeweights.GraphNodeAttribute
target_nodes: ${graph.data}
node_attribute: area_weight
Loading

0 comments on commit e0fb176

Please sign in to comment.