Merge branch 'develop' into pre-training-check

ecmwf · Dec 3, 2024 · e0fb176 · e0fb176
2 parents e384516 + 460b604
commit e0fb176
Show file tree

Hide file tree

Showing 27 changed files with 877 additions and 213 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -8,7 +8,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 Please add your functional changes to the appropriate section in the PR.
 Keep it human-readable, your future self will thank you!
 
-## [Unreleased](https://github.com/ecmwf/anemoi-training/compare/0.3.0...HEAD)
+## [Unreleased](https://github.com/ecmwf/anemoi-training/compare/0.3.1...HEAD)
 ### Fixed
 
 ### Added
@@ -19,23 +19,32 @@ Keep it human-readable, your future self will thank you!
 ### Removed
 - Removed the resolution config entry [#120](https://github.com/ecmwf/anemoi-training/pull/120)
 
+## [0.3.1 - AIFS v0.3 Compatibility](https://github.com/ecmwf/anemoi-training/compare/0.3.0...0.3.1) - 2024-11-28
 
+### Changed
+- Perform full shuffle of training dataset [#153](https://github.com/ecmwf/anemoi-training/pull/153)
 
-## [0.3.0 - Loss & Callback Refactors](https://github.com/ecmwf/anemoi-training/compare/0.2.2...0.3.0) - 2024-11-14
+### Fixed
+- Update `n_pixel` used by datashader to better adapt across resolutions #152
+- Fixed bug in power spectra plotting for the n320 resolution.
+- Allow histogram and spectrum plot for one variable [#165](https://github.com/ecmwf/anemoi-training/pull/165)
+
+### Added
+- Introduce variable to configure (Cosine Annealing) optimizer warm up [#155](https://github.com/ecmwf/anemoi-training/pull/155)
+- Add reader groups to reduce CPU memory usage and increase dataloader throughput [#76](https://github.com/ecmwf/anemoi-training/pull/76)
+- Bump `anemoi-graphs` version to 0.4.1 [#159](https://github.com/ecmwf/anemoi-training/pull/159)
 
-### Changed
-- Increase the default MlFlow HTTP max retries [#111](https://github.com/ecmwf/anemoi-training/pull/111)
+
+## [0.3.0 - Loss & Callback Refactors](https://github.com/ecmwf/anemoi-training/compare/0.2.2...0.3.0) - 2024-11-14
 
 ### Fixed
 
 - Rename loss_scaling to variable_loss_scaling [#138](https://github.com/ecmwf/anemoi-training/pull/138)
 - Refactored callbacks. [#60](https://github.com/ecmwf/anemoi-training/pulls/60)
   - Updated docs [#115](https://github.com/ecmwf/anemoi-training/pull/115)
   - Fix enabling LearningRateMonitor [#119](https://github.com/ecmwf/anemoi-training/pull/119)
-
 - Refactored rollout [#87](https://github.com/ecmwf/anemoi-training/pulls/87)
   - Enable longer validation rollout than training
-
 - Expand iterables in logging [#91](https://github.com/ecmwf/anemoi-training/pull/91)
   - Save entire config in mlflow
 
@@ -44,22 +53,25 @@ Keep it human-readable, your future self will thank you!
 
 - Included more loss functions and allowed configuration [#70](https://github.com/ecmwf/anemoi-training/pull/70)
 - Include option to use datashader and optimised asyncronohous callbacks [#102](https://github.com/ecmwf/anemoi-training/pull/102)
-   - Fix that applies the metric_ranges in the post-processed variable space [#116](https://github.com/ecmwf/anemoi-training/pull/116)
+  - Fix that applies the metric_ranges in the post-processed variable space [#116](https://github.com/ecmwf/anemoi-training/pull/116)
 - Allow updates to scalars [#137](https://github.com/ecmwf/anemoi-training/pulls/137)
   - Add without subsetting in ScaleTensor
-
 - Sub-hour datasets [#63](https://github.com/ecmwf/anemoi-training/pull/63)
 - Add synchronisation workflow [#92](https://github.com/ecmwf/anemoi-training/pull/92)
 - Feat: Anemoi Profiler compatible with mlflow and using Pytorch (Kineto) Profiler for memory report [38](https://github.com/ecmwf/anemoi-training/pull/38/)
 - Feat: Save a gif for longer rollouts in validation [#65](https://github.com/ecmwf/anemoi-training/pull/65)
 - New limited area config file added, limited_area.yaml. [#134](https://github.com/ecmwf/anemoi-training/pull/134/)
 - New stretched grid config added, stretched_grid.yaml [#133](https://github.com/ecmwf/anemoi-training/pull/133)
+- Functionality to change the weight attribute of nodes in the graph at the start of training without re-generating the graph. [#136] (https://github.com/ecmwf/anemoi-training/pull/136)
+- Custom System monitor for Nvidia and AMD GPUs [#147](https://github.com/ecmwf/anemoi-training/pull/147)
+
 
 ### Changed
 
 - Renamed frequency keys in callbacks configuration. [#118](https://github.com/ecmwf/anemoi-training/pull/118)
 - Modified training configuration to support max_steps and tied lr iterations to max_steps by default [#67](https://github.com/ecmwf/anemoi-training/pull/67)
 - Merged node & edge trainable feature callbacks into one. [#135](https://github.com/ecmwf/anemoi-training/pull/135)
+- Increase the default MlFlow HTTP max retries [#111](https://github.com/ecmwf/anemoi-training/pull/111)
 
 ### Removed
 
@@ -121,6 +133,7 @@ Keep it human-readable, your future self will thank you!
 - Feature: Support training for datasets with missing time steps [#48](https://github.com/ecmwf/anemoi-training/pulls/48)
 - Feature: `AnemoiMlflowClient`, an mlflow client with authentication support [#86](https://github.com/ecmwf/anemoi-training/pull/86)
 - Long Rollout Plots
+- Mask NaN values in training loss function [#72](https://github.com/ecmwf/anemoi-training/pull/72) and [#271](https://github.com/ecmwf-lab/aifs-mono/issues/271)
 
 ### Fixed
 

diff --git a/docs/user-guide/distributed.rst b/docs/user-guide/distributed.rst
@@ -45,6 +45,10 @@ number of GPUs you wish to shard the model across. It is recommended to
 only shard if the model does not fit in GPU memory, as data distribution
 is a much more efficient way to parallelise the training.
 
+When using model sharding, ``config.dataloader.read_group_size`` allows
+for sharded data loading in subgroups. This should be set to the number
+of GPUs per model for optimal performance.
+
 *********
  Example
 *********

diff --git a/docs/user-guide/training.rst b/docs/user-guide/training.rst
@@ -183,15 +183,38 @@ levels nearer to the surface). By default anemoi-training uses a ReLU
 Pressure Level scaler with a minimum weighting of 0.2 (i.e. no pressure
 level has a weighting less than 0.2).
 
+The loss is also scaled by assigning a weight to each node on the output
+grid. These weights are calculated during graph-creation and stored as
+an attribute in the graph object. The configuration option
+``config.training.node_loss_weights`` is used to specify the node
+attribute used as weights in the loss function. By default
+anemoi-training uses area weighting, where each node is weighted
+according to the size of the geographical area it represents.
+
+It is also possible to rescale the weight of a subset of nodes after
+they are loaded from the graph. For instance, for a stretched grid setup
+we can rescale the weight of nodes in the limited area such that their
+sum equals 0.25 of the sum of all node weights with the following config
+setup
+
+.. code:: yaml
+
+   node_loss_weights:
+      _target_: anemoi.training.losses.nodeweights.ReweightedGraphNodeAttribute
+      target_nodes: data
+      scaled_attribute: cutout
+      weight_frac_of_total: 0.25
+
 ***************
  Learning rate
 ***************
 
 Anemoi training uses the ``CosineLRScheduler`` from PyTorch as it's
-learning rate scheduler. The user can configure the maximum learning
-rate by setting ``config.training.lr.rate``. Note that this learning
-rate is scaled by the number of GPUs where for the `data parallelism
-<distributed>`_.
+learning rate scheduler. Docs for this scheduler can be found here
+https://github.com/huggingface/pytorch-image-models/blob/main/timm/scheduler/cosine_lr.py
+The user can configure the maximum learning rate by setting
+``config.training.lr.rate``. Note that this learning rate is scaled by
+the number of GPUs where for the `data parallelism <distributed>`_.
 
 .. code:: yaml
 
@@ -201,7 +224,11 @@ The user can also control the rate at which the learning rate decreases
 by setting the total number of iterations through
 ``config.training.lr.iterations`` and the minimum learning rate reached
 through ``config.training.lr.min``. Note that the minimum learning rate
-is not scaled by the number of GPUs.
+is not scaled by the number of GPUs. The user can also control the
+warmup period by setting ``config.training.lr.warmup_t``. If the warmup
+period is set to 0, the learning rate will start at the maximum learning
+rate. If no warmup period is defined, a default warmup period of 1000
+iterations is used.
 
 *********
  Rollout

diff --git a/pyproject.toml b/pyproject.toml
@@ -40,8 +40,8 @@ classifiers = [
 dynamic = [ "version" ]
 
 dependencies = [
-  "anemoi-datasets>=0.4",
-  "anemoi-graphs>=0.4",
+  "anemoi-datasets>=0.5.2",
+  "anemoi-graphs>=0.4.1",
   "anemoi-models>=0.3",
   "anemoi-utils[provenance]>=0.4.4",
   "datashader>=0.16.3",

diff --git a/src/anemoi/training/config/dataloader/native_grid.yaml b/src/anemoi/training/config/dataloader/native_grid.yaml
@@ -1,6 +1,17 @@
 prefetch_factor: 2
 pin_memory: True
 
+# ============
+# read_group_size:
+#   Form subgroups of model comm groups that read data together.
+#   Each reader in the group only reads 1/read_group_size of the data
+#   which is then all-gathered between the group.
+#   This can reduce CPU memory usage as well as increase dataloader throughput.
+#   The number of GPUs per model must be divisible by read_group_size.
+#   To disable, set to 1.
+# ============
+read_group_size: ${hardware.num_gpus_per_model}
+
 num_workers:
   training: 8
   validation: 8

diff --git a/src/anemoi/training/config/graph/encoder_decoder_only.yaml b/src/anemoi/training/config/graph/encoder_decoder_only.yaml
@@ -22,15 +22,15 @@ edges:
 # Encoder configuration
 - source_name: ${graph.data}
   target_name: ${graph.hidden}
-  edge_builder:
-    _target_: anemoi.graphs.edges.CutOffEdges # options: KNNEdges, CutOffEdges
+  edge_builders:
+  - _target_: anemoi.graphs.edges.CutOffEdges # options: KNNEdges, CutOffEdges
     cutoff_factor: 0.6 # only for cutoff method
   attributes: ${graph.attributes.edges}
-- source_name: ${graph.hidden}
   # Decoder configuration
+- source_name: ${graph.hidden}
   target_name: ${graph.data}
-  edge_builder:
-    _target_: anemoi.graphs.edges.KNNEdges # options: KNNEdges, CutOffEdges
+  edge_builders:
+  - _target_: anemoi.graphs.edges.KNNEdges # options: KNNEdges, CutOffEdges
     num_nearest_neighbours: 3 # only for knn method
   attributes: ${graph.attributes.edges}
 

diff --git a/src/anemoi/training/config/graph/limited_area.yaml b/src/anemoi/training/config/graph/limited_area.yaml
@@ -23,23 +23,23 @@ edges:
 # Encoder configuration
 - source_name: ${graph.data}
   target_name: ${graph.hidden}
-  edge_builder:
-    _target_: anemoi.graphs.edges.CutOffEdges # options: KNNEdges, CutOffEdges
+  edge_builders:
+  - _target_: anemoi.graphs.edges.CutOffEdges # options: KNNEdges, CutOffEdges
     cutoff_factor: 0.6 # only for cutoff method
   attributes: ${graph.attributes.edges}
 # Processor configuration
 - source_name: ${graph.hidden}
   target_name: ${graph.hidden}
-  edge_builder:
-    _target_: anemoi.graphs.edges.MultiScaleEdges
+  edge_builders:
+  - _target_: anemoi.graphs.edges.MultiScaleEdges
     x_hops: 1
   attributes: ${graph.attributes.edges}
 # Decoder configuration
 - source_name: ${graph.hidden}
   target_name: ${graph.data}
-  target_mask_attr_name: cutout
-  edge_builder:
-    _target_: anemoi.graphs.edges.KNNEdges # options: KNNEdges, CutOffEdges
+  edge_builders:
+  - _target_: anemoi.graphs.edges.KNNEdges # options: KNNEdges, CutOffEdges
+    target_mask_attr_name: cutout
     num_nearest_neighbours: 3 # only for knn method
   attributes: ${graph.attributes.edges}
 

diff --git a/src/anemoi/training/config/graph/multi_scale.yaml b/src/anemoi/training/config/graph/multi_scale.yaml
@@ -22,22 +22,22 @@ edges:
 # Encoder configuration
 - source_name: ${graph.data}
   target_name: ${graph.hidden}
-  edge_builder:
-    _target_: anemoi.graphs.edges.CutOffEdges # options: KNNEdges, CutOffEdges
+  edge_builders:
+  - _target_: anemoi.graphs.edges.CutOffEdges # options: KNNEdges, CutOffEdges
     cutoff_factor: 0.6 # only for cutoff method
   attributes: ${graph.attributes.edges}
-- source_name: ${graph.hidden}
   # Processor configuration
+- source_name: ${graph.hidden}
   target_name: ${graph.hidden}
-  edge_builder:
-    _target_: anemoi.graphs.edges.MultiScaleEdges
+  edge_builders:
+  - _target_: anemoi.graphs.edges.MultiScaleEdges
     x_hops: 1
   attributes: ${graph.attributes.edges}
-- source_name: ${graph.hidden}
   # Decoder configuration
+- source_name: ${graph.hidden}
   target_name: ${graph.data}
-  edge_builder:
-    _target_: anemoi.graphs.edges.KNNEdges # options: KNNEdges, CutOffEdges
+  edge_builders:
+  - _target_: anemoi.graphs.edges.KNNEdges # options: KNNEdges, CutOffEdges
     num_nearest_neighbours: 3 # only for knn method
   attributes: ${graph.attributes.edges}
 

diff --git a/src/anemoi/training/config/graph/stretched_grid.yaml b/src/anemoi/training/config/graph/stretched_grid.yaml
@@ -34,22 +34,22 @@ edges:
 # Encoder
 - source_name: ${graph.data}
   target_name: ${graph.hidden}
-  edge_builder:
-    _target_: anemoi.graphs.edges.KNNEdges
+  edge_builders:
+  - _target_: anemoi.graphs.edges.KNNEdges
     num_nearest_neighbours: 12
   attributes: ${graph.attributes.edges}
 # Processor
 - source_name: ${graph.hidden}
   target_name: ${graph.hidden}
-  edge_builder:
-    _target_: anemoi.graphs.edges.MultiScaleEdges
+  edge_builders:
+  - _target_: anemoi.graphs.edges.MultiScaleEdges
     x_hops: 1
   attributes: ${graph.attributes.edges}
 # Decoder
 - source_name: ${graph.hidden}
   target_name: ${graph.data}
-  edge_builder:
-    _target_: anemoi.graphs.edges.KNNEdges
+  edge_builders:
+  - _target_: anemoi.graphs.edges.KNNEdges
     num_nearest_neighbours: 3
   attributes: ${graph.attributes.edges}
 

diff --git a/src/anemoi/training/config/model/gnn.yaml b/src/anemoi/training/config/model/gnn.yaml
@@ -45,8 +45,6 @@ attributes:
   - edge_dirs
   nodes: []
 
-node_loss_weight: area_weight
-
 # Bounding configuration
 bounding: #These are applied in order
 

diff --git a/src/anemoi/training/config/model/graphtransformer.yaml b/src/anemoi/training/config/model/graphtransformer.yaml
@@ -50,8 +50,6 @@ attributes:
   - edge_dirs
   nodes: []
 
-node_loss_weight: area_weight
-
 # Bounding configuration
 bounding: #These are applied in order
 

diff --git a/src/anemoi/training/config/model/transformer.yaml b/src/anemoi/training/config/model/transformer.yaml
@@ -49,8 +49,6 @@ attributes:
   - edge_dirs
   nodes: []
 
-node_loss_weight: area_weight
-
 # Bounding configuration
 bounding: #These are applied in order
 

diff --git a/src/anemoi/training/config/training/default.yaml b/src/anemoi/training/config/training/default.yaml
@@ -48,7 +48,9 @@ training_loss:
   # Scalars to include in loss calculation
   # Available scalars include:
   # - 'variable': See `variable_loss_scaling` for more information
-  scalars: ['variable']
+  # - 'loss_weights_mask': Giving imputed NaNs a zero weight in the loss function
+  scalars: ['variable', 'loss_weights_mask']
+
   ignore_nans: False
 
 loss_gradient_scaling: False
@@ -81,6 +83,7 @@ lr:
   rate: 0.625e-4 #local_lr
   iterations: ${training.max_steps} # NOTE: When max_epochs < max_steps, scheduler will run for max_steps
   min: 3e-7 #Not scaled by #GPU
+  warmup_t: 1000
 
 # Changes in per-gpu batch_size should come with a rescaling of the local_lr
 # in order to keep a constant global_lr
@@ -115,3 +118,8 @@ pressure_level_scaler:
   _target_: anemoi.training.data.scaling.ReluPressureLevelScaler
   minimum: 0.2
   slope: 0.001
+
+node_loss_weights:
+  _target_: anemoi.training.losses.nodeweights.GraphNodeAttribute
+  target_nodes: ${graph.data}
+  node_attribute: area_weight