Merge branch 'develop' into feature/model_freezing

MeteoSwiss · Dec 18, 2024 · 498a792 · 498a792
2 parents 0b8a407 + 38b75fa
commit 498a792
Show file tree

Hide file tree

Showing 14 changed files with 457 additions and 108 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -14,6 +14,8 @@ Keep it human-readable, your future self will thank you!
 - Dont crash when using the profiler if certain env vars arent set [#180](https://github.com/ecmwf/anemoi-training/pull/180)
 - Remove saving of metadata to training checkpoint [#57](https://github.com/ecmwf/anemoi-training/pull/190)
 - Fixes to callback plots [#182] (power spectrum large numpy array error + precip cmap for cases where precip is prognostic).
+- GraphTrainableParameters callback will log a warning when no trainable parameters are specified  [#173](https://github.com/ecmwf/anemoi-training/pull/173)
+- Fixes to checkpoint saving - ensure last checkpoint if saving when using max_steps [#191] (https://github.com/ecmwf/anemoi-training/pull/191)
 - Identify stretched grid models based on graph rather than configuration file [#204](https://github.com/ecmwf/anemoi-training/pull/204)
 
 ### Added
@@ -22,27 +24,27 @@ Keep it human-readable, your future self will thank you!
 - Effective batch size: `(config.dataloader.batch_size["training"] * config.hardware.num_gpus_per_node * config.hardware.num_nodes) // config.hardware.num_gpus_per_model`.
   Used for experiment reproducibility across different computing configurations.
 - Added a check for the variable sorting on pre-trained/finetuned models [#120](https://github.com/ecmwf/anemoi-training/pull/120)
+- Added default configuration files for stretched grid and limited area model experiments [173](https://github.com/ecmwf/anemoi-training/pull/173)
 - Added new metrics for stretched grid models to track losses inside/outside the regional domain [#199](https://github.com/ecmwf/anemoi-training/pull/199)
 - <b> Model Freezing ❄️</b>: enabled new functionality. You can now Freeze parts of your model by specifying a list of submodules to freeze with the new config parameter: submodules_to_freeze.
 - Introduce (optional) variable to configure: submodules_to_freeze -> List[str], list of submodules to freeze.
+- Add supporting arrrays (numpy) to checkpoint
+- Support for masking out unconnected nodes in LAM [#171](https://github.com/ecmwf/anemoi-training/pull/171)
+- Improved validation metrics, allow 'all' to be scaled [#202](https://github.com/ecmwf/anemoi-training/pull/202)
 
 ### Changed
 
 ### Removed
 - Removed the resolution config entry [#120](https://github.com/ecmwf/anemoi-training/pull/120)
 
-### Added
-
-- Add supporting arrrays (numpy) to checkpoint
-- Support for masking out unconnected nodes in LAM [#171](https://github.com/ecmwf/anemoi-training/pull/171)
-
 ## [0.3.1 - AIFS v0.3 Compatibility](https://github.com/ecmwf/anemoi-training/compare/0.3.0...0.3.1) - 2024-11-28
 
 ### Changed
 - Perform full shuffle of training dataset [#153](https://github.com/ecmwf/anemoi-training/pull/153)
 
 ### Fixed
-- Update `n_pixel` used by datashader to better adapt across resolutions #152
+
+- Update `n_pixel` used by datashader to better adapt across resolutions [#152](https://github.com/ecmwf/anemoi-training/pull/152)
 - Fixed bug in power spectra plotting for the n320 resolution.
 - Allow histogram and spectrum plot for one variable [#165](https://github.com/ecmwf/anemoi-training/pull/165)
 

diff --git a/docs/modules/losses.rst b/docs/modules/losses.rst
@@ -73,11 +73,36 @@ Currently, the following scalars are available for use:
 ********************
 
 Validation metrics as defined in the config file at
-``config.training.validation_metrics`` follow the same initialise
+``config.training.validation_metrics`` follow the same initialisation
 behaviour as the loss function, but can be a list. In this case all
 losses are calculated and logged as a dictionary with the corresponding
 name
 
+Scaling Validation Losses
+=========================
+
+Validation metrics can **not** by default be scaled by scalars across
+the variable dimension, but can be by all other scalars. If you want to
+scale a validation metric by the variable weights, it must be added to
+`config.training.scale_validation_metrics`.
+
+These metrics are then kept in the normalised, preprocessed space, and
+thus the indexing of scalars aligns with the indexing of the tensors.
+
+By default, only `all` is kept in the normalised space and scaled.
+
+.. code:: yaml
+
+   # List of validation metrics to keep in normalised space, and scalars to be applied
+   # Use '*' in reference all metrics, or a list of metric names.
+   # Unlike above, variable scaling is possible due to these metrics being
+   # calculated in the same way as the training loss, within the internal model space.
+   scale_validation_metrics:
+   scalars_to_apply: ['variable']
+   metrics:
+      - 'all'
+      # - "*"
+
 ***********************
  Custom Loss Functions
 ***********************

diff --git a/src/anemoi/training/config/graph/limited_area.yaml b/src/anemoi/training/config/graph/limited_area.yaml
@@ -10,7 +10,7 @@ nodes:
     node_builder:
       _target_: anemoi.graphs.nodes.ZarrDatasetNodes
       dataset: ${dataloader.training.dataset}
-    attributes: ${graph.attributes.data_nodes}
+    attributes: ${graph.attributes.nodes}
   # Hidden nodes
   hidden:
     node_builder:
@@ -26,8 +26,8 @@ edges:
   edge_builders:
   - _target_: anemoi.graphs.edges.CutOffEdges # options: KNNEdges, CutOffEdges
     cutoff_factor: 0.6 # only for cutoff method
-  - _target_: anemoi.graphs.edges.CutOffEdges # options: KNNEdges, CutOffEdges
-    cutoff_factor: 2 # only for cutoff method
+  - _target_: anemoi.graphs.edges.CutOffEdges # connects only boundary nodes
+    cutoff_factor: 1.5 # only for cutoff method
     source_mask_attr_name: boundary_mask
   attributes: ${graph.attributes.edges}
 # Processor configuration
@@ -46,16 +46,15 @@ edges:
     num_nearest_neighbours: 3 # only for knn method
   attributes: ${graph.attributes.edges}
 
-
 post_processors:
   - _target_: anemoi.graphs.processors.RemoveUnconnectedNodes
     nodes_name: data
     ignore: cutout_mask # optional
     save_mask_indices_to_attr: indices_connected_nodes # optional
 
-
 attributes:
-  data_nodes:
+  nodes:
+    # Attributes for data nodes
     area_weight:
       _target_: anemoi.graphs.nodes.attributes.AreaWeights # options: Area, Uniform
       norm: unit-max # options: l1, l2, unit-max, unit-sum, unit-std

diff --git a/src/anemoi/training/config/graph/stretched_grid.yaml b/src/anemoi/training/config/graph/stretched_grid.yaml
@@ -11,12 +11,7 @@ nodes:
     node_builder:
       _target_: anemoi.graphs.nodes.ZarrDatasetNodes
       dataset: ${dataloader.training.dataset}
-    attributes:
-      area_weight:
-        _target_: anemoi.graphs.nodes.attributes.AreaWeights
-        norm: unit-max
-      cutout:
-        _target_: anemoi.graphs.nodes.attributes.CutOutMask
+    attributes: ${graph.attributes.nodes}
   hidden:
     node_builder:
       _target_: anemoi.graphs.nodes.StretchedTriNodes
@@ -25,10 +20,6 @@ nodes:
       reference_node_name: ${graph.data}
       mask_attr_name: cutout
       margin_radius_km: 11
-    attributes:
-      area_weights:
-        _target_: anemoi.graphs.nodes.attributes.AreaWeights
-        norm: unit-max
 
 edges:
 # Encoder
@@ -54,6 +45,13 @@ edges:
   attributes: ${graph.attributes.edges}
 
 attributes:
+  nodes:
+    # Attributes for data nodes
+    area_weight:
+      _target_: anemoi.graphs.nodes.attributes.AreaWeights
+      norm: unit-max
+    cutout:
+      _target_: anemoi.graphs.nodes.attributes.CutOutMask
   edges:
     edge_length:
       _target_: anemoi.graphs.edges.attributes.EdgeLength

diff --git a/src/anemoi/training/config/lam.yaml b/src/anemoi/training/config/lam.yaml
@@ -0,0 +1,36 @@
+defaults:
+- data: zarr
+- dataloader: native_grid
+- diagnostics: evaluation
+- hardware: example
+- graph: limited_area
+- model: graphtransformer
+- training: default
+- _self_
+
+
+### This file is for local experimentation.
+##  When you commit your changes, assign the new features and keywords
+##  to the correct defaults.
+# For example to change from default GPU count:
+# hardware:
+#   num_gpus_per_node: 1
+
+dataloader:
+  dataset:
+    cutout:
+      - dataset: ${hardware.paths.data}/${hardware.files.dataset}
+        thinning: ???
+      - dataset: ${hardware.paths.data}/${hardware.files.forcing_dataset}
+    adjust: all
+    min_distance_km: 0
+  grid_indices:
+    _target_: anemoi.training.data.grid_indices.MaskedGrid
+    nodes_name: data
+    node_attribute_name: indices_connected_nodes
+model:
+  output_mask: cutout_mask # it must be a node attribute of the output nodes
+hardware:
+  files:
+    dataset: ???
+    forcing_dataset: ???
diff --git a/src/anemoi/training/config/stretched.yaml b/src/anemoi/training/config/stretched.yaml
@@ -0,0 +1,37 @@
+defaults:
+- data: zarr
+- dataloader: native_grid
+- diagnostics: evaluation
+- hardware: example
+- graph: stretched_grid
+- model: graphtransformer
+- training: default
+- _self_
+
+
+### This file is for local experimentation.
+##  When you commit your changes, assign the new features and keywords
+##  to the correct defaults.
+# For example to change from default GPU count:
+# hardware:
+#   num_gpus_per_node: 1
+
+dataloader:
+  dataset:
+    cutout:
+      - dataset: ${hardware.paths.data}/${hardware.files.dataset}
+        thinning: ???
+      - dataset: ${hardware.paths.data}/${hardware.files.forcing_dataset}
+    adjust: all
+    min_distance_km: 0
+training:
+  loss_scaling:
+    spatial:
+      _target_: anemoi.training.data.scaling.ReweightedGraphAttribute
+      target_nodes: ${graph.data}
+      scaled_attribute: area_weight # it must be a node attribute of the output nodes
+      cutout_weight_frac_of_global: ???
+hardware:
+  files:
+    dataset: ???
+    forcing_dataset: ???
diff --git a/src/anemoi/training/config/training/default.yaml b/src/anemoi/training/config/training/default.yaml
@@ -58,16 +58,32 @@ loss_gradient_scaling: False
 
 # Validation metrics calculation,
 # This may be a list, in which case all metrics will be calculated
-# and logged according to their name
+# and logged according to their name.
+# These metrics are calculated in the output model space, and thus
+# have undergone postprocessing.
 validation_metrics:
   # loss class to initialise
   - _target_: anemoi.training.losses.mse.WeightedMSELoss
     # Scalars to include in loss calculation
-    # Available scalars include, 'variable'
+    # Cannot scale over the variable dimension due to possible remappings.
+    # Available scalars include:
+    # - 'loss_weights_mask': Giving imputed NaNs a zero weight in the loss function
+    # Use the `scale_validation_metrics` section to variable scale.
     scalars: []
     # other kwargs
     ignore_nans: True
 
+# List of validation metrics to keep in normalised space, and scalars to be applied
+# Use '*' in reference all metrics, or a list of metric names.
+# Unlike above, variable scaling is possible due to these metrics being
+# calculated in the same way as the training loss, within the internal model space.
+scale_validation_metrics:
+  scalars_to_apply: ['variable']
+  metrics:
+    - 'all'
+    # - "*"
+
+
 # length of the "rollout" window (see Keisler's paper)
 rollout:
   start: 1

diff --git a/src/anemoi/training/diagnostics/callbacks/checkpoint.py b/src/anemoi/training/diagnostics/callbacks/checkpoint.py
@@ -43,6 +43,7 @@ def __init__(self, config: OmegaConf, **kwargs: dict) -> None:
 
         """
         super().__init__(**kwargs)
+
         self.config = config
         self.start = time.time()
         self._model_metadata = None
@@ -76,6 +77,34 @@ def model_metadata(self, model: torch.nn.Module) -> dict:
 
         return self._model_metadata
 
+    def _adjust_epoch_progress(self, trainer: pl.Trainer) -> None:
+        """
+        Adjust the epoch progress when saving a mid-epoch checkpoint.
+
+        Since Pytorch Lightning advances one epoch at end of training (on_train-end),
+        we need to correct the checkpoint epoch progress to avoid inconsistencies.
+        """
+        trainer.fit_loop.epoch_progress.current.processed = trainer.fit_loop.epoch_progress.current.processed - 1
+        trainer.fit_loop.epoch_progress.current.completed = trainer.fit_loop.epoch_progress.current.completed - 1
+        trainer.fit_loop.epoch_progress.total.processed = trainer.fit_loop.epoch_progress.total.processed - 1
+        trainer.fit_loop.epoch_progress.total.completed = trainer.fit_loop.epoch_progress.total.completed - 1
+
+    def on_train_end(self, trainer: pl.Trainer, pl_module: pl.LightningModule) -> None:
+        """
+        Save the last checkpoint at the end of training.
+
+        If the candidates aren't better than the last checkpoint, then no checkpoints are saved.
+        Note - this method if triggered when using max_epochs, it won't save any checkpoints
+        since the monitor candidates won't show any changes with regard the the 'on_train_epoch_end' hook.
+        """
+        del pl_module
+        if not self._should_skip_saving_checkpoint(trainer) and not self._should_save_on_train_epoch_end(trainer):
+            if trainer.fit_loop.epoch_progress.current.completed == trainer.fit_loop.epoch_progress.current.ready:
+                self._adjust_epoch_progress(trainer)
+            monitor_candidates = self._monitor_candidates(trainer)
+            self._save_topk_checkpoint(trainer, monitor_candidates)
+            self._save_last_checkpoint(trainer, monitor_candidates)
+
     def tracker_metadata(self, trainer: pl.Trainer) -> dict:
         if self._tracker_metadata is not None:
             return {self._tracker_name: self._tracker_metadata}
@@ -169,7 +198,6 @@ def _save_checkpoint(self, trainer: pl.Trainer, lightning_checkpoint_filepath: s
             self._last_global_step_saved = trainer.global_step
 
         trainer.strategy.barrier()
-
         # saving checkpoint used for pytorch-lightning based training
         trainer.save_checkpoint(lightning_checkpoint_filepath, self.save_weights_only)