Skip to content

Commit

Permalink
Merge branch 'develop' into feature/model_freezing
Browse files Browse the repository at this point in the history
  • Loading branch information
icedoom888 committed Dec 18, 2024
2 parents 0b8a407 + 38b75fa commit 498a792
Show file tree
Hide file tree
Showing 14 changed files with 457 additions and 108 deletions.
14 changes: 8 additions & 6 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,8 @@ Keep it human-readable, your future self will thank you!
- Dont crash when using the profiler if certain env vars arent set [#180](https://github.com/ecmwf/anemoi-training/pull/180)
- Remove saving of metadata to training checkpoint [#57](https://github.com/ecmwf/anemoi-training/pull/190)
- Fixes to callback plots [#182] (power spectrum large numpy array error + precip cmap for cases where precip is prognostic).
- GraphTrainableParameters callback will log a warning when no trainable parameters are specified [#173](https://github.com/ecmwf/anemoi-training/pull/173)
- Fixes to checkpoint saving - ensure last checkpoint if saving when using max_steps [#191] (https://github.com/ecmwf/anemoi-training/pull/191)
- Identify stretched grid models based on graph rather than configuration file [#204](https://github.com/ecmwf/anemoi-training/pull/204)

### Added
Expand All @@ -22,27 +24,27 @@ Keep it human-readable, your future self will thank you!
- Effective batch size: `(config.dataloader.batch_size["training"] * config.hardware.num_gpus_per_node * config.hardware.num_nodes) // config.hardware.num_gpus_per_model`.
Used for experiment reproducibility across different computing configurations.
- Added a check for the variable sorting on pre-trained/finetuned models [#120](https://github.com/ecmwf/anemoi-training/pull/120)
- Added default configuration files for stretched grid and limited area model experiments [173](https://github.com/ecmwf/anemoi-training/pull/173)
- Added new metrics for stretched grid models to track losses inside/outside the regional domain [#199](https://github.com/ecmwf/anemoi-training/pull/199)
- <b> Model Freezing ❄️</b>: enabled new functionality. You can now Freeze parts of your model by specifying a list of submodules to freeze with the new config parameter: submodules_to_freeze.
- Introduce (optional) variable to configure: submodules_to_freeze -> List[str], list of submodules to freeze.
- Add supporting arrrays (numpy) to checkpoint
- Support for masking out unconnected nodes in LAM [#171](https://github.com/ecmwf/anemoi-training/pull/171)
- Improved validation metrics, allow 'all' to be scaled [#202](https://github.com/ecmwf/anemoi-training/pull/202)

### Changed

### Removed
- Removed the resolution config entry [#120](https://github.com/ecmwf/anemoi-training/pull/120)

### Added

- Add supporting arrrays (numpy) to checkpoint
- Support for masking out unconnected nodes in LAM [#171](https://github.com/ecmwf/anemoi-training/pull/171)

## [0.3.1 - AIFS v0.3 Compatibility](https://github.com/ecmwf/anemoi-training/compare/0.3.0...0.3.1) - 2024-11-28

### Changed
- Perform full shuffle of training dataset [#153](https://github.com/ecmwf/anemoi-training/pull/153)

### Fixed
- Update `n_pixel` used by datashader to better adapt across resolutions #152

- Update `n_pixel` used by datashader to better adapt across resolutions [#152](https://github.com/ecmwf/anemoi-training/pull/152)
- Fixed bug in power spectra plotting for the n320 resolution.
- Allow histogram and spectrum plot for one variable [#165](https://github.com/ecmwf/anemoi-training/pull/165)

Expand Down
27 changes: 26 additions & 1 deletion docs/modules/losses.rst
Original file line number Diff line number Diff line change
Expand Up @@ -73,11 +73,36 @@ Currently, the following scalars are available for use:
********************

Validation metrics as defined in the config file at
``config.training.validation_metrics`` follow the same initialise
``config.training.validation_metrics`` follow the same initialisation
behaviour as the loss function, but can be a list. In this case all
losses are calculated and logged as a dictionary with the corresponding
name

Scaling Validation Losses
=========================

Validation metrics can **not** by default be scaled by scalars across
the variable dimension, but can be by all other scalars. If you want to
scale a validation metric by the variable weights, it must be added to
`config.training.scale_validation_metrics`.

These metrics are then kept in the normalised, preprocessed space, and
thus the indexing of scalars aligns with the indexing of the tensors.

By default, only `all` is kept in the normalised space and scaled.

.. code:: yaml
# List of validation metrics to keep in normalised space, and scalars to be applied
# Use '*' in reference all metrics, or a list of metric names.
# Unlike above, variable scaling is possible due to these metrics being
# calculated in the same way as the training loss, within the internal model space.
scale_validation_metrics:
scalars_to_apply: ['variable']
metrics:
- 'all'
# - "*"
***********************
Custom Loss Functions
***********************
Expand Down
11 changes: 5 additions & 6 deletions src/anemoi/training/config/graph/limited_area.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ nodes:
node_builder:
_target_: anemoi.graphs.nodes.ZarrDatasetNodes
dataset: ${dataloader.training.dataset}
attributes: ${graph.attributes.data_nodes}
attributes: ${graph.attributes.nodes}
# Hidden nodes
hidden:
node_builder:
Expand All @@ -26,8 +26,8 @@ edges:
edge_builders:
- _target_: anemoi.graphs.edges.CutOffEdges # options: KNNEdges, CutOffEdges
cutoff_factor: 0.6 # only for cutoff method
- _target_: anemoi.graphs.edges.CutOffEdges # options: KNNEdges, CutOffEdges
cutoff_factor: 2 # only for cutoff method
- _target_: anemoi.graphs.edges.CutOffEdges # connects only boundary nodes
cutoff_factor: 1.5 # only for cutoff method
source_mask_attr_name: boundary_mask
attributes: ${graph.attributes.edges}
# Processor configuration
Expand All @@ -46,16 +46,15 @@ edges:
num_nearest_neighbours: 3 # only for knn method
attributes: ${graph.attributes.edges}


post_processors:
- _target_: anemoi.graphs.processors.RemoveUnconnectedNodes
nodes_name: data
ignore: cutout_mask # optional
save_mask_indices_to_attr: indices_connected_nodes # optional


attributes:
data_nodes:
nodes:
# Attributes for data nodes
area_weight:
_target_: anemoi.graphs.nodes.attributes.AreaWeights # options: Area, Uniform
norm: unit-max # options: l1, l2, unit-max, unit-sum, unit-std
Expand Down
18 changes: 8 additions & 10 deletions src/anemoi/training/config/graph/stretched_grid.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -11,12 +11,7 @@ nodes:
node_builder:
_target_: anemoi.graphs.nodes.ZarrDatasetNodes
dataset: ${dataloader.training.dataset}
attributes:
area_weight:
_target_: anemoi.graphs.nodes.attributes.AreaWeights
norm: unit-max
cutout:
_target_: anemoi.graphs.nodes.attributes.CutOutMask
attributes: ${graph.attributes.nodes}
hidden:
node_builder:
_target_: anemoi.graphs.nodes.StretchedTriNodes
Expand All @@ -25,10 +20,6 @@ nodes:
reference_node_name: ${graph.data}
mask_attr_name: cutout
margin_radius_km: 11
attributes:
area_weights:
_target_: anemoi.graphs.nodes.attributes.AreaWeights
norm: unit-max

edges:
# Encoder
Expand All @@ -54,6 +45,13 @@ edges:
attributes: ${graph.attributes.edges}

attributes:
nodes:
# Attributes for data nodes
area_weight:
_target_: anemoi.graphs.nodes.attributes.AreaWeights
norm: unit-max
cutout:
_target_: anemoi.graphs.nodes.attributes.CutOutMask
edges:
edge_length:
_target_: anemoi.graphs.edges.attributes.EdgeLength
Expand Down
36 changes: 36 additions & 0 deletions src/anemoi/training/config/lam.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
defaults:
- data: zarr
- dataloader: native_grid
- diagnostics: evaluation
- hardware: example
- graph: limited_area
- model: graphtransformer
- training: default
- _self_


### This file is for local experimentation.
## When you commit your changes, assign the new features and keywords
## to the correct defaults.
# For example to change from default GPU count:
# hardware:
# num_gpus_per_node: 1

dataloader:
dataset:
cutout:
- dataset: ${hardware.paths.data}/${hardware.files.dataset}
thinning: ???
- dataset: ${hardware.paths.data}/${hardware.files.forcing_dataset}
adjust: all
min_distance_km: 0
grid_indices:
_target_: anemoi.training.data.grid_indices.MaskedGrid
nodes_name: data
node_attribute_name: indices_connected_nodes
model:
output_mask: cutout_mask # it must be a node attribute of the output nodes
hardware:
files:
dataset: ???
forcing_dataset: ???
37 changes: 37 additions & 0 deletions src/anemoi/training/config/stretched.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
defaults:
- data: zarr
- dataloader: native_grid
- diagnostics: evaluation
- hardware: example
- graph: stretched_grid
- model: graphtransformer
- training: default
- _self_


### This file is for local experimentation.
## When you commit your changes, assign the new features and keywords
## to the correct defaults.
# For example to change from default GPU count:
# hardware:
# num_gpus_per_node: 1

dataloader:
dataset:
cutout:
- dataset: ${hardware.paths.data}/${hardware.files.dataset}
thinning: ???
- dataset: ${hardware.paths.data}/${hardware.files.forcing_dataset}
adjust: all
min_distance_km: 0
training:
loss_scaling:
spatial:
_target_: anemoi.training.data.scaling.ReweightedGraphAttribute
target_nodes: ${graph.data}
scaled_attribute: area_weight # it must be a node attribute of the output nodes
cutout_weight_frac_of_global: ???
hardware:
files:
dataset: ???
forcing_dataset: ???
20 changes: 18 additions & 2 deletions src/anemoi/training/config/training/default.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -58,16 +58,32 @@ loss_gradient_scaling: False

# Validation metrics calculation,
# This may be a list, in which case all metrics will be calculated
# and logged according to their name
# and logged according to their name.
# These metrics are calculated in the output model space, and thus
# have undergone postprocessing.
validation_metrics:
# loss class to initialise
- _target_: anemoi.training.losses.mse.WeightedMSELoss
# Scalars to include in loss calculation
# Available scalars include, 'variable'
# Cannot scale over the variable dimension due to possible remappings.
# Available scalars include:
# - 'loss_weights_mask': Giving imputed NaNs a zero weight in the loss function
# Use the `scale_validation_metrics` section to variable scale.
scalars: []
# other kwargs
ignore_nans: True

# List of validation metrics to keep in normalised space, and scalars to be applied
# Use '*' in reference all metrics, or a list of metric names.
# Unlike above, variable scaling is possible due to these metrics being
# calculated in the same way as the training loss, within the internal model space.
scale_validation_metrics:
scalars_to_apply: ['variable']
metrics:
- 'all'
# - "*"


# length of the "rollout" window (see Keisler's paper)
rollout:
start: 1
Expand Down
30 changes: 29 additions & 1 deletion src/anemoi/training/diagnostics/callbacks/checkpoint.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,7 @@ def __init__(self, config: OmegaConf, **kwargs: dict) -> None:
"""
super().__init__(**kwargs)

self.config = config
self.start = time.time()
self._model_metadata = None
Expand Down Expand Up @@ -76,6 +77,34 @@ def model_metadata(self, model: torch.nn.Module) -> dict:

return self._model_metadata

def _adjust_epoch_progress(self, trainer: pl.Trainer) -> None:
"""
Adjust the epoch progress when saving a mid-epoch checkpoint.
Since Pytorch Lightning advances one epoch at end of training (on_train-end),
we need to correct the checkpoint epoch progress to avoid inconsistencies.
"""
trainer.fit_loop.epoch_progress.current.processed = trainer.fit_loop.epoch_progress.current.processed - 1
trainer.fit_loop.epoch_progress.current.completed = trainer.fit_loop.epoch_progress.current.completed - 1
trainer.fit_loop.epoch_progress.total.processed = trainer.fit_loop.epoch_progress.total.processed - 1
trainer.fit_loop.epoch_progress.total.completed = trainer.fit_loop.epoch_progress.total.completed - 1

def on_train_end(self, trainer: pl.Trainer, pl_module: pl.LightningModule) -> None:
"""
Save the last checkpoint at the end of training.
If the candidates aren't better than the last checkpoint, then no checkpoints are saved.
Note - this method if triggered when using max_epochs, it won't save any checkpoints
since the monitor candidates won't show any changes with regard the the 'on_train_epoch_end' hook.
"""
del pl_module
if not self._should_skip_saving_checkpoint(trainer) and not self._should_save_on_train_epoch_end(trainer):
if trainer.fit_loop.epoch_progress.current.completed == trainer.fit_loop.epoch_progress.current.ready:
self._adjust_epoch_progress(trainer)
monitor_candidates = self._monitor_candidates(trainer)
self._save_topk_checkpoint(trainer, monitor_candidates)
self._save_last_checkpoint(trainer, monitor_candidates)

def tracker_metadata(self, trainer: pl.Trainer) -> dict:
if self._tracker_metadata is not None:
return {self._tracker_name: self._tracker_metadata}
Expand Down Expand Up @@ -169,7 +198,6 @@ def _save_checkpoint(self, trainer: pl.Trainer, lightning_checkpoint_filepath: s
self._last_global_step_saved = trainer.global_step

trainer.strategy.barrier()

# saving checkpoint used for pytorch-lightning based training
trainer.save_checkpoint(lightning_checkpoint_filepath, self.save_weights_only)

Expand Down
Loading

0 comments on commit 498a792

Please sign in to comment.