Merge remote-tracking branch 'upstream/feature/limited_area' into lam…

…-sub-hourly
MeteoSwiss · Oct 22, 2024 · af47bce · af47bce
2 parents af9685d + bd28e67
commit af47bce
Show file tree

Hide file tree

Showing 20 changed files with 317 additions and 103 deletions.
diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS
@@ -1,6 +1,6 @@
 # CODEOWNERS file
 
 # Protect workflow files
-/.github/ @theissenhelen @jesperdramsch @gmertes
-/.pre-commit-config.yaml @theissenhelen @jesperdramsch @gmertes
-/pyproject.toml @theissenhelen @jesperdramsch @gmertes
+/.github/ @theissenhelen @jesperdramsch @gmertes @b8raoult @floriankrb @anaprietonem @HCookie @JPXKQX @mchantry
+/.pre-commit-config.yaml @theissenhelen @jesperdramsch @gmertes @b8raoult @floriankrb @anaprietonem @HCookie @JPXKQX @mchantry
+/pyproject.toml @theissenhelen @jesperdramsch @gmertes @b8raoult @floriankrb @anaprietonem @HCookie @JPXKQX @mchantry
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -5,12 +5,12 @@ repos:
   - id: clear-notebooks-output
     name: clear-notebooks-output
     files: tools/.*\.ipynb$
-    stages: [commit]
+    stages: [pre-commit]
     language: python
     entry: jupyter nbconvert --ClearOutputPreprocessor.enabled=True --inplace
     additional_dependencies: [jupyter]
 - repo: https://github.com/pre-commit/pre-commit-hooks
-  rev: v4.6.0
+  rev: v5.0.0
   hooks:
   - id: check-yaml # Check YAML files for syntax errors only
     args: [--unsafe, --allow-multiple-documents]
@@ -40,7 +40,7 @@ repos:
     - --force-single-line-imports
     - --profile black
 - repo: https://github.com/astral-sh/ruff-pre-commit
-  rev: v0.6.4
+  rev: v0.6.9
   hooks:
   - id: ruff
     # Next line if for documenation cod snippets
@@ -66,11 +66,11 @@ repos:
   - id: docconvert
     args: ["numpy"]
 - repo: https://github.com/tox-dev/pyproject-fmt
-  rev: "2.2.3"
+  rev: "2.2.4"
   hooks:
   - id: pyproject-fmt
 -   repo: https://github.com/jshwi/docsig # Check docstrings against function sig
-    rev: v0.60.1
+    rev: v0.64.0
     hooks:
     -   id: docsig
         args:

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -8,9 +8,23 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 Please add your functional changes to the appropriate section in the PR.
 Keep it human-readable, your future self will thank you!
 
-## [Unreleased](https://github.com/ecmwf/anemoi-training/compare/0.1.0...HEAD)
+## [Unreleased](https://github.com/ecmwf/anemoi-training/compare/0.2.0...HEAD)
 
 ### Added
+
+- Rollout training for Limited Area Models. [#79](https://github.com/ecmwf/anemoi-training/pulls/79)
+- Feature: New `Boolean1DMask` class. Enables rollout training for limited area models. [#79](https://github.com/ecmwf/anemoi-training/pulls/79)
+
+### Fixed
+### Changed
+
+## [0.2.0 - Feature release](https://github.com/ecmwf/anemoi-training/compare/0.1.0...0.2.0) - 2024-10-16
+
+- Make pin_memory of the Dataloader configurable (#64)
+
+### Added
+
+- Add anemoi-transform link to documentation
 - Codeowners file (#56)
 - Changelog merge strategy (#56)
 - Sub-hour datasets [#63](https://github.com/ecmwf/anemoi-training/pull/63)
@@ -25,7 +39,9 @@ Keep it human-readable, your future self will thank you!
 - Enforce same binning for histograms comparing true data to predicted data
 - Fix: Inference checkpoints are now saved according the frequency settings defined in the config [#37](https://github.com/ecmwf/anemoi-training/pull/37)
 - Feature: Add configurable models [#50](https://github.com/ecmwf/anemoi-training/pulls/50)
+- Feature: Authentication support for mlflow sync - [#51](https://github.com/ecmwf/anemoi-training/pull/51)
 - Feature: Support training for datasets with missing time steps [#48](https://github.com/ecmwf/anemoi-training/pulls/48)
+- Feature: `AnemoiMlflowClient`, an mlflow client with authentication support [#86](https://github.com/ecmwf/anemoi-training/pull/86)
 - Long Rollout Plots
 
 ### Fixed
@@ -34,10 +50,14 @@ Keep it human-readable, your future self will thank you!
 - Bugfixes for CI (#56)
 - Fix `mlflow` subcommand on python 3.9 [#62](https://github.com/ecmwf/anemoi-training/pull/62)
 - Show correct subcommand in MLFlow - Addresses [#39](https://github.com/ecmwf/anemoi-training/issues/39) in [#61](https://github.com/ecmwf/anemoi-training/pull/61)
+- Fix interactive multi-GPU training [#82](https://github.com/ecmwf/anemoi-training/pull/82)
+- Allow 500 characters in mlflow logging [#88](https://github.com/ecmwf/anemoi-training/pull/88)
 
 ### Changed
 
 - Updated configuration examples in documentation and corrected links - [#46](https://github.com/ecmwf/anemoi-training/pull/46)
+- Remove credential prompt from mlflow login, replace with seed refresh token via web - [#78](https://github.com/ecmwf/anemoi-training/pull/78)
+- Update CODEOWNERS
 
 ## [0.1.0 - Anemoi training - First release](https://github.com/ecmwf/anemoi-training/releases/tag/0.1.0) - 2024-08-16
 
@@ -51,6 +71,7 @@ Keep it human-readable, your future self will thank you!
 - Subcommand for checkpoint handling
 
 #### Functionality
+
 - Searchpaths for Hydra configs, to enable configs in CWD, `ANEMOI_CONFIG_PATH` env, and `.config/anemoi/training` in addition to package defaults
 - MlFlow token authentication
 - Configurable pressure level scaling

diff --git a/docs/conf.py b/docs/conf.py
@@ -101,6 +101,10 @@
         "https://anemoi-registry.readthedocs.io/en/latest/",
         ("../../anemoi-registry/docs/_build/html/objects.inv", None),
     ),
+    "anemoi-transform": (
+        "https://anemoi-transform.readthedocs.io/en/latest/",
+        ("../../anemoi-transform/docs/_build/html/objects.inv", None),
+    ),
 }
 
 # -- Options for HTML output -------------------------------------------------

diff --git a/docs/index.rst b/docs/index.rst
@@ -67,6 +67,7 @@ This package provides the *Anemoi* training functionality.
 *****************
 
 -  :ref:`anemoi-utils <anemoi-utils:index-page>`
+-  :ref:`anemoi-transform <anemoi-transform:index-page>`
 -  :ref:`anemoi-datasets <anemoi-datasets:index-page>`
 -  :ref:`anemoi-models <anemoi-models:index-page>`
 -  :ref:`anemoi-graphs <anemoi-graphs:index-page>`

diff --git a/pyproject.toml b/pyproject.toml
@@ -83,6 +83,10 @@ urls.Documentation = "https://anemoi-training.readthedocs.io/"
 urls.Homepage = "https://github.com/ecmwf/anemoi-training/"
 urls.Issues = "https://github.com/ecmwf/anemoi-training/issues"
 urls.Repository = "https://github.com/ecmwf/anemoi-training/"
+# command for interactive DDP (not supposed to be used directly)
+# the dot is intentional, so it doesn't trigger autocomplete
+scripts.".anemoi-training-train" = "anemoi.training.commands.train:main"
+
 # Add subcommand in the `commands` directory
 scripts.anemoi-training = "anemoi.training.__main__:main"
 

diff --git a/src/anemoi/training/commands/mlflow.py b/src/anemoi/training/commands/mlflow.py
@@ -45,23 +45,39 @@ def add_arguments(command_parser: argparse.ArgumentParser) -> None:
             "--source",
             "-s",
             help="The MLflow logs source directory.",
+            metavar="DIR",
             required=True,
             default=argparse.SUPPRESS,
         )
         sync.add_argument(
             "--destination",
             "-d",
             help="The destination MLflow tracking URI.",
+            metavar="URI",
+            required=True,
+            default=argparse.SUPPRESS,
+        )
+        sync.add_argument(
+            "--run-id",
+            "-r",
+            help="The run ID to sync.",
+            metavar="ID",
             required=True,
             default=argparse.SUPPRESS,
         )
-        sync.add_argument("--run-id", "-r", help="The run ID to sync.", required=True, default=argparse.SUPPRESS)
         sync.add_argument(
             "--experiment-name",
             "-e",
             help="The experiment name to sync to.",
+            metavar="NAME",
             default="anemoi-debug",
         )
+        sync.add_argument(
+            "--authentication",
+            "-a",
+            action="store_true",
+            help="The destination server requires authentication.",
+        )
         sync.add_argument(
             "--export-deleted-runs",
             "-x",
@@ -88,8 +104,18 @@ def run(args: argparse.Namespace) -> None:
             return
 
         if args.subcommand == "sync":
+            from anemoi.training.diagnostics.mlflow.utils import health_check
             from anemoi.training.utils.mlflow_sync import MlFlowSync
 
+            if args.authentication:
+                from anemoi.training.diagnostics.mlflow.auth import TokenAuth
+
+                auth = TokenAuth(url=args.destination)
+                auth.login()
+                auth.authenticate()
+
+            health_check(args.destination)
+
             log_level = "DEBUG" if args.verbose else "INFO"
 
             MlFlowSync(

diff --git a/src/anemoi/training/commands/train.py b/src/anemoi/training/commands/train.py
@@ -9,7 +9,9 @@
 from __future__ import annotations
 
 import logging
+import os
 import sys
+from pathlib import Path
 from typing import TYPE_CHECKING
 
 from anemoi.training.commands import Command
@@ -30,7 +32,8 @@ def add_arguments(parser: argparse.ArgumentParser) -> argparse.ArgumentParser:
         return parser
 
     def run(self, args: argparse.Namespace, unknown_args: list[str] | None = None) -> None:
-
+        # This will be picked up by the logger
+        os.environ["ANEMOI_TRAINING_CMD"] = f"{sys.argv[0]} {args.command}"
         # Merge the known subcommands with a non-whitespace character for hydra
         new_sysargv = self._merge_sysargv(args)
 
@@ -40,15 +43,15 @@ def run(self, args: argparse.Namespace, unknown_args: list[str] | None = None) -
         else:
             sys.argv = [new_sysargv]
 
-        # Import and run the training command
         LOGGER.info("Running anemoi training command with overrides: %s", sys.argv[1:])
-        from anemoi.training.train.train import main as anemoi_train
-
-        anemoi_train()
+        main()
 
     def _merge_sysargv(self, args: argparse.Namespace) -> str:
         """Merge the sys.argv with the known subcommands to pass to hydra.
 
+        This is done for interactive DDP, which will spawn the rank > 0 processes from sys.argv[0]
+        and for hydra, which ingests sys.argv[1:]
+
         Parameters
         ----------
         args : argparse.Namespace
@@ -59,10 +62,26 @@ def _merge_sysargv(self, args: argparse.Namespace) -> str:
         str
             Modified sys.argv as string
         """
-        modified_sysargv = f"{sys.argv[0]} {args.command}"
+        argv = Path(sys.argv[0])
+
+        # this will turn "/env/bin/anemoi-training train" into "/env/bin/.anemoi-training-train"
+        # the dot at the beginning is intentional to not interfere with autocomplete
+        modified_sysargv = argv.with_name(f".{argv.name}-{args.command}")
+
         if hasattr(args, "subcommand"):
-            modified_sysargv += f" {args.subcommand}"
-        return modified_sysargv
+            modified_sysargv += f"-{args.subcommand}"
+        return str(modified_sysargv)
+
+
+def main() -> None:
+    # Use the environment variable to check if main is being called from the subcommand, not from the ddp entrypoint
+    if not os.environ.get("ANEMOI_TRAINING_CMD"):
+        error = "This entrypoint should not be called directly. Use `anemoi-training train` instead."
+        raise RuntimeError(error)
+
+    from anemoi.training.train.train import main as anemoi_train
+
+    anemoi_train()
 
 
 command = Train
diff --git a/src/anemoi/training/config/dataloader/native_grid.yaml b/src/anemoi/training/config/dataloader/native_grid.yaml
@@ -1,4 +1,5 @@
 prefetch_factor: 2
+pin_memory: True
 
 num_workers:
   training: 8

diff --git a/src/anemoi/training/data/datamodule.py b/src/anemoi/training/data/datamodule.py
@@ -80,6 +80,9 @@ def __init__(self, config: DictConfig) -> None:
             )
             self.config.dataloader.training.end = self.config.dataloader.validation.start - 1
 
+        if not self.config.dataloader.get("pin_memory", True):
+            LOGGER.info("Data loader memory pinning disabled.")
+
     def _check_resolution(self, resolution: str) -> None:
         assert (
             self.config.data.resolution.lower() == resolution.lower()
@@ -214,7 +217,7 @@ def _get_dataloader(self, ds: NativeGridDataset, stage: str) -> DataLoader:
             num_workers=self.config.dataloader.num_workers[stage],
             # use of pinned memory can speed up CPU-to-GPU data transfers
             # see https://pytorch.org/docs/stable/notes/cuda.html#cuda-memory-pinning
-            pin_memory=True,
+            pin_memory=self.config.dataloader.get("pin_memory", True),
             # worker initializer
             worker_init_fn=worker_init_func,
             # prefetch batches

diff --git a/src/anemoi/training/diagnostics/callbacks/__init__.py b/src/anemoi/training/diagnostics/callbacks/__init__.py
@@ -139,6 +139,13 @@ def teardown(self, trainer: pl.Trainer, pl_module: pl.LightningModule, stage: st
         if self._executor is not None:
             self._executor.shutdown(wait=True)
 
+    def apply_output_mask(self, pl_module: pl.LightningModule, data: torch.Tensor) -> torch.Tensor:
+        if hasattr(pl_module, "output_mask") and pl_module.output_mask is not None:
+            # Fill with NaNs values where the mask is False
+            data[:, :, ~pl_module.output_mask, :] = np.nan
+
+        return data
+
     @abstractmethod
     @rank_zero_only
     def _plot(
@@ -684,7 +691,7 @@ def _plot(
 
         output_tensor = pl_module.output_mask.apply(output_tensor, dim=2, fill_value=np.nan).numpy()
         data[1:, ...] = pl_module.output_mask.apply(data[1:, ...], dim=2, fill_value=np.nan)
-        data = data.numpy().squeeze()
+        data = data.numpy()
 
         for rollout_step in range(pl_module.rollout):
             fig = plot_predicted_multilevel_flat_sample(
@@ -693,8 +700,8 @@ def _plot(
                 self.latlons,
                 self.config.diagnostics.plot.accumulation_levels_plot,
                 self.config.diagnostics.plot.cmap_accumulation,
-                data[0, ...],
-                data[rollout_step + 1, ...],
+                data[0, ...].squeeze(),
+                data[rollout_step + 1, ...].squeeze(),
                 output_tensor[rollout_step, ...],
                 precip_and_related_fields=self.precip_and_related_fields,
             )
@@ -788,7 +795,7 @@ def _plot(
 
         output_tensor = pl_module.output_mask.apply(output_tensor, dim=2, fill_value=np.nan).numpy()
         data[1:, ...] = pl_module.output_mask.apply(data[1:, ...], dim=2, fill_value=np.nan)
-        data = data.numpy().squeeze()
+        data = data.numpy()
 
         for rollout_step in range(pl_module.rollout):
             if len(self.parameters_histogram) > 0:
@@ -801,8 +808,8 @@ def _plot(
 
                 fig = plot_histogram(
                     plot_parameters_dict_histogram,
-                    data[0, ...],
-                    data[rollout_step + 1, ...],
+                    data[0, ...].squeeze(),
+                    data[rollout_step + 1, ...].squeeze(),
                     output_tensor[rollout_step, ...],
                     precip_and_related_fields=self.precip_and_related_fields,
                 )
@@ -826,8 +833,8 @@ def _plot(
                 fig = plot_power_spectrum(
                     plot_parameters_dict_spectrum,
                     self.latlons,
-                    data[0, ...],
-                    data[rollout_step + 1, ...],
+                    data[0, ...].squeeze(),
+                    data[rollout_step + 1, ...].squeeze(),
                     output_tensor[rollout_step, ...],
                 )