Skip to content

Commit

Permalink
Merge remote-tracking branch 'upstream/feature/limited_area' into lam…
Browse files Browse the repository at this point in the history
…-sub-hourly
  • Loading branch information
OpheliaMiralles committed Oct 22, 2024
2 parents af9685d + bd28e67 commit af47bce
Show file tree
Hide file tree
Showing 20 changed files with 317 additions and 103 deletions.
6 changes: 3 additions & 3 deletions .github/CODEOWNERS
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# CODEOWNERS file

# Protect workflow files
/.github/ @theissenhelen @jesperdramsch @gmertes
/.pre-commit-config.yaml @theissenhelen @jesperdramsch @gmertes
/pyproject.toml @theissenhelen @jesperdramsch @gmertes
/.github/ @theissenhelen @jesperdramsch @gmertes @b8raoult @floriankrb @anaprietonem @HCookie @JPXKQX @mchantry
/.pre-commit-config.yaml @theissenhelen @jesperdramsch @gmertes @b8raoult @floriankrb @anaprietonem @HCookie @JPXKQX @mchantry
/pyproject.toml @theissenhelen @jesperdramsch @gmertes @b8raoult @floriankrb @anaprietonem @HCookie @JPXKQX @mchantry
10 changes: 5 additions & 5 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,12 +5,12 @@ repos:
- id: clear-notebooks-output
name: clear-notebooks-output
files: tools/.*\.ipynb$
stages: [commit]
stages: [pre-commit]
language: python
entry: jupyter nbconvert --ClearOutputPreprocessor.enabled=True --inplace
additional_dependencies: [jupyter]
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.6.0
rev: v5.0.0
hooks:
- id: check-yaml # Check YAML files for syntax errors only
args: [--unsafe, --allow-multiple-documents]
Expand Down Expand Up @@ -40,7 +40,7 @@ repos:
- --force-single-line-imports
- --profile black
- repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.6.4
rev: v0.6.9
hooks:
- id: ruff
# Next line if for documenation cod snippets
Expand All @@ -66,11 +66,11 @@ repos:
- id: docconvert
args: ["numpy"]
- repo: https://github.com/tox-dev/pyproject-fmt
rev: "2.2.3"
rev: "2.2.4"
hooks:
- id: pyproject-fmt
- repo: https://github.com/jshwi/docsig # Check docstrings against function sig
rev: v0.60.1
rev: v0.64.0
hooks:
- id: docsig
args:
Expand Down
23 changes: 22 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,9 +8,23 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
Please add your functional changes to the appropriate section in the PR.
Keep it human-readable, your future self will thank you!

## [Unreleased](https://github.com/ecmwf/anemoi-training/compare/0.1.0...HEAD)
## [Unreleased](https://github.com/ecmwf/anemoi-training/compare/0.2.0...HEAD)

### Added

- Rollout training for Limited Area Models. [#79](https://github.com/ecmwf/anemoi-training/pulls/79)
- Feature: New `Boolean1DMask` class. Enables rollout training for limited area models. [#79](https://github.com/ecmwf/anemoi-training/pulls/79)

### Fixed
### Changed

## [0.2.0 - Feature release](https://github.com/ecmwf/anemoi-training/compare/0.1.0...0.2.0) - 2024-10-16

- Make pin_memory of the Dataloader configurable (#64)

### Added

- Add anemoi-transform link to documentation
- Codeowners file (#56)
- Changelog merge strategy (#56)
- Sub-hour datasets [#63](https://github.com/ecmwf/anemoi-training/pull/63)
Expand All @@ -25,7 +39,9 @@ Keep it human-readable, your future self will thank you!
- Enforce same binning for histograms comparing true data to predicted data
- Fix: Inference checkpoints are now saved according the frequency settings defined in the config [#37](https://github.com/ecmwf/anemoi-training/pull/37)
- Feature: Add configurable models [#50](https://github.com/ecmwf/anemoi-training/pulls/50)
- Feature: Authentication support for mlflow sync - [#51](https://github.com/ecmwf/anemoi-training/pull/51)
- Feature: Support training for datasets with missing time steps [#48](https://github.com/ecmwf/anemoi-training/pulls/48)
- Feature: `AnemoiMlflowClient`, an mlflow client with authentication support [#86](https://github.com/ecmwf/anemoi-training/pull/86)
- Long Rollout Plots

### Fixed
Expand All @@ -34,10 +50,14 @@ Keep it human-readable, your future self will thank you!
- Bugfixes for CI (#56)
- Fix `mlflow` subcommand on python 3.9 [#62](https://github.com/ecmwf/anemoi-training/pull/62)
- Show correct subcommand in MLFlow - Addresses [#39](https://github.com/ecmwf/anemoi-training/issues/39) in [#61](https://github.com/ecmwf/anemoi-training/pull/61)
- Fix interactive multi-GPU training [#82](https://github.com/ecmwf/anemoi-training/pull/82)
- Allow 500 characters in mlflow logging [#88](https://github.com/ecmwf/anemoi-training/pull/88)

### Changed

- Updated configuration examples in documentation and corrected links - [#46](https://github.com/ecmwf/anemoi-training/pull/46)
- Remove credential prompt from mlflow login, replace with seed refresh token via web - [#78](https://github.com/ecmwf/anemoi-training/pull/78)
- Update CODEOWNERS

## [0.1.0 - Anemoi training - First release](https://github.com/ecmwf/anemoi-training/releases/tag/0.1.0) - 2024-08-16

Expand All @@ -51,6 +71,7 @@ Keep it human-readable, your future self will thank you!
- Subcommand for checkpoint handling

#### Functionality

- Searchpaths for Hydra configs, to enable configs in CWD, `ANEMOI_CONFIG_PATH` env, and `.config/anemoi/training` in addition to package defaults
- MlFlow token authentication
- Configurable pressure level scaling
Expand Down
4 changes: 4 additions & 0 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -101,6 +101,10 @@
"https://anemoi-registry.readthedocs.io/en/latest/",
("../../anemoi-registry/docs/_build/html/objects.inv", None),
),
"anemoi-transform": (
"https://anemoi-transform.readthedocs.io/en/latest/",
("../../anemoi-transform/docs/_build/html/objects.inv", None),
),
}

# -- Options for HTML output -------------------------------------------------
Expand Down
1 change: 1 addition & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,7 @@ This package provides the *Anemoi* training functionality.
*****************

- :ref:`anemoi-utils <anemoi-utils:index-page>`
- :ref:`anemoi-transform <anemoi-transform:index-page>`
- :ref:`anemoi-datasets <anemoi-datasets:index-page>`
- :ref:`anemoi-models <anemoi-models:index-page>`
- :ref:`anemoi-graphs <anemoi-graphs:index-page>`
Expand Down
4 changes: 4 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,10 @@ urls.Documentation = "https://anemoi-training.readthedocs.io/"
urls.Homepage = "https://github.com/ecmwf/anemoi-training/"
urls.Issues = "https://github.com/ecmwf/anemoi-training/issues"
urls.Repository = "https://github.com/ecmwf/anemoi-training/"
# command for interactive DDP (not supposed to be used directly)
# the dot is intentional, so it doesn't trigger autocomplete
scripts.".anemoi-training-train" = "anemoi.training.commands.train:main"

# Add subcommand in the `commands` directory
scripts.anemoi-training = "anemoi.training.__main__:main"

Expand Down
28 changes: 27 additions & 1 deletion src/anemoi/training/commands/mlflow.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,23 +45,39 @@ def add_arguments(command_parser: argparse.ArgumentParser) -> None:
"--source",
"-s",
help="The MLflow logs source directory.",
metavar="DIR",
required=True,
default=argparse.SUPPRESS,
)
sync.add_argument(
"--destination",
"-d",
help="The destination MLflow tracking URI.",
metavar="URI",
required=True,
default=argparse.SUPPRESS,
)
sync.add_argument(
"--run-id",
"-r",
help="The run ID to sync.",
metavar="ID",
required=True,
default=argparse.SUPPRESS,
)
sync.add_argument("--run-id", "-r", help="The run ID to sync.", required=True, default=argparse.SUPPRESS)
sync.add_argument(
"--experiment-name",
"-e",
help="The experiment name to sync to.",
metavar="NAME",
default="anemoi-debug",
)
sync.add_argument(
"--authentication",
"-a",
action="store_true",
help="The destination server requires authentication.",
)
sync.add_argument(
"--export-deleted-runs",
"-x",
Expand All @@ -88,8 +104,18 @@ def run(args: argparse.Namespace) -> None:
return

if args.subcommand == "sync":
from anemoi.training.diagnostics.mlflow.utils import health_check
from anemoi.training.utils.mlflow_sync import MlFlowSync

if args.authentication:
from anemoi.training.diagnostics.mlflow.auth import TokenAuth

auth = TokenAuth(url=args.destination)
auth.login()
auth.authenticate()

health_check(args.destination)

log_level = "DEBUG" if args.verbose else "INFO"

MlFlowSync(
Expand Down
35 changes: 27 additions & 8 deletions src/anemoi/training/commands/train.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,9 @@
from __future__ import annotations

import logging
import os
import sys
from pathlib import Path
from typing import TYPE_CHECKING

from anemoi.training.commands import Command
Expand All @@ -30,7 +32,8 @@ def add_arguments(parser: argparse.ArgumentParser) -> argparse.ArgumentParser:
return parser

def run(self, args: argparse.Namespace, unknown_args: list[str] | None = None) -> None:

# This will be picked up by the logger
os.environ["ANEMOI_TRAINING_CMD"] = f"{sys.argv[0]} {args.command}"
# Merge the known subcommands with a non-whitespace character for hydra
new_sysargv = self._merge_sysargv(args)

Expand All @@ -40,15 +43,15 @@ def run(self, args: argparse.Namespace, unknown_args: list[str] | None = None) -
else:
sys.argv = [new_sysargv]

# Import and run the training command
LOGGER.info("Running anemoi training command with overrides: %s", sys.argv[1:])
from anemoi.training.train.train import main as anemoi_train

anemoi_train()
main()

def _merge_sysargv(self, args: argparse.Namespace) -> str:
"""Merge the sys.argv with the known subcommands to pass to hydra.
This is done for interactive DDP, which will spawn the rank > 0 processes from sys.argv[0]
and for hydra, which ingests sys.argv[1:]
Parameters
----------
args : argparse.Namespace
Expand All @@ -59,10 +62,26 @@ def _merge_sysargv(self, args: argparse.Namespace) -> str:
str
Modified sys.argv as string
"""
modified_sysargv = f"{sys.argv[0]} {args.command}"
argv = Path(sys.argv[0])

# this will turn "/env/bin/anemoi-training train" into "/env/bin/.anemoi-training-train"
# the dot at the beginning is intentional to not interfere with autocomplete
modified_sysargv = argv.with_name(f".{argv.name}-{args.command}")

if hasattr(args, "subcommand"):
modified_sysargv += f" {args.subcommand}"
return modified_sysargv
modified_sysargv += f"-{args.subcommand}"
return str(modified_sysargv)


def main() -> None:
# Use the environment variable to check if main is being called from the subcommand, not from the ddp entrypoint
if not os.environ.get("ANEMOI_TRAINING_CMD"):
error = "This entrypoint should not be called directly. Use `anemoi-training train` instead."
raise RuntimeError(error)

from anemoi.training.train.train import main as anemoi_train

anemoi_train()


command = Train
1 change: 1 addition & 0 deletions src/anemoi/training/config/dataloader/native_grid.yaml
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
prefetch_factor: 2
pin_memory: True

num_workers:
training: 8
Expand Down
5 changes: 4 additions & 1 deletion src/anemoi/training/data/datamodule.py
Original file line number Diff line number Diff line change
Expand Up @@ -80,6 +80,9 @@ def __init__(self, config: DictConfig) -> None:
)
self.config.dataloader.training.end = self.config.dataloader.validation.start - 1

if not self.config.dataloader.get("pin_memory", True):
LOGGER.info("Data loader memory pinning disabled.")

def _check_resolution(self, resolution: str) -> None:
assert (
self.config.data.resolution.lower() == resolution.lower()
Expand Down Expand Up @@ -214,7 +217,7 @@ def _get_dataloader(self, ds: NativeGridDataset, stage: str) -> DataLoader:
num_workers=self.config.dataloader.num_workers[stage],
# use of pinned memory can speed up CPU-to-GPU data transfers
# see https://pytorch.org/docs/stable/notes/cuda.html#cuda-memory-pinning
pin_memory=True,
pin_memory=self.config.dataloader.get("pin_memory", True),
# worker initializer
worker_init_fn=worker_init_func,
# prefetch batches
Expand Down
23 changes: 15 additions & 8 deletions src/anemoi/training/diagnostics/callbacks/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -139,6 +139,13 @@ def teardown(self, trainer: pl.Trainer, pl_module: pl.LightningModule, stage: st
if self._executor is not None:
self._executor.shutdown(wait=True)

def apply_output_mask(self, pl_module: pl.LightningModule, data: torch.Tensor) -> torch.Tensor:
if hasattr(pl_module, "output_mask") and pl_module.output_mask is not None:
# Fill with NaNs values where the mask is False
data[:, :, ~pl_module.output_mask, :] = np.nan

return data

@abstractmethod
@rank_zero_only
def _plot(
Expand Down Expand Up @@ -684,7 +691,7 @@ def _plot(

output_tensor = pl_module.output_mask.apply(output_tensor, dim=2, fill_value=np.nan).numpy()
data[1:, ...] = pl_module.output_mask.apply(data[1:, ...], dim=2, fill_value=np.nan)
data = data.numpy().squeeze()
data = data.numpy()

for rollout_step in range(pl_module.rollout):
fig = plot_predicted_multilevel_flat_sample(
Expand All @@ -693,8 +700,8 @@ def _plot(
self.latlons,
self.config.diagnostics.plot.accumulation_levels_plot,
self.config.diagnostics.plot.cmap_accumulation,
data[0, ...],
data[rollout_step + 1, ...],
data[0, ...].squeeze(),
data[rollout_step + 1, ...].squeeze(),
output_tensor[rollout_step, ...],
precip_and_related_fields=self.precip_and_related_fields,
)
Expand Down Expand Up @@ -788,7 +795,7 @@ def _plot(

output_tensor = pl_module.output_mask.apply(output_tensor, dim=2, fill_value=np.nan).numpy()
data[1:, ...] = pl_module.output_mask.apply(data[1:, ...], dim=2, fill_value=np.nan)
data = data.numpy().squeeze()
data = data.numpy()

for rollout_step in range(pl_module.rollout):
if len(self.parameters_histogram) > 0:
Expand All @@ -801,8 +808,8 @@ def _plot(

fig = plot_histogram(
plot_parameters_dict_histogram,
data[0, ...],
data[rollout_step + 1, ...],
data[0, ...].squeeze(),
data[rollout_step + 1, ...].squeeze(),
output_tensor[rollout_step, ...],
precip_and_related_fields=self.precip_and_related_fields,
)
Expand All @@ -826,8 +833,8 @@ def _plot(
fig = plot_power_spectrum(
plot_parameters_dict_spectrum,
self.latlons,
data[0, ...],
data[rollout_step + 1, ...],
data[0, ...].squeeze(),
data[rollout_step + 1, ...].squeeze(),
output_tensor[rollout_step, ...],
)

Expand Down
Loading

0 comments on commit af47bce

Please sign in to comment.