diff --git a/README.md b/README.md
index 11b707bba..a83f7ab40 100644
--- a/README.md
+++ b/README.md
@@ -80,7 +80,7 @@ To change parameters specific to this experiment like switching from `fp16` to `
 ```bash
 graphium-train dataset=toymix model=gcn trainer.trainer.precision=32
 ```
-or change them permamently in the dedicated experiment config under `expts/hydra-configs/toymix_gcn.yaml`.
+or change them permanently in the dedicated experiment config under `expts/hydra-configs/toymix_gcn.yaml`.
 Integrating `hydra` also allows you to quickly switch between accelerators. E.g., running
 ```bash
 graphium-train dataset=toymix model=gcn accelerator=gpu
diff --git a/docs/baseline.md b/docs/baseline.md
index 029996554..cac1ee282 100644
--- a/docs/baseline.md
+++ b/docs/baseline.md
@@ -4,6 +4,8 @@ From the paper to be released soon. Below, you can see the baselines for the `To
 
 One can observe that the smaller datasets (`Zinc12k` and `Tox21`) beneficiate from adding another unrelated task (`QM9`), where the labels are computed from DFT simulations.
 
+**NEW baselines added 2023/09/18**: Multitask baselines have been added for GatedGCN and MPNN++ (sum aggretator) using 3 random seeds. They achieve the best performance by a significant margin on Zinc12k and Tox21, while sacrificing a little on QM9.
+
 | Dataset   | Model | MAE ↓     | Pearson ↑ | R² ↑     | MAE ↓   | Pearson ↑ | R² ↑   |
 |-----------|-------|-----------|-----------|-----------|---------|-----------|---------|
 |    | <th colspan="3" style="text-align: center;">Single-Task Model</th>  <th colspan="3" style="text-align: center;">Multi-Task Model</th>   |
@@ -11,18 +13,24 @@ One can observe that the smaller datasets (`Zinc12k` and `Tox21`) beneficiate fr
 | **QM9**   | GCN   | 0.102 ± 0.0003 | 0.958 ± 0.0007 | 0.920 ± 0.002 | 0.119 ± 0.01 | 0.955 ± 0.001 | 0.915 ± 0.001 |
 |           | GIN   | 0.0976 ± 0.0006 | **0.959 ± 0.0002** | **0.922 ± 0.0004** | 0.117 ± 0.01 | 0.950 ± 0.002 | 0.908 ± 0.003 |
 |           | GINE  | **0.0959 ± 0.0002** | 0.955 ± 0.002 | 0.918 ± 0.004 | 0.102 ± 0.01 | 0.956 ± 0.0009 | 0.918 ± 0.002 |
-|
-| **Zinc12k** | GCN   | 0.348 ± 0.02 | 0.941 ± 0.002 | 0.863 ± 0.01 | 0.226 ± 0.004 | 0.973 ± 0.0005 | 0.940 ± 0.003 |
+|       |   GatedGCN    |       |       |       | 0.1212 ± 0.0009 | 0.9457 ± 0.0002 | 0.8964 ± 0.0006 |
+|       |   MPNN++ (sum)    |       |       |    | 0.1174 ± 0.0012 | 0.9460 ± 0.0005 | 0.8989 ± 0.0008 |
+ **Zinc12k** | GCN   | 0.348 ± 0.02 | 0.941 ± 0.002 | 0.863 ± 0.01 | 0.226 ± 0.004 | 0.973 ± 0.0005 | 0.940 ± 0.003 |
 |           | GIN   | 0.303 ± 0.007 | 0.950 ± 0.003 | 0.889 ± 0.003 | 0.189 ± 0.004 | 0.978 ± 0.006 | 0.953 ± 0.002 |
-|           | GINE  | 0.266 ± 0.02 | 0.961 ± 0.003 | 0.915 ± 0.01 | **0.147 ± 0.009** | **0.987 ± 0.001** | **0.971 ± 0.003** |
+|           | GINE  | 0.266 ± 0.02 | 0.961 ± 0.003 | 0.915 ± 0.01 | 0.147 ± 0.009 | 0.987 ± 0.001 | 0.971 ± 0.003 |
+|       | GatedGCN       |       |       |       | 0.1282 ± 0.0045 | 0.9850 ± 0.0006 | 0.9639 ± 0.0024 |
+|       | MPNN++ (sum)   |       |       |       | **0.1002 ± 0.0025** | **0.9909 ± 0.0004** | **0.9777 ± 0.0014** |
 
 |           |       | BCE ↓     | AUROC ↑ | AP ↑     | BCE ↓   | AUROC ↑ | AP ↑   |
 |-----------|-------|-----------|-----------|-----------|---------|-----------|---------|
 |    | <th colspan="3" style="text-align: center;">Single-Task Model</th>  <th colspan="3" style="text-align: center;">Multi-Task Model</th>   |
 |
-| **Tox21**   | GCN   | 0.202 ± 0.005 | 0.773 ± 0.006 | 0.334 ± 0.03 | **0.176 ± 0.001** | **0.850 ± 0.006** | 0.446 ± 0.01 |
+| **Tox21**   | GCN   | 0.202 ± 0.005 | 0.773 ± 0.006 | 0.334 ± 0.03 | 0.176 ± 0.001 | 0.850 ± 0.006 | 0.446 ± 0.01 |
 |           | GIN   | 0.200 ± 0.002 | 0.789 ± 0.009 | 0.350 ± 0.01 | 0.176 ± 0.001 | 0.841 ± 0.005 | 0.454 ± 0.009 |
-|           | GINE  | 0.201 ± 0.007 | 0.783 ± 0.007 | 0.345 ± 0.02 | 0.177 ± 0.0008 | 0.836 ± 0.004 | **0.455 ± 0.008** |
+|           | GINE  | 0.201 ± 0.007 | 0.783 ± 0.007 | 0.345 ± 0.02 | 0.177 ± 0.0008 | 0.836 ± 0.004 | 0.455 ± 0.008 |
+|       | GatedGCN       |       |       |       | 0.1733 ± 0.0015 | 0.8522 ± 0.0022 | **0.4620 ± 0.0118** |
+|       | MPNN++ (sum)   |       |       |       | **0.1725 ± 0.0012** | **0.8569 ± 0.0005** | 0.4598 ± 0.0044 |
+
 
 # LargeMix Baseline
 ## LargeMix test set metrics
@@ -88,6 +96,40 @@ This is not surprising as they contain two orders of magnitude more datapoints a
 |             | GIN   | 0.1873 ± 0.0033 | **0.1701 ± 0.0142** |
 |             | GINE  | 0.1883 ± 0.0039 | **0.1771 ± 0.0010** |
 
+## NEW: Largemix improved sweep - 2023/08-18
+
+Unsatisfied with the prior results, we ran a bayesian search over a broader set of parameters, and including only more expressive models, namely GINE, GatedGCN and MPNN++. We further increase the number of parameters to 10M due to evidence of underfitting. We evaluate only the multitask setting.
+
+We observe a significant improvement over all tasks, with a very notable r2-score increase of +0.53 (0.27 -> 0.80) compared to the best node-level property prediction on PCQM4M_N4.
+
+The results are reported below over 1 seed. We are currently running more seeds of the same models.
+
+| Dataset       | Model          | MAE ↓     | Pearson ↑ | R² ↑     |
+|---------------|----------------|--------|---------|--------|
+| **PCQM4M_G25**    | GINE           | 0.2250 | 0.8840  | 0.7911 |
+|               | GatedGCN       | 0.2457 | 0.8698  | 0.7688 |
+|               | MPNN++ (sum)   | 0.2269 | 0.8802  | 0.7855 |
+|
+| **PCQM4M_N4**     | GINE           | 0.2699 | 0.8475  | 0.7182 |
+|               | GatedGCN       | 0.3337 | 0.8102  | 0.6566 |
+|               | MPNN++ (sum)   | 0.2114 | 0.8942  | 0.8000 |
+
+| Dataset       | Model          | BCE ↓     | AUROC ↑ | AP ↑     |
+|---------------|----------------|--------|---------|--------|
+| **PCBA_1328**     | GINE           | 0.0334 | 0.7879  | 0.2808 |
+|               | GatedGCN       | 0.0351 | 0.7788  | 0.2611 |
+|               | MPNN++ (sum)   | 0.0344 | 0.7815  | 0.2666 |
+|
+| **L1000_VCAP**    | GINE           | 0.1907 | 0.6416  | 0.4042 |
+|               | GatedGCN       | 0.1866 | 0.6395  | 0.4092 |
+|               | MPNN++ (sum)   | 0.1867 | 0.6478  | 0.4131 |
+|
+| **L1000_MCF7**    | GINE           | 0.1931 | 0.6352  | 0.4235 |
+|               | GatedGCN       | 0.1859 | 0.6547  | 0.4224 |
+|               | MPNN++ (sum)   | 0.1870 | 0.6593  | 0.4254 |
+
+
+
 # UltraLarge Baseline
 
 ## UltraLarge test set metrics
diff --git a/docs/datasets.md b/docs/datasets.md
index fc4e0f292..6733736f4 100644
--- a/docs/datasets.md
+++ b/docs/datasets.md
@@ -1,6 +1,8 @@
 # Graphium Datasets
 
-Graphium datasets are hosted at on Zenodo on [this link](https://zenodo.org/record/8206704).
+Graphium datasets are hosted at on Zenodo 
+- ***ToyMix*** and  ***LargeMix*** dataseets are hosted on [this link](https://doi.org/10.5281/zenodo.7998401)
+- ***UltraLarge*** dataset is hosted on [this link](https://doi.org/10.5281/zenodo.8370547)
 
 Instead of provinding datasets as a single entity, our aim is to provide dataset mixes containing a variety of datasets that are meant to be predicted simultaneously using multi-tasking.
 
diff --git a/expts/hydra-configs/accelerator/ipu_pipeline.yaml b/expts/hydra-configs/accelerator/ipu_pipeline.yaml
new file mode 100644
index 000000000..996218646
--- /dev/null
+++ b/expts/hydra-configs/accelerator/ipu_pipeline.yaml
@@ -0,0 +1,22 @@
+type: ipu
+ipu_config:
+    - deviceIterations(60) # IPU would require large batches to be ready for the model.
+    # 60 for PCQM4mv2
+    # 30 for largemix
+    - replicationFactor(4)
+    # - enableProfiling("graph_analyser")       # The folder where the profile will be stored
+    # - enableExecutableCaching("pop_compiler_cache")
+    - TensorLocations.numIOTiles(128)
+    - _Popart.set("defaultBufferingDepth", 96)
+    - Precision.enableStochasticRounding(True)
+
+ipu_inference_config:
+    # set device iteration and replication factor to 1 during inference
+    # gradient accumulation was set to 1 in the code
+    - deviceIterations(60)
+    - replicationFactor(1)
+    - Precision.enableStochasticRounding(False)
+
+accelerator_kwargs:
+    _accelerator: "ipu"
+    gnn_layers_per_ipu: [4, 4, 4, 4]
\ No newline at end of file
diff --git a/expts/hydra-configs/tasks/loss_metrics_datamodule/admet.yaml b/expts/hydra-configs/tasks/loss_metrics_datamodule/admet.yaml
index cfff5f689..89176f2b6 100644
--- a/expts/hydra-configs/tasks/loss_metrics_datamodule/admet.yaml
+++ b/expts/hydra-configs/tasks/loss_metrics_datamodule/admet.yaml
@@ -80,7 +80,7 @@ metrics:
       target_nan_mask: null
       multitask_handling: mean-per-label
     - name: r2_score
-      metric: r2
+      metric: r2_score
       target_nan_mask: null
       multitask_handling: mean-per-label
       threshold_kwargs: null
@@ -138,4 +138,4 @@ datamodule:
   args:
     # TDC specific
     tdc_benchmark_names: null
-    tdc_train_val_seed: ${constants.seed}
\ No newline at end of file
+    tdc_train_val_seed: ${constants.seed}
diff --git a/graphium/cli/train_finetune_test.py b/graphium/cli/train_finetune_test.py
index 5d9a29b97..1be59e652 100644
--- a/graphium/cli/train_finetune_test.py
+++ b/graphium/cli/train_finetune_test.py
@@ -6,6 +6,7 @@
 
 import fsspec
 import hydra
+import numpy as np
 import torch
 import wandb
 import yaml
@@ -42,6 +43,8 @@
 
 TESTING_ONLY_CONFIG_KEY = "testing_only"
 
+OmegaConf.register_new_resolver("eval", lambda x: eval(x, {"np": np}))
+
 
 @hydra.main(version_base=None, config_path="../../expts/hydra-configs", config_name="main")
 def cli(cfg: DictConfig) -> None:
@@ -51,13 +54,78 @@ def cli(cfg: DictConfig) -> None:
     return run_training_finetuning_testing(cfg)
 
 
+def get_replication_factor(cfg):
+    try:
+        ipu_config = cfg.get("accelerator", {}).get("ipu_config", [])
+        for item in ipu_config:
+            if "replicationFactor" in item:
+                # Extract the number between parentheses
+                start = item.find("(") + 1
+                end = item.find(")")
+                if start != 0 and end != -1:
+                    return int(item[start:end])
+    except Exception as e:
+        print(f"An error occurred: {e}")
+
+    # Return default value if replicationFactor is not found or an error occurred
+    return 1
+
+
+def get_gradient_accumulation_factor(cfg):
+    try:
+        # Navigate through the nested dictionaries and get the gradient accumulation factor
+        grad_accumulation_factor = (
+            cfg.get("accelerator", {})
+            .get("config_override", {})
+            .get("trainer", {})
+            .get("trainer", {})
+            .get("accumulate_grad_batches", 1)
+        )
+
+        # Ensure that the extracted value is an integer
+        return int(grad_accumulation_factor)
+    except Exception as e:
+        print(f"An error occurred: {e}")
+
+    # Return default value if an error occurred
+    return 1
+
+
+def get_training_batch_size(cfg):
+    try:
+        # Navigate through the nested dictionaries and get the training batch size
+        batch_size_training = (
+            cfg.get("accelerator", {})
+            .get("config_override", {})
+            .get("datamodule", {})
+            .get("args", {})
+            .get("batch_size_training", 1)
+        )
+
+        # Ensure that the extracted value is an integer
+        return int(batch_size_training)
+    except Exception as e:
+        print(f"An error occurred: {e}")
+
+    # Return default value if an error occurred
+    return 1
+
+
 def run_training_finetuning_testing(cfg: DictConfig) -> None:
     """
     The main (pre-)training and fine-tuning loop.
     """
 
+    unresolved_cfg = OmegaConf.to_container(cfg, resolve=False)
     cfg = OmegaConf.to_container(cfg, resolve=True)
 
+    # Get the current date and time
+    now = datetime.now()
+    # Format the datetime as a string
+    filename_datetime_suffix = now.strftime("%Y%m%d_%H%M%S")
+    # Append the datetime string to the existing filename in the cfg dictionary
+    cfg["trainer"]["model_checkpoint"]["filename"] += f"_{filename_datetime_suffix}"
+
     dst_dir = cfg["constants"].get("results_dir")
     hydra_cfg = HydraConfig.get()
     output_dir = hydra_cfg["runtime"]["output_dir"]
@@ -75,6 +143,12 @@ def run_training_finetuning_testing(cfg: DictConfig) -> None:
 
     st = timeit.default_timer()
 
+    replicas = get_replication_factor(cfg)
+    gradient_acc = get_gradient_accumulation_factor(cfg)
+    micro_bs = get_training_batch_size(cfg)
+
+    global_bs = replicas * gradient_acc * micro_bs
+
     # Disable wandb if the user is not logged in.
     wandb_cfg = cfg["constants"].get("wandb")
     if wandb_cfg is not None and wandb.login() is False:
@@ -119,6 +193,9 @@ def run_training_finetuning_testing(cfg: DictConfig) -> None:
             accelerator_type=accelerator_type,
             featurization=datamodule.featurization,
             task_norms=datamodule.task_norms,
+            replicas=replicas,
+            gradient_acc=gradient_acc,
+            global_bs=global_bs,
         )
 
     logger.info(predictor.model)
@@ -135,7 +212,7 @@ def run_training_finetuning_testing(cfg: DictConfig) -> None:
             trainer.callbacks.append(GraphFinetuning(**finetuning_training_kwargs))
 
         if wandb_cfg is not None:
-            save_params_to_wandb(trainer.logger, cfg, predictor, datamodule)
+            save_params_to_wandb(trainer.logger, cfg, predictor, datamodule, unresolved_config=unresolved_cfg)
 
         # Determine the max num nodes and edges in training and validation
         logger.info("Computing the maximum number of nodes and edges per graph")
@@ -173,6 +250,11 @@ def run_training_finetuning_testing(cfg: DictConfig) -> None:
     logger.info("-" * 50)
 
     if wandb_cfg is not None:
+        # Save initial model state - and upload checkpoint to wandb
+        if cfg["trainer"]["model_checkpoint"]["save_last"] is True:
+            checkpoint_path = f"{cfg['trainer']['model_checkpoint']['dirpath']}{cfg['trainer']['model_checkpoint']['filename']}.ckpt"
+            # Log the initial model checkpoint to wandb
+            wandb.save(checkpoint_path)
         wandb.finish()
 
     # Save test metrics - Base utility in case someone doesn't use a logger.
diff --git a/graphium/config/_loader.py b/graphium/config/_loader.py
index 48d7e9078..259c61e34 100644
--- a/graphium/config/_loader.py
+++ b/graphium/config/_loader.py
@@ -13,7 +13,7 @@
 
 # Lightning
 from lightning import Trainer
-from lightning.pytorch.callbacks import EarlyStopping, ModelCheckpoint
+from lightning.pytorch.callbacks import EarlyStopping, ModelCheckpoint, LearningRateMonitor
 from lightning.pytorch.loggers import Logger, WandbLogger
 from loguru import logger
 
@@ -76,7 +76,6 @@ def _get_ipu_opts(config: Union[omegaconf.DictConfig, Dict[str, Any]]) -> Tuple[
 
     if accelerator_type != "ipu":
         return None, None
-
     ipu_opts = accelerator_options["ipu_config"]
     ipu_inference_opts = accelerator_options.get("ipu_inference_config", None)
 
@@ -126,6 +125,7 @@ def load_datamodule(
             ipu_inference_opts=ipu_inference_opts,
             precision=config["trainer"]["trainer"].get("precision"),
         )
+
         # Define the Dataloader options for the IPU on the training sets
         bz_train = cfg_data["batch_size_training"]
         ipu_dataloader_training_opts = IPUDataloaderOptions(
@@ -261,6 +261,10 @@ def load_architecture(
         graph_output_nn_kwargs=graph_output_nn_kwargs,
         task_heads_kwargs=task_heads_kwargs,
     )
+    # Get accelerator_kwargs if they exist
+    accelerator_kwargs = config["accelerator"].get("accelerator_kwargs", None)
+    if accelerator_kwargs is not None:
+        model_kwargs["accelerator_kwargs"] = accelerator_kwargs
 
     if model_class is FullGraphFinetuningNetwork:
         finetuning_head_kwargs = config["finetuning"].pop("finetuning_head", None)
@@ -286,6 +290,9 @@ def load_predictor(
     accelerator_type: str,
     featurization: Dict[str, str] = None,
     task_norms: Optional[Dict[Callable, Any]] = None,
+    replicas: int = 1,
+    gradient_acc: int = 1,
+    global_bs: int = 1,
 ) -> PredictorModule:
     """
     Defining the predictor module, which handles the training logic from `lightning.LighningModule`
@@ -311,6 +318,9 @@ def load_predictor(
         task_levels=task_levels,
         featurization=featurization,
         task_norms=task_norms,
+        replicas=replicas,
+        gradient_acc=gradient_acc,
+        global_bs=global_bs,
         **cfg_pred,
     )
 
@@ -327,6 +337,9 @@ def load_predictor(
             task_levels=task_levels,
             featurization=featurization,
             task_norms=task_norms,
+            replicas=replicas,
+            gradient_acc=gradient_acc,
+            global_bs=global_bs,
             **cfg_pred,
         )
 
@@ -415,13 +428,18 @@ def load_trainer(
     if "model_checkpoint" in cfg_trainer.keys():
         callbacks.append(ModelCheckpoint(**cfg_trainer["model_checkpoint"]))
 
+    if "learning_rate_monitor" in cfg_trainer.keys():
+        callbacks.append(LearningRateMonitor(**cfg_trainer["learning_rate_monitor"]))
+    else:
+        callbacks.append(LearningRateMonitor())
+
     # Define the logger parameters
     wandb_cfg = config["constants"].get("wandb")
     if wandb_cfg is not None:
         name = wandb_cfg.pop("name", "main")
         if len(date_time_suffix) > 0:
             name += f"_{date_time_suffix}"
-        trainer_kwargs["logger"] = WandbLogger(name=name, **wandb_cfg)
+        trainer_kwargs["logger"] = WandbLogger(name=name, log_model=True, **wandb_cfg)
 
     trainer_kwargs["callbacks"] = callbacks
     trainer = Trainer(
@@ -440,6 +458,7 @@ def save_params_to_wandb(
     config: Union[omegaconf.DictConfig, Dict[str, Any]],
     predictor: PredictorModule,
     datamodule: MultitaskFromSmilesDataModule,
+    unresolved_config: Optional[Union[omegaconf.DictConfig, Dict[str, Any]]] = None,
 ):
     """
     Save a few stuff to weights-and-biases WandB
@@ -448,13 +467,16 @@ def save_params_to_wandb(
         config: The config file, with key `trainer`
         predictor: The predictor used to handle the train/val/test steps logic
         datamodule: The datamodule used to load the data into training
+        unresolved_config: The unresolved config file
     """
 
     # Get the wandb runner and directory
     wandb_run = logger.experiment
+
     if wandb_run is None:
-        wandb_run = ""
-    wandb_dir = wandb_run.dir
+        wandb_dir = ""
+    else:
+        wandb_dir = wandb_run.dir
 
     # Save the mup base model to WandB as a yaml file
     mup.save_base_shapes(predictor.model, os.path.join(wandb_dir, "mup_base_params.yaml"))
@@ -463,14 +485,18 @@ def save_params_to_wandb(
     with open(os.path.join(wandb_dir, "full_configs.yaml"), "w") as file:
         yaml.dump(config, file)
 
+    if unresolved_config is not None:
+        with open(os.path.join(wandb_dir, "unresolved_config.yaml"), "w") as file:
+            yaml.dump(unresolved_config, file)
+
     # Save the featurizer into wandb
     featurizer_path = os.path.join(wandb_dir, "featurizer.pickle")
     joblib.dump(datamodule.smiles_transformer, featurizer_path)
 
     # Save the featurizer and configs into wandb
     if wandb_run is not None:
-        wandb_run.save("*.yaml")
-        wandb_run.save("*.pickle")
+        wandb_run.save(os.path.join(wandb_dir, "*.yaml"), wandb_dir)
+        wandb_run.save(os.path.join(wandb_dir, "*.pickle"), wandb_dir)
 
 
 def load_accelerator(config: Union[omegaconf.DictConfig, Dict[str, Any]]) -> Tuple[Dict[str, Any], str]:
diff --git a/graphium/config/zinc_default_multitask_pyg.yaml b/graphium/config/zinc_default_multitask_pyg.yaml
index 07ae4bf9b..b9435ec7e 100644
--- a/graphium/config/zinc_default_multitask_pyg.yaml
+++ b/graphium/config/zinc_default_multitask_pyg.yaml
@@ -181,3 +181,5 @@ architecture:     # The parameters for the full graph network are taken from `co
       dropout: 0.2
       normalization: none
       residual_type: none
+accelerator:
+  type: cpu
\ No newline at end of file
diff --git a/graphium/features/featurizer.py b/graphium/features/featurizer.py
index 66f241663..d8efdb2ab 100644
--- a/graphium/features/featurizer.py
+++ b/graphium/features/featurizer.py
@@ -1062,11 +1062,9 @@ def mol_to_graph_dict(
             mol = Chem.AddHs(mol)
         else:
             mol = Chem.RemoveHs(mol)
-
         num_atoms = mol.GetNumAtoms()
         if (max_num_atoms is not None) and (num_atoms > max_num_atoms):
             raise ValueError(f"Maximum number of atoms greater than permitted {num_atoms}>{max_num_atoms}")
-
         (
             adj,
             ndata,
diff --git a/graphium/finetuning/utils.py b/graphium/finetuning/utils.py
index abcd11644..a2bd20d68 100644
--- a/graphium/finetuning/utils.py
+++ b/graphium/finetuning/utils.py
@@ -47,7 +47,6 @@ def modify_cfg_for_finetuning(cfg: Dict[str, Any]):
     """
     Function combining information from configuration and pretrained model for finetuning.
     """
-
     task = cfg["finetuning"]["task"]
 
     # Filter the config based on the task name
diff --git a/graphium/nn/architectures/encoder_manager.py b/graphium/nn/architectures/encoder_manager.py
index e3e48aeba..464d9e9cc 100644
--- a/graphium/nn/architectures/encoder_manager.py
+++ b/graphium/nn/architectures/encoder_manager.py
@@ -135,6 +135,8 @@ def _initialize_positional_encoders(self, pe_encoders_kwargs: Dict[str, Any]) ->
                 if pe_out_dim2 is not None:
                     assert edge_pe_out_dim == pe_out_dim2, f"values mismatch {pe_out_dim}!={pe_out_dim2}"
                 pe_encoders[encoder_name] = encoder(out_dim=edge_pe_out_dim, **this_in_dims, **encoder_kwargs)
+            else:
+                pe_encoders[encoder_name] = encoder(**this_in_dims, **encoder_kwargs)
 
         return pe_encoders
 
diff --git a/graphium/nn/architectures/global_architectures.py b/graphium/nn/architectures/global_architectures.py
index 0e4599b24..dc05dbe60 100644
--- a/graphium/nn/architectures/global_architectures.py
+++ b/graphium/nn/architectures/global_architectures.py
@@ -12,6 +12,7 @@
 from torch import Tensor, nn
 import torch
 from torch_geometric.data import Data
+from omegaconf import DictConfig, OmegaConf
 
 # graphium imports
 from graphium.data.utils import get_keys
@@ -421,6 +422,7 @@ def __init__(
         residual_skip_steps: int = 1,
         in_dim_edges: int = 0,
         hidden_dims_edges: List[int] = [],
+        out_dim_edges: Optional[int] = None,
         name: str = "GNN",
         layer_kwargs: Optional[Dict] = None,
         virtual_node: str = "none",
@@ -508,6 +510,11 @@ def __init__(
                 Hidden dimensions for the edges. Most models don't support it, so it
                 should only be used for those that do, i.e. `GatedGCNLayer`
 
+            out_dim_edges:
+                Output edge-feature dimensions of the network. Keep at 0 if not using
+                edge features, or if the layer doesn't support edges. Defaults to the
+                last value of hidden_dims_edges.
+
             name:
                 Name attributed to the current network, for display and printing
                 purposes.
@@ -551,9 +558,17 @@ def __init__(
         else:
             self.hidden_dims_edges = list(hidden_dims_edges)
             assert depth is None
+        self.out_dim_edges = (
+            out_dim_edges
+            if out_dim_edges is not None
+            else self.hidden_dims_edges[-1]
+            if self.hidden_dims_edges
+            else 0
+        )
         self.full_dims_edges = None
-        if len(self.hidden_dims_edges) > 0:
-            self.full_dims_edges = [self.in_dim_edges] + self.hidden_dims_edges + [self.hidden_dims_edges[-1]]
+        if len(self.hidden_dims_edges) or self.out_dim_edges > 0:
+            assert self.out_dim_edges > 0, self.out_dim_edges
+            self.full_dims_edges = [self.in_dim_edges] + self.hidden_dims_edges + [self.out_dim_edges]
 
         self.virtual_node = virtual_node.lower() if virtual_node is not None else "none"
 
@@ -593,6 +608,26 @@ def _check_bad_arguments(self):
         ) and not self.layer_class.layer_supports_edges:
             raise ValueError(f"Cannot use edge features with class `{self.layer_class}`")
 
+    def get_nested_key(self, d, target_key):
+        """
+        Get the value associated with a key in a nested dictionary.
+
+        Parameters:
+        - d: The dictionary to search in
+        - target_key: The key to search for
+
+        Returns:
+        - The value associated with the key if found, None otherwise
+        """
+        if target_key in d:
+            return d[target_key]
+        for key, value in d.items():
+            if isinstance(value, (dict, DictConfig)):
+                nested_result = self.get_nested_key(value, target_key)
+                if nested_result is not None:
+                    return nested_result
+        return None
+
     def _create_layers(self):
         r"""
         Create all the necessary layers for the network.
@@ -639,7 +674,8 @@ def _create_layers(self):
                         this_out_dim_edges = self.full_dims_edges[ii + 1]
                         this_edge_kwargs["out_dim_edges"] = this_out_dim_edges
                     else:
-                        this_out_dim_edges = self.layer_kwargs.get("out_dim_edges")
+                        this_out_dim_edges = self.get_nested_key(self.layer_kwargs, "out_dim_edges")
+                        this_edge_kwargs["out_dim_edges"] = this_out_dim_edges
                     layer_out_dims_edges.append(this_out_dim_edges)
 
             # Create the GNN layer
@@ -900,6 +936,7 @@ def get_init_kwargs(self) -> Dict[str, Any]:
         new_kwargs = dict(
             in_dim_edges=self.in_dim_edges,
             hidden_dims_edges=self.hidden_dims_edges,
+            out_dim_edges=self.out_dim_edges,
             virtual_node=self.virtual_node,
             use_virtual_edges=self.use_virtual_edges,
         )
@@ -931,6 +968,7 @@ def make_mup_base_kwargs(
             kwargs["in_dim_edges"] = round(kwargs["in_dim_edges"] / divide_factor)
         if not self.last_layer_is_readout:
             kwargs["out_dim"] = round(kwargs["out_dim"] / divide_factor)
+            kwargs["out_dim_edges"] = round(kwargs["out_dim_edges"] / divide_factor)
 
         def _recursive_divide_dim(x: collections.abc.Mapping):
             for k, v in x.items():
diff --git a/graphium/nn/encoders/laplace_pos_encoder.py b/graphium/nn/encoders/laplace_pos_encoder.py
index ccf642e9d..7cc69919b 100644
--- a/graphium/nn/encoders/laplace_pos_encoder.py
+++ b/graphium/nn/encoders/laplace_pos_encoder.py
@@ -3,7 +3,7 @@
 import torch.nn as nn
 from torch_geometric.data import Batch
 
-from graphium.nn.base_layers import MLP, get_norm, FCLayer
+from graphium.nn.base_layers import MLP, get_norm, FCLayer, TransformerEncoderLayerMup
 from graphium.nn.encoders.base_encoder import BaseEncoder
 
 
@@ -70,7 +70,8 @@ def __init__(
         if self.model_type == "Transformer":
             # Transformer model for LapPE
             model_kwargs.setdefault("nhead", 1)
-            encoder_layer = nn.TransformerEncoderLayer(
+            encoder_layer = TransformerEncoderLayerMup(
+                None,
                 d_model=hidden_dim,
                 batch_first=True,
                 dropout=dropout,
diff --git a/graphium/nn/pyg_layers/gps_pyg.py b/graphium/nn/pyg_layers/gps_pyg.py
index f3da56979..7af7107ac 100644
--- a/graphium/nn/pyg_layers/gps_pyg.py
+++ b/graphium/nn/pyg_layers/gps_pyg.py
@@ -47,9 +47,10 @@ def __init__(
         activation: Union[Callable, str] = "relu",
         dropout: float = 0.0,
         node_residual: Optional[bool] = True,
+        edge_residual: Optional[bool] = True,
         normalization: Union[str, Callable] = "none",
         mpnn_type: str = "pyg:gine",
-        mpnn_kwargs=None,
+        mpnn_kwargs: Optional[dict] = None,
         attn_type: str = "full-attention",
         precision: str = "32",
         biased_attention_key: Optional[str] = None,
@@ -57,6 +58,7 @@ def __init__(
         droppath_rate_attn: float = 0.0,
         droppath_rate_ffn: float = 0.0,
         hidden_dim_scaling: float = 4.0,
+        output_scale: float = 1.0,
         **kwargs,
     ):
         r"""
@@ -99,6 +101,9 @@ def __init__(
             node_residual:
                 If node residual is used after on the gnn layer output
 
+            edge_residual:
+                If edge residual is used after on the gnn layer output
+
             normalization:
                 Normalization to use. Choices:
 
@@ -141,6 +146,11 @@ def __init__(
             attn_kwargs:
                 Keyword arguments to pass to the attention layer
 
+            output_scale:
+                Float value that will be used to scale the activations, helps reduce growth of activations
+
+                as the model gets deeper. Default value of 1.0 leaves the layer unchanged.
+
         """
 
         super().__init__(
@@ -165,6 +175,7 @@ def __init__(
 
         # Residual connections
         self.node_residual = node_residual
+        self.edge_residual = edge_residual
 
         self.precision = precision
 
@@ -190,6 +201,37 @@ def __init__(
         self.mpnn = self._parse_mpnn_layer(mpnn_type, mpnn_kwargs)
         self.attn_layer = self._parse_attn_layer(attn_type, self.biased_attention_key, attn_kwargs)
 
+        self.output_scale = output_scale
+        self.use_edges = True if self.in_dim_edges is not None else False
+
+    def residual_add(self, feature: Tensor, input_feature: Tensor) -> Tensor:
+        r"""
+        Residual additition layer. Allows information to propagate through the model
+        by skipping the computational layers.
+        Parameters:
+            feature: The feature (typically nodes or edges) after message passing
+            input_feature: The same feature from before message passing
+        Returns:
+            The addition of the two tensors.
+        """
+        feature += input_feature
+        return feature
+
+    def scale_activations(self, feature: Tensor, scale_factor: Tensor) -> Tensor:
+        """Scale Activations by a constant factor to stop growth of activation scale
+        and reduce numerical stability issues at low precision
+
+        Args:
+            feature (Tensor): The feature to scale
+            scale_factor (float): The floating point scale factor
+
+        Returns:
+            Tensor: The scaled features
+        """
+        scale_factor = torch.tensor(scale_factor).to(feature.device)
+        feature *= scale_factor.to(dtype=feature.dtype)
+        return feature
+
     def forward(self, batch: Batch) -> Batch:
         r"""
         forward function of the layer
@@ -200,6 +242,8 @@ def forward(self, batch: Batch) -> Batch:
         """
         # pe, feat, edge_index, edge_feat = batch.pos_enc_feats_sign_flip, batch.feat, batch.edge_index, batch.edge_feat
         feat = batch.feat
+        if self.use_edges:
+            edges_feat_in = batch.edge_feat
 
         feat_in = feat  # for first residual connection
 
@@ -208,10 +252,21 @@ def forward(self, batch: Batch) -> Batch:
         if self.mpnn is not None:
             batch_out = self.mpnn(batch_out)
         h_local = batch_out.feat
+        e_local = batch_out.edge_feat
         if self.dropout_local is not None:
             h_local = self.dropout_local(h_local)
+        # Apply the residual connection for the node features
         if self.node_residual:
-            h_local = feat_in + h_local  # Residual connection for nodes, not used in gps++.
+            h_local = self.residual_add(h_local, feat_in)
+        # Scale the activations by some value to help reduce activation growth
+        h_local = self.scale_activations(h_local, self.output_scale)
+        # Apply the residual connection for the edge features
+        if self.edge_residual and self.use_edges:
+            e_local = self.residual_add(e_local, edges_feat_in)
+        # Scale the activations by some value to help reduce activation growth
+        if self.use_edges:
+            e_local = self.scale_activations(e_local, self.output_scale)
+
         if self.norm_layer_local is not None:
             h_local = self.norm_layer_local(h_local)
 
@@ -240,7 +295,7 @@ def forward(self, batch: Batch) -> Batch:
     def _parse_mpnn_layer(self, mpnn_type, mpnn_kwargs: Dict[str, Any]) -> Optional[Module]:
         """Parse the MPNN layer."""
 
-        if mpnn_type is None:
+        if mpnn_type is None or mpnn_type == "none":
             return
 
         mpnn_kwargs = deepcopy(mpnn_kwargs)
@@ -375,7 +430,7 @@ def _self_attention_block(self, feat: Tensor, feat_in: Tensor, batch: Batch) ->
         )
 
         attn_bias = None
-        if self.biased_attention_key is not None:
+        if self.biased_attention_key is not None and self.biased_attention_key != "none":
             attn_bias = batch[self.biased_attention_key]
 
         # h_dense[num_graphs, max_num_nodes, hidden_dim] -> feat_attn[num_graphs, max_num_nodes, hidden_dim]
@@ -463,6 +518,8 @@ def layer_outputs_edges(self) -> bool:
             bool:
                 Always ``False`` for the current class
         """
+        if self.mpnn is None:
+            return False
         return self.mpnn.layer_outputs_edges
 
     @property
diff --git a/graphium/nn/pyg_layers/mpnn_pyg.py b/graphium/nn/pyg_layers/mpnn_pyg.py
index f2cdcb16c..25df03714 100644
--- a/graphium/nn/pyg_layers/mpnn_pyg.py
+++ b/graphium/nn/pyg_layers/mpnn_pyg.py
@@ -130,14 +130,15 @@ def __init__(
         self.num_edge_mlp = num_edge_mlp
         self.edge_dropout_rate = edge_dropout_rate
 
-        self.aggregator = MultiAggregation(aggregation_method)
+        self.aggregator = MultiAggregation(list(aggregation_method))
+        n_agg = len(aggregation_method)
 
         # node_model:
         edge_dim = self.out_dim_edges if use_edges else self.in_dim_edges
         if self.node_combine_method == "concat":
-            node_model_in_dim = 3 * self.in_dim + 2 * edge_dim
+            node_model_in_dim = (1 + 2 * n_agg) * self.in_dim + 2 * n_agg * edge_dim
         elif self.node_combine_method == "sum":
-            node_model_in_dim = 2 * self.in_dim + edge_dim
+            node_model_in_dim = (1 + n_agg) * self.in_dim + n_agg * edge_dim
         else:
             raise ValueError(f"node_combine_method {self.node_combine_method} not recognised.")
         node_model_hidden_dim = self.mlp_expansion_ratio * self.in_dim
diff --git a/graphium/trainer/predictor.py b/graphium/trainer/predictor.py
index c4e700895..588d7e3f2 100644
--- a/graphium/trainer/predictor.py
+++ b/graphium/trainer/predictor.py
@@ -46,6 +46,9 @@ def __init__(
         flag_kwargs: Dict[str, Any] = None,
         task_norms: Optional[Dict[Callable, Any]] = None,
         metrics_every_n_train_steps: Optional[int] = None,
+        replicas: int = 1,
+        gradient_acc: int = 1,
+        global_bs: Optional[int] = 1,
     ):
         """
         The Lightning module responsible for handling the predictions, losses, metrics, optimization, etc.
@@ -175,6 +178,9 @@ def __init__(
         self.metrics_every_n_train_steps = metrics_every_n_train_steps
         # Wether save preds and targets for each training step.
 
+        self.samples_seen = 0
+        self.global_bs = global_bs
+
     def forward(
         self, inputs: Dict
     ) -> Dict[str, Union[Tensor, Dict[str, Tensor], Dict[str, Dict[str, Tensor]]]]:
@@ -221,6 +227,7 @@ def configure_optimizers(self, impl=None):
 
         # Define the optimizer and schedulers
         optimiser = MuAdam(self.parameters(), **self.optim_options.optim_kwargs, impl=impl)
+        self.optim_options.torch_scheduler_kwargs.pop("module_type")
         torch_scheduler = self.optim_options.scheduler_class(
             optimizer=optimiser, **self.optim_options.torch_scheduler_kwargs
         )
@@ -461,6 +468,10 @@ def on_train_batch_end(self, outputs, batch: Any, batch_idx: int) -> None:
         # Get the metrics that are logged at every step (loss, grad_norm, batch_time, batch_tput)
         concatenated_metrics_logs = {}
         concatenated_metrics_logs["train/loss"] = outputs["loss"]
+        concatenated_metrics_logs["epoch_count"] = self.current_epoch
+        # Incriment by the batch size
+        self.samples_seen += self.global_bs
+        concatenated_metrics_logs["samples_seen"] = self.samples_seen
 
         # report the training loss for each individual tasks
         for task in self.tasks:
@@ -618,11 +629,6 @@ def on_validation_epoch_end(self) -> None:
         concatenated_metrics_logs = self.task_epoch_summary.concatenate_metrics_logs(metrics_logs)
         concatenated_metrics_logs["val/mean_time"] = torch.tensor(self.mean_val_time_tracker.mean_value)
         concatenated_metrics_logs["val/mean_tput"] = self.mean_val_tput_tracker.mean_value
-
-        if hasattr(self.optimizers(), "param_groups"):
-            lr = self.optimizers().param_groups[0]["lr"]
-            concatenated_metrics_logs["lr"] = torch.tensor(lr)
-        concatenated_metrics_logs["n_epochs"] = torch.tensor(self.current_epoch, dtype=torch.float32)
         self.log_dict(concatenated_metrics_logs)
 
         # Save yaml file with the per-task metrics summaries
diff --git a/graphium/trainer/predictor_options.py b/graphium/trainer/predictor_options.py
index 20e193bca..25fc6fad0 100644
--- a/graphium/trainer/predictor_options.py
+++ b/graphium/trainer/predictor_options.py
@@ -76,6 +76,7 @@ class OptimOptions:
     # Instead of passing a dictionary to be processed by the predictor,
     # this class will process the dictionary in advance and return the optimizer
     def set_kwargs(self):
+        torch_scheduler_kwargs = deepcopy(self.torch_scheduler_kwargs)
         # Set the parameters and default value for the optimizer, and check values
         if self.optim_kwargs is None:
             self.optim_kwargs = {}
@@ -94,12 +95,12 @@ def set_kwargs(self):
         self.scheduler_kwargs.setdefault("strict", True)
 
         # Set the pytorch scheduler arguments
-        if self.torch_scheduler_kwargs is None:
-            self.torch_scheduler_kwargs = {}
-        self.torch_scheduler_kwargs.setdefault("module_type", "ReduceLROnPlateau")
+        if torch_scheduler_kwargs is None:
+            torch_scheduler_kwargs = {}
+        torch_scheduler_kwargs.setdefault("module_type", "ReduceLROnPlateau")
 
         # Get the class for the scheduler
-        scheduler_class = self.torch_scheduler_kwargs.pop("module_type")
+        scheduler_class = torch_scheduler_kwargs.pop("module_type")
         if self.scheduler_class is None:
             if isinstance(scheduler_class, str):
                 self.scheduler_class = SCHEDULER_DICT[scheduler_class]
@@ -112,9 +113,9 @@ def set_kwargs(self):
         sig = signature(self.scheduler_class.__init__)
         key_args = [p.name for p in sig.parameters.values()]
         if "monitor" in key_args:
-            self.torch_scheduler_kwargs.setdefault("monitor", self.scheduler_kwargs["monitor"])
+            torch_scheduler_kwargs.setdefault("monitor", self.scheduler_kwargs["monitor"])
         if "mode" in key_args:
-            self.torch_scheduler_kwargs.setdefault("mode", self.scheduler_kwargs["mode"])
+            torch_scheduler_kwargs.setdefault("mode", self.scheduler_kwargs["mode"])
 
 
 @dataclass
diff --git a/graphium/trainer/predictor_summaries.py b/graphium/trainer/predictor_summaries.py
index d62e50a42..8ce863e74 100644
--- a/graphium/trainer/predictor_summaries.py
+++ b/graphium/trainer/predictor_summaries.py
@@ -248,8 +248,6 @@ def get_metrics_logs(self) -> Dict[str, Any]:
         metric_logs[self.metric_log_name(self.task_name, "median_target", self.step_name)] = nan_median(
             targets
         )
-        if torch.cuda.is_available():
-            metric_logs[f"gpu_allocated_GB"] = torch.tensor(torch.cuda.memory_allocated() / (2**30))
 
         # Specify which metrics to use
         metrics_to_use = self.metrics
diff --git a/graphium/utils/spaces.py b/graphium/utils/spaces.py
index 6658d0ca0..d821223a4 100644
--- a/graphium/utils/spaces.py
+++ b/graphium/utils/spaces.py
@@ -52,6 +52,8 @@
 }
 
 LOSS_DICT = {
+    "bce": torch.nn.BCELoss,
+    "bce_logits": torch.nn.BCEWithLogitsLoss,
     "mse": torch.nn.MSELoss,
     "bce": torch.nn.BCELoss,
     "l1": torch.nn.L1Loss,
@@ -106,7 +108,7 @@
     "msle": TorchMetrics.mean_squared_log_error,
     "pearsonr": TorchMetrics.pearson_corrcoef,
     "spearmanr": TorchMetrics.spearman_corrcoef,
-    "r2": TorchMetrics.r2_score,
+    "r2_score": TorchMetrics.r2_score,
     "cosine": TorchMetrics.cosine_similarity,
     "pearsonr_ipu": Metrics.pearson_ipu,
     "spearmanr_ipu": Metrics.spearman_ipu,
diff --git a/scripts/scale_mpnn.sh b/scripts/scale_mpnn.sh
new file mode 100644
index 000000000..8cd61fb86
--- /dev/null
+++ b/scripts/scale_mpnn.sh
@@ -0,0 +1,9 @@
+#!/bin/bash
+
+graphium-train \
+    --config-path=/home/frederik_valencediscovery_com/projects/graphium_hps/expts/configs/ \
+    --config-name=config_mpnn_base.yaml \
+    constants.max_epochs=100 \
+    trainer.model_checkpoint.dirpath=model_checkpoints/large-dataset/scale_mpnn/ \
+    +architecture.mup_scale_factor=2 +architecture.mup_base_path=mup/mpnn_base/base_shapes.yaml \
+    datamodule.args.batch_size_inference=1024 datamodule.args.batch_size_training=1024 +trainer.trainer.accumulate_grad_batches=2 \
\ No newline at end of file