diff --git a/expts/hydra-configs/README.md b/expts/hydra-configs/README.md
new file mode 100644
index 000000000..5e189c304
--- /dev/null
+++ b/expts/hydra-configs/README.md
@@ -0,0 +1,154 @@
+# Configuring Graphium with Hydra
+This document provides users with a point of entry to composing configs in Graphium. As a flexible library with many features, configuration is an important part of Graphium. To make configurations as reusable as possible while providing maximum flexibility, we integrated Graphium with `hydra`. Our config structure is designed to make the following functionality as accessible as possible:
+
+- Switching between **accelerators** (CPU, GPU and IPU)
+- **Benchmarking** different models on the same dataset
+- **Fine-tuning** a pre-trained model on a new dataset
+
+In what follows, we describe how each of the above functionality is achieved and how users can benefit from this design to achieve the most with Graphium with as little configuration as possible.
+
+## Accelerators
+With Graphium supporting CPU, GPU and IPU hardware, easily switching between these accelerators is pre-configured. General, accelerator-specific configs are specified under `accelerator/`, whereas experiment-specific differences between the accelerators are specialized under `training/accelerator`.
+
+## Benchmarking
+Benchmarking multiple models on the same datasets and tasks requires us to easily switch between model configurations without redefining major parts of the architecture, task heads, featurization, metrics, predictor, etc. For example, when changing from a GCN to a GIN model, a simple switch of `architecture.gnn.layer_type: 'pyg:gin'` might suffice. Hence, we abstract the `model` configs under `model/` where such model configurations can be specified.
+In addition, switching models may have implications on configs specific to your current experiment, such as the name of the run or the directory to which model checkpoints are written. To enable such overrides, we can utilize `hydra` [specializations](https://hydra.cc/docs/patterns/specializing_config/). For example, for our ToyMix dataset, we specify the layer type under `model/[model_name].yaml`, e.g., for the GCN layer,
+
+```yaml
+# @package _global_
+
+architecture:
+  gnn:
+    layer_type: 'pyg:gcn'
+```
+
+and set experiment-related parameters in `training/model/toymix_[model_name].yaml` as a specialization, e.g., for the GIN layer,
+
+```yaml
+# @package _global_
+
+constants:
+  name: neurips2023_small_data_gin
+  ...
+
+trainer:
+  model_checkpoint:
+    dirpath: models_checkpoints/neurips2023-small-gin/
+```
+We can now utilize `hydra` to e.g., run a sweep over our models on the ToyMix dataset via
+
+```bash
+python main_run_multitask.py -m model=gcn,gin
+```
+where the ToyMix dataset is pre-configured in `main.yaml`. Read on to find out how to define new datasets and architectures for pre-training and fine-tuning.
+
+## Pre-training / Fine-tuning
+From a configuration point-of-view, fine-tuning requires us to load a pre-trained model and attach new task heads. However, in a highly configurable library such as ours changing the task heads also requires changes to the logged metrics, loss functions and the source of the fine-tuning data. To allow a quick switch between pre-training and fine-tuning, by default, we configure models and the corresponding tasks in a separate manner. More specifically,
+
+- under `architecture/` we store architecture related configurations such as the definition of the GNN/Transformer layers or positional/structural encoders
+- under `tasks/` we store configurations specific to one task set, such as the multi-task dataset ToyMix
+- under `training/` we store configurations specific to training models which could be different for each combination of `architecture` and `tasks`
+
+Since architecture and tasks are logically separated it now becomes very easy to e.g., use an existing architecture backbone on a new set of tasks or a new dataset altogether. Additionally, separating training allows us to specify different training parameters for e.g., pre-training and fine-tuning of the same architecture and task set.
+We will now detail how you can add new architectures, tasks and training configurations.
+
+### Adding an architecture
+The architecture config consists of specifications of the neural network components, including encoders, under the config key `architecture` and the featurization, containing the positional/structural information that is to be extracted from the data.
+To add a new architecture, create a file `architecture/my_architecture.yaml` with the following information specified:
+```yaml
+# @package _global_
+architecture:
+  model_type: FullGraphMultiTaskNetwork # for example
+  pre_nn:
+    ...
+
+  pre_nn_edges:
+    ...
+
+  pe_encoders:
+    encoders: # your encoders
+      ...
+
+  gnn: # your GNN definition
+    ...
+
+  graph_output_nn: # output NNs for different levels such as graph, node, etc.
+    graph:
+      ...
+    node:
+      ...
+    ...
+
+datamodule:
+  module_type: "MultitaskFromSmilesDataModule"
+  args: # Make sure to not specify anything task-specific here
+    ...
+  featurization:
+    ...
+```
+You can then select your new architecture during training, e.g., by running
+```bash
+python main_run_multitask.py architecture=my_architecture
+```
+
+### Adding tasks
+The task set config consists of specifications for the task head neural nets under the config key `architecture.task_heads`; if required, any task-specific arguments to the datamodule you use, e.g., `datamodule.args.task_specfic_args` when using the `MultitaskFromSmilesDataModule` datamodule; the per-task metrics under the config key `metrics.[task]` where `[task]` matches the tasks specified under `architecture.task_heads`; the per-task configs of the `predictor` module, as well as the loss functions of the task set under the config key `predictor.loss_fun`.
+To add a new task set, create a file `tasks/my_tasks.yaml` with the following information specified:
+```yaml
+# @package _global_
+architecture:
+    task_heads:
+        task1:
+            ...
+        task2:
+            ...
+
+datamodule: # optional, depends on your concrete datamodule class. Here: "MultitaskFromSmilesDataModule"
+    args:
+        task_specific_args:
+            task1:
+                ...
+            task2:
+                ...
+
+metrics:
+    task1:
+        ...
+    task2:
+        ...
+
+predictor:
+  metrics_on_progress_bar:
+    task1:
+    task2:
+  loss_fun: ... # your loss functions for the multi-tasking
+```
+You can then select your new dataset during training, e.g., by running
+```bash
+python main_run_multitask.py tasks=my_tasks
+```
+
+### Adding training configs
+The training configs consist of specifications to the `predictor` and `trainer` modules.
+To add new training configs, create a file `training/my_training.yaml` with the following information specified:
+```yaml
+# @package _global_
+predictor:
+    optim_kwargs:
+    lr: 4.e-5
+    torch_scheduler_kwargs: # example
+        module_type: WarmUpLinearLR
+        max_num_epochs: &max_epochs 100
+        warmup_epochs: 10
+        verbose: False
+    scheduler_kwargs:
+        ...
+
+trainer:
+  ...
+  trainer: # example
+    precision: 16
+    max_epochs: *max_epochs
+    min_epochs: 1
+    check_val_every_n_epoch: 20
+```
diff --git a/expts/hydra-configs/architecture/toymix.yaml b/expts/hydra-configs/architecture/toymix.yaml
new file mode 100644
index 000000000..6927f4e66
--- /dev/null
+++ b/expts/hydra-configs/architecture/toymix.yaml
@@ -0,0 +1,108 @@
+# @package _global_
+
+architecture:
+  model_type: FullGraphMultiTaskNetwork
+  mup_base_path: null
+  pre_nn:
+    out_dim: 64
+    hidden_dims: 256
+    depth: 2
+    activation: relu
+    last_activation: none
+    dropout: 0.18
+    normalization: layer_norm
+    last_normalization: ${architecture.pre_nn.normalization}
+    residual_type: none
+
+  pre_nn_edges: null
+
+  pe_encoders:
+    out_dim: 32
+    pool: "sum" #"mean" "max"
+    last_norm: None #"batch_norm", "layer_norm"
+    encoders: #la_pos |  rw_pos
+      la_pos:  # Set as null to avoid a pre-nn network
+        encoder_type: "laplacian_pe"
+        input_keys: ["laplacian_eigvec", "laplacian_eigval"]
+        output_keys: ["feat"]
+        hidden_dim: 64
+        out_dim: 32
+        model_type: 'DeepSet' #'Transformer' or 'DeepSet'
+        num_layers: 2
+        num_layers_post: 1 # Num. layers to apply after pooling
+        dropout: 0.1
+        first_normalization: "none" #"batch_norm" or "layer_norm"
+      rw_pos:
+        encoder_type: "mlp"
+        input_keys: ["rw_return_probs"]
+        output_keys: ["feat"]
+        hidden_dim: 64
+        out_dim: 32
+        num_layers: 2
+        dropout: 0.1
+        normalization: "layer_norm" #"batch_norm" or "layer_norm"
+        first_normalization: "layer_norm" #"batch_norm" or "layer_norm"
+
+  gnn:  # Set as null to avoid a post-nn network
+    in_dim: 64 # or otherwise the correct value
+    out_dim: &gnn_dim 96
+    hidden_dims: *gnn_dim
+    depth: 4
+    activation: gelu
+    last_activation: none
+    dropout: 0.1
+    normalization: "layer_norm"
+    last_normalization: ${architecture.pre_nn.normalization}
+    residual_type: simple
+    virtual_node: 'none'
+    layer_type: 'pyg:gcn' #pyg:gine #'pyg:gps' # pyg:gated-gcn, pyg:gine,pyg:gps
+    layer_kwargs: null # Parameters for the model itself. You could define dropout_attn: 0.1
+
+  graph_output_nn:
+    graph:
+      pooling: [sum]
+      out_dim: *gnn_dim
+      hidden_dims: *gnn_dim
+      depth: 1
+      activation: relu
+      last_activation: none
+      dropout: ${architecture.pre_nn.dropout}
+      normalization: ${architecture.pre_nn.normalization}
+      last_normalization: "none"
+      residual_type: none
+
+datamodule:
+  module_type: "MultitaskFromSmilesDataModule"
+  args:
+    prepare_dict_or_graph: pyg:graph
+    featurization_n_jobs: 30
+    featurization_progress: True
+    featurization_backend: "loky"
+    processed_graph_data_path: "../datacache/neurips2023-small/"
+    num_workers: 30 # -1 to use all
+    persistent_workers: False
+    featurization:
+      atom_property_list_onehot: [atomic-number, group, period, total-valence]
+      atom_property_list_float: [degree, formal-charge, radical-electron, aromatic, in-ring]
+      edge_property_list: [bond-type-onehot, stereo, in-ring]
+      add_self_loop: False
+      explicit_H: False # if H is included
+      use_bonds_weights: False
+      pos_encoding_as_features:
+        pos_types:
+          lap_eigvec:
+            pos_level: node
+            pos_type: laplacian_eigvec
+            num_pos: 8
+            normalization: "none" # normalization already applied on the eigen vectors
+            disconnected_comp: True # if eigen values/vector for disconnected graph are included
+          lap_eigval:
+            pos_level: node
+            pos_type: laplacian_eigval
+            num_pos: 8
+            normalization: "none" # normalization already applied on the eigen vectors
+            disconnected_comp: True # if eigen values/vector for disconnected graph are included
+          rw_pos: # use same name as pe_encoder
+            pos_level: node
+            pos_type: rw_return_probs
+            ksteps: 16
\ No newline at end of file
diff --git a/expts/hydra-configs/dataset/toymix.yaml b/expts/hydra-configs/dataset/toymix.yaml
deleted file mode 100644
index 7ad9a82e6..000000000
--- a/expts/hydra-configs/dataset/toymix.yaml
+++ /dev/null
@@ -1,269 +0,0 @@
-# @package _global_
-
-architecture:
-  model_type: FullGraphMultiTaskNetwork
-  mup_base_path: null
-  pre_nn:
-    out_dim: 64
-    hidden_dims: 256
-    depth: 2
-    activation: relu
-    last_activation: none
-    dropout: &dropout 0.18
-    normalization: &normalization layer_norm
-    last_normalization: *normalization
-    residual_type: none
-
-  pre_nn_edges: null
-
-  pe_encoders:
-    out_dim: 32
-    pool: "sum" #"mean" "max"
-    last_norm: None #"batch_norm", "layer_norm"
-    encoders: #la_pos |  rw_pos
-      la_pos:  # Set as null to avoid a pre-nn network
-        encoder_type: "laplacian_pe"
-        input_keys: ["laplacian_eigvec", "laplacian_eigval"]
-        output_keys: ["feat"]
-        hidden_dim: 64
-        out_dim: 32
-        model_type: 'DeepSet' #'Transformer' or 'DeepSet'
-        num_layers: 2
-        num_layers_post: 1 # Num. layers to apply after pooling
-        dropout: 0.1
-        first_normalization: "none" #"batch_norm" or "layer_norm"
-      rw_pos:
-        encoder_type: "mlp"
-        input_keys: ["rw_return_probs"]
-        output_keys: ["feat"]
-        hidden_dim: 64
-        out_dim: 32
-        num_layers: 2
-        dropout: 0.1
-        normalization: "layer_norm" #"batch_norm" or "layer_norm"
-        first_normalization: "layer_norm" #"batch_norm" or "layer_norm"
-
-  gnn:  # Set as null to avoid a post-nn network
-    in_dim: 64 # or otherwise the correct value
-    out_dim: &gnn_dim 96
-    hidden_dims: *gnn_dim
-    depth: 4
-    activation: gelu
-    last_activation: none
-    dropout: 0.1
-    normalization: "layer_norm"
-    last_normalization: *normalization
-    residual_type: simple
-    virtual_node: 'none'
-    layer_type: 'pyg:gcn' #pyg:gine #'pyg:gps' # pyg:gated-gcn, pyg:gine,pyg:gps
-    layer_kwargs: null # Parameters for the model itself. You could define dropout_attn: 0.1
-
-  graph_output_nn:
-    graph:
-      pooling: [sum]
-      out_dim: *gnn_dim
-      hidden_dims: *gnn_dim
-      depth: 1
-      activation: relu
-      last_activation: none
-      dropout: *dropout
-      normalization: *normalization
-      last_normalization: "none"
-      residual_type: none
-
-  task_heads:
-    qm9:
-      task_level: graph
-      out_dim: 19
-      hidden_dims: 128
-      depth: 2
-      activation: relu
-      last_activation: none
-      dropout: *dropout
-      normalization: *normalization
-      last_normalization: "none"
-      residual_type: none
-    tox21:
-      task_level: graph
-      out_dim: 12
-      hidden_dims: 64
-      depth: 2
-      activation: relu
-      last_activation: sigmoid
-      dropout: *dropout
-      normalization: *normalization
-      last_normalization: "none"
-      residual_type: none
-    zinc:
-      task_level: graph
-      out_dim: 3
-      hidden_dims: 32
-      depth: 2
-      activation: relu
-      last_activation: none
-      dropout: *dropout
-      normalization: *normalization
-      last_normalization: "none"
-      residual_type: none
-
-predictor:
-  metrics_on_progress_bar:
-    qm9: ["mae"]
-    tox21: ["auroc"]
-    zinc: ["mae"]
-  loss_fun:
-    qm9: mae_ipu
-    tox21: bce_ipu
-    zinc: mae_ipu
-  random_seed: ${constants.seed}
-  optim_kwargs:
-    lr: 4.e-5 # warmup can be scheduled using torch_scheduler_kwargs
-    # weight_decay: 1.e-7
-  torch_scheduler_kwargs:
-    module_type: WarmUpLinearLR
-    max_num_epochs: &max_epochs 100
-    warmup_epochs: 10
-    verbose: False
-  scheduler_kwargs:
-  target_nan_mask: null
-  multitask_handling: flatten # flatten, mean-per-label
-
-metrics:
-  qm9: &qm9_metrics
-    - name: mae
-      metric: mae_ipu
-      target_nan_mask: null
-      multitask_handling: flatten
-      threshold_kwargs: null
-    - name: pearsonr
-      metric: pearsonr_ipu
-      threshold_kwargs: null
-      target_nan_mask: null
-      multitask_handling: mean-per-label
-    - name: r2_score
-      metric: r2_score_ipu
-      target_nan_mask: null
-      multitask_handling: mean-per-label
-      threshold_kwargs: null
-  tox21:
-    - name: auroc
-      metric: auroc_ipu
-      task: binary
-      multitask_handling: mean-per-label
-      threshold_kwargs: null
-    - name: avpr
-      metric: average_precision_ipu
-      task: binary
-      multitask_handling: mean-per-label
-      threshold_kwargs: null
-    - name: f1 > 0.5
-      metric: f1
-      multitask_handling: mean-per-label
-      target_to_int: True
-      num_classes: 2
-      average: micro
-      threshold_kwargs: &threshold_05
-        operator: greater
-        threshold: 0.5
-        th_on_preds: True
-        th_on_target: True
-    - name: precision > 0.5
-      metric: precision
-      multitask_handling: mean-per-label
-      average: micro
-      threshold_kwargs: *threshold_05
-  zinc: *qm9_metrics
-
-trainer:
-  seed: ${constants.seed}
-  logger:
-    save_dir: logs/neurips2023-small/
-    name: ${constants.name}
-    project: ${constants.name}
-  model_checkpoint:
-    dirpath: models_checkpoints/neurips2023-small-gcn/
-    filename: ${constants.name}
-    save_last: True
-  trainer:
-    precision: 16
-    max_epochs: *max_epochs
-    min_epochs: 1
-    check_val_every_n_epoch: 20
-
-datamodule:
-  module_type: "MultitaskFromSmilesDataModule"
-  args:
-    prepare_dict_or_graph: pyg:graph
-    featurization_n_jobs: 30
-    featurization_progress: True
-    featurization_backend: "loky"
-    processed_graph_data_path: "../datacache/neurips2023-small/"
-    num_workers: 30 # -1 to use all
-    persistent_workers: False
-    task_specific_args:
-      qm9:
-        df: null
-        df_path: ${constants.data_dir}/qm9.csv.gz
-        # wget https://storage.googleapis.com/graphium-public/datasets/neurips_2023/Small-dataset/qm9.csv.gz
-        # or set path as the URL directly
-        smiles_col: "smiles"
-        label_cols: ["A", "B", "C", "mu", "alpha", "homo", "lumo", "gap", "r2", "zpve", "u0", "u298", "h298", "g298", "cv", "u0_atom", "u298_atom", "h298_atom", "g298_atom"]
-        # sample_size: 2000 # use sample_size for test
-        splits_path: ${constants.data_dir}/qm9_random_splits.pt  # Download with `wget https://storage.googleapis.com/graphium-public/datasets/neurips_2023/Small-dataset/qm9_random_splits.pt`
-        seed: ${constants.seed} #*seed
-        task_level: graph
-        label_normalization:
-          normalize_val_test: True
-          method: "normal"
-
-      tox21:
-        df: null
-        df_path: ${constants.data_dir}/Tox21-7k-12-labels.csv.gz
-        # wget https://storage.googleapis.com/graphium-public/datasets/neurips_2023/Small-dataset/Tox21-7k-12-labels.csv.gz
-        # or set path as the URL directly
-        smiles_col: "smiles"
-        label_cols: ["NR-AR", "NR-AR-LBD", "NR-AhR", "NR-Aromatase", "NR-ER", "NR-ER-LBD", "NR-PPAR-gamma", "SR-ARE", "SR-ATAD5", "SR-HSE", "SR-MMP", "SR-p53"]
-        # sample_size: 2000 # use sample_size for test
-        splits_path: ${constants.data_dir}/Tox21_random_splits.pt  # Download with `wget https://storage.googleapis.com/graphium-public/datasets/neurips_2023/Small-dataset/Tox21_random_splits.pt`
-        seed: ${constants.seed}
-        task_level: graph
-
-      zinc:
-        df: null
-        df_path: ${constants.data_dir}/ZINC12k.csv.gz
-        # wget https://storage.googleapis.com/graphium-public/datasets/neurips_2023/Small-dataset/ZINC12k.csv.gz
-        # or set path as the URL directly
-        smiles_col: "smiles"
-        label_cols: ["SA", "logp", "score"]
-        # sample_size: 2000 # use sample_size for test
-        splits_path: ${constants.data_dir}/ZINC12k_random_splits.pt  # Download with `wget https://storage.googleapis.com/graphium-public/datasets/neurips_2023/Small-dataset/ZINC12k_random_splits.pt`
-        seed: ${constants.seed}
-        task_level: graph
-        label_normalization:
-          normalize_val_test: True
-          method: "normal"
-    featurization:
-      atom_property_list_onehot: [atomic-number, group, period, total-valence]
-      atom_property_list_float: [degree, formal-charge, radical-electron, aromatic, in-ring]
-      edge_property_list: [bond-type-onehot, stereo, in-ring]
-      add_self_loop: False
-      explicit_H: False # if H is included
-      use_bonds_weights: False
-      pos_encoding_as_features:
-        pos_types:
-          lap_eigvec:
-            pos_level: node
-            pos_type: laplacian_eigvec
-            num_pos: 8
-            normalization: "none" # normalization already applied on the eigen vectors
-            disconnected_comp: True # if eigen values/vector for disconnected graph are included
-          lap_eigval:
-            pos_level: node
-            pos_type: laplacian_eigval
-            num_pos: 8
-            normalization: "none" # normalization already applied on the eigen vectors
-            disconnected_comp: True # if eigen values/vector for disconnected graph are included
-          rw_pos: # use same name as pe_encoder
-            pos_level: node
-            pos_type: rw_return_probs
-            ksteps: 16
diff --git a/expts/hydra-configs/main.yaml b/expts/hydra-configs/main.yaml
index 198bccb0c..24962eddc 100644
--- a/expts/hydra-configs/main.yaml
+++ b/expts/hydra-configs/main.yaml
@@ -1,8 +1,16 @@
 defaults:
+
+  # Accelerators
   - accelerator: ipu
-  - dataset: toymix
+
+  # Pre-training/fine-tuning
+  - architecture: toymix
+  - tasks: toymix
+  - training: toymix
+
+  # Benchmarking
   - model: gcn
 
   # Specializations
-  - experiment: ${dataset}_${model}
-  - dataset/accelerator: ${dataset}_${accelerator}
\ No newline at end of file
+  - training/accelerator: ${training}_${accelerator}
+  - training/model: ${training}_${model}
\ No newline at end of file
diff --git a/expts/hydra-configs/tasks/toymix.yaml b/expts/hydra-configs/tasks/toymix.yaml
new file mode 100644
index 000000000..e120c13a8
--- /dev/null
+++ b/expts/hydra-configs/tasks/toymix.yaml
@@ -0,0 +1,138 @@
+# @package _global_
+
+architecture:
+  task_heads:
+    qm9:
+      task_level: graph
+      out_dim: 19
+      hidden_dims: 128
+      depth: 2
+      activation: relu
+      last_activation: none
+      dropout: ${architecture.pre_nn.dropout}
+      normalization: ${architecture.pre_nn.normalization}
+      last_normalization: "none"
+      residual_type: none
+    tox21:
+      task_level: graph
+      out_dim: 12
+      hidden_dims: 64
+      depth: 2
+      activation: relu
+      last_activation: none
+      dropout: ${architecture.pre_nn.dropout}
+      normalization: ${architecture.pre_nn.normalization}
+      last_normalization: "none"
+      residual_type: none
+    zinc:
+      task_level: graph
+      out_dim: 3
+      hidden_dims: 32
+      depth: 2
+      activation: relu
+      last_activation: none
+      dropout: ${architecture.pre_nn.dropout}
+      normalization: ${architecture.pre_nn.normalization}
+      last_normalization: "none"
+      residual_type: none
+
+predictor:
+  metrics_on_progress_bar:
+    qm9: ["mae"]
+    tox21: ["auroc"]
+    zinc: ["mae"]
+  loss_fun:
+    qm9: mae_ipu
+    tox21: bce_logits_ipu
+    zinc: mae_ipu
+
+metrics:
+  qm9: &qm9_metrics
+    - name: mae
+      metric: mae_ipu
+      target_nan_mask: null
+      multitask_handling: flatten
+      threshold_kwargs: null
+    - name: pearsonr
+      metric: pearsonr_ipu
+      threshold_kwargs: null
+      target_nan_mask: null
+      multitask_handling: mean-per-label
+    - name: r2_score
+      metric: r2_score_ipu
+      target_nan_mask: null
+      multitask_handling: mean-per-label
+      threshold_kwargs: null
+  tox21:
+    - name: auroc
+      metric: auroc_ipu
+      task: binary
+      multitask_handling: mean-per-label
+      threshold_kwargs: null
+    - name: avpr
+      metric: average_precision_ipu
+      task: binary
+      multitask_handling: mean-per-label
+      threshold_kwargs: null
+    - name: f1 > 0.5
+      metric: f1
+      multitask_handling: mean-per-label
+      target_to_int: True
+      num_classes: 2
+      average: micro
+      threshold_kwargs: &threshold_05
+        operator: greater
+        threshold: 0.5
+        th_on_preds: True
+        th_on_target: True
+    - name: precision > 0.5
+      metric: precision
+      multitask_handling: mean-per-label
+      average: micro
+      threshold_kwargs: *threshold_05
+  zinc: *qm9_metrics
+
+datamodule:
+  args:
+    task_specific_args:
+      qm9:
+        df: null
+        df_path: ${constants.data_dir}/qm9.csv.gz
+        # wget https://storage.googleapis.com/graphium-public/datasets/neurips_2023/Small-dataset/qm9.csv.gz
+        # or set path as the URL directly
+        smiles_col: "smiles"
+        label_cols: ["A", "B", "C", "mu", "alpha", "homo", "lumo", "gap", "r2", "zpve", "u0", "u298", "h298", "g298", "cv", "u0_atom", "u298_atom", "h298_atom", "g298_atom"]
+        # sample_size: 2000 # use sample_size for test
+        splits_path: ${constants.data_dir}/qm9_random_splits.pt  # Download with `wget https://storage.googleapis.com/graphium-public/datasets/neurips_2023/Small-dataset/qm9_random_splits.pt`
+        seed: ${constants.seed} #*seed
+        task_level: graph
+        label_normalization:
+          normalize_val_test: True
+          method: "normal"
+
+      tox21:
+        df: null
+        df_path: ${constants.data_dir}/Tox21-7k-12-labels.csv.gz
+        # wget https://storage.googleapis.com/graphium-public/datasets/neurips_2023/Small-dataset/Tox21-7k-12-labels.csv.gz
+        # or set path as the URL directly
+        smiles_col: "smiles"
+        label_cols: ["NR-AR", "NR-AR-LBD", "NR-AhR", "NR-Aromatase", "NR-ER", "NR-ER-LBD", "NR-PPAR-gamma", "SR-ARE", "SR-ATAD5", "SR-HSE", "SR-MMP", "SR-p53"]
+        # sample_size: 2000 # use sample_size for test
+        splits_path: ${constants.data_dir}/Tox21_random_splits.pt  # Download with `wget https://storage.googleapis.com/graphium-public/datasets/neurips_2023/Small-dataset/Tox21_random_splits.pt`
+        seed: ${constants.seed}
+        task_level: graph
+
+      zinc:
+        df: null
+        df_path: ${constants.data_dir}/ZINC12k.csv.gz
+        # wget https://storage.googleapis.com/graphium-public/datasets/neurips_2023/Small-dataset/ZINC12k.csv.gz
+        # or set path as the URL directly
+        smiles_col: "smiles"
+        label_cols: ["SA", "logp", "score"]
+        # sample_size: 2000 # use sample_size for test
+        splits_path: ${constants.data_dir}/ZINC12k_random_splits.pt  # Download with `wget https://storage.googleapis.com/graphium-public/datasets/neurips_2023/Small-dataset/ZINC12k_random_splits.pt`
+        seed: ${constants.seed}
+        task_level: graph
+        label_normalization:
+          normalize_val_test: True
+          method: "normal"
\ No newline at end of file
diff --git a/expts/hydra-configs/dataset/accelerator/toymix_cpu.yaml b/expts/hydra-configs/training/accelerator/toymix_cpu.yaml
similarity index 71%
rename from expts/hydra-configs/dataset/accelerator/toymix_cpu.yaml
rename to expts/hydra-configs/training/accelerator/toymix_cpu.yaml
index e201d6c93..eb12c8935 100644
--- a/expts/hydra-configs/dataset/accelerator/toymix_cpu.yaml
+++ b/expts/hydra-configs/training/accelerator/toymix_cpu.yaml
@@ -1,10 +1,5 @@
 # @package _global_
 
-architecture:
-  task_heads:
-    tox21:
-      last_activation: none
-
 datamodule:
   args:
     batch_size_training: 200
@@ -14,9 +9,7 @@ datamodule:
 
 predictor:
   optim_kwargs: {}
-  loss_fun:
-    tox21: bce_logits_ipu
-  metrics_every_n_steps: 300
+  metrics_every_n_train_steps: 300
   torch_scheduler_kwargs:
     max_num_epochs: &max_epochs 300
 
diff --git a/expts/hydra-configs/dataset/accelerator/toymix_gpu.yaml b/expts/hydra-configs/training/accelerator/toymix_gpu.yaml
similarity index 73%
rename from expts/hydra-configs/dataset/accelerator/toymix_gpu.yaml
rename to expts/hydra-configs/training/accelerator/toymix_gpu.yaml
index 9a08dde3c..3712373c3 100644
--- a/expts/hydra-configs/dataset/accelerator/toymix_gpu.yaml
+++ b/expts/hydra-configs/training/accelerator/toymix_gpu.yaml
@@ -3,11 +3,6 @@
 accelerator:
   float32_matmul_precision: medium
 
-architecture:
-  task_heads:
-    tox21:
-      last_activation: none
-
 datamodule:
   args:
     batch_size_training: 200
@@ -17,9 +12,7 @@ datamodule:
 
 predictor:
   optim_kwargs: {}
-  loss_fun:
-    tox21: bce_logits_ipu
-  metrics_every_n_steps: 300
+  metrics_every_n_train_steps: 300
   torch_scheduler_kwargs:
     max_num_epochs: &max_epochs 300
 
diff --git a/expts/hydra-configs/dataset/accelerator/toymix_ipu.yaml b/expts/hydra-configs/training/accelerator/toymix_ipu.yaml
similarity index 100%
rename from expts/hydra-configs/dataset/accelerator/toymix_ipu.yaml
rename to expts/hydra-configs/training/accelerator/toymix_ipu.yaml
diff --git a/expts/hydra-configs/experiment/toymix_gcn.yaml b/expts/hydra-configs/training/model/toymix_gcn.yaml
similarity index 100%
rename from expts/hydra-configs/experiment/toymix_gcn.yaml
rename to expts/hydra-configs/training/model/toymix_gcn.yaml
diff --git a/expts/hydra-configs/experiment/toymix_gin.yaml b/expts/hydra-configs/training/model/toymix_gin.yaml
similarity index 100%
rename from expts/hydra-configs/experiment/toymix_gin.yaml
rename to expts/hydra-configs/training/model/toymix_gin.yaml
diff --git a/expts/hydra-configs/training/toymix.yaml b/expts/hydra-configs/training/toymix.yaml
new file mode 100644
index 000000000..05d7c4715
--- /dev/null
+++ b/expts/hydra-configs/training/toymix.yaml
@@ -0,0 +1,30 @@
+# @package _global_
+
+predictor:
+  random_seed: ${constants.seed}
+  optim_kwargs:
+    lr: 4.e-5 # warmup can be scheduled using torch_scheduler_kwargs
+    # weight_decay: 1.e-7
+  torch_scheduler_kwargs:
+    module_type: WarmUpLinearLR
+    max_num_epochs: &max_epochs 100
+    warmup_epochs: 10
+    verbose: False
+  scheduler_kwargs: null
+  target_nan_mask: null
+  multitask_handling: flatten # flatten, mean-per-label
+
+trainer:
+  seed: ${constants.seed}
+  logger:
+    save_dir: logs/neurips2023-small/
+    name: ${constants.name}
+    project: ${constants.name}
+  model_checkpoint:
+    filename: ${constants.name}
+    save_last: True
+  trainer:
+    precision: 16
+    max_epochs: *max_epochs
+    min_epochs: 1
+    check_val_every_n_epoch: 20
\ No newline at end of file